Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers
Yuxia Wang,Revanth Gangi Reddy,Zain Muhammad Mujahid,Arnav Arora,Aleksandr Rubashevskii,Jiahui Geng,Osama Mohammed Afzal,Liangming Pan,Nadav Borenstein,Aditya Pillai,Isabelle Augenstein,Iryna Gurevych,Preslav Nakov
DOI: https://doi.org/10.48550/arXiv.2311.09000
2024-04-16
Abstract:The increased use of large language models (LLMs) across a variety of real-world applications calls for mechanisms to verify the factual accuracy of their outputs. In this work, we present a holistic end-to-end solution for annotating the factuality of LLM-generated responses, which encompasses a multi-stage annotation scheme designed to yield detailed labels concerning the verifiability and factual inconsistencies found in LLM outputs. We further construct an open-domain document-level factuality benchmark in three-level granularity: claim, sentence and document, aiming to facilitate the evaluation of automatic fact-checking systems. Preliminary experiments show that FacTool, FactScore and <a class="link-external link-http" href="http://Perplexity.ai" rel="external noopener nofollow">this http URL</a> are struggling to identify false claims, with the best F1=0.63 by this annotation solution based on GPT-4. Annotation tool, benchmark and code are available at <a class="link-external link-https" href="https://github.com/yuxiaw/Factcheck-GPT" rel="external noopener nofollow">this https URL</a>.
Computation and Language