Papers
arxiv:2602.10748

Benchmarking Large Language Models for Knowledge Graph Validation

Published on Feb 11
· Submitted by
Farzad Shami
on Feb 12
Authors:

Abstract

Large language models show promise but lack stability and reliability for knowledge graph fact validation, with retrieval-augmented generation and multi-model consensus approaches yielding inconsistent improvements.

AI-generated summary

Knowledge Graphs (KGs) store structured factual knowledge by linking entities through relationships, crucial for many applications. These applications depend on the KG's factual accuracy, so verifying facts is essential, yet challenging. Expert manual verification is ideal but impractical on a large scale. Automated methods show promise but are not ready for real-world KGs. Large Language Models (LLMs) offer potential with their semantic understanding and knowledge access, yet their suitability and effectiveness for KG fact validation remain largely unexplored. In this paper, we introduce FactCheck, a benchmark designed to evaluate LLMs for KG fact validation across three key dimensions: (1) LLMs internal knowledge; (2) external evidence via Retrieval-Augmented Generation (RAG); and (3) aggregated knowledge employing a multi-model consensus strategy. We evaluated open-source and commercial LLMs on three diverse real-world KGs. FactCheck also includes a RAG dataset with 2+ million documents tailored for KG fact validation. Additionally, we offer an interactive exploration platform for analyzing verification decisions. The experimental analyses demonstrate that while LLMs yield promising results, they are still not sufficiently stable and reliable to be used in real-world KG validation scenarios. Integrating external evidence through RAG methods yields fluctuating performance, providing inconsistent improvements over more streamlined approaches -- at higher computational costs. Similarly, strategies based on multi-model consensus do not consistently outperform individual models, underscoring the lack of a one-fits-all solution. These findings further emphasize the need for a benchmark like FactCheck to systematically evaluate and drive progress on this difficult yet crucial task.

Community

Paper author Paper submitter

In this work, we introduce FactCheck, a benchmark to systematically evaluate LLMs for fact validation over Knowledge Graphs, covering internal model knowledge, Retrieval-Augmented Generation (RAG), and multi-model consensus strategies across three real-world KGs (FactBench, YAGO, DBpedia).

🤖🔎 Our results show that while LLMs can reach strong performances, they still lack the stability and reliability needed for real-world KG validation, and that external evidence via RAG and ensemble consensus help, but at non-trivial computational and operational costs. 📊⚙️

You can already explore the web platform and artifacts here:
🌐 Web app: https://factcheck.dei.unipd.it/
💻 Code and datasets: https://github.com/FactCheck-AI

Looking forward to discussing this work with the community in Tampere! 🇫🇮

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.10748 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.10748 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.