The Hallucination Leaderboard, computed using Vectara’s
Hallucination Evaluation Model, serves as a vital tool for assessing the
tendency of language models to introduce hallucinations in summaries – critical
for applications where factual accuracy is paramount.
The leaderboard is set to undergo regular updates,
reflecting the continuous enhancements in both the evaluation model and the
language models themselves.
Current Standings
The table below presents the latest standings of various language models
based on their accuracy, hallucination rate, answer rate, and the average
length of summaries they produce:
Model |
Accuracy |
Hallucination |
Answer |
Avg. |
GPT 4 |
97.0 % |
3.0 % |
100.0 % |
81.1 |
GPT 3.5 |
96.5 % |
3.5 % |
99.6 % |
84.1 |
Llama 2 70B |
94.9 % |
5.1 % |
99.9 % |
84.9 |
Llama 2 7B |
94.4 % |
5.6 % |
99.6 % |
119.9 |
Llama 2 13B |
94.1 % |
5.9 % |
99.8 % |
82.1 |
Cohere-Chat |
92.5 % |
7.5 % |
98.0 % |
74.4 |
Cohere |
91.5 % |
8.5 % |
99.8 % |
59.8 |
Anthropic |
91.5 % |
8.5 % |
99.3 % |
87.5 |
Mistral 7B |
90.6 % |
9.4 % |
98.7 % |
96.1 |
Google Palm |
87.9 % |
12.1 % |
92.4 % |
36.2 |
Google |
72.8 % |
27.2 % |
88.8 % |
221.1 |
A specially trained model, designed to detect hallucinations
in the outputs of language models, was used. This approach involved feeding 1,000
short documents to each of the listed models via their public APIs, requesting
summaries based solely on the document’s facts. The leaderboard then calculates
the accuracy (the absence of hallucinations) and the hallucination rate for
each model, offering a clear comparison of their capabilities.
The focus on summarization accuracy, as opposed to overall
factual accuracy, allows a direct comparison between the model’s response and
the source document, providing a clearer picture of the model’s reliability.