Exploring the Accuracy of Language Models in Summarisation

The Hallucination Leaderboard, computed using Vectara’s
Hallucination Evaluation Model, serves as a vital tool for assessing the
tendency of language models to introduce hallucinations in summaries – critical
for applications where factual accuracy is paramount.

The leaderboard is set to undergo regular updates,
reflecting the continuous enhancements in both the evaluation model and the
language models themselves.

Current Standings
The table below presents the latest standings of various language models
based on their accuracy, hallucination rate, answer rate, and the average
length of summaries they produce:

Model	Accuracy	Hallucination Rate	Answer Rate	Avg. Summary Length (Words)
GPT 4	97.0 %	3.0 %	100.0 %	81.1
GPT 3.5	96.5 %	3.5 %	99.6 %	84.1
Llama 2 70B	94.9 %	5.1 %	99.9 %	84.9
Llama 2 7B	94.4 %	5.6 %	99.6 %	119.9
Llama 2 13B	94.1 %	5.9 %	99.8 %	82.1
Cohere-Chat	92.5 %	7.5 %	98.0 %	74.4
Cohere	91.5 %	8.5 %	99.8 %	59.8
Anthropic Claude 2	91.5 %	8.5 %	99.3 %	87.5
Mistral 7B	90.6 %	9.4 %	98.7 %	96.1
Google Palm	87.9 %	12.1 %	92.4 %	36.2
Google Palm-Chat	72.8 %	27.2 %	88.8 %	221.1

A specially trained model, designed to detect hallucinations
in the outputs of language models, was used. This approach involved feeding 1,000
short documents to each of the listed models via their public APIs, requesting
summaries based solely on the document’s facts. The leaderboard then calculates
the accuracy (the absence of hallucinations) and the hallucination rate for
each model, offering a clear comparison of their capabilities.

The focus on summarization accuracy, as opposed to overall
factual accuracy, allows a direct comparison between the model’s response and
the source document, providing a clearer picture of the model’s reliability.

You may also like

Google and AI Content

Course review: AI for everyone

The Singularity

The impact of Generative AI on the labour market