Exploring the Accuracy of Language Models in Summarisation

The Hallucination Leaderboard, computed using Vectara’s
Hallucination Evaluation Model, serves as a vital tool for assessing the
tendency of language models to introduce hallucinations in summaries – critical
for applications where factual accuracy is paramount.

The leaderboard is set to undergo regular updates,
reflecting the continuous enhancements in both the evaluation model and the
language models themselves.

Current Standings
The table below presents the latest standings of various language models
based on their accuracy, hallucination rate, answer rate, and the average
length of summaries they produce:

Model

Accuracy

Hallucination
Rate

Answer
Rate

Avg.
Summary Length (Words)

GPT 4

97.0 %

3.0 %

100.0 %

81.1

GPT 3.5

96.5 %

3.5 %

99.6 %

84.1

Llama 2 70B

94.9 %

5.1 %

99.9 %

84.9

Llama 2 7B

94.4 %

5.6 %

99.6 %

119.9

Llama 2 13B

94.1 %

5.9 %

99.8 %

82.1

Cohere-Chat

92.5 %

7.5 %

98.0 %

74.4

Cohere

91.5 %

8.5 %

99.8 %

59.8

Anthropic
Claude 2

91.5 %

8.5 %

99.3 %

87.5

Mistral 7B

90.6 %

9.4 %

98.7 %

96.1

Google Palm

87.9 %

12.1 %

92.4 %

36.2

Google
Palm-Chat

72.8 %

27.2 %

88.8 %

221.1

A specially trained model, designed to detect hallucinations
in the outputs of language models, was used. This approach involved feeding 1,000
short documents to each of the listed models via their public APIs, requesting
summaries based solely on the document’s facts. The leaderboard then calculates
the accuracy (the absence of hallucinations) and the hallucination rate for
each model, offering a clear comparison of their capabilities.

The focus on summarization accuracy, as opposed to overall
factual accuracy, allows a direct comparison between the model’s response and
the source document, providing a clearer picture of the model’s reliability.