Rouge metrics measure the word overlap between generated text and expected text.
Since good summaries & headings can be written differently, Rouge scores around 50 are considered excellent results.
We grade models measured via Rouge scores on the following scale:
|🟢A+||> 48||> 46|
|🟢A||> 45||> 45|
|🟡B||> 40||> 40|
|🟠C||> 35||> 35|
Note: This is a high-level description, designed to provide intuition for understanding Rouge metrics, for a more mathematically accurate explaination please see this blog post or the original Rouge paper
- ROUGE-1: Shared words
Number of words that appear in both model output, and expected output
Example: 0.5 means half of the words appear in both model output and expected output
- ROUGE-2: Shared word-pairs
Number of word-pairs that appear in both model output, and expected output (as pairs)
Example: 0.5 means half of adjacent word-pairs appear in both model output and expected output.
This is a stricter metric than Rouge1, which is slightly more sensitive to order
- ROUGE-L: Longest shared word-sequence
Number of words appear in the exact same order. in both model output, and **expected
Example: 0.5 means half of the entire output of expected output is the same half of the entire expected text.
This metric is very sensitive to the generated order of words.
As two summaries or headlines are unlikely to be generated exactly the same (same words, order, inflections and suffixes), Rouge metrics usually peak around 50 (0.50) while representing a very high quality output.
Updated 21 days ago