Dataset

Dataset name: Emotions_Go_OneAI_v1.2
Datapoints: 21,612
Labels: anger, joy, neutral, sadness, surprise, neutral
Description: Based on Google's EmotionsGo dataset, cleaned and further annotated by native-speaker human annotators.

Results

See our Notebook and dataset to reproduce results

AccuracyPrecision (Micro)Recall (Micro)
Ours (One-AI-emotions-1.1)0.750.750.75
2 Emotion-english-distilroberta-base by j-hartmann0.670.620.72
3 distilbert-base-uncased-emotiom by bhadresh-savani0.440.430.44

Notes: Using 'Micro' measurement metric to account for inherent label imbalance in labeled data. Specifically there are 8X more 'neutral' labels than any other emotion label. Micro analysis gives equal weight to all datapoints which will translate to real world precision & recall performance.

Methodology

Data prep

  • Emotions_go dataset was filtered to the main emotions that exhibit clearer user agreement.
    In both user testing and in manual annotation tasks we found that other labels did not reach any agreement threshold. The other models tested where also trained on a nearly identical set of labels which stre
  • Joy & Happiness labels from the dataset were merged to a unified 'joy' label

Measurement

  • As each model identifies a slightly different set of emotions , comparison was normalized as follows:

One AI labels = ['anger', 'joy', 'neutral', 'sadness', 'surprise']
model2 labels = ['anger', 'disgust', 'fear', 'joy', 'neutral', 'sadness', 'surprise']
model3 labels = ['anger', 'fear', 'joy', 'love', 'neutral', 'sadness', 'surprise']

dataset labels = ['anger', 'joy', 'neutral', 'sadness', 'surprise']

  • distilbert-base-uncased-emotiom by bhadresh-savani model does not have a neutral label so if there is no emotion with above 0.8 score the result will be considered 'neutral', 0.8 was found to provide best results.
  • As the One-AI model can detect multiple emotions per text input (as it support per-span annotation) unlike the other 2 models that can only select a single emotion for the entire text input, we consider the sample a correct match if the correct emotions is included in detected emotions.
    If none of the items match the golden label we consider the first item in the list.
    For example if One-AI response was ['happiness','sadness'] and the true-label label was 'sadness' we consider One-AI response be 'sadness',
    But if the golden label was 'neutral' I would take the first value in the list ('happiness') as oneai response.
  • Execution command for reference models:
model2 = pipeline("text-classification", model="j-hartmann/emotion-english-distilroberta-base", return_all_scores=True)

model3 = pipeline("text-classification",model='bhadresh-savani/distilbert-base-uncased-emotion', return_all_scores=True)

Did this page help you?