Human Validation Study

To evaluate the quality and reliability of our MoodArchive dataset, we conducted a comprehensive human validation study on Amazon Mechanical Turk. Workers were tasked with comparing original web-collected alt-text with our LLaVA-generated detailed captions, assessing both content accuracy and emotional interpretation.

MTurk Validation Interface

Task Instructions: Comparing Web-collected Captions vs. LLaVA Emotional Descriptions
MTurk Interface Instructions

Figure 5: MTurk task instructions detailing the evaluation criteria for comparing captions.

Evaluation Interface: Original vs. LLaVA-generated Emotional Captions
MTurk Validation Interface

Figure 6: The comparison interface showing image, original caption, and LLaVA-generated emotional caption.

Validation Results

85%

85% of LLaVA-generated captions were selected by workers as better describing the images than the original web-collected alt-text, confirming the high quality of our automated annotation approach.

Common Rejection Reasons (15% of captions):

  • Inappropriate word choices or overly dramatic descriptions
  • Inaccurate emotion detection or classification
  • Misalignment between detected emotions and cultural interpretations

BibTeX

@article{,
  title={},
  author={},
  booktitle={},
  year={}
}