Moodifier: MLLM-Enhanced Emotion-Driven Image Editing

Abstract

Bridging emotions and visual content for emotion-driven image editing holds great potential in creative industries, yet precise manipulation remains challenging due to the abstract nature of emotions and their varied manifestations across different contexts. We tackle this challenge with an integrated approach consisting of three complementary components. First, we introduce MoodArchive, an 8M+ image dataset with detailed hierarchical emotional annotations generated by LLaVA and partially validated by human evaluators. Second, we develop MoodifyCLIP, a vision-language model fine-tuned on MoodArchive to translate abstract emotions into specific visual attributes. Third, we propose Moodifier, a training-free editing model leveraging MoodifyCLIP and multimodal large language models (MLLMs) to enable precise emotional transformations while preserving content integrity. Our system works across diverse domains such as character expressions, fashion design, jewelry, and home décor, enabling creators to quickly visualize emotional variations while preserving identity and structure. Extensive experimental evaluations show that Moodifier outperforms existing methods in both emotional accuracy and content preservation, providing contextually appropriate edits. By linking abstract emotions to concrete visual changes, our solution unlocks new possibilities for emotional content creation in real-world applications.

MoodArchive Dataset

MoodArchive addresses a critical limitation in language-image pre-training by providing over 8 million images with detailed emotional annotations that existing datasets lack. Created using a structured approach, the dataset encompasses 27 distinct emotions across four contexts (facial expressions, natural scenery, urban scenery, and object classes). Each image features comprehensive annotations generated by LLaVA-NeXT, including a global summary, three specific emotional stimuli identifying visual triggers, and an overall emotion assessment. Human validation studies confirm the quality of these annotations, with 85% of participants preferring the generated captions over original web-collected alt-text.

8M+ Images

Largest emotion-focused dataset with hierarchical annotations

27 Emotions

Comprehensive spectrum beyond basic emotional categories

4 Distinct Contexts

Facial expressions, natural scenery, urban scenery, object classes

85% Human Validation

Preference rate over original web-collected alt-text annotations

Data Collection Process

Emotion Selection

27 emotions from GoEmotions across 4 contexts

Descriptor Generation

ChatGPT expansion to specific visual cues

Image Collection

Multiple sources with descriptor-based queries

Annotation

LLaVA-NeXT structured emotional captions

Dataset Examples and Structure

MoodArchive Dataset Overview — MoodArchive Dataset: 27 emotion categories across facial expressions, natural scenery, urban scenery, and object classes.

MoodifyCLIP Annotation Structure — The structured annotation format used in MoodArchive: global summary, three emotional stimuli identifying visual triggers, and overall emotion assessment.

Moodifier: Emotion-Driven Editing

Moodifier is a training-free, emotion-driven image editing system that seamlessly integrates MoodifyCLIP's emotional intelligence with multimodal large language models (MLLMs) and diffusion models. Our approach leverages the emotional understanding from MoodArchive to enable precise emotional transformations while preserving content integrity across diverse domains such as character expressions, fashion design, jewelry, and home décor.