3. Methodology

3.1 MoodifyCLIP

Although our human validation study confirms the decent quality of LLaVA-generated captions, we acknowledge that hallucinations are inevitable at this scale. However, the fine-grained emotional context these rich descriptions provide is invaluable, capturing nuances that shorter captions cannot. To handle longer descriptions, we adopt a positional embedding interpolation strategy. To balance accuracy and detail, MoodifyCLIP incorporates:

Global Contrastive Loss (GL)

We adopt CLIP's visual encoder Ei and text encoder Et to map inputs into a unified embedding space: If = Ei(I), Tf = Et(Tf). While detailed captions capture specific emotional triggers and nuanced affective dimensions missing from existing datasets, concise summary captions remain crucial for holistic image comprehension. To balance granular details with global understanding, we add Ts = Et(Ts). Note that Tf contains the complete five-sentence caption from our LLaVA annotations, while Ts only includes the first and last sentences (i.e., summary caption and emotion assessment).

We obtain the text-to-vision loss via InfoNCE contrastive learning between visual representations If and textual embeddings Tf as: Lf = (Lft2v + Lfv2t)/2 with explicit formulation:

Lft2v = −∑i=1N log exp(cos(Ifi, Tfi)/τ) j=1N exp(cos(Ifj, Tfi)/τ)

Lfv2t = −∑i=1N log exp(cos(Tfi, Ifi)/τ) j=1N exp(cos(Tfj, Ifi)/τ)

Equation 1: Global Contrastive Loss

Similarly, we add an objective for the global summary caption and emotion assessment: Ls = (Lst2v + Lsv2t)/2.

Fine-grained Loss (FG)

Next, to fully utilize our hierarchical annotations, where each emotional stimulus prompt is carefully engineered to describe distinct affective elements, we align each with its corresponding regions of interest within the image. We first compute cross-attention weights between fine-grained text embeddings (Tfg) and image patch embeddings (Ifg), yielding a similarity matrix W = {wi,j}. Applying these weights ensures that each emotional stimulus description is primarily focused on its relevant visual regions while differentiating itself from others.

The attention-weighted visual representation for each stimulus j is computed as:

Ifgj = ∑m=1M wi,j j3 wi,j Ifgm

Equation 2: Attention-weighted Visual Representation

We implement a contrastive objective for such region visual representations as:

Lfgt2v = −∑i=1Nj=13 log exp(cos(Ifgi,j, Tfgi,j)/τ) m=1M exp(cos(Ifgi,m, Tfgi,j)/τ)

Lfgv2t = −∑i=1Nj=13 log exp(cos(Tfgi,j, Ifgi,j)/τ) m=1M exp(cos(Tfgi,j, Ifgi,m)/τ)

Equation 3: Fine-grained Alignment Loss

Symmetrically we get Lfg = (Lfgt2v + Lfgv2t)/2. This fine-grained alignment mechanism helps develop a sophisticated understanding of which visual elements trigger specific emotional responses.

Fine-grained Loss Visualization
Figure 3: Illustration of how MoodifyCLIP processes both global captions and localized emotion triggers to create fine-grained alignments between text and image regions.

Optimal Transport Loss (OT)

Now that we have the fine-grained loss for specific connections (zoom-in), such as matching a smiling face region with text about happiness or tearful eyes with sadness, we take a bigger-picture (zoom-out) via optimal transport. Mathematically, it solves the matching problem by finding the most efficient overall assignment between all image regions and all text descriptions while respecting their similarity as in the matrix S = ⟨Ifg, Tfgbatch, which translates into transportation costs (distance matrix) as: W = 1 − S.

Then, by asking the question, "What's the most efficient way to match all the emotional elements in these images with all the emotional concepts in these texts?", we use the Sinkhorn algorithm to find the optimal global assignment while respecting local emotional nuances:

Tot = sinkhorn(W)

= arg minP∈Π(a,b) ⟨W, P⟩ − ϵH(P)

where a and b are uniform distributions over image regions and text features respectively

Equation 4: Optimal Transport Formulation

where a and b are uniform distributions over image regions and text features respectively, while Π(a, b) is the set of all possible transport plans with these marginals, and H(P) is an entropy regularization term.

Here higher values in T represent preferred transport paths (where it's "cheaper" to move mass), hence amplifying signals from high-similarity pairs while downplaying mismatches. This is particularly valuable for emotions, which are often ambiguous and overlapping. The final OT loss is computed as Lot = CrossEntropy(Tot⊙S, I), where I is the identity matrix.

Our total loss combines all components as:

FINAL LOSS

Lmoodifyclip = λf · Lf + λs · Ls + λfg · Lfg + λot · Lot

λf: Global weight λs: Summary weight λfg: Fine-grained weight λot: OT weight

Equation 5: Total MoodifyCLIP Loss

Algorithm 1: PyTorch-like pseudocode for MoodifyCLIP

# I_b, T_b, w_I, zI = minibatch of aligned images
# T[b, i]      = minibatch of aligned texts
# T_summary[b, i] = minibatch of summary captions
# image_encoder = Vision Transformer (ViT)
# text_encoder  = Text Transformer (BERT)
# d_I          = dimension of image features
# d_T          = dimension of text features
# M            = number of image patches/tokens in ViT
# N            = number of sentences in emotional annotations [=5]
# f/g          = suffix for global features and losses
# s            = suffix for summary features and losses
# fg           = suffix for fine-grained features and losses
# ot           = suffix for optimal transport components
# w_g, w_s, w_fg, w_ot = weights for different components

# Extract features from images and text
I_g = image_encoder(I)  # [b, d_I]
T_g = text_encoder(T)  # [b, d_T]
T_s = text_encoder(T_summary)  # [b, d_T]
I_fg = stack(text_encoder(T) for T in T)  # [b, N, d_T]

# #### Global image-text contrastive loss
loss_g = InfoNCE_loss(I_g, T_g)

# #### Summary sentence contrastive loss
loss_s = InfoNCE_loss(I_g, T_s)

# #### Fine-grained contrastive loss
I_fg_patches = reshape(I_g, [-1, M, d_I])  # reshape to patch dimension

# Compute attention-based feature alignment
w_align_weights = einops.einsum(I_fg_patches, T_fg, 'n k p, n p d -> n k p')
I_fg_grouped = einops.einsum(w_align_weights, I_fg, 'n k p, n p d -> n k d')

# Compute loss from token-region alignment
w_flat = einops.rearrange(w_align_weights, 'n k d -> (n k) d')
I_fg_grouped_flat = einops.rearrange(I_fg_grouped, 'n k d -> (n k) d')
T_fg_flat = einops.rearrange(T_fg, 'n k d -> (n k) d')
loss_fg = InfoNCE_loss(I_fg_grouped_flat, T_fg_flat)

# #### Optimal transport loss
sim = einops.einsum(I_fg, T_fg, 'n p d, n k d -> (n p) k')
sdist = 1.0 - sim
P_ot = sinkhorn(sdist)
sim_ot = torch.sum(P_ot * sim, dim=1, 2))
loss_ot = cross_entropy(sim_ot, torch.eye(n))

# Combine losses with weights
total_loss = w_g * loss_g + w_s * loss_s + w_fg * loss_fg + w_ot * loss_ot
                    
3.3

Moodifier

With MoodifyCLIP fine-tuned on MoodArchive, we propose Moodifier, a training-free, emotion-driven image editing system. The Moodifier workflow first extracts visual features fV = Encvis(V) from the source image, enabling MLLM (in our case we applied LLaVA-NeXT) to generate emotion-specific prompts PE = MLLM(fV, E) and attention maps ME that identify where modifications should occur. These outputs then guide the diffusion process to produce the emotionally transformed image.

More specifically, we adopt the non-iterative inversion approach to obtain the latent representation z. Inspired by Prompt-to-Prompt, we then manipulate the cross-attention mechanisms within the diffusion process to achieve precise emotional transformations. We operate on the key insight that cross-attention maps define the relationship between spatial image features and textual concepts.

Algorithm 1: Moodifier

MLLM-Enhanced Emotional Editing
Require: Source image V, target emotion E
Ensure: Emotionally edited image V′
1:
fV ← Encvis(V)
2:
PE, ME ← MLLM(fV, E)
3:
PE ← MoodifyCLIP(PE)
4:
z ← Invdm(V)
5:
Initialize z*T ← z
6:
for t = T, T − 1, . . . , 1 do
7:
zt−1, Mt ← DM(zt, ∅, t)
8:
M*t ← DM(z*t, PE, t)
9:
t ← BlendMaps(Mt, M*t, ME, t)
10:
z*t−1 ← DM(z*t, PE, t){M ← M̃t}
11:
z*t−1 ← (1 − 1ME>0) ⊙ zt−1 + 1ME>0 ⊙ z*t−1
12:
end for
13:
return z*0

In text-conditioned diffusion models, each diffusion step computes attention maps Mt through:

Mt = Softmax QKT √d

Equation 6: Cross-attention Map Computation

where Q = ℓQ(φ(zt)) represents visual feature queries, and K = ℓK(ψ(P)) represents textual feature keys. The cell Mi,j defines the influence of the j-th textual token on the i-th pixel. For emotional editing, we need precise control over which image regions should change and how. Our MLLM generates not only detailed prompts PE but also spatial attention maps ME that identify emotion-relevant regions.

As shown in Algorithm 1, DM (Diffusion Model) refers to a single step of the diffusion process that denoises the latent representation while computing cross-attention maps between visual and textual features. Specifically, DM(zt, PE, t) computes one denoising step from noise level t to t − 1 conditioned on prompt P, producing both the denoised latent zt−1 and the attention maps Mt. We blend attention maps from the source image with those generated for the target emotion:

t = Refine(Msrc, Mtgt, ME)  t ≥ τc Mtgt  t < τc

Equation 7: Attention Map Blending

where τc controls attention strength. At early steps (t ≥ τc), refined attention is used to establish structure while preserving identity, whereas in later steps (t < τc), target attention maps enhance emotional details. Finally, the emotion stimulus maps ME smoothly blend source and target latents only in regions that should express the target emotion, while preserving the rest of the image.

Workflow
Moodifier Workflow
Figure 4: The Moodifier workflow illustrating how MLLM-generated prompts and attention maps guide the diffusion process to achieve precise emotional transformations while preserving content integrity.

By leveraging the capabilities of MoodifyCLIP and MLLMs, our Moodifier system enables precise emotional transformations while preserving structural and semantic integrity, offering a powerful tool for creative professionals to explore emotional variations in visual content without the need for extensive manual editing or additional training.

BibTeX

@article{,
  title={},
  author={},
  booktitle={},
  year={}
}