Face Is Not All You Need:
MIME Benchmark for Incomplete Multimodal Emotion Recognition

Yuxin Jia1, Xing Lan1, Wensong Wang1, Jian Xue2, Feiliang Ren1, Ke Lu2
1School of Computer Science and Engineering, Northeastern University, Shenyang, China
2School of Engineering Science, University of Chinese Academy of Sciences, Beijing, China
News: [2026.04] The MIME benchmark repository is created. Mini data and codes are publicly available at our GitHub repository.
📦 Availability

The benchmark can be accessed in two ways. To quickly view, you are welcome to directly download a small sample set containing 4 videos per subset (28 videos in total). If you would like to use the full benchmark, kindly sign the license.pdf and send it to jinj62062@gmail.com (CC: lanx@cse.neu.edu.cn).

1. Scene Understanding

The target person seated at a table with a patterned booth seat. The scene is a restaurant with other patrons visible. Her upright posture and right hand raised near her face suggest emphasis during a heated interaction. The face is heavily blurred.

2. Emotional Analysis

Although the face is heavily blurred, we can still infer anger from the following cues. The target person says "I felt his tongue actually licked my teeth. I don't get..." with a loud, fast-paced tone and slight tremor. She faces slightly right, hand raised, indicating frustration during a direct confrontation in a public setting.

3. Conclusion

Anger

Abstract

Multimodal Emotion Recognition (MER) is a fundamental task in multimedia understanding, where state-of-the-art methods have achieved prominent success relying on the assumption of complete, well-aligned multimodal inputs. However, real-world unconstrained scenarios often suffer from unpredictable modality degradation and information loss.

To bridge this gap, we formally define the Incomplete Multimodal Emotion Recognition (IMER) task, which benchmarks model generalization and robustness under multi-grained modality and emotional information loss. We construct MIME, a dedicated IMER benchmark with 2000 video segments covering realistic fine-grained facial detail degradation, medium-grained facial structure missing, and extreme full-modality loss. We also propose the Chain of Emotion (CoE) analysis paradigm with tailored evaluation metrics, to dissect the fine-grained emotional cue perception and reasoning process of Multimodal Large Language Models (MLLMs).

Key Features

Multi-Grained Scenarios

Unlike existing benchmarks simulating extreme full-modality loss, MIME provides 7 subsets: Full modality (FM), Face details missing (FDM), Face structures missing (FSM), Visual modality missing (VMM), etc.

Adaptive Data Pipeline

Leveraging a four-stage pipeline utilizing Qwen3-Omni with rigorous Ground-truth Quality Filtering and Remained-Cue Verification to prevent hallucination.

Structured Reasoning (CoE)

Annotations explicitly guide models to decouple observations into three folds: Scene Understanding, Emotional Analysis, and Predicted Conclusion.

LLM-as-a-Judge Evaluation

We introduce a hybrid evaluation strategy using Context Perception Score (CPS), Affective Inference Score (AIS), Label Consistency Score (LCS), and Holistic CoE Index (HCI).

Chain-of-Emotion Pipeline

Data Construction Pipeline

Our adaptive four-stage generation pipeline with conditional hallucination suppression delivers high-quality tripartite structured reasoning chains. We deploy Qwen3.5-plus as a judge model to conduct automated quantitative scoring.

Dataset Diversity

Dataset Diversity

The benchmark encompasses specific modality missingness types decoupled across diverse in-the-wild scene contexts (e.g., Lifestyle, Movie, Vlog).

Evaluation Metrics

Evaluation Metrics

Performance comparison across various MLLMs on four predefined CoE metrics. The radar charts illustrate the varying robustness and reasoning capabilities under multi-grained missing scenarios.

fig_duration
fig_emotion

7 Missing-Modality Scenarios

Evaluating models under unpredictable modality degradation and information loss.

Qualitative CoE Evaluation

A pronounced gap between visual perception and cognitive reasoning:
"Seeing" a scene does not guarantee "understanding" its emotion.

Ground Truth CoE

1. Scene Understanding:

The woman with shoulder-length reddish-brown hair in a brown sweater stands in a cozy kitchen. The face is heavily blurred. Wooden shelves with jars fill the background under warm lighting, suggesting domestic comfort as she tilts her head slightly, relaxed and engaged.

2. Emotional Analysis:

Although the face is heavily blurred, we can still infer Happy from the following cues. Other person says "Dis-moi, est-ce que la réponse est pas très proche ?" with a light, playful, inquisitive tone. The warm, domestic scene and her relaxed posture reinforce this cheerful, curious mood.

3. Conclusion:

Happy

Qwen3.5-plus (Predicted)

Rating: 7.97
1. Scene Understanding: Score: 6.67

The subject stands in a dimly lit, cozy pantry or kitchen, surrounded by shelves of jars and bottles. She wears a soft pink sweater and maintains a relaxed, upright posture. Her head is slightly bowed with eyes cast downward, suggesting a moment of quiet reflection or internal amusement within a comfortable domestic setting.

2. Emotional Analysis: Score: 7.67

Despite facial blurring, the visible upward curve of the mouth and slight cheek elevation indicate a gentle smile. The downward gaze and lack of tension in the shoulders suggest contentment rather than excitement. The subtitles imply she is responding to a question with a hint of playful confirmation, reinforcing a sense of quiet satisfaction or mild amusement.

3. Conclusion: Score: 10.00

Happy

GPT-5-mini (Predicted)

Rating: 5.33
1. Scene Understanding: Score: 8.33

A young person stands in a dim, warm-toned pantry/kitchen framed by shelves of jars and boxes. She wears a textured sweater and leans slightly forward with relaxed shoulders and a subtle upward head tilt across frames. Subtitles present questions, indicating conversational pacing. Environment is intimate and calm, with no defensive or abrupt body motions.

2. Emotional Analysis: Score: 6.67

The forward lean, head tilt and consistent questioning in the subtitles imply engagement and anticipation rather than anger or fear. Relaxed shoulders and absence of closed-off gestures point to openness. Taken together these cues suggest the subject is reacting to new information with alert interest and mild astonishment — an engaged, expectant state bordering on surprise.

3. Conclusion: Score: 0.00

Surprise

table1
table2
table34

Repository Structure

MIME/
      ├── data/
      │   ├── CASE1_FM/
      │   ├── CASE2_FDM/
      │   ├── CASE3_FSM/
      │   ├── CASE4_VMM/
      │   ├── CASE5_FDAM/
      │   ├── CASE6_FSAM/
      │   └── CASE7_AMM/
      ├── data_list.txt
      ├── eval/
      │   ├── eval_coe.py
      │   └── predictcoe_evalacc.py/
      ├── label.jsonl
      ├── README.md
      ├── license.pdf
      └── supplementary_material.pdf

BibTeX

@misc{jia2026mime,
  title  = {Face Is Not All You Need: MIME Benchmark for Incomplete Multimodal Emotion Recognition},
  author = {Yuxin Jia and Xing Lan and Wensong Wang and Jian Xue and Feiliang Ren and Ke Lu},
  year   = {2026},
  url    = {https://yuxinokk.github.io/MIME/}
}