Face-LLaVA: Facial Expression and
Attribute Understanding through Instruction Tuning

University of Southern California
Face-LLaVA Chat Example


We propose Face-LLaVA, a multimodal large language model specialized in face-centered tasks like facial expression and attribute recognition, and capable of generating natural language descriptions for reasoning. We introduce FaceInstruct-1M, a dataset tailored for face-focused instruction tuning, and develop a novel face-specific visual encoder utilizing Face-Region Guided Cross-Attention.

Abstract

The human face plays a central role in social communication, necessitating the use of performant computer vision tools for human-centered applications. We propose Face-LLaVA, a multimodal large language model for face-centered, in-context learning, including facial expression and attribute recognition. Additionally, Face-LLaVA is able to generate natural language descriptions that can be used for reasoning. Leveraging existing visual databases, we first developed FaceInstruct-1M, a face-centered database for instruction tuning MLLMs for face processing. We then developed a novel face-specific visual encoder powered by Face-Region Guided Cross-Attention that integrates face geometry with local visual features. We evaluated the proposed method across nine different datasets and five different face processing tasks, including facial expression recognition, action unit detection, facial attribute detection, age estimation and deepfake detection. Face-LLaVA achieves superior results compared to existing open-source MLLMs and competitive performance compared to commercial solutions. Our model output also receives a higher reasoning rating by GPT under a zero-shot setting across all the tasks.

FaceInstruct-1M

FaceInstruct-1M Data Annotation Pipeline


Overview of the data annotation pipeline used to construct FaceInstruct-1M dataset. We leverage existing face-analysis datasets and their annotations and prompt Gemini-1.5 flash to convert the annotations into natural language with reasoning. To ensure data quality, we rate the generated samples from GPT-4o-mini.

Data Samples

Face-LLaVA

Face-LLaVA Architecture


Overview of the proposed Face-LLaVA Architecture. We use a pretrained face expert model, i.e. a landmark detection model to extract landmarks from the visual input. We leverage our novel face-region landmark projector (FRLP) module to obtain landmark tokens which capture the face landmark information. To enrich the visual tokens, face-region guided cross attention is performed between visual tokens and landmark tokens guided by the face-region patch proximity mask.

Results

Face-LLaVA Comparison with baselines on Face Expression Recognition
Face-LLaVA Comparison with baselines on Age, Attribute and Deepfake detection
Face-LLaVA Comparison with baselines for reasoning abilities rated using GPT-4o

Comparison with baselines

Acknowledgement

Research was sponsored by the Army Research Office and was accomplished under Cooperative Agreement Number W911NF-25-2-0040. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein