Skip to yearly menu bar Skip to main content


Poster

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

Peng Gao · Renrui Zhang · Chris Liu · Longtian Qiu · Siyuan Huang · Weifeng Lin · Shitian Zhao · Shijie Geng · Ziyi Lin · Peng Jin · Kaipeng Zhang · Wenqi Shao · Chao Xu · Conghui He · Junjun He · Hao Shao · Pan Lu · Hongsheng Li · Yu Qiao


Abstract: We propose SPHINX-X, an extensive Multi-modality Large Language Model (MLLM) series developed upon SPHINX. To improve the architecture and training efficiency, we modify the SPHINX framework by removing redundant visual encoders, bypassing fully-padded sub-images with skip tokens, and simplifying multi-stage training into a one-stage all-in-one paradigm. To fully unleash the potential of MLLMs, we assemble a comprehensive multi-domain and multi-modal dataset covering publicly available resources in language, vision, and vision-language tasks. We further enrich this collection with our curated OCR intensive and Set-of-Mark datasets, extending the diversity and generality. By training over different base LLMs including TinyLlama-1.1B, InternLM2-7B, LLaMA2-13B, and Mixtral-8$\times$7B, we obtain a spectrum of MLLMs that vary in parameter size and multilingual capabilities. Comprehensive benchmarking reveals a strong correlation between the multi-modal performance with the data and parameter scales.

Live content is unavailable. Log in and register to view live content