Skip to yearly menu bar Skip to main content


Poster

av-SALMONN: Speech-Enhanced Audio-Visual Large Language Models

Brian Sun · Wenyi Yu · Changli Tang · Xianzhao Chen · Tian Tan · Wei Li · Lu Lu · Zejun MA · Yuxuan Wang · Chao Zhang


Abstract:

Speech understanding as an element of the more generic video understanding using audio-visual large language models (av-LLMs) is a crucial yet understudied aspect. This paper proposes av-SALMONN, a single end-to-end av-LLM for video processing, which can understand not only visual frame sequences, audio events and music, but speech as well. To obtain fine-grained temporal information required by speech understanding, while keeping efficient for other video elements, this paper proposes a novel multi-resolution causal Q-Former (MRC Q-Former) structure to connect pre-trained audio-visual encoders and the backbone large language model. Moreover, dedicated training approaches including the diversity loss and the unpaired audio-visual mixed training scheme are proposed to avoid frames or modality dominance. On the introduced audio-visual evaluation benchmark, av-SALMONN achieves more than 25\% absolute accuracy improvements on the video-QA task and over 30\% absolute accuracy improvements on audio-visual QA tasks with human speech. In addition, av-SALMONN demonstrates remarkable video comprehension and reasoning abilities on tasks that are unprecedented by other av-LLMs. An interactive demo is available at https://github.com/the-anonymous-bs/av-SALMONN, and the training code and model checkpoints will be released upon acceptance.

Live content is unavailable. Log in and register to view live content