ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst

Zijia Zhao1, Longteng Guo2, Tongtian Yue1, Sihan Chen1, Shuai Shao2, Xinxin Zhu1, Zehuan Yuan2, jing Liu1
1Institute of Automation, Chinese Academy of Sciences, 2Bytedance Inc.

Abstract

Building general-purpose models that can perceive diverse real-world modalities and solve various tasks is an appealing target in artificial intelligence.

In this paper, we present ChatBridge, a novel multimodal language model that leverages the expressive capabilities of language as the catalyst to bridge the gap between various modalities. We show that only language-paired two-modality data is sufficient to connect all modalities. ChatBridge leverages recent large language models (LLM) and extends their zero-shot capabilities to incorporate diverse multimodal inputs.

ChatBridge undergoes a two-stage training. The first stage aligns each modality with language, which brings emergent multimodal correlation and collaboration abilities. The second stage instruction-finetunes ChatBridge to align it with user intent with our newly proposed multimodal instruction tuning dataset, named MULTIS, which covers a wide range of 16 multimodal tasks of text, image, video, and audio modalities.

We show strong quantitative and qualitative results on zero-shot multimodal tasks covering text, image, video, and audio modalities. All codes, data, and models of ChatBridge will be open-sourced.

ChatBridge

ChatBridge is a multimodal language model capable of perceiving real-world multimodal information, as well as following instructions, thinking, and interacting with humans in natural language. Inspired by Flamingo and BLIP-2, we introduce perceiver modules to bridge the encoders and the LLM. we choose open-sourced Vicuna-13B as the LLM, which is built upon LLaMA, and reported to achieve 90% of ChatGPT's quality as per GPT-4's evaluation. As for the modal-specific encoders, we choose EVA-ViT-G as the vision encoder to encode images and videos, and BEAT as the audio encoder to encoder audios.

  • Stage 1: Bridge each modality with language, leverage large-scale language-paired two-modality data for multimodal alignment training, including image-text, video-text, and audio-text pairs.
  • Stage 2: Multimodal Instruction Tuning, instruction-finetune ChatBridge to align the model with user intent on a multimodal instruction dataset MULTIS, enabling more effective zero-shot generalization on multimodal tasks.

MULTimodal InStruction Datasets (MULTIS)

MULTIS consists of two distinct parts: task-specific data and multimodal chat data. The whole collection of MULTIS covers 16 multimodal task categories and 15 source datasets.

  • Task-Specific Data. We collect a vast array of publicly available human-annotated multimodal datasets and transform them into a unified instruction tuning format.
  • Multimodal Chat Data. While task-specific data empowers the model towards completing standardized tasks, multimodal chat data offers real-world, open-ended dialogues demanding more sophisticated intent comprehension and contextual reasoning abilities, as well as providing more diverse, helpful, human-like responses. Despite the image-to-text chat dataset generated by LLaVA-Instruct-150K, chat data across other modalities remains limited. To this end, we have constructed a multimodal chat dataset that comprises both unimodal and multimodal inputs of image, video, and audio modalities.

Examples of multimodal chat

Audio+Video Input



Audio+Image Input



Image Input



Audio Input



Video Input



BibTeX

@article{zhao2023chatbridge,
        title={ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst},
        author={Zhao, Zijia and Guo, Longteng and Yue, Tongtian and Chen, Sihan and Shao, Shuai and Zhu, Xinxin and Yuan, Zehuan and Liu, Jing},
        journal={arXiv preprint arXiv:2305.16103},
        year={2023}
      }

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.