Workshop

Multi-modal Foundation Model meets Embodied AI (MFM-EAI)

Zhenfei (Jeremy) Yin · Mahi Shafiullah · Zhenhua Xu · Quan Vuong · Jing Shao · Lu Sheng · Takayuki Osa · Hengshuang Zhao · Mohamed Elhoseiny · Xihui Liu · Tatsuya Harada · Cewu Lu · Wanli Ouyang · Pete Florence · Yu Qiao · Dacheng Tao · Phil Torr

Project Page

Abstract

Multi-modal Foundation Model meets Embodied AI (MFM-EAI)In recent years, Multi-modal Foundation Models (MFM) such as CLIP, ImageBind, DALL·E 3, GPT-4V, and Gemini have emerged as one of the most captivating and rapidly advancing areas in AI, drawing significant attention and progressing swiftly. The open-source community for MFM has also seen vigorous growth, with the emergence of models and algorithms like LLaVA, LAMM, Stable Diffusion, and OpenFlamingo. These MFMs are now actively exploring ultimate application scenarios beyond traditional computer vision tasks.Recent studies have unveiled the immense potential these models hold in empowering embodied AI agents, marking the intersection of these fields with a multitude of open questions and unexplored territories. This workshop, MFM-EAI, is dedicated to exploring these critical challenges:- How can we train and evaluate MFM in open-ended environments?- What constitutes an effective system architecture for MFM-based Embodied AI Agents?- And importantly, how can MFM augment the perceptual and decision-making capabilities of these agents, balancing their high-level decision-making prowess with the nuanced requirements of low-level control in embodied systems?Topics include but are not limited to:- Training and evaluation of MFM in open-ended scenarios- Data collection for training Embodied AI Agents and corresponding MFM- Framework design for MFM-powered embodied agents- Decision-making in Embodied Agents empowered by MFM- Low-level control in Embodied Agents empowered by MFM- Evaluation and simulation of Embodied Agents- Limitations of MFM in empowering Embodied AI

Video

Chat is not available.

Schedule

Timezone: America/Los_Angeles

12:00 AM

Opening remark

Video

12:10 AM

General-Purpose Embodied AI

Sergey Levine

Video

12:40 AM

On Building General-Purpose Robots

Lerrel Pinto

Video

1:10 AM

Poster session #1 and Coffee break

1:50 AM

Foundation models for robotics

Chelsea Finn

Video

2:20 AM

Early career researchers in Embodied AI: Challenges and Opportunities in Multimodal Foundation Models

Zhenfei (Jeremy) Yin · Mahi Shafiullah · Yilun Du · Boyuan Chen · Haoshu Fang

Video

3:15 AM

Lunch

4:00 AM

Poster session #2

5:00 AM

Compositional Foundation Models

Yilun Du

Video

5:30 AM

DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning (outstanding paper)

Video

5:40 AM

Instruction-Guided Visual Masking

Video

5:50 AM

BEDD: The MineRL BASALT Evaluation and Demonstrations Dataset for Training and Benchmarking Agents that Solve Fuzzy Tasks

Video

6:00 AM

Behavior Generation with Latent Actions

Video

6:10 AM

Multimodal foundation world models for generalist embodied agents

Video

6:20 AM

MFM-EAI Challenge 1&2&3

Video

6:50 AM

LEO: An embodied generalist agent in 3D world and Beyond

Xiaojian Ma

Video

7:20 AM

Generative Interactive Environments

Jake Bruce

Video

7:50 AM

End of program

GROOT-1.5: Learning to Follow Multi-Modal Instructions from Weak Supervision

Shaofei Cai · Bowei Zhang · Zihao Wang · Xiaojian Ma · Anji Liu · Yitao Liang