Skip to yearly menu bar Skip to main content

Workshop

Text, camera, action! Frontiers in controllable video generation

Michal Geyer · Joanna Materzynska · Jack Parker-Holder · Yuge Shi · Trevor Darrell · Nando de Freitas · Antonio Torralba

Project Page

Abstract

The past few years have seen the rapid development of Generative AI, with powerful foundation models demonstrating the ability to generate new, creative content in multiple modalities. Following breakthroughs in text and image generation, it is clear the next frontier lies in video. One challenging but compelling aspect unique to video generation is the various forms in which one could control such generation: from specifying the content of a video with text, to viewing a scene with different camera angles, or even directing the actions of characters within the video. We have also seen the use cases of these models diversify, with works that extend generation to 3D scenes, use such models to learn policies for robotics tasks or create an interactive environment for gameplay. Given the great variety of algorithmic approaches, the rapid progress, and the tremendous potential for applications, we believe now is the perfect time to engage the broader machine learning community in this exciting new research area. We thus propose the first workshop on Controllable Video Generation (CVG), focused on algorithms that can control videos with multiple modalities and frequencies, and the swathe of potential applications. We anticipate CVG would be uniquely relevant to ICML as it brings together a variety of different communities: from traditional computer vision, to safety and alignment, to those working on world models in a reinforcement learning or robotics setting. This makes ICML the perfect venue, where seemingly unrelated communities can join together and share ideas in this new emerging area of AI research.

Video

Chat is not available.

Schedule

Timezone: America/Los_Angeles

12:00 AM

Introduction and Opening Remarks

Joanna Materzynska

Video

12:05 AM

Andreas Blattman - Pre-Training Rectified Flow Transformer Models for Controllable Video Generation

Andreas Blattman

Video

12:40 AM

Slicedit: Zero-Shot Video Editing With Text-to-Image Diffusion Models Using Spatio-Temporal Slices

Nathaniel Cohen

Video

1:00 AM

Coffee break

1:30 AM

Ashley Edwards - Learning actions, policies, rewards, and environments from videos alone

Ashley Edwards

Video

2:10 AM

Tali Dekel - The Future of Video Generation: Beyond Data and Scale

Tali Dekel

Video

2:50 AM

Diverse and aligned audio-to-video generation via text-to-video model adaptation

Idan Schwartz

Video

3:10 AM

Lunch Break

4:30 AM

Poster Session

5:30 AM

Sander Dieleman - Wading through the noise: an intuitive look at diffusion models

Sander Dieleman

Video

6:10 AM

EgoSim: Egocentric Exploration in Virtual Worlds with Multi-modal Conditioning

Wei Yu

Video

6:30 AM

Alignment is All You Need: A Training-free Augmentation Strategy for Pose-guided Video Generation

Xiaoyu Jin

6:50 AM

William (Bill) Peebles - TBD

William Peebles

Video

7:30 AM

Boyi Li - Leveraging LLMs to Imagine Like Humans by Aligning Representations from Vision and Language

Boyi Li

Video

A Systematic Comparison of fMRI-to-video Reconstruction Techniques

Camilo Fosco · Benjamin Lahner · Alex Andonian · Bowen Pan · Aude Oliva

Vid3D: Synthesis of Dynamic 3D Scenes using 2D Video Diffusion

Rishab Parthasarathy · Zachary Ankner · Aaron Gokaslan

Fréchet Video Motion Distance: A Metric for Evaluating Motion Consistency in Videos

Jiahe Liu · Youran Qu · Qi Yan · xiaohui zeng · Lele Wang · Renjie Liao

EgoSim: Egocentric Exploration in Virtual Worlds with Multi-modal Conditioning

Wei Yu · Songheng Yin · Steve Easterbrook · Animesh Garg

EgoSim: Egocentric Exploration in Virtual Worlds with Multi-modal Conditioning

Wei Yu · Songheng Yin · Steve Easterbrook · Animesh Garg

Jafar: An Open-Source Genie Reimplemention in Jax

Timon Willi · Matthew T Jackson · Jakob Foerster

Slicedit: Zero-Shot Video Editing With Text-to-Image Diffusion Models Using Spatio-Temporal Slices

Nathaniel Cohen · Vladimir Kulikov · Matan Kleiner · Inbar Huberman-Spiegelglas · Tomer Michaeli

Slicedit: Zero-Shot Video Editing With Text-to-Image Diffusion Models Using Spatio-Temporal Slices

Nathaniel Cohen · Vladimir Kulikov · Matan Kleiner · Inbar Huberman-Spiegelglas · Tomer Michaeli

Large Language Models are Frame-Level Directors for Zero-Shot Text-to-Video Generation

Susung Hong · Junyoung Seo · Heeseong Shin · Sunghwan Hong · Seungryong Kim

Diverse and aligned audio-to-video generation via text-to-video model adaptation

Guy Yariv · Idan Schwartz · Itai Gat · Yossi Adi · Sagie Benaim · Lior Wolf

Diverse and aligned audio-to-video generation via text-to-video model adaptation

Guy Yariv · Idan Schwartz · Itai Gat · Yossi Adi · Sagie Benaim · Lior Wolf

Overcoming Knowledge Barriers: Online Imitation Learning from Observation with Pretrained World Models

Xingyuan Zhang · Philip Becker-Ehmck · Patrick van der Smagt · Maximilian Karl

Alignment is All You Need: A Training-free Augmentation Strategy for Pose-guided Video Generation

Xiaoyu Jin · Zunnan Xu · Mingwen Ou · Wenming Yang

Alignment is All You Need: A Training-free Augmentation Strategy for Pose-guided Video Generation

Xiaoyu Jin · Zunnan Xu · Mingwen Ou · Wenming Yang