A Simple Diffusion Transformer on Unified Video, 3D, and Game Field Generation

Overview

We propose a new diffusion field transfomer that unifies the image, video, 3D viewpoint, and game generation with the same network.

We propose a new model comprising of a view-wise sampling algorithm to focus on local structure learning, and incorporating autoregressive generation to keep consistent global geometry. This is espically useful in game generation with infinite frames.

Abstract

Probabilistic field models the distribution of continuous functions defined over metric spaces. While they show great potential for unifying data generation of various modalities including images, videos, and 3D geometry, it only can generate simple results without generalizing to long-context. This can be attributed to their MLP architecture, where it is difficult for the model to capture global structures through uniform sampling. To this end, we propose a new and simple model comprising of a view-wise sampling algorithm to focus on local structure learning, and incorporating autoregressive generation to keep global geometry. The model can be scaled to generate high-fidelity data while unifying multiple modalities and preserving long-context, and tolerate with cross-modalities conditions like text-prompt for text-to-video generation, camera-pose for 3D view generation, and control actions for game generation. Experimental results on data generation in various modalities demonstrate the effectiveness of our model with a 675M model size, as well as its potential as a foundation framework for scalable modality-unified visual content generation.

Game Generation


Text-to-Video Generation (ours results generated by scaled up training on the webvid dataset)


Prompt: Female violinist rehearsing with headphones at the microphone. 4k.
Prompt: Health, environment care for mother earth. the girl's hands are holding a tree sapling. growth and agriculture new life concept. plant and tree breeding. saving life. biological diversity of plants.
Prompt: Little girl plays in the children's room. the kid plays about and throws his things out of the box. daughter plays with clothes at home.

Prompt: Pile of old tvs and retro television with green screen. dolly out. green screen. 4k resolution.
Prompt: Sun light rays through under water's glittering. underwater scene full of bubbles up to sun.
Prompt: Hand in glove put covid 19 vaccination sign small shopping cart with vaccine ampoules on blue background soft focus.
Prompt: Star abstract retro tunnel loop neon glowing animation video template seamless loop.
Prompt: Hands of man placing components on pcb board. close up. zoom in. 4k resolution.
Prompt: Globe and mouse cursor on white background.

Text-to-Face Video (Visual Comparisons)

More detailed text descriptions are used for inference.

Prompt: She is blurry and young. This female begins with a sad expression, and she then is surprised, she eventually is sad. This woman closes eyes while singing for a moderate time.
VDM
CogVideo
sDFT (ours)

Prompt: He has high cheekbones. This man starts with an expression of surprise, and he then has an expression of disgust, he eventually has an expression of surprise. This man talks meanwhile wagging head for a long time.
VDM
CogVideo
sDFT (ours)

Prompt: He is young. He has blond hair. This man begins with a disgusted expression, and next he has a disgusted face, and then he is angry, and hhe then turns into a disgusted face and afterwards he turns happy, and afterwards he turns disgusted, and he then turns into a disgusted face, and he then turns into a angry expression, and he then has a disgusted expression and later on he turns into a angry face, and next he is disgusted, and he then turns into a angry face, and later on he turns into a disgusted face, and he then is angry, in the end, he turns into a disgusted expression. The male talks, frowns, while shaking head for a long time.
VDM
CogVideo
sDFT (ours)

Prompt: She is young. At the beginning, this female is surprised, and afterwards she turns into a happy expression, she finally turns surprised. She first nods and talks at the same time for some time, next she laughs for a moderate time.
VDM
CogVideo
sDFT (ours)

Prompt: This person has sideburns and beard. He is wearing goatee. Firstly, he gazes for a short time, and he then gazes for a short time, then he blinks for a short time, he finally blinks for a short time.
VDM
CogVideo
sDFT (ours)

Prompt: She has wavy hair and high cheekbones. To begin with, this female talks for a short time, and she then talks for a short time, next she talks for a short time, in the end, she talks for a short time.
VDM
CogVideo
sDFT (ours)

Prompt: A male is young. He has a big nose and high cheekbones. The man begins to talks for a short time, then he turns for a moderate time, in the end, he turns for a moderate time.
VDM
CogVideo
sDFT (ours)

3D Object Generation (Visual Comparisons)

GASP
GEM
sDFT (ours)

GASP
GEM
sDFT (ours)

GASP
GEM
sDFT (ours)

GASP
GEM
sDFT (ours)

GASP
GEM
sDFT (ours)

Acknowledgements

The website template was borrowed from Mip-NeRF 360.