We propose a new diffusion field transfomer that unifies the image, video, 3D viewpoint, and game generation with the same network.
We propose a new model comprising of a view-wise sampling algorithm to focus on local structure learning, and incorporating autoregressive generation to keep consistent global geometry. This is espically useful in game generation with infinite frames.Probabilistic field models the distribution of continuous functions defined over metric spaces. While they show great potential for unifying data generation of various modalities including images, videos, and 3D geometry, it only can generate simple results without generalizing to long-context. This can be attributed to their MLP architecture, where it is difficult for the model to capture global structures through uniform sampling. To this end, we propose a new and simple model comprising of a view-wise sampling algorithm to focus on local structure learning, and incorporating autoregressive generation to keep global geometry. The model can be scaled to generate high-fidelity data while unifying multiple modalities and preserving long-context, and tolerate with cross-modalities conditions like text-prompt for text-to-video generation, camera-pose for 3D view generation, and control actions for game generation. Experimental results on data generation in various modalities demonstrate the effectiveness of our model with a 675M model size, as well as its potential as a foundation framework for scalable modality-unified visual content generation.
The website template was borrowed from Mip-NeRF 360.