How DreamLLM Generates an Image On Its Own "Free Will"

All natural documents can be regarded as carriers of text-image interleaved information. Text-only, images-only, and text-image pairs data, on the other hand, can be seen as special cases of interleaved corpora with different modality compositions. Thus, it is critical to empower the model with the capability to learn and generate free-form interleaved documents that form all possible distributions.

Interleaved Structure Learning To model the interleaved structure, the interleaved sequence is operated by extending a new special token before images. During training, DREAMLLM is trained to predict this token that indicates where an image emerges, and the conditional image synthesis is performed afterward, as introduced next. During inference, DREAMLLM will generate an image on its “free will” when this token is predicted.

Table 1: Zero-shot multimodal comprehension evaluation of image-to-text captioning, general VQA, text-related VQA, and comprehensive benchmarks. ∗Note that the results of CM3Leon are not zero-shot since captioning data and VQA data like VQAv2 are used during supervised fine-tuning.

Authors:

(1) Runpei Dong, Xi’an Jiaotong University and Internship at MEGVII;

(2) Chunrui Han, MEGVII Technology;

(3) Yuang Peng, Tsinghua University and Internship at MEGVII;

(4) Zekun Qi, Xi’an Jiaotong University and Internship at MEGVII;

(5) Zheng Ge, MEGVII Technology;

(6) Jinrong Yang, HUST and Internship at MEGVII;

(7) Liang Zhao, MEGVII Technology;

(8) Jianjian Sun, MEGVII Technology;

(9) Hongyu Zhou, MEGVII Technology;

(10) Haoran Wei, MEGVII Technology;

(11) Xiangwen Kong, MEGVII Technology;

(12) Xiangyu Zhang, MEGVII Technology and a Project leader;

(13) Kaisheng Ma, Tsinghua University and a Corresponding author;

(14) Li Yi, Tsinghua University, a Corresponding authors and Project leader.

文章来源: https://hackernoon.com/how-dreamllm-generates-an-image-on-its-own-free-will?source=rss
如有侵权请联系:admin#unsafe.sh