Researchers from Tsinghua University and Zhipu AI Introduced CogView3: An Innovative Cascaded Framework that Enhances the Performance of Text-to-Image Diffusion
A team of researchers from Tsinghua University and Zhipu AI introduced CogView3, an innovative approach to text-to-image generation that employs a technique called relay diffusion. Unlike conventional single-stage diffusion models, CogView3 breaks down the generation into multiple stages, starting with the creation of low-resolution images followed by a relay-based super-resolution process. This cascaded approach enables the model to focus computational resources more efficiently, generating competitive high-resolution images while minimizing costs. Remarkably, CogView3 achieves a 77.0% win rate in human evaluations against SDXL, the current leading open-source model, and requires only half the inference time. A distilled variant of CogView3 further reduces the inference time to one-tenth of that required by SDXL, while still delivering comparable image quality.
CogView3 employs a cascaded relay diffusion structure that first generates a low-resolution base image, which is then refined in subsequent stages to reach higher resolutions. In contrast to traditional cascaded diffusion frameworks, CogView3 introduces a novel approach called relaying super-resolution, wherein Gaussian noise is added to the low-resolution image, and diffusion is restarted from these noised images. This allows the super-resolution stage to correct any artifacts from the earlier stages, effectively refining the image. The model operates in the latent image space, which is eight times compressed from the original pixel space. It utilizes a simplified linear blurring schedule to efficiently blend details from the base and super-resolution stages, ultimately producing images at extremely high resolutions such as 2048×2048 pixels. Furthermore, CogView3’s training process is enhanced by an automatic image recaptioning strategy using GPT-4V, enabling better alignment between training data and user prompts.
The experimental results presented in the paper demonstrate CogView3’s superiority over existing models, particularly in terms of balancing image quality and computational efficiency. For instance, in human evaluations using challenging prompt datasets like DrawBench and PartiPrompts, CogView3 consistently outperformed the state-of-the-art models SDXL and Stable Cascade. Metrics such as Aesthetic Score, Human Preference Score (HPS v2), and ImageReward indicate that CogView3 generated aesthetically pleasing images with better prompt alignment. Notably, while maintaining high image quality, CogView3 also achieved reduced inference times—a critical advancement for practical applications. The distilled version of CogView3 was also shown to have a significantly lower inference time (1.47 seconds per image) while maintaining competitive performance, which highlights the efficiency of the relay diffusion approach.
In conclusion, CogView3 represents a significant leap forward in the field of text-to-image generation, combining efficiency and quality through its innovative use of relay diffusion. By generating images in stages and refining them through a super-resolution process, CogView3 not only reduces the computational burden but also improves the quality of the resulting images. This makes it highly suitable for applications requiring fast and high-quality image generation, such as digital content creation, advertising, and interactive design. Future work could explore expanding the model’s capacity to handle even larger resolutions efficiently and further refine the distillation techniques to push the boundaries of what is possible in real-time generative AI.