Meissonic: A Non-Autoregressive Mask Image Modeling Text-to-Image Synthesis Model that can Generate High-Resolution Images

in #ai7 days ago

Existing attempts to solve the challenges in text-to-image synthesis have mainly focused on two approaches: diffusion-based and token-based image generation. Diffusion models, like Stable Diffusion and SDXL, have made significant progress by working within compressed latent spaces and introducing techniques like micro-conditions and multi-aspect training. The integration of transformer architectures, as seen in DiT and U-ViT, has further enhanced the potential of diffusion models. However, these models still face challenges in real-time applications and quantization. Token-based approaches like MaskGIT and MUSE, have introduced masked image modeling (MIM) to overcome the computational demands of autoregressive methods.

Researchers from Alibaba Group, Skywork AI, HKUST(GZ), HKUST, Zhejiang University, and UC Berkeley have proposed Meissonic, an innovative method to elevate non-autoregressive MIM text-to-image synthesis to a level comparable with state-of-the-art diffusion models like SDXL. Meissonic utilizes a comprehensive suite of architectural innovations, advanced positional encoding strategies, and optimized sampling conditions to enhance MIM’s performance and efficiency. The model uses high-quality training data, micro-conditions informed by human preference scores, and feature compression layers to improve image fidelity and resolution. The Meissonic can produce 1024 × 1024 resolution images and often outperforms existing models in generating high-quality, high-resolution images.

Image

Meissonic’s architecture integrates a CLIP text encoder, a vector-quantized (VQ) image encoder and decoder, and a multi-modal Transformer backbone for efficient high-performance text-to-image synthesis:

The architecture also includes QK-Norm layers and implements gradient clipping to enhance training stability and reduce NaN Loss issues during distributed training.

Meissonic, optimized to 1 billion parameters, runs efficiently on 8GB VRAM, making inference and fine-tuning convenient. Qualitative comparisons show Meissonic’s image quality and text-image alignment capabilities. Human evaluations using K-Sort Arena and GPT-4 assessments indicate that Meissonic achieves performance comparable to DALL-E 2 and SDXL in human preference and text alignment, with improved efficiency. Meissonic is benchmarked against state-of-the-art models using the EMU-Edit dataset in image editing tasks, covering seven different operations. The model demonstrated versatility in both mask-guided and mask-free editing, achieving great performance without specific training on image editing data or instruction datasets.

In conclusion, researchers introduced Meissonic, an approach to elevate non-autoregressive MIM text-to-image synthesis. The model incorporates innovative elements such as a blended transformer architecture, advanced positional encoding, and adaptive masking rates to achieve superior performance in high-resolution image generation. Despite its compact 1B parameter size, Meissonic outperforms larger diffusion models while remaining accessible on consumer-grade GPUs. Moreover, Meissonic aligns with the emerging trend of offline text-to-image applications on mobile devices, exemplified by recent innovations from Google and Apple. It enhances the user experience and privacy in mobile imaging technology, empowering users with creative tools while ensuring data security.