On 28 April, Stability AI, in collaboration with its multimodal AI research lab DeepFloyd, announced the research release of DeepFloyd IF, a state-of-the-art text-to-image cascaded pixel diffusion model.
What’s exciting about DeepFloyd IF is its ability to generate text in images, something no open-source model has been able to do reliably until now.
DeepFloyd IF leverages the T5 model to generate coherent and clear text alongside objects with different properties and spatial relations. This capability has previously been a challenge for most text-to-image models.
There are a few other points to note as well:
Deep text prompt understanding
Utilising the large language model T5-XXL-1.1 as a text encoder, DeepFloyd IF’s generation pipeline incorporates a significant number of text-image cross-attention layers, ensuring better alignment between prompts and generated images.
High degree of photorealism
DeepFloyd IF achieves a remarkable zero-shot FID score of 6.66 on the COCO dataset, a key metric used to evaluate the performance of text-to-image models. The lower the FID score, the better the performance.
Aspect ratio shift
The model can generate images with non-standard aspect ratios, such as vertical, horizontal, or the standard square aspect, allowing for greater flexibility in the types of images produced.
Zero-shot image-to-image translations
DeepFloyd IF can modify images by resizing the original image to 64 pixels, adding noise through forward diffusion, and using backward diffusion with a new prompt to denoise the image. In inpainting mode, this process occurs within the local zone of the image.
The style can be further altered through super-resolution modules via a prompt text description, enabling modifications to style, patterns, and details while maintaining the basic form of the source image—all without the need for fine-tuning.
The model is currently available on a non-commercial, research-permissible license.