ML News

NVIDIA’s eDiffi: A better alternative to Stable Diffusion?

Nov 8, 2022

4 min read

In the race for general AI, the sphere has set up its metric of the efficacy of such networks and found them in Text-to-Image generation models. In our publication, we have covered architectures like DALL E 2 and Stable Diffusion, which put this task to a new perspective.

This has caused labs worldwide, like OpenAI and DeepMind, to bring their next revolutionary architecture that can go against these giants, which may be driven by innovation or hype (we leave that to you), but who are we to complain as we get more credit and creative freedom with each such passing architecture.

This brings us to today’s architecture in focus; NVIDIA’s eDiffi is the innovation from the labs’ trustee machine learning experts and may become a module in your next project; if you can train it, read more to know more.

eDiffi: Text-to-Image Diffusion Models

When it comes to understanding hardware, there cannot be any other organization more adept at it than NVIDIA, which shows how this architecture works. The authors describe the model as “a new generation of generative AI content creation tool that offers unprecedented text-to-image synthesis with instant style transfer and intuitive painting with words capabilities.” This can be further backed up by the modules they use for image generation.

eDiffi offers two effective techniques to generate images: Paint-with-Words and style-guided image generation. Paint-with-Words allows you to assign different colors to different words or phrases from the prompt so that the model knows where to put which object. Whereas style guided comes off to be a more advanced version of neural style transfer, which takes an image as a reference to generate the image requested.

Let us break down the model’s architecture further in the upcoming subsections and try to understand how it works.

eDiffi Architecture

The architecture follows a straightforward principle: “Starting from random noise, such text-to-image diffusion models gradually iteratively synthesize images while conditioning on text prompts.” This is achieved through a series of synthesis pattern understandings.

The existing networks achieve image generation results by sharing model parameters throughout the synthesis process, starting from random noise and fitting it to meet the textual requirements. However, contrary to this eDiffi uses an ensemble of different models specialized for generating at different synthesis stages.

The architecture encodes the text using two existing methods: CLIP and T5 encoders.

CLIP is an integrated visual and linguistic framework. It's useful for zero-shot picture categorization and image-text similarity. CLIP extracts visual information using a ViT-like transformer and text features using a causal language model. The text and images are then projected into the same-sized latent space. The similarity score is then calculated as the dot product of the projected image and the text features.

The T5 model is an encoder-decoder that transforms natural language processing issues into textual ones. As a means of instruction, teacher-forced training is used. This means that an input sequence and a goal sequence are required for training at all times.

Secondly, the encodings go through a set of cascading diffusion models to work towards an image, which is then shown as the output.

NVIDIA's eDiffi vs. Stable Diffusion

NVIDIA’s eDiffi relies on a combination of cascading diffusion models, which follow a pipeline of a base model that can synthesize images at 64×64 resolution and two super-resolution models that incrementally upsample images to 256×256 or 1024×1024 solution. This cascading model, according to NVIDIA, tops other architectures as it adheres better to the context behind the content specifications.

Conclusion

In over six months, we have experienced a boom of architectures from various organizations to be the next point of discussion with their version of image generation and start the conversation about generalization in AI. However, unlike DALL E 2 and Stable Diffusion, eDiffi is still under the closed curtains, and nothing besides the paper is out.

Do you think we will see NVIDIA release it on their servers for the public to try it firsthand and check out the upper hand the cascading diffusion models provide over the existing SOTA architectures?

Written By

Aryan Kargwal

Data Evangelist