Why regular Image generation sucks
So in this paper, the authors propose that the regular image generation process often relies on the accuracy of prompts (specifically the details of the image in terms of entity, scene, style, and composition) to generate high-quality images. And many a times, this will fail because the training data does not represent a certain style or composition or entity or a combination of them.
So what is the solution?
The authors propose to use a multi-modal approach to generate images. They eseentially propose an E2E approach for image generation where an LLM is first asked to understand what information does it need to augment the user prompt to build the ideal prompt for the image generation task.
So once the LLM understands what it needs to know to define the constrants of the image, it can then use the information to generate the image.
For this, they prpose allowing the LLMS to identify the gaps in understanding and then research the textual and visual information from various sources to understand what specific information is needed to generate the image.
Once the LLM researches sources and finds out key constraints / information, it can then re-structure the prompt include information from the researched sources.
So its just using tools to augment the prompt?
Not exactly. So the authors went ahead and actually trained the text generation model on a curated dataset where given a prompt, the model is expected to generate the entire thinking flow for the prompt, including tool calling, observation and recaptioning.
Further to make it truly unified, they also train the image generation model based on difference between a noisy version of generated image and the original image.
What I’m really curious about
So one thing that stuck out to me is that even though the authors went ahead and trained the text generation model on a curated dataset, I wonder the marginal improvements it would have made compared to a regular LLM agent with tool calling capability.
Specifically, I wonder if an LLM agent equipped with ability to research and call tools would have been able to generate images of similar quality as the fine tuned Image gen pipeline.