So what is the paper about ?
The paper deals with the problem of DocumentQnA and discusses the challenges of have mono-modal solutions for it. To address this problem, the authors propose a novel approach called MDocAgent, which is a multi-modal solution that combines the strengths of image and text representations to answer questions about documents.
What you need to know
- The authors discuss that the specific task of answering questions based on documents was traditionally solved by using single modal solutions that made users decide between either answering question based on text or image. This approach often fails to connect the dots between the two modalities and fails to answer questions that require both modalities.
- To address this problem, the authors propose a multi-modal agentic soltution that they call MDocAgent.
How does MDocAgent work ?
Context retrieval
- The first step in the process is to retrieve the context of the question. The framework uses ColBert and ColPali to retrieve top-k text and image contexts that are relevant to the question.
General Agent pass
- After context retrieval, a general agent starts the process of answering the question. The general agent recieves the question, Text context and Image context to kick off the process.
- The core task of the general agent is generate a preliminary answer based on the question and the context.
- Mathematically, you can think of A_g = f(Q, T, I) as the general agent’s answer.
Critical Agent pass
- This agent is responsible for identifying the critical information in the question and the context.
- This agent recieves the preliminary answer from the general agent and the context to identify the critical information both from text and image context provided.
- Mathematically, you can think of A_c_I = f(A_g, T, I) and A_c_T = f(A_g, T, I) as the critical agent’s output. where A_c_I and A_c_T are the critical information from the image and text context respectively.
Text based Agent pass
- This agent is responsible for generating the answer based on the critical information from the text context.
- The Text agent is asked to specifically dive deeper into the critical information from the text context and generate a more detailed answer.
- Mathematically, you can think of A_t = f(A_c_T, T, Q) as the text based agent’s answer.
Image based Agent pass
- This agent is responsible for generating the answer based on the critical information from the image context.
- The Image agent is asked to specifically dive deeper into the critical information from the image context and generate a more detailed answer.
- Mathematically, you can think of A_i = f(A_c_I, I, Q) as the image based agent’s answer.
Summarization Agent pass
- This agent is responsible for summarizing the answers from the text and image based agents.
- The Summarization agent is asked to combine the answers from the general agent, text and image based agents and generate a final answer.
- Mathematically, you can think of A_s = f(Q ,A_g, A_t, A_i) as the summarization agent’s answer.
What are the key takeaways from the paper ?
The key takeaway from the paper is that when solving DocumentQnA, the traditional approach is to use single modal solutions that make users decide between either answering question based on text or image.
To address this problem, one can use a generic agent to narrow down the context scope for image and text based information. Allow specialized agents to further narrow down the critical information from the context that is passed to Domain specific agents for text and image based answers and eventually a summarization agent to combine the answers from the domain specific agents.