How to add PDF understanding to your AI Agent

December 15, 2024

Anthropic recently announced PDF support for their API.

Based on my research and Anthropic's docs, the approach they used matches what I've done personally to add PDF understanding to my Agent workflows. It works well, like kind of insanely well... especially when combined with tool calling (future post coming on that).

Let's walk through how it's built, and how you can implement it in your own agents.

The Challenge

LLMs can't directly process PDFs so we need to convert them into a format that can be sent to an LLM API in a chat completion. The simplest approach, where a lot of people start, is to use an OCR library (i.e. Tesseract) to extract the readable text from the PDF and send that to the LLM.

The issue with "traditional" OCR is that it loses a lot of context from the PDF. For example:

  • Tables are converted to text, losing the structure of the table... especially when it comes to headers and labels.
  • Images are mostly ignored, which can be a problem if the PDF contains important visual information.
  • The relationship between text elements is lost... for example, if there's a checkbox with a label, the OCR may grab the label but will likely miss the checkbox.

Complicating things further, when you OCR a PDF and then send it to an LLM with a prompt, you don't actually know if issues in understanding the content are due to bad OCR or the LLM (or both).

The Solution

What a number of LLM researchers have figured out over the last year is that vision models are actually really good at understanding images of documents. And it makes sense that some significant portion of multi-modal LLM training data is images of pages of documents... the internet is full of them.

So in addition to extracting the text, if we can also convert the document's pages to images then we can send BOTH to the LLM and get a much better understanding of the document's content.

In the case of a 2-page PDF document, the prompt ends up looking like this:

readable_pdf_text = "..."

response = client.chat.completions.create(
  model="gpt-4o",
  messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": f"Summarize the following document: {readable_pdf_text}",
        },
        {
          "type": "image_url",
          "image_url": {
            "url":  f"data:image/png;base64,{base64_image}"
          },
        },
        {
          "type": "image_url",
          "image_url": {
            "url":  f"data:image/png;base64,{base64_image}"
          },
        },
      ],
    }
  ],
)

print(response.choices[0])

NOTE: Ordering the message types properly does matter! And it differs depending on which LLM provider you're using.

OpenAI prefers text before images, while Anthropic prefers images before text. Anthropic provides specific context for this in their docs:

Just as with document-query placement, Claude works best when images come before text. 
Images placed after text or interpolated with text will still perform well, but if your 
use case allows it, we recommend an image-then-text structure.

The Implementation

There are two basic parts to the implementation:

  1. Text extraction - converting readable text from the PDF to a string.
  2. Image conversion - converting each page of the PDF to an image.

From a code and infrastructure standpoint, you can implement the functionality in javascript or python. However, I could not find a good way to do the image extraction in a serverless javascript environment like Vercel. Any javascript libraries with dependencies on <canvas> seemed to work locally, but would either overflow container image size limits or just fail to run in Vercel serverless functions.

So I ended up using a Python Flask endpoint on Vercel to handle the image conversion.

Text Extraction

In python, I use PyMuPDF (https://pymupdf.readthedocs.io/en/latest/index.html) for text extraction. It's built on top of Tesseract, and provides a clean API for extracting text content.

In javascript, I use pdf-parse to extract the text content... it's built on top of PDF.js (https://github.com/mozilla/pdf.js) from Mozilla.

Implementation is super simple... here's a function that extracts the text from a base64 encoded PDF:

async function processPDFContent(pdfDataUrl) {
    try {
      // Extract base64 data from data URL
      const base64Data = pdfDataUrl.split(',')[1];
      
      // Convert base64 to buffer
      const pdfBuffer = Buffer.from(base64Data, 'base64');
      
      // Parse PDF
      const data = await pdfParse(pdfBuffer);
      
      // Return the extracted text
      return data.text;
    } catch (error) {
      console.error('Error processing PDF:', error);
      return '[Error extracting PDF content]';
    }
}

Image Extraction

Images are where it gets a bit more complicated.

In python, you can still use PyMuPDF to convert the pages to images. The only decision to make is the resolution of the image.

Anthropic and OpenAI both do automatic resizing down to the maximum size accepted by their APIs, so I just try to match the aspect ratio of a standard document.

I did try a few different javascript libraries, but ultimately settled on using a python Flask endpoint in Vercel. Here's a reference example project for their python support: GitHub.

The specific dependencies that I got working in requirements.txt are:

Flask==3.0.3
gunicorn==22.0.0
PyMuPDF==1.24.7
Werkzeug==3.0.3

NOTE: Vercel's deployment of python functions is really, really dumb. All code from your entire project is deployed alongside the actual python function, which immediately caused Error: The Serverless Function... exceeds the maximum size limit problems.

The fix is to manually exclude file paths from the deployment of your python functions in your vercel.json file, like this:

"functions": {
      "api/**/*.py": {
        "memory": 1024,
        "excludeFiles": "{public/**,**/node_modules/**,src/**,python/**,docs/**}"
      }
    }

Ok. Now that you can get a python Flask endpoint running on Vercel, let's look at the code for the image conversion.

It's pretty straightforward:

from flask import Flask, request, jsonify
import pymupdf
import base64

app = Flask(__name__)

@app.route("/api/transformers/pdfToPNG", methods=['POST'])
def pdf_to_png():
    try:
        # Get the page limit parameter (optional)
        page_limit = request.args.get('limit', type=int)
        
        # Get PDF data from request
        pdf_bytes = base64.b64decode(request.json['base64'])
        
        # Read the PDF file
        pdf_file = pymupdf.open("pdf", pdf_bytes)
        
        # Determine how many pages to process
        total_pages = pdf_file.page_count
        pages_to_process = min(total_pages, page_limit) if page_limit else total_pages
        
        # Convert pages to PNG
        images = []
        for page_num in range(pages_to_process):
            page = pdf_file[page_num]

            # Render page to an image with proper aspect ratio to match a standard document
            pix = page.get_pixmap(matrix=pymupdf.Matrix(1.7, 2.0))
            
            # Get PNG bytes and convert to base64
            img_bytes = pix.tobytes("png")
            base64_image = base64.b64encode(img_bytes).decode()
            
            images.append({
                'page': page_num + 1,
                'data': base64_image
            })
        
        pdf_file.close()
        return jsonify({
            'total_pages': total_pages,
            'pages_converted': pages_to_process,
            'images': images
        })
        
    except Exception as e:
        return jsonify({'error': str(e)}), 500

Next Steps

So far, we've only focused on one-time processing of a PDF. This works well if your agent only needs to process the PDF once, or if your user is uploading the file in a specific chat conversation.

However, we should also consider the scenario where PDFs are needed for RAG (Retrieval Augmented Generation). For example, if your agent needs to understand the full contents of a PDF across many different conversations, you'll want to pre-process the PDF once and then retrieve it when relevant.

The core tenets of text and image extraction are the same, but for RAG, we need to store them in chunks so the agent can retrieve the most relevant parts of the file as needed.

There are a ton of options for how to approach this, but I'll focus on the simplest approach that's worked for me across different use cases:

  1. Extract the text and images from the PDF.
  2. Use an LLM to generate an inventory of contents and descriptions for each page.
  3. Chunk both the text and the generated descriptions.
  4. Generate embeddings for the text and descriptions.
  5. Store the image file path in the same row as the chunk and embeddings.
  6. When running semantic search, retrieve both the text chunk and the image file and pass them both to the LLM in a message.

This implies page-size chunks, but you may also find the need to introduce semantic chunking where you identify logical breakpoints in the document that span 1 or multiple pages.

With semantic chunking you may end up needing multiple image files per chunk, but that's okay - I've seen good performance even if the images include contents of other chunks. A rough rule of thumb is not to include more than 5 images (so 5 different pages) in a single chunk, because the LLM starts to lose ability to accurately gather info from every page. This is just my experience, and your mileage may vary.

Conclusion

Adding PDF understanding to your AI agent doesn't have to be intimidating. The combination of text extraction and image conversion gives you a robust foundation that works surprisingly well with multi-modal LLMs.

The key takeaways:

  • Don't rely on OCR alone - you'll miss important context and structure
  • Use both text extraction and image conversion to get the full picture
  • Consider your deployment environment early (serverless has limitations)
  • For RAG applications, think about how you'll store and retrieve both text and images

I've found this approach to be incredibly reliable across different types of PDFs - from simple text documents to complex forms and technical diagrams. The extra effort to implement image conversion alongside text extraction pays off in much more accurate and context-aware responses from your agent.

Feel free to use the code examples above as a starting point for your own implementation. And if you run into any issues or have questions, reach out to me on Twitter/X.