PDF to Markdown Conversion Tools: Beyond the Hype - A Deep Dive into MarkItDown, Docling, and Mistral Document AI

Everywhere you look on social media, there’s a new AI tool claiming to be the absolute best at handling PDFs. They promise flawless conversions to Markdown, perfect text extraction, and magical OCR. But how much of that is just hype? I decided to cut through the noise and put three popular tools to the test: markitdown, Docling, and Mistral Document AI. For a standardized evaluation, we used a single sample PDF file containing both text and tables. This allowed us to directly compare how each tool handles the challenges of a complex, real-world document layout.

1. Markitdown by Microsoft

Markitdown is Microsoft’s open-source solution designed specifically for converting various file formats into LLM-friendly Markdown. The tool supports an impressive range of formats including PDF, PowerPoint, Word, Excel, images, audio files, HTML, and even YouTube URLs.

Our Experience with Markitdown

Markitdown pulled out all the text from the PDF, but it failed badly at keeping the document’s structure. Here’s a simple breakdown of the problems we found.

Output File Link : Markitdown Output for Sample PDF File

Key Issues We Found:

Breaks Simple Text: The tool couldn’t keep simple lines of text together. For example, a single line from the PDF like Customer Id : 43416064 was incorrectly split into three separate lines. This made the top of the document messy and hard to read.
Messed Up Tables: The biggest problem was how it handled the main transaction table. Instead of keeping the rows and columns, the tool ripped the data out one column at a time. It listed all the dates first, then all the descriptions, and so on. This completely broke the table, making it impossible to match a transaction with its correct date and amount.
Jumbled Result: The final output was just a long, jumbled list of text. All the useful structure from the original PDF was gone, and even the page number (Page: 1/2) was broken. The file was so messy that it was unusable unless you were willing to fix everything by hand.

Basically, if the layout and tables in your PDF are important, this tool acts more like a simple text scraper than a converter.

Code Sample :

import markitdown

src_file_path: str = YOUR_FILE_PATH

md = markitdown.MarkItDown()
result = md.convert(src_file_path)

with open("markitdown-poc-output.md", "w", encoding="utf-8") as f:
    f.write(result.markdown)
    
print(result)

2. Docling by IBM

Docling is an open source document processing toolkit that can automatically analyze PDF layouts, identify reading order, and recognize table structures with high accuracy. It supports a wide range of input formats, including DOCX, PPTX, XLSX, HTML, images, and audio, and integrates OCR engines like Tesseract, EasyOCR, and RapidOCR to extract text from scanned or image-based documents. With built-in connectors for AI frameworks such as LangChain and LlamaIndex, Docling makes it easy to convert complex documents into structured Markdown or JSON for downstream applications like search, summarization, and question-answering.

Our Experience with Docling

Docling was a huge improvement over Markitdown. It understood the document’s layout much better and produced a far more useful result, though it had its own quirks and considerations.

Output File Link: Docling Output for Sample PDF File

What Went Well:

Tables Were Perfect: The biggest success was the main transaction table. Docling identified it flawlessly and converted it into a clean, perfect Markdown table. All rows and columns were exactly where they should be, making the data easy to read and use.
Good Structure: Unlike Markitdown, Docling kept most of the document’s structure. The address block was correct , and the summary details at the top were neatly separated, even if they weren’t in a table. The page number was also preserved correctly.

Minor Issues and Considerations:

It Takes Time and Space: This higher accuracy comes at a cost. Docling isn’t instant because it has to download and run powerful AI models from Hugging Face. This means it takes a little more time and computer storage to process the file.
Lots of Options: Docling is highly configurable, which can be good for advanced users. It has many settings you can change, like choosing different OCR models to try and get better results.

Overall, Docling is a much more powerful tool, especially if your PDF contains tables. The results are far more structured and useful, but you should be prepared for it to take a little longer to run.

Code Sample :

from venv import logger
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions, TableStructureOptions, TableFormerMode, EasyOcrOptions
from docling.datamodel.accelerator_options import AcceleratorOptions, AcceleratorDevice
from docling_core.types.doc.base import ImageRefMode
import logging

src_file_path: str = YOUR_FILE_PATH

pdfpipelineoptions = PdfPipelineOptions(
    do_picture_classification=True,
    do_formula_enrichment=True,
    do_table_structure=True,
    generate_picture_images=True,
    table_structure_options=TableStructureOptions(
        mode=TableFormerMode.ACCURATE, do_cell_matching=True),
    ocr_options=EasyOcrOptions(
        force_full_page_ocr=True, lang=["en"], use_gpu=True),
    accelerator_options=AcceleratorOptions(
        cuda_use_flash_attention2=False, device=AcceleratorDevice.CPU)
)

converter = DocumentConverter(allowed_formats=[InputFormat.PDF], format_options={
    InputFormat.PDF: PdfFormatOption(pipeline_options=pdfpipelineoptions)
})

result = converter.convert(src_file_path)
result.document.save_as_markdown("docling-output.md",image_mode=ImageRefMode.REFERENCED)

3. Document AI by Mistral

Mistral Document AI is a OCR-based document understanding API powered by the mistral-ocr-latest model, built to automatically convert PDFs and scanned documents into structured outputs like Markdown or JSON. It handles complex layouts, such as tables, equations, images, and multilingual text, and integrates closely with modern AI workflows to enable tasks like summarization and question answering using the processed content. It is a paid service, though users can often obtain an experimental API key to test the AI model.

Our Experience with Mistral Document AI

Output File Link: Mistral Document AI Output for Sample PDF File

What Went Well:

Excellent Table Handling: Similar to Docling, Mistral Document AI demonstrated impressive capabilities in extracting and structuring the main transaction table. It accurately identified rows and columns, producing a clean and perfectly formatted Markdown table, which is crucial for data integrity.
Preserved Document Structure: The tool maintained the overall layout of the document much better than Markitdown. Elements like the address block and summary details were well-preserved, and the page number was also correctly extracted.
High Accuracy OCR: Mistral Document AI uses advanced OCR, which translated to very accurate text extraction, even from the potentially challenging layout of the sample PDF.

Minor Issues and Considerations:

Commercial Offering: As a commercial API, Mistral Document AI comes with a cost. While experimental API keys are often available for testing, ongoing or large-scale usage will incur charges.

Overall, Mistral Document AI provides a highly accurate and robust solution for PDF to Markdown conversion, especially excelling in table recognition and layout preservation.

Code Sample :

import base64
from mistralai import Mistral, OCRResponse

def encode_pdf(pdf_path):
    """Encode the pdf to base64."""
    try:
        with open(pdf_path, "rb") as pdf_file:
            return base64.b64encode(pdf_file.read()).decode('utf-8')
    except FileNotFoundError:
        print(f"Error: The file {pdf_path} was not found.")
        return None
    except Exception as e:  # Added general exception handling
        print(f"Error: {e}")
        return None

# Path to your pdf
pdf_path = YOUR_FILE_PATH

# Getting the base64 string
base64_pdf = encode_pdf(pdf_path)

api_key = YOUR_API_KEY
client = Mistral(api_key=api_key,timeout_ms=20000)

ocr_response: OCRResponse = client.ocr.process(
    model="mistral-ocr-latest",
    document={
        "type": "document_url",
        "document_url": f"data:application/pdf;base64,{base64_pdf}" 
    },
    include_image_base64=True
)
for i in ocr_response.pages:
    print(i.markdown)

Conclusion

In the hyped tools market of PDF to Markdown converters, our deep dive into Markitdown, Docling, and Mistral Document AI highlights their distinct capabilities. Markitdown is a basic text scraper, struggling with document structure and tables. Docling and Mistral Document AI, however, excel in preserving complex layouts and accurately converting tables, making them far more robust. Docling offers a flexible open-source option with higher resource demands, while Mistral Document AI provides a highly accurate, paid API. The optimal choice hinges on whether your priority is basic text extraction or faithful preservation of complex document structures, particularly tables.

Back to Blog