Back to Blog

Chandra OCR: The New Gold Standard in Open-Source Document Parsing

AIOpen SourceDeveloper Tools

Chandra OCR document parsing

Date: October 2025 Sources: Datalab Blog, Chandra on GitHub, Chandra on Hugging Face

Chandra OCR is a new open-source model from Datalab that has taken the top position on the olmOCR benchmark with an overall score of 83.1%. It outperforms GPT-4o (69.9%), Gemini Flash 2 (63.8%), and every other competitor in the field.

What makes Chandra different from traditional OCR tools is its full-page decoding approach. Instead of processing documents in chunks, it reads the entire page at once, which gives it much better layout awareness and accuracy.

Key Capabilities

CategoryScoreRank
Tables88.0#1
Old Scans Math80.3#1
Old Scans50.4#1
Long Tiny Text92.3#1
Base Documents99.9Near-Perfect

Chandra preserves original document structure and outputs to Markdown, HTML, or JSON. It handles handwritten text, extracts images with captions, supports 40+ languages, and excels at complex content like mathematical equations and intricate tables.

Datalab has made it available as open source on GitHub and Hugging Face, with a hosted API (free tier included) and quantized versions for on-premises deployment capable of processing up to 4 pages per second on an H100 GPU.

Why This Matters for Developers

If your team processes documents at scale (contracts, legacy documentation, forms), Chandra is the best open-source option available. It is free, commercially viable, and does not compromise on performance. For teams working with Sitecore or content management systems, the structured output (Markdown, JSON) makes it straightforward to pipe extracted content directly into a CMS or RAG pipeline.

Read More