Mistral OCR vs Tesseract

OCR (Optical Character Recognition) is a critical tool for digitizing structured data from documents like driver’s licenses, IDs, and forms. Recently, I conducted a comparison between Mistral OCR (SAAS) and Tesseract, testing how well they extract fields like names, dates, and addresses from real-world samples. Here’s what I found.

1. Setup & Methodology

I tested 10 ID samples in formats like .jpeg and .webp. Each document was processed through:

Mistral OCR via API (extract endpoint)
Tesseract OCR via API (tesseract endpoint)

The results were compared field by field across 11 key attributes:

name, DL_number, date_of_birth, issue_date, expiration_date,
sex, height, weight, eye_color, donor_status, address

Each field was checked for differences, and a summary CSV was generated for analysis.

Python scripts handled:

API calls
JSON storage of outputs
Automatic comparison of each field
Visualization of discrepancies per sample and per field

2. Code Implementation

I used the following python script to run 10 samples against two endpoints that I created in my GPU enabled machine. I exposed the endpoints using flask.

import os
import json
import requests
import pandas as pd
from pathlib import Path

# CONFIG
BASE_URL = "MY HOSTED ENDPOINT"
AUTH_TOKEN = "APIKEY"  
SAMPLES_DIR = "samples"
RESULTS_DIR = "results"

os.makedirs(RESULTS_DIR, exist_ok=True)

HEADERS = {
    "Authorization": f"Bearer {AUTH_TOKEN}"
}

FIELDS = [
    "name", "DL_number", "date_of_birth", "issue_date", "expiration_date",
    "sex", "height", "weight", "eye_color", "donor_status", "address"
]

def call_api(endpoint, file_path):
    with open(file_path, "rb") as f:
        resp = requests.post(
            f"{BASE_URL}/{endpoint}",
            headers=HEADERS,
            files={"file": (os.path.basename(file_path), f)},
            timeout=60
        )
    try:
        return resp.json()
    except:
        return {"error": resp.text}

def compare_dicts(mistral_data, tess_data):
    diffs = {}
    for field in FIELDS:
        v1 = mistral_data.get(field, "")
        v2 = tess_data.get(field, "")
        diffs[field] = (v1 != v2)
    return diffs

def main():
    rows = []

    for file_name in os.listdir(SAMPLES_DIR):
        file_path = os.path.join(SAMPLES_DIR, file_name)
        print(f"\n▶ Processing: {file_name}")


        mistral_result = call_api("extract", file_path)
        tess_result = call_api("tesseract", file_path)

        Path(f"{RESULTS_DIR}/{file_name}_mistral.json").write_text(
            json.dumps(mistral_result, indent=2, ensure_ascii=False)
        )
        Path(f"{RESULTS_DIR}/{file_name}_tesseract.json").write_text(
            json.dumps(tess_result, indent=2, ensure_ascii=False)
        )

        diffs = compare_dicts(mistral_result, tess_result)
        diff_count = sum(1 for x in diffs.values() if x)

        rows.append({
            "file": file_name,
            "mistral_output": mistral_result,
            "tesseract_output": tess_result,
            "fields_different": diff_count,
            **{f"diff_{k}": v for k, v in diffs.items()}
        })

    # Export CSV for plotting
    df = pd.DataFrame(rows)
    csv_path = f"{RESULTS_DIR}/comparison_results.csv"
    df.to_csv(csv_path, index=False)

    print(f"\nDone! Comparison results saved to: {csv_path}")
    print("JSON outputs stored in results/ directory.")


if __name__ == "__main__":
    main()

3. Results Overview

Metric	Observation
Total Samples	10
Fields Compared	11
Max Fields Different per Sample	11 (worst-case Tesseract)
Min Fields Different per Sample	5

Field-wise Differences

The most frequently misread fields by Tesseract were:

Name
Address
Donor Status
Eye Color
Weight

Fields like sex, issue_date, and expiration_date were more reliably captured by both engines.

3. Visual Insights

Plot 1: Differences per Sample

Samples sample6.jpeg, sample7.jpeg, sample8.jpeg, sample9.jpeg, and sample10.jpeg had all fields misread by Tesseract.
sample1.jpg and sample4.webp performed better, with fewer differences.

Plot 2: Differences per Field

Confirms that name, address, donor status, eye color, and weight are Tesseract’s weak spots.
DL number, dates, and sex fields were mostly correct.

4. Detailed Field Analysis

Field	Mistral Strengths	Tesseract Weaknesses
DL_number	Usually captured correctly (format errors minor)	Often garbled (`sample6.jpeg` shows “us1n4567-650100”)
Name	Often accurate	Frequently misrecognized or replaced with unrelated text (`MONT. ANA`, `Ke`, `Bev`)
Address	Decent formatting, preserves structure (street/city/state)	Often jumbled or incomplete
Date of Birth	Correct in some, but sometimes missing	Frequently wrong or placeholder text (`DD o0/00/0000`)
Donor Status	Sometimes correct	Often default or misread (`None`, `[Donor Status Redacted]`)
Expiration/Issue Dates	Better consistency in Mistral	Tesseract often misreads formats (`01-01-2010` vs `20XX-05-17`)
Sex	Mostly correct	Misinterpretations for some samples
Height/Weight	Mistral keeps units consistent (`5'-06"`, `150 lb`)	Tesseract mixes units or misreads (`150 Ib`, `5-08`)
Eye Color	Frequently misread by Tesseract (`BRO`, `cyes`)	Mistral sometimes also fails, but less often

5. Key Takeaways

Structured Data Extraction
- Mistral OCR clearly outperforms Tesseract when extracting structured fields from identity documents.
- DL numbers, addresses, and dates are more reliable with Mistral.
Tesseract Limitations
- Struggles with stylized or messy documents.
- Often misreads names, donor status, eye color, and units in height/weight.
Hybrid Approach
- Use Mistral for critical fields.
- Optionally, Tesseract can handle unstructured or plain text areas.

6. Conclusion

While Tesseract is free and works well for general OCR, it struggles with highly structured documents. For reliable extraction of fields like IDs, Mistral OCR (or similar modern OCR SAAS solutions) is worth the investment.

This comparison highlights the importance of field-level validation in OCR workflows, simply getting the text is not enough; the accuracy of individual data points matters most for automation and downstream applications.

Comparing Mistral OCR vs Tesseract for Identity Document Extraction

1. Setup & Methodology

2. Code Implementation

3. Results Overview

Field-wise Differences

3. Visual Insights

4. Detailed Field Analysis

5. Key Takeaways

6. Conclusion

More from this blog

When SSL Lies: Debugging PostgreSQL “server does not support SSL” in Kubernetes

A Real World Journey Building on Tencent Cloud

Lessons Learned Building a CI Pipeline That Auto-Tags and Deploys Docker Images

What I Learned Migrating a Real App from Docker Compose to Kubernetes

Running Apache Flink on Kubernetes: From Zero to a Fully Utilized Cluster

Command Palette

1. Setup & Methodology

2. Code Implementation

3. Results Overview

Field-wise Differences

3. Visual Insights

4. Detailed Field Analysis

5. Key Takeaways

6. Conclusion

More from this blog