Skip to main content

Command Palette

Search for a command to run...

Comparing Mistral OCR vs Tesseract for Identity Document Extraction

Updated
4 min read
Comparing Mistral OCR vs Tesseract for Identity Document Extraction
S

DevOps & Cloud Engineer — building scalable, automated, and intelligent systems. Developer of sorts | Automator | Innovator

OCR (Optical Character Recognition) is a critical tool for digitizing structured data from documents like driver’s licenses, IDs, and forms. Recently, I conducted a comparison between Mistral OCR (SAAS) and Tesseract, testing how well they extract fields like names, dates, and addresses from real-world samples. Here’s what I found.

1. Setup & Methodology

I tested 10 ID samples in formats like .jpeg and .webp. Each document was processed through:

  • Mistral OCR via API (extract endpoint)

  • Tesseract OCR via API (tesseract endpoint)

The results were compared field by field across 11 key attributes:

name, DL_number, date_of_birth, issue_date, expiration_date,
sex, height, weight, eye_color, donor_status, address

Each field was checked for differences, and a summary CSV was generated for analysis.

Python scripts handled:

  • API calls

  • JSON storage of outputs

  • Automatic comparison of each field

  • Visualization of discrepancies per sample and per field

2. Code Implementation

I used the following python script to run 10 samples against two endpoints that I created in my GPU enabled machine. I exposed the endpoints using flask.

import os
import json
import requests
import pandas as pd
from pathlib import Path

# CONFIG
BASE_URL = "MY HOSTED ENDPOINT"
AUTH_TOKEN = "APIKEY"  
SAMPLES_DIR = "samples"
RESULTS_DIR = "results"

os.makedirs(RESULTS_DIR, exist_ok=True)

HEADERS = {
    "Authorization": f"Bearer {AUTH_TOKEN}"
}

FIELDS = [
    "name", "DL_number", "date_of_birth", "issue_date", "expiration_date",
    "sex", "height", "weight", "eye_color", "donor_status", "address"
]

def call_api(endpoint, file_path):
    with open(file_path, "rb") as f:
        resp = requests.post(
            f"{BASE_URL}/{endpoint}",
            headers=HEADERS,
            files={"file": (os.path.basename(file_path), f)},
            timeout=60
        )
    try:
        return resp.json()
    except:
        return {"error": resp.text}

def compare_dicts(mistral_data, tess_data):
    diffs = {}
    for field in FIELDS:
        v1 = mistral_data.get(field, "")
        v2 = tess_data.get(field, "")
        diffs[field] = (v1 != v2)
    return diffs

def main():
    rows = []

    for file_name in os.listdir(SAMPLES_DIR):
        file_path = os.path.join(SAMPLES_DIR, file_name)
        print(f"\n▶ Processing: {file_name}")


        mistral_result = call_api("extract", file_path)
        tess_result = call_api("tesseract", file_path)

        Path(f"{RESULTS_DIR}/{file_name}_mistral.json").write_text(
            json.dumps(mistral_result, indent=2, ensure_ascii=False)
        )
        Path(f"{RESULTS_DIR}/{file_name}_tesseract.json").write_text(
            json.dumps(tess_result, indent=2, ensure_ascii=False)
        )

        diffs = compare_dicts(mistral_result, tess_result)
        diff_count = sum(1 for x in diffs.values() if x)

        rows.append({
            "file": file_name,
            "mistral_output": mistral_result,
            "tesseract_output": tess_result,
            "fields_different": diff_count,
            **{f"diff_{k}": v for k, v in diffs.items()}
        })

    # Export CSV for plotting
    df = pd.DataFrame(rows)
    csv_path = f"{RESULTS_DIR}/comparison_results.csv"
    df.to_csv(csv_path, index=False)

    print(f"\nDone! Comparison results saved to: {csv_path}")
    print("JSON outputs stored in results/ directory.")


if __name__ == "__main__":
    main()

3. Results Overview

MetricObservation
Total Samples10
Fields Compared11
Max Fields Different per Sample11 (worst-case Tesseract)
Min Fields Different per Sample5

Field-wise Differences

The most frequently misread fields by Tesseract were:

  • Name

  • Address

  • Donor Status

  • Eye Color

  • Weight

Fields like sex, issue_date, and expiration_date were more reliably captured by both engines.

3. Visual Insights

Plot 1: Differences per Sample

  • Samples sample6.jpeg, sample7.jpeg, sample8.jpeg, sample9.jpeg, and sample10.jpeg had all fields misread by Tesseract.

  • sample1.jpg and sample4.webp performed better, with fewer differences.

Plot 2: Differences per Field

  • Confirms that name, address, donor status, eye color, and weight are Tesseract’s weak spots.

  • DL number, dates, and sex fields were mostly correct.

4. Detailed Field Analysis

FieldMistral StrengthsTesseract Weaknesses
DL_numberUsually captured correctly (format errors minor)Often garbled (sample6.jpeg shows “us1n4567-650100”)
NameOften accurateFrequently misrecognized or replaced with unrelated text (MONT. ANA, Ke, Bev)
AddressDecent formatting, preserves structure (street/city/state)Often jumbled or incomplete
Date of BirthCorrect in some, but sometimes missingFrequently wrong or placeholder text (DD o0/00/0000)
Donor StatusSometimes correctOften default or misread (None, [Donor Status Redacted])
Expiration/Issue DatesBetter consistency in MistralTesseract often misreads formats (01-01-2010 vs 20XX-05-17)
SexMostly correctMisinterpretations for some samples
Height/WeightMistral keeps units consistent (5'-06", 150 lb)Tesseract mixes units or misreads (150 Ib, 5-08)
Eye ColorFrequently misread by Tesseract (BRO, cyes)Mistral sometimes also fails, but less often

5. Key Takeaways

  1. Structured Data Extraction

    • Mistral OCR clearly outperforms Tesseract when extracting structured fields from identity documents.

    • DL numbers, addresses, and dates are more reliable with Mistral.

  2. Tesseract Limitations

    • Struggles with stylized or messy documents.

    • Often misreads names, donor status, eye color, and units in height/weight.

  3. Hybrid Approach

    • Use Mistral for critical fields.

    • Optionally, Tesseract can handle unstructured or plain text areas.

6. Conclusion

While Tesseract is free and works well for general OCR, it struggles with highly structured documents. For reliable extraction of fields like IDs, Mistral OCR (or similar modern OCR SAAS solutions) is worth the investment.

This comparison highlights the importance of field-level validation in OCR workflows, simply getting the text is not enough; the accuracy of individual data points matters most for automation and downstream applications.

More from this blog

C

CodeOps Studies

39 posts

Simple write-ups on day to day code or devops experiments, tests etc.

Mistral OCR vs Tesseract