Comparing Mistral OCR vs Tesseract for Identity Document Extraction

DevOps & Cloud Engineer — building scalable, automated, and intelligent systems. Developer of sorts | Automator | Innovator
OCR (Optical Character Recognition) is a critical tool for digitizing structured data from documents like driver’s licenses, IDs, and forms. Recently, I conducted a comparison between Mistral OCR (SAAS) and Tesseract, testing how well they extract fields like names, dates, and addresses from real-world samples. Here’s what I found.
1. Setup & Methodology
I tested 10 ID samples in formats like .jpeg and .webp. Each document was processed through:
Mistral OCR via API (
extractendpoint)Tesseract OCR via API (
tesseractendpoint)
The results were compared field by field across 11 key attributes:
name, DL_number, date_of_birth, issue_date, expiration_date,
sex, height, weight, eye_color, donor_status, address
Each field was checked for differences, and a summary CSV was generated for analysis.
Python scripts handled:
API calls
JSON storage of outputs
Automatic comparison of each field
Visualization of discrepancies per sample and per field
2. Code Implementation
I used the following python script to run 10 samples against two endpoints that I created in my GPU enabled machine. I exposed the endpoints using flask.
import os
import json
import requests
import pandas as pd
from pathlib import Path
# CONFIG
BASE_URL = "MY HOSTED ENDPOINT"
AUTH_TOKEN = "APIKEY"
SAMPLES_DIR = "samples"
RESULTS_DIR = "results"
os.makedirs(RESULTS_DIR, exist_ok=True)
HEADERS = {
"Authorization": f"Bearer {AUTH_TOKEN}"
}
FIELDS = [
"name", "DL_number", "date_of_birth", "issue_date", "expiration_date",
"sex", "height", "weight", "eye_color", "donor_status", "address"
]
def call_api(endpoint, file_path):
with open(file_path, "rb") as f:
resp = requests.post(
f"{BASE_URL}/{endpoint}",
headers=HEADERS,
files={"file": (os.path.basename(file_path), f)},
timeout=60
)
try:
return resp.json()
except:
return {"error": resp.text}
def compare_dicts(mistral_data, tess_data):
diffs = {}
for field in FIELDS:
v1 = mistral_data.get(field, "")
v2 = tess_data.get(field, "")
diffs[field] = (v1 != v2)
return diffs
def main():
rows = []
for file_name in os.listdir(SAMPLES_DIR):
file_path = os.path.join(SAMPLES_DIR, file_name)
print(f"\n▶ Processing: {file_name}")
mistral_result = call_api("extract", file_path)
tess_result = call_api("tesseract", file_path)
Path(f"{RESULTS_DIR}/{file_name}_mistral.json").write_text(
json.dumps(mistral_result, indent=2, ensure_ascii=False)
)
Path(f"{RESULTS_DIR}/{file_name}_tesseract.json").write_text(
json.dumps(tess_result, indent=2, ensure_ascii=False)
)
diffs = compare_dicts(mistral_result, tess_result)
diff_count = sum(1 for x in diffs.values() if x)
rows.append({
"file": file_name,
"mistral_output": mistral_result,
"tesseract_output": tess_result,
"fields_different": diff_count,
**{f"diff_{k}": v for k, v in diffs.items()}
})
# Export CSV for plotting
df = pd.DataFrame(rows)
csv_path = f"{RESULTS_DIR}/comparison_results.csv"
df.to_csv(csv_path, index=False)
print(f"\nDone! Comparison results saved to: {csv_path}")
print("JSON outputs stored in results/ directory.")
if __name__ == "__main__":
main()
3. Results Overview
| Metric | Observation |
| Total Samples | 10 |
| Fields Compared | 11 |
| Max Fields Different per Sample | 11 (worst-case Tesseract) |
| Min Fields Different per Sample | 5 |
Field-wise Differences
The most frequently misread fields by Tesseract were:
Name
Address
Donor Status
Eye Color
Weight
Fields like sex, issue_date, and expiration_date were more reliably captured by both engines.
3. Visual Insights
Plot 1: Differences per Sample
Samples
sample6.jpeg,sample7.jpeg,sample8.jpeg,sample9.jpeg, andsample10.jpeghad all fields misread by Tesseract.sample1.jpgandsample4.webpperformed better, with fewer differences.

Plot 2: Differences per Field
Confirms that name, address, donor status, eye color, and weight are Tesseract’s weak spots.
DL number, dates, and sex fields were mostly correct.

4. Detailed Field Analysis
| Field | Mistral Strengths | Tesseract Weaknesses |
| DL_number | Usually captured correctly (format errors minor) | Often garbled (sample6.jpeg shows “us1n4567-650100”) |
| Name | Often accurate | Frequently misrecognized or replaced with unrelated text (MONT. ANA, Ke, Bev) |
| Address | Decent formatting, preserves structure (street/city/state) | Often jumbled or incomplete |
| Date of Birth | Correct in some, but sometimes missing | Frequently wrong or placeholder text (DD o0/00/0000) |
| Donor Status | Sometimes correct | Often default or misread (None, [Donor Status Redacted]) |
| Expiration/Issue Dates | Better consistency in Mistral | Tesseract often misreads formats (01-01-2010 vs 20XX-05-17) |
| Sex | Mostly correct | Misinterpretations for some samples |
| Height/Weight | Mistral keeps units consistent (5'-06", 150 lb) | Tesseract mixes units or misreads (150 Ib, 5-08) |
| Eye Color | Frequently misread by Tesseract (BRO, cyes) | Mistral sometimes also fails, but less often |
5. Key Takeaways
Structured Data Extraction
Mistral OCR clearly outperforms Tesseract when extracting structured fields from identity documents.
DL numbers, addresses, and dates are more reliable with Mistral.
Tesseract Limitations
Struggles with stylized or messy documents.
Often misreads names, donor status, eye color, and units in height/weight.
Hybrid Approach
Use Mistral for critical fields.
Optionally, Tesseract can handle unstructured or plain text areas.
6. Conclusion
While Tesseract is free and works well for general OCR, it struggles with highly structured documents. For reliable extraction of fields like IDs, Mistral OCR (or similar modern OCR SAAS solutions) is worth the investment.
This comparison highlights the importance of field-level validation in OCR workflows, simply getting the text is not enough; the accuracy of individual data points matters most for automation and downstream applications.






