← Back to Blog
Tue Apr 28 2026

Document vs Data mismatch in KYC

Why OCR vs API verification gaps create avoidable KYC failures - and how confidence-based matching fixes them

In digital KYC journeys, identity decisions usually compare:

  • Document data extracted via OCR
  • Authoritative source data from APIs (PAN, Aadhaar XML, CKYC, and similar systems)

The core issue: strict text equality does not reflect real-world identity data.

Quick takeaway: Many KYC mismatches are formatting mismatches, not identity mismatches.

Where mismatches usually appear

1) Name expansion and abbreviations

  • OCR: RAVI KUMAR
  • API: RAVI KUMAR SINGH

Or:

  • OCR: A. K. SHARMA
  • API: ASHOK KUMAR SHARMA

2) Address formatting differences

  • OCR: Flat 12, Bldg A
  • API: Flat No. 12 Building A

These are often representation differences, not fraud indicators.

Why these mismatches happen

  • OCR quality variation: blur, lighting, compression, skew.
  • Normalization differences: APIs return standardized forms; documents carry human-entered variations.
  • Initials vs expanded tokens: A K vs Ashok Kumar.
  • Punctuation and spacing: #12/A vs 12 A.
  • Legacy record drift: one source updated, another not yet synchronized.

What strict exact-match systems cause

  • Higher onboarding drop-offs
  • Unnecessary manual or VCIP escalation
  • Increased operations workload
  • Worse customer experience for genuine users

When systems reject on pure string mismatch, conversion drops even for valid customers.

Better approach: confidence-based matching

Instead of binary match/no-match, use score-based identity matching:

Confidence-based matching flow

  1. Compare OCR and API values
  2. Compute similarity score by field
  3. Apply risk thresholds instead of binary pass/fail

Example: field-level scoring

FieldOCRAPIScore
NameA K SharmaAshok Kumar Sharma92%
AddressFlat 12 AFlat No 12 A95%

With confidence scoring, these records can be accepted (or soft-reviewed) instead of hard-failed.

How scoring engines typically work

  • Normalization: punctuation cleanup, consistent casing, whitespace rules.
  • Token matching: compares component-level name and address tokens.
  • Fuzzy matching: tolerates small spelling variation.
  • Weighted fields: gives higher weight to risk-critical fields (for example, Name > DOB > Address based on policy).

Threshold-based decisions

Score RangeAction
>= 90%Auto-accept
75-89%Soft review
< 75%Reject / escalation

This keeps a balance between conversion and risk control.

Compliance perspective

Confidence-based matching is compliant when controls are explicit:

  • Every decision is score-backed and logged
  • Thresholds are configurable per policy/regulation
  • High-risk segments still route to manual review

This model is not about relaxing KYC.
It is about interpreting equivalent data correctly and consistently.

Key insight for teams

Most KYC mismatches are representation mismatches, not identity mismatches.

The core issue is often data interpretation, not data authenticity.

Conclusion

Strict equality checks increase friction and operational cost.
Confidence-based matching improves conversion while preserving compliance posture.

The goal of KYC is not string equality.
The goal of KYC is identity confidence.