Speech-to-Text (Multimodal)

Unified multimodal speech-to-text wrapper with selectable providers, language control, accuracy tiers, and optional auto-language routing.

Authorization

AuthorizationBearer <token>

JWT Bearer token authentication. Obtain a token from the KwikID dashboard.

In: header

Request Body

audiostring

provider?string

Vendor slug.

model?string

Provider-specific model id (for example whisper-1, scribe_v1). Omit for deployment default.

language?string

BCP-47 source language (for example en, en-IN, hi-IN). Used when language_mode is fixed.

Default"en"

language_mode?string

fixed uses language; auto requests detection (detected_language in response).

Default"fixed"

Value in"fixed" | "auto"

accuracy?string

Quality / latency / cost trade-off tier.

Default"medium"

Value in"low" | "medium" | "high"

accuracy_language_policy?string

JSON string (stringified object) mapping tiers to language_mode, language, allowed_languages, provider_preference, fallback_chain. Example: {"high":{"language_mode":"fixed","language":"en-IN"},"low":{"language_mode":"auto"}}.

enable_word_timestamps?string

Pass true / false as form string.

Value in"true" | "false"

enable_diarization?string

Value in"true" | "false"

enable_pii_redaction?string

Value in"true" | "false"

callback_url?string

Formaturi

idempotency_key?string

metadata?string

JSON string echoed in logs and callbacks.

unique_id?string

Optional correlation id for logs.

Response Body

curl -X POST "https://__mock__/speech-to-text/transcribe" \
  -F audio="string"

curl -X POST "https://__mock__/speech-to-text/transcribe" \  -F audio="string"

{
  "status": "success",
  "request_id": "req_stt_01hexample",
  "text": "Sample transcript for mock responses in the Playground.",
  "detected_language": "en-IN",
  "confidence": 0.94,
  "words": [
    {
      "word": "Sample",
      "start": 0,
      "end": 0.32
    },
    {
      "word": "transcript",
      "start": 0.35,
      "end": 0.88
    }
  ],
  "provider": "openai",
  "model": "whisper-1",
  "accuracy": "medium",
  "usage": {
    "audio_seconds": 4.2,
    "billed_seconds": 5
  }
}

{
  "detail": {},
  "message": "string"
}

{
  "detail": {},
  "message": "string"
}

{
  "detail": {},
  "message": "string"
}

{
  "detail": {},
  "message": "string"
}

{
  "error": "string"
}

Overview

The Speech-to-Text (STT) API is a single integration surface for audio → text with provider routing and policy-based controls. You choose a provider / model, set source language (default English), tune accuracy tier (low, medium, high), and optionally let the service pick language and routing dynamically based on tier, latency budget, and supported locales per provider.

[PLACEHOLDER FOR SCREENSHOT]

Key features

Audio-first contract: Accept audio as required input with a stable multipart schema.
Multi-provider routing: One contract; swap model without changing your client orchestration.
Explicit or auto language: Pin language (BCP-47, for example en, en-IN, hi-IN) or use language_mode: auto for detection.
Accuracy tiers: low, medium, high map to latency/cost/quality trade-offs and may constrain which providers run for a given locale.
Dynamic language policy: Optional accuracy_language_policy ties tier to allowed auto-detect candidates or fallback order.
Operational extras: Idempotency, callbacks for long jobs, timestamps, diarization hooks, and retention controls (see below).

Supported providers and popular models

The wrapper exposes provider (or model composite string provider:model_id) from commonly used stacks. Exact model_id strings are versioned in your environment’s model registry; this table lists representative families.

Provider	Typical STT / multimodal models (examples)	Notes
OpenAI / ChatGPT stack	Whisper (for example `whisper-1`), GPT-4o audio / Realtime where enabled	Strong `en` baseline; broad locale coverage via Whisper
xAI	Grok audio / multimodal SKUs where exposed for STT	Check locale matrix per deployment
ElevenLabs	Scribe and related speech-to-text model IDs	Often tuned for creator / long-form audio
Sarvam	Sarvam STT models (Indian languages emphasis)	Use for `hi`, `ta`, `te`, `kn`, `mr`, `bn`, `gu`, `en-IN`, etc.
Google	Gemini multimodal + Cloud Speech-to-Text / Chirp	Good auto language in some tiers
Amazon Web Services	Amazon Transcribe (batch / streaming)	Enterprise PII / redaction features
Microsoft Azure	Azure AI Speech fast / accurate modes	`high` accuracy option maps cleanly
Deepgram	Nova / Enhanced	`low` latency streaming
AssemblyAI	Universal + slam-1 and other SKUs	Rich post-processing (labels, topics)
Anthropic	Claude multimodal (audio-in) where available	Less traditional STT; useful for reasoning over transcripts

Benefit: You standardize on one request schema and one observability story while retaining access to best-in-class engines per locale and SLA.

Language model

Default: language: "en" (English) when omitted.
Explicit locale: Set language to a BCP-47 tag (for example en-IN, hi-IN).
Auto detection: Set language_mode": "auto". The service may return detected_language in the response.
Dynamic choice by accuracy: Set accuracy_language_policy so each accuracy tier defines allowed_languages, prefer_providers, or fallback_chain. Example: high → fixed en-IN + OpenAI/Azure; low → auto + streaming Deepgram with broad allowed_languages.

Accuracy tiers

Tier	Intended use	Typical behavior
`high`	Compliance-heavy, court-ready, medical	Slower, higher cost, stricter VAD, word error rate targets
`medium`	Product default	Balanced cost/latency/quality
`low`	Real-time hints, rough captions	Faster, cheaper; may widen auto language guesses

Benefit: Product teams expose one slider; backend maps to provider capabilities and quota.

API contract (reference)

Call POST /speech-to-text/transcribe with Authorization: Bearer <token> and multipart/form-data. The OpenAPI block above is the source of truth for field names and enums. accuracy_language_policy is sent as a JSON string in multipart (same pattern as other stringified JSON fields in ML APIs).

HTTP

POST /speech-to-text/transcribe
Content-Type: multipart/form-data
Authorization: Bearer <token>

Request fields (conceptual JSON; use multipart parts or string fields per OpenAPI)

{
  "audio": "<binary or multipart file field>",
  "provider": "openai",
  "model": "whisper-1",
  "language": "en",
  "language_mode": "fixed",
  "accuracy": "medium",
  "accuracy_language_policy": {
    "high": { "language_mode": "fixed", "language": "en-IN", "provider_preference": ["azure", "openai"] },
    "low": { "language_mode": "auto", "allowed_languages": ["en", "hi", "ta"], "provider_preference": ["deepgram", "sarvam"] }
  },
  "enable_word_timestamps": true,
  "enable_diarization": false,
  "enable_pii_redaction": false,
  "callback_url": "https://example.com/stt-callback",
  "idempotency_key": "uuid-v4",
  "metadata": { "session_id": "abc123" }
}

Field	Required	Description
`audio`	Yes	Primary audio bytes or upload field.
`provider` / `model`	One required	Select vendor and model.
`language`	No	BCP-47; default `en`.
`language_mode`	No	`fixed` (default) or `auto`.
`accuracy`	No	`low`, `medium` (default), `high`.
`accuracy_language_policy`	No	Per-tier overrides for language and provider preferences.
`enable_word_timestamps`	No	Word-level timing when supported.
`enable_diarization`	No	Speaker labels when supported.
`enable_pii_redaction`	No	Mask emails, phones, Aadhaar-like patterns when policy allows.
`callback_url`	No	Async completion for long media.
`idempotency_key`	No	Safe retries for paid calls.
`metadata`	No	Echoed in logs and callbacks.

Response

{
  "status": "success",
  "request_id": "req_01h...",
  "text": "Full transcript text.",
  "detected_language": "en-IN",
  "confidence": 0.93,
  "words": [{ "word": "Hello", "start": 0.12, "end": 0.34 }],
  "provider": "openai",
  "model": "whisper-1",
  "accuracy": "medium"
}

Wrapper capabilities (recommended)

Unified authentication and per-tenant rate limits.
Format normalization (WAV, MP3, M4A, FLAC) and sample-rate hints.
Streaming WebSocket or SSE channel for low latency paths (provider-dependent).
Fallback chain: If primary provider errors or rejects locale, try fallback_chain from policy.
Cost and latency hints in response usage object (seconds billed, realtime_factor).
Data retention TTL per request (retention_hours).
Webhook signature (HMAC) on callbacks.
Audit: Store hashes only when PII minimization mode is on.

Implementation

Obtain token: Use the same Bearer pattern as other KwikID APIs.
Choose path: Synchronous upload for short clips; callback_url for long interviews.
Set language: Start with language: "en"; expand to target locales; switch to language_mode: auto only when accuracy and policy allow.
Map accuracy: Align high with review workflows; use low for live agent assist.
Test matrix: For each provider, run gold audio in en, en-IN, and your top regional languages.
Monitor: Track WER proxies, p95 latency, and empty transcript rate by provider.

Error handling

HTTP status	When
400	Unsupported `provider`/`model`, bad audio, or conflicting `language_mode`.
401	Invalid or missing token.
413	Payload over max duration/size.
429	Quota / concurrency exceeded.
502	Upstream STT failure after retries.

Benefits

Single integration for many STT backends.
Predictable cost/quality via accuracy.
Locale agility from English-first to broad Indian and global languages.
Future-proofing as new model_id values are added to the registry.

Next steps

Machine Learning API index
API Suite index
Facematch (when pairing voice with face journeys)

Speech-to-Text (Multimodal)

200

400

401

413

429

502

Example default