Speech-to-Text (Multimodal)
Unified multimodal speech-to-text wrapper with selectable providers, language control, accuracy tiers, and optional auto-language routing.
API reference
JWT Bearer token authentication. Obtain a token from the KwikID dashboard.
In: header
Vendor slug.
"openai" | "xai" | "elevenlabs" | "sarvam" | "google" | "aws" | "azure" | "deepgram" | "assemblyai" | "anthropic"Provider-specific model id (for example whisper-1, scribe_v1). Omit for deployment default.
BCP-47 source language (for example en, en-IN, hi-IN). Used when language_mode is fixed.
"en"fixed uses language; auto requests detection (detected_language in response).
"fixed""fixed" | "auto"Quality / latency / cost trade-off tier.
"medium""low" | "medium" | "high"JSON string (stringified object) mapping tiers to language_mode, language, allowed_languages, provider_preference, fallback_chain. Example: {"high":{"language_mode":"fixed","language":"en-IN"},"low":{"language_mode":"auto"}}.
Pass true / false as form string.
"true" | "false""true" | "false""true" | "false"uriJSON string echoed in logs and callbacks.
Optional correlation id for logs.
Response Body
curl -X POST "https://__mock__/speech-to-text/transcribe" \ -F audio="string"{
"status": "success",
"request_id": "req_stt_01hexample",
"text": "Sample transcript for mock responses in the Playground.",
"detected_language": "en-IN",
"confidence": 0.94,
"words": [
{
"word": "Sample",
"start": 0,
"end": 0.32
},
{
"word": "transcript",
"start": 0.35,
"end": 0.88
}
],
"provider": "openai",
"model": "whisper-1",
"accuracy": "medium",
"usage": {
"audio_seconds": 4.2,
"billed_seconds": 5
}
}{
"detail": {},
"message": "string"
}{
"detail": {},
"message": "string"
}{
"detail": {},
"message": "string"
}{
"detail": {},
"message": "string"
}{
"error": "string"
}Overview
The Speech-to-Text (STT) API is a single integration surface for audio → text with provider routing and policy-based controls. You choose a provider / model, set source language (default English), tune accuracy tier (low, medium, high), and optionally let the service pick language and routing dynamically based on tier, latency budget, and supported locales per provider.
[PLACEHOLDER FOR SCREENSHOT]
Key features
- Audio-first contract: Accept audio as required input with a stable multipart schema.
- Multi-provider routing: One contract; swap model without changing your client orchestration.
- Explicit or auto language: Pin
language(BCP-47, for exampleen,en-IN,hi-IN) or uselanguage_mode: autofor detection. - Accuracy tiers:
low,medium,highmap to latency/cost/quality trade-offs and may constrain which providers run for a given locale. - Dynamic language policy: Optional
accuracy_language_policyties tier to allowed auto-detect candidates or fallback order. - Operational extras: Idempotency, callbacks for long jobs, timestamps, diarization hooks, and retention controls (see below).
Supported providers and popular models
The wrapper exposes provider (or model composite string provider:model_id) from commonly used stacks. Exact model_id strings are versioned in your environment’s model registry; this table lists representative families.
| Provider | Typical STT / multimodal models (examples) | Notes |
|---|---|---|
| OpenAI / ChatGPT stack | Whisper (for example whisper-1), GPT-4o audio / Realtime where enabled | Strong en baseline; broad locale coverage via Whisper |
| xAI | Grok audio / multimodal SKUs where exposed for STT | Check locale matrix per deployment |
| ElevenLabs | Scribe and related speech-to-text model IDs | Often tuned for creator / long-form audio |
| Sarvam | Sarvam STT models (Indian languages emphasis) | Use for hi, ta, te, kn, mr, bn, gu, en-IN, etc. |
| Gemini multimodal + Cloud Speech-to-Text / Chirp | Good auto language in some tiers | |
| Amazon Web Services | Amazon Transcribe (batch / streaming) | Enterprise PII / redaction features |
| Microsoft Azure | Azure AI Speech fast / accurate modes | high accuracy option maps cleanly |
| Deepgram | Nova / Enhanced | low latency streaming |
| AssemblyAI | Universal + slam-1 and other SKUs | Rich post-processing (labels, topics) |
| Anthropic | Claude multimodal (audio-in) where available | Less traditional STT; useful for reasoning over transcripts |
Benefit: You standardize on one request schema and one observability story while retaining access to best-in-class engines per locale and SLA.
Language model
- Default:
language: "en"(English) when omitted. - Explicit locale: Set
languageto a BCP-47 tag (for exampleen-IN,hi-IN). - Auto detection: Set
language_mode": "auto". The service may returndetected_languagein the response. - Dynamic choice by accuracy: Set
accuracy_language_policyso eachaccuracytier definesallowed_languages,prefer_providers, orfallback_chain. Example:high→ fixeden-IN+ OpenAI/Azure;low→auto+ streaming Deepgram with broadallowed_languages.
Accuracy tiers
| Tier | Intended use | Typical behavior |
|---|---|---|
high | Compliance-heavy, court-ready, medical | Slower, higher cost, stricter VAD, word error rate targets |
medium | Product default | Balanced cost/latency/quality |
low | Real-time hints, rough captions | Faster, cheaper; may widen auto language guesses |
Benefit: Product teams expose one slider; backend maps to provider capabilities and quota.
API contract (reference)
Call POST /speech-to-text/transcribe with Authorization: Bearer <token> and multipart/form-data. The OpenAPI block above is the source of truth for field names and enums. accuracy_language_policy is sent as a JSON string in multipart (same pattern as other stringified JSON fields in ML APIs).
HTTP
POST /speech-to-text/transcribe
Content-Type: multipart/form-data
Authorization: Bearer <token>Request fields (conceptual JSON; use multipart parts or string fields per OpenAPI)
{
"audio": "<binary or multipart file field>",
"provider": "openai",
"model": "whisper-1",
"language": "en",
"language_mode": "fixed",
"accuracy": "medium",
"accuracy_language_policy": {
"high": { "language_mode": "fixed", "language": "en-IN", "provider_preference": ["azure", "openai"] },
"low": { "language_mode": "auto", "allowed_languages": ["en", "hi", "ta"], "provider_preference": ["deepgram", "sarvam"] }
},
"enable_word_timestamps": true,
"enable_diarization": false,
"enable_pii_redaction": false,
"callback_url": "https://example.com/stt-callback",
"idempotency_key": "uuid-v4",
"metadata": { "session_id": "abc123" }
}| Field | Required | Description |
|---|---|---|
audio | Yes | Primary audio bytes or upload field. |
provider / model | One required | Select vendor and model. |
language | No | BCP-47; default en. |
language_mode | No | fixed (default) or auto. |
accuracy | No | low, medium (default), high. |
accuracy_language_policy | No | Per-tier overrides for language and provider preferences. |
enable_word_timestamps | No | Word-level timing when supported. |
enable_diarization | No | Speaker labels when supported. |
enable_pii_redaction | No | Mask emails, phones, Aadhaar-like patterns when policy allows. |
callback_url | No | Async completion for long media. |
idempotency_key | No | Safe retries for paid calls. |
metadata | No | Echoed in logs and callbacks. |
Response
{
"status": "success",
"request_id": "req_01h...",
"text": "Full transcript text.",
"detected_language": "en-IN",
"confidence": 0.93,
"words": [{ "word": "Hello", "start": 0.12, "end": 0.34 }],
"provider": "openai",
"model": "whisper-1",
"accuracy": "medium"
}Wrapper capabilities (recommended)
- Unified authentication and per-tenant rate limits.
- Format normalization (WAV, MP3, M4A, FLAC) and sample-rate hints.
- Streaming WebSocket or SSE channel for
lowlatency paths (provider-dependent). - Fallback chain: If primary
providererrors or rejects locale, tryfallback_chainfrom policy. - Cost and latency hints in response
usageobject (seconds billed, realtime_factor). - Data retention TTL per request (
retention_hours). - Webhook signature (HMAC) on callbacks.
- Audit: Store hashes only when PII minimization mode is on.
Implementation
- Obtain token: Use the same Bearer pattern as other KwikID APIs.
- Choose path: Synchronous upload for short clips;
callback_urlfor long interviews. - Set language: Start with
language: "en"; expand to target locales; switch tolanguage_mode: autoonly whenaccuracyand policy allow. - Map accuracy: Align
highwith review workflows; uselowfor live agent assist. - Test matrix: For each
provider, run gold audio inen,en-IN, and your top regional languages. - Monitor: Track WER proxies, p95 latency, and empty transcript rate by
provider.
Error handling
| HTTP status | When |
|---|---|
| 400 | Unsupported provider/model, bad audio, or conflicting language_mode. |
| 401 | Invalid or missing token. |
| 413 | Payload over max duration/size. |
| 429 | Quota / concurrency exceeded. |
| 502 | Upstream STT failure after retries. |
Benefits
- Single integration for many STT backends.
- Predictable cost/quality via
accuracy. - Locale agility from English-first to broad Indian and global languages.
- Future-proofing as new
model_idvalues are added to the registry.
Next steps
- Machine Learning API index
- API Suite index
- Facematch (when pairing voice with face journeys)