BrandKwikID Documentation
API SuiteSpeech-to-Text

Speech-to-Text (Multimodal)

Unified multimodal speech-to-text wrapper with selectable providers, language control, accuracy tiers, and optional auto-language routing.

API reference

Try itLoading playground…
Loading…
AuthorizationBearer <token>

JWT Bearer token authentication. Obtain a token from the KwikID dashboard.

In: header

audiostring
provider?string

Vendor slug.

Value in"openai" | "xai" | "elevenlabs" | "sarvam" | "google" | "aws" | "azure" | "deepgram" | "assemblyai" | "anthropic"
model?string

Provider-specific model id (for example whisper-1, scribe_v1). Omit for deployment default.

language?string

BCP-47 source language (for example en, en-IN, hi-IN). Used when language_mode is fixed.

Default"en"
language_mode?string

fixed uses language; auto requests detection (detected_language in response).

Default"fixed"
Value in"fixed" | "auto"
accuracy?string

Quality / latency / cost trade-off tier.

Default"medium"
Value in"low" | "medium" | "high"
accuracy_language_policy?string

JSON string (stringified object) mapping tiers to language_mode, language, allowed_languages, provider_preference, fallback_chain. Example: {"high":{"language_mode":"fixed","language":"en-IN"},"low":{"language_mode":"auto"}}.

enable_word_timestamps?string

Pass true / false as form string.

Value in"true" | "false"
enable_diarization?string
Value in"true" | "false"
enable_pii_redaction?string
Value in"true" | "false"
callback_url?string
Formaturi
idempotency_key?string
metadata?string

JSON string echoed in logs and callbacks.

unique_id?string

Optional correlation id for logs.

Response Body

curl -X POST "https://__mock__/speech-to-text/transcribe" \  -F audio="string"

{
  "status": "success",
  "request_id": "req_stt_01hexample",
  "text": "Sample transcript for mock responses in the Playground.",
  "detected_language": "en-IN",
  "confidence": 0.94,
  "words": [
    {
      "word": "Sample",
      "start": 0,
      "end": 0.32
    },
    {
      "word": "transcript",
      "start": 0.35,
      "end": 0.88
    }
  ],
  "provider": "openai",
  "model": "whisper-1",
  "accuracy": "medium",
  "usage": {
    "audio_seconds": 4.2,
    "billed_seconds": 5
  }
}
{
  "detail": {},
  "message": "string"
}
{
  "detail": {},
  "message": "string"
}
{
  "detail": {},
  "message": "string"
}
{
  "detail": {},
  "message": "string"
}
{
  "error": "string"
}

Overview

The Speech-to-Text (STT) API is a single integration surface for audio → text with provider routing and policy-based controls. You choose a provider / model, set source language (default English), tune accuracy tier (low, medium, high), and optionally let the service pick language and routing dynamically based on tier, latency budget, and supported locales per provider.

[PLACEHOLDER FOR SCREENSHOT]

Key features

  • Audio-first contract: Accept audio as required input with a stable multipart schema.
  • Multi-provider routing: One contract; swap model without changing your client orchestration.
  • Explicit or auto language: Pin language (BCP-47, for example en, en-IN, hi-IN) or use language_mode: auto for detection.
  • Accuracy tiers: low, medium, high map to latency/cost/quality trade-offs and may constrain which providers run for a given locale.
  • Dynamic language policy: Optional accuracy_language_policy ties tier to allowed auto-detect candidates or fallback order.
  • Operational extras: Idempotency, callbacks for long jobs, timestamps, diarization hooks, and retention controls (see below).

The wrapper exposes provider (or model composite string provider:model_id) from commonly used stacks. Exact model_id strings are versioned in your environment’s model registry; this table lists representative families.

ProviderTypical STT / multimodal models (examples)Notes
OpenAI / ChatGPT stackWhisper (for example whisper-1), GPT-4o audio / Realtime where enabledStrong en baseline; broad locale coverage via Whisper
xAIGrok audio / multimodal SKUs where exposed for STTCheck locale matrix per deployment
ElevenLabsScribe and related speech-to-text model IDsOften tuned for creator / long-form audio
SarvamSarvam STT models (Indian languages emphasis)Use for hi, ta, te, kn, mr, bn, gu, en-IN, etc.
GoogleGemini multimodal + Cloud Speech-to-Text / ChirpGood auto language in some tiers
Amazon Web ServicesAmazon Transcribe (batch / streaming)Enterprise PII / redaction features
Microsoft AzureAzure AI Speech fast / accurate modeshigh accuracy option maps cleanly
DeepgramNova / Enhancedlow latency streaming
AssemblyAIUniversal + slam-1 and other SKUsRich post-processing (labels, topics)
AnthropicClaude multimodal (audio-in) where availableLess traditional STT; useful for reasoning over transcripts

Benefit: You standardize on one request schema and one observability story while retaining access to best-in-class engines per locale and SLA.

Language model

  1. Default: language: "en" (English) when omitted.
  2. Explicit locale: Set language to a BCP-47 tag (for example en-IN, hi-IN).
  3. Auto detection: Set language_mode": "auto". The service may return detected_language in the response.
  4. Dynamic choice by accuracy: Set accuracy_language_policy so each accuracy tier defines allowed_languages, prefer_providers, or fallback_chain. Example: high → fixed en-IN + OpenAI/Azure; lowauto + streaming Deepgram with broad allowed_languages.

Accuracy tiers

TierIntended useTypical behavior
highCompliance-heavy, court-ready, medicalSlower, higher cost, stricter VAD, word error rate targets
mediumProduct defaultBalanced cost/latency/quality
lowReal-time hints, rough captionsFaster, cheaper; may widen auto language guesses

Benefit: Product teams expose one slider; backend maps to provider capabilities and quota.

API contract (reference)

Call POST /speech-to-text/transcribe with Authorization: Bearer <token> and multipart/form-data. The OpenAPI block above is the source of truth for field names and enums. accuracy_language_policy is sent as a JSON string in multipart (same pattern as other stringified JSON fields in ML APIs).

HTTP

POST /speech-to-text/transcribe
Content-Type: multipart/form-data
Authorization: Bearer <token>

Request fields (conceptual JSON; use multipart parts or string fields per OpenAPI)

{
  "audio": "<binary or multipart file field>",
  "provider": "openai",
  "model": "whisper-1",
  "language": "en",
  "language_mode": "fixed",
  "accuracy": "medium",
  "accuracy_language_policy": {
    "high": { "language_mode": "fixed", "language": "en-IN", "provider_preference": ["azure", "openai"] },
    "low": { "language_mode": "auto", "allowed_languages": ["en", "hi", "ta"], "provider_preference": ["deepgram", "sarvam"] }
  },
  "enable_word_timestamps": true,
  "enable_diarization": false,
  "enable_pii_redaction": false,
  "callback_url": "https://example.com/stt-callback",
  "idempotency_key": "uuid-v4",
  "metadata": { "session_id": "abc123" }
}
FieldRequiredDescription
audioYesPrimary audio bytes or upload field.
provider / modelOne requiredSelect vendor and model.
languageNoBCP-47; default en.
language_modeNofixed (default) or auto.
accuracyNolow, medium (default), high.
accuracy_language_policyNoPer-tier overrides for language and provider preferences.
enable_word_timestampsNoWord-level timing when supported.
enable_diarizationNoSpeaker labels when supported.
enable_pii_redactionNoMask emails, phones, Aadhaar-like patterns when policy allows.
callback_urlNoAsync completion for long media.
idempotency_keyNoSafe retries for paid calls.
metadataNoEchoed in logs and callbacks.

Response

{
  "status": "success",
  "request_id": "req_01h...",
  "text": "Full transcript text.",
  "detected_language": "en-IN",
  "confidence": 0.93,
  "words": [{ "word": "Hello", "start": 0.12, "end": 0.34 }],
  "provider": "openai",
  "model": "whisper-1",
  "accuracy": "medium"
}
  • Unified authentication and per-tenant rate limits.
  • Format normalization (WAV, MP3, M4A, FLAC) and sample-rate hints.
  • Streaming WebSocket or SSE channel for low latency paths (provider-dependent).
  • Fallback chain: If primary provider errors or rejects locale, try fallback_chain from policy.
  • Cost and latency hints in response usage object (seconds billed, realtime_factor).
  • Data retention TTL per request (retention_hours).
  • Webhook signature (HMAC) on callbacks.
  • Audit: Store hashes only when PII minimization mode is on.

Implementation

  1. Obtain token: Use the same Bearer pattern as other KwikID APIs.
  2. Choose path: Synchronous upload for short clips; callback_url for long interviews.
  3. Set language: Start with language: "en"; expand to target locales; switch to language_mode: auto only when accuracy and policy allow.
  4. Map accuracy: Align high with review workflows; use low for live agent assist.
  5. Test matrix: For each provider, run gold audio in en, en-IN, and your top regional languages.
  6. Monitor: Track WER proxies, p95 latency, and empty transcript rate by provider.

Error handling

HTTP statusWhen
400Unsupported provider/model, bad audio, or conflicting language_mode.
401Invalid or missing token.
413Payload over max duration/size.
429Quota / concurrency exceeded.
502Upstream STT failure after retries.

Benefits

  • Single integration for many STT backends.
  • Predictable cost/quality via accuracy.
  • Locale agility from English-first to broad Indian and global languages.
  • Future-proofing as new model_id values are added to the registry.

Next steps