Audio API
Transcribe audio files, generate speech, and create music. All endpoints consume credits and require authentication.
Base URL: https://api.ainative.studio/api/v1/audio
Transcription (Whisper)
Transcribe audio files to text using OpenAI Whisper.
POST
/api/v1/audio/transcriptions🔒Cost: ~6 credits/minute of audio | Max file size: 25 MB
curl -X POST https://api.ainative.studio/api/v1/audio/transcriptions \
-H "Authorization: Bearer $TOKEN" \
-F file=@recording.mp3 \
-F model=whisper-1 \
-F response_format=json
import requests
with open("recording.mp3", "rb") as f:
response = requests.post(
"https://api.ainative.studio/api/v1/audio/transcriptions",
headers={"Authorization": f"Bearer {token}"},
files={"file": f},
data={"model": "whisper-1", "response_format": "json"}
)
print(response.json()["text"])
Parameters (form data):
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
file | file | Yes | — | Audio file (mp3, mp4, wav, webm, m4a, mpeg, mpga) |
model | string | No | whisper-1 | Whisper model |
language | string | No | auto-detect | ISO-639-1 language code |
prompt | string | No | — | Hint text to guide transcription |
response_format | string | No | text | text, json, srt, or vtt |
temperature | float | No | — | Sampling temperature (0.0-1.0) |
Response:
| Field | Type | Description |
|---|---|---|
text | string | Transcribed text |
cost_credits | int | Credits charged |
duration_seconds | float | Audio duration |
language | string | Detected language |
Translation (Whisper)
Translate audio from any language to English.
POST
/api/v1/audio/translations🔒Same parameters and cost as transcription. Input can be any supported language; output is always English.
Text-to-Speech
Generate speech audio from text.
POST
/api/v1/audio/tts🔒curl -X POST https://api.ainative.studio/api/v1/audio/tts \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"text": "Hello from AINative Studio",
"model_id": "facebook/mms-tts-eng"
}'
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
text | string | Yes | — | Text to synthesize (1-5000 chars) |
model_id | string | No | facebook/mms-tts-eng | TTS model ID |
voice | string | No | — | Voice/speaker parameter |
Response:
| Field | Type | Description |
|---|---|---|
audio_base64 | string | Base64-encoded audio |
duration_seconds | float | Audio length |
cost_credits | int | Credits charged |
format | string | Audio format (wav/mp3) |
Two TTS Endpoints
/api/v1/audio/tts— HuggingFace models (free tier, lower quality)/api/v1/multimodal/tts— MiniMax TTS (14 credits, higher quality)
Choose based on your quality and budget requirements.
Music Generation
Generate music from text descriptions.
POST
/api/v1/audio/music🔒curl -X POST https://api.ainative.studio/api/v1/audio/music \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"prompt": "Upbeat electronic music with synthesizer melody",
"duration": 15
}'
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
prompt | string | Yes | — | Music description (1-500 chars) |
model_id | string | No | minimax-music-v2 | Music model |
duration | int | No | 10 | Duration in seconds (5-60) |
Response:
| Field | Type | Description |
|---|---|---|
audio_base64 | string | Base64-encoded audio |
duration_seconds | float | Actual duration |
cost_credits | int | Credits charged |
format | string | Audio format |
prompt | string | Prompt used |
For AI Agents
- Use transcription to process voice memos, meeting recordings, or audio commands
- Use TTS to generate voice responses for users or narrate content
- Use music generation for background audio in video content
- Audio files can be stored in ZeroDB file storage for later retrieval