Skip to main content

Audio API

Transcribe audio files, generate speech, and create music. All endpoints consume credits and require authentication.

Base URL: https://api.ainative.studio/api/v1/audio


Transcription (Whisper)

Transcribe audio files to text using OpenAI Whisper.

POST/api/v1/audio/transcriptions🔒

Cost: ~6 credits/minute of audio | Max file size: 25 MB

curl -X POST https://api.ainative.studio/api/v1/audio/transcriptions \
-H "Authorization: Bearer $TOKEN" \
-F file=@recording.mp3 \
-F model=whisper-1 \
-F response_format=json
import requests

with open("recording.mp3", "rb") as f:
response = requests.post(
"https://api.ainative.studio/api/v1/audio/transcriptions",
headers={"Authorization": f"Bearer {token}"},
files={"file": f},
data={"model": "whisper-1", "response_format": "json"}
)

print(response.json()["text"])

Parameters (form data):

ParameterTypeRequiredDefaultDescription
filefileYesAudio file (mp3, mp4, wav, webm, m4a, mpeg, mpga)
modelstringNowhisper-1Whisper model
languagestringNoauto-detectISO-639-1 language code
promptstringNoHint text to guide transcription
response_formatstringNotexttext, json, srt, or vtt
temperaturefloatNoSampling temperature (0.0-1.0)

Response:

FieldTypeDescription
textstringTranscribed text
cost_creditsintCredits charged
duration_secondsfloatAudio duration
languagestringDetected language

Translation (Whisper)

Translate audio from any language to English.

POST/api/v1/audio/translations🔒

Same parameters and cost as transcription. Input can be any supported language; output is always English.


Text-to-Speech

Generate speech audio from text.

POST/api/v1/audio/tts🔒
curl -X POST https://api.ainative.studio/api/v1/audio/tts \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"text": "Hello from AINative Studio",
"model_id": "facebook/mms-tts-eng"
}'

Parameters:

ParameterTypeRequiredDefaultDescription
textstringYesText to synthesize (1-5000 chars)
model_idstringNofacebook/mms-tts-engTTS model ID
voicestringNoVoice/speaker parameter

Response:

FieldTypeDescription
audio_base64stringBase64-encoded audio
duration_secondsfloatAudio length
cost_creditsintCredits charged
formatstringAudio format (wav/mp3)
Two TTS Endpoints
  • /api/v1/audio/tts — HuggingFace models (free tier, lower quality)
  • /api/v1/multimodal/tts — MiniMax TTS (14 credits, higher quality)

Choose based on your quality and budget requirements.


Music Generation

Generate music from text descriptions.

POST/api/v1/audio/music🔒
curl -X POST https://api.ainative.studio/api/v1/audio/music \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"prompt": "Upbeat electronic music with synthesizer melody",
"duration": 15
}'

Parameters:

ParameterTypeRequiredDefaultDescription
promptstringYesMusic description (1-500 chars)
model_idstringNominimax-music-v2Music model
durationintNo10Duration in seconds (5-60)

Response:

FieldTypeDescription
audio_base64stringBase64-encoded audio
duration_secondsfloatActual duration
cost_creditsintCredits charged
formatstringAudio format
promptstringPrompt used

For AI Agents

  • Use transcription to process voice memos, meeting recordings, or audio commands
  • Use TTS to generate voice responses for users or narrate content
  • Use music generation for background audio in video content
  • Audio files can be stored in ZeroDB file storage for later retrieval