Skip to main content

Agent Cloud Inference

AINative Agent Cloud provides a unified inference API across a tiered GPU stack — from free serverless models to single-tenant H100/MI300X deployments. One API key, one endpoint, every model.

Base URL: https://api.ainative.studio


Inference Stack

Requests route automatically based on the model you select:

TierProviderHardwareNotes
ServerlessNVIDIA NIMShared GPUPrimary — 141+ models, OpenAI-compatible
Serverless fallbackHuggingFaceShared GPUAuto-fallback on NIM errors
Ultra-fastCerebrasWafer-scalegpt-oss-120b, zai-glm-4.7 — 2,000+ tok/s
Dedicated HFHuggingFace EndpointsT4 / L4 / A100Per-user dedicated GPU (Tier 2)
FrontierMeta APILlama 3.3/4 — free, no per-token cost
FrontierAnthropicClaude Haiku / Sonnet / Opus
Dedicated GPUDigitalOceanH100 / MI300X / MI325XSingle-tenant reserved capacity

Chat Completions

Standard

POST /api/v1/public/chat/completions

OpenAI-compatible. Routes automatically across the stack based on your chosen model.

curl -X POST https://api.ainative.studio/api/v1/public/chat/completions \
-H "Authorization: Bearer $AINATIVE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.3-70b-instruct",
"messages": [{"role": "user", "content": "Explain transformers in one paragraph"}],
"max_tokens": 512,
"stream": false
}'

Also available in Anthropic format at POST /v1/messages — see Chat Completions API.

Dedicated Deployment

Route a request to your provisioned single-tenant GPU:

POST /api/v1/public/deployments/{deployment_id}/chat/completions

Same request schema as standard chat. Traffic is fully isolated to your reserved hardware.

import requests

response = requests.post(
f"https://api.ainative.studio/api/v1/public/deployments/{DEPLOYMENT_ID}/chat/completions",
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"model": "llama-3.3-70b-instruct",
"messages": [{"role": "user", "content": "Hello"}],
},
)

Org-Scoped Inference

Enterprise teams isolate inference by org with RBAC enforced:

POST /api/v1/public/orgs/{org_id}/deployments/{deployment_id}/chat/completions

Requires the caller to be a member of org_id with deployment:invoke permission on the target deployment.


Dedicated GPU Deployments

Provision isolated GPU capacity on DigitalOcean H100/MI300X/MI325X hardware. Reserved capacity starts immediately — no cold start.

Create a Deployment

POST /api/v1/public/deployments
deployment = requests.post(
"https://api.ainative.studio/api/v1/public/deployments",
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"name": "prod-inference",
"model_id": "llama-3.3-70b-instruct",
"gpu_type": "h100",
"region": "nyc2",
"replicas": 1,
},
).json()

print(deployment["id"]) # dep_abc123
print(deployment["status"]) # provisioning → active

GPU types:

gpu_typeHardwareVRAMBest for
h100NVIDIA H10080 GBSpeed-critical, 70B+ models
mi300xAMD MI300X192 GBCost-efficient, large context
mi300x_88× AMD MI300X1,536 GBVery large models
mi325xAMD MI325X256 GBBalanced
mi325x_88× AMD MI325X2,048 GBMaximum capacity

Regions: atl1 (Atlanta), nyc2 (New York), tor1 (Toronto), ric1 (Richmond)

Deployment Lifecycle

provisioning → active → scaling → teardown → inactive
# Poll until active
import time

while True:
status = requests.get(
f"https://api.ainative.studio/api/v1/public/deployments/{dep_id}",
headers={"Authorization": f"Bearer {API_KEY}"},
).json()["status"]
if status == "active":
break
time.sleep(5)

List Deployments

GET /api/v1/public/deployments

Teardown

DELETE /api/v1/public/deployments/{deployment_id}

Destroys the DO GPU instance. Billing stops at teardown.


Inference Presets

Apply hardware + configuration presets at deployment creation:

{
"name": "speed-optimized",
"model_id": "llama-3.3-70b-instruct",
"gpu_type": "h100",
"region": "nyc2",
"preset": "speed"
}
PresetGPUConfigUse case
speedH100fp16, max batchLowest latency
latencyH100speculative decoding, min batchInteractive / streaming
costMI300Xint8, scale-to-zeroBatch workloads
balancedMI300Xfp16, auto batchGeneral purpose

Auto-Scaling

Configure replicas to scale automatically based on queue depth and latency:

PATCH /api/v1/public/deployments/{deployment_id}/scaling
{
"min_replicas": 1,
"max_replicas": 4,
"scale_to_zero": true,
"idle_timeout_seconds": 300
}

The autoscaler polls every 30s. Scale-to-zero destroys the DO instance after the idle timeout — cold start resumes on the next request.


Custom SLA

Set TTFT, throughput, and concurrency targets. Breaches fire webhook events automatically:

PATCH /api/v1/public/deployments/{deployment_id}/sla
{
"ttft_target_ms": 500,
"throughput_target_tps": 100,
"concurrency_limit": 50
}
GET /api/v1/public/deployments/{deployment_id}/sla

Returns current targets and breach history.


Deployment Health Webhooks

Subscribe to push events for state changes and SLA breaches:

POST /api/v1/public/deployments/{deployment_id}/webhooks
{
"url": "https://your-service.com/hooks/ainative",
"events": [
"deployment.active",
"deployment.degraded",
"deployment.sla_breach",
"deployment.recovered",
"deployment.scaled"
]
}

Payloads are HMAC-SHA256 signed. Verify with the X-AINative-Signature header:

import hmac, hashlib

def verify(payload: bytes, signature: str, secret: str) -> bool:
expected = hmac.new(secret.encode(), payload, hashlib.sha256).hexdigest()
return hmac.compare_digest(f"sha256={expected}", signature)

RBAC — Per-Deployment Access Control

Grant specific API keys access to a deployment:

POST /api/v1/public/deployments/{deployment_id}/permissions
{
"api_key_id": "key_abc123",
"role": "inference"
}

Revoke:

DELETE /api/v1/public/deployments/{deployment_id}/permissions/{api_key_id}

Scoped keys can only call deployments they've been granted access to. Useful for multi-tenant products where each customer gets their own scoped key.


Bring Your Own Model (BYOM)

Upload custom GGUF fine-tuned weights and serve them on dedicated infrastructure:

POST /api/v1/public/models/upload
curl -X POST https://api.ainative.studio/api/v1/public/models/upload \
-H "Authorization: Bearer $AINATIVE_API_KEY" \
-F "file=@my-finetune.gguf" \
-F "name=my-model-v1" \
-F "deployment_id=dep_abc123"

Status transitions: pending_uploaduploadingready

Once ready, call via standard chat completions:

{ "model": "byom:my-model-v1", "messages": [...] }

Embeddings

POST /api/v1/public/embeddings/generate

Backed by DigitalOcean inference (inference.do-ai.run/v1/embeddings).

result = requests.post(
"https://api.ainative.studio/api/v1/public/embeddings/generate",
headers={"Authorization": f"Bearer {API_KEY}"},
json={"input": "The quick brown fox", "model": "bge-m3"},
).json()

vector = result["data"][0]["embedding"] # 1024-dim
ModelDimensionsTierNotes
bge-m31024basic+Default — best retrieval quality
all-minilm-l6-v2384freeFast, low cost

Rerank

Cross-encoder reranking for RAG pipelines — scores documents against a query:

POST /api/v1/public/rerank
result = requests.post(
"https://api.ainative.studio/api/v1/public/rerank",
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"query": "What is quantum computing?",
"documents": [
"Quantum computers use qubits to process information.",
"Classical computers use binary bits.",
"Quantum entanglement enables faster computation.",
],
"top_n": 3,
},
).json()

for r in result["results"]:
print(r["relevance_score"], r["document"][:60])

Audio

Transcription

POST /api/v1/public/audio/transcriptions
curl -X POST https://api.ainative.studio/api/v1/public/audio/transcriptions \
-H "Authorization: Bearer $AINATIVE_API_KEY" \
-F "file=@recording.mp3" \
-F "model=whisper-large-v3-turbo"
ModelTierNotes
whisper-smallbasic+Fast, lower accuracy
whisper-large-v3-turbobasic+Best accuracy

Text-to-Speech

POST /api/v1/public/audio/tts
{
"input": "Hello, I am your AI assistant.",
"model": "minimax-speech-02-hd",
"voice": "en_female_1"
}

Returns mp3 audio binary. Powered by MiniMax Speech API.


Inference Observability

Usage Metrics

GET /api/v1/public/inference/metrics

Returns per-model token usage, cost, latency percentiles, and error rates.

Deployment Metrics

GET /api/v1/public/deployments/{deployment_id}/metrics

Real-time TTFT, throughput (tok/s), active concurrency, and SLA breach history for a dedicated deployment.


Model Catalog

GET /api/v1/public/models

Full catalog with capabilities, pricing, context window, and tier requirements.

GET /api/v1/public/models/available

Only models your current plan can access.

Plan access:

PlanModels included
freeLlama 3.3 8B/70B, Llama 4 Maverick, Llama 4 Scout, all-minilm-l6-v2
basic+ Whisper, Gemma, Qwen, BGE embeddings, Claude Haiku
professional+ Coding models, DeepSeek, image gen, audio TTS
enterprise+ Claude Sonnet/Opus, DeepSeek R1, video gen, SD 3.5

Rate Limits

PlanRequests/minTokens/min
Free1050,000
Basic60200,000
Professional3001,000,000
EnterpriseCustomCustom

Rate limit headers: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset


Next Steps