Agent Cloud Inference
AINative Agent Cloud provides a unified inference API across a tiered GPU stack — from free serverless models to single-tenant H100/MI300X deployments. One API key, one endpoint, every model.
Base URL: https://api.ainative.studio
Inference Stack
Requests route automatically based on the model you select:
| Tier | Provider | Hardware | Notes |
|---|---|---|---|
| Serverless | NVIDIA NIM | Shared GPU | Primary — 141+ models, OpenAI-compatible |
| Serverless fallback | HuggingFace | Shared GPU | Auto-fallback on NIM errors |
| Ultra-fast | Cerebras | Wafer-scale | gpt-oss-120b, zai-glm-4.7 — 2,000+ tok/s |
| Dedicated HF | HuggingFace Endpoints | T4 / L4 / A100 | Per-user dedicated GPU (Tier 2) |
| Frontier | Meta API | — | Llama 3.3/4 — free, no per-token cost |
| Frontier | Anthropic | — | Claude Haiku / Sonnet / Opus |
| Dedicated GPU | DigitalOcean | H100 / MI300X / MI325X | Single-tenant reserved capacity |
Chat Completions
Standard
POST /api/v1/public/chat/completions
OpenAI-compatible. Routes automatically across the stack based on your chosen model.
curl -X POST https://api.ainative.studio/api/v1/public/chat/completions \
-H "Authorization: Bearer $AINATIVE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.3-70b-instruct",
"messages": [{"role": "user", "content": "Explain transformers in one paragraph"}],
"max_tokens": 512,
"stream": false
}'
Also available in Anthropic format at POST /v1/messages — see Chat Completions API.
Dedicated Deployment
Route a request to your provisioned single-tenant GPU:
POST /api/v1/public/deployments/{deployment_id}/chat/completions
Same request schema as standard chat. Traffic is fully isolated to your reserved hardware.
import requests
response = requests.post(
f"https://api.ainative.studio/api/v1/public/deployments/{DEPLOYMENT_ID}/chat/completions",
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"model": "llama-3.3-70b-instruct",
"messages": [{"role": "user", "content": "Hello"}],
},
)
Org-Scoped Inference
Enterprise teams isolate inference by org with RBAC enforced:
POST /api/v1/public/orgs/{org_id}/deployments/{deployment_id}/chat/completions
Requires the caller to be a member of org_id with deployment:invoke permission on the target deployment.
Dedicated GPU Deployments
Provision isolated GPU capacity on DigitalOcean H100/MI300X/MI325X hardware. Reserved capacity starts immediately — no cold start.
Create a Deployment
POST /api/v1/public/deployments
deployment = requests.post(
"https://api.ainative.studio/api/v1/public/deployments",
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"name": "prod-inference",
"model_id": "llama-3.3-70b-instruct",
"gpu_type": "h100",
"region": "nyc2",
"replicas": 1,
},
).json()
print(deployment["id"]) # dep_abc123
print(deployment["status"]) # provisioning → active
GPU types:
gpu_type | Hardware | VRAM | Best for |
|---|---|---|---|
h100 | NVIDIA H100 | 80 GB | Speed-critical, 70B+ models |
mi300x | AMD MI300X | 192 GB | Cost-efficient, large context |
mi300x_8 | 8× AMD MI300X | 1,536 GB | Very large models |
mi325x | AMD MI325X | 256 GB | Balanced |
mi325x_8 | 8× AMD MI325X | 2,048 GB | Maximum capacity |
Regions: atl1 (Atlanta), nyc2 (New York), tor1 (Toronto), ric1 (Richmond)
Deployment Lifecycle
provisioning → active → scaling → teardown → inactive
# Poll until active
import time
while True:
status = requests.get(
f"https://api.ainative.studio/api/v1/public/deployments/{dep_id}",
headers={"Authorization": f"Bearer {API_KEY}"},
).json()["status"]
if status == "active":
break
time.sleep(5)
List Deployments
GET /api/v1/public/deployments
Teardown
DELETE /api/v1/public/deployments/{deployment_id}
Destroys the DO GPU instance. Billing stops at teardown.
Inference Presets
Apply hardware + configuration presets at deployment creation:
{
"name": "speed-optimized",
"model_id": "llama-3.3-70b-instruct",
"gpu_type": "h100",
"region": "nyc2",
"preset": "speed"
}
| Preset | GPU | Config | Use case |
|---|---|---|---|
speed | H100 | fp16, max batch | Lowest latency |
latency | H100 | speculative decoding, min batch | Interactive / streaming |
cost | MI300X | int8, scale-to-zero | Batch workloads |
balanced | MI300X | fp16, auto batch | General purpose |
Auto-Scaling
Configure replicas to scale automatically based on queue depth and latency:
PATCH /api/v1/public/deployments/{deployment_id}/scaling
{
"min_replicas": 1,
"max_replicas": 4,
"scale_to_zero": true,
"idle_timeout_seconds": 300
}
The autoscaler polls every 30s. Scale-to-zero destroys the DO instance after the idle timeout — cold start resumes on the next request.
Custom SLA
Set TTFT, throughput, and concurrency targets. Breaches fire webhook events automatically:
PATCH /api/v1/public/deployments/{deployment_id}/sla
{
"ttft_target_ms": 500,
"throughput_target_tps": 100,
"concurrency_limit": 50
}
GET /api/v1/public/deployments/{deployment_id}/sla
Returns current targets and breach history.
Deployment Health Webhooks
Subscribe to push events for state changes and SLA breaches:
POST /api/v1/public/deployments/{deployment_id}/webhooks
{
"url": "https://your-service.com/hooks/ainative",
"events": [
"deployment.active",
"deployment.degraded",
"deployment.sla_breach",
"deployment.recovered",
"deployment.scaled"
]
}
Payloads are HMAC-SHA256 signed. Verify with the X-AINative-Signature header:
import hmac, hashlib
def verify(payload: bytes, signature: str, secret: str) -> bool:
expected = hmac.new(secret.encode(), payload, hashlib.sha256).hexdigest()
return hmac.compare_digest(f"sha256={expected}", signature)
RBAC — Per-Deployment Access Control
Grant specific API keys access to a deployment:
POST /api/v1/public/deployments/{deployment_id}/permissions
{
"api_key_id": "key_abc123",
"role": "inference"
}
Revoke:
DELETE /api/v1/public/deployments/{deployment_id}/permissions/{api_key_id}
Scoped keys can only call deployments they've been granted access to. Useful for multi-tenant products where each customer gets their own scoped key.
Bring Your Own Model (BYOM)
Upload custom GGUF fine-tuned weights and serve them on dedicated infrastructure:
POST /api/v1/public/models/upload
curl -X POST https://api.ainative.studio/api/v1/public/models/upload \
-H "Authorization: Bearer $AINATIVE_API_KEY" \
-F "file=@my-finetune.gguf" \
-F "name=my-model-v1" \
-F "deployment_id=dep_abc123"
Status transitions: pending_upload → uploading → ready
Once ready, call via standard chat completions:
{ "model": "byom:my-model-v1", "messages": [...] }
Embeddings
POST /api/v1/public/embeddings/generate
Backed by DigitalOcean inference (inference.do-ai.run/v1/embeddings).
result = requests.post(
"https://api.ainative.studio/api/v1/public/embeddings/generate",
headers={"Authorization": f"Bearer {API_KEY}"},
json={"input": "The quick brown fox", "model": "bge-m3"},
).json()
vector = result["data"][0]["embedding"] # 1024-dim
| Model | Dimensions | Tier | Notes |
|---|---|---|---|
bge-m3 | 1024 | basic+ | Default — best retrieval quality |
all-minilm-l6-v2 | 384 | free | Fast, low cost |
Rerank
Cross-encoder reranking for RAG pipelines — scores documents against a query:
POST /api/v1/public/rerank
result = requests.post(
"https://api.ainative.studio/api/v1/public/rerank",
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"query": "What is quantum computing?",
"documents": [
"Quantum computers use qubits to process information.",
"Classical computers use binary bits.",
"Quantum entanglement enables faster computation.",
],
"top_n": 3,
},
).json()
for r in result["results"]:
print(r["relevance_score"], r["document"][:60])
Audio
Transcription
POST /api/v1/public/audio/transcriptions
curl -X POST https://api.ainative.studio/api/v1/public/audio/transcriptions \
-H "Authorization: Bearer $AINATIVE_API_KEY" \
-F "file=@recording.mp3" \
-F "model=whisper-large-v3-turbo"
| Model | Tier | Notes |
|---|---|---|
whisper-small | basic+ | Fast, lower accuracy |
whisper-large-v3-turbo | basic+ | Best accuracy |
Text-to-Speech
POST /api/v1/public/audio/tts
{
"input": "Hello, I am your AI assistant.",
"model": "minimax-speech-02-hd",
"voice": "en_female_1"
}
Returns mp3 audio binary. Powered by MiniMax Speech API.
Inference Observability
Usage Metrics
GET /api/v1/public/inference/metrics
Returns per-model token usage, cost, latency percentiles, and error rates.
Deployment Metrics
GET /api/v1/public/deployments/{deployment_id}/metrics
Real-time TTFT, throughput (tok/s), active concurrency, and SLA breach history for a dedicated deployment.
Model Catalog
GET /api/v1/public/models
Full catalog with capabilities, pricing, context window, and tier requirements.
GET /api/v1/public/models/available
Only models your current plan can access.
Plan access:
| Plan | Models included |
|---|---|
free | Llama 3.3 8B/70B, Llama 4 Maverick, Llama 4 Scout, all-minilm-l6-v2 |
basic | + Whisper, Gemma, Qwen, BGE embeddings, Claude Haiku |
professional | + Coding models, DeepSeek, image gen, audio TTS |
enterprise | + Claude Sonnet/Opus, DeepSeek R1, video gen, SD 3.5 |
Rate Limits
| Plan | Requests/min | Tokens/min |
|---|---|---|
| Free | 10 | 50,000 |
| Basic | 60 | 200,000 |
| Professional | 300 | 1,000,000 |
| Enterprise | Custom | Custom |
Rate limit headers: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset
Next Steps
- Dedicated Deployments — Full deployment management reference
- Chat Completions API — Anthropic + OpenAI format details, tool calling
- Model Catalog — All models with pricing
- ZeroMemory — Add persistent memory to your agents