Inference

Agent Cloud Inference

AINative Agent Cloud provides a unified inference API across a tiered GPU stack — from free serverless models to single-tenant H100/MI300X deployments. One API key, one endpoint, every model.

Base URL: https://api.ainative.studio

Inference Stack

Requests route automatically based on the model you select:

Tier	Provider	Hardware	Notes
Serverless	NVIDIA NIM	Shared GPU	Primary — 141+ models, OpenAI-compatible
Serverless fallback	HuggingFace	Shared GPU	Auto-fallback on NIM errors
Ultra-fast	Cerebras	Wafer-scale	`gpt-oss-120b`, `zai-glm-4.7` — 2,000+ tok/s
Dedicated HF	HuggingFace Endpoints	T4 / L4 / A100	Per-user dedicated GPU (Tier 2)
Frontier	DigitalOcean	H100 / MI300X	Llama 3.3/4 — routed via DO inference
Frontier	Anthropic	—	Claude Haiku / Sonnet / Opus
Dedicated GPU	DigitalOcean	H100 / MI300X / MI325X	Single-tenant reserved capacity

Chat Completions

Standard

POST /api/v1/public/chat/completions

OpenAI-compatible. Routes automatically across the stack based on your chosen model.

curl -X POST https://api.ainative.studio/api/v1/public/chat/completions \
  -H "Authorization: Bearer $AINATIVE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.3-70b-instruct",
    "messages": [{"role": "user", "content": "Explain transformers in one paragraph"}],
    "max_tokens": 512,
    "stream": false
  }'

Also available in Anthropic format at POST /v1/messages — see Chat Completions API.

Dedicated Deployment

Route a request to your provisioned single-tenant GPU:

POST /api/v1/public/deployments/{deployment_id}/chat/completions

Same request schema as standard chat. Traffic is fully isolated to your reserved hardware.

import requests

response = requests.post(
    f"https://api.ainative.studio/api/v1/public/deployments/{DEPLOYMENT_ID}/chat/completions",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={
        "model": "llama-3.3-70b-instruct",
        "messages": [{"role": "user", "content": "Hello"}],
    },
)

Org-Scoped Inference

Enterprise teams isolate inference by org with RBAC enforced:

POST /api/v1/public/orgs/{org_id}/deployments/{deployment_id}/chat/completions

Requires the caller to be a member of org_id with deployment:invoke permission on the target deployment.

Dedicated GPU Deployments

Provision isolated GPU capacity on DigitalOcean H100/MI300X/MI325X hardware. Reserved capacity starts immediately — no cold start.

Create a Deployment

POST /api/v1/public/deployments

deployment = requests.post(
    "https://api.ainative.studio/api/v1/public/deployments",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={
        "name": "prod-inference",
        "model_id": "llama-3.3-70b-instruct",
        "gpu_type": "h100",
        "region": "nyc2",
        "replicas": 1,
    },
).json()

print(deployment["id"])          # dep_abc123
print(deployment["status"])      # provisioning → active

GPU types:

`gpu_type`	Hardware	VRAM	Best for
`h100`	NVIDIA H100	80 GB	Speed-critical, 70B+ models
`mi300x`	AMD MI300X	192 GB	Cost-efficient, large context
`mi300x_8`	8× AMD MI300X	1,536 GB	Very large models
`mi325x`	AMD MI325X	256 GB	Balanced
`mi325x_8`	8× AMD MI325X	2,048 GB	Maximum capacity

Regions: atl1 (Atlanta), nyc2 (New York), tor1 (Toronto), ric1 (Richmond)

Deployment Lifecycle

provisioning → active → scaling → teardown → inactive

## Poll until active
import time

while True:
    status = requests.get(
        f"https://api.ainative.studio/api/v1/public/deployments/{dep_id}",
        headers={"Authorization": f"Bearer {API_KEY}"},
    ).json()["status"]
    if status == "active":
        break
    time.sleep(5)

List Deployments

GET /api/v1/public/deployments

Teardown

DELETE /api/v1/public/deployments/{deployment_id}

Destroys the DO GPU instance. Billing stops at teardown.

Inference Presets

Apply hardware + configuration presets at deployment creation:

{
    "name": "speed-optimized",
    "model_id": "llama-3.3-70b-instruct",
    "gpu_type": "h100",
    "region": "nyc2",
    "preset": "speed"
}

Preset	GPU	Config	Use case
`speed`	H100	fp16, max batch	Lowest latency
`latency`	H100	speculative decoding, min batch	Interactive / streaming
`cost`	MI300X	int8, scale-to-zero	Batch workloads
`balanced`	MI300X	fp16, auto batch	General purpose

Auto-Scaling

Configure replicas to scale automatically based on queue depth and latency:

PATCH /api/v1/public/deployments/{deployment_id}/scaling

{
  "min_replicas": 1,
  "max_replicas": 4,
  "scale_to_zero": true,
  "idle_timeout_seconds": 300
}

The autoscaler polls every 30s. Scale-to-zero destroys the DO instance after the idle timeout — cold start resumes on the next request.

Custom SLA

Set TTFT, throughput, and concurrency targets. Breaches fire webhook events automatically:

PATCH /api/v1/public/deployments/{deployment_id}/sla

{
  "ttft_target_ms": 500,
  "throughput_target_tps": 100,
  "concurrency_limit": 50
}

GET /api/v1/public/deployments/{deployment_id}/sla

Returns current targets and breach history.

Deployment Health Webhooks

Subscribe to push events for state changes and SLA breaches:

POST /api/v1/public/deployments/{deployment_id}/webhooks

{
  "url": "https://your-service.com/hooks/ainative",
  "events": [
    "deployment.active",
    "deployment.degraded",
    "deployment.sla_breach",
    "deployment.recovered",
    "deployment.scaled"
  ]
}

Payloads are HMAC-SHA256 signed. Verify with the X-AINative-Signature header:

import hmac, hashlib

def verify(payload: bytes, signature: str, secret: str) -> bool:
    expected = hmac.new(secret.encode(), payload, hashlib.sha256).hexdigest()
    return hmac.compare_digest(f"sha256={expected}", signature)

RBAC — Per-Deployment Access Control

Grant specific API keys access to a deployment:

POST /api/v1/public/deployments/{deployment_id}/permissions

{
  "api_key_id": "key_abc123",
  "role": "inference"
}

Revoke:

DELETE /api/v1/public/deployments/{deployment_id}/permissions/{api_key_id}

Scoped keys can only call deployments they've been granted access to. Useful for multi-tenant products where each customer gets their own scoped key.

Bring Your Own Model (BYOM)

Upload custom GGUF fine-tuned weights and serve them on dedicated infrastructure:

POST /api/v1/public/models/upload

curl -X POST https://api.ainative.studio/api/v1/public/models/upload \
  -H "Authorization: Bearer $AINATIVE_API_KEY" \
  -F "file=@my-finetune.gguf" \
  -F "name=my-model-v1" \
  -F "deployment_id=dep_abc123"

Status transitions: pending_upload → uploading → ready

Once ready, call via standard chat completions:

{ "model": "byom:my-model-v1", "messages": [...] }

Embeddings

POST /api/v1/public/embeddings/generate

Backed by DigitalOcean inference (inference.do-ai.run/v1/embeddings).

result = requests.post(
    "https://api.ainative.studio/api/v1/public/embeddings/generate",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={"input": "The quick brown fox", "model": "bge-m3"},
).json()

vector = result["data"][0]["embedding"]  # 1024-dim

Model	Dimensions	Tier	Notes
`bge-m3`	1024	basic+	Default — best retrieval quality
`all-minilm-l6-v2`	384	free	Fast, low cost

Rerank

Cross-encoder reranking for RAG pipelines — scores documents against a query:

POST /api/v1/public/rerank

result = requests.post(
    "https://api.ainative.studio/api/v1/public/rerank",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={
        "query": "What is quantum computing?",
        "documents": [
            "Quantum computers use qubits to process information.",
            "Classical computers use binary bits.",
            "Quantum entanglement enables faster computation.",
        ],
        "top_n": 3,
    },
).json()

for r in result["results"]:
    print(r["relevance_score"], r["document"][:60])

Audio

Transcription

POST /api/v1/public/audio/transcriptions

curl -X POST https://api.ainative.studio/api/v1/public/audio/transcriptions \
  -H "Authorization: Bearer $AINATIVE_API_KEY" \
  -F "file=@recording.mp3" \
  -F "model=whisper-large-v3-turbo"

Model	Tier	Notes
`whisper-small`	basic+	Fast, lower accuracy
`whisper-large-v3-turbo`	basic+	Best accuracy

Text-to-Speech

POST /api/v1/public/audio/tts

{
  "input": "Hello, I am your AI assistant.",
  "model": "minimax-speech-02-hd",
  "voice": "en_female_1"
}

Returns mp3 audio binary. Powered by MiniMax Speech API.

Inference Observability

Usage Metrics

GET /api/v1/public/inference/metrics

Returns per-model token usage, cost, latency percentiles, and error rates.

Deployment Metrics

GET /api/v1/public/deployments/{deployment_id}/metrics

Real-time TTFT, throughput (tok/s), active concurrency, and SLA breach history for a dedicated deployment.

Model Catalog

GET /api/v1/public/models

Full catalog with capabilities, pricing, context window, and tier requirements.

GET /api/v1/public/models/available

Only models your current plan can access.

Plan access:

Plan	Models included
`free`	Llama 3.3 8B/70B, Llama 4 Maverick, Llama 4 Scout, `all-minilm-l6-v2`
`basic`	+ Whisper, Gemma, Qwen, BGE embeddings, Claude Haiku
`professional`	+ Coding models, DeepSeek, image gen, audio TTS
`enterprise`	+ Claude Sonnet/Opus, DeepSeek R1, video gen, SD 3.5

Rate Limits

Plan	Requests/min	Tokens/min
Free	10	50,000
Basic	60	200,000
Professional	300	1,000,000
Enterprise	Custom	Custom

Rate limit headers: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset

Next Steps

Dedicated Deployments — Full deployment management reference
Chat Completions API — Anthropic + OpenAI format details, tool calling
Model Catalog — All models with pricing
ZeroMemory — Add persistent memory to your agents

Agent Cloud Inference​

Inference Stack​

Chat Completions​

Standard​

Dedicated Deployment​

Org-Scoped Inference​

Dedicated GPU Deployments​

Create a Deployment​

Deployment Lifecycle​

List Deployments​

Teardown​

Inference Presets​

Auto-Scaling​

Custom SLA​

Deployment Health Webhooks​

RBAC — Per-Deployment Access Control​

Bring Your Own Model (BYOM)​

Embeddings​

Rerank​

Audio​

Transcription​

Text-to-Speech​

Inference Observability​

Usage Metrics​

Deployment Metrics​

Model Catalog​

Rate Limits​

Next Steps​