Dedicated GPU Deployments
Single-tenant inference on reserved DigitalOcean GPU hardware. Traffic is fully isolated — no shared capacity, no cold starts once active. Built to compete with Pipeshift and other enterprise inference platforms.
Base path: POST /api/v1/public/deployments
Why Dedicated vs Serverless
| Serverless (Tier 1–3) | Dedicated (Tier 4) | |
|---|---|---|
| Cold start | Possible | None — always warm |
| Isolation | Shared | Single-tenant |
| SLA enforcement | No | TTFT + throughput targets |
| Scaling | Platform-managed | You control min/max |
| BYOM | No | Yes |
| Cost | Per-token | Hourly reserved |
| Best for | Intermittent / dev | Production / enterprise |
Endpoints
| Method | Path | Description |
|---|---|---|
POST | /deployments | Create a deployment |
GET | /deployments | List your deployments |
GET | /deployments/{id} | Get deployment status |
DELETE | /deployments/{id} | Teardown a deployment |
POST | /deployments/{id}/chat/completions | Proxy inference (SSE supported) |
PATCH | /deployments/{id}/scaling | Configure auto-scaling |
GET | /deployments/{id}/scaling | Get scaling config |
PATCH | /deployments/{id}/sla | Set SLA targets |
GET | /deployments/{id}/sla | Get SLA + breach history |
POST | /deployments/{id}/webhooks | Subscribe to events |
GET | /deployments/{id}/webhooks | List subscriptions |
DELETE | /deployments/{id}/webhooks/{webhook_id} | Unsubscribe |
POST | /deployments/{id}/permissions | Grant API key access |
DELETE | /deployments/{id}/permissions/{key_id} | Revoke access |
GET | /deployments/{id}/metrics | Real-time metrics |
PATCH | /deployments/{id}/preset | Apply inference preset |
GET | /gpu-model-configs | Valid GPU slug + model combinations |
Org-scoped proxy: POST /orgs/{org_id}/deployments/{id}/chat/completions
Create a Deployment
import requests
API_KEY = "sk_your_key"
BASE = "https://api.ainative.studio/api/v1/public"
H = {"Authorization": f"Bearer {API_KEY}"}
dep = requests.post(f"{BASE}/deployments", headers=H, json={
"name": "prod-llama-70b",
"model_id": "llama-3.3-70b-instruct",
"gpu_type": "h100",
"region": "nyc2",
"replicas": 1,
}).json()
print(dep["id"]) # dep_abc123
print(dep["status"]) # provisioning
GPU Types
gpu_type | Hardware | VRAM | Hourly rate |
|---|---|---|---|
h100 | 1× NVIDIA H100 | 80 GB | $4.41/hr |
mi300x | 1× AMD MI300X | 192 GB | $2.59/hr |
mi300x_8 | 8× AMD MI300X | 1,536 GB | $20.70/hr |
mi325x | 1× AMD MI325X | 256 GB | $2.98/hr |
mi325x_8 | 8× AMD MI325X | 2,048 GB | $23.82/hr |
Regions
region | Location |
|---|---|
atl1 | Atlanta |
nyc2 | New York |
tor1 | Toronto |
ric1 | Richmond |
Status Lifecycle
provisioning → active → scaling → teardown → inactive
Poll until active:
import time
while True:
r = requests.get(f"{BASE}/deployments/{dep['id']}", headers=H).json()
if r["status"] == "active":
print("Ready:", r["endpoint_url"])
break
time.sleep(5)
Run Inference
Once active, call the deployment proxy directly:
response = requests.post(
f"{BASE}/deployments/{dep['id']}/chat/completions",
headers=H,
json={
"model": "llama-3.3-70b-instruct",
"messages": [{"role": "user", "content": "Explain gradient descent"}],
"max_tokens": 512,
"stream": False,
},
)
print(response.json()["choices"][0]["message"]["content"])
Streaming is also supported — set "stream": true for SSE tokens.
Inference Presets
Apply hardware + vLLM configuration presets:
requests.patch(f"{BASE}/deployments/{dep['id']}/preset", headers=H, json={
"preset": "speed"
})
| Preset | GPU | Config | Best for |
|---|---|---|---|
speed | H100 | fp16, max batch size | Interactive chat, coding |
latency | H100 | speculative decoding, min batch | Streaming, realtime |
cost | MI300X | int8, scale-to-zero | Batch jobs, overnight |
balanced | MI300X | fp16, auto batch | General production |
Auto-Scaling
requests.patch(f"{BASE}/deployments/{dep['id']}/scaling", headers=H, json={
"min_replicas": 1,
"max_replicas": 4,
"scale_to_zero": True,
"idle_timeout_seconds": 300,
})
The Celery autoscaler polls every 30s. It reads queue depth and latency from the deployment endpoint:
- Scale up: 3 consecutive above-threshold checks → add replica
- Scale down: 3 consecutive below-threshold checks → remove replica
- Scale to zero: destroys the DO instance after
idle_timeout_secondsof no traffic. Cold-start resumes on next request.
Custom SLA
Define TTFT, throughput, and concurrency targets. The SLA monitor collects actual metrics every 60s and automatically fires breach webhooks:
requests.patch(f"{BASE}/deployments/{dep['id']}/sla", headers=H, json={
"ttft_target_ms": 500, # first token in under 500ms
"throughput_target_tps": 100, # 100 tokens/second sustained
"concurrency_limit": 50, # max parallel requests
})
Get current SLA status and breach log:
sla = requests.get(f"{BASE}/deployments/{dep['id']}/sla", headers=H).json()
print(sla["breach_count_24h"])
print(sla["last_ttft_ms"])
Webhooks
Subscribe to push events so your system reacts automatically:
requests.post(f"{BASE}/deployments/{dep['id']}/webhooks", headers=H, json={
"url": "https://your-service.com/hooks/ainative",
"events": [
"deployment.active",
"deployment.degraded",
"deployment.sla_breach",
"deployment.recovered",
"deployment.scaled",
],
})
Event types:
| Event | Triggered when |
|---|---|
deployment.active | Provisioning complete, inference ready |
deployment.degraded | DO API reports degraded state |
deployment.recovered | Degraded → healthy |
deployment.sla_breach | TTFT or throughput target missed |
deployment.scaled | Autoscaler changed replica count |
Verify the signature:
import hmac, hashlib
def verify_webhook(body: bytes, sig_header: str, secret: str) -> bool:
mac = hmac.new(secret.encode(), body, hashlib.sha256).hexdigest()
return hmac.compare_digest(f"sha256={mac}", sig_header)
Retries use exponential backoff — up to 5 attempts over 10 minutes.
RBAC
Scope API keys to specific deployments. Useful when each customer of your product should only access their own inference endpoint:
# Grant access
requests.post(f"{BASE}/deployments/{dep['id']}/permissions", headers=H, json={
"api_key_id": "key_cust_123",
"role": "inference",
})
# Revoke
requests.delete(f"{BASE}/deployments/{dep['id']}/permissions/key_cust_123", headers=H)
Scoped keys that try to call a deployment they're not granted access to receive 403 Forbidden.
Org-Scoped Inference
Enterprise orgs can route inference through the org namespace with membership-enforced RBAC:
response = requests.post(
f"{BASE}/orgs/{ORG_ID}/deployments/{dep['id']}/chat/completions",
headers=H,
json={
"model": "llama-3.3-70b-instruct",
"messages": [{"role": "user", "content": "Hello"}],
},
)
The caller must be a member of ORG_ID and have deployment:invoke permission on the target deployment.
BYOM — Bring Your Own Model
Upload custom GGUF weights to serve on your dedicated deployment:
curl -X POST https://api.ainative.studio/api/v1/public/models/upload \
-H "Authorization: Bearer $AINATIVE_API_KEY" \
-F "file=@my-finetune-7b.gguf" \
-F "name=my-model-v2" \
-F "deployment_id=dep_abc123"
Poll status until ready:
while True:
m = requests.get(f"{BASE}/models/byom/{model_id}", headers=H).json()
if m["status"] == "ready":
break
time.sleep(10)
Call via standard chat completions:
{ "model": "byom:my-model-v2", "messages": [...] }
Requirements: GGUF format, max 20GB per file. Weights are stored to DigitalOcean Spaces.
Billing
Dedicated deployments bill hourly from provisioning until teardown — regardless of traffic. Charges accrue to your AINative ledger and aggregate to your Stripe invoice daily.
# Check projected monthly cost
dep = requests.get(f"{BASE}/deployments/{dep['id']}", headers=H).json()
print(f"${dep['hourly_cost_usd']}/hr × 730 = ${dep['hourly_cost_usd'] * 730:.2f}/mo projected")
Scale to zero to stop billing when idle.
Teardown
requests.delete(f"{BASE}/deployments/{dep['id']}", headers=H)
Destroys the DO GPU instance immediately. Billing stops at teardown.
Next Steps
- Inference Overview — Quick start, embeddings, rerank, audio
- Chat Completions API — Full request/response reference
- RBAC + Orgs — OAuth 2.1 and org management