Skip to main content

Dedicated GPU Deployments

Single-tenant inference on reserved DigitalOcean GPU hardware. Traffic is fully isolated — no shared capacity, no cold starts once active. Built to compete with Pipeshift and other enterprise inference platforms.

Base path: POST /api/v1/public/deployments


Why Dedicated vs Serverless

Serverless (Tier 1–3)Dedicated (Tier 4)
Cold startPossibleNone — always warm
IsolationSharedSingle-tenant
SLA enforcementNoTTFT + throughput targets
ScalingPlatform-managedYou control min/max
BYOMNoYes
CostPer-tokenHourly reserved
Best forIntermittent / devProduction / enterprise

Endpoints

MethodPathDescription
POST/deploymentsCreate a deployment
GET/deploymentsList your deployments
GET/deployments/{id}Get deployment status
DELETE/deployments/{id}Teardown a deployment
POST/deployments/{id}/chat/completionsProxy inference (SSE supported)
PATCH/deployments/{id}/scalingConfigure auto-scaling
GET/deployments/{id}/scalingGet scaling config
PATCH/deployments/{id}/slaSet SLA targets
GET/deployments/{id}/slaGet SLA + breach history
POST/deployments/{id}/webhooksSubscribe to events
GET/deployments/{id}/webhooksList subscriptions
DELETE/deployments/{id}/webhooks/{webhook_id}Unsubscribe
POST/deployments/{id}/permissionsGrant API key access
DELETE/deployments/{id}/permissions/{key_id}Revoke access
GET/deployments/{id}/metricsReal-time metrics
PATCH/deployments/{id}/presetApply inference preset
GET/gpu-model-configsValid GPU slug + model combinations

Org-scoped proxy: POST /orgs/{org_id}/deployments/{id}/chat/completions


Create a Deployment

import requests

API_KEY = "sk_your_key"
BASE = "https://api.ainative.studio/api/v1/public"
H = {"Authorization": f"Bearer {API_KEY}"}

dep = requests.post(f"{BASE}/deployments", headers=H, json={
"name": "prod-llama-70b",
"model_id": "llama-3.3-70b-instruct",
"gpu_type": "h100",
"region": "nyc2",
"replicas": 1,
}).json()

print(dep["id"]) # dep_abc123
print(dep["status"]) # provisioning

GPU Types

gpu_typeHardwareVRAMHourly rate
h1001× NVIDIA H10080 GB$4.41/hr
mi300x1× AMD MI300X192 GB$2.59/hr
mi300x_88× AMD MI300X1,536 GB$20.70/hr
mi325x1× AMD MI325X256 GB$2.98/hr
mi325x_88× AMD MI325X2,048 GB$23.82/hr

Regions

regionLocation
atl1Atlanta
nyc2New York
tor1Toronto
ric1Richmond

Status Lifecycle

provisioning → active → scaling → teardown → inactive

Poll until active:

import time

while True:
r = requests.get(f"{BASE}/deployments/{dep['id']}", headers=H).json()
if r["status"] == "active":
print("Ready:", r["endpoint_url"])
break
time.sleep(5)

Run Inference

Once active, call the deployment proxy directly:

response = requests.post(
f"{BASE}/deployments/{dep['id']}/chat/completions",
headers=H,
json={
"model": "llama-3.3-70b-instruct",
"messages": [{"role": "user", "content": "Explain gradient descent"}],
"max_tokens": 512,
"stream": False,
},
)
print(response.json()["choices"][0]["message"]["content"])

Streaming is also supported — set "stream": true for SSE tokens.


Inference Presets

Apply hardware + vLLM configuration presets:

requests.patch(f"{BASE}/deployments/{dep['id']}/preset", headers=H, json={
"preset": "speed"
})
PresetGPUConfigBest for
speedH100fp16, max batch sizeInteractive chat, coding
latencyH100speculative decoding, min batchStreaming, realtime
costMI300Xint8, scale-to-zeroBatch jobs, overnight
balancedMI300Xfp16, auto batchGeneral production

Auto-Scaling

requests.patch(f"{BASE}/deployments/{dep['id']}/scaling", headers=H, json={
"min_replicas": 1,
"max_replicas": 4,
"scale_to_zero": True,
"idle_timeout_seconds": 300,
})

The Celery autoscaler polls every 30s. It reads queue depth and latency from the deployment endpoint:

  • Scale up: 3 consecutive above-threshold checks → add replica
  • Scale down: 3 consecutive below-threshold checks → remove replica
  • Scale to zero: destroys the DO instance after idle_timeout_seconds of no traffic. Cold-start resumes on next request.

Custom SLA

Define TTFT, throughput, and concurrency targets. The SLA monitor collects actual metrics every 60s and automatically fires breach webhooks:

requests.patch(f"{BASE}/deployments/{dep['id']}/sla", headers=H, json={
"ttft_target_ms": 500, # first token in under 500ms
"throughput_target_tps": 100, # 100 tokens/second sustained
"concurrency_limit": 50, # max parallel requests
})

Get current SLA status and breach log:

sla = requests.get(f"{BASE}/deployments/{dep['id']}/sla", headers=H).json()
print(sla["breach_count_24h"])
print(sla["last_ttft_ms"])

Webhooks

Subscribe to push events so your system reacts automatically:

requests.post(f"{BASE}/deployments/{dep['id']}/webhooks", headers=H, json={
"url": "https://your-service.com/hooks/ainative",
"events": [
"deployment.active",
"deployment.degraded",
"deployment.sla_breach",
"deployment.recovered",
"deployment.scaled",
],
})

Event types:

EventTriggered when
deployment.activeProvisioning complete, inference ready
deployment.degradedDO API reports degraded state
deployment.recoveredDegraded → healthy
deployment.sla_breachTTFT or throughput target missed
deployment.scaledAutoscaler changed replica count

Verify the signature:

import hmac, hashlib

def verify_webhook(body: bytes, sig_header: str, secret: str) -> bool:
mac = hmac.new(secret.encode(), body, hashlib.sha256).hexdigest()
return hmac.compare_digest(f"sha256={mac}", sig_header)

Retries use exponential backoff — up to 5 attempts over 10 minutes.


RBAC

Scope API keys to specific deployments. Useful when each customer of your product should only access their own inference endpoint:

# Grant access
requests.post(f"{BASE}/deployments/{dep['id']}/permissions", headers=H, json={
"api_key_id": "key_cust_123",
"role": "inference",
})

# Revoke
requests.delete(f"{BASE}/deployments/{dep['id']}/permissions/key_cust_123", headers=H)

Scoped keys that try to call a deployment they're not granted access to receive 403 Forbidden.


Org-Scoped Inference

Enterprise orgs can route inference through the org namespace with membership-enforced RBAC:

response = requests.post(
f"{BASE}/orgs/{ORG_ID}/deployments/{dep['id']}/chat/completions",
headers=H,
json={
"model": "llama-3.3-70b-instruct",
"messages": [{"role": "user", "content": "Hello"}],
},
)

The caller must be a member of ORG_ID and have deployment:invoke permission on the target deployment.


BYOM — Bring Your Own Model

Upload custom GGUF weights to serve on your dedicated deployment:

curl -X POST https://api.ainative.studio/api/v1/public/models/upload \
-H "Authorization: Bearer $AINATIVE_API_KEY" \
-F "file=@my-finetune-7b.gguf" \
-F "name=my-model-v2" \
-F "deployment_id=dep_abc123"

Poll status until ready:

while True:
m = requests.get(f"{BASE}/models/byom/{model_id}", headers=H).json()
if m["status"] == "ready":
break
time.sleep(10)

Call via standard chat completions:

{ "model": "byom:my-model-v2", "messages": [...] }

Requirements: GGUF format, max 20GB per file. Weights are stored to DigitalOcean Spaces.


Billing

Dedicated deployments bill hourly from provisioning until teardown — regardless of traffic. Charges accrue to your AINative ledger and aggregate to your Stripe invoice daily.

# Check projected monthly cost
dep = requests.get(f"{BASE}/deployments/{dep['id']}", headers=H).json()
print(f"${dep['hourly_cost_usd']}/hr × 730 = ${dep['hourly_cost_usd'] * 730:.2f}/mo projected")

Scale to zero to stop billing when idle.


Teardown

requests.delete(f"{BASE}/deployments/{dep['id']}", headers=H)

Destroys the DO GPU instance immediately. Billing stops at teardown.


Next Steps