Dedicated GPU Deployments

Single-tenant inference on reserved DigitalOcean GPU hardware. Traffic is fully isolated — no shared capacity, no cold starts once active. Built to compete with Pipeshift and other enterprise inference platforms.

Base path: POST /api/v1/public/deployments

Serving vs training

A dedicated deployment serves a model (steady-state, kept warm, latency-optimized). If you instead need a raw GPU VM to train or fine-tune a model and then tear down, use GPU Training Nodes.

Why Dedicated vs Serverless

	Serverless (Tier 1–3)	Dedicated (Tier 4)
Cold start	Possible	None — always warm
Isolation	Shared	Single-tenant
SLA enforcement	No	TTFT + throughput targets
Scaling	Platform-managed	You control min/max
BYOM	No	Yes
Cost	Per-token	Hourly reserved
Best for	Intermittent / dev	Production / enterprise

Endpoints

Method	Path	Description
`POST`	`/deployments`	Create a deployment
`GET`	`/deployments`	List your deployments
`GET`	`/deployments/{id}`	Get deployment status
`DELETE`	`/deployments/{id}`	Teardown a deployment
`POST`	`/deployments/{id}/chat/completions`	Proxy inference (SSE supported)
`PATCH`	`/deployments/{id}/scaling`	Configure auto-scaling
`GET`	`/deployments/{id}/scaling`	Get scaling config
`PATCH`	`/deployments/{id}/sla`	Set SLA targets
`GET`	`/deployments/{id}/sla`	Get SLA + breach history
`POST`	`/deployments/{id}/webhooks`	Subscribe to events
`GET`	`/deployments/{id}/webhooks`	List subscriptions
`DELETE`	`/deployments/{id}/webhooks/{webhook_id}`	Unsubscribe
`POST`	`/deployments/{id}/permissions`	Grant API key access
`DELETE`	`/deployments/{id}/permissions/{key_id}`	Revoke access
`GET`	`/deployments/{id}/metrics`	Real-time metrics
`PATCH`	`/deployments/{id}/preset`	Apply inference preset
`GET`	`/gpu-model-configs`	Valid GPU slug + model combinations

Org-scoped proxy: POST /orgs/{org_id}/deployments/{id}/chat/completions

Create a Deployment

import requests

API_KEY = "sk_your_key"
BASE = "https://api.ainative.studio/api/v1/public"
H = {"Authorization": f"Bearer {API_KEY}"}

dep = requests.post(f"{BASE}/deployments", headers=H, json={
    "name": "prod-llama-70b",
    "model_id": "llama-3.3-70b-instruct",
    "gpu_type": "h100",
    "region": "nyc2",
    "replicas": 1,
}).json()

print(dep["id"])      # dep_abc123
print(dep["status"])  # provisioning

GPU Types

`gpu_type`	Hardware	VRAM	Hourly rate
`h100`	1× NVIDIA H100	80 GB	$4.41/hr
`mi300x`	1× AMD MI300X	192 GB	$2.59/hr
`mi300x_8`	8× AMD MI300X	1,536 GB	$20.70/hr
`mi325x`	1× AMD MI325X	256 GB	$2.98/hr
`mi325x_8`	8× AMD MI325X	2,048 GB	$23.82/hr

Regions

`region`	Location
`atl1`	Atlanta
`nyc2`	New York
`tor1`	Toronto
`ric1`	Richmond

Status Lifecycle

provisioning → active → scaling → teardown → inactive

Poll until active:

import time

while True:
    r = requests.get(f"{BASE}/deployments/{dep['id']}", headers=H).json()
    if r["status"] == "active":
        print("Ready:", r["endpoint_url"])
        break
    time.sleep(5)

Run Inference

Once active, call the deployment proxy directly:

response = requests.post(
    f"{BASE}/deployments/{dep['id']}/chat/completions",
    headers=H,
    json={
        "model": "llama-3.3-70b-instruct",
        "messages": [{"role": "user", "content": "Explain gradient descent"}],
        "max_tokens": 512,
        "stream": False,
    },
)
print(response.json()["choices"][0]["message"]["content"])

Streaming is also supported — set "stream": true for SSE tokens.

Inference Presets

Apply hardware + vLLM configuration presets:

requests.patch(f"{BASE}/deployments/{dep['id']}/preset", headers=H, json={
    "preset": "speed"
})

Preset	GPU	Config	Best for
`speed`	H100	fp16, max batch size	Interactive chat, coding
`latency`	H100	speculative decoding, min batch	Streaming, realtime
`cost`	MI300X	int8, scale-to-zero	Batch jobs, overnight
`balanced`	MI300X	fp16, auto batch	General production

Auto-Scaling

requests.patch(f"{BASE}/deployments/{dep['id']}/scaling", headers=H, json={
    "min_replicas": 1,
    "max_replicas": 4,
    "scale_to_zero": True,
    "idle_timeout_seconds": 300,
})

The Celery autoscaler polls every 30s. It reads queue depth and latency from the deployment endpoint:

Scale up: 3 consecutive above-threshold checks → add replica
Scale down: 3 consecutive below-threshold checks → remove replica
Scale to zero: destroys the DO instance after idle_timeout_seconds of no traffic. Cold-start resumes on next request.

Custom SLA

Define TTFT, throughput, and concurrency targets. The SLA monitor collects actual metrics every 60s and automatically fires breach webhooks:

requests.patch(f"{BASE}/deployments/{dep['id']}/sla", headers=H, json={
    "ttft_target_ms": 500,          # first token in under 500ms
    "throughput_target_tps": 100,   # 100 tokens/second sustained
    "concurrency_limit": 50,        # max parallel requests
})

Get current SLA status and breach log:

sla = requests.get(f"{BASE}/deployments/{dep['id']}/sla", headers=H).json()
print(sla["breach_count_24h"])
print(sla["last_ttft_ms"])

Webhooks

Subscribe to push events so your system reacts automatically:

requests.post(f"{BASE}/deployments/{dep['id']}/webhooks", headers=H, json={
    "url": "https://your-service.com/hooks/ainative",
    "events": [
        "deployment.active",
        "deployment.degraded",
        "deployment.sla_breach",
        "deployment.recovered",
        "deployment.scaled",
    ],
})

Event types:

Event	Triggered when
`deployment.active`	Provisioning complete, inference ready
`deployment.degraded`	DO API reports degraded state
`deployment.recovered`	Degraded → healthy
`deployment.sla_breach`	TTFT or throughput target missed
`deployment.scaled`	Autoscaler changed replica count

Verify the signature:

import hmac, hashlib

def verify_webhook(body: bytes, sig_header: str, secret: str) -> bool:
    mac = hmac.new(secret.encode(), body, hashlib.sha256).hexdigest()
    return hmac.compare_digest(f"sha256={mac}", sig_header)

Retries use exponential backoff — up to 5 attempts over 10 minutes.

RBAC

Scope API keys to specific deployments. Useful when each customer of your product should only access their own inference endpoint:

## Grant access
requests.post(f"{BASE}/deployments/{dep['id']}/permissions", headers=H, json={
    "api_key_id": "key_cust_123",
    "role": "inference",
})

## Revoke
requests.delete(f"{BASE}/deployments/{dep['id']}/permissions/key_cust_123", headers=H)

Scoped keys that try to call a deployment they're not granted access to receive 403 Forbidden.

Org-Scoped Inference

Enterprise orgs can route inference through the org namespace with membership-enforced RBAC:

response = requests.post(
    f"{BASE}/orgs/{ORG_ID}/deployments/{dep['id']}/chat/completions",
    headers=H,
    json={
        "model": "llama-3.3-70b-instruct",
        "messages": [{"role": "user", "content": "Hello"}],
    },
)

The caller must be a member of ORG_ID and have deployment:invoke permission on the target deployment.

BYOM — Bring Your Own Model

Upload custom GGUF weights to serve on your dedicated deployment:

curl -X POST https://api.ainative.studio/api/v1/public/models/upload \
  -H "Authorization: Bearer $AINATIVE_API_KEY" \
  -F "file=@my-finetune-7b.gguf" \
  -F "name=my-model-v2" \
  -F "deployment_id=dep_abc123"

Poll status until ready:

while True:
    m = requests.get(f"{BASE}/models/byom/{model_id}", headers=H).json()
    if m["status"] == "ready":
        break
    time.sleep(10)

Call via standard chat completions:

{ "model": "byom:my-model-v2", "messages": [...] }

Requirements: GGUF format, max 20GB per file. Weights are stored to DigitalOcean Spaces.

Billing

Dedicated deployments bill hourly from provisioning until teardown — regardless of traffic. Charges accrue to your AINative ledger and aggregate to your Stripe invoice daily.

## Check projected monthly cost
dep = requests.get(f"{BASE}/deployments/{dep['id']}", headers=H).json()
print(f"${dep['hourly_cost_usd']}/hr × 730 = ${dep['hourly_cost_usd'] * 730:.2f}/mo projected")

Scale to zero to stop billing when idle.

Teardown

requests.delete(f"{BASE}/deployments/{dep['id']}", headers=H)

Destroys the DO GPU instance immediately. Billing stops at teardown.

Next Steps

GPU Training Nodes — Raw GPU VMs for training / fine-tuning (spin up → train → tear down)
Inference Overview — Quick start, embeddings, rerank, audio
Chat Completions API — Full request/response reference
RBAC + Orgs — OAuth 2.1 and org management