Inference Guide

Ready-to-deploy recipes for validated open-weight LLMs on Alauda AI. Each model in this guide has been deployed end-to-end on a real cluster and benchmarked, so you get a known-good deployment manifest, the runtime image that serves it, and the throughput you can expect.

The models here were validated on Huawei Ascend 910B4 NPU with the community vLLM-Ascend engine, deployed through Alauda AI's InferNex surface — a KServe LLMInferenceService reconciled by the InferNex-Bridge into a load-aware router (hermes-router / EPP) in front of the vLLM-Ascend instances. All were run through the same InferNex aggregation surface and the same two benchmark scenarios (the two Qwen models share an identical 2 × TP=4 topology and are directly comparable; the larger DeepSeek MoE uses 1 × TP=8). For the runtime model (KServe, ModelCar storage, scheduling) see Model Deployment & Inference.

Validated models

Model	Type	Params	dtype	Device	Engine	Per-page
Qwen3-32B	Qwen3 dense	32B	BF16	Ascend 910B4 ×8	vLLM-Ascend v0.18.0	Qwen3-32B
Qwen3.6-27B (W8A8)	`qwen3_5` hybrid (GDN + MTP)	27.78B	W8A8	Ascend 910B4 ×8	vLLM-Ascend nightly	Qwen3.6-27B (W8A8)
DeepSeek-V4-Flash (W4A8)	`deepseek_v4` MoE (MLA + DSA + MTP)	256-expert MoE	W4A8	Ascend 910B4 ×8	vLLM-Ascend nightly	DeepSeek-V4-Flash (W4A8)

All three were validated on Ascend 910B4 (32 GB/card), driven through KServe LLMInferenceService with load-aware routing (InferNex-Bridge + hermes-router). The two Qwen models run an 8-card aggregation — 2 instances × TP=4; DeepSeek-V4-Flash (~151 GB W4A8) fills all 8 cards as 1 instance × TP=8, and additionally validated the MaaS gateway (API-key) ingress next to the internal KServe ingress.

Runtime images

Engine	Device	Image (validated tag)	Used by	Notes
vLLM-Ascend v0.18.0	Huawei Ascend NPU	`quay.io/ascend/vllm-ascend:v0.18.0-openeuler`	Qwen3-32B	Community vLLM for Ascend, V1 engine. Serves standard Qwen3 dense models directly. Match the tag's CANN version to your host NPU driver.
vLLM-Ascend nightly (release-pinned)	Huawei Ascend NPU	`quay.io/ascend/vllm-ascend:nightly-releases-v0.22.1rc-openeuler`	Qwen3.6-27B (W8A8), DeepSeek-V4-Flash (W4A8)	Carries the `qwen3_5` Gated DeltaNet hybrid + MTP and the `deepseek_v4` MoE stack — the stock `v0.18.0` cannot load either. Use this release-pinned tag, not the moving `nightly-main-openeuler` (it drifted to a broken build whose TP workers crash).

TIP

The Ascend CANN images are arm64. Always match the runtime image's CANN version to the host NPU driver on your nodes. Only the engines actually used in this guide are listed; other engines (MindIE, SGLang, …) were not benchmarked at this size.

Benchmark scenarios

Both models were measured with aiperf against the same two scenarios, modelled on real serving patterns. Output is pinned to 128 tokens and load is closed-loop, concurrency 4 (4 in-flight requests, fixed). Each scenario ran 240 requests.

Scenario	What it models	Dataset	Request shape
① Fixed-length system-prompt reuse	A reused system prompt — performance when request length is fixed and KV cache repeats.	60 distinct 8k-token system prompts, each reused 4 times = 240 requests.	ISL ~8k / OSL 128
② Multi-turn dialogue	Multiple users in a continuing conversation, several requests per session.	60 independent users × 4 rounds = 240 requests.	Round 1 sends 16k and the model replies 128 tokens; round 2 adds 1k (~17k total); and so on for 4 rounds (16k / 17k / 18k / 19k), averaging ~17.5k ISL / OSL 128.

NOTE

Both scenarios run on a single-node 8-card deployment (Qwen models: 2 instances × TP=4; DeepSeek-V4-Flash: 1 instance × TP=8). Latency (TTFT / ITL / E2E) is the per-instance operating point under steady 2-in-flight load; total throughput (TPS) is the aggregate across the instances and scales with the instance count. TPS is the total-token (input + output) caliber; the decode-only output rate is reported separately and is much smaller under these long-input workloads. DeepSeek-V4-Flash additionally ran each scenario through the MaaS gateway (API-key) ingress as well as the internal KServe ingress — see its page.

Deploy a validated model

Each model page links self-contained YAMLs under assets/ that hold the real InferNex deployment — a KServe LLMInferenceService (infernex.io/runtime: true) plus the two LLMInferenceServiceConfig objects (engine template + hermes-router/EPP template) that the InferNex-Bridge reconciles into the running instances.

base=https://raw.githubusercontent.com/alauda/aml-docs/master/docs/en/inference_guide/assets

# 1. Edit the file first: set the namespace and the image tag (match your CANN /
#    host driver). model.uri already points at the public ModelCar on Docker Hub
#    (oci://docker.io/alaudadockerhub/...) — repoint it only if you mirror locally.
# 2. Apply — the InferNex-Bridge reconciles the LLMInferenceService into the
#    hermes-router (EPP) + the vLLM-Ascend instances.
kubectl apply -f $base/qwen3-32b/qwen3-32b-agg-base-llmisvc.yaml

# 3. Watch it come up (first start loads weights + compiles graphs — can take minutes).
kubectl -n <your-namespace> get llminferenceservice -w

# 4. Call the OpenAI-compatible endpoint through the gateway.
curl -s http://<gateway>/<namespace>/qwen3-32b-agg-base/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"Qwen/Qwen3-32B","messages":[{"role":"user","content":"hello"}]}'

Caveats

These manifests deploy through InferNex (LLMInferenceService + InferNex-Bridge
- hermes-router). The two LLMInferenceServiceConfig objects live in the kserve namespace; the LLMInferenceService lives in your deployment namespace.
Resource keys are for Ascend 910B4 (huawei.com/Ascend910). Adjust the resource key, image, and version fields for your actual NPU model.
The ModelCar images are public on Docker Hub under alaudadockerhub — the manifests pull them with no credentials. Mirror them to your own registry and repoint model.uri if you prefer; the modelcar pull secret in the manifest is only needed for a private registry.
The benchmark numbers were measured closed-loop (concurrency 4) on 8 cards. Treat them as the per-instance operating point under steady load, not a saturation ceiling.

Verify the ModelCar signature

The ModelCar images are signed with Cosign. Verify an image against the published public key (cosign.pub) before deploying:

curl -sO https://raw.githubusercontent.com/alauda/aml-docs/master/docs/en/inference_guide/cosign.pub

cosign verify --key cosign.pub --insecure-ignore-tlog=true \
  docker.io/alaudadockerhub/modelcar-qwen3-32b:v0.1.0

The three signed images and their digests:

Image	Digest
`docker.io/alaudadockerhub/modelcar-qwen3-32b:v0.1.0`	`sha256:eccffdb567038196638eb7e1c6bb9572d8bcc7829d11fcdfa17f2c43fbca0c6e`
`docker.io/alaudadockerhub/modelcar-qwen3-6-27b-w8a8:v0.1.0`	`sha256:5eb8841a29a4f3d659f8cef25c7f8b16208b32265db23354dae79c4c04e8ef79`
`docker.io/alaudadockerhub/modelcar-deepseek-v4-flash-w4a8-mtp:v0.1.0`	`sha256:ef57a46726bec9c6505504262beaa55ec6c45d595a1726ab90d594f42606c6be`

--insecure-ignore-tlog=true is required because these were signed with --tlog-upload=false (no public transparency-log entry); verification relies on the public key alone.

#Inference Guide

#TOC

#Validated models

#Runtime images

#Benchmark scenarios

#Deploy a validated model

#Caveats

#Verify the ModelCar signature