Inference Guide

Ready-to-deploy recipes for validated open-weight LLMs on Alauda AI. Each model in this guide has been deployed end-to-end on a real cluster and benchmarked, so you get a known-good deployment manifest, the runtime image that serves it, and the throughput you can expect.

The models here were validated on Huawei Ascend 910B4 NPU with the community vLLM-Ascend engine, deployed through Alauda AI's InferNex surface — a KServe LLMInferenceService reconciled by the InferNex-Bridge into a load-aware router (hermes-router / EPP) in front of the vLLM-Ascend instances. All were run through the same InferNex aggregation surface and the same two benchmark scenarios (the two Qwen models share an identical 2 × TP=4 topology and are directly comparable; the larger DeepSeek MoE uses 1 × TP=8). For the runtime model (KServe, ModelCar storage, scheduling) see Model Deployment & Inference.

Validated models

ModelTypeParamsdtypeDeviceEnginePer-page
Qwen3-32BQwen3 dense32BBF16Ascend 910B4 ×8vLLM-Ascend v0.18.0Qwen3-32B
Qwen3.6-27B (W8A8)qwen3_5 hybrid (GDN + MTP)27.78BW8A8Ascend 910B4 ×8vLLM-Ascend nightlyQwen3.6-27B (W8A8)
DeepSeek-V4-Flash (W4A8)deepseek_v4 MoE (MLA + DSA + MTP)256-expert MoEW4A8Ascend 910B4 ×8vLLM-Ascend nightlyDeepSeek-V4-Flash (W4A8)

All three were validated on Ascend 910B4 (32 GB/card), driven through KServe LLMInferenceService with load-aware routing (InferNex-Bridge + hermes-router). The two Qwen models run an 8-card aggregation — 2 instances × TP=4; DeepSeek-V4-Flash (~151 GB W4A8) fills all 8 cards as 1 instance × TP=8, and additionally validated the MaaS gateway (API-key) ingress next to the internal KServe ingress.

Runtime images

EngineDeviceImage (validated tag)Used byNotes
vLLM-Ascend v0.18.0Huawei Ascend NPUquay.io/ascend/vllm-ascend:v0.18.0-openeulerQwen3-32BCommunity vLLM for Ascend, V1 engine. Serves standard Qwen3 dense models directly. Match the tag's CANN version to your host NPU driver.
vLLM-Ascend nightly (release-pinned)Huawei Ascend NPUquay.io/ascend/vllm-ascend:nightly-releases-v0.22.1rc-openeulerQwen3.6-27B (W8A8), DeepSeek-V4-Flash (W4A8)Carries the qwen3_5 Gated DeltaNet hybrid + MTP and the deepseek_v4 MoE stack — the stock v0.18.0 cannot load either. Use this release-pinned tag, not the moving nightly-main-openeuler (it drifted to a broken build whose TP workers crash).
TIP

The Ascend CANN images are arm64. Always match the runtime image's CANN version to the host NPU driver on your nodes. Only the engines actually used in this guide are listed; other engines (MindIE, SGLang, …) were not benchmarked at this size.

Benchmark scenarios

Both models were measured with aiperf against the same two scenarios, modelled on real serving patterns. Output is pinned to 128 tokens and load is closed-loop, concurrency 4 (4 in-flight requests, fixed). Each scenario ran 240 requests.

ScenarioWhat it modelsDatasetRequest shape
① Fixed-length system-prompt reuseA reused system prompt — performance when request length is fixed and KV cache repeats.60 distinct 8k-token system prompts, each reused 4 times = 240 requests.ISL ~8k / OSL 128
② Multi-turn dialogueMultiple users in a continuing conversation, several requests per session.60 independent users × 4 rounds = 240 requests.Round 1 sends 16k and the model replies 128 tokens; round 2 adds 1k (~17k total); and so on for 4 rounds (16k / 17k / 18k / 19k), averaging ~17.5k ISL / OSL 128.
NOTE

Both scenarios run on a single-node 8-card deployment (Qwen models: 2 instances × TP=4; DeepSeek-V4-Flash: 1 instance × TP=8). Latency (TTFT / ITL / E2E) is the per-instance operating point under steady 2-in-flight load; total throughput (TPS) is the aggregate across the instances and scales with the instance count. TPS is the total-token (input + output) caliber; the decode-only output rate is reported separately and is much smaller under these long-input workloads. DeepSeek-V4-Flash additionally ran each scenario through the MaaS gateway (API-key) ingress as well as the internal KServe ingress — see its page.

Deploy a validated model

Each model page links self-contained YAMLs under assets/ that hold the real InferNex deployment — a KServe LLMInferenceService (infernex.io/runtime: true) plus the two LLMInferenceServiceConfig objects (engine template + hermes-router/EPP template) that the InferNex-Bridge reconciles into the running instances.

base=https://raw.githubusercontent.com/alauda/aml-docs/master/docs/en/inference_guide/assets

# 1. Edit the file first: set the namespace and the image tag (match your CANN /
#    host driver). model.uri already points at the public ModelCar on Docker Hub
#    (oci://docker.io/alaudadockerhub/...) — repoint it only if you mirror locally.
# 2. Apply — the InferNex-Bridge reconciles the LLMInferenceService into the
#    hermes-router (EPP) + the vLLM-Ascend instances.
kubectl apply -f $base/qwen3-32b/qwen3-32b-agg-base-llmisvc.yaml

# 3. Watch it come up (first start loads weights + compiles graphs — can take minutes).
kubectl -n <your-namespace> get llminferenceservice -w

# 4. Call the OpenAI-compatible endpoint through the gateway.
curl -s http://<gateway>/<namespace>/qwen3-32b-agg-base/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"Qwen/Qwen3-32B","messages":[{"role":"user","content":"hello"}]}'

Caveats

  • These manifests deploy through InferNex (LLMInferenceService + InferNex-Bridge
    • hermes-router). The two LLMInferenceServiceConfig objects live in the kserve namespace; the LLMInferenceService lives in your deployment namespace.
  • Resource keys are for Ascend 910B4 (huawei.com/Ascend910). Adjust the resource key, image, and version fields for your actual NPU model.
  • The ModelCar images are public on Docker Hub under alaudadockerhub — the manifests pull them with no credentials. Mirror them to your own registry and repoint model.uri if you prefer; the modelcar pull secret in the manifest is only needed for a private registry.
  • The benchmark numbers were measured closed-loop (concurrency 4) on 8 cards. Treat them as the per-instance operating point under steady load, not a saturation ceiling.

Verify the ModelCar signature

The ModelCar images are signed with Cosign. Verify an image against the published public key (cosign.pub) before deploying:

curl -sO https://raw.githubusercontent.com/alauda/aml-docs/master/docs/en/inference_guide/cosign.pub

cosign verify --key cosign.pub --insecure-ignore-tlog=true \
  docker.io/alaudadockerhub/modelcar-qwen3-32b:v0.1.0

The three signed images and their digests:

ImageDigest
docker.io/alaudadockerhub/modelcar-qwen3-32b:v0.1.0sha256:eccffdb567038196638eb7e1c6bb9572d8bcc7829d11fcdfa17f2c43fbca0c6e
docker.io/alaudadockerhub/modelcar-qwen3-6-27b-w8a8:v0.1.0sha256:5eb8841a29a4f3d659f8cef25c7f8b16208b32265db23354dae79c4c04e8ef79
docker.io/alaudadockerhub/modelcar-deepseek-v4-flash-w4a8-mtp:v0.1.0sha256:ef57a46726bec9c6505504262beaa55ec6c45d595a1726ab90d594f42606c6be

--insecure-ignore-tlog=true is required because these were signed with --tlog-upload=false (no public transparency-log entry); verification relies on the public key alone.