DeepSeek-V4-Flash (W4A8)
deepseek_v4-architecture model — a 256-expert MoE with MLA + DSA sparse attention
and a native MTP speculative head, served as W4A8 (~151 GB weights, 43 layers).
Validated on Ascend 910B4 (32 GB/card) with the vLLM-Ascend nightly engine
through Alauda AI's InferNex surface. Because the weights need all 8 cards, the
topology is 1 instance × TP=8 (+ expert parallel), and both benchmark scenarios
were additionally driven through the MaaS gateway (API-key) ingress alongside the
internal KServe ingress.
TOC
Model identityValidated hardware × stackModel configurationDeployment specDeployBenchmark resultsModel identity
Validated hardware × stack
The release-pinned nightly nightly-releases-v0.22.1rc-openeuler (same image as
Qwen3.6-27B) carries the full deepseek_v4 stack. enable_dsa_cp is intentionally
off (an upstream DSA-CP index-buffer bug crashes sustained 8k MTP decode on this
build). Prefix caching is enabled but the hit rate is 0 — DSA sparse attention
selects KV per query, so a shared prefix is not directly reusable; it costs ~nothing
and the speedup over older builds comes from the nightly stack itself, not prefix
caching.
Model configuration
Deployment spec
Served as agg-base only — aggregation, hermes-router strategy random. With a
single TP=8 instance the router has one endpoint (a no-op); the structure is kept for
consistency. The cross-instance KV store / KV-cache-aware routing (agg-mc-kv) does
not apply: the ~151 GB weights fill all 8 cards, so only one replica fits and there
is no second instance to share KV with.
Deploy
Self-contained InferNex manifest (engine inlined in the LLMInferenceService +
hermes-router preset, 1 replica × TP=8):
Benchmark results
Closed-loop aiperf 0.7.0, TP=8 × 1 replica (8 × 910B4), concurrency 4, agg-base.
Two scenarios — ① 8k system-prompt reuse and ② 17.5k multi-turn — each 240 requests,
output pinned to 128 tokens, n=3 (all runs 240/240, zero errors). Each scenario was
driven through two ingresses: the internal KServe Service (no auth) and the
product MaaS gateway (Envoy + API-key auth + token rate limiting — the
customer-facing OpenAI endpoint). TTFT / E2E in ms, ITL in ms, TPS = total tokens/s.
Scenario ① — fixed-length system-prompt reuse (ISL ~8k / OSL 128)
Scenario ② — multi-turn dialogue (ISL ~17.5k / OSL 128)
How to read these. All runs completed 240/240 with zero errors at a steady 2 in-flight requests per instance (n=3 mean). The MaaS gateway vs internal KServe difference is within run-to-run jitter (scenario ① is ~+1–3% on the gateway = a few hundred ms of extra hop; scenario ②'s two n=3 sets, taken at different times, land a few percent the other way). The takeaway: the API-key + rate-limited MaaS gateway adds no measurable overhead beyond noise for both typical and long-context requests, so it can be served as the production ingress. Decode-only output rate is 33 tok/s (scenario ①) / 16 tok/s (scenario ②); the TPS column is the total-token (input + output) caliber. MTP speculative decoding is on. This is a single TP=8 instance, so the TPS is not directly comparable to the 2-replica models in this guide.
Long-context requests through the MaaS gateway need the gateway's request-body buffer
raised — the 17.5k scenario (~85 KB streaming body) exceeds the default limit and
hangs until ClientTrafficPolicy.connection.bufferLimit on the MaaS gateway is
increased. The internal KServe ingress has no such limit. The numbers above are after
that fix.