Chinese Domestic LLM Performance Comparison

好的模型,现在就是需要 2 到七八秒的返回。

wxside gateway: model latency sample (successful requests)

Against https://api.wxside.com OpenAI-compatible API, one POST /v1/chat/completions per model; prompt: “Reply with exactly one English word: ok”, max_tokens=128. A single keep-alive TCP/TLS connection; requests run sequentially.

Timing: End-to-end is measured on the client; gateway is the X-Response-Time-Ms header (includes upstream wait); rest ≈ end-to-end minus gateway (local stack and network—indicative only).

Environment (from headers): gateway 0.56.2-rc1; sample time around 2026-04-21 22:56 CST. One-off sample, not a load test.

Model E2E (ms) Gateway (ms) Rest ≈ (ms)
qwen-turbo 297 281 16
qwen3-coder-480b-a35b-instruct 339 316 23
doubao-seed-1.6-flash 591 566 25
doubao-1.5-pro-32k 603 579 24
deepseek/deepseek-v3.2-251201 1207 1185 22
qwen3-max 1916 1892 24
minimax/minimax-m2.7 2318 2297 21
moonshotai/kimi-k2.5 2774 2752 22
doubao-seed-1.6 3437 3399 38
doubao-seed-2.0-code 4097 3985 112
z-ai/glm-5 5389 5367 22
z-ai/glm-5.1 5834 5815 19
doubao-seed-2.0-pro 6362 6320 42
moonshotai/kimi-k2.6 9759 9736 23

Single sample; latency depends on load and output length. Some models return reasoning traces, so latency can exceed what a one-token reply might suggest.