Chinese Domestic LLM Performance Comparison

好的模型，现在就是需要 2 到七八秒的返回。

wxside gateway: model latency sample (successful requests)

Against https://api.wxside.com OpenAI-compatible API, one POST /v1/chat/completions per model; prompt: “Reply with exactly one English word: ok”, max_tokens=128. A single keep-alive TCP/TLS connection; requests run sequentially.

Timing: End-to-end is measured on the client; gateway is the X-Response-Time-Ms header (includes upstream wait); rest ≈ end-to-end minus gateway (local stack and network—indicative only).

Environment (from headers): gateway 0.56.2-rc1; sample time around 2026-04-21 22:56 CST. One-off sample, not a load test.

Model	E2E (ms)	Gateway (ms)	Rest ≈ (ms)
qwen-turbo	297	281	16
qwen3-coder-480b-a35b-instruct	339	316	23
doubao-seed-1.6-flash	591	566	25
doubao-1.5-pro-32k	603	579	24
deepseek/deepseek-v3.2-251201	1207	1185	22
qwen3-max	1916	1892	24
minimax/minimax-m2.7	2318	2297	21
moonshotai/kimi-k2.5	2774	2752	22
doubao-seed-1.6	3437	3399	38
doubao-seed-2.0-code	4097	3985	112
z-ai/glm-5	5389	5367	22
z-ai/glm-5.1	5834	5815	19
doubao-seed-2.0-pro	6362	6320	42
moonshotai/kimi-k2.6	9759	9736	23

Single sample; latency depends on load and output length. Some models return reasoning traces, so latency can exceed what a one-token reply might suggest.