### Qwen-Coder - [ ] Qwen2.5-Coder-32B-Instruct - HF: Qwen/Qwen2.5-Coder-32B-Instruct - Hardware: - 1x H100/H200 - --tool-call-parser hermes --enable-auto-tool-choice - 2x H100/H200 - --tensor-parallel-size 2 --tool-call-parser hermes --enable-auto-tool-choice - Notes: Good balance of size and performance. Single GPU capable. - [ ] Qwen3-Coder-480B-A35B-Instruct (BF16) - HF: Qwen/Qwen3-Coder-480B-A35B-Instruct - Hardware: - 8x H200/H20 - --tensor-parallel-size 8 --max-model-len 32000 --enable-auto-tool-choice --tool-call-parser qwen3_coder - Notes: Cannot serve full 262K context on single node. Reduce max-model-len or increase gpu-memory-utilization. - [ ] Qwen3-Coder-480B-A35B-Instruct-FP8 - HF: Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 - Hardware: - 8x H200/H20 - --max-model-len 131072 --enable-expert-parallel --data-parallel-size 8 --enable-auto-tool-choice --tool-call-parser qwen3_coder - Env: VLLM_USE_DEEP_GEMM=1 - Notes: Use data-parallel mode (not tensor-parallel) to avoid weight quantization errors. DeepGEMM recommended. - [ ] Qwen3-Coder-30B-A3B-Instruct (BF16) - HF: Qwen/Qwen3-Coder-30B-A3B-Instruct - Hardware: - 1x H100/H200 - --enable-auto-tool-choice --tool-call-parser qwen3_coder - Notes: Fits comfortably on single GPU. ~60GB model weight. - 2x H100/H200 - --tensor-parallel-size 2 --enable-auto-tool-choice --tool-call-parser qwen3_coder - Notes: For higher throughput/longer context. - [ ] Qwen3-Coder-30B-A3B-Instruct-FP8 - HF: Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 - Hardware: - 1x H100/H200 - --enable-auto-tool-choice --tool-call-parser qwen3_coder - Env: VLLM_USE_DEEP_GEMM=1 - Notes: FP8 quantized, ~30GB model weight. Excellent for single GPU deployment. ### GPT-OSS - Notes: Requires vLLM 0.10.1+gptoss. Built-in tools via /v1/responses endpoint (browsing, Python). Function calling not yet supported. --async-scheduling recommended for higher perf (not compatible with structured output). - [ ] GPT-OSS-20B - HF: openai/gpt-oss-20b - Hardware: - 1x H100/H200 - --async-scheduling - 1x B200 - --async-scheduling - Env: VLLM_USE_TRTLLM_ATTENTION=1 VLLM_USE_TRTLLM_DECODE_ATTENTION=1 VLLM_USE_TRTLLM_CONTEXT_ATTENTION=1 VLLM_USE_FLASHINFER_MXFP4_MOE=1 - [ ] GPT-OSS-120B - HF: openai/gpt-oss-120b - Hardware: - 1x H100/H200 - --async-scheduling - Notes: Needs --gpu-memory-utilization 0.95 --max-num-batched-tokens 1024 to avoid OOM - 2x H100/H200 - --tensor-parallel-size 2 --async-scheduling - Notes: Set --gpu-memory-utilization <0.95 to avoid OOM - 4x H100/H200 - --tensor-parallel-size 4 --async-scheduling - 8x H100/H200 - --tensor-parallel-size 8 --async-scheduling --max-model-len 131072 --max-num-batched-tokens 10240 --max-num-seqs 128 --gpu-memory-utilization 0.85 --no-enable-prefix-caching - 1x B200 - --async-scheduling - Env: VLLM_USE_TRTLLM_ATTENTION=1 VLLM_USE_TRTLLM_DECODE_ATTENTION=1 VLLM_USE_TRTLLM_CONTEXT_ATTENTION=1 VLLM_USE_FLASHINFER_MXFP4_MOE=1 - 2x B200 - --tensor-parallel-size 2 --async-scheduling - Env: VLLM_USE_TRTLLM_ATTENTION=1 VLLM_USE_TRTLLM_DECODE_ATTENTION=1 VLLM_USE_TRTLLM_CONTEXT_ATTENTION=1 VLLM_USE_FLASHINFER_MXFP4_MOE=1 ### GLM-4.5 - Notes: Listed configs support reduced context. For full 128K context, double the GPU count. Models default to thinking mode (disable with API param). - [ ] GLM-4.5 (BF16) - HF: zai-org/GLM-4.5 - Hardware: - 16x H100 - --tensor-parallel-size 16 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice - 8x H200 - --tensor-parallel-size 8 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice - Notes: On 8x H100, may need --cpu-offload-gb 16 to avoid OOM. For full 128K: needs 32x H100 or 16x H200. - [ ] GLM-4.5-FP8 - HF: zai-org/GLM-4.5-FP8 - Hardware: - 8x H100 - --tensor-parallel-size 8 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice - 4x H200 - --tensor-parallel-size 4 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice - Notes: For full 128K context: needs 16x H100 or 8x H200. - [ ] GLM-4.5-Air (BF16) - HF: zai-org/GLM-4.5-Air - Hardware: - 4x H100 - --tensor-parallel-size 4 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice - 2x H200 - --tensor-parallel-size 2 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice - Notes: For full 128K context: needs 8x H100 or 4x H200. - [ ] GLM-4.5-Air-FP8 - HF: zai-org/GLM-4.5-Air-FP8 - Hardware: - 2x H100 - --tensor-parallel-size 2 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice - 1x H200 - --tensor-parallel-size 1 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice - Notes: For full 128K context: needs 4x H100 or 2x H200. ### Kimi - Notes: Requires vLLM v0.10.0rc1+. Minimum 16 GPUs for FP8 with 128k context. Reuses DeepSeekV3 architecture with model_type="kimi_k2". - [ ] Kimi-K2-Instruct - HF: moonshotai/Kimi-K2-Instruct - Hardware: - 16x H200/H20 - --tensor-parallel-size 16 --trust-remote-code --enable-auto-tool-choice --tool-call-parser kimi_k2 - Notes: Pure TP mode. For >16 GPUs, combine with pipeline-parallelism. - 16x H200/H20 (DP+EP mode) - --data-parallel-size 16 --data-parallel-size-local 8 --enable-expert-parallel --max-num-batched-tokens 8192 --max-num-seqs 256 --gpu-memory-utilization 0.85 --trust-remote-code --enable-auto-tool-choice --tool-call-parser kimi_k2 - Notes: Data parallel + expert parallel mode for higher throughput. Requires multi-node setup with proper networking.