### Qwen-Coder
- [ ] Qwen2.5-Coder-32B-Instruct
  - HF: Qwen/Qwen2.5-Coder-32B-Instruct
  - Hardware:
    - 1x H100/H200
      - --tool-call-parser hermes --enable-auto-tool-choice
    - 2x H100/H200
      - --tensor-parallel-size 2 --tool-call-parser hermes --enable-auto-tool-choice
  - Notes: Good balance of size and performance. Single GPU capable.
- [ ] Qwen3-Coder-480B-A35B-Instruct (BF16)
  - HF: Qwen/Qwen3-Coder-480B-A35B-Instruct
  - Hardware:
    - 8x H200/H20
      - --tensor-parallel-size 8 --max-model-len 32000 --enable-auto-tool-choice --tool-call-parser qwen3_coder
      - Notes: Cannot serve full 262K context on single node. Reduce max-model-len or increase gpu-memory-utilization.
- [ ] Qwen3-Coder-480B-A35B-Instruct-FP8
  - HF: Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8
  - Hardware:
    - 8x H200/H20
      - --max-model-len 131072 --enable-expert-parallel --data-parallel-size 8 --enable-auto-tool-choice --tool-call-parser qwen3_coder
      - Env: VLLM_USE_DEEP_GEMM=1
      - Notes: Use data-parallel mode (not tensor-parallel) to avoid weight quantization errors. DeepGEMM recommended.
- [ ] Qwen3-Coder-30B-A3B-Instruct (BF16)
  - HF: Qwen/Qwen3-Coder-30B-A3B-Instruct
  - Hardware:
    - 1x H100/H200
      - --enable-auto-tool-choice --tool-call-parser qwen3_coder
      - Notes: Fits comfortably on single GPU. ~60GB model weight.
    - 2x H100/H200
      - --tensor-parallel-size 2 --enable-auto-tool-choice --tool-call-parser qwen3_coder
      - Notes: For higher throughput/longer context.
- [ ] Qwen3-Coder-30B-A3B-Instruct-FP8
  - HF: Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8
  - Hardware:
    - 1x H100/H200
      - --enable-auto-tool-choice --tool-call-parser qwen3_coder
      - Env: VLLM_USE_DEEP_GEMM=1
      - Notes: FP8 quantized, ~30GB model weight. Excellent for single GPU deployment.

### GPT-OSS
- Notes: Requires vLLM 0.10.1+gptoss. Built-in tools via /v1/responses endpoint (browsing, Python). Function calling not yet supported. --async-scheduling recommended for higher perf (not compatible with structured output).
- [ ] GPT-OSS-20B
  - HF: openai/gpt-oss-20b
  - Hardware:
    - 1x H100/H200
      - --async-scheduling
    - 1x B200
      - --async-scheduling
      - Env: VLLM_USE_TRTLLM_ATTENTION=1 VLLM_USE_TRTLLM_DECODE_ATTENTION=1 VLLM_USE_TRTLLM_CONTEXT_ATTENTION=1 VLLM_USE_FLASHINFER_MXFP4_MOE=1
- [ ] GPT-OSS-120B
  - HF: openai/gpt-oss-120b
  - Hardware:
    - 1x H100/H200
      - --async-scheduling
      - Notes: Needs --gpu-memory-utilization 0.95 --max-num-batched-tokens 1024 to avoid OOM
    - 2x H100/H200
      - --tensor-parallel-size 2 --async-scheduling
      - Notes: Set --gpu-memory-utilization <0.95 to avoid OOM
    - 4x H100/H200
      - --tensor-parallel-size 4 --async-scheduling
    - 8x H100/H200
      - --tensor-parallel-size 8 --async-scheduling --max-model-len 131072 --max-num-batched-tokens 10240 --max-num-seqs 128 --gpu-memory-utilization 0.85 --no-enable-prefix-caching
    - 1x B200
      - --async-scheduling
      - Env: VLLM_USE_TRTLLM_ATTENTION=1 VLLM_USE_TRTLLM_DECODE_ATTENTION=1 VLLM_USE_TRTLLM_CONTEXT_ATTENTION=1 VLLM_USE_FLASHINFER_MXFP4_MOE=1
    - 2x B200
      - --tensor-parallel-size 2 --async-scheduling
      - Env: VLLM_USE_TRTLLM_ATTENTION=1 VLLM_USE_TRTLLM_DECODE_ATTENTION=1 VLLM_USE_TRTLLM_CONTEXT_ATTENTION=1 VLLM_USE_FLASHINFER_MXFP4_MOE=1

### GLM-4.5
- Notes: Listed configs support reduced context. For full 128K context, double the GPU count. Models default to thinking mode (disable with API param).
- [ ] GLM-4.5 (BF16)
  - HF: zai-org/GLM-4.5
  - Hardware:
    - 16x H100
      - --tensor-parallel-size 16 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
    - 8x H200
      - --tensor-parallel-size 8 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
  - Notes: On 8x H100, may need --cpu-offload-gb 16 to avoid OOM. For full 128K: needs 32x H100 or 16x H200.
- [ ] GLM-4.5-FP8
  - HF: zai-org/GLM-4.5-FP8
  - Hardware:
    - 8x H100
      - --tensor-parallel-size 8 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
    - 4x H200
      - --tensor-parallel-size 4 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
  - Notes: For full 128K context: needs 16x H100 or 8x H200.
- [ ] GLM-4.5-Air (BF16)
  - HF: zai-org/GLM-4.5-Air
  - Hardware:
    - 4x H100
      - --tensor-parallel-size 4 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
    - 2x H200
      - --tensor-parallel-size 2 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
  - Notes: For full 128K context: needs 8x H100 or 4x H200.
- [ ] GLM-4.5-Air-FP8
  - HF: zai-org/GLM-4.5-Air-FP8
  - Hardware:
    - 2x H100
      - --tensor-parallel-size 2 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
    - 1x H200
      - --tensor-parallel-size 1 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
  - Notes: For full 128K context: needs 4x H100 or 2x H200.

### Kimi
- Notes: Requires vLLM v0.10.0rc1+. Minimum 16 GPUs for FP8 with 128k context. Reuses DeepSeekV3 architecture with model_type="kimi_k2".
- [ ] Kimi-K2-Instruct
  - HF: moonshotai/Kimi-K2-Instruct
  - Hardware:
    - 16x H200/H20
      - --tensor-parallel-size 16 --trust-remote-code --enable-auto-tool-choice --tool-call-parser kimi_k2
      - Notes: Pure TP mode. For >16 GPUs, combine with pipeline-parallelism.
    - 16x H200/H20 (DP+EP mode)
      - --data-parallel-size 16 --data-parallel-size-local 8 --enable-expert-parallel --max-num-batched-tokens 8192 --max-num-seqs 256 --gpu-memory-utilization 0.85 --trust-remote-code --enable-auto-tool-choice --tool-call-parser kimi_k2
      - Notes: Data parallel + expert parallel mode for higher throughput. Requires multi-node setup with proper networking.