co-mono/packages/pods/docs/models.md
Mario Zechner a74c5da112 Initial monorepo setup with npm workspaces and dual TypeScript configuration
- Set up npm workspaces for three packages: pi-tui, pi-agent, and pi (pods)
- Implemented dual TypeScript configuration:
  - Root tsconfig.json with path mappings for development and type checking
  - Package-specific tsconfig.build.json for clean production builds
- Configured lockstep versioning with sync script for inter-package dependencies
- Added comprehensive documentation for development and publishing workflows
- All packages at version 0.5.0 ready for npm publishing
2025-08-09 17:18:38 +02:00

5.7 KiB

Qwen-Coder

  • Qwen2.5-Coder-32B-Instruct
    • HF: Qwen/Qwen2.5-Coder-32B-Instruct
    • Hardware:
      • 1x H100/H200
        • --tool-call-parser hermes --enable-auto-tool-choice
      • 2x H100/H200
        • --tensor-parallel-size 2 --tool-call-parser hermes --enable-auto-tool-choice
    • Notes: Good balance of size and performance. Single GPU capable.
  • Qwen3-Coder-480B-A35B-Instruct (BF16)
    • HF: Qwen/Qwen3-Coder-480B-A35B-Instruct
    • Hardware:
      • 8x H200/H20
        • --tensor-parallel-size 8 --max-model-len 32000 --enable-auto-tool-choice --tool-call-parser qwen3_coder
        • Notes: Cannot serve full 262K context on single node. Reduce max-model-len or increase gpu-memory-utilization.
  • Qwen3-Coder-480B-A35B-Instruct-FP8
    • HF: Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8
    • Hardware:
      • 8x H200/H20
        • --max-model-len 131072 --enable-expert-parallel --data-parallel-size 8 --enable-auto-tool-choice --tool-call-parser qwen3_coder
        • Env: VLLM_USE_DEEP_GEMM=1
        • Notes: Use data-parallel mode (not tensor-parallel) to avoid weight quantization errors. DeepGEMM recommended.
  • Qwen3-Coder-30B-A3B-Instruct (BF16)
    • HF: Qwen/Qwen3-Coder-30B-A3B-Instruct
    • Hardware:
      • 1x H100/H200
        • --enable-auto-tool-choice --tool-call-parser qwen3_coder
        • Notes: Fits comfortably on single GPU. ~60GB model weight.
      • 2x H100/H200
        • --tensor-parallel-size 2 --enable-auto-tool-choice --tool-call-parser qwen3_coder
        • Notes: For higher throughput/longer context.
  • Qwen3-Coder-30B-A3B-Instruct-FP8
    • HF: Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8
    • Hardware:
      • 1x H100/H200
        • --enable-auto-tool-choice --tool-call-parser qwen3_coder
        • Env: VLLM_USE_DEEP_GEMM=1
        • Notes: FP8 quantized, ~30GB model weight. Excellent for single GPU deployment.

GPT-OSS

  • Notes: Requires vLLM 0.10.1+gptoss. Built-in tools via /v1/responses endpoint (browsing, Python). Function calling not yet supported. --async-scheduling recommended for higher perf (not compatible with structured output).
  • GPT-OSS-20B
    • HF: openai/gpt-oss-20b
    • Hardware:
      • 1x H100/H200
        • --async-scheduling
      • 1x B200
        • --async-scheduling
        • Env: VLLM_USE_TRTLLM_ATTENTION=1 VLLM_USE_TRTLLM_DECODE_ATTENTION=1 VLLM_USE_TRTLLM_CONTEXT_ATTENTION=1 VLLM_USE_FLASHINFER_MXFP4_MOE=1
  • GPT-OSS-120B
    • HF: openai/gpt-oss-120b
    • Hardware:
      • 1x H100/H200
        • --async-scheduling
        • Notes: Needs --gpu-memory-utilization 0.95 --max-num-batched-tokens 1024 to avoid OOM
      • 2x H100/H200
        • --tensor-parallel-size 2 --async-scheduling
        • Notes: Set --gpu-memory-utilization <0.95 to avoid OOM
      • 4x H100/H200
        • --tensor-parallel-size 4 --async-scheduling
      • 8x H100/H200
        • --tensor-parallel-size 8 --async-scheduling --max-model-len 131072 --max-num-batched-tokens 10240 --max-num-seqs 128 --gpu-memory-utilization 0.85 --no-enable-prefix-caching
      • 1x B200
        • --async-scheduling
        • Env: VLLM_USE_TRTLLM_ATTENTION=1 VLLM_USE_TRTLLM_DECODE_ATTENTION=1 VLLM_USE_TRTLLM_CONTEXT_ATTENTION=1 VLLM_USE_FLASHINFER_MXFP4_MOE=1
      • 2x B200
        • --tensor-parallel-size 2 --async-scheduling
        • Env: VLLM_USE_TRTLLM_ATTENTION=1 VLLM_USE_TRTLLM_DECODE_ATTENTION=1 VLLM_USE_TRTLLM_CONTEXT_ATTENTION=1 VLLM_USE_FLASHINFER_MXFP4_MOE=1

GLM-4.5

  • Notes: Listed configs support reduced context. For full 128K context, double the GPU count. Models default to thinking mode (disable with API param).
  • GLM-4.5 (BF16)
    • HF: zai-org/GLM-4.5
    • Hardware:
      • 16x H100
        • --tensor-parallel-size 16 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
      • 8x H200
        • --tensor-parallel-size 8 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
    • Notes: On 8x H100, may need --cpu-offload-gb 16 to avoid OOM. For full 128K: needs 32x H100 or 16x H200.
  • GLM-4.5-FP8
    • HF: zai-org/GLM-4.5-FP8
    • Hardware:
      • 8x H100
        • --tensor-parallel-size 8 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
      • 4x H200
        • --tensor-parallel-size 4 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
    • Notes: For full 128K context: needs 16x H100 or 8x H200.
  • GLM-4.5-Air (BF16)
    • HF: zai-org/GLM-4.5-Air
    • Hardware:
      • 4x H100
        • --tensor-parallel-size 4 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
      • 2x H200
        • --tensor-parallel-size 2 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
    • Notes: For full 128K context: needs 8x H100 or 4x H200.
  • GLM-4.5-Air-FP8
    • HF: zai-org/GLM-4.5-Air-FP8
    • Hardware:
      • 2x H100
        • --tensor-parallel-size 2 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
      • 1x H200
        • --tensor-parallel-size 1 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
    • Notes: For full 128K context: needs 4x H100 or 2x H200.

Kimi

  • Notes: Requires vLLM v0.10.0rc1+. Minimum 16 GPUs for FP8 with 128k context. Reuses DeepSeekV3 architecture with model_type="kimi_k2".
  • Kimi-K2-Instruct
    • HF: moonshotai/Kimi-K2-Instruct
    • Hardware:
      • 16x H200/H20
        • --tensor-parallel-size 16 --trust-remote-code --enable-auto-tool-choice --tool-call-parser kimi_k2
        • Notes: Pure TP mode. For >16 GPUs, combine with pipeline-parallelism.
      • 16x H200/H20 (DP+EP mode)
        • --data-parallel-size 16 --data-parallel-size-local 8 --enable-expert-parallel --max-num-batched-tokens 8192 --max-num-seqs 256 --gpu-memory-utilization 0.85 --trust-remote-code --enable-auto-tool-choice --tool-call-parser kimi_k2
        • Notes: Data parallel + expert parallel mode for higher throughput. Requires multi-node setup with proper networking.