mirror of
https://github.com/getcompanion-ai/co-mono.git
synced 2026-04-21 04:00:10 +00:00
Initial monorepo setup with npm workspaces and dual TypeScript configuration
- Set up npm workspaces for three packages: pi-tui, pi-agent, and pi (pods) - Implemented dual TypeScript configuration: - Root tsconfig.json with path mappings for development and type checking - Package-specific tsconfig.build.json for clean production builds - Configured lockstep versioning with sync script for inter-package dependencies - Added comprehensive documentation for development and publishing workflows - All packages at version 0.5.0 ready for npm publishing
This commit is contained in:
commit
a74c5da112
63 changed files with 14558 additions and 0 deletions
189
packages/pods/docs/gml-4.5.md
Normal file
189
packages/pods/docs/gml-4.5.md
Normal file
|
|
@ -0,0 +1,189 @@
|
|||
# GLM-4.5
|
||||
|
||||
[中文阅读](./README_zh.md)
|
||||
|
||||
<div align="center">
|
||||
<img src=resources/logo.svg width="15%"/>
|
||||
</div>
|
||||
<p align="center">
|
||||
👋 Join our <a href="resources/WECHAT.md" target="_blank">WeChat</a> or <a href="https://discord.gg/QR7SARHRxK" target="_blank">Discord</a> community.
|
||||
<br>
|
||||
📖 Check out the GLM-4.5 <a href="https://z.ai/blog/glm-4.5" target="_blank">technical blog</a>.
|
||||
<br>
|
||||
📍 Use GLM-4.5 API services on <a href="https://docs.z.ai/guides/llm/glm-4.5">Z.ai API Platform (Global)</a> or <br> <a href="https://docs.bigmodel.cn/cn/guide/models/text/glm-4.5">Zhipu AI Open Platform (Mainland China)</a>.
|
||||
<br>
|
||||
👉 One click to <a href="https://chat.z.ai">GLM-4.5</a>.
|
||||
</p>
|
||||
|
||||
## Model Introduction
|
||||
|
||||
The **GLM-4.5** series models are foundation models designed for intelligent agents. GLM-4.5 has **355** billion total
|
||||
parameters with **32** billion active parameters, while GLM-4.5-Air adopts a more compact design with **106** billion
|
||||
total parameters and **12** billion active parameters. GLM-4.5 models unify reasoning, coding, and intelligent agent
|
||||
capabilities to meet the complex demands of intelligent agent applications.
|
||||
|
||||
Both GLM-4.5 and GLM-4.5-Air are hybrid reasoning models that provide two modes: thinking mode for complex reasoning and
|
||||
tool usage, and non-thinking mode for immediate responses.
|
||||
|
||||
We have open-sourced the base models, hybrid reasoning models, and FP8 versions of the hybrid reasoning models for both
|
||||
GLM-4.5 and GLM-4.5-Air. They are released under the MIT open-source license and can be used commercially and for
|
||||
secondary development.
|
||||
|
||||
As demonstrated in our comprehensive evaluation across 12 industry-standard benchmarks, GLM-4.5 achieves exceptional
|
||||
performance with a score of **63.2**, in the **3rd** place among all the proprietary and open-source models. Notably,
|
||||
GLM-4.5-Air delivers competitive results at **59.8** while maintaining superior efficiency.
|
||||
|
||||

|
||||
|
||||
For more eval results, show cases, and technical details, please visit
|
||||
our [technical blog](https://z.ai/blog/glm-4.5). The technical report will be released soon.
|
||||
|
||||
The model code, tool parser and reasoning parser can be found in the implementation
|
||||
of [transformers](https://github.com/huggingface/transformers/tree/main/src/transformers/models/glm4_moe), [vLLM](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/glm4_moe_mtp.py)
|
||||
and [SGLang](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/glm4_moe.py).
|
||||
|
||||
## Model Downloads
|
||||
|
||||
You can directly experience the model on [Hugging Face](https://huggingface.co/spaces/zai-org/GLM-4.5-Space)
|
||||
or [ModelScope](https://modelscope.cn/studios/ZhipuAI/GLM-4.5-Demo) or download the model by following the links below.
|
||||
|
||||
| Model | Download Links | Model Size | Precision |
|
||||
|------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|------------|-----------|
|
||||
| GLM-4.5 | [🤗 Hugging Face](https://huggingface.co/zai-org/GLM-4.5)<br> [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-4.5) | 355B-A32B | BF16 |
|
||||
| GLM-4.5-Air | [🤗 Hugging Face](https://huggingface.co/zai-org/GLM-4.5-Air)<br> [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-4.5-Air) | 106B-A12B | BF16 |
|
||||
| GLM-4.5-FP8 | [🤗 Hugging Face](https://huggingface.co/zai-org/GLM-4.5-FP8)<br> [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-4.5-FP8) | 355B-A32B | FP8 |
|
||||
| GLM-4.5-Air-FP8 | [🤗 Hugging Face](https://huggingface.co/zai-org/GLM-4.5-Air-FP8)<br> [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-4.5-Air-FP8) | 106B-A12B | FP8 |
|
||||
| GLM-4.5-Base | [🤗 Hugging Face](https://huggingface.co/zai-org/GLM-4.5-Base)<br> [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-4.5-Base) | 355B-A32B | BF16 |
|
||||
| GLM-4.5-Air-Base | [🤗 Hugging Face](https://huggingface.co/zai-org/GLM-4.5-Air-Base)<br> [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-4.5-Air-Base) | 106B-A12B | BF16 |
|
||||
|
||||
## System Requirements
|
||||
|
||||
### Inference
|
||||
|
||||
We provide minimum and recommended configurations for "full-featured" model inference. The data in the table below is
|
||||
based on the following conditions:
|
||||
|
||||
1. All models use MTP layers and specify
|
||||
`--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4` to ensure competitive
|
||||
inference speed.
|
||||
2. The `cpu-offload` parameter is not used.
|
||||
3. Inference batch size does not exceed `8`.
|
||||
4. All are executed on devices that natively support FP8 inference, ensuring both weights and cache are in FP8 format.
|
||||
5. Server memory must exceed `1T` to ensure normal model loading and operation.
|
||||
|
||||
The models can run under the configurations in the table below:
|
||||
|
||||
| Model | Precision | GPU Type and Count | Test Framework |
|
||||
|-------------|-----------|----------------------|----------------|
|
||||
| GLM-4.5 | BF16 | H100 x 16 / H200 x 8 | sglang |
|
||||
| GLM-4.5 | FP8 | H100 x 8 / H200 x 4 | sglang |
|
||||
| GLM-4.5-Air | BF16 | H100 x 4 / H200 x 2 | sglang |
|
||||
| GLM-4.5-Air | FP8 | H100 x 2 / H200 x 1 | sglang |
|
||||
|
||||
Under the configurations in the table below, the models can utilize their full 128K context length:
|
||||
|
||||
| Model | Precision | GPU Type and Count | Test Framework |
|
||||
|-------------|-----------|-----------------------|----------------|
|
||||
| GLM-4.5 | BF16 | H100 x 32 / H200 x 16 | sglang |
|
||||
| GLM-4.5 | FP8 | H100 x 16 / H200 x 8 | sglang |
|
||||
| GLM-4.5-Air | BF16 | H100 x 8 / H200 x 4 | sglang |
|
||||
| GLM-4.5-Air | FP8 | H100 x 4 / H200 x 2 | sglang |
|
||||
|
||||
### Fine-tuning
|
||||
|
||||
The code can run under the configurations in the table below
|
||||
using [Llama Factory](https://github.com/hiyouga/LLaMA-Factory):
|
||||
|
||||
| Model | GPU Type and Count | Strategy | Batch Size (per GPU) |
|
||||
|-------------|--------------------|----------|----------------------|
|
||||
| GLM-4.5 | H100 x 16 | Lora | 1 |
|
||||
| GLM-4.5-Air | H100 x 4 | Lora | 1 |
|
||||
|
||||
The code can run under the configurations in the table below using [Swift](https://github.com/modelscope/ms-swift):
|
||||
|
||||
| Model | GPU Type and Count | Strategy | Batch Size (per GPU) |
|
||||
|-------------|--------------------|----------|----------------------|
|
||||
| GLM-4.5 | H20 (96GiB) x 16 | Lora | 1 |
|
||||
| GLM-4.5-Air | H20 (96GiB) x 4 | Lora | 1 |
|
||||
| GLM-4.5 | H20 (96GiB) x 128 | SFT | 1 |
|
||||
| GLM-4.5-Air | H20 (96GiB) x 32 | SFT | 1 |
|
||||
| GLM-4.5 | H20 (96GiB) x 128 | RL | 1 |
|
||||
| GLM-4.5-Air | H20 (96GiB) x 32 | RL | 1 |
|
||||
|
||||
## Quick Start
|
||||
|
||||
Please install the required packages according to `requirements.txt`.
|
||||
|
||||
```shell
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
### transformers
|
||||
|
||||
Please refer to the `trans_infer_cli.py` code in the `inference` folder.
|
||||
|
||||
### vLLM
|
||||
|
||||
+ Both BF16 and FP8 can be started with the following code:
|
||||
|
||||
```shell
|
||||
vllm serve zai-org/GLM-4.5-Air \
|
||||
--tensor-parallel-size 8 \
|
||||
--tool-call-parser glm45 \
|
||||
--reasoning-parser glm45 \
|
||||
--enable-auto-tool-choice \
|
||||
--served-model-name glm-4.5-air
|
||||
```
|
||||
|
||||
If you're using 8x H100 GPUs and encounter insufficient memory when running the GLM-4.5 model, you'll need
|
||||
`--cpu-offload-gb 16` (only applicable to vLLM).
|
||||
|
||||
If you encounter `flash infer` issues, use `VLLM_ATTENTION_BACKEND=XFORMERS` as a temporary replacement. You can also
|
||||
specify `TORCH_CUDA_ARCH_LIST='9.0+PTX'` to use `flash infer` (different GPUs have different TORCH_CUDA_ARCH_LIST
|
||||
values, please check accordingly).
|
||||
|
||||
### SGLang
|
||||
|
||||
+ BF16
|
||||
|
||||
```shell
|
||||
python3 -m sglang.launch_server \
|
||||
--model-path zai-org/GLM-4.5-Air \
|
||||
--tp-size 8 \
|
||||
--tool-call-parser glm45 \
|
||||
--reasoning-parser glm45 \
|
||||
--speculative-algorithm EAGLE \
|
||||
--speculative-num-steps 3 \
|
||||
--speculative-eagle-topk 1 \
|
||||
--speculative-num-draft-tokens 4 \
|
||||
--mem-fraction-static 0.7 \
|
||||
--served-model-name glm-4.5-air \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000
|
||||
```
|
||||
|
||||
+ FP8
|
||||
|
||||
```shell
|
||||
python3 -m sglang.launch_server \
|
||||
--model-path zai-org/GLM-4.5-Air-FP8 \
|
||||
--tp-size 4 \
|
||||
--tool-call-parser glm45 \
|
||||
--reasoning-parser glm45 \
|
||||
--speculative-algorithm EAGLE \
|
||||
--speculative-num-steps 3 \
|
||||
--speculative-eagle-topk 1 \
|
||||
--speculative-num-draft-tokens 4 \
|
||||
--mem-fraction-static 0.7 \
|
||||
--disable-shared-experts-fusion \
|
||||
--served-model-name glm-4.5-air-fp8 \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000
|
||||
```
|
||||
|
||||
### Request Parameter Instructions
|
||||
|
||||
+ When using `vLLM` and `SGLang`, thinking mode is enabled by default when sending requests. If you want to disable the
|
||||
thinking switch, you need to add the `extra_body={"chat_template_kwargs": {"enable_thinking": False}}` parameter.
|
||||
+ Both support tool calling. Please use OpenAI-style tool description format for calls.
|
||||
+ For specific code, please refer to `api_request.py` in the `inference` folder.
|
||||
233
packages/pods/docs/gpt-oss.md
Normal file
233
packages/pods/docs/gpt-oss.md
Normal file
|
|
@ -0,0 +1,233 @@
|
|||
## `gpt-oss` vLLM Usage Guide
|
||||
|
||||
`gpt-oss-20b` and `gpt-oss-120b` are powerful reasoning models open-sourced by OpenAI.
|
||||
In vLLM, you can run it on NVIDIA H100, H200, B200 as well as MI300x, MI325x, MI355x and Radeon AI PRO R9700.
|
||||
We are actively working on ensuring this model can work on Ampere, Ada Lovelace, and RTX 5090.
|
||||
Specifically, vLLM optimizes for `gpt-oss` family of models with
|
||||
|
||||
* **Flexible parallelism options**: the model can be sharded across 2, 4, 8 GPUs, scaling throughput.
|
||||
* **High performance attention and MoE kernels**: attention kernel is specifically optimized for the attention sinks mechanism and sliding window shapes.
|
||||
* **Asynchronous scheduling**: optimizing for maximum utilization and high throughput by overlapping CPU operations with GPU operations.
|
||||
|
||||
This is a living document and we welcome contributions, corrections, and creation of new recipes!
|
||||
|
||||
## Quickstart
|
||||
|
||||
### Installation
|
||||
|
||||
We highly recommend using a new virtual environment, as the first iteration of the release requires cutting edge kernels from various dependencies, these might not work with other models. In particular, we will be installing: a prerelease version of vLLM, PyTorch nightly, Triton nightly, FlashInfer prerelease, HuggingFace prerelease, Harmony, and gpt-oss library tools.
|
||||
|
||||
```
|
||||
uv venv
|
||||
source .venv/bin/activate
|
||||
|
||||
uv pip install --pre vllm==0.10.1+gptoss \
|
||||
--extra-index-url https://wheels.vllm.ai/gpt-oss/ \
|
||||
--extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
|
||||
--index-strategy unsafe-best-match
|
||||
```
|
||||
|
||||
We also provide a docker container with all the dependencies built in
|
||||
|
||||
```
|
||||
docker run --gpus all \
|
||||
-p 8000:8000 \
|
||||
--ipc=host \
|
||||
vllm/vllm-openai:gptoss \
|
||||
--model openai/gpt-oss-20b
|
||||
```
|
||||
|
||||
### H100 & H200
|
||||
|
||||
You can serve the model with its default parameters:
|
||||
|
||||
* `--async-scheduling` can be enabled for higher performance. Currently it is not compatible with structured output.
|
||||
* We recommend TP=2 for H100 and H200 as the best performance tradeoff point.
|
||||
|
||||
```
|
||||
# openai/gpt-oss-20b should run in single GPU
|
||||
vllm serve openai/gpt-oss-20b --async-scheduling
|
||||
|
||||
# gpt-oss-120b will fit in a single H100/H200, but scaling it to higher TP sizes can help with throughput
|
||||
vllm serve openai/gpt-oss-120b --async-scheduling
|
||||
vllm serve openai/gpt-oss-120b --tensor-parallel-size 2 --async-scheduling
|
||||
vllm serve openai/gpt-oss-120b --tensor-parallel-size 4 --async-scheduling
|
||||
```
|
||||
|
||||
### B200
|
||||
|
||||
NVIDIA Blackwell requires installation of FlashInfer library and several environments to enable the necessary kernels. We recommend TP=1 as a starting point for a performant option. We are actively working on the performance of vLLM on Blackwell.
|
||||
|
||||
```
|
||||
# All 3 of these are required
|
||||
export VLLM_USE_TRTLLM_ATTENTION=1
|
||||
export VLLM_USE_TRTLLM_DECODE_ATTENTION=1
|
||||
export VLLM_USE_TRTLLM_CONTEXT_ATTENTION=1
|
||||
|
||||
# Pick only one out of the two.
|
||||
# mxfp8 activation for MoE. faster, but higher risk for accuracy.
|
||||
export VLLM_USE_FLASHINFER_MXFP4_MOE=1
|
||||
# bf16 activation for MoE. matching reference precision.
|
||||
export VLLM_USE_FLASHINFER_MXFP4_BF16_MOE=1
|
||||
|
||||
# openai/gpt-oss-20b
|
||||
vllm serve openai/gpt-oss-20b --async-scheduling
|
||||
|
||||
# gpt-oss-120b
|
||||
vllm serve openai/gpt-oss-120b --async-scheduling
|
||||
vllm serve openai/gpt-oss-120b --tensor-parallel-size 2 --async-scheduling
|
||||
vllm serve openai/gpt-oss-120b --tensor-parallel-size 4 --async-scheduling
|
||||
```
|
||||
|
||||
### AMD
|
||||
|
||||
ROCm supports OpenAI gpt-oss-120b or gpt-oss-20b models on these 3 different GPUs on day one, along with the pre-built docker containers:
|
||||
|
||||
* gfx950: MI350x series, `rocm/vllm-dev:open-mi355-08052025`
|
||||
* gfx942: MI300x/MI325 series, `rocm/vllm-dev:open-mi300-08052025`
|
||||
* gfx1201: Radeon AI PRO R9700, `rocm/vllm-dev:open-r9700-08052025`
|
||||
|
||||
To run the container:
|
||||
|
||||
```
|
||||
alias drun='sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --shm-size 32G -v /data:/data -v $HOME:/myhome -w /myhome'
|
||||
|
||||
drun rocm/vllm-dev:open-mi300-08052025
|
||||
```
|
||||
|
||||
For MI300x and R9700:
|
||||
|
||||
```
|
||||
export VLLM_ROCM_USE_AITER=1
|
||||
export VLLM_USE_AITER_UNIFIED_ATTENTION=1
|
||||
export VLLM_ROCM_USE_AITER_MHA=0
|
||||
|
||||
vllm serve openai/gpt-oss-120b --compilation-config '{"full_cuda_graph": true}'
|
||||
```
|
||||
|
||||
For MI355x:
|
||||
|
||||
```
|
||||
# MoE preshuffle, fusion and Triton GEMM flags
|
||||
export VLLM_USE_AITER_TRITON_FUSED_SPLIT_QKV_ROPE=1
|
||||
export VLLM_USE_AITER_TRITON_FUSED_ADD_RMSNORM_PAD=1
|
||||
export VLLM_USE_AITER_TRITON_GEMM=1
|
||||
export VLLM_ROCM_USE_AITER=1
|
||||
export VLLM_USE_AITER_UNIFIED_ATTENTION=1
|
||||
export VLLM_ROCM_USE_AITER_MHA=0
|
||||
export TRITON_HIP_PRESHUFFLE_SCALES=1
|
||||
|
||||
vllm serve openai/gpt-oss-120b --compilation-config '{"compile_sizes": [1, 2, 4, 8, 16, 24, 32, 64, 128, 256, 4096, 8192], "full_cuda_graph": true}' --block-size 64
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
Once the `vllm serve` runs and `INFO: Application startup complete` has been displayed, you can send requests using HTTP request or OpenAI SDK to the following endpoints:
|
||||
|
||||
* `/v1/responses` endpoint can perform tool use (browsing, python, mcp) in between chain-of-thought and deliver a final response. This endpoint leverages the `openai-harmony` library for input rendering and output parsing. Stateful operation and full streaming API are work in progress. Responses API is recommended by OpenAI as the way to interact with this model.
|
||||
* `/v1/chat/completions` endpoint offers a familiar interface to this model. No tool will be invoked but reasoning and final text output will be returned structurally. Function calling is work in progress. You can also set the parameter `include_reasoning: false` in request parameter to skip CoT being part of the output.
|
||||
* `/v1/completions` endpoint is the endpoint for a simple input output interface without any sorts of template rendering.
|
||||
|
||||
All endpoints accept `stream: true` as part of the operations to enable incremental token streaming. Please note that vLLM currently does not cover the full scope of responses API, for more detail, please see Limitation section below.
|
||||
|
||||
### Tool Use
|
||||
|
||||
One premier feature of gpt-oss is the ability to call tools directly, called "built-in tools". In vLLM, we offer several options:
|
||||
|
||||
* By default, we integrate with the reference library's browser (with `ExaBackend`) and demo Python interpreter via docker container. In order to use the search backend, you need to get access to [exa.ai](http://exa.ai) and put `EXA_API_KEY=` as an environment variable. For Python, either have docker available, or set `PYTHON_EXECUTION_BACKEND=UV` to dangerously allow execution of model generated code snippets to be executed on the same machine.
|
||||
|
||||
```
|
||||
uv pip install gpt-oss
|
||||
|
||||
vllm serve ... --tool-server demo
|
||||
```
|
||||
|
||||
* Please note that the default options are simply for demo purposes. For production usage, vLLM itself can act as MCP client to multiple services.
|
||||
Here is an [example tool server](https://github.com/openai/gpt-oss/tree/main/gpt-oss-mcp-server) that vLLM can work with, they wrap the demo tools:
|
||||
|
||||
```
|
||||
mcp run -t sse browser_server.py:mcp
|
||||
mcp run -t sse python_server.py:mcp
|
||||
|
||||
vllm serve ... --tool-server ip-1:port-1,ip-2:port-2
|
||||
```
|
||||
|
||||
The URLs are expected to be MCP SSE servers that implement `instructions` in server info and well documented tools. The tools will be injected into the system prompt for the model to enable them.
|
||||
|
||||
## Accuracy Evaluation Panels
|
||||
|
||||
OpenAI recommends using the gpt-oss reference library to perform evaluation. For example,
|
||||
|
||||
```
|
||||
python -m gpt_oss.evals --model 120b-low --eval gpqa --n-threads 128
|
||||
python -m gpt_oss.evals --model 120b --eval gpqa --n-threads 128
|
||||
python -m gpt_oss.evals --model 120b-high --eval gpqa --n-threads 128
|
||||
```
|
||||
To eval on AIME2025, change `gpqa` to `aime25`.
|
||||
With vLLM deployed:
|
||||
|
||||
```
|
||||
# Example deployment on 8xH100
|
||||
vllm serve openai/gpt-oss-120b \
|
||||
--tensor_parallel_size 8 \
|
||||
--max-model-len 131072 \
|
||||
--max-num-batched-tokens 10240 \
|
||||
--max-num-seqs 128 \
|
||||
--gpu-memory-utilization 0.85 \
|
||||
--no-enable-prefix-caching
|
||||
```
|
||||
|
||||
Here is the score we were able to reproduce without tool use, and we encourage you to try reproducing it as well!
|
||||
We’ve observed that the numbers may vary slightly across runs, so feel free to run the evaluation multiple times to get a sense of the variance.
|
||||
For a quick correctness check, we recommend starting with the low reasoning effort setting (120b-low), which should complete within minutes.
|
||||
|
||||
Model: 120B
|
||||
|
||||
| Reasoning Effort | GPQA | AIME25 |
|
||||
| :---- | :---- | :---- |
|
||||
| Low | 65.3 | 51.2 |
|
||||
| Mid | 72.4 | 79.6 |
|
||||
| High | 79.4 | 93.0 |
|
||||
|
||||
Model: 20B
|
||||
|
||||
| Reasoning Effort | GPQA | AIME25 |
|
||||
| :---- | :---- | :---- |
|
||||
| Low | 56.8 | 38.8 |
|
||||
| Mid | 67.5 | 75.0 |
|
||||
| High | 70.9 | 85.8 |
|
||||
|
||||
## Known Limitations
|
||||
|
||||
* On H100 using tensor parallel size 1, default gpu memory utilization, and batched token will cause CUDA Out-of-memory. When running tp1, please increase your gpu memory utilization or lower batched token
|
||||
|
||||
```
|
||||
vllm serve openai/gpt-oss-120b --gpu-memory-utilization 0.95 --max-num-batched-tokens 1024
|
||||
```
|
||||
|
||||
* When running TP2 on H100, set your gpu memory utilization below 0.95 as that will also cause OOM
|
||||
* Responses API has several limitations at the current moment; we strongly welcome contribution and maintenance of this service in vLLM
|
||||
* Usage accounting is currently broken and only returns all zeros.
|
||||
* Annotations (citing URLs from search results) are not supported.
|
||||
* Truncation by `max_tokens` might not be able to preserve partial chunks.
|
||||
* Streaming is fairly barebone at the moment, for example:
|
||||
* Item id and indexing needs more work
|
||||
* Tool invocation and output are not properly streamed, rather batched.
|
||||
* Proper error handling is missing.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
- Attention sink dtype error on Blackwell:
|
||||
|
||||
```
|
||||
ERROR 08-05 07:31:10 [multiproc_executor.py:559] assert sinks.dtype == torch.float32, "Sinks must be of type float32"
|
||||
**(VllmWorker TP0 pid=174579)** ERROR 08-05 07:31:10 [multiproc_executor.py:559] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
**(VllmWorker TP0 pid=174579)** ERROR 08-05 07:31:10 [multiproc_executor.py:559] AssertionError: Sinks must be of type float32
|
||||
```
|
||||
|
||||
**Solution: Please refer to Blackwell section to check if related environment variables are added.**
|
||||
|
||||
- Triton issue related to `tl.language` not defined:
|
||||
|
||||
**Solution: Make sure there's no other triton installed in your environment (pytorch-triton, etc).**
|
||||
|
||||
183
packages/pods/docs/implementation-plan.md
Normal file
183
packages/pods/docs/implementation-plan.md
Normal file
|
|
@ -0,0 +1,183 @@
|
|||
# Implementation Plan
|
||||
|
||||
## Core Principles
|
||||
- TypeScript throughout
|
||||
- Clean, minimal code
|
||||
- Self-contained modules
|
||||
- Direct SSH execution (no remote manager)
|
||||
- All state in local JSON
|
||||
|
||||
## Package 1: Pod Setup Script Generation
|
||||
Generate and execute pod_setup.sh via SSH
|
||||
|
||||
- [ ] `src/setup/generate-setup-script.ts` - Generate bash script as string
|
||||
- [ ] Detect CUDA driver version
|
||||
- [ ] Determine CUDA toolkit version needed
|
||||
- [ ] Generate uv/Python install commands
|
||||
- [ ] Generate venv creation commands
|
||||
- [ ] Generate pip install commands (torch, vLLM, etc.)
|
||||
- [ ] Handle model-specific vLLM versions (e.g., gpt-oss needs 0.10.1+gptoss)
|
||||
- [ ] Generate mount commands if --mount provided
|
||||
- [ ] Generate env var setup (HF_TOKEN, PI_API_KEY)
|
||||
|
||||
- [ ] `src/setup/detect-hardware.ts` - Run nvidia-smi and parse GPU info
|
||||
- [ ] Execute nvidia-smi via SSH
|
||||
- [ ] Parse GPU count, names, memory
|
||||
- [ ] Return structured GPU info
|
||||
|
||||
- [ ] `src/setup/execute-setup.ts` - Main setup orchestrator
|
||||
- [ ] Generate setup script
|
||||
- [ ] Copy and execute via SSH
|
||||
- [ ] Stream output to console
|
||||
- [ ] Handle Ctrl+C properly
|
||||
- [ ] Save GPU info to local config
|
||||
|
||||
## Package 2: Config Management
|
||||
Local JSON state management
|
||||
|
||||
- [ ] `src/config/types.ts` - TypeScript interfaces
|
||||
- [ ] Pod interface (ssh, gpus, models, mount)
|
||||
- [ ] Model interface (model, port, gpu, pid)
|
||||
- [ ] GPU interface (id, name, memory)
|
||||
|
||||
- [ ] `src/config/store.ts` - Read/write ~/.pi/pods.json
|
||||
- [ ] Load config (handle missing file)
|
||||
- [ ] Save config (atomic write)
|
||||
- [ ] Get active pod
|
||||
- [ ] Add/remove pods
|
||||
- [ ] Update model state
|
||||
|
||||
## Package 3: SSH Executor
|
||||
Clean SSH command execution
|
||||
|
||||
- [ ] `src/ssh/executor.ts` - SSH command wrapper
|
||||
- [ ] Execute command with streaming output
|
||||
- [ ] Execute command with captured output
|
||||
- [ ] Handle SSH errors gracefully
|
||||
- [ ] Support Ctrl+C propagation
|
||||
- [ ] Support background processes (nohup)
|
||||
|
||||
## Package 4: Pod Commands
|
||||
Pod management CLI commands
|
||||
|
||||
- [ ] `src/commands/pods-setup.ts` - pi pods setup
|
||||
- [ ] Parse args (name, ssh, mount)
|
||||
- [ ] Check env vars (HF_TOKEN, PI_API_KEY)
|
||||
- [ ] Call setup executor
|
||||
- [ ] Save pod to config
|
||||
|
||||
- [ ] `src/commands/pods-list.ts` - pi pods
|
||||
- [ ] Load config
|
||||
- [ ] Display all pods with active marker
|
||||
|
||||
- [ ] `src/commands/pods-active.ts` - pi pods active
|
||||
- [ ] Switch active pod
|
||||
- [ ] Update config
|
||||
|
||||
- [ ] `src/commands/pods-remove.ts` - pi pods remove
|
||||
- [ ] Remove from config (not remote)
|
||||
|
||||
## Package 5: Model Management
|
||||
Model lifecycle management
|
||||
|
||||
- [ ] `src/models/model-config.ts` - Known model configurations
|
||||
- [ ] Load models.md data structure
|
||||
- [ ] Match hardware to vLLM args
|
||||
- [ ] Get model-specific env vars
|
||||
|
||||
- [ ] `src/models/download.ts` - Model download via HF
|
||||
- [ ] Check if model cached
|
||||
- [ ] Run huggingface-cli download
|
||||
- [ ] Stream progress to console
|
||||
- [ ] Handle Ctrl+C
|
||||
|
||||
- [ ] `src/models/vllm-builder.ts` - Build vLLM command
|
||||
- [ ] Get base command for model
|
||||
- [ ] Add hardware-specific args
|
||||
- [ ] Add user --vllm args
|
||||
- [ ] Add port and API key
|
||||
|
||||
## Package 6: Model Commands
|
||||
Model management CLI commands
|
||||
|
||||
- [ ] `src/commands/start.ts` - pi start
|
||||
- [ ] Parse model and args
|
||||
- [ ] Find next available port
|
||||
- [ ] Select GPU (round-robin)
|
||||
- [ ] Download if needed
|
||||
- [ ] Build and execute vLLM command
|
||||
- [ ] Wait for health check
|
||||
- [ ] Update config on success
|
||||
|
||||
- [ ] `src/commands/stop.ts` - pi stop
|
||||
- [ ] Find model in config
|
||||
- [ ] Kill process via PID
|
||||
- [ ] Clean up config
|
||||
|
||||
- [ ] `src/commands/list.ts` - pi list
|
||||
- [ ] Show models from config
|
||||
- [ ] Optionally verify PIDs
|
||||
|
||||
- [ ] `src/commands/logs.ts` - pi logs
|
||||
- [ ] Tail log file via SSH
|
||||
- [ ] Handle Ctrl+C (stop tailing only)
|
||||
|
||||
## Package 7: Model Testing
|
||||
Quick model testing with tools
|
||||
|
||||
- [ ] `src/prompt/tools.ts` - Tool definitions
|
||||
- [ ] Define ls, read, glob, rg tools
|
||||
- [ ] Format for OpenAI API
|
||||
|
||||
- [ ] `src/prompt/client.ts` - OpenAI client wrapper
|
||||
- [ ] Create client for model endpoint
|
||||
- [ ] Handle streaming responses
|
||||
- [ ] Display thinking, tools, content
|
||||
|
||||
- [ ] `src/commands/prompt.ts` - pi prompt
|
||||
- [ ] Get model endpoint from config
|
||||
- [ ] Augment prompt with CWD info
|
||||
- [ ] Send request with tools
|
||||
- [ ] Display formatted response
|
||||
|
||||
## Package 8: CLI Entry Point
|
||||
Main CLI with commander.js
|
||||
|
||||
- [ ] `src/cli.ts` - Main entry point
|
||||
- [ ] Setup commander program
|
||||
- [ ] Register all commands
|
||||
- [ ] Handle global options (--pod override)
|
||||
- [ ] Error handling
|
||||
|
||||
- [ ] `src/index.ts` - Package exports
|
||||
|
||||
## Testing Strategy
|
||||
- [ ] Test pod_setup.sh generation locally
|
||||
- [ ] Test on local machine with GPU
|
||||
- [ ] Test SSH executor with mock commands
|
||||
- [ ] Test config management with temp files
|
||||
- [ ] Integration test on real pod
|
||||
|
||||
## Dependencies
|
||||
```json
|
||||
{
|
||||
"dependencies": {
|
||||
"commander": "^12.0.0",
|
||||
"@commander-js/extra-typings": "^12.0.0",
|
||||
"openai": "^4.0.0",
|
||||
"chalk": "^5.0.0",
|
||||
"ora": "^8.0.0"
|
||||
},
|
||||
"devDependencies": {
|
||||
"@types/node": "^22.0.0",
|
||||
"typescript": "^5.0.0",
|
||||
"tsx": "^4.0.0"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Build & Distribution
|
||||
- [ ] TypeScript config for Node.js target
|
||||
- [ ] Build to dist/
|
||||
- [ ] npm package with bin entry
|
||||
- [ ] npx support
|
||||
197
packages/pods/docs/kimi-k2.md
Normal file
197
packages/pods/docs/kimi-k2.md
Normal file
|
|
@ -0,0 +1,197 @@
|
|||
# Kimi-K2 Deployment Guide
|
||||
|
||||
> [!Note]
|
||||
> This guide only provides some examples of deployment commands for Kimi-K2, which may not be the optimal configuration. Since inference engines are still being updated frequently, please continue to follow the guidance from their homepage if you want to achieve better inference performance.
|
||||
|
||||
|
||||
## vLLM Deployment
|
||||
vLLM version v0.10.0rc1 or later is required.
|
||||
|
||||
The smallest deployment unit for Kimi-K2 FP8 weights with 128k seqlen on mainstream H200 or H20 platform is a cluster with 16 GPUs with either Tensor Parallel (TP) or "data parallel + expert parallel" (DP+EP).
|
||||
Running parameters for this environment are provided below. You may scale up to more nodes and increase expert-parallelism to enlarge the inference batch size and overall throughput.
|
||||
|
||||
### Tensor Parallelism
|
||||
|
||||
When the parallelism degree ≤ 16, you can run inference with pure Tensor Parallelism. A sample launch command is:
|
||||
|
||||
``` bash
|
||||
# start ray on node 0 and node 1
|
||||
|
||||
# node 0:
|
||||
vllm serve $MODEL_PATH \
|
||||
--port 8000 \
|
||||
--served-model-name kimi-k2 \
|
||||
--trust-remote-code \
|
||||
--tensor-parallel-size 16 \
|
||||
--enable-auto-tool-choice \
|
||||
--tool-call-parser kimi_k2
|
||||
```
|
||||
|
||||
**Key parameter notes:**
|
||||
- `--tensor-parallel-size 16`: If using more than 16 GPUs, combine with pipeline-parallelism.
|
||||
- `--enable-auto-tool-choice`: Required when enabling tool usage.
|
||||
- `--tool-call-parser kimi_k2`: Required when enabling tool usage.
|
||||
|
||||
### Data Parallelism + Expert Parallelism
|
||||
|
||||
You can install libraries like DeepEP and DeepGEMM as needed. Then run the command (example on H200):
|
||||
|
||||
``` bash
|
||||
# node 0
|
||||
vllm serve $MODEL_PATH --port 8000 --served-model-name kimi-k2 --trust-remote-code --data-parallel-size 16 --data-parallel-size-local 8 --data-parallel-address $MASTER_IP --data-parallel-rpc-port $PORT --enable-expert-parallel --max-num-batched-tokens 8192 --max-num-seqs 256 --gpu-memory-utilization 0.85 --enable-auto-tool-choice --tool-call-parser kimi_k2
|
||||
|
||||
# node 1
|
||||
vllm serve $MODEL_PATH --headless --data-parallel-start-rank 8 --port 8000 --served-model-name kimi-k2 --trust-remote-code --data-parallel-size 16 --data-parallel-size-local 8 --data-parallel-address $MASTER_IP --data-parallel-rpc-port $PORT --enable-expert-parallel --max-num-batched-tokens 8192 --max-num-seqs 256 --gpu-memory-utilization 0.85 --enable-auto-tool-choice --tool-call-parser kimi_k2
|
||||
```
|
||||
|
||||
## SGLang Deployment
|
||||
|
||||
Similarly, we can use TP or DP+EP in SGLang for Deployment, here are the examples.
|
||||
|
||||
|
||||
### Tensor Parallelism
|
||||
|
||||
Here is the simple example code to run TP16 with two nodes on H200:
|
||||
|
||||
``` bash
|
||||
# Node 0
|
||||
python -m sglang.launch_server --model-path $MODEL_PATH --tp 16 --dist-init-addr $MASTER_IP:50000 --nnodes 2 --node-rank 0 --trust-remote-code --tool-call-parser kimi_k2
|
||||
|
||||
# Node 1
|
||||
python -m sglang.launch_server --model-path $MODEL_PATH --tp 16 --dist-init-addr $MASTER_IP:50000 --nnodes 2 --node-rank 1 --trust-remote-code --tool-call-parser kimi_k2
|
||||
```
|
||||
|
||||
**Key parameter notes:**
|
||||
- `--tool-call-parser kimi_k2`: Required when enabling tool usage.
|
||||
|
||||
### Data Parallelism + Expert Parallelism
|
||||
|
||||
Here is an example for large scale Prefill-Decode Disaggregation (4P12D H200) with DP+EP in SGLang:
|
||||
|
||||
``` bash
|
||||
# for prefill node
|
||||
MC_TE_METRIC=true SGLANG_DISAGGREGATION_HEARTBEAT_INTERVAL=10000000 SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=100000 SGLANG_DISAGGREGATION_WAITING_TIMEOUT=100000 PYTHONUNBUFFERED=1 \
|
||||
python -m sglang.launch_server --model-path $MODEL_PATH \
|
||||
--trust-remote-code --disaggregation-mode prefill --dist-init-addr $PREFILL_NODE0$:5757 --tp-size 32 --dp-size 32 --enable-dp-attention --host $LOCAL_IP --decode-log-interval 1 --disable-radix-cache --enable-deepep-moe --moe-dense-tp-size 1 --enable-dp-lm-head --disable-shared-experts-fusion --watchdog-timeout 1000000 --enable-two-batch-overlap --disaggregation-ib-device $IB_DEVICE --chunked-prefill-size 131072 --mem-fraction-static 0.85 --deepep-mode normal --ep-dispatch-algorithm dynamic --eplb-algorithm deepseek --max-running-requests 1024 --nnodes 4 --node-rank $RANK --tool-call-parser kimi_k2
|
||||
|
||||
|
||||
# for decode node
|
||||
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=480 MC_TE_METRIC=true SGLANG_DISAGGREGATION_HEARTBEAT_INTERVAL=10000000 SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=100000 SGLANG_DISAGGREGATION_WAITING_TIMEOUT=100000 PYTHONUNBUFFERED=1 \
|
||||
python -m sglang.launch_server --model-path $MODEL_PATH --trust-remote-code --disaggregation-mode decode --dist-init-addr $DECODE_NODE0:5757 --tp-size 96 --dp-size 96 --enable-dp-attention --host $LOCAL_IP --decode-log-interval 1 --context-length 2176 --disable-radix-cache --enable-deepep-moe --moe-dense-tp-size 1 --enable-dp-lm-head --disable-shared-experts-fusion --watchdog-timeout 1000000 --enable-two-batch-overlap --disaggregation-ib-device $IB_DEVICE --deepep-mode low_latency --mem-fraction-static 0.8 --cuda-graph-bs 480 --max-running-requests 46080 --ep-num-redundant-experts 96 --nnodes 12 --node-rank $RANK --tool-call-parser kimi_k2
|
||||
|
||||
# pdlb
|
||||
PYTHONUNBUFFERED=1 python -m sglang.srt.disaggregation.launch_lb --prefill http://${PREFILL_NODE0}:30000 --decode http://${DECODE_NODE0}:30000
|
||||
```
|
||||
|
||||
## KTransformers Deployment
|
||||
|
||||
Please copy all configuration files (i.e., everything except the .safetensors files) into the GGUF checkpoint folder at /path/to/K2. Then run:
|
||||
``` bash
|
||||
python ktransformers/server/main.py --model_path /path/to/K2 --gguf_path /path/to/K2 --cache_lens 30000
|
||||
```
|
||||
|
||||
To enable AMX optimization, run:
|
||||
|
||||
``` bash
|
||||
python ktransformers/server/main.py --model_path /path/to/K2 --gguf_path /path/to/K2 --cache_lens 30000 --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-fp8-linear-ggml-experts-serve-amx.yaml
|
||||
```
|
||||
|
||||
## TensorRT-LLM Deployment
|
||||
### Prerequisite
|
||||
Please refer to [this guide](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html) to build TensorRT-LLM v1.0.0-rc2 from source and start a TRT-LLM docker container.
|
||||
|
||||
install blobfile by:
|
||||
```bash
|
||||
pip install blobfile
|
||||
```
|
||||
### Multi-node Serving
|
||||
TensorRT-LLM supports multi-node inference. You can use mpirun to launch Kimi-K2 with multi-node jobs. We will use two nodes for this example.
|
||||
|
||||
#### mpirun
|
||||
mpirun requires each node to have passwordless ssh access to the other node. We need to setup the environment inside the docker container. Run the container with host network and mount the current directory as well as model directory to the container.
|
||||
|
||||
```bash
|
||||
# use host network
|
||||
IMAGE=<YOUR_IMAGE>
|
||||
NAME=test_2node_docker
|
||||
# host1
|
||||
docker run -it --name ${NAME}_host1 --ipc=host --gpus=all --network host --privileged --ulimit memlock=-1 --ulimit stack=67108864 -v ${PWD}:/workspace -v <YOUR_MODEL_DIR>:/models/DeepSeek-V3 -w /workspace ${IMAGE}
|
||||
# host2
|
||||
docker run -it --name ${NAME}_host2 --ipc=host --gpus=all --network host --privileged --ulimit memlock=-1 --ulimit stack=67108864 -v ${PWD}:/workspace -v <YOUR_MODEL_DIR>:/models/DeepSeek-V3 -w /workspace ${IMAGE}
|
||||
```
|
||||
|
||||
Set up ssh inside the container
|
||||
|
||||
```bash
|
||||
apt-get update && apt-get install -y openssh-server
|
||||
|
||||
# modify /etc/ssh/sshd_config
|
||||
PermitRootLogin yes
|
||||
PubkeyAuthentication yes
|
||||
# modify /etc/ssh/sshd_config, change default port 22 to another unused port
|
||||
port 2233
|
||||
|
||||
# modify /etc/ssh
|
||||
```
|
||||
|
||||
Generate ssh key on host1 and copy to host2, vice versa.
|
||||
|
||||
```bash
|
||||
# on host1
|
||||
ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519
|
||||
ssh-copy-id -i ~/.ssh/id_ed25519.pub root@<HOST2>
|
||||
# on host2
|
||||
ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519
|
||||
ssh-copy-id -i ~/.ssh/id_ed25519.pub root@<HOST1>
|
||||
|
||||
# restart ssh service on host1 and host2
|
||||
service ssh restart # or
|
||||
/etc/init.d/ssh restart # or
|
||||
systemctl restart ssh
|
||||
```
|
||||
|
||||
Generate additional config for trtllm serve.
|
||||
```bash
|
||||
cat >/path/to/TensorRT-LLM/extra-llm-api-config.yml <<EOF
|
||||
cuda_graph_config:
|
||||
padding_enabled: true
|
||||
batch_sizes:
|
||||
- 1
|
||||
- 2
|
||||
- 4
|
||||
- 8
|
||||
- 16
|
||||
- 32
|
||||
- 64
|
||||
- 128
|
||||
print_iter_log: true
|
||||
enable_attention_dp: true
|
||||
EOF
|
||||
```
|
||||
|
||||
|
||||
After the preparations,you can run the trtllm-serve on two nodes using mpirun:
|
||||
|
||||
```bash
|
||||
mpirun -np 16 \
|
||||
-H <HOST1>:8,<HOST2>:8 \
|
||||
-mca plm_rsh_args "-p 2233" \
|
||||
--allow-run-as-root \
|
||||
trtllm-llmapi-launch trtllm-serve serve \
|
||||
--backend pytorch \
|
||||
--tp_size 16 \
|
||||
--ep_size 8 \
|
||||
--kv_cache_free_gpu_memory_fraction 0.95 \
|
||||
--trust_remote_code \
|
||||
--max_batch_size 128 \
|
||||
--max_num_tokens 4096 \
|
||||
--extra_llm_api_options /path/to/TensorRT-LLM/extra-llm-api-config.yml \
|
||||
--port 8000 \
|
||||
<YOUR_MODEL_DIR>
|
||||
```
|
||||
|
||||
## Others
|
||||
|
||||
Kimi-K2 reuses the `DeepSeekV3CausalLM` architecture and convert it's weight into proper shape to save redevelopment effort. To let inference engines distinguish it from DeepSeek-V3 and apply the best optimizations, we set `"model_type": "kimi_k2"` in `config.json`.
|
||||
|
||||
If you are using a framework that is not on the recommended list, you can still run the model by manually changing `model_type` to "deepseek_v3" in `config.json` as a temporary workaround. You may need to manually parse tool calls in case no tool call parser is available in your framework.
|
||||
116
packages/pods/docs/models.md
Normal file
116
packages/pods/docs/models.md
Normal file
|
|
@ -0,0 +1,116 @@
|
|||
### Qwen-Coder
|
||||
- [ ] Qwen2.5-Coder-32B-Instruct
|
||||
- HF: Qwen/Qwen2.5-Coder-32B-Instruct
|
||||
- Hardware:
|
||||
- 1x H100/H200
|
||||
- --tool-call-parser hermes --enable-auto-tool-choice
|
||||
- 2x H100/H200
|
||||
- --tensor-parallel-size 2 --tool-call-parser hermes --enable-auto-tool-choice
|
||||
- Notes: Good balance of size and performance. Single GPU capable.
|
||||
- [ ] Qwen3-Coder-480B-A35B-Instruct (BF16)
|
||||
- HF: Qwen/Qwen3-Coder-480B-A35B-Instruct
|
||||
- Hardware:
|
||||
- 8x H200/H20
|
||||
- --tensor-parallel-size 8 --max-model-len 32000 --enable-auto-tool-choice --tool-call-parser qwen3_coder
|
||||
- Notes: Cannot serve full 262K context on single node. Reduce max-model-len or increase gpu-memory-utilization.
|
||||
- [ ] Qwen3-Coder-480B-A35B-Instruct-FP8
|
||||
- HF: Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8
|
||||
- Hardware:
|
||||
- 8x H200/H20
|
||||
- --max-model-len 131072 --enable-expert-parallel --data-parallel-size 8 --enable-auto-tool-choice --tool-call-parser qwen3_coder
|
||||
- Env: VLLM_USE_DEEP_GEMM=1
|
||||
- Notes: Use data-parallel mode (not tensor-parallel) to avoid weight quantization errors. DeepGEMM recommended.
|
||||
- [ ] Qwen3-Coder-30B-A3B-Instruct (BF16)
|
||||
- HF: Qwen/Qwen3-Coder-30B-A3B-Instruct
|
||||
- Hardware:
|
||||
- 1x H100/H200
|
||||
- --enable-auto-tool-choice --tool-call-parser qwen3_coder
|
||||
- Notes: Fits comfortably on single GPU. ~60GB model weight.
|
||||
- 2x H100/H200
|
||||
- --tensor-parallel-size 2 --enable-auto-tool-choice --tool-call-parser qwen3_coder
|
||||
- Notes: For higher throughput/longer context.
|
||||
- [ ] Qwen3-Coder-30B-A3B-Instruct-FP8
|
||||
- HF: Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8
|
||||
- Hardware:
|
||||
- 1x H100/H200
|
||||
- --enable-auto-tool-choice --tool-call-parser qwen3_coder
|
||||
- Env: VLLM_USE_DEEP_GEMM=1
|
||||
- Notes: FP8 quantized, ~30GB model weight. Excellent for single GPU deployment.
|
||||
|
||||
### GPT-OSS
|
||||
- Notes: Requires vLLM 0.10.1+gptoss. Built-in tools via /v1/responses endpoint (browsing, Python). Function calling not yet supported. --async-scheduling recommended for higher perf (not compatible with structured output).
|
||||
- [ ] GPT-OSS-20B
|
||||
- HF: openai/gpt-oss-20b
|
||||
- Hardware:
|
||||
- 1x H100/H200
|
||||
- --async-scheduling
|
||||
- 1x B200
|
||||
- --async-scheduling
|
||||
- Env: VLLM_USE_TRTLLM_ATTENTION=1 VLLM_USE_TRTLLM_DECODE_ATTENTION=1 VLLM_USE_TRTLLM_CONTEXT_ATTENTION=1 VLLM_USE_FLASHINFER_MXFP4_MOE=1
|
||||
- [ ] GPT-OSS-120B
|
||||
- HF: openai/gpt-oss-120b
|
||||
- Hardware:
|
||||
- 1x H100/H200
|
||||
- --async-scheduling
|
||||
- Notes: Needs --gpu-memory-utilization 0.95 --max-num-batched-tokens 1024 to avoid OOM
|
||||
- 2x H100/H200
|
||||
- --tensor-parallel-size 2 --async-scheduling
|
||||
- Notes: Set --gpu-memory-utilization <0.95 to avoid OOM
|
||||
- 4x H100/H200
|
||||
- --tensor-parallel-size 4 --async-scheduling
|
||||
- 8x H100/H200
|
||||
- --tensor-parallel-size 8 --async-scheduling --max-model-len 131072 --max-num-batched-tokens 10240 --max-num-seqs 128 --gpu-memory-utilization 0.85 --no-enable-prefix-caching
|
||||
- 1x B200
|
||||
- --async-scheduling
|
||||
- Env: VLLM_USE_TRTLLM_ATTENTION=1 VLLM_USE_TRTLLM_DECODE_ATTENTION=1 VLLM_USE_TRTLLM_CONTEXT_ATTENTION=1 VLLM_USE_FLASHINFER_MXFP4_MOE=1
|
||||
- 2x B200
|
||||
- --tensor-parallel-size 2 --async-scheduling
|
||||
- Env: VLLM_USE_TRTLLM_ATTENTION=1 VLLM_USE_TRTLLM_DECODE_ATTENTION=1 VLLM_USE_TRTLLM_CONTEXT_ATTENTION=1 VLLM_USE_FLASHINFER_MXFP4_MOE=1
|
||||
|
||||
### GLM-4.5
|
||||
- Notes: Listed configs support reduced context. For full 128K context, double the GPU count. Models default to thinking mode (disable with API param).
|
||||
- [ ] GLM-4.5 (BF16)
|
||||
- HF: zai-org/GLM-4.5
|
||||
- Hardware:
|
||||
- 16x H100
|
||||
- --tensor-parallel-size 16 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
|
||||
- 8x H200
|
||||
- --tensor-parallel-size 8 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
|
||||
- Notes: On 8x H100, may need --cpu-offload-gb 16 to avoid OOM. For full 128K: needs 32x H100 or 16x H200.
|
||||
- [ ] GLM-4.5-FP8
|
||||
- HF: zai-org/GLM-4.5-FP8
|
||||
- Hardware:
|
||||
- 8x H100
|
||||
- --tensor-parallel-size 8 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
|
||||
- 4x H200
|
||||
- --tensor-parallel-size 4 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
|
||||
- Notes: For full 128K context: needs 16x H100 or 8x H200.
|
||||
- [ ] GLM-4.5-Air (BF16)
|
||||
- HF: zai-org/GLM-4.5-Air
|
||||
- Hardware:
|
||||
- 4x H100
|
||||
- --tensor-parallel-size 4 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
|
||||
- 2x H200
|
||||
- --tensor-parallel-size 2 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
|
||||
- Notes: For full 128K context: needs 8x H100 or 4x H200.
|
||||
- [ ] GLM-4.5-Air-FP8
|
||||
- HF: zai-org/GLM-4.5-Air-FP8
|
||||
- Hardware:
|
||||
- 2x H100
|
||||
- --tensor-parallel-size 2 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
|
||||
- 1x H200
|
||||
- --tensor-parallel-size 1 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
|
||||
- Notes: For full 128K context: needs 4x H100 or 2x H200.
|
||||
|
||||
### Kimi
|
||||
- Notes: Requires vLLM v0.10.0rc1+. Minimum 16 GPUs for FP8 with 128k context. Reuses DeepSeekV3 architecture with model_type="kimi_k2".
|
||||
- [ ] Kimi-K2-Instruct
|
||||
- HF: moonshotai/Kimi-K2-Instruct
|
||||
- Hardware:
|
||||
- 16x H200/H20
|
||||
- --tensor-parallel-size 16 --trust-remote-code --enable-auto-tool-choice --tool-call-parser kimi_k2
|
||||
- Notes: Pure TP mode. For >16 GPUs, combine with pipeline-parallelism.
|
||||
- 16x H200/H20 (DP+EP mode)
|
||||
- --data-parallel-size 16 --data-parallel-size-local 8 --enable-expert-parallel --max-num-batched-tokens 8192 --max-num-seqs 256 --gpu-memory-utilization 0.85 --trust-remote-code --enable-auto-tool-choice --tool-call-parser kimi_k2
|
||||
- Notes: Data parallel + expert parallel mode for higher throughput. Requires multi-node setup with proper networking.
|
||||
|
||||
166
packages/pods/docs/plan.md
Normal file
166
packages/pods/docs/plan.md
Normal file
|
|
@ -0,0 +1,166 @@
|
|||
## Pi
|
||||
|
||||
Pi automates vLLM deployment on GPU pods from DataCrunch, Vast.ai, Prime Intellect, RunPod (or any Ubuntu machine with NVIDIA GPUs). It manages multiple concurrent model deployments via separate vLLM instances, each accessible through the OpenAI API protocol with API key authentication.
|
||||
|
||||
Pods are treated as ephemeral - spin up when needed, tear down when done. To avoid re-downloading models (30+ minutes for 100GB+ models), pi uses persistent network volumes for model storage that can be shared across pods on the same provider. This minimizes both cost (only pay for active compute) and setup time (models already cached).
|
||||
|
||||
## Usage
|
||||
|
||||
### Pods
|
||||
```bash
|
||||
pi pods setup dc1 "ssh root@1.2.3.4" --mount "mount -t nfs..." # Setup pod (requires HF_TOKEN, PI_API_KEY env vars)
|
||||
pi pods # List all pods (* = active)
|
||||
pi pods active dc2 # Switch active pod
|
||||
pi pods remove dc1 # Remove pod
|
||||
```
|
||||
|
||||
### Models
|
||||
```bash
|
||||
pi start Qwen/Qwen2.5-72B-Instruct --name qwen72b # Known model - pi handles vLLM args
|
||||
pi start some/unknown-model --name mymodel --vllm --tensor-parallel-size 4 --max-model-len 32768 # Custom vLLM args
|
||||
pi list # List running models with ports
|
||||
pi stop qwen72b # Stop model
|
||||
pi logs qwen72b # View model logs
|
||||
```
|
||||
|
||||
For known models, pi automatically configures appropriate vLLM arguments from model documentation based on the hardware of the pod. For unknown models or custom configurations, pass vLLM args after `--vllm`.
|
||||
|
||||
## Pod management
|
||||
|
||||
Pi manages GPU pods from various providers (DataCrunch, Vast.ai, Prime Intellect, RunPod) as ephemeral compute resources. Users manually create pods via provider dashboards, then register them with pi for automated setup and management.
|
||||
|
||||
Key capabilities:
|
||||
- **Pod setup**: Transform bare Ubuntu/Debian machines into vLLM-ready environments in ~2 minutes
|
||||
- **Model caching**: Optional persistent storage shared by pods to avoid re-downloading 100GB+ models
|
||||
- **Multi-pod management**: Register multiple pods, switch between them, maintain different environments
|
||||
|
||||
### Pod setup
|
||||
|
||||
When a user creates a fresh pod on a provider, they register it with pi using the SSH command from the provider:
|
||||
|
||||
```bash
|
||||
pi pods setup dc1 "ssh root@1.2.3.4" --mount "mount -t nfs..."
|
||||
```
|
||||
|
||||
This copies and executes `pod_setup.sh` which:
|
||||
1. Detects GPUs via `nvidia-smi` and stores count/memory in local config
|
||||
2. Installs CUDA toolkit matching the driver version
|
||||
3. Creates Python environment
|
||||
- Installs uv and Python 3.12
|
||||
- Creates venv at ~/venv with PyTorch (--torch-backend=auto)
|
||||
- Installs vLLM (model-specific versions when needed)
|
||||
- Installs FlashInfer (builds from source if required)
|
||||
- Installs huggingface-hub (for model downloads)
|
||||
- Installs hf-transfer (for accelerated downloads)
|
||||
4. Mounts persistent storage if provided
|
||||
- Symlinks to ~/.cache/huggingface for model caching
|
||||
5. Configures environment variables persistently
|
||||
|
||||
Required environment variables:
|
||||
- `HF_TOKEN`: HuggingFace token for model downloads
|
||||
- `PI_API_KEY`: API key for securing vLLM endpoints
|
||||
|
||||
### Model caching
|
||||
|
||||
Models can be 100GB+ and take 30+ minutes to download. The `--mount` flag enables persistent model caching:
|
||||
|
||||
- **DataCrunch**: NFS shared filesystems, mountable across multiple running pods in same region
|
||||
- **RunPod**: Network volumes persist independently but cannot be shared between running pods
|
||||
- **Vast.ai**: Volumes locked to specific machine - no sharing
|
||||
- **Prime Intellect**: No persistent storage documented
|
||||
|
||||
Without `--mount`, models download to pod-local storage and are lost on termination.
|
||||
|
||||
### Multi-pod management
|
||||
|
||||
Users can register multiple pods and switch between them:
|
||||
|
||||
```bash
|
||||
pi pods # List all pods (* = active)
|
||||
pi pods active dc2 # Switch active pod
|
||||
pi pods remove dc1 # Remove pod from local config but doesn't destroy pod remotely.
|
||||
```
|
||||
|
||||
All model commands (`pi start`, `pi stop`, etc.) target the active pod, unless `--pod <podname>` is given, which overrides the active pod for that command.
|
||||
|
||||
## Model deployment
|
||||
|
||||
Pi uses direct SSH commands to manage vLLM instances on pods. No remote manager component is needed - everything is controlled from the local pi CLI.
|
||||
|
||||
### Architecture
|
||||
The pi CLI maintains all state locally in `~/.pi/pods.json`:
|
||||
```json
|
||||
{
|
||||
"pods": {
|
||||
"dc1": {
|
||||
"ssh": "ssh root@1.2.3.4",
|
||||
"gpus": [
|
||||
{"id": 0, "name": "H100", "memory": "80GB"},
|
||||
{"id": 1, "name": "H100", "memory": "80GB"}
|
||||
],
|
||||
"models": {
|
||||
"qwen": {
|
||||
"model": "Qwen/Qwen2.5-72B",
|
||||
"port": 8001,
|
||||
"gpu": "0",
|
||||
"pid": 12345
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"active": "dc1"
|
||||
}
|
||||
```
|
||||
|
||||
The location of the pi config dir can also be specified via the `PI_CONFIG_DIR` env var, e.g. for testing.
|
||||
|
||||
Pods are assumed to be fully managed by pi - no other processes compete for ports or GPUs.
|
||||
|
||||
### Starting models
|
||||
When user runs `pi start Qwen/Qwen2.5-72B --name qwen`:
|
||||
1. CLI determines next available port (starting from 8001)
|
||||
2. Selects GPU (round-robin based on stored GPU info)
|
||||
3. Downloads model if not cached:
|
||||
- Sets `HF_HUB_ENABLE_HF_TRANSFER=1` for fast downloads
|
||||
- Runs via SSH with output piped to local terminal
|
||||
- Ctrl+C cancels download and returns control
|
||||
4. Builds vLLM command with appropriate args and PI_API_KEY
|
||||
5. Executes via SSH: `ssh pod "nohup vllm serve ... > ~/.vllm_logs/qwen.log 2>&1 & echo $!"`
|
||||
6. Waits for vLLM to be ready (checks health endpoint)
|
||||
7. On success: stores port, GPU, PID in local state
|
||||
8. On failure: shows exact error from vLLM logs, doesn't save to config
|
||||
|
||||
### Managing models
|
||||
- **List**: Show models from local state, optionally verify PIDs still running
|
||||
- **Stop**: SSH to kill process by PID
|
||||
- **Logs**: SSH to tail -f log files (Ctrl+C stops tailing, doesn't kill vLLM)
|
||||
|
||||
### Error handling
|
||||
- **SSH failures**: Prompt user to check connection or remove pod from config
|
||||
- **Stale state**: Commands that fail with "process not found" auto-clean local state
|
||||
- **Setup failures**: Ctrl+C during setup kills remote script and exits cleanly
|
||||
|
||||
### Testing models
|
||||
The `pi prompt` command provides a quick way to test deployed models:
|
||||
```bash
|
||||
pi prompt qwen "What is 2+2?" # Simple prompt
|
||||
pi prompt qwen "Read file.txt and summarize" # Uses built-in tools
|
||||
```
|
||||
|
||||
Built-in tools for agentic testing:
|
||||
- `ls(path, ignore?)`: List files and directories at path, with optional ignore patterns
|
||||
- `read(file_path, offset?, limit?)`: Read file contents with optional line offset/limit
|
||||
- `glob(pattern, path?)`: Find files matching glob pattern (e.g., "**/*.py", "src/**/*.ts")
|
||||
- `rg(args)`: Run ripgrep with any arguments (e.g., "pattern -t py -C 3", "TODO --type-not test")
|
||||
|
||||
The provided prompt will be augmented with info on the current local working directory. File tools expect absolute paths.
|
||||
|
||||
This allows testing basic agent capabilities without external tool configuration.
|
||||
|
||||
`prompt` is implemented using the latest OpenAI SDK for NodeJS. It outputs thinking content, tool calls and results, and normal assistant messages.
|
||||
|
||||
## Models
|
||||
We want to support these models specifically, with alternative models being marked as "possibly works". This list will be updated with new models regularly. A checked
|
||||
box means "supported".
|
||||
|
||||
See [models.md](./models.md) for a list of models, their HW reqs, vLLM args and notes, we want to support out of the box with a simple `pi start <model-name> --name <local-name>`
|
||||
132
packages/pods/docs/qwen3-coder.md
Normal file
132
packages/pods/docs/qwen3-coder.md
Normal file
|
|
@ -0,0 +1,132 @@
|
|||
# Qwen3-Coder Usage Guide
|
||||
|
||||
[Qwen3-Coder](https://github.com/QwenLM/Qwen3-Coder) is an advanced large language model created by the Qwen team from Alibaba Cloud. vLLM already supports Qwen3-Coder, and `tool-call` functionality will be available in vLLM v0.10.0 and higher You can install vLLM with `tool-call` support using the following method:
|
||||
|
||||
## Installing vLLM
|
||||
|
||||
```bash
|
||||
uv venv
|
||||
source .venv/bin/activate
|
||||
uv pip install -U vllm --torch-backend auto
|
||||
```
|
||||
|
||||
## Launching Qwen3-Coder with vLLM
|
||||
|
||||
### Serving on 8xH200 (or H20) GPUs (141GB × 8)
|
||||
|
||||
**BF16 Model**
|
||||
|
||||
```bash
|
||||
vllm serve Qwen/Qwen3-Coder-480B-A35B-Instruct \
|
||||
--tensor-parallel-size 8 \
|
||||
--max-model-len 32000 \
|
||||
--enable-auto-tool-choice \
|
||||
--tool-call-parser qwen3_coder
|
||||
```
|
||||
|
||||
**FP8 Model**
|
||||
|
||||
```bash
|
||||
VLLM_USE_DEEP_GEMM=1 vllm serve Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 \
|
||||
--max-model-len 131072 \
|
||||
--enable-expert-parallel \
|
||||
--data-parallel-size 8 \
|
||||
--enable-auto-tool-choice \
|
||||
--tool-call-parser qwen3_coder
|
||||
```
|
||||
|
||||
## Performance Metrics
|
||||
|
||||
### Evaluation
|
||||
We launched `Qwen3-Coder-480B-A35B-Instruct-FP8` using vLLM and evaluated its performance using [EvalPlus](https://github.com/evalplus/evalplus). The results are displayed below:
|
||||
|
||||
| Dataset | Test Type | Pass@1 Score |
|
||||
|-----------|-----------|--------------|
|
||||
| HumanEval | Base tests | 0.939 |
|
||||
| HumanEval+ | Base + extra tests | 0.902 |
|
||||
| MBPP | Base tests | 0.918 |
|
||||
| MBPP+ | Base + extra tests | 0.794 |
|
||||
|
||||
### Benchmarking
|
||||
We used the following script to benchmark `Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8`
|
||||
|
||||
```bash
|
||||
vllm bench serve \
|
||||
--backend vllm \
|
||||
--model Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 \
|
||||
--endpoint /v1/completions \
|
||||
--dataset-name random \
|
||||
--random-input 2048 \
|
||||
--random-output 1024 \
|
||||
--max-concurrency 10 \
|
||||
--num-prompt 100 \
|
||||
```
|
||||
If successful, you will see the following output.
|
||||
|
||||
```shell
|
||||
============ Serving Benchmark Result ============
|
||||
Successful requests: 100
|
||||
Benchmark duration (s): 776.49
|
||||
Total input tokens: 204169
|
||||
Total generated tokens: 102400
|
||||
Request throughput (req/s): 0.13
|
||||
Output token throughput (tok/s): 131.88
|
||||
Total Token throughput (tok/s): 394.81
|
||||
---------------Time to First Token----------------
|
||||
Mean TTFT (ms): 7639.31
|
||||
Median TTFT (ms): 6935.71
|
||||
P99 TTFT (ms): 13766.68
|
||||
-----Time per Output Token (excl. 1st token)------
|
||||
Mean TPOT (ms): 68.43
|
||||
Median TPOT (ms): 67.23
|
||||
P99 TPOT (ms): 72.14
|
||||
---------------Inter-token Latency----------------
|
||||
Mean ITL (ms): 68.43
|
||||
Median ITL (ms): 66.34
|
||||
P99 ITL (ms): 69.38
|
||||
==================================================
|
||||
|
||||
```
|
||||
|
||||
|
||||
## Using Tips
|
||||
|
||||
### BF16 Models
|
||||
- **Context Length Limitation**: A single H20 node cannot serve the original context length (262144). You can reduce the `max-model-len` or increase `gpu-memory-utilization` to work within memory constraints.
|
||||
|
||||
### FP8 Models
|
||||
- **Context Length Limitation**: A single H20 node cannot serve the original context length (262144). You can reduce the `max-model-len` or increase `gpu-memory-utilization` to work within memory constraints.
|
||||
- **DeepGEMM Usage**: To use [DeepGEMM](https://github.com/deepseek-ai/DeepGEMM), set `VLLM_USE_DEEP_GEMM=1`. Follow the [setup instructions](https://github.com/vllm-project/vllm/blob/main/benchmarks/kernels/deepgemm/README.md#setup) to install it.
|
||||
- **Tensor Parallelism Issue**: When using `tensor-parallel-size 8`, the following failures are expected. Switch to data-parallel mode using `--data-parallel-size`.
|
||||
- **Additional Resources**: Refer to the [Data Parallel Deployment documentation](https://docs.vllm.ai/en/latest/serving/data_parallel_deployment.html) for more parallelism groups.
|
||||
|
||||
```shell
|
||||
ERROR [multiproc_executor.py:511] File "/vllm/vllm/model_executor/models/qwen3_moe.py", line 336, in <lambda>
|
||||
ERROR [multiproc_executor.py:511] lambda prefix: Qwen3MoeDecoderLayer(config=config,
|
||||
ERROR [multiproc_executor.py:511] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
ERROR [multiproc_executor.py:511] File "/vllm/vllm/model_executor/models/qwen3_moe.py", line 278, in __init__
|
||||
ERROR [multiproc_executor.py:511] self.mlp = Qwen3MoeSparseMoeBlock(config=config,
|
||||
ERROR [multiproc_executor.py:511] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
ERROR [multiproc_executor.py:511] File "/vllm/vllm/model_executor/models/qwen3_moe.py", line 113, in __init__
|
||||
ERROR [multiproc_executor.py:511] self.experts = FusedMoE(num_experts=config.num_experts,
|
||||
ERROR [multiproc_executor.py:511] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
ERROR [multiproc_executor.py:511] File "/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 773, in __init__
|
||||
ERROR [multiproc_executor.py:511] self.quant_method.create_weights(layer=self, **moe_quant_params)
|
||||
ERROR [multiproc_executor.py:511] File "/vllm/vllm/model_executor/layers/quantization/fp8.py", line 573, in create_weights
|
||||
ERROR [multiproc_executor.py:511] raise ValueError(
|
||||
ERROR [multiproc_executor.py:511] ValueError: The output_size of gate's and up's weight = 320 is not divisible by weight quantization block_n = 128.
|
||||
```
|
||||
|
||||
### Tool Calling
|
||||
- **Enable Tool Calls**: Add `--tool-call-parser qwen3_coder` to enable tool call parsing functionality, please refer to: [tool_calling](https://docs.vllm.ai/en/latest/features/tool_calling.html)
|
||||
|
||||
## Roadmap
|
||||
|
||||
- [x] Add benchmark results
|
||||
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- [EvalPlus](https://github.com/evalplus/evalplus)
|
||||
- [Qwen3-Coder](https://github.com/QwenLM/Qwen3-Coder)
|
||||
- [vLLM Documentation](https://docs.vllm.ai/)
|
||||
Loading…
Add table
Add a link
Reference in a new issue