Initial monorepo setup with npm workspaces and dual TypeScript configuration

- Set up npm workspaces for three packages: pi-tui, pi-agent, and pi (pods)
- Implemented dual TypeScript configuration:
  - Root tsconfig.json with path mappings for development and type checking
  - Package-specific tsconfig.build.json for clean production builds
- Configured lockstep versioning with sync script for inter-package dependencies
- Added comprehensive documentation for development and publishing workflows
- All packages at version 0.5.0 ready for npm publishing
This commit is contained in:
Mario Zechner 2025-08-09 17:18:38 +02:00
commit a74c5da112
63 changed files with 14558 additions and 0 deletions

511
packages/pods/README.md Normal file
View file

@ -0,0 +1,511 @@
# pi
Deploy and manage LLMs on GPU pods with automatic vLLM configuration for agentic workloads.
## Installation
```bash
npm install -g @mariozechner/pi
```
## What is pi?
`pi` simplifies running large language models on remote GPU pods. It automatically:
- Sets up vLLM on fresh Ubuntu pods
- Configures tool calling for agentic models (Qwen, GPT-OSS, GLM, etc.)
- Manages multiple models on the same pod with "smart" GPU allocation
- Provides OpenAI-compatible API endpoints for each model
- Includes an interactive agent with file system tools for testing
## Quick Start
```bash
# Set required environment variables
export HF_TOKEN=your_huggingface_token # Get from https://huggingface.co/settings/tokens
export PI_API_KEY=your_api_key # Any string you want for API authentication
# Setup a DataCrunch pod with NFS storage (models path auto-extracted)
pi pods setup dc1 "ssh root@1.2.3.4" \
--mount "sudo mount -t nfs -o nconnect=16 nfs.fin-02.datacrunch.io:/your-pseudo /mnt/hf-models"
# Start a model (automatic configuration for known models)
pi start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen
# Send a single message to the model
pi agent qwen "What is the Fibonacci sequence?"
# Interactive chat mode with file system tools
pi agent qwen -i
# Use with any OpenAI-compatible client
export OPENAI_BASE_URL='http://1.2.3.4:8001/v1'
export OPENAI_API_KEY=$PI_API_KEY
```
## Prerequisites
- Node.js 18+
- HuggingFace token (for model downloads)
- GPU pod with:
- Ubuntu 22.04 or 24.04
- SSH root access
- NVIDIA drivers installed
- Persistent storage for models
## Supported Providers
### Primary Support
**DataCrunch** - Best for shared model storage
- NFS volumes sharable across multiple pods in same region
- Models download once, use everywhere
- Ideal for teams or multiple experiments
**RunPod** - Good persistent storage
- Network volumes persist independently
- Cannot share between running pods simultaneously
- Good for single-pod workflows
### Also Works With
- Vast.ai (volumes locked to specific machine)
- Prime Intellect (no persistent storage)
- AWS EC2 (with EFS setup)
- Any Ubuntu machine with NVIDIA GPUs, CUDA driver, and SSH
## Commands
### Pod Management
```bash
pi pods setup <name> "<ssh>" [options] # Setup new pod
--mount "<mount_command>" # Run mount command during setup
--models-path <path> # Override extracted path (optional)
--vllm release|nightly|gpt-oss # vLLM version (default: release)
pi pods # List all configured pods
pi pods active <name> # Switch active pod
pi pods remove <name> # Remove pod from local config
pi shell [<name>] # SSH into pod
pi ssh [<name>] "<command>" # Run command on pod
```
**Note**: When using `--mount`, the models path is automatically extracted from the mount command's target directory. You only need `--models-path` if not using `--mount` or to override the extracted path.
#### vLLM Version Options
- `release` (default): Stable vLLM release, recommended for most users
- `nightly`: Latest vLLM features, needed for newest models like GLM-4.5
- `gpt-oss`: Special build for OpenAI's GPT-OSS models only
### Model Management
```bash
pi start <model> --name <name> [options] # Start a model
--memory <percent> # GPU memory: 30%, 50%, 90% (default: 90%)
--context <size> # Context window: 4k, 8k, 16k, 32k, 64k, 128k
--gpus <count> # Number of GPUs to use (predefined models only)
--pod <name> # Target specific pod (overrides active)
--vllm <args...> # Pass custom args directly to vLLM
pi stop [<name>] # Stop model (or all if no name given)
pi list # List running models with status
pi logs <name> # Stream model logs (tail -f)
```
### Agent & Chat Interface
```bash
pi agent <name> "<message>" # Single message to model
pi agent <name> "<msg1>" "<msg2>" # Multiple messages in sequence
pi agent <name> -i # Interactive chat mode
pi agent <name> -i -c # Continue previous session
# Standalone OpenAI-compatible agent (works with any API)
pi-agent --base-url http://localhost:8000/v1 --model llama-3.1 "Hello"
pi-agent --api-key sk-... "What is 2+2?" # Uses OpenAI by default
pi-agent --json "What is 2+2?" # Output event stream as JSONL
pi-agent -i # Interactive mode
```
The agent includes tools for file operations (read, list, bash, glob, rg) to test agentic capabilities, particularly useful for code navigation and analysis tasks.
## Predefined Model Configurations
`pi` includes predefined configurations for popular agentic models, so you do not have to specify `--vllm` arguments manually. `pi` will also check if the model you selected can actually run on your pod with respect to the number of GPUs and available VRAM. Run `pi start` without additional arguments to see a list of predefined models that can run on the active pod.
### Qwen Models
```bash
# Qwen2.5-Coder-32B - Excellent coding model, fits on single H100/H200
pi start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen
# Qwen3-Coder-30B - Advanced reasoning with tool use
pi start Qwen/Qwen3-Coder-30B-A3B-Instruct --name qwen3
# Qwen3-Coder-480B - State-of-the-art on 8xH200 (data-parallel mode)
pi start Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 --name qwen-480b
```
### GPT-OSS Models
```bash
# Requires special vLLM build during setup
pi pods setup gpt-pod "ssh root@1.2.3.4" --models-path /workspace --vllm gpt-oss
# GPT-OSS-20B - Fits on 16GB+ VRAM
pi start openai/gpt-oss-20b --name gpt20
# GPT-OSS-120B - Needs 60GB+ VRAM
pi start openai/gpt-oss-120b --name gpt120
```
### GLM Models
```bash
# GLM-4.5 - Requires 8-16 GPUs, includes thinking mode
pi start zai-org/GLM-4.5 --name glm
# GLM-4.5-Air - Smaller version, 1-2 GPUs
pi start zai-org/GLM-4.5-Air --name glm-air
```
### Custom Models with --vllm
For models not in the predefined list, use `--vllm` to pass arguments directly to vLLM:
```bash
# DeepSeek with custom settings
pi start deepseek-ai/DeepSeek-V3 --name deepseek --vllm \
--tensor-parallel-size 4 --trust-remote-code
# Mistral with pipeline parallelism
pi start mistralai/Mixtral-8x22B-Instruct-v0.1 --name mixtral --vllm \
--tensor-parallel-size 8 --pipeline-parallel-size 2
# Any model with specific tool parser
pi start some/model --name mymodel --vllm \
--tool-call-parser hermes --enable-auto-tool-choice
```
## DataCrunch Setup
DataCrunch offers the best experience with shared NFS storage across pods:
### 1. Create Shared Filesystem (SFS)
- Go to DataCrunch dashboard → Storage → Create SFS
- Choose size and datacenter
- Note the mount command (e.g., `sudo mount -t nfs -o nconnect=16 nfs.fin-02.datacrunch.io:/hf-models-fin02-8ac1bab7 /mnt/hf-models-fin02`)
### 2. Create GPU Instance
- Create instance in same datacenter as SFS
- Share the SFS with the instance
- Get SSH command from dashboard
### 3. Setup with pi
```bash
# Get mount command from DataCrunch dashboard
pi pods setup dc1 "ssh root@instance.datacrunch.io" \
--mount "sudo mount -t nfs -o nconnect=16 nfs.fin-02.datacrunch.io:/your-pseudo /mnt/hf-models"
# Models automatically stored in /mnt/hf-models (extracted from mount command)
```
### 4. Benefits
- Models persist across instance restarts
- Share models between multiple instances in same datacenter
- Download once, use everywhere
- Pay only for storage, not compute time during downloads
## RunPod Setup
RunPod offers good persistent storage with network volumes:
### 1. Create Network Volume (optional)
- Go to RunPod dashboard → Storage → Create Network Volume
- Choose size and region
### 2. Create GPU Pod
- Select "Network Volume" during pod creation (if using)
- Attach your volume to `/runpod-volume`
- Get SSH command from pod details
### 3. Setup with pi
```bash
# With network volume
pi pods setup runpod "ssh root@pod.runpod.io" --models-path /runpod-volume
# Or use workspace (persists with pod but not shareable)
pi pods setup runpod "ssh root@pod.runpod.io" --models-path /workspace
```
## Multi-GPU Support
### Automatic GPU Assignment
When running multiple models, pi automatically assigns them to different GPUs:
```bash
pi start model1 --name m1 # Auto-assigns to GPU 0
pi start model2 --name m2 # Auto-assigns to GPU 1
pi start model3 --name m3 # Auto-assigns to GPU 2
```
### Specify GPU Count for Predefined Models
For predefined models with multiple configurations, use `--gpus` to control GPU usage:
```bash
# Run Qwen on 1 GPU instead of all available
pi start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen --gpus 1
# Run GLM-4.5 on 8 GPUs (if it has an 8-GPU config)
pi start zai-org/GLM-4.5 --name glm --gpus 8
```
If the model doesn't have a configuration for the requested GPU count, you'll see available options.
### Tensor Parallelism for Large Models
For models that don't fit on a single GPU:
```bash
# Use all available GPUs
pi start meta-llama/Llama-3.1-70B-Instruct --name llama70b --vllm \
--tensor-parallel-size 4
# Specific GPU count
pi start Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 --name qwen480 --vllm \
--data-parallel-size 8 --enable-expert-parallel
```
## API Integration
All models expose OpenAI-compatible endpoints:
```python
from openai import OpenAI
client = OpenAI(
base_url="http://your-pod-ip:8001/v1",
api_key="your-pi-api-key"
)
# Chat completion with tool calling
response = client.chat.completions.create(
model="Qwen/Qwen2.5-Coder-32B-Instruct",
messages=[
{"role": "user", "content": "Write a Python function to calculate fibonacci"}
],
tools=[{
"type": "function",
"function": {
"name": "execute_code",
"description": "Execute Python code",
"parameters": {
"type": "object",
"properties": {
"code": {"type": "string"}
},
"required": ["code"]
}
}
}],
tool_choice="auto"
)
```
## Standalone Agent CLI
`pi` includes a standalone OpenAI-compatible agent that can work with any API:
```bash
# Install globally to get pi-agent command
npm install -g @mariozechner/pi
# Use with OpenAI
pi-agent --api-key sk-... "What is machine learning?"
# Use with local vLLM
pi-agent --base-url http://localhost:8000/v1 \
--model meta-llama/Llama-3.1-8B-Instruct \
--api-key dummy \
"Explain quantum computing"
# Interactive mode
pi-agent -i
# Continue previous session
pi-agent --continue "Follow up question"
# Custom system prompt
pi-agent --system-prompt "You are a Python expert" "Write a web scraper"
# Use responses API (for GPT-OSS models)
pi-agent --api responses --model openai/gpt-oss-20b "Hello"
```
The agent supports:
- Session persistence across conversations
- Interactive TUI mode with syntax highlighting
- File system tools (read, list, bash, glob, rg) for code navigation
- Both Chat Completions and Responses API formats
- Custom system prompts
## Tool Calling Support
`pi` automatically configures appropriate tool calling parsers for known models:
- **Qwen models**: `hermes` parser (Qwen3-Coder uses `qwen3_coder`)
- **GLM models**: `glm4_moe` parser with reasoning support
- **GPT-OSS models**: Uses `/v1/responses` endpoint, as tool calling (function calling in OpenAI parlance) is currently a [WIP with the `v1/chat/completions` endpoint](https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html#tool-use).
- **Custom models**: Specify with `--vllm --tool-call-parser <parser> --enable-auto-tool-choice`
To disable tool calling:
```bash
pi start model --name mymodel --vllm --disable-tool-call-parser
```
## Memory and Context Management
### GPU Memory Allocation
Controls how much GPU memory vLLM pre-allocates:
- `--memory 30%`: High concurrency, limited context
- `--memory 50%`: Balanced (default)
- `--memory 90%`: Maximum context, low concurrency
### Context Window
Sets maximum input + output tokens:
- `--context 4k`: 4,096 tokens total
- `--context 32k`: 32,768 tokens total
- `--context 128k`: 131,072 tokens total
Example for coding workload:
```bash
# Large context for code analysis, moderate concurrency
pi start Qwen/Qwen2.5-Coder-32B-Instruct --name coder \
--context 64k --memory 70%
```
**Note**: When using `--vllm`, the `--memory`, `--context`, and `--gpus` parameters are ignored. You'll see a warning if you try to use them together.
## Session Persistence
The interactive agent mode (`-i`) saves sessions for each project directory:
```bash
# Start new session
pi agent qwen -i
# Continue previous session (maintains chat history)
pi agent qwen -i -c
```
Sessions are stored in `~/.pi/sessions/` organized by project path and include:
- Complete conversation history
- Tool call results
- Token usage statistics
## Architecture & Event System
The agent uses a unified event-based architecture where all interactions flow through `AgentEvent` types. This enables:
- Consistent UI rendering across console and TUI modes
- Session recording and replay
- Clean separation between API calls and UI updates
- JSON output mode for programmatic integration
Events are automatically converted to the appropriate API format (Chat Completions or Responses) based on the model type.
### JSON Output Mode
Use `--json` flag to output the event stream as JSONL (JSON Lines) for programmatic consumption:
```bash
pi-agent --api-key sk-... --json "What is 2+2?"
```
Each line is a complete JSON object representing an event:
```jsonl
{"type":"user_message","text":"What is 2+2?"}
{"type":"assistant_start"}
{"type":"assistant_message","text":"2 + 2 = 4"}
{"type":"token_usage","inputTokens":10,"outputTokens":5,"totalTokens":15,"cacheReadTokens":0,"cacheWriteTokens":0}
```
## Troubleshooting
### OOM (Out of Memory) Errors
- Reduce `--memory` percentage
- Use smaller model or quantized version (FP8)
- Reduce `--context` size
### Model Won't Start
```bash
# Check GPU usage
pi ssh "nvidia-smi"
# Check if port is in use
pi list
# Force stop all models
pi stop
```
### Tool Calling Issues
- Not all models support tool calling reliably
- Try different parser: `--vllm --tool-call-parser mistral`
- Or disable: `--vllm --disable-tool-call-parser`
### Access Denied for Models
Some models (Llama, Mistral) require HuggingFace access approval. Visit the model page and click "Request access".
### vLLM Build Issues
If using `--vllm nightly` fails, try:
- Use `--vllm release` for stable version
- Check CUDA compatibility with `pi ssh "nvidia-smi"`
### Agent Not Finding Messages
If the agent shows configuration instead of your message, ensure quotes around messages with special characters:
```bash
# Good
pi agent qwen "What is this file about?"
# Bad (shell might interpret special chars)
pi agent qwen What is this file about?
```
## Advanced Usage
### Working with Multiple Pods
```bash
# Override active pod for any command
pi start model --name test --pod dev-pod
pi list --pod prod-pod
pi stop test --pod dev-pod
```
### Custom vLLM Arguments
```bash
# Pass any vLLM argument after --vllm
pi start model --name custom --vllm \
--quantization awq \
--enable-prefix-caching \
--max-num-seqs 256 \
--gpu-memory-utilization 0.95
```
### Monitoring
```bash
# Watch GPU utilization
pi ssh "watch -n 1 nvidia-smi"
# Check model downloads
pi ssh "du -sh ~/.cache/huggingface/hub/*"
# View all logs
pi ssh "ls -la ~/.vllm_logs/"
# Check agent session history
ls -la ~/.pi/sessions/
```
## Environment Variables
- `HF_TOKEN` - HuggingFace token for model downloads
- `PI_API_KEY` - API key for vLLM endpoints
- `PI_CONFIG_DIR` - Config directory (default: `~/.pi`)
- `OPENAI_API_KEY` - Used by `pi-agent` when no `--api-key` provided
## License
MIT

View file

@ -0,0 +1,189 @@
# GLM-4.5
[中文阅读](./README_zh.md)
<div align="center">
<img src=resources/logo.svg width="15%"/>
</div>
<p align="center">
👋 Join our <a href="resources/WECHAT.md" target="_blank">WeChat</a> or <a href="https://discord.gg/QR7SARHRxK" target="_blank">Discord</a> community.
<br>
📖 Check out the GLM-4.5 <a href="https://z.ai/blog/glm-4.5" target="_blank">technical blog</a>.
<br>
📍 Use GLM-4.5 API services on <a href="https://docs.z.ai/guides/llm/glm-4.5">Z.ai API Platform (Global)</a> or <br> <a href="https://docs.bigmodel.cn/cn/guide/models/text/glm-4.5">Zhipu AI Open Platform (Mainland China)</a>.
<br>
👉 One click to <a href="https://chat.z.ai">GLM-4.5</a>.
</p>
## Model Introduction
The **GLM-4.5** series models are foundation models designed for intelligent agents. GLM-4.5 has **355** billion total
parameters with **32** billion active parameters, while GLM-4.5-Air adopts a more compact design with **106** billion
total parameters and **12** billion active parameters. GLM-4.5 models unify reasoning, coding, and intelligent agent
capabilities to meet the complex demands of intelligent agent applications.
Both GLM-4.5 and GLM-4.5-Air are hybrid reasoning models that provide two modes: thinking mode for complex reasoning and
tool usage, and non-thinking mode for immediate responses.
We have open-sourced the base models, hybrid reasoning models, and FP8 versions of the hybrid reasoning models for both
GLM-4.5 and GLM-4.5-Air. They are released under the MIT open-source license and can be used commercially and for
secondary development.
As demonstrated in our comprehensive evaluation across 12 industry-standard benchmarks, GLM-4.5 achieves exceptional
performance with a score of **63.2**, in the **3rd** place among all the proprietary and open-source models. Notably,
GLM-4.5-Air delivers competitive results at **59.8** while maintaining superior efficiency.
![bench](resources/bench.png)
For more eval results, show cases, and technical details, please visit
our [technical blog](https://z.ai/blog/glm-4.5). The technical report will be released soon.
The model code, tool parser and reasoning parser can be found in the implementation
of [transformers](https://github.com/huggingface/transformers/tree/main/src/transformers/models/glm4_moe), [vLLM](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/glm4_moe_mtp.py)
and [SGLang](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/glm4_moe.py).
## Model Downloads
You can directly experience the model on [Hugging Face](https://huggingface.co/spaces/zai-org/GLM-4.5-Space)
or [ModelScope](https://modelscope.cn/studios/ZhipuAI/GLM-4.5-Demo) or download the model by following the links below.
| Model | Download Links | Model Size | Precision |
|------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|------------|-----------|
| GLM-4.5 | [🤗 Hugging Face](https://huggingface.co/zai-org/GLM-4.5)<br> [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-4.5) | 355B-A32B | BF16 |
| GLM-4.5-Air | [🤗 Hugging Face](https://huggingface.co/zai-org/GLM-4.5-Air)<br> [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-4.5-Air) | 106B-A12B | BF16 |
| GLM-4.5-FP8 | [🤗 Hugging Face](https://huggingface.co/zai-org/GLM-4.5-FP8)<br> [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-4.5-FP8) | 355B-A32B | FP8 |
| GLM-4.5-Air-FP8 | [🤗 Hugging Face](https://huggingface.co/zai-org/GLM-4.5-Air-FP8)<br> [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-4.5-Air-FP8) | 106B-A12B | FP8 |
| GLM-4.5-Base | [🤗 Hugging Face](https://huggingface.co/zai-org/GLM-4.5-Base)<br> [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-4.5-Base) | 355B-A32B | BF16 |
| GLM-4.5-Air-Base | [🤗 Hugging Face](https://huggingface.co/zai-org/GLM-4.5-Air-Base)<br> [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-4.5-Air-Base) | 106B-A12B | BF16 |
## System Requirements
### Inference
We provide minimum and recommended configurations for "full-featured" model inference. The data in the table below is
based on the following conditions:
1. All models use MTP layers and specify
`--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4` to ensure competitive
inference speed.
2. The `cpu-offload` parameter is not used.
3. Inference batch size does not exceed `8`.
4. All are executed on devices that natively support FP8 inference, ensuring both weights and cache are in FP8 format.
5. Server memory must exceed `1T` to ensure normal model loading and operation.
The models can run under the configurations in the table below:
| Model | Precision | GPU Type and Count | Test Framework |
|-------------|-----------|----------------------|----------------|
| GLM-4.5 | BF16 | H100 x 16 / H200 x 8 | sglang |
| GLM-4.5 | FP8 | H100 x 8 / H200 x 4 | sglang |
| GLM-4.5-Air | BF16 | H100 x 4 / H200 x 2 | sglang |
| GLM-4.5-Air | FP8 | H100 x 2 / H200 x 1 | sglang |
Under the configurations in the table below, the models can utilize their full 128K context length:
| Model | Precision | GPU Type and Count | Test Framework |
|-------------|-----------|-----------------------|----------------|
| GLM-4.5 | BF16 | H100 x 32 / H200 x 16 | sglang |
| GLM-4.5 | FP8 | H100 x 16 / H200 x 8 | sglang |
| GLM-4.5-Air | BF16 | H100 x 8 / H200 x 4 | sglang |
| GLM-4.5-Air | FP8 | H100 x 4 / H200 x 2 | sglang |
### Fine-tuning
The code can run under the configurations in the table below
using [Llama Factory](https://github.com/hiyouga/LLaMA-Factory):
| Model | GPU Type and Count | Strategy | Batch Size (per GPU) |
|-------------|--------------------|----------|----------------------|
| GLM-4.5 | H100 x 16 | Lora | 1 |
| GLM-4.5-Air | H100 x 4 | Lora | 1 |
The code can run under the configurations in the table below using [Swift](https://github.com/modelscope/ms-swift):
| Model | GPU Type and Count | Strategy | Batch Size (per GPU) |
|-------------|--------------------|----------|----------------------|
| GLM-4.5 | H20 (96GiB) x 16 | Lora | 1 |
| GLM-4.5-Air | H20 (96GiB) x 4 | Lora | 1 |
| GLM-4.5 | H20 (96GiB) x 128 | SFT | 1 |
| GLM-4.5-Air | H20 (96GiB) x 32 | SFT | 1 |
| GLM-4.5 | H20 (96GiB) x 128 | RL | 1 |
| GLM-4.5-Air | H20 (96GiB) x 32 | RL | 1 |
## Quick Start
Please install the required packages according to `requirements.txt`.
```shell
pip install -r requirements.txt
```
### transformers
Please refer to the `trans_infer_cli.py` code in the `inference` folder.
### vLLM
+ Both BF16 and FP8 can be started with the following code:
```shell
vllm serve zai-org/GLM-4.5-Air \
--tensor-parallel-size 8 \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--served-model-name glm-4.5-air
```
If you're using 8x H100 GPUs and encounter insufficient memory when running the GLM-4.5 model, you'll need
`--cpu-offload-gb 16` (only applicable to vLLM).
If you encounter `flash infer` issues, use `VLLM_ATTENTION_BACKEND=XFORMERS` as a temporary replacement. You can also
specify `TORCH_CUDA_ARCH_LIST='9.0+PTX'` to use `flash infer` (different GPUs have different TORCH_CUDA_ARCH_LIST
values, please check accordingly).
### SGLang
+ BF16
```shell
python3 -m sglang.launch_server \
--model-path zai-org/GLM-4.5-Air \
--tp-size 8 \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--mem-fraction-static 0.7 \
--served-model-name glm-4.5-air \
--host 0.0.0.0 \
--port 8000
```
+ FP8
```shell
python3 -m sglang.launch_server \
--model-path zai-org/GLM-4.5-Air-FP8 \
--tp-size 4 \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--mem-fraction-static 0.7 \
--disable-shared-experts-fusion \
--served-model-name glm-4.5-air-fp8 \
--host 0.0.0.0 \
--port 8000
```
### Request Parameter Instructions
+ When using `vLLM` and `SGLang`, thinking mode is enabled by default when sending requests. If you want to disable the
thinking switch, you need to add the `extra_body={"chat_template_kwargs": {"enable_thinking": False}}` parameter.
+ Both support tool calling. Please use OpenAI-style tool description format for calls.
+ For specific code, please refer to `api_request.py` in the `inference` folder.

View file

@ -0,0 +1,233 @@
## `gpt-oss` vLLM Usage Guide
`gpt-oss-20b` and `gpt-oss-120b` are powerful reasoning models open-sourced by OpenAI.
In vLLM, you can run it on NVIDIA H100, H200, B200 as well as MI300x, MI325x, MI355x and Radeon AI PRO R9700.
We are actively working on ensuring this model can work on Ampere, Ada Lovelace, and RTX 5090.
Specifically, vLLM optimizes for `gpt-oss` family of models with
* **Flexible parallelism options**: the model can be sharded across 2, 4, 8 GPUs, scaling throughput.
* **High performance attention and MoE kernels**: attention kernel is specifically optimized for the attention sinks mechanism and sliding window shapes.
* **Asynchronous scheduling**: optimizing for maximum utilization and high throughput by overlapping CPU operations with GPU operations.
This is a living document and we welcome contributions, corrections, and creation of new recipes!
## Quickstart
### Installation
We highly recommend using a new virtual environment, as the first iteration of the release requires cutting edge kernels from various dependencies, these might not work with other models. In particular, we will be installing: a prerelease version of vLLM, PyTorch nightly, Triton nightly, FlashInfer prerelease, HuggingFace prerelease, Harmony, and gpt-oss library tools.
```
uv venv
source .venv/bin/activate
uv pip install --pre vllm==0.10.1+gptoss \
--extra-index-url https://wheels.vllm.ai/gpt-oss/ \
--extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
--index-strategy unsafe-best-match
```
We also provide a docker container with all the dependencies built in
```
docker run --gpus all \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:gptoss \
--model openai/gpt-oss-20b
```
### H100 & H200
You can serve the model with its default parameters:
* `--async-scheduling` can be enabled for higher performance. Currently it is not compatible with structured output.
* We recommend TP=2 for H100 and H200 as the best performance tradeoff point.
```
# openai/gpt-oss-20b should run in single GPU
vllm serve openai/gpt-oss-20b --async-scheduling
# gpt-oss-120b will fit in a single H100/H200, but scaling it to higher TP sizes can help with throughput
vllm serve openai/gpt-oss-120b --async-scheduling
vllm serve openai/gpt-oss-120b --tensor-parallel-size 2 --async-scheduling
vllm serve openai/gpt-oss-120b --tensor-parallel-size 4 --async-scheduling
```
### B200
NVIDIA Blackwell requires installation of FlashInfer library and several environments to enable the necessary kernels. We recommend TP=1 as a starting point for a performant option. We are actively working on the performance of vLLM on Blackwell.
```
# All 3 of these are required
export VLLM_USE_TRTLLM_ATTENTION=1
export VLLM_USE_TRTLLM_DECODE_ATTENTION=1
export VLLM_USE_TRTLLM_CONTEXT_ATTENTION=1
# Pick only one out of the two.
# mxfp8 activation for MoE. faster, but higher risk for accuracy.
export VLLM_USE_FLASHINFER_MXFP4_MOE=1
# bf16 activation for MoE. matching reference precision.
export VLLM_USE_FLASHINFER_MXFP4_BF16_MOE=1
# openai/gpt-oss-20b
vllm serve openai/gpt-oss-20b --async-scheduling
# gpt-oss-120b
vllm serve openai/gpt-oss-120b --async-scheduling
vllm serve openai/gpt-oss-120b --tensor-parallel-size 2 --async-scheduling
vllm serve openai/gpt-oss-120b --tensor-parallel-size 4 --async-scheduling
```
### AMD
ROCm supports OpenAI gpt-oss-120b or gpt-oss-20b models on these 3 different GPUs on day one, along with the pre-built docker containers:
* gfx950: MI350x series, `rocm/vllm-dev:open-mi355-08052025`
* gfx942: MI300x/MI325 series, `rocm/vllm-dev:open-mi300-08052025`
* gfx1201: Radeon AI PRO R9700, `rocm/vllm-dev:open-r9700-08052025`
To run the container:
```
alias drun='sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --shm-size 32G -v /data:/data -v $HOME:/myhome -w /myhome'
drun rocm/vllm-dev:open-mi300-08052025
```
For MI300x and R9700:
```
export VLLM_ROCM_USE_AITER=1
export VLLM_USE_AITER_UNIFIED_ATTENTION=1
export VLLM_ROCM_USE_AITER_MHA=0
vllm serve openai/gpt-oss-120b --compilation-config '{"full_cuda_graph": true}'
```
For MI355x:
```
# MoE preshuffle, fusion and Triton GEMM flags
export VLLM_USE_AITER_TRITON_FUSED_SPLIT_QKV_ROPE=1
export VLLM_USE_AITER_TRITON_FUSED_ADD_RMSNORM_PAD=1
export VLLM_USE_AITER_TRITON_GEMM=1
export VLLM_ROCM_USE_AITER=1
export VLLM_USE_AITER_UNIFIED_ATTENTION=1
export VLLM_ROCM_USE_AITER_MHA=0
export TRITON_HIP_PRESHUFFLE_SCALES=1
vllm serve openai/gpt-oss-120b --compilation-config '{"compile_sizes": [1, 2, 4, 8, 16, 24, 32, 64, 128, 256, 4096, 8192], "full_cuda_graph": true}' --block-size 64
```
## Usage
Once the `vllm serve` runs and `INFO: Application startup complete` has been displayed, you can send requests using HTTP request or OpenAI SDK to the following endpoints:
* `/v1/responses` endpoint can perform tool use (browsing, python, mcp) in between chain-of-thought and deliver a final response. This endpoint leverages the `openai-harmony` library for input rendering and output parsing. Stateful operation and full streaming API are work in progress. Responses API is recommended by OpenAI as the way to interact with this model.
* `/v1/chat/completions` endpoint offers a familiar interface to this model. No tool will be invoked but reasoning and final text output will be returned structurally. Function calling is work in progress. You can also set the parameter `include_reasoning: false` in request parameter to skip CoT being part of the output.
* `/v1/completions` endpoint is the endpoint for a simple input output interface without any sorts of template rendering.
All endpoints accept `stream: true` as part of the operations to enable incremental token streaming. Please note that vLLM currently does not cover the full scope of responses API, for more detail, please see Limitation section below.
### Tool Use
One premier feature of gpt-oss is the ability to call tools directly, called "built-in tools". In vLLM, we offer several options:
* By default, we integrate with the reference library's browser (with `ExaBackend`) and demo Python interpreter via docker container. In order to use the search backend, you need to get access to [exa.ai](http://exa.ai) and put `EXA_API_KEY=` as an environment variable. For Python, either have docker available, or set `PYTHON_EXECUTION_BACKEND=UV` to dangerously allow execution of model generated code snippets to be executed on the same machine.
```
uv pip install gpt-oss
vllm serve ... --tool-server demo
```
* Please note that the default options are simply for demo purposes. For production usage, vLLM itself can act as MCP client to multiple services.
Here is an [example tool server](https://github.com/openai/gpt-oss/tree/main/gpt-oss-mcp-server) that vLLM can work with, they wrap the demo tools:
```
mcp run -t sse browser_server.py:mcp
mcp run -t sse python_server.py:mcp
vllm serve ... --tool-server ip-1:port-1,ip-2:port-2
```
The URLs are expected to be MCP SSE servers that implement `instructions` in server info and well documented tools. The tools will be injected into the system prompt for the model to enable them.
## Accuracy Evaluation Panels
OpenAI recommends using the gpt-oss reference library to perform evaluation. For example,
```
python -m gpt_oss.evals --model 120b-low --eval gpqa --n-threads 128
python -m gpt_oss.evals --model 120b --eval gpqa --n-threads 128
python -m gpt_oss.evals --model 120b-high --eval gpqa --n-threads 128
```
To eval on AIME2025, change `gpqa` to `aime25`.
With vLLM deployed:
```
# Example deployment on 8xH100
vllm serve openai/gpt-oss-120b \
--tensor_parallel_size 8 \
--max-model-len 131072 \
--max-num-batched-tokens 10240 \
--max-num-seqs 128 \
--gpu-memory-utilization 0.85 \
--no-enable-prefix-caching
```
Here is the score we were able to reproduce without tool use, and we encourage you to try reproducing it as well!
Weve observed that the numbers may vary slightly across runs, so feel free to run the evaluation multiple times to get a sense of the variance.
For a quick correctness check, we recommend starting with the low reasoning effort setting (120b-low), which should complete within minutes.
Model: 120B
| Reasoning Effort | GPQA | AIME25 |
| :---- | :---- | :---- |
| Low | 65.3 | 51.2 |
| Mid | 72.4 | 79.6 |
| High | 79.4 | 93.0 |
Model: 20B
| Reasoning Effort | GPQA | AIME25 |
| :---- | :---- | :---- |
| Low | 56.8 | 38.8 |
| Mid | 67.5 | 75.0 |
| High | 70.9 | 85.8 |
## Known Limitations
* On H100 using tensor parallel size 1, default gpu memory utilization, and batched token will cause CUDA Out-of-memory. When running tp1, please increase your gpu memory utilization or lower batched token
```
vllm serve openai/gpt-oss-120b --gpu-memory-utilization 0.95 --max-num-batched-tokens 1024
```
* When running TP2 on H100, set your gpu memory utilization below 0.95 as that will also cause OOM
* Responses API has several limitations at the current moment; we strongly welcome contribution and maintenance of this service in vLLM
* Usage accounting is currently broken and only returns all zeros.
* Annotations (citing URLs from search results) are not supported.
* Truncation by `max_tokens` might not be able to preserve partial chunks.
* Streaming is fairly barebone at the moment, for example:
* Item id and indexing needs more work
* Tool invocation and output are not properly streamed, rather batched.
* Proper error handling is missing.
## Troubleshooting
- Attention sink dtype error on Blackwell:
```
ERROR 08-05 07:31:10 [multiproc_executor.py:559] assert sinks.dtype == torch.float32, "Sinks must be of type float32"
**(VllmWorker TP0 pid=174579)** ERROR 08-05 07:31:10 [multiproc_executor.py:559] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
**(VllmWorker TP0 pid=174579)** ERROR 08-05 07:31:10 [multiproc_executor.py:559] AssertionError: Sinks must be of type float32
```
**Solution: Please refer to Blackwell section to check if related environment variables are added.**
- Triton issue related to `tl.language` not defined:
**Solution: Make sure there's no other triton installed in your environment (pytorch-triton, etc).**

View file

@ -0,0 +1,183 @@
# Implementation Plan
## Core Principles
- TypeScript throughout
- Clean, minimal code
- Self-contained modules
- Direct SSH execution (no remote manager)
- All state in local JSON
## Package 1: Pod Setup Script Generation
Generate and execute pod_setup.sh via SSH
- [ ] `src/setup/generate-setup-script.ts` - Generate bash script as string
- [ ] Detect CUDA driver version
- [ ] Determine CUDA toolkit version needed
- [ ] Generate uv/Python install commands
- [ ] Generate venv creation commands
- [ ] Generate pip install commands (torch, vLLM, etc.)
- [ ] Handle model-specific vLLM versions (e.g., gpt-oss needs 0.10.1+gptoss)
- [ ] Generate mount commands if --mount provided
- [ ] Generate env var setup (HF_TOKEN, PI_API_KEY)
- [ ] `src/setup/detect-hardware.ts` - Run nvidia-smi and parse GPU info
- [ ] Execute nvidia-smi via SSH
- [ ] Parse GPU count, names, memory
- [ ] Return structured GPU info
- [ ] `src/setup/execute-setup.ts` - Main setup orchestrator
- [ ] Generate setup script
- [ ] Copy and execute via SSH
- [ ] Stream output to console
- [ ] Handle Ctrl+C properly
- [ ] Save GPU info to local config
## Package 2: Config Management
Local JSON state management
- [ ] `src/config/types.ts` - TypeScript interfaces
- [ ] Pod interface (ssh, gpus, models, mount)
- [ ] Model interface (model, port, gpu, pid)
- [ ] GPU interface (id, name, memory)
- [ ] `src/config/store.ts` - Read/write ~/.pi/pods.json
- [ ] Load config (handle missing file)
- [ ] Save config (atomic write)
- [ ] Get active pod
- [ ] Add/remove pods
- [ ] Update model state
## Package 3: SSH Executor
Clean SSH command execution
- [ ] `src/ssh/executor.ts` - SSH command wrapper
- [ ] Execute command with streaming output
- [ ] Execute command with captured output
- [ ] Handle SSH errors gracefully
- [ ] Support Ctrl+C propagation
- [ ] Support background processes (nohup)
## Package 4: Pod Commands
Pod management CLI commands
- [ ] `src/commands/pods-setup.ts` - pi pods setup
- [ ] Parse args (name, ssh, mount)
- [ ] Check env vars (HF_TOKEN, PI_API_KEY)
- [ ] Call setup executor
- [ ] Save pod to config
- [ ] `src/commands/pods-list.ts` - pi pods
- [ ] Load config
- [ ] Display all pods with active marker
- [ ] `src/commands/pods-active.ts` - pi pods active
- [ ] Switch active pod
- [ ] Update config
- [ ] `src/commands/pods-remove.ts` - pi pods remove
- [ ] Remove from config (not remote)
## Package 5: Model Management
Model lifecycle management
- [ ] `src/models/model-config.ts` - Known model configurations
- [ ] Load models.md data structure
- [ ] Match hardware to vLLM args
- [ ] Get model-specific env vars
- [ ] `src/models/download.ts` - Model download via HF
- [ ] Check if model cached
- [ ] Run huggingface-cli download
- [ ] Stream progress to console
- [ ] Handle Ctrl+C
- [ ] `src/models/vllm-builder.ts` - Build vLLM command
- [ ] Get base command for model
- [ ] Add hardware-specific args
- [ ] Add user --vllm args
- [ ] Add port and API key
## Package 6: Model Commands
Model management CLI commands
- [ ] `src/commands/start.ts` - pi start
- [ ] Parse model and args
- [ ] Find next available port
- [ ] Select GPU (round-robin)
- [ ] Download if needed
- [ ] Build and execute vLLM command
- [ ] Wait for health check
- [ ] Update config on success
- [ ] `src/commands/stop.ts` - pi stop
- [ ] Find model in config
- [ ] Kill process via PID
- [ ] Clean up config
- [ ] `src/commands/list.ts` - pi list
- [ ] Show models from config
- [ ] Optionally verify PIDs
- [ ] `src/commands/logs.ts` - pi logs
- [ ] Tail log file via SSH
- [ ] Handle Ctrl+C (stop tailing only)
## Package 7: Model Testing
Quick model testing with tools
- [ ] `src/prompt/tools.ts` - Tool definitions
- [ ] Define ls, read, glob, rg tools
- [ ] Format for OpenAI API
- [ ] `src/prompt/client.ts` - OpenAI client wrapper
- [ ] Create client for model endpoint
- [ ] Handle streaming responses
- [ ] Display thinking, tools, content
- [ ] `src/commands/prompt.ts` - pi prompt
- [ ] Get model endpoint from config
- [ ] Augment prompt with CWD info
- [ ] Send request with tools
- [ ] Display formatted response
## Package 8: CLI Entry Point
Main CLI with commander.js
- [ ] `src/cli.ts` - Main entry point
- [ ] Setup commander program
- [ ] Register all commands
- [ ] Handle global options (--pod override)
- [ ] Error handling
- [ ] `src/index.ts` - Package exports
## Testing Strategy
- [ ] Test pod_setup.sh generation locally
- [ ] Test on local machine with GPU
- [ ] Test SSH executor with mock commands
- [ ] Test config management with temp files
- [ ] Integration test on real pod
## Dependencies
```json
{
"dependencies": {
"commander": "^12.0.0",
"@commander-js/extra-typings": "^12.0.0",
"openai": "^4.0.0",
"chalk": "^5.0.0",
"ora": "^8.0.0"
},
"devDependencies": {
"@types/node": "^22.0.0",
"typescript": "^5.0.0",
"tsx": "^4.0.0"
}
}
```
## Build & Distribution
- [ ] TypeScript config for Node.js target
- [ ] Build to dist/
- [ ] npm package with bin entry
- [ ] npx support

View file

@ -0,0 +1,197 @@
# Kimi-K2 Deployment Guide
> [!Note]
> This guide only provides some examples of deployment commands for Kimi-K2, which may not be the optimal configuration. Since inference engines are still being updated frequently, please continue to follow the guidance from their homepage if you want to achieve better inference performance.
## vLLM Deployment
vLLM version v0.10.0rc1 or later is required.
The smallest deployment unit for Kimi-K2 FP8 weights with 128k seqlen on mainstream H200 or H20 platform is a cluster with 16 GPUs with either Tensor Parallel (TP) or "data parallel + expert parallel" (DP+EP).
Running parameters for this environment are provided below. You may scale up to more nodes and increase expert-parallelism to enlarge the inference batch size and overall throughput.
### Tensor Parallelism
When the parallelism degree ≤ 16, you can run inference with pure Tensor Parallelism. A sample launch command is:
``` bash
# start ray on node 0 and node 1
# node 0:
vllm serve $MODEL_PATH \
--port 8000 \
--served-model-name kimi-k2 \
--trust-remote-code \
--tensor-parallel-size 16 \
--enable-auto-tool-choice \
--tool-call-parser kimi_k2
```
**Key parameter notes:**
- `--tensor-parallel-size 16`: If using more than 16 GPUs, combine with pipeline-parallelism.
- `--enable-auto-tool-choice`: Required when enabling tool usage.
- `--tool-call-parser kimi_k2`: Required when enabling tool usage.
### Data Parallelism + Expert Parallelism
You can install libraries like DeepEP and DeepGEMM as needed. Then run the command (example on H200):
``` bash
# node 0
vllm serve $MODEL_PATH --port 8000 --served-model-name kimi-k2 --trust-remote-code --data-parallel-size 16 --data-parallel-size-local 8 --data-parallel-address $MASTER_IP --data-parallel-rpc-port $PORT --enable-expert-parallel --max-num-batched-tokens 8192 --max-num-seqs 256 --gpu-memory-utilization 0.85 --enable-auto-tool-choice --tool-call-parser kimi_k2
# node 1
vllm serve $MODEL_PATH --headless --data-parallel-start-rank 8 --port 8000 --served-model-name kimi-k2 --trust-remote-code --data-parallel-size 16 --data-parallel-size-local 8 --data-parallel-address $MASTER_IP --data-parallel-rpc-port $PORT --enable-expert-parallel --max-num-batched-tokens 8192 --max-num-seqs 256 --gpu-memory-utilization 0.85 --enable-auto-tool-choice --tool-call-parser kimi_k2
```
## SGLang Deployment
Similarly, we can use TP or DP+EP in SGLang for Deployment, here are the examples.
### Tensor Parallelism
Here is the simple example code to run TP16 with two nodes on H200:
``` bash
# Node 0
python -m sglang.launch_server --model-path $MODEL_PATH --tp 16 --dist-init-addr $MASTER_IP:50000 --nnodes 2 --node-rank 0 --trust-remote-code --tool-call-parser kimi_k2
# Node 1
python -m sglang.launch_server --model-path $MODEL_PATH --tp 16 --dist-init-addr $MASTER_IP:50000 --nnodes 2 --node-rank 1 --trust-remote-code --tool-call-parser kimi_k2
```
**Key parameter notes:**
- `--tool-call-parser kimi_k2`: Required when enabling tool usage.
### Data Parallelism + Expert Parallelism
Here is an example for large scale Prefill-Decode Disaggregation (4P12D H200) with DP+EP in SGLang:
``` bash
# for prefill node
MC_TE_METRIC=true SGLANG_DISAGGREGATION_HEARTBEAT_INTERVAL=10000000 SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=100000 SGLANG_DISAGGREGATION_WAITING_TIMEOUT=100000 PYTHONUNBUFFERED=1 \
python -m sglang.launch_server --model-path $MODEL_PATH \
--trust-remote-code --disaggregation-mode prefill --dist-init-addr $PREFILL_NODE0$:5757 --tp-size 32 --dp-size 32 --enable-dp-attention --host $LOCAL_IP --decode-log-interval 1 --disable-radix-cache --enable-deepep-moe --moe-dense-tp-size 1 --enable-dp-lm-head --disable-shared-experts-fusion --watchdog-timeout 1000000 --enable-two-batch-overlap --disaggregation-ib-device $IB_DEVICE --chunked-prefill-size 131072 --mem-fraction-static 0.85 --deepep-mode normal --ep-dispatch-algorithm dynamic --eplb-algorithm deepseek --max-running-requests 1024 --nnodes 4 --node-rank $RANK --tool-call-parser kimi_k2
# for decode node
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=480 MC_TE_METRIC=true SGLANG_DISAGGREGATION_HEARTBEAT_INTERVAL=10000000 SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=100000 SGLANG_DISAGGREGATION_WAITING_TIMEOUT=100000 PYTHONUNBUFFERED=1 \
python -m sglang.launch_server --model-path $MODEL_PATH --trust-remote-code --disaggregation-mode decode --dist-init-addr $DECODE_NODE0:5757 --tp-size 96 --dp-size 96 --enable-dp-attention --host $LOCAL_IP --decode-log-interval 1 --context-length 2176 --disable-radix-cache --enable-deepep-moe --moe-dense-tp-size 1 --enable-dp-lm-head --disable-shared-experts-fusion --watchdog-timeout 1000000 --enable-two-batch-overlap --disaggregation-ib-device $IB_DEVICE --deepep-mode low_latency --mem-fraction-static 0.8 --cuda-graph-bs 480 --max-running-requests 46080 --ep-num-redundant-experts 96 --nnodes 12 --node-rank $RANK --tool-call-parser kimi_k2
# pdlb
PYTHONUNBUFFERED=1 python -m sglang.srt.disaggregation.launch_lb --prefill http://${PREFILL_NODE0}:30000 --decode http://${DECODE_NODE0}:30000
```
## KTransformers Deployment
Please copy all configuration files (i.e., everything except the .safetensors files) into the GGUF checkpoint folder at /path/to/K2. Then run:
``` bash
python ktransformers/server/main.py --model_path /path/to/K2 --gguf_path /path/to/K2 --cache_lens 30000
```
To enable AMX optimization, run:
``` bash
python ktransformers/server/main.py --model_path /path/to/K2 --gguf_path /path/to/K2 --cache_lens 30000 --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-fp8-linear-ggml-experts-serve-amx.yaml
```
## TensorRT-LLM Deployment
### Prerequisite
Please refer to [this guide](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html) to build TensorRT-LLM v1.0.0-rc2 from source and start a TRT-LLM docker container.
install blobfile by:
```bash
pip install blobfile
```
### Multi-node Serving
TensorRT-LLM supports multi-node inference. You can use mpirun to launch Kimi-K2 with multi-node jobs. We will use two nodes for this example.
#### mpirun
mpirun requires each node to have passwordless ssh access to the other node. We need to setup the environment inside the docker container. Run the container with host network and mount the current directory as well as model directory to the container.
```bash
# use host network
IMAGE=<YOUR_IMAGE>
NAME=test_2node_docker
# host1
docker run -it --name ${NAME}_host1 --ipc=host --gpus=all --network host --privileged --ulimit memlock=-1 --ulimit stack=67108864 -v ${PWD}:/workspace -v <YOUR_MODEL_DIR>:/models/DeepSeek-V3 -w /workspace ${IMAGE}
# host2
docker run -it --name ${NAME}_host2 --ipc=host --gpus=all --network host --privileged --ulimit memlock=-1 --ulimit stack=67108864 -v ${PWD}:/workspace -v <YOUR_MODEL_DIR>:/models/DeepSeek-V3 -w /workspace ${IMAGE}
```
Set up ssh inside the container
```bash
apt-get update && apt-get install -y openssh-server
# modify /etc/ssh/sshd_config
PermitRootLogin yes
PubkeyAuthentication yes
# modify /etc/ssh/sshd_config, change default port 22 to another unused port
port 2233
# modify /etc/ssh
```
Generate ssh key on host1 and copy to host2, vice versa.
```bash
# on host1
ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519
ssh-copy-id -i ~/.ssh/id_ed25519.pub root@<HOST2>
# on host2
ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519
ssh-copy-id -i ~/.ssh/id_ed25519.pub root@<HOST1>
# restart ssh service on host1 and host2
service ssh restart # or
/etc/init.d/ssh restart # or
systemctl restart ssh
```
Generate additional config for trtllm serve.
```bash
cat >/path/to/TensorRT-LLM/extra-llm-api-config.yml <<EOF
cuda_graph_config:
padding_enabled: true
batch_sizes:
- 1
- 2
- 4
- 8
- 16
- 32
- 64
- 128
print_iter_log: true
enable_attention_dp: true
EOF
```
After the preparations,you can run the trtllm-serve on two nodes using mpirun:
```bash
mpirun -np 16 \
-H <HOST1>:8,<HOST2>:8 \
-mca plm_rsh_args "-p 2233" \
--allow-run-as-root \
trtllm-llmapi-launch trtllm-serve serve \
--backend pytorch \
--tp_size 16 \
--ep_size 8 \
--kv_cache_free_gpu_memory_fraction 0.95 \
--trust_remote_code \
--max_batch_size 128 \
--max_num_tokens 4096 \
--extra_llm_api_options /path/to/TensorRT-LLM/extra-llm-api-config.yml \
--port 8000 \
<YOUR_MODEL_DIR>
```
## Others
Kimi-K2 reuses the `DeepSeekV3CausalLM` architecture and convert it's weight into proper shape to save redevelopment effort. To let inference engines distinguish it from DeepSeek-V3 and apply the best optimizations, we set `"model_type": "kimi_k2"` in `config.json`.
If you are using a framework that is not on the recommended list, you can still run the model by manually changing `model_type` to "deepseek_v3" in `config.json` as a temporary workaround. You may need to manually parse tool calls in case no tool call parser is available in your framework.

View file

@ -0,0 +1,116 @@
### Qwen-Coder
- [ ] Qwen2.5-Coder-32B-Instruct
- HF: Qwen/Qwen2.5-Coder-32B-Instruct
- Hardware:
- 1x H100/H200
- --tool-call-parser hermes --enable-auto-tool-choice
- 2x H100/H200
- --tensor-parallel-size 2 --tool-call-parser hermes --enable-auto-tool-choice
- Notes: Good balance of size and performance. Single GPU capable.
- [ ] Qwen3-Coder-480B-A35B-Instruct (BF16)
- HF: Qwen/Qwen3-Coder-480B-A35B-Instruct
- Hardware:
- 8x H200/H20
- --tensor-parallel-size 8 --max-model-len 32000 --enable-auto-tool-choice --tool-call-parser qwen3_coder
- Notes: Cannot serve full 262K context on single node. Reduce max-model-len or increase gpu-memory-utilization.
- [ ] Qwen3-Coder-480B-A35B-Instruct-FP8
- HF: Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8
- Hardware:
- 8x H200/H20
- --max-model-len 131072 --enable-expert-parallel --data-parallel-size 8 --enable-auto-tool-choice --tool-call-parser qwen3_coder
- Env: VLLM_USE_DEEP_GEMM=1
- Notes: Use data-parallel mode (not tensor-parallel) to avoid weight quantization errors. DeepGEMM recommended.
- [ ] Qwen3-Coder-30B-A3B-Instruct (BF16)
- HF: Qwen/Qwen3-Coder-30B-A3B-Instruct
- Hardware:
- 1x H100/H200
- --enable-auto-tool-choice --tool-call-parser qwen3_coder
- Notes: Fits comfortably on single GPU. ~60GB model weight.
- 2x H100/H200
- --tensor-parallel-size 2 --enable-auto-tool-choice --tool-call-parser qwen3_coder
- Notes: For higher throughput/longer context.
- [ ] Qwen3-Coder-30B-A3B-Instruct-FP8
- HF: Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8
- Hardware:
- 1x H100/H200
- --enable-auto-tool-choice --tool-call-parser qwen3_coder
- Env: VLLM_USE_DEEP_GEMM=1
- Notes: FP8 quantized, ~30GB model weight. Excellent for single GPU deployment.
### GPT-OSS
- Notes: Requires vLLM 0.10.1+gptoss. Built-in tools via /v1/responses endpoint (browsing, Python). Function calling not yet supported. --async-scheduling recommended for higher perf (not compatible with structured output).
- [ ] GPT-OSS-20B
- HF: openai/gpt-oss-20b
- Hardware:
- 1x H100/H200
- --async-scheduling
- 1x B200
- --async-scheduling
- Env: VLLM_USE_TRTLLM_ATTENTION=1 VLLM_USE_TRTLLM_DECODE_ATTENTION=1 VLLM_USE_TRTLLM_CONTEXT_ATTENTION=1 VLLM_USE_FLASHINFER_MXFP4_MOE=1
- [ ] GPT-OSS-120B
- HF: openai/gpt-oss-120b
- Hardware:
- 1x H100/H200
- --async-scheduling
- Notes: Needs --gpu-memory-utilization 0.95 --max-num-batched-tokens 1024 to avoid OOM
- 2x H100/H200
- --tensor-parallel-size 2 --async-scheduling
- Notes: Set --gpu-memory-utilization <0.95 to avoid OOM
- 4x H100/H200
- --tensor-parallel-size 4 --async-scheduling
- 8x H100/H200
- --tensor-parallel-size 8 --async-scheduling --max-model-len 131072 --max-num-batched-tokens 10240 --max-num-seqs 128 --gpu-memory-utilization 0.85 --no-enable-prefix-caching
- 1x B200
- --async-scheduling
- Env: VLLM_USE_TRTLLM_ATTENTION=1 VLLM_USE_TRTLLM_DECODE_ATTENTION=1 VLLM_USE_TRTLLM_CONTEXT_ATTENTION=1 VLLM_USE_FLASHINFER_MXFP4_MOE=1
- 2x B200
- --tensor-parallel-size 2 --async-scheduling
- Env: VLLM_USE_TRTLLM_ATTENTION=1 VLLM_USE_TRTLLM_DECODE_ATTENTION=1 VLLM_USE_TRTLLM_CONTEXT_ATTENTION=1 VLLM_USE_FLASHINFER_MXFP4_MOE=1
### GLM-4.5
- Notes: Listed configs support reduced context. For full 128K context, double the GPU count. Models default to thinking mode (disable with API param).
- [ ] GLM-4.5 (BF16)
- HF: zai-org/GLM-4.5
- Hardware:
- 16x H100
- --tensor-parallel-size 16 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
- 8x H200
- --tensor-parallel-size 8 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
- Notes: On 8x H100, may need --cpu-offload-gb 16 to avoid OOM. For full 128K: needs 32x H100 or 16x H200.
- [ ] GLM-4.5-FP8
- HF: zai-org/GLM-4.5-FP8
- Hardware:
- 8x H100
- --tensor-parallel-size 8 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
- 4x H200
- --tensor-parallel-size 4 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
- Notes: For full 128K context: needs 16x H100 or 8x H200.
- [ ] GLM-4.5-Air (BF16)
- HF: zai-org/GLM-4.5-Air
- Hardware:
- 4x H100
- --tensor-parallel-size 4 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
- 2x H200
- --tensor-parallel-size 2 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
- Notes: For full 128K context: needs 8x H100 or 4x H200.
- [ ] GLM-4.5-Air-FP8
- HF: zai-org/GLM-4.5-Air-FP8
- Hardware:
- 2x H100
- --tensor-parallel-size 2 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
- 1x H200
- --tensor-parallel-size 1 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
- Notes: For full 128K context: needs 4x H100 or 2x H200.
### Kimi
- Notes: Requires vLLM v0.10.0rc1+. Minimum 16 GPUs for FP8 with 128k context. Reuses DeepSeekV3 architecture with model_type="kimi_k2".
- [ ] Kimi-K2-Instruct
- HF: moonshotai/Kimi-K2-Instruct
- Hardware:
- 16x H200/H20
- --tensor-parallel-size 16 --trust-remote-code --enable-auto-tool-choice --tool-call-parser kimi_k2
- Notes: Pure TP mode. For >16 GPUs, combine with pipeline-parallelism.
- 16x H200/H20 (DP+EP mode)
- --data-parallel-size 16 --data-parallel-size-local 8 --enable-expert-parallel --max-num-batched-tokens 8192 --max-num-seqs 256 --gpu-memory-utilization 0.85 --trust-remote-code --enable-auto-tool-choice --tool-call-parser kimi_k2
- Notes: Data parallel + expert parallel mode for higher throughput. Requires multi-node setup with proper networking.

166
packages/pods/docs/plan.md Normal file
View file

@ -0,0 +1,166 @@
## Pi
Pi automates vLLM deployment on GPU pods from DataCrunch, Vast.ai, Prime Intellect, RunPod (or any Ubuntu machine with NVIDIA GPUs). It manages multiple concurrent model deployments via separate vLLM instances, each accessible through the OpenAI API protocol with API key authentication.
Pods are treated as ephemeral - spin up when needed, tear down when done. To avoid re-downloading models (30+ minutes for 100GB+ models), pi uses persistent network volumes for model storage that can be shared across pods on the same provider. This minimizes both cost (only pay for active compute) and setup time (models already cached).
## Usage
### Pods
```bash
pi pods setup dc1 "ssh root@1.2.3.4" --mount "mount -t nfs..." # Setup pod (requires HF_TOKEN, PI_API_KEY env vars)
pi pods # List all pods (* = active)
pi pods active dc2 # Switch active pod
pi pods remove dc1 # Remove pod
```
### Models
```bash
pi start Qwen/Qwen2.5-72B-Instruct --name qwen72b # Known model - pi handles vLLM args
pi start some/unknown-model --name mymodel --vllm --tensor-parallel-size 4 --max-model-len 32768 # Custom vLLM args
pi list # List running models with ports
pi stop qwen72b # Stop model
pi logs qwen72b # View model logs
```
For known models, pi automatically configures appropriate vLLM arguments from model documentation based on the hardware of the pod. For unknown models or custom configurations, pass vLLM args after `--vllm`.
## Pod management
Pi manages GPU pods from various providers (DataCrunch, Vast.ai, Prime Intellect, RunPod) as ephemeral compute resources. Users manually create pods via provider dashboards, then register them with pi for automated setup and management.
Key capabilities:
- **Pod setup**: Transform bare Ubuntu/Debian machines into vLLM-ready environments in ~2 minutes
- **Model caching**: Optional persistent storage shared by pods to avoid re-downloading 100GB+ models
- **Multi-pod management**: Register multiple pods, switch between them, maintain different environments
### Pod setup
When a user creates a fresh pod on a provider, they register it with pi using the SSH command from the provider:
```bash
pi pods setup dc1 "ssh root@1.2.3.4" --mount "mount -t nfs..."
```
This copies and executes `pod_setup.sh` which:
1. Detects GPUs via `nvidia-smi` and stores count/memory in local config
2. Installs CUDA toolkit matching the driver version
3. Creates Python environment
- Installs uv and Python 3.12
- Creates venv at ~/venv with PyTorch (--torch-backend=auto)
- Installs vLLM (model-specific versions when needed)
- Installs FlashInfer (builds from source if required)
- Installs huggingface-hub (for model downloads)
- Installs hf-transfer (for accelerated downloads)
4. Mounts persistent storage if provided
- Symlinks to ~/.cache/huggingface for model caching
5. Configures environment variables persistently
Required environment variables:
- `HF_TOKEN`: HuggingFace token for model downloads
- `PI_API_KEY`: API key for securing vLLM endpoints
### Model caching
Models can be 100GB+ and take 30+ minutes to download. The `--mount` flag enables persistent model caching:
- **DataCrunch**: NFS shared filesystems, mountable across multiple running pods in same region
- **RunPod**: Network volumes persist independently but cannot be shared between running pods
- **Vast.ai**: Volumes locked to specific machine - no sharing
- **Prime Intellect**: No persistent storage documented
Without `--mount`, models download to pod-local storage and are lost on termination.
### Multi-pod management
Users can register multiple pods and switch between them:
```bash
pi pods # List all pods (* = active)
pi pods active dc2 # Switch active pod
pi pods remove dc1 # Remove pod from local config but doesn't destroy pod remotely.
```
All model commands (`pi start`, `pi stop`, etc.) target the active pod, unless `--pod <podname>` is given, which overrides the active pod for that command.
## Model deployment
Pi uses direct SSH commands to manage vLLM instances on pods. No remote manager component is needed - everything is controlled from the local pi CLI.
### Architecture
The pi CLI maintains all state locally in `~/.pi/pods.json`:
```json
{
"pods": {
"dc1": {
"ssh": "ssh root@1.2.3.4",
"gpus": [
{"id": 0, "name": "H100", "memory": "80GB"},
{"id": 1, "name": "H100", "memory": "80GB"}
],
"models": {
"qwen": {
"model": "Qwen/Qwen2.5-72B",
"port": 8001,
"gpu": "0",
"pid": 12345
}
}
}
},
"active": "dc1"
}
```
The location of the pi config dir can also be specified via the `PI_CONFIG_DIR` env var, e.g. for testing.
Pods are assumed to be fully managed by pi - no other processes compete for ports or GPUs.
### Starting models
When user runs `pi start Qwen/Qwen2.5-72B --name qwen`:
1. CLI determines next available port (starting from 8001)
2. Selects GPU (round-robin based on stored GPU info)
3. Downloads model if not cached:
- Sets `HF_HUB_ENABLE_HF_TRANSFER=1` for fast downloads
- Runs via SSH with output piped to local terminal
- Ctrl+C cancels download and returns control
4. Builds vLLM command with appropriate args and PI_API_KEY
5. Executes via SSH: `ssh pod "nohup vllm serve ... > ~/.vllm_logs/qwen.log 2>&1 & echo $!"`
6. Waits for vLLM to be ready (checks health endpoint)
7. On success: stores port, GPU, PID in local state
8. On failure: shows exact error from vLLM logs, doesn't save to config
### Managing models
- **List**: Show models from local state, optionally verify PIDs still running
- **Stop**: SSH to kill process by PID
- **Logs**: SSH to tail -f log files (Ctrl+C stops tailing, doesn't kill vLLM)
### Error handling
- **SSH failures**: Prompt user to check connection or remove pod from config
- **Stale state**: Commands that fail with "process not found" auto-clean local state
- **Setup failures**: Ctrl+C during setup kills remote script and exits cleanly
### Testing models
The `pi prompt` command provides a quick way to test deployed models:
```bash
pi prompt qwen "What is 2+2?" # Simple prompt
pi prompt qwen "Read file.txt and summarize" # Uses built-in tools
```
Built-in tools for agentic testing:
- `ls(path, ignore?)`: List files and directories at path, with optional ignore patterns
- `read(file_path, offset?, limit?)`: Read file contents with optional line offset/limit
- `glob(pattern, path?)`: Find files matching glob pattern (e.g., "**/*.py", "src/**/*.ts")
- `rg(args)`: Run ripgrep with any arguments (e.g., "pattern -t py -C 3", "TODO --type-not test")
The provided prompt will be augmented with info on the current local working directory. File tools expect absolute paths.
This allows testing basic agent capabilities without external tool configuration.
`prompt` is implemented using the latest OpenAI SDK for NodeJS. It outputs thinking content, tool calls and results, and normal assistant messages.
## Models
We want to support these models specifically, with alternative models being marked as "possibly works". This list will be updated with new models regularly. A checked
box means "supported".
See [models.md](./models.md) for a list of models, their HW reqs, vLLM args and notes, we want to support out of the box with a simple `pi start <model-name> --name <local-name>`

View file

@ -0,0 +1,132 @@
# Qwen3-Coder Usage Guide
[Qwen3-Coder](https://github.com/QwenLM/Qwen3-Coder) is an advanced large language model created by the Qwen team from Alibaba Cloud. vLLM already supports Qwen3-Coder, and `tool-call` functionality will be available in vLLM v0.10.0 and higher You can install vLLM with `tool-call` support using the following method:
## Installing vLLM
```bash
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto
```
## Launching Qwen3-Coder with vLLM
### Serving on 8xH200 (or H20) GPUs (141GB × 8)
**BF16 Model**
```bash
vllm serve Qwen/Qwen3-Coder-480B-A35B-Instruct \
--tensor-parallel-size 8 \
--max-model-len 32000 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder
```
**FP8 Model**
```bash
VLLM_USE_DEEP_GEMM=1 vllm serve Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 \
--max-model-len 131072 \
--enable-expert-parallel \
--data-parallel-size 8 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder
```
## Performance Metrics
### Evaluation
We launched `Qwen3-Coder-480B-A35B-Instruct-FP8` using vLLM and evaluated its performance using [EvalPlus](https://github.com/evalplus/evalplus). The results are displayed below:
| Dataset | Test Type | Pass@1 Score |
|-----------|-----------|--------------|
| HumanEval | Base tests | 0.939 |
| HumanEval+ | Base + extra tests | 0.902 |
| MBPP | Base tests | 0.918 |
| MBPP+ | Base + extra tests | 0.794 |
### Benchmarking
We used the following script to benchmark `Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8`
```bash
vllm bench serve \
--backend vllm \
--model Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 \
--endpoint /v1/completions \
--dataset-name random \
--random-input 2048 \
--random-output 1024 \
--max-concurrency 10 \
--num-prompt 100 \
```
If successful, you will see the following output.
```shell
============ Serving Benchmark Result ============
Successful requests: 100
Benchmark duration (s): 776.49
Total input tokens: 204169
Total generated tokens: 102400
Request throughput (req/s): 0.13
Output token throughput (tok/s): 131.88
Total Token throughput (tok/s): 394.81
---------------Time to First Token----------------
Mean TTFT (ms): 7639.31
Median TTFT (ms): 6935.71
P99 TTFT (ms): 13766.68
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 68.43
Median TPOT (ms): 67.23
P99 TPOT (ms): 72.14
---------------Inter-token Latency----------------
Mean ITL (ms): 68.43
Median ITL (ms): 66.34
P99 ITL (ms): 69.38
==================================================
```
## Using Tips
### BF16 Models
- **Context Length Limitation**: A single H20 node cannot serve the original context length (262144). You can reduce the `max-model-len` or increase `gpu-memory-utilization` to work within memory constraints.
### FP8 Models
- **Context Length Limitation**: A single H20 node cannot serve the original context length (262144). You can reduce the `max-model-len` or increase `gpu-memory-utilization` to work within memory constraints.
- **DeepGEMM Usage**: To use [DeepGEMM](https://github.com/deepseek-ai/DeepGEMM), set `VLLM_USE_DEEP_GEMM=1`. Follow the [setup instructions](https://github.com/vllm-project/vllm/blob/main/benchmarks/kernels/deepgemm/README.md#setup) to install it.
- **Tensor Parallelism Issue**: When using `tensor-parallel-size 8`, the following failures are expected. Switch to data-parallel mode using `--data-parallel-size`.
- **Additional Resources**: Refer to the [Data Parallel Deployment documentation](https://docs.vllm.ai/en/latest/serving/data_parallel_deployment.html) for more parallelism groups.
```shell
ERROR [multiproc_executor.py:511] File "/vllm/vllm/model_executor/models/qwen3_moe.py", line 336, in <lambda>
ERROR [multiproc_executor.py:511] lambda prefix: Qwen3MoeDecoderLayer(config=config,
ERROR [multiproc_executor.py:511] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py:511] File "/vllm/vllm/model_executor/models/qwen3_moe.py", line 278, in __init__
ERROR [multiproc_executor.py:511] self.mlp = Qwen3MoeSparseMoeBlock(config=config,
ERROR [multiproc_executor.py:511] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py:511] File "/vllm/vllm/model_executor/models/qwen3_moe.py", line 113, in __init__
ERROR [multiproc_executor.py:511] self.experts = FusedMoE(num_experts=config.num_experts,
ERROR [multiproc_executor.py:511] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py:511] File "/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 773, in __init__
ERROR [multiproc_executor.py:511] self.quant_method.create_weights(layer=self, **moe_quant_params)
ERROR [multiproc_executor.py:511] File "/vllm/vllm/model_executor/layers/quantization/fp8.py", line 573, in create_weights
ERROR [multiproc_executor.py:511] raise ValueError(
ERROR [multiproc_executor.py:511] ValueError: The output_size of gate's and up's weight = 320 is not divisible by weight quantization block_n = 128.
```
### Tool Calling
- **Enable Tool Calls**: Add `--tool-call-parser qwen3_coder` to enable tool call parsing functionality, please refer to: [tool_calling](https://docs.vllm.ai/en/latest/features/tool_calling.html)
## Roadmap
- [x] Add benchmark results
## Additional Resources
- [EvalPlus](https://github.com/evalplus/evalplus)
- [Qwen3-Coder](https://github.com/QwenLM/Qwen3-Coder)
- [vLLM Documentation](https://docs.vllm.ai/)

1447
packages/pods/package-lock.json generated Normal file

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,40 @@
{
"name": "@mariozechner/pi",
"version": "0.5.0",
"description": "CLI tool for managing vLLM deployments on GPU pods",
"type": "module",
"bin": {
"pi": "./dist/cli.js"
},
"scripts": {
"clean": "rm -rf dist tsconfig.tsbuildinfo",
"build": "tsc -p tsconfig.build.json && chmod +x dist/cli.js && cp src/models.json dist/",
"check": "biome check --write .",
"prepublishOnly": "npm run clean && npm run build"
},
"files": [
"dist"
],
"keywords": [
"llm",
"vllm",
"gpu",
"ai",
"cli"
],
"author": "Mario Zechner",
"license": "MIT",
"repository": {
"type": "git",
"url": "https://github.com/badlogic/pi-mono.git",
"directory": "packages/pods"
},
"engines": {
"node": ">=20.0.0"
},
"dependencies": {
"@mariozechner/pi-agent": "^0.5.0",
"chalk": "^5.5.0"
},
"devDependencies": {}
}

View file

@ -0,0 +1,83 @@
#!/usr/bin/env bash
# Model runner script - runs sequentially, killed by pi stop
set -euo pipefail
# These values are replaced before upload by pi CLI
MODEL_ID="{{MODEL_ID}}"
NAME="{{NAME}}"
PORT="{{PORT}}"
VLLM_ARGS="{{VLLM_ARGS}}"
# Trap to ensure cleanup on exit and kill any child processes
cleanup() {
local exit_code=$?
echo "Model runner exiting with code $exit_code"
# Kill any child processes
pkill -P $$ 2>/dev/null || true
exit $exit_code
}
trap cleanup EXIT TERM INT
# Force colored output even when not a TTY
export FORCE_COLOR=1
export PYTHONUNBUFFERED=1
export TERM=xterm-256color
export RICH_FORCE_TERMINAL=1
export CLICOLOR_FORCE=1
# Source virtual environment
source /root/venv/bin/activate
echo "========================================="
echo "Model Run: $NAME"
echo "Model ID: $MODEL_ID"
echo "Port: $PORT"
if [ -n "$VLLM_ARGS" ]; then
echo "vLLM Args: $VLLM_ARGS"
fi
echo "========================================="
echo ""
# Download model (with color progress bars)
echo "Downloading model (will skip if cached)..."
HF_HUB_ENABLE_HF_TRANSFER=1 hf download "$MODEL_ID"
if [ $? -ne 0 ]; then
echo "❌ ERROR: Failed to download model" >&2
exit 1
fi
echo ""
echo "✅ Model download complete"
echo ""
# Build vLLM command
VLLM_CMD="vllm serve '$MODEL_ID' --port $PORT --api-key '$PI_API_KEY'"
if [ -n "$VLLM_ARGS" ]; then
VLLM_CMD="$VLLM_CMD $VLLM_ARGS"
fi
echo "Starting vLLM server..."
echo "Command: $VLLM_CMD"
echo "========================================="
echo ""
# Run vLLM in background so we can monitor it
echo "Starting vLLM process..."
bash -c "$VLLM_CMD" &
VLLM_PID=$!
# Monitor the vLLM process
echo "Monitoring vLLM process (PID: $VLLM_PID)..."
wait $VLLM_PID
VLLM_EXIT_CODE=$?
if [ $VLLM_EXIT_CODE -ne 0 ]; then
echo "❌ ERROR: vLLM exited with code $VLLM_EXIT_CODE" >&2
# Make sure to exit the script command too
kill -TERM $$ 2>/dev/null || true
exit $VLLM_EXIT_CODE
fi
echo "✅ vLLM exited normally"
exit 0

View file

@ -0,0 +1,334 @@
#!/usr/bin/env bash
# GPU pod bootstrap for vLLM deployment
set -euo pipefail
# Parse arguments passed from pi CLI
MOUNT_COMMAND=""
MODELS_PATH=""
HF_TOKEN=""
PI_API_KEY=""
VLLM_VERSION="release" # Default to release
while [[ $# -gt 0 ]]; do
case $1 in
--mount)
MOUNT_COMMAND="$2"
shift 2
;;
--models-path)
MODELS_PATH="$2"
shift 2
;;
--hf-token)
HF_TOKEN="$2"
shift 2
;;
--vllm-api-key)
PI_API_KEY="$2"
shift 2
;;
--vllm)
VLLM_VERSION="$2"
shift 2
;;
*)
echo "ERROR: Unknown option: $1" >&2
exit 1
;;
esac
done
# Validate required parameters
if [ -z "$HF_TOKEN" ]; then
echo "ERROR: HF_TOKEN is required" >&2
exit 1
fi
if [ -z "$PI_API_KEY" ]; then
echo "ERROR: PI_API_KEY is required" >&2
exit 1
fi
if [ -z "$MODELS_PATH" ]; then
echo "ERROR: MODELS_PATH is required" >&2
exit 1
fi
echo "=== Starting pod setup ==="
# Install system dependencies
apt update -y
apt install -y python3-pip python3-venv git build-essential cmake ninja-build curl wget lsb-release htop pkg-config
# --- Install matching CUDA toolkit -------------------------------------------
echo "Checking CUDA driver version..."
DRIVER_CUDA_VERSION=$(nvidia-smi | grep "CUDA Version" | awk '{print $9}')
echo "Driver supports CUDA: $DRIVER_CUDA_VERSION"
# Check if nvcc exists and its version
if command -v nvcc &> /dev/null; then
NVCC_VERSION=$(nvcc --version | grep "release" | awk '{print $6}' | cut -d, -f1)
echo "Current nvcc version: $NVCC_VERSION"
else
NVCC_VERSION="none"
echo "nvcc not found"
fi
# Install CUDA toolkit matching driver version if needed
if [[ "$NVCC_VERSION" != "$DRIVER_CUDA_VERSION" ]]; then
echo "Installing CUDA Toolkit $DRIVER_CUDA_VERSION to match driver..."
# Detect Ubuntu version
UBUNTU_VERSION=$(lsb_release -rs)
UBUNTU_CODENAME=$(lsb_release -cs)
echo "Detected Ubuntu $UBUNTU_VERSION ($UBUNTU_CODENAME)"
# Map Ubuntu version to NVIDIA repo path
if [[ "$UBUNTU_VERSION" == "24.04" ]]; then
REPO_PATH="ubuntu2404"
elif [[ "$UBUNTU_VERSION" == "22.04" ]]; then
REPO_PATH="ubuntu2204"
elif [[ "$UBUNTU_VERSION" == "20.04" ]]; then
REPO_PATH="ubuntu2004"
else
echo "Warning: Unsupported Ubuntu version $UBUNTU_VERSION, trying ubuntu2204"
REPO_PATH="ubuntu2204"
fi
# Add NVIDIA package repositories
wget https://developer.download.nvidia.com/compute/cuda/repos/${REPO_PATH}/x86_64/cuda-keyring_1.1-1_all.deb
dpkg -i cuda-keyring_1.1-1_all.deb
rm cuda-keyring_1.1-1_all.deb
apt-get update
# Install specific CUDA toolkit version
# Convert version format (12.9 -> 12-9)
CUDA_VERSION_APT=$(echo $DRIVER_CUDA_VERSION | sed 's/\./-/')
echo "Installing cuda-toolkit-${CUDA_VERSION_APT}..."
apt-get install -y cuda-toolkit-${CUDA_VERSION_APT}
# Add CUDA to PATH
export PATH=/usr/local/cuda-${DRIVER_CUDA_VERSION}/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-${DRIVER_CUDA_VERSION}/lib64:${LD_LIBRARY_PATH:-}
# Verify installation
nvcc --version
else
echo "CUDA toolkit $NVCC_VERSION matches driver version"
export PATH=/usr/local/cuda-${DRIVER_CUDA_VERSION}/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-${DRIVER_CUDA_VERSION}/lib64:${LD_LIBRARY_PATH:-}
fi
# --- Install uv (fast Python package manager) --------------------------------
curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH="$HOME/.local/bin:$PATH"
# --- Install Python 3.12 if not available ------------------------------------
if ! command -v python3.12 &> /dev/null; then
echo "Python 3.12 not found. Installing via uv..."
uv python install 3.12
fi
# --- Clean up existing environments and caches -------------------------------
echo "Cleaning up existing environments and caches..."
# Remove existing venv for a clean installation
VENV="$HOME/venv"
if [ -d "$VENV" ]; then
echo "Removing existing virtual environment..."
rm -rf "$VENV"
fi
# Remove uv cache to ensure fresh installs
if [ -d "$HOME/.cache/uv" ]; then
echo "Clearing uv cache..."
rm -rf "$HOME/.cache/uv"
fi
# Remove vLLM cache to avoid conflicts
if [ -d "$HOME/.cache/vllm" ]; then
echo "Clearing vLLM cache..."
rm -rf "$HOME/.cache/vllm"
fi
# --- Create and activate venv ------------------------------------------------
echo "Creating fresh virtual environment..."
uv venv --python 3.12 --seed "$VENV"
source "$VENV/bin/activate"
# --- Install PyTorch and vLLM ------------------------------------------------
echo "Installing vLLM and dependencies (version: $VLLM_VERSION)..."
case "$VLLM_VERSION" in
release)
echo "Installing vLLM release with PyTorch..."
# Install vLLM with automatic PyTorch backend selection
# vLLM will automatically install the correct PyTorch version
uv pip install vllm>=0.10.0 --torch-backend=auto || {
echo "ERROR: Failed to install vLLM"
exit 1
}
;;
nightly)
echo "Installing vLLM nightly with PyTorch..."
echo "This will install the latest nightly build of vLLM..."
# Install vLLM nightly with PyTorch
uv pip install -U vllm \
--torch-backend=auto \
--extra-index-url https://wheels.vllm.ai/nightly || {
echo "ERROR: Failed to install vLLM nightly"
exit 1
}
echo "vLLM nightly successfully installed!"
;;
gpt-oss)
echo "Installing GPT-OSS special build with PyTorch nightly..."
echo "WARNING: This build is ONLY for GPT-OSS models!"
echo "Installing PyTorch nightly and cutting-edge dependencies..."
# Convert CUDA version format for PyTorch (12.4 -> cu124)
PYTORCH_CUDA="cu$(echo $DRIVER_CUDA_VERSION | sed 's/\.//')"
echo "Using PyTorch nightly with ${PYTORCH_CUDA} (driver supports ${DRIVER_CUDA_VERSION})"
# The GPT-OSS build will pull PyTorch nightly and other dependencies
# via the extra index URLs. We don't pre-install torch here to avoid conflicts.
uv pip install --pre vllm==0.10.1+gptoss \
--extra-index-url https://wheels.vllm.ai/gpt-oss/ \
--extra-index-url https://download.pytorch.org/whl/nightly/${PYTORCH_CUDA} \
--index-strategy unsafe-best-match || {
echo "ERROR: Failed to install GPT-OSS vLLM build"
echo "This automatically installs PyTorch nightly with ${PYTORCH_CUDA}, Triton nightly, and other dependencies"
exit 1
}
# Install gpt-oss library for tool support
uv pip install gpt-oss || {
echo "WARNING: Failed to install gpt-oss library (needed for tool use)"
}
;;
*)
echo "ERROR: Unknown vLLM version: $VLLM_VERSION"
exit 1
;;
esac
# --- Install additional packages ---------------------------------------------
echo "Installing additional packages..."
uv pip install huggingface-hub psutil tensorrt hf_transfer
# --- FlashInfer installation (optional, improves performance) ----------------
echo "Attempting FlashInfer installation (optional)..."
if uv pip install flashinfer-python; then
echo "FlashInfer installed successfully"
else
echo "FlashInfer not available, using Flash Attention instead"
fi
# --- Mount storage if provided -----------------------------------------------
if [ -n "$MOUNT_COMMAND" ]; then
echo "Setting up mount..."
# Create mount point directory if it doesn't exist
mkdir -p "$MODELS_PATH"
# Execute the mount command
eval "$MOUNT_COMMAND" || {
echo "WARNING: Mount command failed, continuing without mount"
}
# Verify mount succeeded (optional, may not always be a mount point)
if mountpoint -q "$MODELS_PATH" 2>/dev/null; then
echo "Storage successfully mounted at $MODELS_PATH"
else
echo "Note: $MODELS_PATH is not a mount point (might be local storage)"
fi
fi
# --- Model storage setup ------------------------------------------------------
echo ""
echo "=== Setting up model storage ==="
echo "Storage path: $MODELS_PATH"
# Check if the path exists and is writable
if [ ! -d "$MODELS_PATH" ]; then
echo "Creating model storage directory: $MODELS_PATH"
mkdir -p "$MODELS_PATH"
fi
if [ ! -w "$MODELS_PATH" ]; then
echo "ERROR: Model storage path is not writable: $MODELS_PATH"
echo "Please check permissions"
exit 1
fi
# Create the huggingface cache directory structure in the models path
mkdir -p "${MODELS_PATH}/huggingface/hub"
# Remove any existing cache directory or symlink
if [ -e ~/.cache/huggingface ] || [ -L ~/.cache/huggingface ]; then
echo "Removing existing ~/.cache/huggingface..."
rm -rf ~/.cache/huggingface 2>/dev/null || true
fi
# Create parent directory if needed
mkdir -p ~/.cache
# Create symlink from ~/.cache/huggingface to the models path
ln -s "${MODELS_PATH}/huggingface" ~/.cache/huggingface
echo "Created symlink: ~/.cache/huggingface -> ${MODELS_PATH}/huggingface"
# Verify the symlink works
if [ -d ~/.cache/huggingface/hub ]; then
echo "✓ Model storage configured successfully"
# Check available space
AVAILABLE_SPACE=$(df -h "$MODELS_PATH" | awk 'NR==2 {print $4}')
echo "Available space: $AVAILABLE_SPACE"
else
echo "ERROR: Could not verify model storage setup"
echo "The symlink was created but the target directory is not accessible"
exit 1
fi
# --- Configure environment ----------------------------------------------------
mkdir -p ~/.config/vllm
touch ~/.config/vllm/do_not_track
# Write environment to .bashrc for persistence
cat >> ~/.bashrc << EOF
# Pi vLLM environment
[ -d "\$HOME/venv" ] && source "\$HOME/venv/bin/activate"
export PATH="/usr/local/cuda-${DRIVER_CUDA_VERSION}/bin:\$HOME/.local/bin:\$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda-${DRIVER_CUDA_VERSION}/lib64:\${LD_LIBRARY_PATH:-}"
export HF_TOKEN="${HF_TOKEN}"
export PI_API_KEY="${PI_API_KEY}"
export HUGGING_FACE_HUB_TOKEN="${HF_TOKEN}"
export HF_HUB_ENABLE_HF_TRANSFER=1
export VLLM_NO_USAGE_STATS=1
export VLLM_DO_NOT_TRACK=1
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
EOF
# Create log directory for vLLM
mkdir -p ~/.vllm_logs
# --- Output GPU info for pi CLI to parse -------------------------------------
echo ""
echo "===GPU_INFO_START==="
nvidia-smi --query-gpu=index,name,memory.total --format=csv,noheader | while IFS=, read -r id name memory; do
# Trim whitespace
id=$(echo "$id" | xargs)
name=$(echo "$name" | xargs)
memory=$(echo "$memory" | xargs)
echo "{\"id\": $id, \"name\": \"$name\", \"memory\": \"$memory\"}"
done
echo "===GPU_INFO_END==="
echo ""
echo "=== Setup complete ==="
echo "Pod is ready for vLLM deployments"
echo "Models will be cached at: $MODELS_PATH"

362
packages/pods/src/cli.ts Normal file
View file

@ -0,0 +1,362 @@
#!/usr/bin/env node
import chalk from "chalk";
import { spawn } from "child_process";
import { readFileSync } from "fs";
import { dirname, join } from "path";
import { fileURLToPath } from "url";
import { listModels, startModel, stopModel, viewLogs } from "./commands/models.js";
import { listPods, removePodCommand, setupPod, switchActivePod } from "./commands/pods.js";
import { promptModel } from "./commands/prompt.js";
import { getActivePod, loadConfig } from "./config.js";
import { sshExecStream } from "./ssh.js";
const __filename = fileURLToPath(import.meta.url);
const __dirname = dirname(__filename);
const packageJson = JSON.parse(readFileSync(join(__dirname, "../package.json"), "utf-8"));
function printHelp() {
console.log(`pi v${packageJson.version} - Manage vLLM deployments on GPU pods
Pod Management:
pi pods setup <name> "<ssh>" --mount "<mount>" Setup pod with mount command
Options:
--vllm release Install latest vLLM release >=0.10.0 (default)
--vllm nightly Install vLLM nightly build (latest features)
--vllm gpt-oss Install vLLM 0.10.1+gptoss with PyTorch nightly (GPT-OSS only)
pi pods List all pods (* = active)
pi pods active <name> Switch active pod
pi pods remove <name> Remove pod from local config
pi shell [<name>] Open shell on pod (active or specified)
pi ssh [<name>] "<command>" Run SSH command on pod
Model Management:
pi start <model> --name <name> [options] Start a model
--memory <percent> GPU memory allocation (30%, 50%, 90%)
--context <size> Context window (4k, 8k, 16k, 32k, 64k, 128k)
--gpus <count> Number of GPUs to use (predefined models only)
--vllm <args...> Pass remaining args to vLLM (ignores other options)
pi stop [<name>] Stop model (or all if no name)
pi list List running models
pi logs <name> Stream model logs
pi agent <name> ["<message>"...] [options] Chat with model using agent & tools
pi agent <name> [options] Interactive chat mode
--continue, -c Continue previous session
--json Output as JSONL
(All pi-agent options are supported)
All model commands support --pod <name> to override the active pod.
Environment:
HF_TOKEN HuggingFace token for model downloads
PI_API_KEY API key for vLLM endpoints
PI_CONFIG_DIR Config directory (default: ~/.pi)`);
}
// Parse command line arguments
const args = process.argv.slice(2);
if (args.length === 0 || args[0] === "--help" || args[0] === "-h") {
printHelp();
process.exit(0);
}
if (args[0] === "--version" || args[0] === "-v") {
console.log(packageJson.version);
process.exit(0);
}
const command = args[0];
const subcommand = args[1];
// Main command handler
try {
// Handle "pi pods" commands
if (command === "pods") {
if (!subcommand) {
// pi pods - list all pods
listPods();
} else if (subcommand === "setup") {
// pi pods setup <name> "<ssh>" [--mount "<mount>"] [--models-path <path>] [--vllm release|nightly|gpt-oss]
const name = args[2];
const sshCmd = args[3];
if (!name || !sshCmd) {
console.error(
'Usage: pi pods setup <name> "<ssh>" [--mount "<mount>"] [--models-path <path>] [--vllm release|nightly|gpt-oss]',
);
process.exit(1);
}
// Parse options
const options: { mount?: string; modelsPath?: string; vllm?: "release" | "nightly" | "gpt-oss" } = {};
for (let i = 4; i < args.length; i++) {
if (args[i] === "--mount" && i + 1 < args.length) {
options.mount = args[i + 1];
i++;
} else if (args[i] === "--models-path" && i + 1 < args.length) {
options.modelsPath = args[i + 1];
i++;
} else if (args[i] === "--vllm" && i + 1 < args.length) {
const vllmType = args[i + 1];
if (vllmType === "release" || vllmType === "nightly" || vllmType === "gpt-oss") {
options.vllm = vllmType;
} else {
console.error(chalk.red(`Invalid vLLM type: ${vllmType}`));
console.error("Valid options: release, nightly, gpt-oss");
process.exit(1);
}
i++;
}
}
// If --mount provided but no --models-path, try to extract path from mount command
if (options.mount && !options.modelsPath) {
// Extract last part of mount command as models path
const parts = options.mount.trim().split(" ");
const lastPart = parts[parts.length - 1];
if (lastPart?.startsWith("/")) {
options.modelsPath = lastPart;
}
}
await setupPod(name, sshCmd, options);
} else if (subcommand === "active") {
// pi pods active <name>
const name = args[2];
if (!name) {
console.error("Usage: pi pods active <name>");
process.exit(1);
}
switchActivePod(name);
} else if (subcommand === "remove") {
// pi pods remove <name>
const name = args[2];
if (!name) {
console.error("Usage: pi pods remove <name>");
process.exit(1);
}
removePodCommand(name);
} else {
console.error(`Unknown pods subcommand: ${subcommand}`);
process.exit(1);
}
} else {
// Parse --pod override for model commands
let podOverride: string | undefined;
const podIndex = args.indexOf("--pod");
if (podIndex !== -1 && podIndex + 1 < args.length) {
podOverride = args[podIndex + 1];
// Remove --pod and its value from args
args.splice(podIndex, 2);
}
// Handle SSH/shell commands and model commands
switch (command) {
case "shell": {
// pi shell [<name>] - open interactive shell
const podName = args[1];
let podInfo: { name: string; pod: import("./types.js").Pod } | null = null;
if (podName) {
const config = loadConfig();
const pod = config.pods[podName];
if (pod) {
podInfo = { name: podName, pod };
}
} else {
podInfo = getActivePod();
}
if (!podInfo) {
if (podName) {
console.error(chalk.red(`Pod '${podName}' not found`));
} else {
console.error(chalk.red("No active pod. Use 'pi pods active <name>' to set one."));
}
process.exit(1);
}
console.log(chalk.green(`Connecting to pod '${podInfo.name}'...`));
// Execute SSH in interactive mode
const sshArgs = podInfo.pod.ssh.split(" ").slice(1); // Remove 'ssh' from command
const sshProcess = spawn("ssh", sshArgs, {
stdio: "inherit",
env: process.env,
});
sshProcess.on("exit", (code) => {
process.exit(code || 0);
});
break;
}
case "ssh": {
// pi ssh [<name>] "<command>" - run command via SSH
let podName: string | undefined;
let sshCommand: string;
if (args.length === 2) {
// pi ssh "<command>" - use active pod
sshCommand = args[1];
} else if (args.length === 3) {
// pi ssh <name> "<command>"
podName = args[1];
sshCommand = args[2];
} else {
console.error('Usage: pi ssh [<name>] "<command>"');
process.exit(1);
}
let podInfo: { name: string; pod: import("./types.js").Pod } | null = null;
if (podName) {
const config = loadConfig();
const pod = config.pods[podName];
if (pod) {
podInfo = { name: podName, pod };
}
} else {
podInfo = getActivePod();
}
if (!podInfo) {
if (podName) {
console.error(chalk.red(`Pod '${podName}' not found`));
} else {
console.error(chalk.red("No active pod. Use 'pi pods active <name>' to set one."));
}
process.exit(1);
}
console.log(chalk.gray(`Running on pod '${podInfo.name}': ${sshCommand}`));
// Execute command and stream output
const exitCode = await sshExecStream(podInfo.pod.ssh, sshCommand);
process.exit(exitCode);
break;
}
case "start": {
// pi start <model> --name <name> [options]
const modelId = args[1];
if (!modelId) {
// Show available models
const { showKnownModels } = await import("./commands/models.js");
await showKnownModels();
process.exit(0);
}
// Parse options
let name: string | undefined;
let memory: string | undefined;
let context: string | undefined;
let gpus: number | undefined;
const vllmArgs: string[] = [];
let inVllmArgs = false;
for (let i = 2; i < args.length; i++) {
if (inVllmArgs) {
vllmArgs.push(args[i]);
} else if (args[i] === "--name" && i + 1 < args.length) {
name = args[i + 1];
i++;
} else if (args[i] === "--memory" && i + 1 < args.length) {
memory = args[i + 1];
i++;
} else if (args[i] === "--context" && i + 1 < args.length) {
context = args[i + 1];
i++;
} else if (args[i] === "--gpus" && i + 1 < args.length) {
gpus = parseInt(args[i + 1]);
if (Number.isNaN(gpus) || gpus < 1) {
console.error(chalk.red("--gpus must be a positive number"));
process.exit(1);
}
i++;
} else if (args[i] === "--vllm") {
inVllmArgs = true;
}
}
if (!name) {
console.error("--name is required");
process.exit(1);
}
// Warn if --vllm is used with other parameters
if (vllmArgs.length > 0 && (memory || context || gpus)) {
console.log(
chalk.yellow("⚠ Warning: --memory, --context, and --gpus are ignored when --vllm is specified"),
);
console.log(chalk.yellow(" Using only custom vLLM arguments"));
console.log("");
}
await startModel(modelId, name, {
pod: podOverride,
memory,
context,
gpus,
vllmArgs: vllmArgs.length > 0 ? vllmArgs : undefined,
});
break;
}
case "stop": {
// pi stop [name] - stop specific model or all models
const name = args[1];
if (!name) {
// Stop all models on the active pod
const { stopAllModels } = await import("./commands/models.js");
await stopAllModels({ pod: podOverride });
} else {
await stopModel(name, { pod: podOverride });
}
break;
}
case "list":
// pi list
await listModels({ pod: podOverride });
break;
case "logs": {
// pi logs <name>
const name = args[1];
if (!name) {
console.error("Usage: pi logs <name>");
process.exit(1);
}
await viewLogs(name, { pod: podOverride });
break;
}
case "agent": {
// pi agent <name> [messages...] [options]
const name = args[1];
if (!name) {
console.error("Usage: pi agent <name> [messages...] [options]");
process.exit(1);
}
const apiKey = process.env.PI_API_KEY;
// Pass all args after the model name
const agentArgs = args.slice(2);
// If no messages provided, it's interactive mode
await promptModel(name, agentArgs, {
pod: podOverride,
apiKey,
}).catch(() => {
// Error already handled in promptModel, just exit cleanly
process.exit(0);
});
break;
}
default:
console.error(`Unknown command: ${command}`);
printHelp();
process.exit(1);
}
}
} catch (error) {
console.error("Error:", error);
process.exit(1);
}

View file

@ -0,0 +1,703 @@
import chalk from "chalk";
import { spawn } from "child_process";
import { readFileSync } from "fs";
import { dirname, join } from "path";
import { fileURLToPath } from "url";
import { getActivePod, loadConfig, saveConfig } from "../config.js";
import { getModelConfig, getModelName, isKnownModel } from "../model-configs.js";
import { sshExec } from "../ssh.js";
import type { Pod } from "../types.js";
/**
* Get the pod to use (active or override)
*/
const getPod = (podOverride?: string): { name: string; pod: Pod } => {
if (podOverride) {
const config = loadConfig();
const pod = config.pods[podOverride];
if (!pod) {
console.error(chalk.red(`Pod '${podOverride}' not found`));
process.exit(1);
}
return { name: podOverride, pod };
}
const active = getActivePod();
if (!active) {
console.error(chalk.red("No active pod. Use 'pi pods active <name>' to set one."));
process.exit(1);
}
return active;
};
/**
* Find next available port starting from 8001
*/
const getNextPort = (pod: Pod): number => {
const usedPorts = Object.values(pod.models).map((m) => m.port);
let port = 8001;
while (usedPorts.includes(port)) {
port++;
}
return port;
};
/**
* Select GPUs for model deployment (round-robin)
*/
const selectGPUs = (pod: Pod, count: number = 1): number[] => {
if (count === pod.gpus.length) {
// Use all GPUs
return pod.gpus.map((g) => g.id);
}
// Count GPU usage across all models
const gpuUsage = new Map<number, number>();
for (const gpu of pod.gpus) {
gpuUsage.set(gpu.id, 0);
}
for (const model of Object.values(pod.models)) {
for (const gpuId of model.gpu) {
gpuUsage.set(gpuId, (gpuUsage.get(gpuId) || 0) + 1);
}
}
// Sort GPUs by usage (least used first)
const sortedGPUs = Array.from(gpuUsage.entries())
.sort((a, b) => a[1] - b[1])
.map((entry) => entry[0]);
// Return the least used GPUs
return sortedGPUs.slice(0, count);
};
/**
* Start a model
*/
export const startModel = async (
modelId: string,
name: string,
options: {
pod?: string;
vllmArgs?: string[];
memory?: string;
context?: string;
gpus?: number;
},
) => {
const { name: podName, pod } = getPod(options.pod);
// Validation
if (!pod.modelsPath) {
console.error(chalk.red("Pod does not have a models path configured"));
process.exit(1);
}
if (pod.models[name]) {
console.error(chalk.red(`Model '${name}' already exists on pod '${podName}'`));
process.exit(1);
}
const port = getNextPort(pod);
// Determine GPU allocation and vLLM args
let gpus: number[] = [];
let vllmArgs: string[] = [];
let modelConfig = null;
if (options.vllmArgs?.length) {
// Custom args override everything
vllmArgs = options.vllmArgs;
console.log(chalk.gray("Using custom vLLM args, GPU allocation managed by vLLM"));
} else if (isKnownModel(modelId)) {
// Handle --gpus parameter for known models
if (options.gpus) {
// Validate GPU count
if (options.gpus > pod.gpus.length) {
console.error(chalk.red(`Error: Requested ${options.gpus} GPUs but pod only has ${pod.gpus.length}`));
process.exit(1);
}
// Try to find config for requested GPU count
modelConfig = getModelConfig(modelId, pod.gpus, options.gpus);
if (modelConfig) {
gpus = selectGPUs(pod, options.gpus);
vllmArgs = [...(modelConfig.args || [])];
} else {
console.error(
chalk.red(`Model '${getModelName(modelId)}' does not have a configuration for ${options.gpus} GPU(s)`),
);
console.error(chalk.yellow("Available configurations:"));
// Show available configurations
for (let gpuCount = 1; gpuCount <= pod.gpus.length; gpuCount++) {
const config = getModelConfig(modelId, pod.gpus, gpuCount);
if (config) {
console.error(chalk.gray(` - ${gpuCount} GPU(s)`));
}
}
process.exit(1);
}
} else {
// Find best config for this hardware (original behavior)
for (let gpuCount = pod.gpus.length; gpuCount >= 1; gpuCount--) {
modelConfig = getModelConfig(modelId, pod.gpus, gpuCount);
if (modelConfig) {
gpus = selectGPUs(pod, gpuCount);
vllmArgs = [...(modelConfig.args || [])];
break;
}
}
if (!modelConfig) {
console.error(chalk.red(`Model '${getModelName(modelId)}' not compatible with this pod's GPUs`));
process.exit(1);
}
}
} else {
// Unknown model
if (options.gpus) {
console.error(chalk.red("Error: --gpus can only be used with predefined models"));
console.error(chalk.yellow("For custom models, use --vllm with tensor-parallel-size or similar arguments"));
process.exit(1);
}
// Single GPU default
gpus = selectGPUs(pod, 1);
console.log(chalk.gray("Unknown model, defaulting to single GPU"));
}
// Apply memory/context overrides
if (!options.vllmArgs?.length) {
if (options.memory) {
const fraction = parseFloat(options.memory.replace("%", "")) / 100;
vllmArgs = vllmArgs.filter((arg) => !arg.includes("gpu-memory-utilization"));
vllmArgs.push("--gpu-memory-utilization", String(fraction));
}
if (options.context) {
const contextSizes: Record<string, number> = {
"4k": 4096,
"8k": 8192,
"16k": 16384,
"32k": 32768,
"64k": 65536,
"128k": 131072,
};
const maxTokens = contextSizes[options.context.toLowerCase()] || parseInt(options.context);
vllmArgs = vllmArgs.filter((arg) => !arg.includes("max-model-len"));
vllmArgs.push("--max-model-len", String(maxTokens));
}
}
// Show what we're doing
console.log(chalk.green(`Starting model '${name}' on pod '${podName}'...`));
console.log(`Model: ${modelId}`);
console.log(`Port: ${port}`);
console.log(`GPU(s): ${gpus.length ? gpus.join(", ") : "Managed by vLLM"}`);
if (modelConfig?.notes) console.log(chalk.yellow(`Note: ${modelConfig.notes}`));
console.log("");
// Read and customize model_run.sh script with our values
const scriptPath = join(dirname(fileURLToPath(import.meta.url)), "../../scripts/model_run.sh");
let scriptContent = readFileSync(scriptPath, "utf-8");
// Replace placeholders - no escaping needed, heredoc with 'EOF' is literal
scriptContent = scriptContent
.replace("{{MODEL_ID}}", modelId)
.replace("{{NAME}}", name)
.replace("{{PORT}}", String(port))
.replace("{{VLLM_ARGS}}", vllmArgs.join(" "));
// Upload customized script
const result = await sshExec(
pod.ssh,
`cat > /tmp/model_run_${name}.sh << 'EOF'
${scriptContent}
EOF
chmod +x /tmp/model_run_${name}.sh`,
);
// Prepare environment
const env = [
`HF_TOKEN='${process.env.HF_TOKEN}'`,
`PI_API_KEY='${process.env.PI_API_KEY}'`,
`HF_HUB_ENABLE_HF_TRANSFER=1`,
`VLLM_NO_USAGE_STATS=1`,
`PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`,
`FORCE_COLOR=1`,
`TERM=xterm-256color`,
...(gpus.length === 1 ? [`CUDA_VISIBLE_DEVICES=${gpus[0]}`] : []),
...Object.entries(modelConfig?.env || {}).map(([k, v]) => `${k}='${v}'`),
]
.map((e) => `export ${e}`)
.join("\n");
// Start the model runner with script command for pseudo-TTY (preserves colors)
// Note: We use script to preserve colors and create a log file
// setsid creates a new session so it survives SSH disconnection
const startCmd = `
${env}
mkdir -p ~/.vllm_logs
# Create a wrapper that monitors the script command
cat > /tmp/model_wrapper_${name}.sh << 'WRAPPER'
#!/bin/bash
script -q -f -c "/tmp/model_run_${name}.sh" ~/.vllm_logs/${name}.log
exit_code=$?
echo "Script exited with code $exit_code" >> ~/.vllm_logs/${name}.log
exit $exit_code
WRAPPER
chmod +x /tmp/model_wrapper_${name}.sh
setsid /tmp/model_wrapper_${name}.sh </dev/null >/dev/null 2>&1 &
echo $!
exit 0
`;
const pidResult = await sshExec(pod.ssh, startCmd);
const pid = parseInt(pidResult.stdout.trim());
if (!pid) {
console.error(chalk.red("Failed to start model runner"));
process.exit(1);
}
// Save to config
const config = loadConfig();
config.pods[podName].models[name] = { model: modelId, port, gpu: gpus, pid };
saveConfig(config);
console.log(`Model runner started with PID: ${pid}`);
console.log("Streaming logs... (waiting for startup)\n");
// Small delay to ensure log file is created
await new Promise((resolve) => setTimeout(resolve, 500));
// Stream logs with color support, watching for startup complete
const sshParts = pod.ssh.split(" ");
const sshCommand = sshParts[0]; // "ssh"
const sshArgs = sshParts.slice(1); // ["root@86.38.238.55"]
const host = sshArgs[0].split("@")[1] || "localhost";
const tailCmd = `tail -f ~/.vllm_logs/${name}.log`;
// Build the full args array for spawn
const fullArgs = [...sshArgs, tailCmd];
const logProcess = spawn(sshCommand, fullArgs, {
stdio: ["inherit", "pipe", "pipe"], // capture stdout and stderr
env: { ...process.env, FORCE_COLOR: "1" },
});
let interrupted = false;
let startupComplete = false;
// Handle Ctrl+C
const sigintHandler = () => {
interrupted = true;
logProcess.kill();
};
process.on("SIGINT", sigintHandler);
// Process log output line by line
const processOutput = (data: Buffer) => {
const lines = data.toString().split("\n");
for (const line of lines) {
if (line) {
console.log(line); // Echo the line to console
// Check for startup complete message
if (line.includes("Application startup complete")) {
startupComplete = true;
logProcess.kill(); // Stop tailing logs
}
}
}
};
logProcess.stdout?.on("data", processOutput);
logProcess.stderr?.on("data", processOutput);
await new Promise<void>((resolve) => logProcess.on("exit", resolve));
process.removeListener("SIGINT", sigintHandler);
if (startupComplete) {
// Model started successfully - output connection details
console.log("\n" + chalk.green("✓ Model started successfully!"));
console.log("\n" + chalk.bold("Connection Details:"));
console.log(chalk.cyan("─".repeat(50)));
console.log(chalk.white("Base URL: ") + chalk.yellow(`http://${host}:${port}/v1`));
console.log(chalk.white("Model: ") + chalk.yellow(modelId));
console.log(chalk.white("API Key: ") + chalk.yellow(process.env.PI_API_KEY || "(not set)"));
console.log(chalk.cyan("─".repeat(50)));
console.log("\n" + chalk.bold("Export for shell:"));
console.log(chalk.gray(`export OPENAI_BASE_URL="http://${host}:${port}/v1"`));
console.log(chalk.gray(`export OPENAI_API_KEY="${process.env.PI_API_KEY || "your-api-key"}"`));
console.log(chalk.gray(`export OPENAI_MODEL="${modelId}"`));
console.log("\n" + chalk.bold("Example usage:"));
console.log(
chalk.gray(`
# Python
from openai import OpenAI
client = OpenAI() # Uses env vars
response = client.chat.completions.create(
model="${modelId}",
messages=[{"role": "user", "content": "Hello!"}]
)
# CLI
curl $OPENAI_BASE_URL/chat/completions \\
-H "Authorization: Bearer $OPENAI_API_KEY" \\
-H "Content-Type: application/json" \\
-d '{"model":"${modelId}","messages":[{"role":"user","content":"Hi"}]}'`),
);
console.log("");
console.log(chalk.cyan(`Chat with model: pi agent ${name} "Your message"`));
console.log(chalk.cyan(`Interactive mode: pi agent ${name} -i`));
console.log(chalk.cyan(`Monitor logs: pi logs ${name}`));
console.log(chalk.cyan(`Stop model: pi stop ${name}`));
} else if (interrupted) {
console.log(chalk.yellow("\n\nStopped monitoring. Model deployment continues in background."));
console.log(chalk.cyan(`Chat with model: pi agent ${name} "Your message"`));
console.log(chalk.cyan(`Check status: pi logs ${name}`));
console.log(chalk.cyan(`Stop model: pi stop ${name}`));
} else {
console.log(chalk.yellow("\n\nLog stream ended. Model may still be running."));
console.log(chalk.cyan(`Chat with model: pi agent ${name} "Your message"`));
console.log(chalk.cyan(`Check status: pi logs ${name}`));
console.log(chalk.cyan(`Stop model: pi stop ${name}`));
}
};
/**
* Stop a model
*/
export const stopModel = async (name: string, options: { pod?: string }) => {
const { name: podName, pod } = getPod(options.pod);
const model = pod.models[name];
if (!model) {
console.error(chalk.red(`Model '${name}' not found on pod '${podName}'`));
process.exit(1);
}
console.log(chalk.yellow(`Stopping model '${name}' on pod '${podName}'...`));
// Kill the script process and all its children
// Using pkill to kill the process and all children
const killCmd = `
# Kill the script process and all its children
pkill -TERM -P ${model.pid} 2>/dev/null || true
kill ${model.pid} 2>/dev/null || true
`;
await sshExec(pod.ssh, killCmd);
// Remove from config
const config = loadConfig();
delete config.pods[podName].models[name];
saveConfig(config);
console.log(chalk.green(`✓ Model '${name}' stopped`));
};
/**
* Stop all models on a pod
*/
export const stopAllModels = async (options: { pod?: string }) => {
const { name: podName, pod } = getPod(options.pod);
const modelNames = Object.keys(pod.models);
if (modelNames.length === 0) {
console.log(`No models running on pod '${podName}'`);
return;
}
console.log(chalk.yellow(`Stopping ${modelNames.length} model(s) on pod '${podName}'...`));
// Kill all script processes and their children
const pids = Object.values(pod.models).map((m) => m.pid);
const killCmd = `
for PID in ${pids.join(" ")}; do
pkill -TERM -P $PID 2>/dev/null || true
kill $PID 2>/dev/null || true
done
`;
await sshExec(pod.ssh, killCmd);
// Clear all models from config
const config = loadConfig();
config.pods[podName].models = {};
saveConfig(config);
console.log(chalk.green(`✓ Stopped all models: ${modelNames.join(", ")}`));
};
/**
* List all models
*/
export const listModels = async (options: { pod?: string }) => {
const { name: podName, pod } = getPod(options.pod);
const modelNames = Object.keys(pod.models);
if (modelNames.length === 0) {
console.log(`No models running on pod '${podName}'`);
return;
}
// Get pod SSH host for URL display
const sshParts = pod.ssh.split(" ");
const host = sshParts.find((p) => p.includes("@"))?.split("@")[1] || "unknown";
console.log(`Models on pod '${chalk.bold(podName)}':`);
for (const name of modelNames) {
const model = pod.models[name];
const gpuStr =
model.gpu.length > 1
? `GPUs ${model.gpu.join(",")}`
: model.gpu.length === 1
? `GPU ${model.gpu[0]}`
: "GPU unknown";
console.log(` ${chalk.green(name)} - Port ${model.port} - ${gpuStr} - PID ${model.pid}`);
console.log(` Model: ${chalk.gray(model.model)}`);
console.log(` URL: ${chalk.cyan(`http://${host}:${model.port}/v1`)}`);
}
// Optionally verify processes are still running
console.log("");
console.log("Verifying processes...");
let anyDead = false;
for (const name of modelNames) {
const model = pod.models[name];
// Check both the wrapper process and if vLLM is responding
const checkCmd = `
# Check if wrapper process exists
if ps -p ${model.pid} > /dev/null 2>&1; then
# Process exists, now check if vLLM is responding
if curl -s -f http://localhost:${model.port}/health > /dev/null 2>&1; then
echo "running"
else
# Check if it's still starting up
if tail -n 20 ~/.vllm_logs/${name}.log 2>/dev/null | grep -q "ERROR\\|Failed\\|Cuda error\\|died"; then
echo "crashed"
else
echo "starting"
fi
fi
else
echo "dead"
fi
`;
const result = await sshExec(pod.ssh, checkCmd);
const status = result.stdout.trim();
if (status === "dead") {
console.log(chalk.red(` ${name}: Process ${model.pid} is not running`));
anyDead = true;
} else if (status === "crashed") {
console.log(chalk.red(` ${name}: vLLM crashed (check logs with 'pi logs ${name}')`));
anyDead = true;
} else if (status === "starting") {
console.log(chalk.yellow(` ${name}: Still starting up...`));
}
}
if (anyDead) {
console.log("");
console.log(chalk.yellow("Some models are not running. Clean up with:"));
console.log(chalk.cyan(" pi stop <name>"));
} else {
console.log(chalk.green("✓ All processes verified"));
}
};
/**
* View model logs
*/
export const viewLogs = async (name: string, options: { pod?: string }) => {
const { name: podName, pod } = getPod(options.pod);
const model = pod.models[name];
if (!model) {
console.error(chalk.red(`Model '${name}' not found on pod '${podName}'`));
process.exit(1);
}
console.log(chalk.green(`Streaming logs for '${name}' on pod '${podName}'...`));
console.log(chalk.gray("Press Ctrl+C to stop"));
console.log("");
// Stream logs with color preservation
const sshParts = pod.ssh.split(" ");
const sshCommand = sshParts[0]; // "ssh"
const sshArgs = sshParts.slice(1); // ["root@86.38.238.55"]
const tailCmd = `tail -f ~/.vllm_logs/${name}.log`;
const logProcess = spawn(sshCommand, [...sshArgs, tailCmd], {
stdio: "inherit",
env: {
...process.env,
FORCE_COLOR: "1",
},
});
// Wait for process to exit
await new Promise<void>((resolve) => {
logProcess.on("exit", () => resolve());
});
};
/**
* Show known models and their hardware requirements
*/
export const showKnownModels = async () => {
const modelsJson = await import("../models.json", { assert: { type: "json" } });
const models = modelsJson.default.models;
// Get active pod info if available
const activePod = getActivePod();
let podGpuCount = 0;
let podGpuType = "";
if (activePod) {
podGpuCount = activePod.pod.gpus.length;
// Extract GPU type from name (e.g., "NVIDIA H200" -> "H200")
podGpuType = activePod.pod.gpus[0]?.name?.replace("NVIDIA", "")?.trim()?.split(" ")[0] || "";
console.log(chalk.bold(`Known Models for ${activePod.name} (${podGpuCount}x ${podGpuType || "GPU"}):\n`));
} else {
console.log(chalk.bold("Known Models:\n"));
console.log(chalk.yellow("No active pod. Use 'pi pods active <name>' to filter compatible models.\n"));
}
console.log("Usage: pi start <model> --name <name> [options]\n");
// Group models by compatibility and family
const compatible: Record<string, Array<{ id: string; name: string; config: string; notes?: string }>> = {};
const incompatible: Record<string, Array<{ id: string; name: string; minGpu: string; notes?: string }>> = {};
for (const [modelId, info] of Object.entries(models)) {
const modelInfo = info as any;
const family = modelInfo.name.split("-")[0] || "Other";
let isCompatible = false;
let compatibleConfig = "";
let minGpu = "Unknown";
let minNotes: string | undefined;
if (modelInfo.configs && modelInfo.configs.length > 0) {
// Sort configs by GPU count to find minimum
const sortedConfigs = [...modelInfo.configs].sort((a: any, b: any) => (a.gpuCount || 1) - (b.gpuCount || 1));
// Find minimum requirements
const minConfig = sortedConfigs[0];
const minGpuCount = minConfig.gpuCount || 1;
const gpuTypes = minConfig.gpuTypes?.join("/") || "H100/H200";
if (minGpuCount === 1) {
minGpu = `1x ${gpuTypes}`;
} else {
minGpu = `${minGpuCount}x ${gpuTypes}`;
}
minNotes = minConfig.notes || modelInfo.notes;
// Check compatibility with active pod
if (activePod && podGpuCount > 0) {
// Find best matching config for this pod
for (const config of sortedConfigs) {
const configGpuCount = config.gpuCount || 1;
const configGpuTypes = config.gpuTypes || [];
// Check if we have enough GPUs
if (configGpuCount <= podGpuCount) {
// Check if GPU type matches (if specified)
if (
configGpuTypes.length === 0 ||
configGpuTypes.some((type: string) => podGpuType.includes(type) || type.includes(podGpuType))
) {
isCompatible = true;
if (configGpuCount === 1) {
compatibleConfig = `1x ${podGpuType}`;
} else {
compatibleConfig = `${configGpuCount}x ${podGpuType}`;
}
minNotes = config.notes || modelInfo.notes;
break;
}
}
}
}
}
const modelEntry = {
id: modelId,
name: modelInfo.name,
notes: minNotes,
};
if (activePod && isCompatible) {
if (!compatible[family]) {
compatible[family] = [];
}
compatible[family].push({ ...modelEntry, config: compatibleConfig });
} else {
if (!incompatible[family]) {
incompatible[family] = [];
}
incompatible[family].push({ ...modelEntry, minGpu });
}
}
// Display compatible models first
if (activePod && Object.keys(compatible).length > 0) {
console.log(chalk.green.bold("✓ Compatible Models:\n"));
const sortedFamilies = Object.keys(compatible).sort();
for (const family of sortedFamilies) {
console.log(chalk.cyan(`${family} Models:`));
const modelList = compatible[family].sort((a, b) => a.name.localeCompare(b.name));
for (const model of modelList) {
console.log(` ${chalk.green(model.id)}`);
console.log(` Name: ${model.name}`);
console.log(` Config: ${model.config}`);
if (model.notes) {
console.log(chalk.gray(` Note: ${model.notes}`));
}
console.log("");
}
}
}
// Display incompatible models
if (Object.keys(incompatible).length > 0) {
if (activePod && Object.keys(compatible).length > 0) {
console.log(chalk.red.bold("✗ Incompatible Models (need more/different GPUs):\n"));
}
const sortedFamilies = Object.keys(incompatible).sort();
for (const family of sortedFamilies) {
if (!activePod) {
console.log(chalk.cyan(`${family} Models:`));
} else {
console.log(chalk.gray(`${family} Models:`));
}
const modelList = incompatible[family].sort((a, b) => a.name.localeCompare(b.name));
for (const model of modelList) {
const color = activePod ? chalk.gray : chalk.green;
console.log(` ${color(model.id)}`);
console.log(chalk.gray(` Name: ${model.name}`));
console.log(chalk.gray(` Min Hardware: ${model.minGpu}`));
if (model.notes && !activePod) {
console.log(chalk.gray(` Note: ${model.notes}`));
}
if (activePod) {
console.log(""); // Less verbose for incompatible models when filtered
} else {
console.log("");
}
}
}
}
console.log(chalk.gray("\nFor unknown models, defaults to single GPU deployment."));
console.log(chalk.gray("Use --vllm to pass custom arguments to vLLM."));
};

View file

@ -0,0 +1,205 @@
import chalk from "chalk";
import { dirname, join } from "path";
import { fileURLToPath } from "url";
import { addPod, loadConfig, removePod, setActivePod } from "../config.js";
import { scpFile, sshExec, sshExecStream } from "../ssh.js";
import type { GPU, Pod } from "../types.js";
const __filename = fileURLToPath(import.meta.url);
const __dirname = dirname(__filename);
/**
* List all pods
*/
export const listPods = () => {
const config = loadConfig();
const podNames = Object.keys(config.pods);
if (podNames.length === 0) {
console.log("No pods configured. Use 'pi pods setup' to add a pod.");
return;
}
console.log("Configured pods:");
for (const name of podNames) {
const pod = config.pods[name];
const isActive = config.active === name;
const marker = isActive ? chalk.green("*") : " ";
const gpuCount = pod.gpus?.length || 0;
const gpuInfo = gpuCount > 0 ? `${gpuCount}x ${pod.gpus[0].name}` : "no GPUs detected";
const vllmInfo = pod.vllmVersion ? ` (vLLM: ${pod.vllmVersion})` : "";
console.log(`${marker} ${chalk.bold(name)} - ${gpuInfo}${vllmInfo} - ${pod.ssh}`);
if (pod.modelsPath) {
console.log(` Models: ${pod.modelsPath}`);
}
if (pod.vllmVersion === "gpt-oss") {
console.log(chalk.yellow(` ⚠️ GPT-OSS build - only for GPT-OSS models`));
}
}
};
/**
* Setup a new pod
*/
export const setupPod = async (
name: string,
sshCmd: string,
options: { mount?: string; modelsPath?: string; vllm?: "release" | "nightly" | "gpt-oss" },
) => {
// Validate environment variables
const hfToken = process.env.HF_TOKEN;
const vllmApiKey = process.env.PI_API_KEY;
if (!hfToken) {
console.error(chalk.red("ERROR: HF_TOKEN environment variable is required"));
console.error("Get a token from: https://huggingface.co/settings/tokens");
console.error("Then run: export HF_TOKEN=your_token_here");
process.exit(1);
}
if (!vllmApiKey) {
console.error(chalk.red("ERROR: PI_API_KEY environment variable is required"));
console.error("Set an API key: export PI_API_KEY=your_api_key_here");
process.exit(1);
}
// Determine models path
let modelsPath = options.modelsPath;
if (!modelsPath && options.mount) {
// Extract path from mount command if not explicitly provided
// e.g., "mount -t nfs ... /mnt/sfs" -> "/mnt/sfs"
const parts = options.mount.split(" ");
modelsPath = parts[parts.length - 1];
}
if (!modelsPath) {
console.error(chalk.red("ERROR: --models-path is required (or must be extractable from --mount)"));
process.exit(1);
}
console.log(chalk.green(`Setting up pod '${name}'...`));
console.log(`SSH: ${sshCmd}`);
console.log(`Models path: ${modelsPath}`);
console.log(
`vLLM version: ${options.vllm || "release"} ${options.vllm === "gpt-oss" ? chalk.yellow("(GPT-OSS special build)") : ""}`,
);
if (options.mount) {
console.log(`Mount command: ${options.mount}`);
}
console.log("");
// Test SSH connection
console.log("Testing SSH connection...");
const testResult = await sshExec(sshCmd, "echo 'SSH OK'");
if (testResult.exitCode !== 0) {
console.error(chalk.red("Failed to connect via SSH"));
console.error(testResult.stderr);
process.exit(1);
}
console.log(chalk.green("✓ SSH connection successful"));
// Copy setup script
console.log("Copying setup script...");
const scriptPath = join(__dirname, "../../scripts/pod_setup.sh");
const success = await scpFile(sshCmd, scriptPath, "/tmp/pod_setup.sh");
if (!success) {
console.error(chalk.red("Failed to copy setup script"));
process.exit(1);
}
console.log(chalk.green("✓ Setup script copied"));
// Build setup command
let setupCmd = `bash /tmp/pod_setup.sh --models-path '${modelsPath}' --hf-token '${hfToken}' --vllm-api-key '${vllmApiKey}'`;
if (options.mount) {
setupCmd += ` --mount '${options.mount}'`;
}
// Add vLLM version flag
const vllmVersion = options.vllm || "release";
setupCmd += ` --vllm '${vllmVersion}'`;
// Run setup script
console.log("");
console.log(chalk.yellow("Running setup (this will take 2-5 minutes)..."));
console.log("");
// Use forceTTY to preserve colors from apt, pip, etc.
const exitCode = await sshExecStream(sshCmd, setupCmd, { forceTTY: true });
if (exitCode !== 0) {
console.error(chalk.red("\nSetup failed. Check the output above for errors."));
process.exit(1);
}
// Parse GPU info from setup output
console.log("");
console.log("Detecting GPU configuration...");
const gpuResult = await sshExec(sshCmd, "nvidia-smi --query-gpu=index,name,memory.total --format=csv,noheader");
const gpus: GPU[] = [];
if (gpuResult.exitCode === 0 && gpuResult.stdout) {
const lines = gpuResult.stdout.trim().split("\n");
for (const line of lines) {
const [id, name, memory] = line.split(",").map((s) => s.trim());
if (id !== undefined) {
gpus.push({
id: parseInt(id),
name: name || "Unknown",
memory: memory || "Unknown",
});
}
}
}
console.log(chalk.green(`✓ Detected ${gpus.length} GPU(s)`));
for (const gpu of gpus) {
console.log(` GPU ${gpu.id}: ${gpu.name} (${gpu.memory})`);
}
// Save pod configuration
const pod: Pod = {
ssh: sshCmd,
gpus,
models: {},
modelsPath,
vllmVersion: options.vllm || "release",
};
addPod(name, pod);
console.log("");
console.log(chalk.green(`✓ Pod '${name}' setup complete and set as active pod`));
console.log("");
console.log("You can now deploy models with:");
console.log(chalk.cyan(` pi start <model> --name <name>`));
};
/**
* Switch active pod
*/
export const switchActivePod = (name: string) => {
const config = loadConfig();
if (!config.pods[name]) {
console.error(chalk.red(`Pod '${name}' not found`));
console.log("\nAvailable pods:");
for (const podName of Object.keys(config.pods)) {
console.log(` ${podName}`);
}
process.exit(1);
}
setActivePod(name);
console.log(chalk.green(`✓ Switched active pod to '${name}'`));
};
/**
* Remove a pod from config
*/
export const removePodCommand = (name: string) => {
const config = loadConfig();
if (!config.pods[name]) {
console.error(chalk.red(`Pod '${name}' not found`));
process.exit(1);
}
removePod(name);
console.log(chalk.green(`✓ Removed pod '${name}' from configuration`));
console.log(chalk.yellow("Note: This only removes the local configuration. The remote pod is not affected."));
};

View file

@ -0,0 +1,85 @@
import { main as agentMain } from "@mariozechner/pi-agent";
import chalk from "chalk";
import { getActivePod, loadConfig } from "../config.js";
// ────────────────────────────────────────────────────────────────────────────────
// Types
// ────────────────────────────────────────────────────────────────────────────────
interface PromptOptions {
pod?: string;
apiKey?: string;
}
// ────────────────────────────────────────────────────────────────────────────────
// Main prompt function
// ────────────────────────────────────────────────────────────────────────────────
export async function promptModel(modelName: string, userArgs: string[], opts: PromptOptions = {}) {
// Get pod and model configuration
const activePod = opts.pod ? { name: opts.pod, pod: loadConfig().pods[opts.pod] } : getActivePod();
if (!activePod) {
console.error(chalk.red("No active pod. Use 'pi pods active <name>' to set one."));
process.exit(1);
}
const { name: podName, pod } = activePod;
const modelConfig = pod.models[modelName];
if (!modelConfig) {
console.error(chalk.red(`Model '${modelName}' not found on pod '${podName}'`));
process.exit(1);
}
// Extract host from SSH string
const host =
pod.ssh
.split(" ")
.find((p) => p.includes("@"))
?.split("@")[1] ?? "localhost";
// Build the system prompt for code navigation
const systemPrompt = `You help the user understand and navigate the codebase in the current working directory.
You can read files, list directories, and execute shell commands via the respective tools.
Do not output file contents you read via the read_file tool directly, unless asked to.
Do not output markdown tables as part of your responses.
Keep your responses concise and relevant to the user's request.
File paths you output must include line numbers where possible, e.g. "src/index.ts:10-20" for lines 10 to 20 in src/index.ts.
Current working directory: ${process.cwd()}`;
// Build arguments for agent main function
const args: string[] = [];
// Add base configuration that we control
args.push(
"--base-url",
`http://${host}:${modelConfig.port}/v1`,
"--model",
modelConfig.model,
"--api-key",
opts.apiKey || process.env.PI_API_KEY || "dummy",
"--api",
modelConfig.model.toLowerCase().includes("gpt-oss") ? "responses" : "completions",
"--system-prompt",
systemPrompt,
);
// Pass through all user-provided arguments
// This includes messages, --continue, --json, etc.
args.push(...userArgs);
// Call agent main function directly
try {
await agentMain(args);
} catch (err: any) {
console.error(chalk.red(`Agent error: ${err.message}`));
process.exit(1);
}
}

View file

@ -0,0 +1,80 @@
import { existsSync, mkdirSync, readFileSync, writeFileSync } from "fs";
import { homedir } from "os";
import { join } from "path";
import type { Config, Pod } from "./types.js";
// Get config directory from env or use default
const getConfigDir = (): string => {
const configDir = process.env.PI_CONFIG_DIR || join(homedir(), ".pi");
if (!existsSync(configDir)) {
mkdirSync(configDir, { recursive: true });
}
return configDir;
};
const getConfigPath = (): string => {
return join(getConfigDir(), "pods.json");
};
export const loadConfig = (): Config => {
const configPath = getConfigPath();
if (!existsSync(configPath)) {
// Return empty config if file doesn't exist
return { pods: {} };
}
try {
const data = readFileSync(configPath, "utf-8");
return JSON.parse(data);
} catch (e) {
console.error(`Error reading config: ${e}`);
return { pods: {} };
}
};
export const saveConfig = (config: Config): void => {
const configPath = getConfigPath();
try {
writeFileSync(configPath, JSON.stringify(config, null, 2));
} catch (e) {
console.error(`Error saving config: ${e}`);
process.exit(1);
}
};
export const getActivePod = (): { name: string; pod: Pod } | null => {
const config = loadConfig();
if (!config.active || !config.pods[config.active]) {
return null;
}
return { name: config.active, pod: config.pods[config.active] };
};
export const addPod = (name: string, pod: Pod): void => {
const config = loadConfig();
config.pods[name] = pod;
// If no active pod, make this one active
if (!config.active) {
config.active = name;
}
saveConfig(config);
};
export const removePod = (name: string): void => {
const config = loadConfig();
delete config.pods[name];
// If this was the active pod, clear active
if (config.active === name) {
config.active = undefined;
}
saveConfig(config);
};
export const setActivePod = (name: string): void => {
const config = loadConfig();
if (!config.pods[name]) {
console.error(`Pod '${name}' not found`);
process.exit(1);
}
config.active = name;
saveConfig(config);
};

View file

@ -0,0 +1,2 @@
// Main library exports
export * from "./types.js";

View file

@ -0,0 +1,111 @@
import { readFileSync } from "fs";
import { dirname, join } from "path";
import { fileURLToPath } from "url";
import type { GPU } from "./types.js";
const __filename = fileURLToPath(import.meta.url);
const __dirname = dirname(__filename);
interface ModelConfig {
gpuCount: number;
gpuTypes?: string[];
args: string[];
env?: Record<string, string>;
notes?: string;
}
interface ModelInfo {
name: string;
configs: ModelConfig[];
notes?: string;
}
interface ModelsData {
models: Record<string, ModelInfo>;
}
// Load models configuration - resolve relative to this file
const modelsJsonPath = join(__dirname, "models.json");
const modelsData: ModelsData = JSON.parse(readFileSync(modelsJsonPath, "utf-8"));
/**
* Get the best configuration for a model based on available GPUs
*/
export const getModelConfig = (
modelId: string,
gpus: GPU[],
requestedGpuCount: number,
): { args: string[]; env?: Record<string, string>; notes?: string } | null => {
const modelInfo = modelsData.models[modelId];
if (!modelInfo) {
// Unknown model, no default config
return null;
}
// Extract GPU type from the first GPU name (e.g., "NVIDIA H200" -> "H200")
const gpuType = gpus[0]?.name?.replace("NVIDIA", "")?.trim()?.split(" ")[0] || "";
// Find best matching config
let bestConfig: ModelConfig | null = null;
for (const config of modelInfo.configs) {
// Check GPU count
if (config.gpuCount !== requestedGpuCount) {
continue;
}
// Check GPU type if specified
if (config.gpuTypes && config.gpuTypes.length > 0) {
const typeMatches = config.gpuTypes.some((type) => gpuType.includes(type) || type.includes(gpuType));
if (!typeMatches) {
continue;
}
}
// This config matches
bestConfig = config;
break;
}
// If no exact match, try to find a config with just the right GPU count
if (!bestConfig) {
for (const config of modelInfo.configs) {
if (config.gpuCount === requestedGpuCount) {
bestConfig = config;
break;
}
}
}
if (!bestConfig) {
// No suitable config found
return null;
}
return {
args: [...bestConfig.args],
env: bestConfig.env ? { ...bestConfig.env } : undefined,
notes: bestConfig.notes || modelInfo.notes,
};
};
/**
* Check if a model is known
*/
export const isKnownModel = (modelId: string): boolean => {
return modelId in modelsData.models;
};
/**
* Get all known models
*/
export const getKnownModels = (): string[] => {
return Object.keys(modelsData.models);
};
/**
* Get model display name
*/
export const getModelName = (modelId: string): string => {
return modelsData.models[modelId]?.name || modelId;
};

View file

@ -0,0 +1,305 @@
{
"models": {
"Qwen/Qwen2.5-Coder-32B-Instruct": {
"name": "Qwen2.5-Coder-32B",
"configs": [
{
"gpuCount": 1,
"gpuTypes": ["H100", "H200"],
"args": ["--tool-call-parser", "hermes", "--enable-auto-tool-choice"]
},
{
"gpuCount": 2,
"gpuTypes": ["H100", "H200"],
"args": ["--tensor-parallel-size", "2", "--tool-call-parser", "hermes", "--enable-auto-tool-choice"]
}
]
},
"Qwen/Qwen3-Coder-30B-A3B-Instruct": {
"name": "Qwen3-Coder-30B",
"configs": [
{
"gpuCount": 1,
"gpuTypes": ["H100", "H200"],
"args": ["--enable-auto-tool-choice", "--tool-call-parser", "qwen3_coder"],
"notes": "Fits comfortably on single GPU. ~60GB model weight."
},
{
"gpuCount": 2,
"gpuTypes": ["H100", "H200"],
"args": [
"--tensor-parallel-size",
"2",
"--enable-auto-tool-choice",
"--tool-call-parser",
"qwen3_coder"
],
"notes": "For higher throughput/longer context."
}
]
},
"Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8": {
"name": "Qwen3-Coder-30B-FP8",
"configs": [
{
"gpuCount": 1,
"gpuTypes": ["H100", "H200"],
"args": ["--enable-auto-tool-choice", "--tool-call-parser", "qwen3_coder"],
"env": {
"VLLM_USE_DEEP_GEMM": "1"
},
"notes": "FP8 quantized, ~30GB model weight. Excellent for single GPU deployment."
}
]
},
"Qwen/Qwen3-Coder-480B-A35B-Instruct": {
"name": "Qwen3-Coder-480B",
"configs": [
{
"gpuCount": 8,
"gpuTypes": ["H200", "H20"],
"args": [
"--tensor-parallel-size",
"8",
"--max-model-len",
"32000",
"--enable-auto-tool-choice",
"--tool-call-parser",
"qwen3_coder"
],
"notes": "Cannot serve full 262K context on single node. Reduce max-model-len or increase gpu-memory-utilization."
}
]
},
"Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8": {
"name": "Qwen3-Coder-480B-FP8",
"configs": [
{
"gpuCount": 8,
"gpuTypes": ["H200", "H20"],
"args": [
"--max-model-len",
"131072",
"--enable-expert-parallel",
"--data-parallel-size",
"8",
"--enable-auto-tool-choice",
"--tool-call-parser",
"qwen3_coder"
],
"env": {
"VLLM_USE_DEEP_GEMM": "1"
},
"notes": "Use data-parallel mode (not tensor-parallel) to avoid weight quantization errors."
}
]
},
"openai/gpt-oss-20b": {
"name": "GPT-OSS-20B",
"configs": [
{
"gpuCount": 1,
"gpuTypes": ["H100", "H200"],
"args": ["--async-scheduling"]
},
{
"gpuCount": 1,
"gpuTypes": ["B200"],
"args": ["--async-scheduling"],
"env": {
"VLLM_USE_TRTLLM_ATTENTION": "1",
"VLLM_USE_TRTLLM_DECODE_ATTENTION": "1",
"VLLM_USE_TRTLLM_CONTEXT_ATTENTION": "1",
"VLLM_USE_FLASHINFER_MXFP4_MOE": "1"
}
}
],
"notes": "Requires vLLM 0.10.1+gptoss. Tools/functoin calls only via /v1/responses endpoint."
},
"openai/gpt-oss-120b": {
"name": "GPT-OSS-120B",
"configs": [
{
"gpuCount": 1,
"gpuTypes": ["H100", "H200"],
"args": ["--async-scheduling", "--gpu-memory-utilization", "0.95", "--max-num-batched-tokens", "1024"],
"notes": "Single GPU deployment. Requires vLLM 0.10.1+gptoss. Tools/function calls only via /v1/responses endpoint."
},
{
"gpuCount": 2,
"gpuTypes": ["H100", "H200"],
"args": ["--tensor-parallel-size", "2", "--async-scheduling", "--gpu-memory-utilization", "0.94"],
"notes": "Recommended for H100/H200. Requires vLLM 0.10.1+gptoss. Tools/function calls only via /v1/responses endpoint."
},
{
"gpuCount": 4,
"gpuTypes": ["H100", "H200"],
"args": ["--tensor-parallel-size", "4", "--async-scheduling"],
"notes": "Higher throughput. Requires vLLM 0.10.1+gptoss. Tools/function calls only via /v1/responses endpoint."
},
{
"gpuCount": 8,
"gpuTypes": ["H100", "H200"],
"args": ["--tensor-parallel-size", "8", "--async-scheduling"],
"notes": "Maximum throughput for evaluation workloads. Requires vLLM 0.10.1+gptoss. Tools/function calls only via /v1/responses endpoint."
}
]
},
"zai-org/GLM-4.5": {
"name": "GLM-4.5",
"configs": [
{
"gpuCount": 16,
"gpuTypes": ["H100"],
"args": [
"--tensor-parallel-size",
"16",
"--tool-call-parser",
"glm4_moe",
"--reasoning-parser",
"glm4_moe",
"--enable-auto-tool-choice"
]
},
{
"gpuCount": 8,
"gpuTypes": ["H200"],
"args": [
"--tensor-parallel-size",
"8",
"--tool-call-parser",
"glm4_moe",
"--reasoning-parser",
"glm4_moe",
"--enable-auto-tool-choice"
]
}
],
"notes": "Models default to thinking mode. For full 128K context, double the GPU count."
},
"zai-org/GLM-4.5-FP8": {
"name": "GLM-4.5-FP8",
"configs": [
{
"gpuCount": 8,
"gpuTypes": ["H100"],
"args": [
"--tensor-parallel-size",
"8",
"--tool-call-parser",
"glm4_moe",
"--reasoning-parser",
"glm4_moe",
"--enable-auto-tool-choice"
]
},
{
"gpuCount": 4,
"gpuTypes": ["H200"],
"args": [
"--tensor-parallel-size",
"4",
"--tool-call-parser",
"glm4_moe",
"--reasoning-parser",
"glm4_moe",
"--enable-auto-tool-choice"
]
}
]
},
"zai-org/GLM-4.5-Air-FP8": {
"name": "GLM-4.5-Air-FP8",
"configs": [
{
"gpuCount": 2,
"gpuTypes": ["H100"],
"args": [
"--tensor-parallel-size",
"2",
"--tool-call-parser",
"glm4_moe",
"--reasoning-parser",
"glm4_moe",
"--enable-auto-tool-choice",
"--quantization",
"fp8"
],
"env": {
"VLLM_ATTENTION_BACKEND": "XFORMERS"
},
"notes": "FP8 model requires vLLM with proper FP8 support or MTP module"
},
{
"gpuCount": 1,
"gpuTypes": ["H200"],
"args": [
"--tool-call-parser",
"glm4_moe",
"--reasoning-parser",
"glm4_moe",
"--enable-auto-tool-choice",
"--quantization",
"fp8"
],
"env": {
"VLLM_ATTENTION_BACKEND": "XFORMERS"
},
"notes": "FP8 model requires vLLM with proper FP8 support or MTP module"
}
]
},
"zai-org/GLM-4.5-Air": {
"name": "GLM-4.5-Air",
"configs": [
{
"gpuCount": 2,
"gpuTypes": ["H100", "H200"],
"args": [
"--tensor-parallel-size",
"2",
"--tool-call-parser",
"glm4_moe",
"--reasoning-parser",
"glm4_moe",
"--enable-auto-tool-choice"
],
"notes": "Non-quantized BF16 version, more compatible"
},
{
"gpuCount": 1,
"gpuTypes": ["H200"],
"args": [
"--tool-call-parser",
"glm4_moe",
"--reasoning-parser",
"glm4_moe",
"--enable-auto-tool-choice",
"--gpu-memory-utilization",
"0.95"
],
"notes": "Single H200 can fit the BF16 model with high memory utilization"
}
]
},
"moonshotai/Kimi-K2-Instruct": {
"name": "Kimi-K2",
"configs": [
{
"gpuCount": 16,
"gpuTypes": ["H200", "H20"],
"args": [
"--tensor-parallel-size",
"16",
"--trust-remote-code",
"--enable-auto-tool-choice",
"--tool-call-parser",
"kimi_k2"
],
"notes": "Pure TP mode. For >16 GPUs, combine with pipeline-parallelism."
}
],
"notes": "Requires vLLM v0.10.0rc1+. Minimum 16 GPUs for FP8 with 128k context."
}
}
}

151
packages/pods/src/ssh.ts Normal file
View file

@ -0,0 +1,151 @@
import { type SpawnOptions, spawn } from "child_process";
export interface SSHResult {
stdout: string;
stderr: string;
exitCode: number;
}
/**
* Execute an SSH command and return the result
*/
export const sshExec = async (
sshCmd: string,
command: string,
options?: { keepAlive?: boolean },
): Promise<SSHResult> => {
return new Promise((resolve) => {
// Parse SSH command (e.g., "ssh root@1.2.3.4" or "ssh -p 22 root@1.2.3.4")
const sshParts = sshCmd.split(" ").filter((p) => p);
const sshBinary = sshParts[0];
let sshArgs = [...sshParts.slice(1)];
// Add SSH keepalive options for long-running commands
if (options?.keepAlive) {
// ServerAliveInterval=30 sends keepalive every 30 seconds
// ServerAliveCountMax=120 allows up to 120 failures (60 minutes total)
sshArgs = ["-o", "ServerAliveInterval=30", "-o", "ServerAliveCountMax=120", ...sshArgs];
}
sshArgs.push(command);
const proc = spawn(sshBinary, sshArgs, {
stdio: ["ignore", "pipe", "pipe"],
});
let stdout = "";
let stderr = "";
proc.stdout.on("data", (data) => {
stdout += data.toString();
});
proc.stderr.on("data", (data) => {
stderr += data.toString();
});
proc.on("close", (code) => {
resolve({
stdout,
stderr,
exitCode: code || 0,
});
});
proc.on("error", (err) => {
resolve({
stdout,
stderr: err.message,
exitCode: 1,
});
});
});
};
/**
* Execute an SSH command with streaming output to console
*/
export const sshExecStream = async (
sshCmd: string,
command: string,
options?: { silent?: boolean; forceTTY?: boolean; keepAlive?: boolean },
): Promise<number> => {
return new Promise((resolve) => {
const sshParts = sshCmd.split(" ").filter((p) => p);
const sshBinary = sshParts[0];
// Build SSH args
let sshArgs = [...sshParts.slice(1)];
// Add -t flag if requested and not already present
if (options?.forceTTY && !sshParts.includes("-t")) {
sshArgs = ["-t", ...sshArgs];
}
// Add SSH keepalive options for long-running commands
if (options?.keepAlive) {
// ServerAliveInterval=30 sends keepalive every 30 seconds
// ServerAliveCountMax=120 allows up to 120 failures (60 minutes total)
sshArgs = ["-o", "ServerAliveInterval=30", "-o", "ServerAliveCountMax=120", ...sshArgs];
}
sshArgs.push(command);
const spawnOptions: SpawnOptions = options?.silent
? { stdio: ["ignore", "ignore", "ignore"] }
: { stdio: "inherit" };
const proc = spawn(sshBinary, sshArgs, spawnOptions);
proc.on("close", (code) => {
resolve(code || 0);
});
proc.on("error", () => {
resolve(1);
});
});
};
/**
* Copy a file to remote via SCP
*/
export const scpFile = async (sshCmd: string, localPath: string, remotePath: string): Promise<boolean> => {
// Extract host from SSH command
const sshParts = sshCmd.split(" ").filter((p) => p);
let host = "";
let port = "22";
let i = 1; // Skip 'ssh'
while (i < sshParts.length) {
if (sshParts[i] === "-p" && i + 1 < sshParts.length) {
port = sshParts[i + 1];
i += 2;
} else if (!sshParts[i].startsWith("-")) {
host = sshParts[i];
break;
} else {
i++;
}
}
if (!host) {
console.error("Could not parse host from SSH command");
return false;
}
// Build SCP command
const scpArgs = ["-P", port, localPath, `${host}:${remotePath}`];
return new Promise((resolve) => {
const proc = spawn("scp", scpArgs, { stdio: "inherit" });
proc.on("close", (code) => {
resolve(code === 0);
});
proc.on("error", () => {
resolve(false);
});
});
};

View file

@ -0,0 +1,27 @@
// Core type definitions for pi
export interface GPU {
id: number;
name: string;
memory: string;
}
export interface Model {
model: string;
port: number;
gpu: number[]; // Array of GPU IDs for multi-GPU deployment
pid: number;
}
export interface Pod {
ssh: string;
gpus: GPU[];
models: Record<string, Model>;
modelsPath?: string;
vllmVersion?: "release" | "nightly" | "gpt-oss"; // Track which vLLM version is installed
}
export interface Config {
pods: Record<string, Pod>;
active?: string;
}

View file

@ -0,0 +1,9 @@
{
"extends": "../../tsconfig.base.json",
"compilerOptions": {
"outDir": "./dist",
"rootDir": "./src"
},
"include": ["src/**/*", "src/**/*.json"],
"exclude": ["node_modules", "dist"]
}