mirror of
https://github.com/getcompanion-ai/co-mono.git
synced 2026-04-15 23:01:30 +00:00
- Set up npm workspaces for three packages: pi-tui, pi-agent, and pi (pods) - Implemented dual TypeScript configuration: - Root tsconfig.json with path mappings for development and type checking - Package-specific tsconfig.build.json for clean production builds - Configured lockstep versioning with sync script for inter-package dependencies - Added comprehensive documentation for development and publishing workflows - All packages at version 0.5.0 ready for npm publishing
511 lines
No EOL
16 KiB
Markdown
511 lines
No EOL
16 KiB
Markdown
# pi
|
|
|
|
Deploy and manage LLMs on GPU pods with automatic vLLM configuration for agentic workloads.
|
|
|
|
## Installation
|
|
|
|
```bash
|
|
npm install -g @mariozechner/pi
|
|
```
|
|
|
|
## What is pi?
|
|
|
|
`pi` simplifies running large language models on remote GPU pods. It automatically:
|
|
- Sets up vLLM on fresh Ubuntu pods
|
|
- Configures tool calling for agentic models (Qwen, GPT-OSS, GLM, etc.)
|
|
- Manages multiple models on the same pod with "smart" GPU allocation
|
|
- Provides OpenAI-compatible API endpoints for each model
|
|
- Includes an interactive agent with file system tools for testing
|
|
|
|
## Quick Start
|
|
|
|
```bash
|
|
# Set required environment variables
|
|
export HF_TOKEN=your_huggingface_token # Get from https://huggingface.co/settings/tokens
|
|
export PI_API_KEY=your_api_key # Any string you want for API authentication
|
|
|
|
# Setup a DataCrunch pod with NFS storage (models path auto-extracted)
|
|
pi pods setup dc1 "ssh root@1.2.3.4" \
|
|
--mount "sudo mount -t nfs -o nconnect=16 nfs.fin-02.datacrunch.io:/your-pseudo /mnt/hf-models"
|
|
|
|
# Start a model (automatic configuration for known models)
|
|
pi start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen
|
|
|
|
# Send a single message to the model
|
|
pi agent qwen "What is the Fibonacci sequence?"
|
|
|
|
# Interactive chat mode with file system tools
|
|
pi agent qwen -i
|
|
|
|
# Use with any OpenAI-compatible client
|
|
export OPENAI_BASE_URL='http://1.2.3.4:8001/v1'
|
|
export OPENAI_API_KEY=$PI_API_KEY
|
|
```
|
|
|
|
## Prerequisites
|
|
|
|
- Node.js 18+
|
|
- HuggingFace token (for model downloads)
|
|
- GPU pod with:
|
|
- Ubuntu 22.04 or 24.04
|
|
- SSH root access
|
|
- NVIDIA drivers installed
|
|
- Persistent storage for models
|
|
|
|
## Supported Providers
|
|
|
|
### Primary Support
|
|
|
|
**DataCrunch** - Best for shared model storage
|
|
- NFS volumes sharable across multiple pods in same region
|
|
- Models download once, use everywhere
|
|
- Ideal for teams or multiple experiments
|
|
|
|
**RunPod** - Good persistent storage
|
|
- Network volumes persist independently
|
|
- Cannot share between running pods simultaneously
|
|
- Good for single-pod workflows
|
|
|
|
### Also Works With
|
|
- Vast.ai (volumes locked to specific machine)
|
|
- Prime Intellect (no persistent storage)
|
|
- AWS EC2 (with EFS setup)
|
|
- Any Ubuntu machine with NVIDIA GPUs, CUDA driver, and SSH
|
|
|
|
## Commands
|
|
|
|
### Pod Management
|
|
|
|
```bash
|
|
pi pods setup <name> "<ssh>" [options] # Setup new pod
|
|
--mount "<mount_command>" # Run mount command during setup
|
|
--models-path <path> # Override extracted path (optional)
|
|
--vllm release|nightly|gpt-oss # vLLM version (default: release)
|
|
|
|
pi pods # List all configured pods
|
|
pi pods active <name> # Switch active pod
|
|
pi pods remove <name> # Remove pod from local config
|
|
pi shell [<name>] # SSH into pod
|
|
pi ssh [<name>] "<command>" # Run command on pod
|
|
```
|
|
|
|
**Note**: When using `--mount`, the models path is automatically extracted from the mount command's target directory. You only need `--models-path` if not using `--mount` or to override the extracted path.
|
|
|
|
#### vLLM Version Options
|
|
|
|
- `release` (default): Stable vLLM release, recommended for most users
|
|
- `nightly`: Latest vLLM features, needed for newest models like GLM-4.5
|
|
- `gpt-oss`: Special build for OpenAI's GPT-OSS models only
|
|
|
|
### Model Management
|
|
|
|
```bash
|
|
pi start <model> --name <name> [options] # Start a model
|
|
--memory <percent> # GPU memory: 30%, 50%, 90% (default: 90%)
|
|
--context <size> # Context window: 4k, 8k, 16k, 32k, 64k, 128k
|
|
--gpus <count> # Number of GPUs to use (predefined models only)
|
|
--pod <name> # Target specific pod (overrides active)
|
|
--vllm <args...> # Pass custom args directly to vLLM
|
|
|
|
pi stop [<name>] # Stop model (or all if no name given)
|
|
pi list # List running models with status
|
|
pi logs <name> # Stream model logs (tail -f)
|
|
```
|
|
|
|
### Agent & Chat Interface
|
|
|
|
```bash
|
|
pi agent <name> "<message>" # Single message to model
|
|
pi agent <name> "<msg1>" "<msg2>" # Multiple messages in sequence
|
|
pi agent <name> -i # Interactive chat mode
|
|
pi agent <name> -i -c # Continue previous session
|
|
|
|
# Standalone OpenAI-compatible agent (works with any API)
|
|
pi-agent --base-url http://localhost:8000/v1 --model llama-3.1 "Hello"
|
|
pi-agent --api-key sk-... "What is 2+2?" # Uses OpenAI by default
|
|
pi-agent --json "What is 2+2?" # Output event stream as JSONL
|
|
pi-agent -i # Interactive mode
|
|
```
|
|
|
|
The agent includes tools for file operations (read, list, bash, glob, rg) to test agentic capabilities, particularly useful for code navigation and analysis tasks.
|
|
|
|
## Predefined Model Configurations
|
|
|
|
`pi` includes predefined configurations for popular agentic models, so you do not have to specify `--vllm` arguments manually. `pi` will also check if the model you selected can actually run on your pod with respect to the number of GPUs and available VRAM. Run `pi start` without additional arguments to see a list of predefined models that can run on the active pod.
|
|
|
|
### Qwen Models
|
|
```bash
|
|
# Qwen2.5-Coder-32B - Excellent coding model, fits on single H100/H200
|
|
pi start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen
|
|
|
|
# Qwen3-Coder-30B - Advanced reasoning with tool use
|
|
pi start Qwen/Qwen3-Coder-30B-A3B-Instruct --name qwen3
|
|
|
|
# Qwen3-Coder-480B - State-of-the-art on 8xH200 (data-parallel mode)
|
|
pi start Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 --name qwen-480b
|
|
```
|
|
|
|
### GPT-OSS Models
|
|
```bash
|
|
# Requires special vLLM build during setup
|
|
pi pods setup gpt-pod "ssh root@1.2.3.4" --models-path /workspace --vllm gpt-oss
|
|
|
|
# GPT-OSS-20B - Fits on 16GB+ VRAM
|
|
pi start openai/gpt-oss-20b --name gpt20
|
|
|
|
# GPT-OSS-120B - Needs 60GB+ VRAM
|
|
pi start openai/gpt-oss-120b --name gpt120
|
|
```
|
|
|
|
### GLM Models
|
|
```bash
|
|
# GLM-4.5 - Requires 8-16 GPUs, includes thinking mode
|
|
pi start zai-org/GLM-4.5 --name glm
|
|
|
|
# GLM-4.5-Air - Smaller version, 1-2 GPUs
|
|
pi start zai-org/GLM-4.5-Air --name glm-air
|
|
```
|
|
|
|
### Custom Models with --vllm
|
|
|
|
For models not in the predefined list, use `--vllm` to pass arguments directly to vLLM:
|
|
|
|
```bash
|
|
# DeepSeek with custom settings
|
|
pi start deepseek-ai/DeepSeek-V3 --name deepseek --vllm \
|
|
--tensor-parallel-size 4 --trust-remote-code
|
|
|
|
# Mistral with pipeline parallelism
|
|
pi start mistralai/Mixtral-8x22B-Instruct-v0.1 --name mixtral --vllm \
|
|
--tensor-parallel-size 8 --pipeline-parallel-size 2
|
|
|
|
# Any model with specific tool parser
|
|
pi start some/model --name mymodel --vllm \
|
|
--tool-call-parser hermes --enable-auto-tool-choice
|
|
```
|
|
|
|
## DataCrunch Setup
|
|
|
|
DataCrunch offers the best experience with shared NFS storage across pods:
|
|
|
|
### 1. Create Shared Filesystem (SFS)
|
|
- Go to DataCrunch dashboard → Storage → Create SFS
|
|
- Choose size and datacenter
|
|
- Note the mount command (e.g., `sudo mount -t nfs -o nconnect=16 nfs.fin-02.datacrunch.io:/hf-models-fin02-8ac1bab7 /mnt/hf-models-fin02`)
|
|
|
|
### 2. Create GPU Instance
|
|
- Create instance in same datacenter as SFS
|
|
- Share the SFS with the instance
|
|
- Get SSH command from dashboard
|
|
|
|
### 3. Setup with pi
|
|
```bash
|
|
# Get mount command from DataCrunch dashboard
|
|
pi pods setup dc1 "ssh root@instance.datacrunch.io" \
|
|
--mount "sudo mount -t nfs -o nconnect=16 nfs.fin-02.datacrunch.io:/your-pseudo /mnt/hf-models"
|
|
|
|
# Models automatically stored in /mnt/hf-models (extracted from mount command)
|
|
```
|
|
|
|
### 4. Benefits
|
|
- Models persist across instance restarts
|
|
- Share models between multiple instances in same datacenter
|
|
- Download once, use everywhere
|
|
- Pay only for storage, not compute time during downloads
|
|
|
|
## RunPod Setup
|
|
|
|
RunPod offers good persistent storage with network volumes:
|
|
|
|
### 1. Create Network Volume (optional)
|
|
- Go to RunPod dashboard → Storage → Create Network Volume
|
|
- Choose size and region
|
|
|
|
### 2. Create GPU Pod
|
|
- Select "Network Volume" during pod creation (if using)
|
|
- Attach your volume to `/runpod-volume`
|
|
- Get SSH command from pod details
|
|
|
|
### 3. Setup with pi
|
|
```bash
|
|
# With network volume
|
|
pi pods setup runpod "ssh root@pod.runpod.io" --models-path /runpod-volume
|
|
|
|
# Or use workspace (persists with pod but not shareable)
|
|
pi pods setup runpod "ssh root@pod.runpod.io" --models-path /workspace
|
|
```
|
|
|
|
|
|
## Multi-GPU Support
|
|
|
|
### Automatic GPU Assignment
|
|
When running multiple models, pi automatically assigns them to different GPUs:
|
|
```bash
|
|
pi start model1 --name m1 # Auto-assigns to GPU 0
|
|
pi start model2 --name m2 # Auto-assigns to GPU 1
|
|
pi start model3 --name m3 # Auto-assigns to GPU 2
|
|
```
|
|
|
|
### Specify GPU Count for Predefined Models
|
|
For predefined models with multiple configurations, use `--gpus` to control GPU usage:
|
|
```bash
|
|
# Run Qwen on 1 GPU instead of all available
|
|
pi start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen --gpus 1
|
|
|
|
# Run GLM-4.5 on 8 GPUs (if it has an 8-GPU config)
|
|
pi start zai-org/GLM-4.5 --name glm --gpus 8
|
|
```
|
|
|
|
If the model doesn't have a configuration for the requested GPU count, you'll see available options.
|
|
|
|
### Tensor Parallelism for Large Models
|
|
For models that don't fit on a single GPU:
|
|
```bash
|
|
# Use all available GPUs
|
|
pi start meta-llama/Llama-3.1-70B-Instruct --name llama70b --vllm \
|
|
--tensor-parallel-size 4
|
|
|
|
# Specific GPU count
|
|
pi start Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 --name qwen480 --vllm \
|
|
--data-parallel-size 8 --enable-expert-parallel
|
|
```
|
|
|
|
## API Integration
|
|
|
|
All models expose OpenAI-compatible endpoints:
|
|
|
|
```python
|
|
from openai import OpenAI
|
|
|
|
client = OpenAI(
|
|
base_url="http://your-pod-ip:8001/v1",
|
|
api_key="your-pi-api-key"
|
|
)
|
|
|
|
# Chat completion with tool calling
|
|
response = client.chat.completions.create(
|
|
model="Qwen/Qwen2.5-Coder-32B-Instruct",
|
|
messages=[
|
|
{"role": "user", "content": "Write a Python function to calculate fibonacci"}
|
|
],
|
|
tools=[{
|
|
"type": "function",
|
|
"function": {
|
|
"name": "execute_code",
|
|
"description": "Execute Python code",
|
|
"parameters": {
|
|
"type": "object",
|
|
"properties": {
|
|
"code": {"type": "string"}
|
|
},
|
|
"required": ["code"]
|
|
}
|
|
}
|
|
}],
|
|
tool_choice="auto"
|
|
)
|
|
```
|
|
|
|
## Standalone Agent CLI
|
|
|
|
`pi` includes a standalone OpenAI-compatible agent that can work with any API:
|
|
|
|
```bash
|
|
# Install globally to get pi-agent command
|
|
npm install -g @mariozechner/pi
|
|
|
|
# Use with OpenAI
|
|
pi-agent --api-key sk-... "What is machine learning?"
|
|
|
|
# Use with local vLLM
|
|
pi-agent --base-url http://localhost:8000/v1 \
|
|
--model meta-llama/Llama-3.1-8B-Instruct \
|
|
--api-key dummy \
|
|
"Explain quantum computing"
|
|
|
|
# Interactive mode
|
|
pi-agent -i
|
|
|
|
# Continue previous session
|
|
pi-agent --continue "Follow up question"
|
|
|
|
# Custom system prompt
|
|
pi-agent --system-prompt "You are a Python expert" "Write a web scraper"
|
|
|
|
# Use responses API (for GPT-OSS models)
|
|
pi-agent --api responses --model openai/gpt-oss-20b "Hello"
|
|
```
|
|
|
|
The agent supports:
|
|
- Session persistence across conversations
|
|
- Interactive TUI mode with syntax highlighting
|
|
- File system tools (read, list, bash, glob, rg) for code navigation
|
|
- Both Chat Completions and Responses API formats
|
|
- Custom system prompts
|
|
|
|
## Tool Calling Support
|
|
|
|
`pi` automatically configures appropriate tool calling parsers for known models:
|
|
|
|
- **Qwen models**: `hermes` parser (Qwen3-Coder uses `qwen3_coder`)
|
|
- **GLM models**: `glm4_moe` parser with reasoning support
|
|
- **GPT-OSS models**: Uses `/v1/responses` endpoint, as tool calling (function calling in OpenAI parlance) is currently a [WIP with the `v1/chat/completions` endpoint](https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html#tool-use).
|
|
- **Custom models**: Specify with `--vllm --tool-call-parser <parser> --enable-auto-tool-choice`
|
|
|
|
To disable tool calling:
|
|
```bash
|
|
pi start model --name mymodel --vllm --disable-tool-call-parser
|
|
```
|
|
|
|
## Memory and Context Management
|
|
|
|
### GPU Memory Allocation
|
|
Controls how much GPU memory vLLM pre-allocates:
|
|
- `--memory 30%`: High concurrency, limited context
|
|
- `--memory 50%`: Balanced (default)
|
|
- `--memory 90%`: Maximum context, low concurrency
|
|
|
|
### Context Window
|
|
Sets maximum input + output tokens:
|
|
- `--context 4k`: 4,096 tokens total
|
|
- `--context 32k`: 32,768 tokens total
|
|
- `--context 128k`: 131,072 tokens total
|
|
|
|
Example for coding workload:
|
|
```bash
|
|
# Large context for code analysis, moderate concurrency
|
|
pi start Qwen/Qwen2.5-Coder-32B-Instruct --name coder \
|
|
--context 64k --memory 70%
|
|
```
|
|
|
|
**Note**: When using `--vllm`, the `--memory`, `--context`, and `--gpus` parameters are ignored. You'll see a warning if you try to use them together.
|
|
|
|
## Session Persistence
|
|
|
|
The interactive agent mode (`-i`) saves sessions for each project directory:
|
|
|
|
```bash
|
|
# Start new session
|
|
pi agent qwen -i
|
|
|
|
# Continue previous session (maintains chat history)
|
|
pi agent qwen -i -c
|
|
```
|
|
|
|
Sessions are stored in `~/.pi/sessions/` organized by project path and include:
|
|
- Complete conversation history
|
|
- Tool call results
|
|
- Token usage statistics
|
|
|
|
## Architecture & Event System
|
|
|
|
The agent uses a unified event-based architecture where all interactions flow through `AgentEvent` types. This enables:
|
|
- Consistent UI rendering across console and TUI modes
|
|
- Session recording and replay
|
|
- Clean separation between API calls and UI updates
|
|
- JSON output mode for programmatic integration
|
|
|
|
Events are automatically converted to the appropriate API format (Chat Completions or Responses) based on the model type.
|
|
|
|
### JSON Output Mode
|
|
|
|
Use `--json` flag to output the event stream as JSONL (JSON Lines) for programmatic consumption:
|
|
```bash
|
|
pi-agent --api-key sk-... --json "What is 2+2?"
|
|
```
|
|
|
|
Each line is a complete JSON object representing an event:
|
|
```jsonl
|
|
{"type":"user_message","text":"What is 2+2?"}
|
|
{"type":"assistant_start"}
|
|
{"type":"assistant_message","text":"2 + 2 = 4"}
|
|
{"type":"token_usage","inputTokens":10,"outputTokens":5,"totalTokens":15,"cacheReadTokens":0,"cacheWriteTokens":0}
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### OOM (Out of Memory) Errors
|
|
- Reduce `--memory` percentage
|
|
- Use smaller model or quantized version (FP8)
|
|
- Reduce `--context` size
|
|
|
|
### Model Won't Start
|
|
```bash
|
|
# Check GPU usage
|
|
pi ssh "nvidia-smi"
|
|
|
|
# Check if port is in use
|
|
pi list
|
|
|
|
# Force stop all models
|
|
pi stop
|
|
```
|
|
|
|
### Tool Calling Issues
|
|
- Not all models support tool calling reliably
|
|
- Try different parser: `--vllm --tool-call-parser mistral`
|
|
- Or disable: `--vllm --disable-tool-call-parser`
|
|
|
|
### Access Denied for Models
|
|
Some models (Llama, Mistral) require HuggingFace access approval. Visit the model page and click "Request access".
|
|
|
|
### vLLM Build Issues
|
|
If using `--vllm nightly` fails, try:
|
|
- Use `--vllm release` for stable version
|
|
- Check CUDA compatibility with `pi ssh "nvidia-smi"`
|
|
|
|
### Agent Not Finding Messages
|
|
If the agent shows configuration instead of your message, ensure quotes around messages with special characters:
|
|
```bash
|
|
# Good
|
|
pi agent qwen "What is this file about?"
|
|
|
|
# Bad (shell might interpret special chars)
|
|
pi agent qwen What is this file about?
|
|
```
|
|
|
|
## Advanced Usage
|
|
|
|
### Working with Multiple Pods
|
|
```bash
|
|
# Override active pod for any command
|
|
pi start model --name test --pod dev-pod
|
|
pi list --pod prod-pod
|
|
pi stop test --pod dev-pod
|
|
```
|
|
|
|
### Custom vLLM Arguments
|
|
```bash
|
|
# Pass any vLLM argument after --vllm
|
|
pi start model --name custom --vllm \
|
|
--quantization awq \
|
|
--enable-prefix-caching \
|
|
--max-num-seqs 256 \
|
|
--gpu-memory-utilization 0.95
|
|
```
|
|
|
|
### Monitoring
|
|
```bash
|
|
# Watch GPU utilization
|
|
pi ssh "watch -n 1 nvidia-smi"
|
|
|
|
# Check model downloads
|
|
pi ssh "du -sh ~/.cache/huggingface/hub/*"
|
|
|
|
# View all logs
|
|
pi ssh "ls -la ~/.vllm_logs/"
|
|
|
|
# Check agent session history
|
|
ls -la ~/.pi/sessions/
|
|
```
|
|
|
|
## Environment Variables
|
|
|
|
- `HF_TOKEN` - HuggingFace token for model downloads
|
|
- `PI_API_KEY` - API key for vLLM endpoints
|
|
- `PI_CONFIG_DIR` - Config directory (default: `~/.pi`)
|
|
- `OPENAI_API_KEY` - Used by `pi-agent` when no `--api-key` provided
|
|
|
|
## License
|
|
|
|
MIT |