mirror of
https://github.com/getcompanion-ai/co-mono.git
synced 2026-04-17 10:02:23 +00:00
Initial monorepo setup with npm workspaces and dual TypeScript configuration
- Set up npm workspaces for three packages: pi-tui, pi-agent, and pi (pods) - Implemented dual TypeScript configuration: - Root tsconfig.json with path mappings for development and type checking - Package-specific tsconfig.build.json for clean production builds - Configured lockstep versioning with sync script for inter-package dependencies - Added comprehensive documentation for development and publishing workflows - All packages at version 0.5.0 ready for npm publishing
This commit is contained in:
commit
a74c5da112
63 changed files with 14558 additions and 0 deletions
511
packages/pods/README.md
Normal file
511
packages/pods/README.md
Normal file
|
|
@ -0,0 +1,511 @@
|
|||
# pi
|
||||
|
||||
Deploy and manage LLMs on GPU pods with automatic vLLM configuration for agentic workloads.
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
npm install -g @mariozechner/pi
|
||||
```
|
||||
|
||||
## What is pi?
|
||||
|
||||
`pi` simplifies running large language models on remote GPU pods. It automatically:
|
||||
- Sets up vLLM on fresh Ubuntu pods
|
||||
- Configures tool calling for agentic models (Qwen, GPT-OSS, GLM, etc.)
|
||||
- Manages multiple models on the same pod with "smart" GPU allocation
|
||||
- Provides OpenAI-compatible API endpoints for each model
|
||||
- Includes an interactive agent with file system tools for testing
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# Set required environment variables
|
||||
export HF_TOKEN=your_huggingface_token # Get from https://huggingface.co/settings/tokens
|
||||
export PI_API_KEY=your_api_key # Any string you want for API authentication
|
||||
|
||||
# Setup a DataCrunch pod with NFS storage (models path auto-extracted)
|
||||
pi pods setup dc1 "ssh root@1.2.3.4" \
|
||||
--mount "sudo mount -t nfs -o nconnect=16 nfs.fin-02.datacrunch.io:/your-pseudo /mnt/hf-models"
|
||||
|
||||
# Start a model (automatic configuration for known models)
|
||||
pi start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen
|
||||
|
||||
# Send a single message to the model
|
||||
pi agent qwen "What is the Fibonacci sequence?"
|
||||
|
||||
# Interactive chat mode with file system tools
|
||||
pi agent qwen -i
|
||||
|
||||
# Use with any OpenAI-compatible client
|
||||
export OPENAI_BASE_URL='http://1.2.3.4:8001/v1'
|
||||
export OPENAI_API_KEY=$PI_API_KEY
|
||||
```
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Node.js 18+
|
||||
- HuggingFace token (for model downloads)
|
||||
- GPU pod with:
|
||||
- Ubuntu 22.04 or 24.04
|
||||
- SSH root access
|
||||
- NVIDIA drivers installed
|
||||
- Persistent storage for models
|
||||
|
||||
## Supported Providers
|
||||
|
||||
### Primary Support
|
||||
|
||||
**DataCrunch** - Best for shared model storage
|
||||
- NFS volumes sharable across multiple pods in same region
|
||||
- Models download once, use everywhere
|
||||
- Ideal for teams or multiple experiments
|
||||
|
||||
**RunPod** - Good persistent storage
|
||||
- Network volumes persist independently
|
||||
- Cannot share between running pods simultaneously
|
||||
- Good for single-pod workflows
|
||||
|
||||
### Also Works With
|
||||
- Vast.ai (volumes locked to specific machine)
|
||||
- Prime Intellect (no persistent storage)
|
||||
- AWS EC2 (with EFS setup)
|
||||
- Any Ubuntu machine with NVIDIA GPUs, CUDA driver, and SSH
|
||||
|
||||
## Commands
|
||||
|
||||
### Pod Management
|
||||
|
||||
```bash
|
||||
pi pods setup <name> "<ssh>" [options] # Setup new pod
|
||||
--mount "<mount_command>" # Run mount command during setup
|
||||
--models-path <path> # Override extracted path (optional)
|
||||
--vllm release|nightly|gpt-oss # vLLM version (default: release)
|
||||
|
||||
pi pods # List all configured pods
|
||||
pi pods active <name> # Switch active pod
|
||||
pi pods remove <name> # Remove pod from local config
|
||||
pi shell [<name>] # SSH into pod
|
||||
pi ssh [<name>] "<command>" # Run command on pod
|
||||
```
|
||||
|
||||
**Note**: When using `--mount`, the models path is automatically extracted from the mount command's target directory. You only need `--models-path` if not using `--mount` or to override the extracted path.
|
||||
|
||||
#### vLLM Version Options
|
||||
|
||||
- `release` (default): Stable vLLM release, recommended for most users
|
||||
- `nightly`: Latest vLLM features, needed for newest models like GLM-4.5
|
||||
- `gpt-oss`: Special build for OpenAI's GPT-OSS models only
|
||||
|
||||
### Model Management
|
||||
|
||||
```bash
|
||||
pi start <model> --name <name> [options] # Start a model
|
||||
--memory <percent> # GPU memory: 30%, 50%, 90% (default: 90%)
|
||||
--context <size> # Context window: 4k, 8k, 16k, 32k, 64k, 128k
|
||||
--gpus <count> # Number of GPUs to use (predefined models only)
|
||||
--pod <name> # Target specific pod (overrides active)
|
||||
--vllm <args...> # Pass custom args directly to vLLM
|
||||
|
||||
pi stop [<name>] # Stop model (or all if no name given)
|
||||
pi list # List running models with status
|
||||
pi logs <name> # Stream model logs (tail -f)
|
||||
```
|
||||
|
||||
### Agent & Chat Interface
|
||||
|
||||
```bash
|
||||
pi agent <name> "<message>" # Single message to model
|
||||
pi agent <name> "<msg1>" "<msg2>" # Multiple messages in sequence
|
||||
pi agent <name> -i # Interactive chat mode
|
||||
pi agent <name> -i -c # Continue previous session
|
||||
|
||||
# Standalone OpenAI-compatible agent (works with any API)
|
||||
pi-agent --base-url http://localhost:8000/v1 --model llama-3.1 "Hello"
|
||||
pi-agent --api-key sk-... "What is 2+2?" # Uses OpenAI by default
|
||||
pi-agent --json "What is 2+2?" # Output event stream as JSONL
|
||||
pi-agent -i # Interactive mode
|
||||
```
|
||||
|
||||
The agent includes tools for file operations (read, list, bash, glob, rg) to test agentic capabilities, particularly useful for code navigation and analysis tasks.
|
||||
|
||||
## Predefined Model Configurations
|
||||
|
||||
`pi` includes predefined configurations for popular agentic models, so you do not have to specify `--vllm` arguments manually. `pi` will also check if the model you selected can actually run on your pod with respect to the number of GPUs and available VRAM. Run `pi start` without additional arguments to see a list of predefined models that can run on the active pod.
|
||||
|
||||
### Qwen Models
|
||||
```bash
|
||||
# Qwen2.5-Coder-32B - Excellent coding model, fits on single H100/H200
|
||||
pi start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen
|
||||
|
||||
# Qwen3-Coder-30B - Advanced reasoning with tool use
|
||||
pi start Qwen/Qwen3-Coder-30B-A3B-Instruct --name qwen3
|
||||
|
||||
# Qwen3-Coder-480B - State-of-the-art on 8xH200 (data-parallel mode)
|
||||
pi start Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 --name qwen-480b
|
||||
```
|
||||
|
||||
### GPT-OSS Models
|
||||
```bash
|
||||
# Requires special vLLM build during setup
|
||||
pi pods setup gpt-pod "ssh root@1.2.3.4" --models-path /workspace --vllm gpt-oss
|
||||
|
||||
# GPT-OSS-20B - Fits on 16GB+ VRAM
|
||||
pi start openai/gpt-oss-20b --name gpt20
|
||||
|
||||
# GPT-OSS-120B - Needs 60GB+ VRAM
|
||||
pi start openai/gpt-oss-120b --name gpt120
|
||||
```
|
||||
|
||||
### GLM Models
|
||||
```bash
|
||||
# GLM-4.5 - Requires 8-16 GPUs, includes thinking mode
|
||||
pi start zai-org/GLM-4.5 --name glm
|
||||
|
||||
# GLM-4.5-Air - Smaller version, 1-2 GPUs
|
||||
pi start zai-org/GLM-4.5-Air --name glm-air
|
||||
```
|
||||
|
||||
### Custom Models with --vllm
|
||||
|
||||
For models not in the predefined list, use `--vllm` to pass arguments directly to vLLM:
|
||||
|
||||
```bash
|
||||
# DeepSeek with custom settings
|
||||
pi start deepseek-ai/DeepSeek-V3 --name deepseek --vllm \
|
||||
--tensor-parallel-size 4 --trust-remote-code
|
||||
|
||||
# Mistral with pipeline parallelism
|
||||
pi start mistralai/Mixtral-8x22B-Instruct-v0.1 --name mixtral --vllm \
|
||||
--tensor-parallel-size 8 --pipeline-parallel-size 2
|
||||
|
||||
# Any model with specific tool parser
|
||||
pi start some/model --name mymodel --vllm \
|
||||
--tool-call-parser hermes --enable-auto-tool-choice
|
||||
```
|
||||
|
||||
## DataCrunch Setup
|
||||
|
||||
DataCrunch offers the best experience with shared NFS storage across pods:
|
||||
|
||||
### 1. Create Shared Filesystem (SFS)
|
||||
- Go to DataCrunch dashboard → Storage → Create SFS
|
||||
- Choose size and datacenter
|
||||
- Note the mount command (e.g., `sudo mount -t nfs -o nconnect=16 nfs.fin-02.datacrunch.io:/hf-models-fin02-8ac1bab7 /mnt/hf-models-fin02`)
|
||||
|
||||
### 2. Create GPU Instance
|
||||
- Create instance in same datacenter as SFS
|
||||
- Share the SFS with the instance
|
||||
- Get SSH command from dashboard
|
||||
|
||||
### 3. Setup with pi
|
||||
```bash
|
||||
# Get mount command from DataCrunch dashboard
|
||||
pi pods setup dc1 "ssh root@instance.datacrunch.io" \
|
||||
--mount "sudo mount -t nfs -o nconnect=16 nfs.fin-02.datacrunch.io:/your-pseudo /mnt/hf-models"
|
||||
|
||||
# Models automatically stored in /mnt/hf-models (extracted from mount command)
|
||||
```
|
||||
|
||||
### 4. Benefits
|
||||
- Models persist across instance restarts
|
||||
- Share models between multiple instances in same datacenter
|
||||
- Download once, use everywhere
|
||||
- Pay only for storage, not compute time during downloads
|
||||
|
||||
## RunPod Setup
|
||||
|
||||
RunPod offers good persistent storage with network volumes:
|
||||
|
||||
### 1. Create Network Volume (optional)
|
||||
- Go to RunPod dashboard → Storage → Create Network Volume
|
||||
- Choose size and region
|
||||
|
||||
### 2. Create GPU Pod
|
||||
- Select "Network Volume" during pod creation (if using)
|
||||
- Attach your volume to `/runpod-volume`
|
||||
- Get SSH command from pod details
|
||||
|
||||
### 3. Setup with pi
|
||||
```bash
|
||||
# With network volume
|
||||
pi pods setup runpod "ssh root@pod.runpod.io" --models-path /runpod-volume
|
||||
|
||||
# Or use workspace (persists with pod but not shareable)
|
||||
pi pods setup runpod "ssh root@pod.runpod.io" --models-path /workspace
|
||||
```
|
||||
|
||||
|
||||
## Multi-GPU Support
|
||||
|
||||
### Automatic GPU Assignment
|
||||
When running multiple models, pi automatically assigns them to different GPUs:
|
||||
```bash
|
||||
pi start model1 --name m1 # Auto-assigns to GPU 0
|
||||
pi start model2 --name m2 # Auto-assigns to GPU 1
|
||||
pi start model3 --name m3 # Auto-assigns to GPU 2
|
||||
```
|
||||
|
||||
### Specify GPU Count for Predefined Models
|
||||
For predefined models with multiple configurations, use `--gpus` to control GPU usage:
|
||||
```bash
|
||||
# Run Qwen on 1 GPU instead of all available
|
||||
pi start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen --gpus 1
|
||||
|
||||
# Run GLM-4.5 on 8 GPUs (if it has an 8-GPU config)
|
||||
pi start zai-org/GLM-4.5 --name glm --gpus 8
|
||||
```
|
||||
|
||||
If the model doesn't have a configuration for the requested GPU count, you'll see available options.
|
||||
|
||||
### Tensor Parallelism for Large Models
|
||||
For models that don't fit on a single GPU:
|
||||
```bash
|
||||
# Use all available GPUs
|
||||
pi start meta-llama/Llama-3.1-70B-Instruct --name llama70b --vllm \
|
||||
--tensor-parallel-size 4
|
||||
|
||||
# Specific GPU count
|
||||
pi start Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 --name qwen480 --vllm \
|
||||
--data-parallel-size 8 --enable-expert-parallel
|
||||
```
|
||||
|
||||
## API Integration
|
||||
|
||||
All models expose OpenAI-compatible endpoints:
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI(
|
||||
base_url="http://your-pod-ip:8001/v1",
|
||||
api_key="your-pi-api-key"
|
||||
)
|
||||
|
||||
# Chat completion with tool calling
|
||||
response = client.chat.completions.create(
|
||||
model="Qwen/Qwen2.5-Coder-32B-Instruct",
|
||||
messages=[
|
||||
{"role": "user", "content": "Write a Python function to calculate fibonacci"}
|
||||
],
|
||||
tools=[{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "execute_code",
|
||||
"description": "Execute Python code",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"code": {"type": "string"}
|
||||
},
|
||||
"required": ["code"]
|
||||
}
|
||||
}
|
||||
}],
|
||||
tool_choice="auto"
|
||||
)
|
||||
```
|
||||
|
||||
## Standalone Agent CLI
|
||||
|
||||
`pi` includes a standalone OpenAI-compatible agent that can work with any API:
|
||||
|
||||
```bash
|
||||
# Install globally to get pi-agent command
|
||||
npm install -g @mariozechner/pi
|
||||
|
||||
# Use with OpenAI
|
||||
pi-agent --api-key sk-... "What is machine learning?"
|
||||
|
||||
# Use with local vLLM
|
||||
pi-agent --base-url http://localhost:8000/v1 \
|
||||
--model meta-llama/Llama-3.1-8B-Instruct \
|
||||
--api-key dummy \
|
||||
"Explain quantum computing"
|
||||
|
||||
# Interactive mode
|
||||
pi-agent -i
|
||||
|
||||
# Continue previous session
|
||||
pi-agent --continue "Follow up question"
|
||||
|
||||
# Custom system prompt
|
||||
pi-agent --system-prompt "You are a Python expert" "Write a web scraper"
|
||||
|
||||
# Use responses API (for GPT-OSS models)
|
||||
pi-agent --api responses --model openai/gpt-oss-20b "Hello"
|
||||
```
|
||||
|
||||
The agent supports:
|
||||
- Session persistence across conversations
|
||||
- Interactive TUI mode with syntax highlighting
|
||||
- File system tools (read, list, bash, glob, rg) for code navigation
|
||||
- Both Chat Completions and Responses API formats
|
||||
- Custom system prompts
|
||||
|
||||
## Tool Calling Support
|
||||
|
||||
`pi` automatically configures appropriate tool calling parsers for known models:
|
||||
|
||||
- **Qwen models**: `hermes` parser (Qwen3-Coder uses `qwen3_coder`)
|
||||
- **GLM models**: `glm4_moe` parser with reasoning support
|
||||
- **GPT-OSS models**: Uses `/v1/responses` endpoint, as tool calling (function calling in OpenAI parlance) is currently a [WIP with the `v1/chat/completions` endpoint](https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html#tool-use).
|
||||
- **Custom models**: Specify with `--vllm --tool-call-parser <parser> --enable-auto-tool-choice`
|
||||
|
||||
To disable tool calling:
|
||||
```bash
|
||||
pi start model --name mymodel --vllm --disable-tool-call-parser
|
||||
```
|
||||
|
||||
## Memory and Context Management
|
||||
|
||||
### GPU Memory Allocation
|
||||
Controls how much GPU memory vLLM pre-allocates:
|
||||
- `--memory 30%`: High concurrency, limited context
|
||||
- `--memory 50%`: Balanced (default)
|
||||
- `--memory 90%`: Maximum context, low concurrency
|
||||
|
||||
### Context Window
|
||||
Sets maximum input + output tokens:
|
||||
- `--context 4k`: 4,096 tokens total
|
||||
- `--context 32k`: 32,768 tokens total
|
||||
- `--context 128k`: 131,072 tokens total
|
||||
|
||||
Example for coding workload:
|
||||
```bash
|
||||
# Large context for code analysis, moderate concurrency
|
||||
pi start Qwen/Qwen2.5-Coder-32B-Instruct --name coder \
|
||||
--context 64k --memory 70%
|
||||
```
|
||||
|
||||
**Note**: When using `--vllm`, the `--memory`, `--context`, and `--gpus` parameters are ignored. You'll see a warning if you try to use them together.
|
||||
|
||||
## Session Persistence
|
||||
|
||||
The interactive agent mode (`-i`) saves sessions for each project directory:
|
||||
|
||||
```bash
|
||||
# Start new session
|
||||
pi agent qwen -i
|
||||
|
||||
# Continue previous session (maintains chat history)
|
||||
pi agent qwen -i -c
|
||||
```
|
||||
|
||||
Sessions are stored in `~/.pi/sessions/` organized by project path and include:
|
||||
- Complete conversation history
|
||||
- Tool call results
|
||||
- Token usage statistics
|
||||
|
||||
## Architecture & Event System
|
||||
|
||||
The agent uses a unified event-based architecture where all interactions flow through `AgentEvent` types. This enables:
|
||||
- Consistent UI rendering across console and TUI modes
|
||||
- Session recording and replay
|
||||
- Clean separation between API calls and UI updates
|
||||
- JSON output mode for programmatic integration
|
||||
|
||||
Events are automatically converted to the appropriate API format (Chat Completions or Responses) based on the model type.
|
||||
|
||||
### JSON Output Mode
|
||||
|
||||
Use `--json` flag to output the event stream as JSONL (JSON Lines) for programmatic consumption:
|
||||
```bash
|
||||
pi-agent --api-key sk-... --json "What is 2+2?"
|
||||
```
|
||||
|
||||
Each line is a complete JSON object representing an event:
|
||||
```jsonl
|
||||
{"type":"user_message","text":"What is 2+2?"}
|
||||
{"type":"assistant_start"}
|
||||
{"type":"assistant_message","text":"2 + 2 = 4"}
|
||||
{"type":"token_usage","inputTokens":10,"outputTokens":5,"totalTokens":15,"cacheReadTokens":0,"cacheWriteTokens":0}
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### OOM (Out of Memory) Errors
|
||||
- Reduce `--memory` percentage
|
||||
- Use smaller model or quantized version (FP8)
|
||||
- Reduce `--context` size
|
||||
|
||||
### Model Won't Start
|
||||
```bash
|
||||
# Check GPU usage
|
||||
pi ssh "nvidia-smi"
|
||||
|
||||
# Check if port is in use
|
||||
pi list
|
||||
|
||||
# Force stop all models
|
||||
pi stop
|
||||
```
|
||||
|
||||
### Tool Calling Issues
|
||||
- Not all models support tool calling reliably
|
||||
- Try different parser: `--vllm --tool-call-parser mistral`
|
||||
- Or disable: `--vllm --disable-tool-call-parser`
|
||||
|
||||
### Access Denied for Models
|
||||
Some models (Llama, Mistral) require HuggingFace access approval. Visit the model page and click "Request access".
|
||||
|
||||
### vLLM Build Issues
|
||||
If using `--vllm nightly` fails, try:
|
||||
- Use `--vllm release` for stable version
|
||||
- Check CUDA compatibility with `pi ssh "nvidia-smi"`
|
||||
|
||||
### Agent Not Finding Messages
|
||||
If the agent shows configuration instead of your message, ensure quotes around messages with special characters:
|
||||
```bash
|
||||
# Good
|
||||
pi agent qwen "What is this file about?"
|
||||
|
||||
# Bad (shell might interpret special chars)
|
||||
pi agent qwen What is this file about?
|
||||
```
|
||||
|
||||
## Advanced Usage
|
||||
|
||||
### Working with Multiple Pods
|
||||
```bash
|
||||
# Override active pod for any command
|
||||
pi start model --name test --pod dev-pod
|
||||
pi list --pod prod-pod
|
||||
pi stop test --pod dev-pod
|
||||
```
|
||||
|
||||
### Custom vLLM Arguments
|
||||
```bash
|
||||
# Pass any vLLM argument after --vllm
|
||||
pi start model --name custom --vllm \
|
||||
--quantization awq \
|
||||
--enable-prefix-caching \
|
||||
--max-num-seqs 256 \
|
||||
--gpu-memory-utilization 0.95
|
||||
```
|
||||
|
||||
### Monitoring
|
||||
```bash
|
||||
# Watch GPU utilization
|
||||
pi ssh "watch -n 1 nvidia-smi"
|
||||
|
||||
# Check model downloads
|
||||
pi ssh "du -sh ~/.cache/huggingface/hub/*"
|
||||
|
||||
# View all logs
|
||||
pi ssh "ls -la ~/.vllm_logs/"
|
||||
|
||||
# Check agent session history
|
||||
ls -la ~/.pi/sessions/
|
||||
```
|
||||
|
||||
## Environment Variables
|
||||
|
||||
- `HF_TOKEN` - HuggingFace token for model downloads
|
||||
- `PI_API_KEY` - API key for vLLM endpoints
|
||||
- `PI_CONFIG_DIR` - Config directory (default: `~/.pi`)
|
||||
- `OPENAI_API_KEY` - Used by `pi-agent` when no `--api-key` provided
|
||||
|
||||
## License
|
||||
|
||||
MIT
|
||||
Loading…
Add table
Add a link
Reference in a new issue