co-mono/packages/pods/README.md

# pi

Deploy and manage LLMs on GPU pods with automatic vLLM configuration for agentic workloads.

## Installation

```bash
npm install -g @mariozechner/pi
```

## What is pi?

`pi` simplifies running large language models on remote GPU pods. It automatically:
- Sets up vLLM on fresh Ubuntu pods
- Configures tool calling for agentic models (Qwen, GPT-OSS, GLM, etc.)
- Manages multiple models on the same pod with "smart" GPU allocation
- Provides OpenAI-compatible API endpoints for each model
- Includes an interactive agent with file system tools for testing

## Quick Start

```bash
# Set required environment variables
export HF_TOKEN=your_huggingface_token      # Get from https://huggingface.co/settings/tokens
export PI_API_KEY=your_api_key              # Any string you want for API authentication

# Setup a DataCrunch pod with NFS storage (models path auto-extracted)
pi pods setup dc1 "ssh root@1.2.3.4" \
  --mount "sudo mount -t nfs -o nconnect=16 nfs.fin-02.datacrunch.io:/your-pseudo /mnt/hf-models"

# Start a model (automatic configuration for known models)
pi start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen

# Send a single message to the model
pi agent qwen "What is the Fibonacci sequence?"

# Interactive chat mode with file system tools
pi agent qwen -i

# Use with any OpenAI-compatible client
export OPENAI_BASE_URL='http://1.2.3.4:8001/v1'
export OPENAI_API_KEY=$PI_API_KEY
```

## Prerequisites

- Node.js 18+
- HuggingFace token (for model downloads)
- GPU pod with:
  - Ubuntu 22.04 or 24.04
  - SSH root access
  - NVIDIA drivers installed
  - Persistent storage for models

## Supported Providers

### Primary Support

**DataCrunch** - Best for shared model storage
- NFS volumes sharable across multiple pods in same region
- Models download once, use everywhere
- Ideal for teams or multiple experiments

**RunPod** - Good persistent storage
- Network volumes persist independently
- Cannot share between running pods simultaneously
- Good for single-pod workflows

### Also Works With
- Vast.ai (volumes locked to specific machine)
- Prime Intellect (no persistent storage)
- AWS EC2 (with EFS setup)
- Any Ubuntu machine with NVIDIA GPUs, CUDA driver, and SSH

## Commands

### Pod Management

```bash
pi pods setup <name> "<ssh>" [options]        # Setup new pod
  --mount "<mount_command>"                   # Run mount command during setup
  --models-path <path>                        # Override extracted path (optional)
  --vllm release|nightly|gpt-oss              # vLLM version (default: release)

pi pods                                       # List all configured pods
pi pods active <name>                         # Switch active pod
pi pods remove <name>                         # Remove pod from local config
pi shell [<name>]                             # SSH into pod
pi ssh [<name>] "<command>"                   # Run command on pod
```

**Note**: When using `--mount`, the models path is automatically extracted from the mount command's target directory. You only need `--models-path` if not using `--mount` or to override the extracted path.

#### vLLM Version Options

- `release` (default): Stable vLLM release, recommended for most users
- `nightly`: Latest vLLM features, needed for newest models like GLM-4.5
- `gpt-oss`: Special build for OpenAI's GPT-OSS models only

### Model Management

```bash
pi start <model> --name <name> [options]  # Start a model
  --memory <percent>      # GPU memory: 30%, 50%, 90% (default: 90%)
  --context <size>        # Context window: 4k, 8k, 16k, 32k, 64k, 128k
  --gpus <count>          # Number of GPUs to use (predefined models only)
  --pod <name>            # Target specific pod (overrides active)
  --vllm <args...>        # Pass custom args directly to vLLM

pi stop [<name>]          # Stop model (or all if no name given)
pi list                   # List running models with status
pi logs <name>            # Stream model logs (tail -f)
```

### Agent & Chat Interface

```bash
pi agent <name> "<message>"               # Single message to model
pi agent <name> "<msg1>" "<msg2>"         # Multiple messages in sequence
pi agent <name> -i                        # Interactive chat mode
pi agent <name> -i -c                     # Continue previous session

# Standalone OpenAI-compatible agent (works with any API)
pi-agent --base-url http://localhost:8000/v1 --model llama-3.1 "Hello"
pi-agent --api-key sk-... "What is 2+2?"  # Uses OpenAI by default
pi-agent --json "What is 2+2?"            # Output event stream as JSONL
pi-agent -i                                # Interactive mode
```

The agent includes tools for file operations (read, list, bash, glob, rg) to test agentic capabilities, particularly useful for code navigation and analysis tasks.

## Predefined Model Configurations

`pi` includes predefined configurations for popular agentic models, so you do not have to specify `--vllm` arguments manually. `pi` will also check if the model you selected can actually run on your pod with respect to the number of GPUs and available VRAM. Run `pi start` without additional arguments to see a list of predefined models that can run on the active pod.

### Qwen Models
```bash
# Qwen2.5-Coder-32B - Excellent coding model, fits on single H100/H200
pi start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen

# Qwen3-Coder-30B - Advanced reasoning with tool use
pi start Qwen/Qwen3-Coder-30B-A3B-Instruct --name qwen3

# Qwen3-Coder-480B - State-of-the-art on 8xH200 (data-parallel mode)
pi start Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 --name qwen-480b
```

### GPT-OSS Models
```bash
# Requires special vLLM build during setup
pi pods setup gpt-pod "ssh root@1.2.3.4" --models-path /workspace --vllm gpt-oss

# GPT-OSS-20B - Fits on 16GB+ VRAM
pi start openai/gpt-oss-20b --name gpt20

# GPT-OSS-120B - Needs 60GB+ VRAM
pi start openai/gpt-oss-120b --name gpt120
```

### GLM Models
```bash
# GLM-4.5 - Requires 8-16 GPUs, includes thinking mode
pi start zai-org/GLM-4.5 --name glm

# GLM-4.5-Air - Smaller version, 1-2 GPUs
pi start zai-org/GLM-4.5-Air --name glm-air
```

### Custom Models with --vllm

For models not in the predefined list, use `--vllm` to pass arguments directly to vLLM:

```bash
# DeepSeek with custom settings
pi start deepseek-ai/DeepSeek-V3 --name deepseek --vllm \
  --tensor-parallel-size 4 --trust-remote-code

# Mistral with pipeline parallelism
pi start mistralai/Mixtral-8x22B-Instruct-v0.1 --name mixtral --vllm \
  --tensor-parallel-size 8 --pipeline-parallel-size 2

# Any model with specific tool parser
pi start some/model --name mymodel --vllm \
  --tool-call-parser hermes --enable-auto-tool-choice
```

## DataCrunch Setup

DataCrunch offers the best experience with shared NFS storage across pods:

### 1. Create Shared Filesystem (SFS)
- Go to DataCrunch dashboard → Storage → Create SFS
- Choose size and datacenter
- Note the mount command (e.g., `sudo mount -t nfs -o nconnect=16 nfs.fin-02.datacrunch.io:/hf-models-fin02-8ac1bab7 /mnt/hf-models-fin02`)

### 2. Create GPU Instance
- Create instance in same datacenter as SFS
- Share the SFS with the instance
- Get SSH command from dashboard

### 3. Setup with pi
```bash
# Get mount command from DataCrunch dashboard
pi pods setup dc1 "ssh root@instance.datacrunch.io" \
  --mount "sudo mount -t nfs -o nconnect=16 nfs.fin-02.datacrunch.io:/your-pseudo /mnt/hf-models"

# Models automatically stored in /mnt/hf-models (extracted from mount command)
```

### 4. Benefits
- Models persist across instance restarts
- Share models between multiple instances in same datacenter
- Download once, use everywhere
- Pay only for storage, not compute time during downloads

## RunPod Setup

RunPod offers good persistent storage with network volumes:

### 1. Create Network Volume (optional)
- Go to RunPod dashboard → Storage → Create Network Volume
- Choose size and region

### 2. Create GPU Pod
- Select "Network Volume" during pod creation (if using)
- Attach your volume to `/runpod-volume`
- Get SSH command from pod details

### 3. Setup with pi
```bash
# With network volume
pi pods setup runpod "ssh root@pod.runpod.io" --models-path /runpod-volume

# Or use workspace (persists with pod but not shareable)
pi pods setup runpod "ssh root@pod.runpod.io" --models-path /workspace
```


## Multi-GPU Support

### Automatic GPU Assignment
When running multiple models, pi automatically assigns them to different GPUs:
```bash
pi start model1 --name m1  # Auto-assigns to GPU 0
pi start model2 --name m2  # Auto-assigns to GPU 1
pi start model3 --name m3  # Auto-assigns to GPU 2
```

### Specify GPU Count for Predefined Models
For predefined models with multiple configurations, use `--gpus` to control GPU usage:
```bash
# Run Qwen on 1 GPU instead of all available
pi start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen --gpus 1

# Run GLM-4.5 on 8 GPUs (if it has an 8-GPU config)
pi start zai-org/GLM-4.5 --name glm --gpus 8
```

If the model doesn't have a configuration for the requested GPU count, you'll see available options.

### Tensor Parallelism for Large Models
For models that don't fit on a single GPU:
```bash
# Use all available GPUs
pi start meta-llama/Llama-3.1-70B-Instruct --name llama70b --vllm \
  --tensor-parallel-size 4

# Specific GPU count
pi start Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 --name qwen480 --vllm \
  --data-parallel-size 8 --enable-expert-parallel
```

## API Integration

All models expose OpenAI-compatible endpoints:

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://your-pod-ip:8001/v1",
    api_key="your-pi-api-key"
)

# Chat completion with tool calling
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-Coder-32B-Instruct",
    messages=[
        {"role": "user", "content": "Write a Python function to calculate fibonacci"}
    ],
    tools=[{
        "type": "function",
        "function": {
            "name": "execute_code",
            "description": "Execute Python code",
            "parameters": {
                "type": "object",
                "properties": {
                    "code": {"type": "string"}
                },
                "required": ["code"]
            }
        }
    }],
    tool_choice="auto"
)
```

## Standalone Agent CLI

`pi` includes a standalone OpenAI-compatible agent that can work with any API:

```bash
# Install globally to get pi-agent command
npm install -g @mariozechner/pi

# Use with OpenAI
pi-agent --api-key sk-... "What is machine learning?"

# Use with local vLLM
pi-agent --base-url http://localhost:8000/v1 \
         --model meta-llama/Llama-3.1-8B-Instruct \
         --api-key dummy \
         "Explain quantum computing"

# Interactive mode
pi-agent -i

# Continue previous session
pi-agent --continue "Follow up question"

# Custom system prompt
pi-agent --system-prompt "You are a Python expert" "Write a web scraper"

# Use responses API (for GPT-OSS models)
pi-agent --api responses --model openai/gpt-oss-20b "Hello"
```

The agent supports:
- Session persistence across conversations
- Interactive TUI mode with syntax highlighting
- File system tools (read, list, bash, glob, rg) for code navigation
- Both Chat Completions and Responses API formats
- Custom system prompts

## Tool Calling Support

`pi` automatically configures appropriate tool calling parsers for known models:

- **Qwen models**: `hermes` parser (Qwen3-Coder uses `qwen3_coder`)
- **GLM models**: `glm4_moe` parser with reasoning support
- **GPT-OSS models**: Uses `/v1/responses` endpoint, as tool calling (function calling in OpenAI parlance) is currently a [WIP with the `v1/chat/completions` endpoint](https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html#tool-use).
- **Custom models**: Specify with `--vllm --tool-call-parser <parser> --enable-auto-tool-choice`

To disable tool calling:
```bash
pi start model --name mymodel --vllm --disable-tool-call-parser
```

## Memory and Context Management

### GPU Memory Allocation
Controls how much GPU memory vLLM pre-allocates:
- `--memory 30%`: High concurrency, limited context
- `--memory 50%`: Balanced (default)
- `--memory 90%`: Maximum context, low concurrency

### Context Window
Sets maximum input + output tokens:
- `--context 4k`: 4,096 tokens total
- `--context 32k`: 32,768 tokens total
- `--context 128k`: 131,072 tokens total

Example for coding workload:
```bash
# Large context for code analysis, moderate concurrency
pi start Qwen/Qwen2.5-Coder-32B-Instruct --name coder \
  --context 64k --memory 70%
```

**Note**: When using `--vllm`, the `--memory`, `--context`, and `--gpus` parameters are ignored. You'll see a warning if you try to use them together.

## Session Persistence

The interactive agent mode (`-i`) saves sessions for each project directory:

```bash
# Start new session
pi agent qwen -i

# Continue previous session (maintains chat history)
pi agent qwen -i -c
```

Sessions are stored in `~/.pi/sessions/` organized by project path and include:
- Complete conversation history
- Tool call results
- Token usage statistics

## Architecture & Event System

The agent uses a unified event-based architecture where all interactions flow through `AgentEvent` types. This enables:
- Consistent UI rendering across console and TUI modes
- Session recording and replay
- Clean separation between API calls and UI updates
- JSON output mode for programmatic integration

Events are automatically converted to the appropriate API format (Chat Completions or Responses) based on the model type.

### JSON Output Mode

Use `--json` flag to output the event stream as JSONL (JSON Lines) for programmatic consumption:
```bash
pi-agent --api-key sk-... --json "What is 2+2?"
```

Each line is a complete JSON object representing an event:
```jsonl
{"type":"user_message","text":"What is 2+2?"}
{"type":"assistant_start"}
{"type":"assistant_message","text":"2 + 2 = 4"}
{"type":"token_usage","inputTokens":10,"outputTokens":5,"totalTokens":15,"cacheReadTokens":0,"cacheWriteTokens":0}
```

## Troubleshooting

### OOM (Out of Memory) Errors
- Reduce `--memory` percentage
- Use smaller model or quantized version (FP8)
- Reduce `--context` size

### Model Won't Start
```bash
# Check GPU usage
pi ssh "nvidia-smi"

# Check if port is in use
pi list

# Force stop all models
pi stop
```

### Tool Calling Issues
- Not all models support tool calling reliably
- Try different parser: `--vllm --tool-call-parser mistral`
- Or disable: `--vllm --disable-tool-call-parser`

### Access Denied for Models
Some models (Llama, Mistral) require HuggingFace access approval. Visit the model page and click "Request access".

### vLLM Build Issues
If using `--vllm nightly` fails, try:
- Use `--vllm release` for stable version
- Check CUDA compatibility with `pi ssh "nvidia-smi"`

### Agent Not Finding Messages
If the agent shows configuration instead of your message, ensure quotes around messages with special characters:
```bash
# Good
pi agent qwen "What is this file about?"

# Bad (shell might interpret special chars)
pi agent qwen What is this file about?
```

## Advanced Usage

### Working with Multiple Pods
```bash
# Override active pod for any command
pi start model --name test --pod dev-pod
pi list --pod prod-pod
pi stop test --pod dev-pod
```

### Custom vLLM Arguments
```bash
# Pass any vLLM argument after --vllm
pi start model --name custom --vllm \
  --quantization awq \
  --enable-prefix-caching \
  --max-num-seqs 256 \
  --gpu-memory-utilization 0.95
```

### Monitoring
```bash
# Watch GPU utilization
pi ssh "watch -n 1 nvidia-smi"

# Check model downloads
pi ssh "du -sh ~/.cache/huggingface/hub/*"

# View all logs
pi ssh "ls -la ~/.vllm_logs/"

# Check agent session history
ls -la ~/.pi/sessions/
```

## Environment Variables

- `HF_TOKEN` - HuggingFace token for model downloads
- `PI_API_KEY` - API key for vLLM endpoints
- `PI_CONFIG_DIR` - Config directory (default: `~/.pi`)
- `OPENAI_API_KEY` - Used by `pi-agent` when no `--api-key` provided

## License

MIT