# pi Deploy and manage LLMs on GPU pods with automatic vLLM configuration for agentic workloads. ## Installation ```bash npm install -g @mariozechner/pi ``` ## What is pi? `pi` simplifies running large language models on remote GPU pods. It automatically: - Sets up vLLM on fresh Ubuntu pods - Configures tool calling for agentic models (Qwen, GPT-OSS, GLM, etc.) - Manages multiple models on the same pod with "smart" GPU allocation - Provides OpenAI-compatible API endpoints for each model - Includes an interactive agent with file system tools for testing ## Quick Start ```bash # Set required environment variables export HF_TOKEN=your_huggingface_token # Get from https://huggingface.co/settings/tokens export PI_API_KEY=your_api_key # Any string you want for API authentication # Setup a DataCrunch pod with NFS storage (models path auto-extracted) pi pods setup dc1 "ssh root@1.2.3.4" \ --mount "sudo mount -t nfs -o nconnect=16 nfs.fin-02.datacrunch.io:/your-pseudo /mnt/hf-models" # Start a model (automatic configuration for known models) pi start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen # Send a single message to the model pi agent qwen "What is the Fibonacci sequence?" # Interactive chat mode with file system tools pi agent qwen -i # Use with any OpenAI-compatible client export OPENAI_BASE_URL='http://1.2.3.4:8001/v1' export OPENAI_API_KEY=$PI_API_KEY ``` ## Prerequisites - Node.js 18+ - HuggingFace token (for model downloads) - GPU pod with: - Ubuntu 22.04 or 24.04 - SSH root access - NVIDIA drivers installed - Persistent storage for models ## Supported Providers ### Primary Support **DataCrunch** - Best for shared model storage - NFS volumes sharable across multiple pods in same region - Models download once, use everywhere - Ideal for teams or multiple experiments **RunPod** - Good persistent storage - Network volumes persist independently - Cannot share between running pods simultaneously - Good for single-pod workflows ### Also Works With - Vast.ai (volumes locked to specific machine) - Prime Intellect (no persistent storage) - AWS EC2 (with EFS setup) - Any Ubuntu machine with NVIDIA GPUs, CUDA driver, and SSH ## Commands ### Pod Management ```bash pi pods setup "" [options] # Setup new pod --mount "" # Run mount command during setup --models-path # Override extracted path (optional) --vllm release|nightly|gpt-oss # vLLM version (default: release) pi pods # List all configured pods pi pods active # Switch active pod pi pods remove # Remove pod from local config pi shell [] # SSH into pod pi ssh [] "" # Run command on pod ``` **Note**: When using `--mount`, the models path is automatically extracted from the mount command's target directory. You only need `--models-path` if not using `--mount` or to override the extracted path. #### vLLM Version Options - `release` (default): Stable vLLM release, recommended for most users - `nightly`: Latest vLLM features, needed for newest models like GLM-4.5 - `gpt-oss`: Special build for OpenAI's GPT-OSS models only ### Model Management ```bash pi start --name [options] # Start a model --memory # GPU memory: 30%, 50%, 90% (default: 90%) --context # Context window: 4k, 8k, 16k, 32k, 64k, 128k --gpus # Number of GPUs to use (predefined models only) --pod # Target specific pod (overrides active) --vllm # Pass custom args directly to vLLM pi stop [] # Stop model (or all if no name given) pi list # List running models with status pi logs # Stream model logs (tail -f) ``` ### Agent & Chat Interface ```bash pi agent "" # Single message to model pi agent "" "" # Multiple messages in sequence pi agent -i # Interactive chat mode pi agent -i -c # Continue previous session # Standalone OpenAI-compatible agent (works with any API) pi-agent --base-url http://localhost:8000/v1 --model llama-3.1 "Hello" pi-agent --api-key sk-... "What is 2+2?" # Uses OpenAI by default pi-agent --json "What is 2+2?" # Output event stream as JSONL pi-agent -i # Interactive mode ``` The agent includes tools for file operations (read, list, bash, glob, rg) to test agentic capabilities, particularly useful for code navigation and analysis tasks. ## Predefined Model Configurations `pi` includes predefined configurations for popular agentic models, so you do not have to specify `--vllm` arguments manually. `pi` will also check if the model you selected can actually run on your pod with respect to the number of GPUs and available VRAM. Run `pi start` without additional arguments to see a list of predefined models that can run on the active pod. ### Qwen Models ```bash # Qwen2.5-Coder-32B - Excellent coding model, fits on single H100/H200 pi start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen # Qwen3-Coder-30B - Advanced reasoning with tool use pi start Qwen/Qwen3-Coder-30B-A3B-Instruct --name qwen3 # Qwen3-Coder-480B - State-of-the-art on 8xH200 (data-parallel mode) pi start Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 --name qwen-480b ``` ### GPT-OSS Models ```bash # Requires special vLLM build during setup pi pods setup gpt-pod "ssh root@1.2.3.4" --models-path /workspace --vllm gpt-oss # GPT-OSS-20B - Fits on 16GB+ VRAM pi start openai/gpt-oss-20b --name gpt20 # GPT-OSS-120B - Needs 60GB+ VRAM pi start openai/gpt-oss-120b --name gpt120 ``` ### GLM Models ```bash # GLM-4.5 - Requires 8-16 GPUs, includes thinking mode pi start zai-org/GLM-4.5 --name glm # GLM-4.5-Air - Smaller version, 1-2 GPUs pi start zai-org/GLM-4.5-Air --name glm-air ``` ### Custom Models with --vllm For models not in the predefined list, use `--vllm` to pass arguments directly to vLLM: ```bash # DeepSeek with custom settings pi start deepseek-ai/DeepSeek-V3 --name deepseek --vllm \ --tensor-parallel-size 4 --trust-remote-code # Mistral with pipeline parallelism pi start mistralai/Mixtral-8x22B-Instruct-v0.1 --name mixtral --vllm \ --tensor-parallel-size 8 --pipeline-parallel-size 2 # Any model with specific tool parser pi start some/model --name mymodel --vllm \ --tool-call-parser hermes --enable-auto-tool-choice ``` ## DataCrunch Setup DataCrunch offers the best experience with shared NFS storage across pods: ### 1. Create Shared Filesystem (SFS) - Go to DataCrunch dashboard → Storage → Create SFS - Choose size and datacenter - Note the mount command (e.g., `sudo mount -t nfs -o nconnect=16 nfs.fin-02.datacrunch.io:/hf-models-fin02-8ac1bab7 /mnt/hf-models-fin02`) ### 2. Create GPU Instance - Create instance in same datacenter as SFS - Share the SFS with the instance - Get SSH command from dashboard ### 3. Setup with pi ```bash # Get mount command from DataCrunch dashboard pi pods setup dc1 "ssh root@instance.datacrunch.io" \ --mount "sudo mount -t nfs -o nconnect=16 nfs.fin-02.datacrunch.io:/your-pseudo /mnt/hf-models" # Models automatically stored in /mnt/hf-models (extracted from mount command) ``` ### 4. Benefits - Models persist across instance restarts - Share models between multiple instances in same datacenter - Download once, use everywhere - Pay only for storage, not compute time during downloads ## RunPod Setup RunPod offers good persistent storage with network volumes: ### 1. Create Network Volume (optional) - Go to RunPod dashboard → Storage → Create Network Volume - Choose size and region ### 2. Create GPU Pod - Select "Network Volume" during pod creation (if using) - Attach your volume to `/runpod-volume` - Get SSH command from pod details ### 3. Setup with pi ```bash # With network volume pi pods setup runpod "ssh root@pod.runpod.io" --models-path /runpod-volume # Or use workspace (persists with pod but not shareable) pi pods setup runpod "ssh root@pod.runpod.io" --models-path /workspace ``` ## Multi-GPU Support ### Automatic GPU Assignment When running multiple models, pi automatically assigns them to different GPUs: ```bash pi start model1 --name m1 # Auto-assigns to GPU 0 pi start model2 --name m2 # Auto-assigns to GPU 1 pi start model3 --name m3 # Auto-assigns to GPU 2 ``` ### Specify GPU Count for Predefined Models For predefined models with multiple configurations, use `--gpus` to control GPU usage: ```bash # Run Qwen on 1 GPU instead of all available pi start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen --gpus 1 # Run GLM-4.5 on 8 GPUs (if it has an 8-GPU config) pi start zai-org/GLM-4.5 --name glm --gpus 8 ``` If the model doesn't have a configuration for the requested GPU count, you'll see available options. ### Tensor Parallelism for Large Models For models that don't fit on a single GPU: ```bash # Use all available GPUs pi start meta-llama/Llama-3.1-70B-Instruct --name llama70b --vllm \ --tensor-parallel-size 4 # Specific GPU count pi start Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 --name qwen480 --vllm \ --data-parallel-size 8 --enable-expert-parallel ``` ## API Integration All models expose OpenAI-compatible endpoints: ```python from openai import OpenAI client = OpenAI( base_url="http://your-pod-ip:8001/v1", api_key="your-pi-api-key" ) # Chat completion with tool calling response = client.chat.completions.create( model="Qwen/Qwen2.5-Coder-32B-Instruct", messages=[ {"role": "user", "content": "Write a Python function to calculate fibonacci"} ], tools=[{ "type": "function", "function": { "name": "execute_code", "description": "Execute Python code", "parameters": { "type": "object", "properties": { "code": {"type": "string"} }, "required": ["code"] } } }], tool_choice="auto" ) ``` ## Standalone Agent CLI `pi` includes a standalone OpenAI-compatible agent that can work with any API: ```bash # Install globally to get pi-agent command npm install -g @mariozechner/pi # Use with OpenAI pi-agent --api-key sk-... "What is machine learning?" # Use with local vLLM pi-agent --base-url http://localhost:8000/v1 \ --model meta-llama/Llama-3.1-8B-Instruct \ --api-key dummy \ "Explain quantum computing" # Interactive mode pi-agent -i # Continue previous session pi-agent --continue "Follow up question" # Custom system prompt pi-agent --system-prompt "You are a Python expert" "Write a web scraper" # Use responses API (for GPT-OSS models) pi-agent --api responses --model openai/gpt-oss-20b "Hello" ``` The agent supports: - Session persistence across conversations - Interactive TUI mode with syntax highlighting - File system tools (read, list, bash, glob, rg) for code navigation - Both Chat Completions and Responses API formats - Custom system prompts ## Tool Calling Support `pi` automatically configures appropriate tool calling parsers for known models: - **Qwen models**: `hermes` parser (Qwen3-Coder uses `qwen3_coder`) - **GLM models**: `glm4_moe` parser with reasoning support - **GPT-OSS models**: Uses `/v1/responses` endpoint, as tool calling (function calling in OpenAI parlance) is currently a [WIP with the `v1/chat/completions` endpoint](https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html#tool-use). - **Custom models**: Specify with `--vllm --tool-call-parser --enable-auto-tool-choice` To disable tool calling: ```bash pi start model --name mymodel --vllm --disable-tool-call-parser ``` ## Memory and Context Management ### GPU Memory Allocation Controls how much GPU memory vLLM pre-allocates: - `--memory 30%`: High concurrency, limited context - `--memory 50%`: Balanced (default) - `--memory 90%`: Maximum context, low concurrency ### Context Window Sets maximum input + output tokens: - `--context 4k`: 4,096 tokens total - `--context 32k`: 32,768 tokens total - `--context 128k`: 131,072 tokens total Example for coding workload: ```bash # Large context for code analysis, moderate concurrency pi start Qwen/Qwen2.5-Coder-32B-Instruct --name coder \ --context 64k --memory 70% ``` **Note**: When using `--vllm`, the `--memory`, `--context`, and `--gpus` parameters are ignored. You'll see a warning if you try to use them together. ## Session Persistence The interactive agent mode (`-i`) saves sessions for each project directory: ```bash # Start new session pi agent qwen -i # Continue previous session (maintains chat history) pi agent qwen -i -c ``` Sessions are stored in `~/.pi/sessions/` organized by project path and include: - Complete conversation history - Tool call results - Token usage statistics ## Architecture & Event System The agent uses a unified event-based architecture where all interactions flow through `AgentEvent` types. This enables: - Consistent UI rendering across console and TUI modes - Session recording and replay - Clean separation between API calls and UI updates - JSON output mode for programmatic integration Events are automatically converted to the appropriate API format (Chat Completions or Responses) based on the model type. ### JSON Output Mode Use `--json` flag to output the event stream as JSONL (JSON Lines) for programmatic consumption: ```bash pi-agent --api-key sk-... --json "What is 2+2?" ``` Each line is a complete JSON object representing an event: ```jsonl {"type":"user_message","text":"What is 2+2?"} {"type":"assistant_start"} {"type":"assistant_message","text":"2 + 2 = 4"} {"type":"token_usage","inputTokens":10,"outputTokens":5,"totalTokens":15,"cacheReadTokens":0,"cacheWriteTokens":0} ``` ## Troubleshooting ### OOM (Out of Memory) Errors - Reduce `--memory` percentage - Use smaller model or quantized version (FP8) - Reduce `--context` size ### Model Won't Start ```bash # Check GPU usage pi ssh "nvidia-smi" # Check if port is in use pi list # Force stop all models pi stop ``` ### Tool Calling Issues - Not all models support tool calling reliably - Try different parser: `--vllm --tool-call-parser mistral` - Or disable: `--vllm --disable-tool-call-parser` ### Access Denied for Models Some models (Llama, Mistral) require HuggingFace access approval. Visit the model page and click "Request access". ### vLLM Build Issues If using `--vllm nightly` fails, try: - Use `--vllm release` for stable version - Check CUDA compatibility with `pi ssh "nvidia-smi"` ### Agent Not Finding Messages If the agent shows configuration instead of your message, ensure quotes around messages with special characters: ```bash # Good pi agent qwen "What is this file about?" # Bad (shell might interpret special chars) pi agent qwen What is this file about? ``` ## Advanced Usage ### Working with Multiple Pods ```bash # Override active pod for any command pi start model --name test --pod dev-pod pi list --pod prod-pod pi stop test --pod dev-pod ``` ### Custom vLLM Arguments ```bash # Pass any vLLM argument after --vllm pi start model --name custom --vllm \ --quantization awq \ --enable-prefix-caching \ --max-num-seqs 256 \ --gpu-memory-utilization 0.95 ``` ### Monitoring ```bash # Watch GPU utilization pi ssh "watch -n 1 nvidia-smi" # Check model downloads pi ssh "du -sh ~/.cache/huggingface/hub/*" # View all logs pi ssh "ls -la ~/.vllm_logs/" # Check agent session history ls -la ~/.pi/sessions/ ``` ## Environment Variables - `HF_TOKEN` - HuggingFace token for model downloads - `PI_API_KEY` - API key for vLLM endpoints - `PI_CONFIG_DIR` - Config directory (default: `~/.pi`) - `OPENAI_API_KEY` - Used by `pi-agent` when no `--api-key` provided ## License MIT