Initial monorepo setup with npm workspaces and dual TypeScript configuration

- Set up npm workspaces for three packages: pi-tui, pi-agent, and pi (pods) - Implemented dual TypeScript configuration: - Root tsconfig.json with path mappings for development and type checking - Package-specific tsconfig.build.json for clean production builds - Configured lockstep versioning with sync script for inter-package dependencies - Added comprehensive documentation for development and publishing workflows - All packages at version 0.5.0 ready for npm publishing
2026-04-17 10:02:23 +00:00 · 2025-08-09 17:18:38 +02:00 · 2025-08-09 17:18:38 +02:00 · a74c5da112
commit a74c5da112
63 changed files with 14558 additions and 0 deletions
--- a/packages/pods/README.md
+++ b/packages/pods/README.md
@ -0,0 +1,511 @@
+# pi
+
+Deploy and manage LLMs on GPU pods with automatic vLLM configuration for agentic workloads.
+
+## Installation
+
+```bash
+npm install -g @mariozechner/pi
+```
+
+## What is pi?
+
+`pi` simplifies running large language models on remote GPU pods. It automatically:
+- Sets up vLLM on fresh Ubuntu pods
+- Configures tool calling for agentic models (Qwen, GPT-OSS, GLM, etc.)
+- Manages multiple models on the same pod with "smart" GPU allocation
+- Provides OpenAI-compatible API endpoints for each model
+- Includes an interactive agent with file system tools for testing
+
+## Quick Start
+
+```bash
+# Set required environment variables
+export HF_TOKEN=your_huggingface_token      # Get from https://huggingface.co/settings/tokens
+export PI_API_KEY=your_api_key              # Any string you want for API authentication
+
+# Setup a DataCrunch pod with NFS storage (models path auto-extracted)
+pi pods setup dc1 "ssh root@1.2.3.4" \
+  --mount "sudo mount -t nfs -o nconnect=16 nfs.fin-02.datacrunch.io:/your-pseudo /mnt/hf-models"
+
+# Start a model (automatic configuration for known models)
+pi start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen
+
+# Send a single message to the model
+pi agent qwen "What is the Fibonacci sequence?"
+
+# Interactive chat mode with file system tools
+pi agent qwen -i
+
+# Use with any OpenAI-compatible client
+export OPENAI_BASE_URL='http://1.2.3.4:8001/v1'
+export OPENAI_API_KEY=$PI_API_KEY
+```
+
+## Prerequisites
+
+- Node.js 18+
+- HuggingFace token (for model downloads)
+- GPU pod with:
+  - Ubuntu 22.04 or 24.04
+  - SSH root access
+  - NVIDIA drivers installed
+  - Persistent storage for models
+
+## Supported Providers
+
+### Primary Support
+
+**DataCrunch** - Best for shared model storage
+- NFS volumes sharable across multiple pods in same region
+- Models download once, use everywhere
+- Ideal for teams or multiple experiments
+
+**RunPod** - Good persistent storage
+- Network volumes persist independently
+- Cannot share between running pods simultaneously
+- Good for single-pod workflows
+
+### Also Works With
+- Vast.ai (volumes locked to specific machine)
+- Prime Intellect (no persistent storage)
+- AWS EC2 (with EFS setup)
+- Any Ubuntu machine with NVIDIA GPUs, CUDA driver, and SSH
+
+## Commands
+
+### Pod Management
+
+```bash
+pi pods setup <name> "<ssh>" [options]        # Setup new pod
+  --mount "<mount_command>"                   # Run mount command during setup
+  --models-path <path>                        # Override extracted path (optional)
+  --vllm release|nightly|gpt-oss              # vLLM version (default: release)
+
+pi pods                                       # List all configured pods
+pi pods active <name>                         # Switch active pod
+pi pods remove <name>                         # Remove pod from local config
+pi shell [<name>]                             # SSH into pod
+pi ssh [<name>] "<command>"                   # Run command on pod
+```
+
+**Note**: When using `--mount`, the models path is automatically extracted from the mount command's target directory. You only need `--models-path` if not using `--mount` or to override the extracted path.
+
+#### vLLM Version Options
+
+- `release` (default): Stable vLLM release, recommended for most users
+- `nightly`: Latest vLLM features, needed for newest models like GLM-4.5
+- `gpt-oss`: Special build for OpenAI's GPT-OSS models only
+
+### Model Management
+
+```bash
+pi start <model> --name <name> [options]  # Start a model
+  --memory <percent>      # GPU memory: 30%, 50%, 90% (default: 90%)
+  --context <size>        # Context window: 4k, 8k, 16k, 32k, 64k, 128k
+  --gpus <count>          # Number of GPUs to use (predefined models only)
+  --pod <name>            # Target specific pod (overrides active)
+  --vllm <args...>        # Pass custom args directly to vLLM
+
+pi stop [<name>]          # Stop model (or all if no name given)
+pi list                   # List running models with status
+pi logs <name>            # Stream model logs (tail -f)
+```
+
+### Agent & Chat Interface
+
+```bash
+pi agent <name> "<message>"               # Single message to model
+pi agent <name> "<msg1>" "<msg2>"         # Multiple messages in sequence
+pi agent <name> -i                        # Interactive chat mode
+pi agent <name> -i -c                     # Continue previous session
+
+# Standalone OpenAI-compatible agent (works with any API)
+pi-agent --base-url http://localhost:8000/v1 --model llama-3.1 "Hello"
+pi-agent --api-key sk-... "What is 2+2?"  # Uses OpenAI by default
+pi-agent --json "What is 2+2?"            # Output event stream as JSONL
+pi-agent -i                                # Interactive mode
+```
+
+The agent includes tools for file operations (read, list, bash, glob, rg) to test agentic capabilities, particularly useful for code navigation and analysis tasks.
+
+## Predefined Model Configurations
+
+`pi` includes predefined configurations for popular agentic models, so you do not have to specify `--vllm` arguments manually. `pi` will also check if the model you selected can actually run on your pod with respect to the number of GPUs and available VRAM. Run `pi start` without additional arguments to see a list of predefined models that can run on the active pod.
+
+### Qwen Models
+```bash
+# Qwen2.5-Coder-32B - Excellent coding model, fits on single H100/H200
+pi start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen
+
+# Qwen3-Coder-30B - Advanced reasoning with tool use
+pi start Qwen/Qwen3-Coder-30B-A3B-Instruct --name qwen3
+
+# Qwen3-Coder-480B - State-of-the-art on 8xH200 (data-parallel mode)
+pi start Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 --name qwen-480b
+```
+
+### GPT-OSS Models
+```bash
+# Requires special vLLM build during setup
+pi pods setup gpt-pod "ssh root@1.2.3.4" --models-path /workspace --vllm gpt-oss
+
+# GPT-OSS-20B - Fits on 16GB+ VRAM
+pi start openai/gpt-oss-20b --name gpt20
+
+# GPT-OSS-120B - Needs 60GB+ VRAM
+pi start openai/gpt-oss-120b --name gpt120
+```
+
+### GLM Models
+```bash
+# GLM-4.5 - Requires 8-16 GPUs, includes thinking mode
+pi start zai-org/GLM-4.5 --name glm
+
+# GLM-4.5-Air - Smaller version, 1-2 GPUs
+pi start zai-org/GLM-4.5-Air --name glm-air
+```
+
+### Custom Models with --vllm
+
+For models not in the predefined list, use `--vllm` to pass arguments directly to vLLM:
+
+```bash
+# DeepSeek with custom settings
+pi start deepseek-ai/DeepSeek-V3 --name deepseek --vllm \
+  --tensor-parallel-size 4 --trust-remote-code
+
+# Mistral with pipeline parallelism
+pi start mistralai/Mixtral-8x22B-Instruct-v0.1 --name mixtral --vllm \
+  --tensor-parallel-size 8 --pipeline-parallel-size 2
+
+# Any model with specific tool parser
+pi start some/model --name mymodel --vllm \
+  --tool-call-parser hermes --enable-auto-tool-choice
+```
+
+## DataCrunch Setup
+
+DataCrunch offers the best experience with shared NFS storage across pods:
+
+### 1. Create Shared Filesystem (SFS)
+- Go to DataCrunch dashboard → Storage → Create SFS
+- Choose size and datacenter
+- Note the mount command (e.g., `sudo mount -t nfs -o nconnect=16 nfs.fin-02.datacrunch.io:/hf-models-fin02-8ac1bab7 /mnt/hf-models-fin02`)
+
+### 2. Create GPU Instance
+- Create instance in same datacenter as SFS
+- Share the SFS with the instance
+- Get SSH command from dashboard
+
+### 3. Setup with pi
+```bash
+# Get mount command from DataCrunch dashboard
+pi pods setup dc1 "ssh root@instance.datacrunch.io" \
+  --mount "sudo mount -t nfs -o nconnect=16 nfs.fin-02.datacrunch.io:/your-pseudo /mnt/hf-models"
+
+# Models automatically stored in /mnt/hf-models (extracted from mount command)
+```
+
+### 4. Benefits
+- Models persist across instance restarts
+- Share models between multiple instances in same datacenter
+- Download once, use everywhere
+- Pay only for storage, not compute time during downloads
+
+## RunPod Setup
+
+RunPod offers good persistent storage with network volumes:
+
+### 1. Create Network Volume (optional)
+- Go to RunPod dashboard → Storage → Create Network Volume
+- Choose size and region
+
+### 2. Create GPU Pod
+- Select "Network Volume" during pod creation (if using)
+- Attach your volume to `/runpod-volume`
+- Get SSH command from pod details
+
+### 3. Setup with pi
+```bash
+# With network volume
+pi pods setup runpod "ssh root@pod.runpod.io" --models-path /runpod-volume
+
+# Or use workspace (persists with pod but not shareable)
+pi pods setup runpod "ssh root@pod.runpod.io" --models-path /workspace
+```
+
+
+## Multi-GPU Support
+
+### Automatic GPU Assignment
+When running multiple models, pi automatically assigns them to different GPUs:
+```bash
+pi start model1 --name m1  # Auto-assigns to GPU 0
+pi start model2 --name m2  # Auto-assigns to GPU 1
+pi start model3 --name m3  # Auto-assigns to GPU 2
+```
+
+### Specify GPU Count for Predefined Models
+For predefined models with multiple configurations, use `--gpus` to control GPU usage:
+```bash
+# Run Qwen on 1 GPU instead of all available
+pi start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen --gpus 1
+
+# Run GLM-4.5 on 8 GPUs (if it has an 8-GPU config)
+pi start zai-org/GLM-4.5 --name glm --gpus 8
+```
+
+If the model doesn't have a configuration for the requested GPU count, you'll see available options.
+
+### Tensor Parallelism for Large Models
+For models that don't fit on a single GPU:
+```bash
+# Use all available GPUs
+pi start meta-llama/Llama-3.1-70B-Instruct --name llama70b --vllm \
+  --tensor-parallel-size 4
+
+# Specific GPU count
+pi start Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 --name qwen480 --vllm \
+  --data-parallel-size 8 --enable-expert-parallel
+```
+
+## API Integration
+
+All models expose OpenAI-compatible endpoints:
+
+```python
+from openai import OpenAI
+
+client = OpenAI(
+    base_url="http://your-pod-ip:8001/v1",
+    api_key="your-pi-api-key"
+)
+
+# Chat completion with tool calling
+response = client.chat.completions.create(
+    model="Qwen/Qwen2.5-Coder-32B-Instruct",
+    messages=[
+        {"role": "user", "content": "Write a Python function to calculate fibonacci"}
+    ],
+    tools=[{
+        "type": "function",
+        "function": {
+            "name": "execute_code",
+            "description": "Execute Python code",
+            "parameters": {
+                "type": "object",
+                "properties": {
+                    "code": {"type": "string"}
+                },
+                "required": ["code"]
+            }
+        }
+    }],
+    tool_choice="auto"
+)
+```
+
+## Standalone Agent CLI
+
+`pi` includes a standalone OpenAI-compatible agent that can work with any API:
+
+```bash
+# Install globally to get pi-agent command
+npm install -g @mariozechner/pi
+
+# Use with OpenAI
+pi-agent --api-key sk-... "What is machine learning?"
+
+# Use with local vLLM
+pi-agent --base-url http://localhost:8000/v1 \
+         --model meta-llama/Llama-3.1-8B-Instruct \
+         --api-key dummy \
+         "Explain quantum computing"
+
+# Interactive mode
+pi-agent -i
+
+# Continue previous session
+pi-agent --continue "Follow up question"
+
+# Custom system prompt
+pi-agent --system-prompt "You are a Python expert" "Write a web scraper"
+
+# Use responses API (for GPT-OSS models)
+pi-agent --api responses --model openai/gpt-oss-20b "Hello"
+```
+
+The agent supports:
+- Session persistence across conversations
+- Interactive TUI mode with syntax highlighting
+- File system tools (read, list, bash, glob, rg) for code navigation
+- Both Chat Completions and Responses API formats
+- Custom system prompts
+
+## Tool Calling Support
+
+`pi` automatically configures appropriate tool calling parsers for known models:
+
+- **Qwen models**: `hermes` parser (Qwen3-Coder uses `qwen3_coder`)
+- **GLM models**: `glm4_moe` parser with reasoning support
+- **GPT-OSS models**: Uses `/v1/responses` endpoint, as tool calling (function calling in OpenAI parlance) is currently a [WIP with the `v1/chat/completions` endpoint](https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html#tool-use).
+- **Custom models**: Specify with `--vllm --tool-call-parser <parser> --enable-auto-tool-choice`
+
+To disable tool calling:
+```bash
+pi start model --name mymodel --vllm --disable-tool-call-parser
+```
+
+## Memory and Context Management
+
+### GPU Memory Allocation
+Controls how much GPU memory vLLM pre-allocates:
+- `--memory 30%`: High concurrency, limited context
+- `--memory 50%`: Balanced (default)
+- `--memory 90%`: Maximum context, low concurrency
+
+### Context Window
+Sets maximum input + output tokens:
+- `--context 4k`: 4,096 tokens total
+- `--context 32k`: 32,768 tokens total
+- `--context 128k`: 131,072 tokens total
+
+Example for coding workload:
+```bash
+# Large context for code analysis, moderate concurrency
+pi start Qwen/Qwen2.5-Coder-32B-Instruct --name coder \
+  --context 64k --memory 70%
+```
+
+**Note**: When using `--vllm`, the `--memory`, `--context`, and `--gpus` parameters are ignored. You'll see a warning if you try to use them together.
+
+## Session Persistence
+
+The interactive agent mode (`-i`) saves sessions for each project directory:
+
+```bash
+# Start new session
+pi agent qwen -i
+
+# Continue previous session (maintains chat history)
+pi agent qwen -i -c
+```
+
+Sessions are stored in `~/.pi/sessions/` organized by project path and include:
+- Complete conversation history
+- Tool call results
+- Token usage statistics
+
+## Architecture & Event System
+
+The agent uses a unified event-based architecture where all interactions flow through `AgentEvent` types. This enables:
+- Consistent UI rendering across console and TUI modes
+- Session recording and replay
+- Clean separation between API calls and UI updates
+- JSON output mode for programmatic integration
+
+Events are automatically converted to the appropriate API format (Chat Completions or Responses) based on the model type.
+
+### JSON Output Mode
+
+Use `--json` flag to output the event stream as JSONL (JSON Lines) for programmatic consumption:
+```bash
+pi-agent --api-key sk-... --json "What is 2+2?"
+```
+
+Each line is a complete JSON object representing an event:
+```jsonl
+{"type":"user_message","text":"What is 2+2?"}
+{"type":"assistant_start"}
+{"type":"assistant_message","text":"2 + 2 = 4"}
+{"type":"token_usage","inputTokens":10,"outputTokens":5,"totalTokens":15,"cacheReadTokens":0,"cacheWriteTokens":0}
+```
+
+## Troubleshooting
+
+### OOM (Out of Memory) Errors
+- Reduce `--memory` percentage
+- Use smaller model or quantized version (FP8)
+- Reduce `--context` size
+
+### Model Won't Start
+```bash
+# Check GPU usage
+pi ssh "nvidia-smi"
+
+# Check if port is in use
+pi list
+
+# Force stop all models
+pi stop
+```
+
+### Tool Calling Issues
+- Not all models support tool calling reliably
+- Try different parser: `--vllm --tool-call-parser mistral`
+- Or disable: `--vllm --disable-tool-call-parser`
+
+### Access Denied for Models
+Some models (Llama, Mistral) require HuggingFace access approval. Visit the model page and click "Request access".
+
+### vLLM Build Issues
+If using `--vllm nightly` fails, try:
+- Use `--vllm release` for stable version
+- Check CUDA compatibility with `pi ssh "nvidia-smi"`
+
+### Agent Not Finding Messages
+If the agent shows configuration instead of your message, ensure quotes around messages with special characters:
+```bash
+# Good
+pi agent qwen "What is this file about?"
+
+# Bad (shell might interpret special chars)
+pi agent qwen What is this file about?
+```
+
+## Advanced Usage
+
+### Working with Multiple Pods
+```bash
+# Override active pod for any command
+pi start model --name test --pod dev-pod
+pi list --pod prod-pod
+pi stop test --pod dev-pod
+```
+
+### Custom vLLM Arguments
+```bash
+# Pass any vLLM argument after --vllm
+pi start model --name custom --vllm \
+  --quantization awq \
+  --enable-prefix-caching \
+  --max-num-seqs 256 \
+  --gpu-memory-utilization 0.95
+```
+
+### Monitoring
+```bash
+# Watch GPU utilization
+pi ssh "watch -n 1 nvidia-smi"
+
+# Check model downloads
+pi ssh "du -sh ~/.cache/huggingface/hub/*"
+
+# View all logs
+pi ssh "ls -la ~/.vllm_logs/"
+
+# Check agent session history
+ls -la ~/.pi/sessions/
+```
+
+## Environment Variables
+
+- `HF_TOKEN` - HuggingFace token for model downloads
+- `PI_API_KEY` - API key for vLLM endpoints
+- `PI_CONFIG_DIR` - Config directory (default: `~/.pi`)
+- `OPENAI_API_KEY` - Used by `pi-agent` when no `--api-key` provided
+
+## License
+
+MIT