|
@@ -0,0 +1,216 @@
|
|
|
+# Guide: Implementing Models in Ollama's Go Inference Engine
|
|
|
+
|
|
|
+> **Note**: This guide and the Go inference engine are in early development and will be updated as implementation details evolve.
|
|
|
+
|
|
|
+This guide outlines the process of implementing a new model in Ollama's Go GGML inference engine. It covers everything from initial setup to deploying your model to ollama.com.
|
|
|
+
|
|
|
+## Architecture Overview
|
|
|
+
|
|
|
+Below is a diagram showing Ollama's inference engine architecture layers and how they interact:
|
|
|
+
|
|
|
+```mermaid
|
|
|
+graph TB
|
|
|
+ subgraph Models["Model Layer: LLM Implementations"]
|
|
|
+ direction TB
|
|
|
+ llama["model/models/llama/model.go"]
|
|
|
+ mllama["model/models/mllama/model.go"]
|
|
|
+ qwen["model/models/qwen2/model.go"]
|
|
|
+ qwen_vl["model/models/qwen2vl/model.go"]
|
|
|
+
|
|
|
+ note1["Each model implements a specific architecture
|
|
|
+ - Defines model parameters
|
|
|
+ - Implements forward pass"]
|
|
|
+ end
|
|
|
+
|
|
|
+ subgraph ML_Ops["Neural Network Operations"]
|
|
|
+ direction TB
|
|
|
+ nn_ops["nn/
|
|
|
+ linear.go - Matrix operations
|
|
|
+ embedding.go - Token embeddings
|
|
|
+ normalization.go - Layer normalization
|
|
|
+ convolution.go - Conv operations"]
|
|
|
+
|
|
|
+ backend["ml/backend.go
|
|
|
+ Hardware Abstraction Layer
|
|
|
+ - Defines tensor operations
|
|
|
+ - Manages computation graphs
|
|
|
+ - Handles memory allocation"]
|
|
|
+
|
|
|
+ note2["Common neural net operations
|
|
|
+ used across different models
|
|
|
+ - Abstracts hardware details
|
|
|
+ - Provides unified API
|
|
|
+ - Manages computation flow"]
|
|
|
+ end
|
|
|
+
|
|
|
+ subgraph GGML["Hardware Execution Layer"]
|
|
|
+ direction TB
|
|
|
+ ggml["ggml.go
|
|
|
+ CGO Interface
|
|
|
+ - Bridges Go and C++
|
|
|
+ - Handles type conversion
|
|
|
+ - Manages memory between languages"]
|
|
|
+
|
|
|
+ subgraph Hardware_Specific["Hardware-Specific Implementations"]
|
|
|
+ direction LR
|
|
|
+ cpu["ggml-cpu.h
|
|
|
+ CPU optimized ops"]
|
|
|
+ cuda["ggml-cuda.h
|
|
|
+ NVIDIA GPU ops"]
|
|
|
+ metal["ggml-metal.h
|
|
|
+ Apple GPU ops"]
|
|
|
+ vulkan["ggml-vulkan.h
|
|
|
+ Cross-platform GPU"]
|
|
|
+ opencl["ggml-opencl.h
|
|
|
+ OpenCL acceleration"]
|
|
|
+ end
|
|
|
+
|
|
|
+ note3["GGML provides optimized
|
|
|
+ implementations for each hardware:
|
|
|
+ - Automatic dispatch
|
|
|
+ - Hardware-specific optimizations
|
|
|
+ - Memory management
|
|
|
+ - Parallel execution"]
|
|
|
+ end
|
|
|
+
|
|
|
+ %% Connections with explanations
|
|
|
+ Models --> |"Makes high-level calls
|
|
|
+ (e.g., self-attention)"| ML_Ops
|
|
|
+ ML_Ops --> |"Translates to tensor operations
|
|
|
+ (e.g., matmul, softmax)"| GGML
|
|
|
+ GGML --> |"Executes optimized code
|
|
|
+ on target hardware"| Hardware_Specific
|
|
|
+
|
|
|
+ %% Styling
|
|
|
+ classDef model fill:#fff,stroke:#01579b,stroke-width:2px
|
|
|
+ classDef ml fill:#fff,stroke:#e65100,stroke-width:2px
|
|
|
+ classDef hw fill:#fff,stroke:#b71c1c,stroke-width:2px
|
|
|
+ classDef note fill:#fff,stroke:#666,stroke-dasharray: 5 5
|
|
|
+
|
|
|
+ class llama,mllama,qwen,qwen_vl,pixtral model
|
|
|
+ class nn_ops,backend ml
|
|
|
+ class ggml,cpu,cuda,metal,vulkan,opencl hw
|
|
|
+ class note1,note2,note3 note
|
|
|
+
|
|
|
+ %% Style subgraphs
|
|
|
+ style Models fill:#fff,stroke:#01579b,stroke-width:2px
|
|
|
+ style ML_Ops fill:#fff,stroke:#e65100,stroke-width:2px
|
|
|
+ style GGML fill:#fff,stroke:#b71c1c,stroke-width:2px
|
|
|
+ style Hardware_Specific fill:#fff,stroke:#b71c1c,stroke-width:1px
|
|
|
+```
|
|
|
+
|
|
|
+When implementing a new model, you'll primarily work in the model layer, interfacing with the neural network operations layer.
|
|
|
+
|
|
|
+## Implementation Steps
|
|
|
+
|
|
|
+### 1. Environment Setup
|
|
|
+
|
|
|
+First, clone the Ollama repository and get it running locally. Follow the development setup guide at:
|
|
|
+https://github.com/ollama/ollama/blob/main/docs/development.md
|
|
|
+
|
|
|
+### 2. Research Implementation
|
|
|
+
|
|
|
+Get the original model implementation running. This typically involves:
|
|
|
+- Cloning the research code repository (usually Python-based)
|
|
|
+- Setting up the required environment
|
|
|
+- Running inference with sample inputs
|
|
|
+- Understanding the model architecture and forward pass
|
|
|
+
|
|
|
+### 3. Project Structure Setup
|
|
|
+
|
|
|
+Create the necessary file structure by referencing previous model implementations. You'll need:
|
|
|
+
|
|
|
+```
|
|
|
+model/
|
|
|
+└── your-model/
|
|
|
+ ├── model.go # Architecture and forward pass implementation
|
|
|
+ ├── convert.go # Weight conversion logic (PyTorch/SafeTensors to GGML)
|
|
|
+ └── convert_test.go # Conversion logic tests
|
|
|
+```
|
|
|
+
|
|
|
+Add your model to the main paths in [model/models/models.go](https://github.com/ollama/ollama/blob/main/model/models/models.go):
|
|
|
+
|
|
|
+```
|
|
|
+package models
|
|
|
+
|
|
|
+import (
|
|
|
+ _ "github.com/ollama/ollama/model/models/llama"
|
|
|
+ _ "github.com/ollama/ollama/model/models/mllama"
|
|
|
+ _ "github.com/ollama/ollama/model/models/your-model" // Add your model here
|
|
|
+)
|
|
|
+```
|
|
|
+
|
|
|
+### 4. Development Process
|
|
|
+
|
|
|
+1. **Open a Draft PR**
|
|
|
+ - Create a draft pull request in the `ollama/ollama` repository
|
|
|
+ - Use this as a communication channel with Ollama maintainers
|
|
|
+
|
|
|
+2. **Implement Weight Conversion**
|
|
|
+ - Work on `convert.go`
|
|
|
+ - Reference existing conversion implementations
|
|
|
+ - Create a basic Modelfile:
|
|
|
+ ```
|
|
|
+ FROM /path/to/model
|
|
|
+ ```
|
|
|
+ - Test conversion:
|
|
|
+ ```bash
|
|
|
+ go run . create <my-model> -f /path/to/Modelfile
|
|
|
+ ```
|
|
|
+
|
|
|
+3. **Implement Model Logic**
|
|
|
+ - Implement `New()` and `Forward()` functions in `model.go`
|
|
|
+ - Reference existing model implementations
|
|
|
+ - Debug forward pass:
|
|
|
+ ```bash
|
|
|
+ OLLAMA_DEBUG=1 go run . run <my-model>
|
|
|
+ ```
|
|
|
+ - Compare output with research implementation
|
|
|
+
|
|
|
+4. **Tokenizer Implementation**
|
|
|
+ - Implement a new tokenizer if required
|
|
|
+ - Ensure compatibility with model architecture
|
|
|
+
|
|
|
+5. **Text Generation Testing**
|
|
|
+ - Implement proper prompt formatting
|
|
|
+ - Test basic generation:
|
|
|
+ ```bash
|
|
|
+ go run . run <my-model> "hello"
|
|
|
+ ```
|
|
|
+
|
|
|
+### 5. Testing
|
|
|
+
|
|
|
+1. Add comprehensive tests to:
|
|
|
+ - `model_test.go`
|
|
|
+ - `convert_test.go`
|
|
|
+
|
|
|
+2. Ensure tests cover:
|
|
|
+ - Weight conversion
|
|
|
+ - Model initialization
|
|
|
+ - Text generation
|
|
|
+
|
|
|
+### 6. Model Deployment
|
|
|
+
|
|
|
+1. **Finalize Pull Request**
|
|
|
+ - Move PR out of draft state
|
|
|
+ - Address reviewer feedback
|
|
|
+
|
|
|
+2. **Deploy to ollama.com**
|
|
|
+ - Determine model prompt format
|
|
|
+ - Convert prompt format to Go template
|
|
|
+ - Create final Modelfile:
|
|
|
+ ```
|
|
|
+ FROM <converted-gguf>
|
|
|
+ TEMPLATE <prompt-template>
|
|
|
+ LICENSE <license-info>
|
|
|
+ # Add additional parameters if needed
|
|
|
+ ```
|
|
|
+ - Push to ollama.com:
|
|
|
+ ```bash
|
|
|
+ ollama create <your-namespace>/<your-model> -f /path/to/Modelfile
|
|
|
+ ollama push <your-namespace>/<your-model>
|
|
|
+ ```
|
|
|
+
|
|
|
+3. **Integration Testing**
|
|
|
+ - Run end-to-end tests
|
|
|
+ - Verify model behavior in production environment
|