# Guide: Implementing Models in Ollama's Go Inference Engine > **Note**: This guide and the Go inference engine are in early development and will be updated as implementation details evolve. This guide outlines the process of implementing a new model in Ollama's inference engine. It covers everything from initial setup to deploying your model to ollama.com. ## Architecture Overview Below is a diagram showing Ollama's inference engine architecture layers and how they interact: ```mermaid graph TB subgraph Models["Model Layer: LLM Implementations"] direction TB llama["model/models/llama/model.go"] mllama["model/models/mllama/model.go"] qwen["model/models/qwen2/model.go"] qwen_vl["model/models/qwen2vl/model.go"] note1["Each model implements a specific architecture - Defines model parameters - Implements forward pass"] end subgraph ML_Ops["Neural Network Operations"] direction TB nn_ops["nn/ linear.go - Matrix operations embedding.go - Token embeddings normalization.go - Layer normalization convolution.go - Conv operations"] backend["ml/backend.go Hardware Abstraction Layer - Defines tensor operations - Manages computation graphs - Handles memory allocation"] note2["Common neural net operations used across different models - Abstracts hardware details - Provides unified API - Manages computation flow"] end subgraph GGML["Hardware Execution Layer"] direction TB ggml["ggml.go CGO Interface - Bridges Go and C++ - Handles type conversion - Manages memory between languages"] subgraph Hardware_Specific["Hardware-Specific Implementations"] direction LR cpu["ggml-cpu.h CPU optimized ops"] cuda["ggml-cuda.h NVIDIA GPU ops"] metal["ggml-metal.h Apple GPU ops"] vulkan["ggml-vulkan.h Cross-platform GPU"] opencl["ggml-opencl.h OpenCL acceleration"] end note3["GGML provides optimized implementations for each hardware: - Automatic dispatch - Hardware-specific optimizations - Memory management - Parallel execution"] end %% Connections with explanations Models --> |"Makes high-level calls (e.g., self-attention)"| ML_Ops ML_Ops --> |"Translates to tensor operations (e.g., matmul, softmax)"| GGML GGML --> |"Executes optimized code on target hardware"| Hardware_Specific %% Styling classDef model fill:#fff,stroke:#01579b,stroke-width:2px classDef ml fill:#fff,stroke:#e65100,stroke-width:2px classDef hw fill:#fff,stroke:#b71c1c,stroke-width:2px classDef note fill:#fff,stroke:#666,stroke-dasharray: 5 5 class llama,mllama,qwen,qwen_vl,pixtral model class nn_ops,backend ml class ggml,cpu,cuda,metal,vulkan,opencl hw class note1,note2,note3 note %% Style subgraphs style Models fill:#fff,stroke:#01579b,stroke-width:2px style ML_Ops fill:#fff,stroke:#e65100,stroke-width:2px style GGML fill:#fff,stroke:#b71c1c,stroke-width:2px style Hardware_Specific fill:#fff,stroke:#b71c1c,stroke-width:1px ``` When implementing a new model, you'll primarily work in the model layer, interfacing with the neural network operations layer. ## Implementation Steps ### 1. Environment Setup First, clone the Ollama repository and get it running locally. Follow the development setup guide at: https://github.com/ollama/ollama/blob/main/docs/development.md ### 2. Research Implementation Get the original model implementation running. This typically involves: - Cloning the research code repository (usually Python-based) - Setting up the required environment - Running inference with sample inputs - Understanding the model architecture and forward pass ### 3. Project Structure Setup Create the necessary file structure by referencing previous model implementations. You'll need: ``` model/ └── your-model/ ├── model.go # Architecture and forward pass implementation ├── convert.go # Weight conversion logic (PyTorch/SafeTensors to GGML) └── convert_test.go # Conversion logic tests ``` Add your model to the main paths in [model/models/models.go](https://github.com/ollama/ollama/blob/main/model/models/models.go): ``` package models import ( _ "github.com/ollama/ollama/model/models/llama" _ "github.com/ollama/ollama/model/models/mllama" _ "github.com/ollama/ollama/model/models/your-model" // Add your model here ) ``` ### 4. Development Process 1. **Open a Draft PR** - Create a draft pull request in the `ollama/ollama` repository - Use this as a communication channel with Ollama maintainers 2. **Implement Weight Conversion** - Work on `convert.go` - Reference existing conversion implementations - Create a basic Modelfile: ``` FROM /path/to/model ``` - Test conversion: ```bash go run . create -f /path/to/Modelfile ``` 3. **Implement Model Logic** - Implement `New()` and `Forward()` functions in `model.go` - Reference existing model implementations - Debug forward pass: ```bash OLLAMA_DEBUG=1 go run . run ``` - Compare output with research implementation 4. **Tokenizer Implementation** - Implement a new tokenizer if required - Ensure compatibility with model architecture 5. **Text Generation Testing** - Implement proper prompt formatting - Test basic generation: ```bash go run . run "hello" ``` ### 5. Testing 1. Add comprehensive tests to: - `model_test.go` - `convert_test.go` 2. Ensure tests cover: - Weight conversion - Model initialization - Text generation ### 6. Model Deployment 1. **Finalize Pull Request** - Move PR out of draft state - Address reviewer feedback 2. **Deploy to ollama.com** - Determine model prompt format - Convert prompt format to Go template - Create final Modelfile: ``` FROM TEMPLATE LICENSE # Add additional parameters if needed ``` - Push to ollama.com: ```bash ollama create / -f /path/to/Modelfile ollama push / ``` 3. **Integration Testing** - Run end-to-end tests - Verify model behavior in production environment