Note: This guide and the Go inference engine are in early development and will be updated as implementation details evolve.
This guide outlines the process of implementing a new model in Ollama's inference engine. It covers everything from initial setup to deploying your model to ollama.com.
Below is a diagram showing Ollama's inference engine architecture layers and how they interact:
graph TB
subgraph Models["Model Layer: LLM Implementations"]
direction TB
llama["model/models/llama/model.go"]
mllama["model/models/mllama/model.go"]
qwen["model/models/qwen2/model.go"]
qwen_vl["model/models/qwen2vl/model.go"]
note1["Each model implements a specific architecture
- Defines model parameters
- Implements forward pass"]
end
subgraph ML_Ops["Neural Network Operations"]
direction TB
nn_ops["nn/
linear.go - Matrix operations
embedding.go - Token embeddings
normalization.go - Layer normalization
convolution.go - Conv operations"]
backend["ml/backend.go
Hardware Abstraction Layer
- Defines tensor operations
- Manages computation graphs
- Handles memory allocation"]
note2["Common neural net operations
used across different models
- Abstracts hardware details
- Provides unified API
- Manages computation flow"]
end
subgraph GGML["Hardware Execution Layer"]
direction TB
ggml["ggml.go
CGO Interface
- Bridges Go and C++
- Handles type conversion
- Manages memory between languages"]
subgraph Hardware_Specific["Hardware-Specific Implementations"]
direction LR
cpu["ggml-cpu.h
CPU optimized ops"]
cuda["ggml-cuda.h
NVIDIA GPU ops"]
metal["ggml-metal.h
Apple GPU ops"]
vulkan["ggml-vulkan.h
Cross-platform GPU"]
opencl["ggml-opencl.h
OpenCL acceleration"]
end
note3["GGML provides optimized
implementations for each hardware:
- Automatic dispatch
- Hardware-specific optimizations
- Memory management
- Parallel execution"]
end
%% Connections with explanations
Models --> |"Makes high-level calls
(e.g., self-attention)"| ML_Ops
ML_Ops --> |"Translates to tensor operations
(e.g., matmul, softmax)"| GGML
GGML --> |"Executes optimized code
on target hardware"| Hardware_Specific
%% Styling
classDef model fill:#fff,stroke:#01579b,stroke-width:2px
classDef ml fill:#fff,stroke:#e65100,stroke-width:2px
classDef hw fill:#fff,stroke:#b71c1c,stroke-width:2px
classDef note fill:#fff,stroke:#666,stroke-dasharray: 5 5
class llama,mllama,qwen,qwen_vl,pixtral model
class nn_ops,backend ml
class ggml,cpu,cuda,metal,vulkan,opencl hw
class note1,note2,note3 note
%% Style subgraphs
style Models fill:#fff,stroke:#01579b,stroke-width:2px
style ML_Ops fill:#fff,stroke:#e65100,stroke-width:2px
style GGML fill:#fff,stroke:#b71c1c,stroke-width:2px
style Hardware_Specific fill:#fff,stroke:#b71c1c,stroke-width:1px
When implementing a new model, you'll primarily work in the model layer, interfacing with the neural network operations layer.
First, clone the Ollama repository and get it running locally. Follow the development setup guide at: https://github.com/ollama/ollama/blob/main/docs/development.md
Get the original model implementation running. This typically involves:
Create the necessary file structure by referencing previous model implementations. You'll need:
model/
└── your-model/
├── model.go # Architecture and forward pass implementation
├── convert.go # Weight conversion logic (PyTorch/SafeTensors to GGML)
└── convert_test.go # Conversion logic tests
Add your model to the main paths in model/models/models.go:
package models
import (
_ "github.com/ollama/ollama/model/models/llama"
_ "github.com/ollama/ollama/model/models/mllama"
_ "github.com/ollama/ollama/model/models/your-model" // Add your model here
)
Open a Draft PR
ollama/ollama
repositoryImplement Weight Conversion
convert.go
Create a basic Modelfile:
FROM /path/to/model
Test conversion:
go run . create <my-model> -f /path/to/Modelfile
Implement Model Logic
New()
and Forward()
functions in model.go
Debug forward pass:
OLLAMA_DEBUG=1 go run . run <my-model>
Compare output with research implementation
Tokenizer Implementation
Text Generation Testing
Test basic generation:
go run . run <my-model> "hello"
Add comprehensive tests to:
model_test.go
convert_test.go
Ensure tests cover:
Finalize Pull Request
Deploy to ollama.com
Create final Modelfile:
FROM <converted-gguf>
TEMPLATE <prompt-template>
LICENSE <license-info>
# Add additional parameters if needed
Push to ollama.com:
ollama create <your-namespace>/<your-model> -f /path/to/Modelfile
ollama push <your-namespace>/<your-model>
Integration Testing