Note: This guide and the Go inference engine are in early development and will be updated as implementation details evolve.
This guide outlines the process of implementing a new model in Ollama's inference engine. It covers everything from initial setup to publishing your model to ollama.com.
Below is a diagram showing Ollama's inference engine architecture layers and how they interact:
graph TB
subgraph Models["Model Layer: LLM Implementations"]
direction TB
llama["model/models/llama"]
mllama["model/models/mllama"]
qwen["model/models/qwen2"]
etc["...etc"]
note1[" Each model implements a<br>specific architecture:<br>- Defines model parameters<br>- Implements forward pass"]
end
subgraph ML_Ops["Neural Network Operations"]
direction TB
nn_ops[" nn/<br>linear.go: Matrix multiplication<br>embedding.go: Token embedding lookups<br>normalization.go: Layer norm operations<br>convolution.go: Convolutional operations "]
backend[" ml/backend.go<br>Hardware Abstraction Layer:<br>- Defines tensor operations<br>- Manages computation graphs<br>- Handles memory allocation "]
note2[" Common neural net operations:<br>- Abstracts hardware details<br>- Provides unified API<br>- Manages computation flow "]
end
subgraph Hardware["Backend Execution Layer"]
direction TB
backend_impl[" The backend package provides:<br>- Unified computation interface<br>- Automatic hardware selection<br>- Optimized kernels<br>- Efficient memory management "]
subgraph Backends["Backend Implementations"]
direction LR
cpu["backend/cpu<br>- Pure Go implementation<br>- Fallback for all platforms"]
metal["backend/metal<br>- Apple Silicon (M1/M2/M3)<br>- MLX integration<br>- Leverages Apple Neural Engine"]
onnx["backend/onnx<br>- Cross-platform compatibility<br>- ONNX Runtime integration<br>- Pre-compiled graph execution"]
ggml["backend/ggml<br>- CPU/GPU quantized compute<br>- Low-precision operations<br>- Memory-efficient inferencing"]
end
end
Models --> |" Makes high-level calls<br>(e.g., self-attention) "| ML_Ops
ML_Ops --> |" Translates to tensor operations<br>(e.g., matmul, softmax) "| Hardware
backend_impl --> Backends
When implementing a new model, you'll primarily work in the model layer, interfacing with the neural network operations layer.
Here's the high-level process for implementing a new model in Ollama:
First, clone the Ollama repository and get it running locally. Follow the development setup guide at: https://github.com/ollama/ollama/blob/main/docs/development.md
Get the original model implementation running. This typically involves:
Create the necessary file structure by referencing previous model implementations. You'll need:
convert/
└── convert_your-model.go # Weight conversion logic (PyTorch/SafeTensors to GGML)
model/
└── your-model/
└── model.go # Architecture and forward pass implementation
Add your model to the main paths in model/models/models.go:
package models
import (
_ "github.com/ollama/ollama/model/models/llama"
_ "github.com/ollama/ollama/model/models/mllama"
_ "github.com/ollama/ollama/model/models/your-model" // Add your model here
)
Create a simple Modelfile early in the process to facilitate testing:
FROM /path/to/model
TEMPLATE "{{.Prompt}}" # Use a static prompt format for initial testing
This allows you to test your implementation with consistent inputs before finalizing the proper prompt template.
convert/convert_your-model.go
Typical GGUF Layout:
GGUF
├── Metadata Section
│ ├── Model Parameters
│ │ ├── General architecture parameters
│ │ │ ├── "{arch}.vocab_size" (e.g., "llama.vocab_size")
│ │ │ ├── "{arch}.context_length" (e.g., "llama.context_length")
│ │ │ ├── "{arch}.embedding_length" (e.g., "llama.embedding_length")
│ │ │ └── "{arch}.block_count" (e.g., "llama.block_count")
│ │ │
│ │ └── Architecture-specific parameters
│ │ ├── "{arch}.attention.head_count" (e.g., "llama.attention.head_count")
│ │ ├── "{arch}.attention.head_count_kv" (e.g., "llama.attention.head_count_kv")
│ │ ├── "{arch}.rope.dimension_count" (e.g., "llama.rope.dimension_count")
│ │ └── "{arch}.attention.layer_norm_rms_epsilon" (e.g., "llama.attention.layer_norm_rms_epsilon")
│ │
│ ├── Tokenizer parameters
│ │ ├── "tokenizer.ggml.model" (e.g., "llama")
│ │ ├── "tokenizer.ggml.tokens" (vocabulary tokens)
│ │ ├── "tokenizer.ggml.bos_id" (beginning of sequence token ID)
│ │ └── "tokenizer.ggml.eos_id" (end of sequence token ID)
│ │
│ └── General metadata
│ └── "general.architecture" (e.g., "llama", "qwen2", "phi")
│
└── Tensor Data Section
├── Common tensors:
│ ├── "token_embd.weight" (token embedding matrix)
│ ├── "rope_freqs.weight" (RoPE frequency weights)
│ ├── "output_norm.weight" (final layer normalization)
│ └── "output.weight" (output projection)
│
└── Layer-specific tensors:
├── "blk.{i}.attn_q.weight" (query projection)
├── "blk.{i}.attn_k.weight" (key projection)
├── "blk.{i}.attn_v.weight" (value projection)
├── "blk.{i}.attn_output.weight" (attention output)
├── "blk.{i}.attn_norm.weight" (attention normalization)
├── "blk.{i}.ffn_norm.weight" (feed-forward normalization)
├── "blk.{i}.ffn_up.weight" (FFN up projection)
├── "blk.{i}.ffn_down.weight" (FFN down projection)
└── "blk.{i}.ffn_gate.weight" (FFN gate projection)
Key conversion details include:
Test conversion:
go run . create <my-model> -f /path/to/Modelfile
After implementing the initial weight conversion, creating a draft pull request is recommended as it:
To open a draft PR:
ollama/ollama
repository and mark it as draftNew()
and Forward()
functions in model.go
:
The New()
function:
Example:
func New(c ml.Config) (model.Model, error) {
m := &Model{
// Initialize tokenizer
BytePairEncoding: model.NewBytePairEncoding(...),
// Create layer arrays
Layers: make([]Layer, c.Uint("block_count")),
// Set model parameters
Options: &Options{...},
}
// Initialize KV cache for efficient inference
m.Cache = kvcache.NewCausalCache(m.Shift)
return m, nil
}
The Forward()
function:
Example:
func (m *Model) Forward(ctx ml.Context, opts model.Options) (ml.Tensor, error) {
// Convert inputs to tensors
inputTensor, _ := ctx.FromIntSlice(opts.Inputs, len(opts.Inputs))
positionsTensor, _ := ctx.FromIntSlice(opts.Positions, len(opts.Positions))
// Initial token embedding
hiddenStates := m.TokenEmbedding.Forward(ctx, inputTensor)
// Process through transformer layers
for i, layer := range m.Layers {
m.Cache.SetLayer(i)
hiddenStates = layer.Forward(ctx, hiddenStates, positionsTensor, m.Cache, m.Options)
}
// Final processing and output
normalizedOutput := m.OutputNorm.Forward(ctx, hiddenStates, m.modelEpsilon)
logits := m.Output.Forward(ctx, normalizedOutput)
// Return logits for requested positions
outputsTensor, _ := ctx.FromIntSlice(opts.Outputs, len(opts.Outputs))
return logits.Rows(ctx, outputsTensor), nil
}
Key Components to Implement:
KV Cache:
kvcache.NewCausalCache()
for autoregressive modelsShift()
function to handle rotary position embeddings with the cacheSelf-Attention:
Normalization Layers:
normalizedOutput := m.OutputNorm.Forward(ctx, hiddenStates, m.modelEpsilon)
Activation Functions:
Example:
// SwiGLU activation in MLP
gateActivation := mlp.Gate.Forward(ctx, hiddenState).SILU(ctx)
upProjection := mlp.Up.Forward(ctx, hiddenState)
intermediateStates := gateActivation.Mul(ctx, upProjection)
Run your forward pass:
# in the root of the ollama directory
go build .
OLLAMA_DEBUG=1 ./ollama serve
OLLAMA_DEBUG=1 ./ollama run <my-model>
Compare output with research implementation
Add comprehensive tests to:
model_test.go
convert_test.go
Ensure tests cover:
Create Final Modelfile
Replace the static prompt with the proper Go template for your model:
FROM <converted-gguf>
TEMPLATE <prompt-template> # Add the proper Go template for your model, including tools if needed
LICENSE <license-info> # Add appropriate license information
# Add additional parameters if needed
End-to-end Testing
Benchmark
Run performance benchmarks on your model implementation
# from the root of the Ollama directory, while a server is running locally
go build .
OLLAMA_DEBUG=1 ./ollama serve
go test -bench=. -m <your-model-name> ./...
Finalize Pull Request
Publish to ollama.com
Push to ollama.com:
ollama create <your-namespace>/<your-model> -f /path/to/Modelfile
ollama push <your-namespace>/<your-model>