hai 2 meses · e3f3043f5b
--- a/docs/add-a-model.md
+++ b/docs/add-a-model.md
@@ -2,7 +2,7 @@
 
				 
			
 
				 > **Note**: This guide and the Go inference engine are in early development and will be updated as implementation details evolve.
			
 
				 
			
 
				-This guide outlines the process of implementing a new model in Ollama's inference engine. It covers everything from initial setup to deploying your model to ollama.com.
			
 
				+This guide outlines the process of implementing a new model in Ollama's inference engine. It covers everything from initial setup to publishing your model to ollama.com.
			
 
				 
			
 
				 ## Architecture Overview
			
 
				 
			
@@ -12,96 +12,49 @@ Below is a diagram showing Ollama's inference engine architecture layers and how
 
				 graph TB
			
 
				     subgraph Models["Model Layer: LLM Implementations"]
			
 
				         direction TB
			
 
				-        llama["model/models/llama/model.go"]
			
 
				-        mllama["model/models/mllama/model.go"]
			
 
				-        qwen["model/models/qwen2/model.go"]
			
 
				-        qwen_vl["model/models/qwen2vl/model.go"]
			
 
				+        llama["model/models/llama"]
			
 
				+        mllama["model/models/mllama"]
			
 
				+        qwen["model/models/qwen2"]
			
 
				+        etc["...etc"]
			
 
				         
			
 
				-        note1["Each model implements a specific architecture
			
 
				-        - Defines model parameters
			
 
				-        - Implements forward pass"]
			
 
				+        note1[" Each model implements a<br>specific architecture:<br>- Defines model parameters<br>- Implements forward pass"]
			
 
				     end
			
 
				 
			
 
				     subgraph ML_Ops["Neural Network Operations"]
			
 
				         direction TB
			
 
				-        nn_ops["nn/
			
 
				-            linear.go - Matrix operations
			
 
				-            embedding.go - Token embeddings
			
 
				-            normalization.go - Layer normalization
			
 
				-            convolution.go - Conv operations"]
			
 
				+        nn_ops[" nn/<br>linear.go: Matrix multiplication<br>embedding.go: Token embedding lookups<br>normalization.go: Layer norm operations<br>convolution.go: Convolutional operations "]
			
 
				         
			
 
				-        backend["ml/backend.go
			
 
				-        Hardware Abstraction Layer
			
 
				-        - Defines tensor operations
			
 
				-        - Manages computation graphs
			
 
				-        - Handles memory allocation"]
			
 
				-
			
 
				-        note2["Common neural net operations
			
 
				-        used across different models
			
 
				-        - Abstracts hardware details
			
 
				-        - Provides unified API
			
 
				-        - Manages computation flow"]
			
 
				+        backend[" ml/backend.go<br>Hardware Abstraction Layer:<br>- Defines tensor operations<br>- Manages computation graphs<br>- Handles memory allocation "]
			
 
				+
			
 
				+        note2[" Common neural net operations:<br>- Abstracts hardware details<br>- Provides unified API<br>- Manages computation flow "]
			
 
				     end
			
 
				 
			
 
				-    subgraph GGML["Hardware Execution Layer"]
			
 
				+    subgraph Hardware["Backend Execution Layer"]
			
 
				         direction TB
			
 
				-        ggml["ggml.go
			
 
				-        CGO Interface
			
 
				-        - Bridges Go and C++
			
 
				-        - Handles type conversion
			
 
				-        - Manages memory between languages"]
			
 
				-        
			
 
				-        subgraph Hardware_Specific["Hardware-Specific Implementations"]
			
 
				-            direction LR
			
 
				-            cpu["ggml-cpu.h
			
 
				-            CPU optimized ops"]
			
 
				-            cuda["ggml-cuda.h
			
 
				-            NVIDIA GPU ops"]
			
 
				-            metal["ggml-metal.h
			
 
				-            Apple GPU ops"]
			
 
				-            vulkan["ggml-vulkan.h
			
 
				-            Cross-platform GPU"]
			
 
				-            opencl["ggml-opencl.h
			
 
				-            OpenCL acceleration"]
			
 
				-        end
			
 
				-
			
 
				-        note3["GGML provides optimized 
			
 
				-        implementations for each hardware:
			
 
				-        - Automatic dispatch
			
 
				-        - Hardware-specific optimizations
			
 
				-        - Memory management
			
 
				-        - Parallel execution"]
			
 
				+        backend_impl[" The backend package provides:<br>- Unified computation interface<br>- Automatic hardware selection<br>- Optimized kernels<br>- Efficient memory management "]
			
 
				     end
			
 
				 
			
 
				-    %% Connections with explanations
			
 
				-    Models --> |"Makes high-level calls
			
 
				-    (e.g., self-attention)"| ML_Ops
			
 
				-    ML_Ops --> |"Translates to tensor operations
			
 
				-    (e.g., matmul, softmax)"| GGML
			
 
				-    GGML --> |"Executes optimized code
			
 
				-    on target hardware"| Hardware_Specific
			
 
				-    
			
 
				-    %% Styling
			
 
				-    classDef model fill:#fff,stroke:#01579b,stroke-width:2px
			
 
				-    classDef ml fill:#fff,stroke:#e65100,stroke-width:2px
			
 
				-    classDef hw fill:#fff,stroke:#b71c1c,stroke-width:2px
			
 
				-    classDef note fill:#fff,stroke:#666,stroke-dasharray: 5 5
			
 
				-    
			
 
				-    class llama,mllama,qwen,qwen_vl,pixtral model
			
 
				-    class nn_ops,backend ml
			
 
				-    class ggml,cpu,cuda,metal,vulkan,opencl hw
			
 
				-    class note1,note2,note3 note
			
 
				-
			
 
				-    %% Style subgraphs
			
 
				-    style Models fill:#fff,stroke:#01579b,stroke-width:2px
			
 
				-    style ML_Ops fill:#fff,stroke:#e65100,stroke-width:2px
			
 
				-    style GGML fill:#fff,stroke:#b71c1c,stroke-width:2px
			
 
				-    style Hardware_Specific fill:#fff,stroke:#b71c1c,stroke-width:1px
			
 
				+    Models --> |" Makes high-level calls<br>(e.g., self-attention) "| ML_Ops
			
 
				+    ML_Ops --> |" Translates to tensor operations<br>(e.g., matmul, softmax) "| Hardware
			
 
				 ```
			
 
				 
			
 
				 When implementing a new model, you'll primarily work in the model layer, interfacing with the neural network operations layer.
			
 
				 
			
 
				-## Implementation Steps
			
 
				+## Implementation Process Overview
			
 
				+
			
 
				+Here's the high-level process for implementing a new model in Ollama:
			
 
				+
			
 
				+1. **Environment Setup**: Clone the repository and set up your development environment
			
 
				+2. **Research Implementation**: Understand the original model architecture
			
 
				+3. **Project Structure Setup**: Set up the necessary file structure
			
 
				+4. **Create Basic Modelfile**: Create a simple Modelfile for testing
			
 
				+5. **Implement Weight Conversion**: Map from original format to GGUF
			
 
				+6. **Open a Draft PR**: Create a draft pull request to establish communication with maintainers
			
 
				+7. **Implement Model Logic**: Create the model architecture and forward pass
			
 
				+8. **Quality Check and Final Steps**: Create a Modelfile, add tests and ensure functionality
			
 
				+10. **Finalize PR and Publish**: Complete the PR and publish to ollama.com
			
 
				+
			
 
				+## Implementation Steps in Detail
			
 
				 
			
 
				 ### 1. Environment Setup
			
 
				 
			
@@ -121,11 +74,11 @@ Get the original model implementation running. This typically involves:
 
				 Create the necessary file structure by referencing previous model implementations. You'll need:
			
 
				 
			
 
				 ```
			
 
				+convert/
			
 
				+└── convert_your-model.go # Weight conversion logic (PyTorch/SafeTensors to GGML)
			
 
				 model/
			
 
				 └── your-model/
			
 
				-    ├── model.go         # Architecture and forward pass implementation
			
 
				-    ├── convert.go       # Weight conversion logic (PyTorch/SafeTensors to GGML)
			
 
				-    └── convert_test.go  # Conversion logic tests
			
 
				+    └── model.go         # Architecture and forward pass implementation
			
 
				 ```
			
 
				 
			
 
				 Add your model to the main paths in [model/models/models.go](https://github.com/ollama/ollama/blob/main/model/models/models.go):
			
@@ -140,45 +93,194 @@ import (
 
				 )
			
 
				 ```
			
 
				 
			
 
				-### 4. Development Process
			
 
				+### 4. Create a Basic Modelfile
			
 
				 
			
 
				-1. **Open a Draft PR**
			
 
				-   - Create a draft pull request in the `ollama/ollama` repository
			
 
				-   - Use this as a communication channel with Ollama maintainers
			
 
				+Create a simple Modelfile early in the process to facilitate testing:
			
 
				 
			
 
				-2. **Implement Weight Conversion**
			
 
				-   - Work on `convert.go`
			
 
				-   - Reference existing conversion implementations
			
 
				-   - Create a basic Modelfile:
			
 
				-     ```
			
 
				-     FROM /path/to/model
			
 
				-     ```
			
 
				-   - Test conversion:
			
 
				-     ```bash
			
 
				-     go run . create <my-model> -f /path/to/Modelfile
			
 
				-     ```
			
 
				-
			
 
				-3. **Implement Model Logic**
			
 
				-   - Implement `New()` and `Forward()` functions in `model.go`
			
 
				-   - Reference existing model implementations
			
 
				-   - Debug forward pass:
			
 
				-     ```bash
			
 
				-     OLLAMA_DEBUG=1 go run . run <my-model>
			
 
				-     ```
			
 
				-   - Compare output with research implementation
			
 
				-
			
 
				-4. **Tokenizer Implementation**
			
 
				-   - Implement a new tokenizer if required
			
 
				-   - Ensure compatibility with model architecture
			
 
				+```
			
 
				+FROM /path/to/model
			
 
				+TEMPLATE "{{.Prompt}}" # Use a static prompt format for initial testing
			
 
				+```
			
 
				 
			
 
				-5. **Text Generation Testing**
			
 
				-   - Implement proper prompt formatting
			
 
				-   - Test basic generation:
			
 
				-     ```bash
			
 
				-     go run . run <my-model> "hello"
			
 
				+This allows you to test your implementation with consistent inputs before finalizing the proper prompt template.
			
 
				+
			
 
				+### 5. Implement Weight Conversion
			
 
				+
			
 
				+- Work on `convert/convert_your-model.go`
			
 
				+- Reference existing conversion implementations
			
 
				+- Conversion involves mapping from PyTorch/SafeTensors naming to GGUF naming as you see fit
			
 
				+- Understand typical GGUF layout and structure:
			
 
				+  
			
 
				+  **Typical GGUF Layout:**
			
 
				+  ```
			
 
				+  GGUF
			
 
				+  ├── Metadata Section
			
 
				+  │   ├── Model Parameters
			
 
				+  │   │   ├── General architecture parameters 
			
 
				+  │   │   │   ├── "{arch}.vocab_size" (e.g., "llama.vocab_size") 
			
 
				+  │   │   │   ├── "{arch}.context_length" (e.g., "llama.context_length")
			
 
				+  │   │   │   ├── "{arch}.embedding_length" (e.g., "llama.embedding_length")
			
 
				+  │   │   │   └── "{arch}.block_count" (e.g., "llama.block_count")
			
 
				+  │   │   │
			
 
				+  │   │   └── Architecture-specific parameters
			
 
				+  │   │       ├── "{arch}.attention.head_count" (e.g., "llama.attention.head_count")
			
 
				+  │   │       ├── "{arch}.attention.head_count_kv" (e.g., "llama.attention.head_count_kv")
			
 
				+  │   │       ├── "{arch}.rope.dimension_count" (e.g., "llama.rope.dimension_count")
			
 
				+  │   │       └── "{arch}.attention.layer_norm_rms_epsilon" (e.g., "llama.attention.layer_norm_rms_epsilon")
			
 
				+  │   │
			
 
				+  │   ├── Tokenizer parameters
			
 
				+  │   │   ├── "tokenizer.ggml.model" (e.g., "llama")
			
 
				+  │   │   ├── "tokenizer.ggml.tokens" (vocabulary tokens)
			
 
				+  │   │   ├── "tokenizer.ggml.bos_id" (beginning of sequence token ID)
			
 
				+  │   │   └── "tokenizer.ggml.eos_id" (end of sequence token ID)
			
 
				+  │   │
			
 
				+  │   └── General metadata
			
 
				+  │       └── "general.architecture" (e.g., "llama", "qwen2", "phi")
			
 
				+  │
			
 
				+  └── Tensor Data Section
			
 
				+      ├── Common tensors:
			
 
				+      │   ├── "token_embd.weight" (token embedding matrix)
			
 
				+      │   ├── "rope_freqs.weight" (RoPE frequency weights)
			
 
				+      │   ├── "output_norm.weight" (final layer normalization)
			
 
				+      │   └── "output.weight" (output projection)
			
 
				+      │
			
 
				+      └── Layer-specific tensors:
			
 
				+          ├── "blk.{i}.attn_q.weight" (query projection)
			
 
				+          ├── "blk.{i}.attn_k.weight" (key projection) 
			
 
				+          ├── "blk.{i}.attn_v.weight" (value projection)
			
 
				+          ├── "blk.{i}.attn_output.weight" (attention output)
			
 
				+          ├── "blk.{i}.attn_norm.weight" (attention normalization)
			
 
				+          ├── "blk.{i}.ffn_norm.weight" (feed-forward normalization)
			
 
				+          ├── "blk.{i}.ffn_up.weight" (FFN up projection)
			
 
				+          ├── "blk.{i}.ffn_down.weight" (FFN down projection)
			
 
				+          └── "blk.{i}.ffn_gate.weight" (FFN gate projection)
			
 
				+  ```
			
 
				+
			
 
				+  - Key conversion details include:
			
 
				+    - Linear weight matrices (sometimes need transposition)
			
 
				+    - Layer normalization weights (might need reshaping)
			
 
				+    - **Note: In GGML, FFN values are for the MLP (Multi-Layer Perceptron) part of the architecture**
			
 
				+
			
 
				+- Test conversion:
			
 
				+  ```bash
			
 
				+  go run . create <my-model> -f /path/to/Modelfile
			
 
				+  ```
			
 
				+
			
 
				+### 6. Open a Draft PR
			
 
				+
			
 
				+After implementing the initial weight conversion, creating a draft pull request is recommended as it:
			
 
				+- Establishes a communication channel with Ollama maintainers
			
 
				+- Allows for early feedback on your approach
			
 
				+- Makes it easier to track progress and changes
			
 
				+
			
 
				+To open a draft PR:
			
 
				+1. Fork the repository
			
 
				+2. Create a new branch for your model implementation
			
 
				+3. Make initial commits with your weight conversion implementation
			
 
				+4. Open a PR in the `ollama/ollama` repository and mark it as draft
			
 
				+5. Include a clear description of the model you're implementing
			
 
				+
			
 
				+### 7. Implement Model Logic
			
 
				+
			
 
				+- Reference existing model implementations
			
 
				+- Implement `New()` and `Forward()` functions in `model.go`:
			
 
				+  
			
 
				+  **The `New()` function:**
			
 
				+  - Creates and initializes your model structure
			
 
				+  - Loads configuration parameters (embedding size, attention heads, etc.)
			
 
				+  - Sets up the tokenizer with vocabulary and special tokens
			
 
				+  - Initializes all model layers and weights
			
 
				+  - **Important**: Sets up the KV cache for efficient inference
			
 
				+  - Example:
			
 
				+    ```go
			
 
				+    func New(c ml.Config) (model.Model, error) {
			
 
				+        m := &Model{
			
 
				+            // Initialize tokenizer
			
 
				+            BytePairEncoding: model.NewBytePairEncoding(...),
			
 
				+            // Create layer arrays
			
 
				+            Layers: make([]Layer, c.Uint("block_count")),
			
 
				+            // Set model parameters
			
 
				+            Options: &Options{...},
			
 
				+        }
			
 
				+        // Initialize KV cache for efficient inference
			
 
				+        m.Cache = kvcache.NewCausalCache(m.Shift)
			
 
				+        return m, nil
			
 
				+    }
			
 
				+    ```
			
 
				+  
			
 
				+  **The `Forward()` function:**
			
 
				+  - **What it does**: Defines the computational graph of your model
			
 
				+  - **Important**: The graph is NOT executed immediately - it's built first, then executed later when predictions are needed
			
 
				+  - Takes input tokens and converts them to embeddings
			
 
				+  - Processes inputs through transformer layers (attention and feed-forward networks)
			
 
				+  - Creates the path for data flow through your model's components
			
 
				+  - Example:
			
 
				+    ```go
			
 
				+    func (m *Model) Forward(ctx ml.Context, opts model.Options) (ml.Tensor, error) {
			
 
				+        // Convert inputs to tensors
			
 
				+        inputTensor, _ := ctx.FromIntSlice(opts.Inputs, len(opts.Inputs))
			
 
				+        positionsTensor, _ := ctx.FromIntSlice(opts.Positions, len(opts.Positions))
			
 
				+        
			
 
				+        // Initial token embedding
			
 
				+        hiddenStates := m.TokenEmbedding.Forward(ctx, inputTensor)
			
 
				+        
			
 
				+        // Process through transformer layers
			
 
				+        for i, layer := range m.Layers {
			
 
				+            m.Cache.SetLayer(i)
			
 
				+            hiddenStates = layer.Forward(ctx, hiddenStates, positionsTensor, m.Cache, m.Options)
			
 
				+        }
			
 
				+        
			
 
				+        // Final processing and output
			
 
				+        normalizedOutput := m.OutputNorm.Forward(ctx, hiddenStates, m.modelEpsilon)
			
 
				+        logits := m.Output.Forward(ctx, normalizedOutput)
			
 
				+        
			
 
				+        // Return logits for requested positions
			
 
				+        outputsTensor, _ := ctx.FromIntSlice(opts.Outputs, len(opts.Outputs))
			
 
				+        return logits.Rows(ctx, outputsTensor), nil
			
 
				+    }
			
 
				+    ```
			
 
				+
			
 
				+  **Key Components to Implement:**
			
 
				+
			
 
				+  1. **KV Cache**:
			
 
				+     - Improves inference performance for text generation
			
 
				+     - How it works: Stores previously computed key and value tensors from self-attention, avoiding redundant computations
			
 
				+     - Implementation: Use the `kvcache.NewCausalCache()` for autoregressive models
			
 
				+     - Important: Must implement the `Shift()` function to handle rotary position embeddings with the cache
			
 
				+
			
 
				+  2. **Self-Attention**:
			
 
				+     - Core component that learns contextual relationships between tokens
			
 
				+     - Implements query, key, value projections and their interactions
			
 
				+     - Must handle positional encoding (usually Rotary Position Embeddings)
			
 
				+     - Uses the KV cache to make generation efficient
			
 
				+
			
 
				+  3. **Normalization Layers**:
			
 
				+     - Purpose: Stabilizes training and maintains consistent activation distributions
			
 
				+     - Types: RMSNorm, LayerNorm, etc. depending on model architecture
			
 
				+     - Implementation: Apply before attention and feed-forward networks
			
 
				+     - Example: `normalizedOutput := m.OutputNorm.Forward(ctx, hiddenStates, m.modelEpsilon)`
			
 
				+
			
 
				+  4. **Activation Functions**:
			
 
				+     - Purpose: Introduces non-linearity into the model
			
 
				+     - Common types: SILU (Sigmoid Linear Unit), GELU, ReLU
			
 
				+     - Found in feed-forward/MLP blocks
			
 
				+     - Example:
			
 
				+     ```go
			
 
				+     // SwiGLU activation in MLP
			
 
				+     gateActivation := mlp.Gate.Forward(ctx, hiddenState).SILU(ctx)
			
 
				+     upProjection := mlp.Up.Forward(ctx, hiddenState)
			
 
				+     intermediateStates := gateActivation.Mul(ctx, upProjection)
			
 
				      ```
			
 
				+- Run your forward pass:
			
 
				+  ```bash
			
 
				+  # in the root of the ollama directory
			
 
				+  go build .
			
 
				+  OLLAMA_DEBUG=1 ./ollama serve
			
 
				+  OLLAMA_DEBUG=1 ./ollama run <my-model>
			
 
				+  ```
			
 
				+- Compare output with research implementation
			
 
				 
			
 
				-### 5. Testing
			
 
				+### 8. Quality Check and Final Steps
			
 
				 
			
 
				 1. Add comprehensive tests to:
			
 
				    - `model_test.go`
			
@@ -189,28 +291,36 @@ import (
 
				    - Model initialization
			
 
				    - Text generation
			
 
				 
			
 
				-### 6. Model Deployment
			
 
				+3. **Create Final Modelfile**
			
 
				+   - Replace the static prompt with the proper Go template for your model:
			
 
				+     ```
			
 
				+     FROM <converted-gguf>
			
 
				+     TEMPLATE <prompt-template>    # Add the proper Go template for your model, including tools if needed
			
 
				+     LICENSE <license-info>        # Add appropriate license information
			
 
				+     # Add additional parameters if needed
			
 
				+     ```
			
 
				+
			
 
				+4. **End-to-end Testing**
			
 
				+   - Run your model with your local Ollama build to ensure that it functions as expected
			
 
				+
			
 
				+5. Benchmark
			
 
				+   - Run performance benchmarks on your model implementation
			
 
				+   ```go
			
 
				+   # from the root of the Ollama directory, while a server is running locally
			
 
				+   go build .
			
 
				+   OLLAMA_DEBUG=1 ./ollama serve
			
 
				+   go test -bench=. -m <your-model-name> ./...
			
 
				+   ```
			
 
				+
			
 
				+### 9. Finalize PR and Publish to ollama.com
			
 
				 
			
 
				 1. **Finalize Pull Request**
			
 
				    - Move PR out of draft state
			
 
				    - Address reviewer feedback
			
 
				 
			
 
				-2. **Deploy to ollama.com**
			
 
				-   - Determine model prompt format
			
 
				-   - Convert prompt format to Go template
			
 
				-   - Create final Modelfile:
			
 
				-     ```
			
 
				-     FROM <converted-gguf>
			
 
				-     TEMPLATE <prompt-template>
			
 
				-     LICENSE <license-info>
			
 
				-     # Add additional parameters if needed
			
 
				-     ```
			
 
				+2. **Publish to ollama.com**
			
 
				    - Push to ollama.com:
			
 
				      ```bash
			
 
				      ollama create <your-namespace>/<your-model> -f /path/to/Modelfile
			
 
				      ollama push <your-namespace>/<your-model>
			
 
				-     ```
			
 
				-
			
 
				-3. **Integration Testing**
			
 
				-   - Run end-to-end tests
			
 
				-   - Verify model behavior in production environment
			
 
				+     ```