|
hai 3 meses | |
---|---|---|
.. | ||
imageproc | hai 4 meses | |
llama | hai 3 meses | |
mllama | hai 3 meses | |
model_test | hai 3 meses | |
pixtral | hai 4 meses | |
qwen2 | hai 3 meses | |
qwen2vl | hai 4 meses | |
testdata | hai 3 meses | |
README.md | hai 3 meses | |
model.go | hai 3 meses | |
model_test.go | hai 3 meses | |
process_text.go | hai 3 meses |
!! This is a work in progress document !!
graph TB
subgraph Models["Model Layer: LLM Implementations"]
direction TB
llama["llama/model.go"]
mllama["mllama/model.go"]
qwen["qwen2/model.go"]
qwen_vl["qwen2vl/model.go"]
pixtral["pixtral/"]
note1["Each model implements a specific architecture
- Defines model parameters
- Handles tokenization
- Implements forward pass
- Manages model weights"]
end
subgraph ML_Ops["Neural Network Operations"]
direction TB
nn_ops["nn/
linear.go - Matrix operations
embedding.go - Token embeddings
normalization.go - Layer normalization
convolution.go - Conv operations"]
backend["ml/backend.go
Hardware Abstraction Layer
- Defines tensor operations
- Manages computation graphs
- Handles memory allocation"]
note2["Common neural net operations
used across different models
- Abstracts hardware details
- Provides unified API
- Manages computation flow"]
end
subgraph GGML["Hardware Execution Layer"]
direction TB
ggml["ggml.go
CGO Interface
- Bridges Go and C++
- Handles type conversion
- Manages memory between languages"]
subgraph Hardware_Specific["Hardware-Specific Implementations"]
direction LR
cpu["ggml-cpu.h
CPU optimized ops"]
cuda["ggml-cuda.h
NVIDIA GPU ops"]
metal["ggml-metal.h
Apple GPU ops"]
vulkan["ggml-vulkan.h
Cross-platform GPU"]
opencl["ggml-opencl.h
OpenCL acceleration"]
end
note3["GGML provides optimized
implementations for each hardware:
- Automatic dispatch
- Hardware-specific optimizations
- Memory management
- Parallel execution"]
end
%% Connections with explanations
Models --> |"Makes high-level calls
(e.g., self-attention)"| ML_Ops
ML_Ops --> |"Translates to tensor operations
(e.g., matmul, softmax)"| GGML
GGML --> |"Executes optimized code
on target hardware"| Hardware_Specific
%% Styling
classDef model fill:#fff,stroke:#01579b,stroke-width:2px
classDef ml fill:#fff,stroke:#e65100,stroke-width:2px
classDef hw fill:#fff,stroke:#b71c1c,stroke-width:2px
classDef note fill:#fff,stroke:#666,stroke-dasharray: 5 5
class llama,mllama,qwen,qwen_vl,pixtral model
class nn_ops,backend ml
class ggml,cpu,cuda,metal,vulkan,opencl hw
class note1,note2,note3 note
%% Style subgraphs
style Models fill:#fff,stroke:#01579b,stroke-width:2px
style ML_Ops fill:#fff,stroke:#e65100,stroke-width:2px
style GGML fill:#fff,stroke:#b71c1c,stroke-width:2px
style Hardware_Specific fill:#fff,stroke:#b71c1c,stroke-width:1px
Get a dump of the graph built with Pytorch or Safetensors. Use this snippet to do so.
import torch
import sys
from safetensors.torch import load_file
def extract_graph(model_path):
if model_path.endswith('.safetensors'):
state_dict = load_file(model_path)
else:
state_dict = torch.load(model_path, weights_only=True)
graph = []
for name, tensor in state_dict.items():
if isinstance(tensor, torch.Tensor):
graph.append({
"name": name,
"shape": list(tensor.shape)
})
print("{")
print(' "graph": [')
for i, layer in enumerate(graph):
comma = "," if i < len(graph) - 1 else ""
print(f' {{"name": "{layer["name"]}", "shape": {layer["shape"]}}}{comma}')
print(" ]")
print("}")
if __name__ == "__main__":
if len(sys.argv) != 2:
print("Usage: python extract.py <path/to/model>")
sys.exit(1)
extract_graph(sys.argv[1])
Look at a previous model implementation pull request and copy the structure of the files needed. We will need:
model/<model-name>
directorymodel/<model-name>/model.go
file to implement the architecture and forward pass.model/<model-name>/convert.go
file to implement to conversion from pytorch/safetensors to ggml.model/<model-name>/model_test.go
and model/<model-name>/convert_test.go
files for testing.Open a draft pull request in the ollama/ollama
repo, as a place to ask questions and get answers from Ollama maintainers.
Implement conversion from the model weights (pytorch, safetensors) to ggml in the model/<your-model>/convert.go
file. Reference other convert.go
files.
Create a Modelfile that only references the pytorch/safetensor directory. We will handle the other fields later. Modelfile:
FROM /path/to/model
Use ollama create
to convert the model:
go run . create <my-model> -f /path/to/Modelfie
Implement the New()
and Forward()
logic in model/<your-model>/model.go
. Reference other model.go
files.
Run the model and get the debug output of the forward pass to compare with the output of the research implementation from step 1:
OLLAMA_DEBUG=1 go run . run <my-model>
go run . run <my-model> "hello"
model/<your-model>/model_test.go
and model/<your-model>/convert_test.go
ollama/ollama
pull request, and move the pull request out of the draft state.FROM
the converted gguf, add the TEMPLATE
, LICENSE
, and parameters if needed.ollama create <your-namespace>/<your-model> -f /path/to/Modelfile
ollama push <your-namespace>/<your-model>