Michael Yang
|
829ff87bd1
revert tokenize ffi (#4761)
|
11 months ago |
Jeffrey Morgan
|
763bb65dbb
use `int32_t` for call to tokenize (#4738)
|
11 months ago |
Michael Yang
|
bf54c845e9
vocab only
|
11 months ago |
Michael Yang
|
26a00a0410
use ffi for tokenizing/detokenizing
|
11 months ago |
Michael Yang
|
01811c176a
comments
|
1 year ago |
Michael Yang
|
9685c34509
quantize any fp16/fp32 model
|
1 year ago |
Hernan Martinez
|
86e67fc4a9
Add import declaration for windows,arm64 to llm.go
|
1 year ago |
Michael Yang
|
9502e5661f
cgo quantize
|
1 year ago |
Daniel Hiltgen
|
58d95cc9bd
Switch back to subprocessing for llama.cpp
|
1 year ago |
Michael Yang
|
91b3e4d282
update memory calcualtions
|
1 year ago |
Michael Yang
|
d338d70492
refactor model parsing
|
1 year ago |
Patrick Devine
|
1b272d5bcd
change `github.com/jmorganca/ollama` to `github.com/ollama/ollama` (#3347)
|
1 year ago |
Jeffrey Morgan
|
f9cd55c70b
disable gpu for certain model architectures and fix divide-by-zero on memory estimation
|
1 year ago |
Daniel Hiltgen
|
6c5ccb11f9
Revamp ROCm support
|
1 year ago |
Daniel Hiltgen
|
a1dfab43b9
Ensure the libraries are present
|
1 year ago |
Jeffrey Morgan
|
4458efb73a
Load all layers on `arm64` macOS if model is small enough (#2149)
|
1 year ago |
Daniel Hiltgen
|
fedd705aea
Mechanical switch from log to slog
|
1 year ago |
Michael Yang
|
eaed6f8c45
add max context length check
|
1 year ago |
Daniel Hiltgen
|
7427fa1387
Fix up the CPU fallback selection
|
1 year ago |
Daniel Hiltgen
|
de2fbdec99
Merge pull request #1819 from dhiltgen/multi_variant
|
1 year ago |
Michael Yang
|
f4f939de28
Merge pull request #1552 from jmorganca/mxyng/lint-test
|
1 year ago |
Daniel Hiltgen
|
39928a42e8
Always dynamically load the llm server library
|
1 year ago |
Daniel Hiltgen
|
d88c527be3
Build multiple CPU variants and pick the best
|
1 year ago |
Jeffrey Morgan
|
ab6be852c7
revisit memory allocation to account for full kv cache on main gpu
|
1 year ago |
Daniel Hiltgen
|
8da7bef05f
Support multiple variants for a given llm lib type
|
1 year ago |
Jeffrey Morgan
|
b24e8d17b2
Increase minimum CUDA memory allocation overhead and fix minimum overhead for multi-gpu (#1896)
|
1 year ago |
Michael Yang
|
f921e2696e
typo
|
1 year ago |
Jeffrey Morgan
|
f387e9631b
use runner if cuda alloc won't fit
|
1 year ago |
Jeffrey Morgan
|
cb534e6ac2
use 10% vram overhead for cuda
|
1 year ago |
Jeffrey Morgan
|
58ce2d8273
better estimate scratch buffer size
|
1 year ago |