Ver Fonte

Merge pull request #5442 from dhiltgen/concurrency_docs

Add windows radeon concurrency note
Daniel Hiltgen há 10 meses atrás
pai
commit
d2f19024d0
1 ficheiros alterados com 3 adições e 1 exclusões
  1. 3 1
      docs/faq.md

+ 3 - 1
docs/faq.md

@@ -266,8 +266,10 @@ If there is insufficient available memory to load a new model request while one
 
 Parallel request processing for a given model results in increasing the context size by the number of parallel requests.  For example, a 2K context with 4 parallel requests will result in an 8K context and additional memory allocation.
 
-The following server settings may be used to adjust how Ollama handles concurrent requests:
+The following server settings may be used to adjust how Ollama handles concurrent requests on most platforms:
 
 - `OLLAMA_MAX_LOADED_MODELS` - The maximum number of models that can be loaded concurrently provided they fit in available memory.  The default is 3 * the number of GPUs or 3 for CPU inference.
 - `OLLAMA_NUM_PARALLEL` - The maximum number of parallel requests each model will process at the same time.  The default will auto-select either 4 or 1 based on available memory.
 - `OLLAMA_MAX_QUEUE` - The maximum number of requests Ollama will queue when busy before rejecting additional requests. The default is 512
+
+Note: Windows with Radeon GPUs currently default to 1 model maximum due to limitations in ROCm v5.7 for available VRAM reporting.  Once ROCm v6 is available, Windows Radeon will follow the defaults above.  You may enable concurrent model loads on Radeon on Windows, but ensure you don't load more models than will fit into your GPUs VRAM.