瀏覽代碼

runner.go: Scale batches to be processed by numParallel

We should process a batch of tokens for each parallel request, rather
than having a shared pool. Otherwise, a single request can fill the
batch and then subsequent ones will fail or get starved.

Server.cpp used the KV cache size allocated for each parallel request
as the allocated size for the batch. This is the upper bound for the
batch but since we know how many tokens we will actually put in a batch
there is no need to over allocate.
Jesse Gross 8 月之前
父節點
當前提交
8e1554c91d
共有 1 個文件被更改,包括 1 次插入2 次删除
  1. 1 2
      llama/runner/runner.go

+ 1 - 2
llama/runner/runner.go

@@ -198,8 +198,7 @@ func incompleteUnicode(token string) bool {
 }
 
 func (s *Server) run(ctx context.Context) {
-	// TODO - should this be n_ctx / parallel like the old server.cpp setup?
-	batch := llama.NewBatch(s.batchSize, 0, s.parallel)
+	batch := llama.NewBatch(s.batchSize*len(s.seqs), 0, len(s.seqs))
 	defer batch.Free()
 
 	// build up stop sequences as we recognize them