Continuous Batching

Continuous Batching is an algorithm that allows LLM runners like llama.cpp to better utilize GPU processing time.

It allows the server to handle multiple completion requests in parallel.

Explanation and Demonstration

The best person to explain how Continuous Batching is implemented in llama.cpp is the author of the library, Georgi Gerganov. Here is his tweet explaining the concept and demonstrating the algorithm’s speed.

If you want to dig even deeper, you can check out the GithHub discussion that further explains this.

Serving 8 clients in parallel on A100 with llama.cpp

Model: Codellama 7B F16
System prompt: 305 tokens
Requests: 128
Max sequence length: 100
Continuous batching: enabled

Average speed ~484 t/s (including prompts and generated tokens) pic.twitter.com/0yFX95GiKK
— Georgi Gerganov (@ggerganov) October 24, 2023

LLMOps Handbook (work in progress)

Continuous Batching

Explanation and Demonstration