Continuous Batching
Continuous Batching is an algorithm that allows LLM runners like llama.cpp
to better utilize GPU processing time.
It allows the server to handle multiple completion requests in parallel.
Explanation and Demonstration
The best person to explain how Continuous Batching is implemented in llama.cpp
is the author of the library, Georgi Gerganov. Here is his tweet explaining the concept and demonstrating the algorithm’s speed.
If you want to dig even deeper, you can check out the GithHub discussion that further explains this.
Serving 8 clients in parallel on A100 with llama.cpp
— Georgi Gerganov (@ggerganov) October 24, 2023
Model: Codellama 7B F16
System prompt: 305 tokens
Requests: 128
Max sequence length: 100
Continuous batching: enabled
Average speed ~484 t/s (including prompts and generated tokens) pic.twitter.com/0yFX95GiKK