Continuous Batching

Continuous Batching is an algorithm that allows LLM runners like llama.cpp to better utilize GPU processing time.

It allows the server to handle multiple completion requests in parallel.

Explanation and Demonstration

The best person to explain how Continuous Batching is implemented in llama.cpp is the author of the library, Georgi Gerganov. Here is his tweet explaining the concept and demonstrating the algorithm’s speed.

If you want to dig even deeper, you can check out the GithHub discussion that further explains this.