Long-Running
In web development, there are two primary application models:
- Long-running processes (for example, a web application written in
Go
that keeps running and the same process responds to multiple incoming requests) - Worker-based, single-threaded, synchronous (for example,
PHP
withFPM
,Ruby
, and somePython
setups - it is generally used by scripted languages)
It is not necessarily connected to a specific language (for example, PHP
can also start a long-running process with a web server, but most PHP
frameworks were not designed with that application model in mind and without extensions like Swoole, it won’t be preemptive).
Python can run synchronously with some frameworks and WSGI
, but it can also be run as a long-running application with ASGI or projects like Granian.
The Problem with the Worker-Based Synchronous Model
We will use PHP-FPM as an example. On Debian, it comes preconfigured with the max_children
parameter set to 5
by default, which means it can spawn at most 5 workers and handle at most 5 requests in parallel. This parameter can be tweaked, and under normal circumstances, it can be changed to a much higher value at the cost of RAM memory used.
Let’s assume we have 32 workers running. Normally, the time to respond to a request takes at most milliseconds, and this runtime model is not an issue, but working with LLMs or ML models turns that around. Typical requests to LLM can take multiple seconds to complete, sometimes 20-30 seconds, depending on the number of tokens to be generated. That means if we have 32 requests in parallel to our application (which is not a lot), all the synchronous PHP workers can be blocked while waiting for tokens to be generated.
We can end up in a situation where our server’s CPU is almost idling, but it can’t accept more requests due to an Input/Output bottleneck.
flowchart TD client[Client] client---|Request|response_handler subgraph worker[Worker] response_handler[Response Handler]---response["Response"] end subgraph llm[LLM] response_handler-->|Waits for|llmserver[Completion] llmserver-->response end
While the tokens are generated, the worker is at a standstill and cannot accept additional requests.
The Solution
The solution is to use languages and frameworks that support long-running processes (that do not rely on spawning workers) and with any form of asynchronicity.
The perfect example of such language is Go
with its goroutines. Each goroutine uses just a few kilobytes of memory, and a server with a few gigabytes of RAM can potentially spawn millions of them. They run asynchronously and are preemptive, so there shouldn’t be a situation where just a few requests can exhaust the entire server capacity.