Introduction

This handbook is a practical and advanced guide to LLMOps. It provides a solid understanding of large language models’ general concepts, deployment techniques, and software engineering practices. With this knowledge, you will be prepared to maintain the entire stack confidently.

This handbook focuses more on LLM runners like llama.cpp or VLLM, which can scale and behave predictably in the infrastructure, rather than runners focused more on casual use cases.

It will teach you how to use large language models in professional applications, self-host Open-Source models, and build software around them. It goes beyond just Retrieval Augmented Generation and Fine Tuning.

It assumes you are interested in self-hosting open source Large Language Models. If you only want to use them through HTTP APIs, you can jump straight to the application layer best practices.

This is a living document, which means it will be updated regularly. To follow us, visit our GitHub repository.

What is LLMOps?

LLMOps is a set of practices that deals with deploying, maintaining, and scaling Large Language Models. If you want to consider yourself an LLMOps practitioner, you should be able to, at minimum, deploy and maintain a scalable setup of multiple running LLM instances.

New Class of Opportunities, New Class of Problems

Although there has been a recent trend of naming everything *Ops (DevOps, Product Ops, MLOps, LLMOps, BizOps, etc.), LLMOps and MLOps truly deserve their place as a standalone set of practices.

They bridge the gap between the applications and AI models deployed in the infrastructure. They also address specific issues arising from using GPUs and TPUs, with the primary stress being [Input/Output](/general concepts/input-output) optimizations.

Authors

All contributions are welcome. If you want to contribute, follow our contributing guideline.

  • Mateusz Charytoniuk (LinkedIn, GitHub) - Author of the original version of this handbook, project maintainer.

Contributors

If you want to be on this list, contribute a new article or substantially expand or update any existing article with new information.

Contributing

First of all, every contribution is welcomed.

You do not have to add or improve articles to contribute. Even giving us suggestions or general ideas is valuable if you want to join in.

To discuss the handbook contents, use GitHub discussions.

What are we looking for?

This handbook is intended to be a living document that evolves with the community. It is aimed at more advanced LLM users who want to deploy scalable setups and/or be able to architect applications around them.

It focuses primarily on runners like llama.cpp or VLLM, aimed at production usage. However, if you find an interesting use case for aphrodite, tabby, or any other runner, that is also welcomed.

Those are just general ideas. Anything related to the infrastructure, application layer, and tutorials is welcomed. If you have an interesting approach to using LLMs, feel free to contribute that also.

How to contribute?

We are using GitHub issues and pull requests to organize the work.

Submitting a new article

If you want to submit an article:

  1. Start a GitHub issue with an outline (with general points you want to cover) so we can decide together if it fits the handbook.
  2. If the article fits the handbook, add a new page and create a pull request with a finished article.

Updating an article

If you want to improve an existing article, start an issue to let us know your thoughts or create a pull request if you are ready to add changes. Add an improvement tag to such an issue or pull request.

Scrutiny

If you think something in the handbook is incorrect, add a new issue with a scrutiny tag and point out the issues.

Continuous Batching

Continuous Batching is an algorithm that allows LLM runners like llama.cpp to better utilize GPU processing time.

It allows the server to handle multiple completion requests in parallel.

Explanation and Demonstration

The best person to explain how Continuous Batching is implemented in llama.cpp is the author of the library, Georgi Gerganov. Here is his tweet explaining the concept and demonstrating the algorithm’s speed.

If you want to dig even deeper, you can check out the GithHub discussion that further explains this.

Embedding

Formally, embedding represents a word (or a phrase) in a vector space. In this space, words with similar meanings are close to each other.

For example, the words “dog” and “cat” might be close to each other in the vector space because they are both animals.

RGB Analogy

Because embeddings can be vectors with 4096 or more dimensions, it might be hard to imagine them and get a good intuition on how they work in practice.

A good analogy for getting an intuition about embeddings is to imagine them as points in 3D space first.

Let’s assume a color represented by RGB is our embedding. It is a 3D vector with 3 values: red, green, and blue representing 3 dimensions. Similar colors in that space are placed near each other. Red is close to orange, blue and green are close to teal, etc.

Embeddings work similarly. Words and phrases are represented by vectors, and similar words are placed close to each other in the vector space.

Searching through similar embeddings to a given one means we are looking for vectors that are placed close to the given embedding.

RGB Space

Input/Output

In the broad sense, an application can either wait for a CPU to finish processing something or for some external operation to complete (like a long-running HTTP request or waiting for some specific device to be ready).

While the primary bottleneck in traditional applications and setups is often the CPU, when working with LLMs, it is the GPU and general Input/Output issues.

For example, when working with an LLM’s HTTP API, requests can take multiple seconds to complete. The same endpoints can have vastly varying response times. This can be due to the GPU being busy, the model being swapped out of memory, or the prompt itself.

A lot of LLMOps issues are about working around those issues.

Infrastructure

Regarding the infrastructure, IO issues require us to use a different set of metrics than with CPU-bound applications.

For example, if you are running llama.cpp server with 8 available slots, using even 2-3 of them might put your GPU at a strain. The fact is, thanks to Continuous Batching and caching of generated tokens, the server might easily handle 5 more parallel requests, but the metrics like percentage of hardware usage will suggest that you have to scale up, which is not the case and might be a waste of resources.

Application Layer

While LLMOps primarily focuses on the infrastructure; on the application layer, Input/Output issues make it extremely important to pick a programming language that supports concurrency or parallelism—either asynchronicity or threaded execution to not block your application’s execution (languages like JavaScript, Golang, Python with asyncio, and Rust are perfect choices here).

PHP can also be used, but I recommend Swoole language extension (which gives PHP Go-like coroutines) or AMPHP library to complement it - by default, PHP is synchronous and combined with FPM it relies on a worker pool. Let’s say you have 32 synchronous workers running in your application. It is possible to block all of them when handling 32 requests in parallel and executing 20-second+ HTTP requests in all of them. You might be in a situation where your CPU is almost idling, but your server cannot handle more requests.

The same applies to Python, but it has more mature built-in solutions to handle the same issues and gives easier access to multiprocessing and threading.

You can read more in the Application Layer section.

Large Language Model

Everyone can see what a horse is.

Benedykt Chmielowski, Nowe Ateny (1745)

For practical purposes, in this handbook, any AI model that can handle user prompts to produce human-like text and follow instructions is considered a “large language model.” This includes GPT, Llama, and any other models that may be developed in the future.

These models are typically trained on a large corpus of text data and can generate human-like text in response to user prompts.

In this handbook, we will discuss how to use these models, fine-tune them, evaluate their performance, and build applications around them.

We will not focus on how to create new Large Language Models nor on their internal architecture besides the basics.

Load Balancing

Load balancing allows you to distribute the load (preferably evenly) among multiple servers.

In this handbook, we assume that you intend to use GPU or TPU servers for inference. TPU and GPU pose pretty much the same class of benefits and issues, so we will use the term GPU to cover all of them.

The interesting thing is that having some experience with 3D game development might help you get into LLMOps and resolve some GPU-related issues.

Differences Between Balancing GPU and CPU Load

In the context of LLMOps, the primary factors we have to deal with this time are Input/Output bottlenecks instead of the usual CPU bottlenecks. That forces us to adjust how we design our infrastructure and applications.

We will also often use a different set of metrics than traditional load balancing, which are usually closer to the application level (like the number of available context slots being used, the number of buffered application requests, and such).

Forward Proxy

A forward proxy is an intermediary server between the client and the origin server. Clients connect to the forward proxy server and request a resource (such as a completion) available on a different server that is otherwise inaccessible to them. The forward proxy server retrieves the resource and forwards it to the client.

You can combine both forward proxy and reverse proxy to create a gateway.

llama.cpp Forward Proxy

llama.cpp implements it’s own forward proxy in the form of RPC server.

It puts the llama.cpp server in form of multiple backends and distributes requests among them.

flowchart TD
    rpcb---|TCP|srva
    rpcb---|TCP|srvb
    rpcb-.-|TCP|srvn
    subgraph hostn[Host N]
    srvn[rpc-server]-.-backend3["Backend (CUDA,Metal,etc.)"]
    end
    subgraph hostb[Host B]
    srvb[rpc-server]---backend2["Backend (CUDA,Metal,etc.)"]
    end
    subgraph hosta[Host A]
    srva[rpc-server]---backend["Backend (CUDA,Metal,etc.)"]
    end
    subgraph host[Main Host]
    ggml[llama.cpp]---rpcb[RPC backend]
    end
source: llama.cpp repository

Reverse Proxy

A reverse proxy server retrieves resources from one or more servers on a client’s behalf. These resources are then returned to the client, appearing to originate from the source server itself. It abstracts your infrastructure setup from the end users, which is useful for implementing scaling, security middleware, and load balancing.

While forward and reverse proxies may seem functionally similar, their differences lie primarily in their use cases and perspectives. A forward proxy acts on behalf of clients seeking resources from various servers, often used for client privacy and access control. A reverse proxy acts on behalf of servers, making resources available to clients while hiding the backend server details.

That means a reverse proxy hides its presence from the clients and acts as an intermediary between them and the servers. When you communicate with a reverse proxy, it is as if you communicated directly with the target server.

That is one of the primary differences between forward proxy and a reverse proxy.

You can combine both forward proxy and reverse proxy to create a gateway.

Paddler

Paddler is a reverse proxy server and load balancer made specifically for llama.cpp. You can communicate with it like a regular llama.cpp instance. You can learn more on it’s dedicated page.

Gateway

Functionally, forward and reverse proxies are similar in that they both act as intermediaries handling requests. The key difference lies in the direction of the request. A forward proxy is used by clients to forward requests to other servers, often used to bypass network restrictions or for caching. It announces its presence to the end user. In contrast, servers use a reverse proxy to forward responses to clients, often for load balancing, security, and caching purposes, and it hides its presence from the end user.

When combined, forward and reverse proxies can create a gateway. A gateway serves as a front-end for underlying services and acts as an entry point for users to the application. It handles both incoming client requests and outgoing server responses.

Temperature

The temperature parameter in LLMs controls the randomness of the output. Lower temperatures make the output more deterministic (less creative), while higher temperatures increase variability. Even at low temperatures, some variability remains due to the probabilistic nature of the models.

llama.cpp

Note

Llama.cpp is a production-ready, open-source runner for various Large Language Models.

It has an excellent built-in server with HTTP API.

In this handbook, we will use Continuous Batching, which in practice allows handling parallel requests.

Installing on AWS EC2 with CUDA

This tutorial was tested on g4dn.xlarge instance with Ubuntu 22.04 operating system. This tutorial was written explicitly to perform the installation on a Ubuntu 22.04 machine.

Installation Steps

  1. Start an EC2 instance of any class with a GPU with CUDA support.

    If you want to compile llama.cpp on this instance, you will need at least 4GB for CUDA drivers and enough space for your LLM of choice. I recommend at least 30GB. Perform the following steps of this tutorial on the instance you started.

  2. Install build dependencies:

    sudo apt update
    
    sudo apt install build-essential ccache
    
  3. Install CUDA Toolkit (only the Base Installer). Download it and follow instructions from https://developer.nvidia.com/cuda-downloads

    At the time of writing this tutorial, the highest available supported version of the Ubuntu version was 22.04. But do not fear! :) We’ll get it to work with some minor workarounds (see the Potential Errors section)

  4. Install NVIDIA Drivers:

    sudo apt install nvidia-driver-555
    
  5. Compile llama.cpp:

    git clone https://github.com/ggerganov/llama.cpp.git
    
    cd llama.cpp
    
    GGML_CUDA=1 make -j
    
  6. Benchmark llama.cpp (optional):

    Follow the official tutorial if you intend to run the benchmark. However, keep using GGML_CUDA=1 make to compile the llama.cpp (do not use LLAMA_CUBLAS=1): https://github.com/ggerganov/llama.cpp/discussions/4225

    Instead of performing a model quantization yourself, you can download quantized models from Hugging Face. For example, Mistral Instruct you can download from https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/tree/main

Potential Errors

CUDA Architecture Must Be Explicitly Provided

ERROR: For CUDA versions < 11.7 a target CUDA architecture must be explicitly 
provided via environment variable CUDA_DOCKER_ARCH, e.g. by running 
"export CUDA_DOCKER_ARCH=compute_XX" on Unix-like systems, where XX is the 
minimum compute capability that the code needs to run on. A list with compute 
capabilities can be found here: https://developer.nvidia.com/cuda-gpus

You need to check the mentioned page (https://developer.nvidia.com/cuda-gpus) and pick the appropriate version for your instance’s GPU. g4dn instances use T4 GPU, which would be compute_75.

For example:

CUDA_DOCKER_ARCH=compute_75 GGML_CUDA=1 make -j

Failed to initialize CUDA

ggml_cuda_init: failed to initialize CUDA: unknown error

Sometimes can be solved with sudo modprobe nvidia_uvm.

You can also create a Systemd unit that loads the module on boot:

[Unit]
After=nvidia-persistenced.service

[Service]
Type=oneshot
ExecStart=/usr/sbin/modprobe nvidia_uvm

[Install]
WantedBy=multi-user.target

NVCC not found

/bin/sh: 1: nvcc: not found

You need to add CUDA path to your shell environmental variables.

For example, with Bash and CUDA 12:

export PATH="/usr/local/cuda-12/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda-12/lib64:$LD_LIBRARY_PATH"

cannot find -lcuda

/usr/bin/ld: cannot find -lcuda: No such file or directory

That means your Nvidia drivers are not installed. Install NVIDIA Drivers first.

Cannot communicate with NVIDIA driver

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

If you installed the drivers, reboot the instance.

Failed to decode the batch

failed to decode the batch, n_batch = 0, ret = -1
main: llama_decode() failed

There are two potential causes of this issue.

Option 1: Install NVIDIA drivers

Make sure you have installed the CUDA Toolkit and NVIDIA drivers. If you do, restart your server and try again. Most likely, NVIDIA kernel modules are not loaded.

sudo reboot

Option 2: Use different benchmarking parameters

For example, with Mistral Instruct 7B what worked for me is:

./llama-batched-bench -m ../mistral-7b-instruct-v0.2.Q4_K_M.gguf 2048 2048 512 0 999 128,256,512 128,256 1,2,4,8,16,32

Installing with AWS Image Builder

This tutorial explains how to install llama.cpp with AWS EC2 Image Builder.

By putting llama.cpp in EC2 Image Builder pipeline, you can automatically build custom AMIs with llama.cpp pre-installed.

You can also use that AMI as a base and add your foundational model on top of it. Thanks to that, you can quickly scale up or down your llama.cpp groups.

We will repackage the base EC2 tutorial as a set of Image Builder Components and Workflow.

You can complete the tutorial steps either manually or by automating the setup with Terraform/OpenTofu. Terraform source files are linked to their respective tutorial steps.

Installation Steps

  1. Create an IAM imagebuilder role (source file)

    Go to the IAM Dashboard, click “Roles” from the left-hand menu, and select “AWS service” as the trusted entity type. Next, select “EC2” as the use case:

    screenshot-01

    Next, assign the following policies:

    • arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role
    • arn:aws:iam::aws:policy/EC2InstanceProfileForImageBuilderECRContainerBuilds
    • arn:aws:iam::aws:policy/EC2InstanceProfileForImageBuilder
    • arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore

    Name your role (for example, “imagebuilder”) and finish creating it. You should end up with permissions and trust relationships looking like this:

    screenshot-02 screenshot-03

  2. Create components.

    We’ll need the following four components:

    To create the component via GUI, navigate to EC2 Image Builder service on AWS. From there, select “Components” from the menu. We’ll need to add four components that will act as the building blocks in our Image Builder pipeline. You can refer to the generic EC2 tutorial for more details for more information.

    Click “Create component”. Next, for each component:

    • Choose “Build” as the component type
    • Select “Linux” as the image OS
    • Select “Ubuntu 22.04” as the compatible OS version

    Provide the following as component names and contents in YAML format:

    Component name: apt_build_essential

     name: apt_build_essential
     description: "Component to install build essentials on Ubuntu"
     schemaVersion: '1.0'
     phases:
       - name: build
         steps:
           - name: InstallBuildEssential
             action: ExecuteBash
             inputs:
               commands:
                 - sudo apt-get update
                 - DEBIAN_FRONTEND=noninteractive sudo apt-get install -yq build-essential ccache
             onFailure: Abort
             timeoutSeconds: 180
    

    Component name: apt_nvidia_driver_555

     name: apt_nvidia_driver_555
     description: "Component to install NVIDIA driver 550 on Ubuntu"
     schemaVersion: '1.0'
     phases:
       - name: build
         steps:
           - name: apt_nvidia_driver_555
             action: ExecuteBash
             inputs:
               commands:
                 - sudo apt-get update
                 - DEBIAN_FRONTEND=noninteractive sudo apt-get install -yq nvidia-driver-550
             onFailure: Abort
             timeoutSeconds: 180
           - name: reboot
             action: Reboot
    

    Component name: cuda_toolkit_12

     name: cuda_toolkit_12
     description: "Component to install CUDA Toolkit 12 on Ubuntu"
     schemaVersion: '1.0'
     phases:
       - name: build
         steps:
           - name: apt_cuda_toolkit_12
             action: ExecuteBash
             inputs:
               commands:
                 - DEBIAN_FRONTEND=noninteractive sudo apt-get -yq install nvidia-cuda-toolkit
             onFailure: Abort
             timeoutSeconds: 600
           - name: reboot
             action: Reboot
    

    Component name: llamacpp_gpu_compute_75

     name: llamacpp_gpu_compute_75
     description: "Component to install and compile llama.cpp with CUDA compute capability 75 on Ubuntu"
     schemaVersion: '1.0'
     phases:
       - name: build
         steps:
           - name: compile
             action: ExecuteBash
             inputs:
               commands:
                 - cd /opt
                 - git clone https://github.com/ggerganov/llama.cpp.git
                 - cd llama.cpp
                 - |
                   CUDA_DOCKER_ARCH=compute_75 \
                   LD_LIBRARY_PATH="/usr/local/cuda-12/lib64:$LD_LIBRARY_PATH" \
                   GGML_CUDA=1 \
                   PATH="/usr/local/cuda-12/bin:$PATH" \
                   make -j
             onFailure: Abort
             timeoutSeconds: 1200
    

    Once you’re finished, you’ll see all the created components you added on the list:

    screenshot-04

  3. Add Infrastructure Configuration source file

    Next, we’ll create a new Infrastructure Configuration. Select it from the left-hand menu and click “Create”. You’ll need to use g4dn.xlarge instance type or any other instance type that supports CUDA. Name your configuration, select the IAM role you created in step 1, and select the instance, for example:

    screenshot-05

  4. Add Distribution Configuration source file

    Select Distribution settings in the left-hand menu to create a Distribution Configuration. It specifies how the AMI should be distributed (on what type of base AMI it will be published). Select Amazon Machine Image, name the configuration, and save:

    screenshot-06

  5. Add Image Pipeline source file

    Next, we’ll add the Image Pipeline. It will use the Components, Infrastructure Configuration, and Distribution Configuration we prepared previously. Select “Imagie Pipeline” from the left-hand menu and click “Create”. Name your image pipeline, and select the desired build schedule.

    As the second step, create a new recipe. Choose AMI, name the recipe:

    screenshot-07

    Next, select the previously created components:

    screenshot-08

  6. The next step is to build the image. You should be able to run the pipeline:

    screenshot-09

  7. Launch test EC2 Instance.

    When launching EC2 instance, the llama.cpp image we prepared should be available under My AMIs list:

    screenshot-10

Summary

Feel free to open an issue if you find a bug in the tutorial or have ideas on how to improve it.

Ollama

Note

Ollama is a convenient and easy-to-use wrapper around llama.cpp.

It acts like a llama.cpp multiplexer. You can start a new conversation or request completion from a specific LLM without manually downloading weights and setting them up.

For example, when you request completion from a model that is not yet loaded, it checks if it is possible to fit that new model into RAM or VRAM. If so, it internally starts a new llama.cpp instance and uses it to load the LLM. If the requested model has not yet been downloaded, it will download it for you.

In general terms, it acts like a llama.cpp forward proxy and a supervisor.

For example, if you load both llama3 and phi-3 into the same Ollama instance, you will get something like this:

flowchart TD
    Ollama --> llama1[llama.cpp with llama3]
    Ollama --> llama2[llama.cpp with phi-3]
    llama1 --> VRAM
    llama2 --> VRAM

Viability for Production

Predictability

Although the Ollama approach is convenient for local development, it causes some deployment problems (compared to llama.cpp).

With llama.cpp, it is easily possible to divide the context of the loaded model into a specific number of slots, which makes it extremely easy to predict how many parallel requests the current server can handle.

It is not easy to predict with Ollama. It manages the slots internally and does not expose the equivalent with the llama.cpp /health endpoint to monitor the currently used resources. Even if it did, it is always possible to have a few different models loaded simultaneously that share server resources.

We might end up in a situation where Ollama keeps both llama3 (which is 70B parameter model) and phi-3 (which is 3.8B parameter model). A completion request towards llama3 will use many more resources than asking phi-3 for completion. 8 slots of llama3 require many more resources than 8 of phi-3.

How can that be balanced effectively? As a software architect, you would have to plan an infrastructure that does not allow developers to randomly load models into memory and force a specific number of slots, which defeats the purpose of Ollama.

Good Parts of Ollama

I greatly support Ollama because it makes it easy to start your journey with large language models. You can use Ollama in production deployments, but llama.cpp is a better choice because it is predictable.

Ollama is better suited than llama.cpp for end-user distributable applications. By that, I mean the applications that do not use an external server but are installed and run in their entirety on the user’s device. The same thing that makes it less predictable regarding resource usage makes it more resilient to end-user errors. In that context, resource usage predictability is less important than on the server side.

That is why this handbook is almost entirely based on vanilla llama.cpp as it is much better for server-side deployments (based on all the reasons above).

The situation might change with the future Ollama releases.

Paddler

Note

Additional note from the author of the handbook:

Paddler is my personal project and is not part of the llama.cpp, but I am including it here as it is a useful tool for deploying llama.cpp in production. It helped me, and I hope it helps you too.

Note

Paddler is an open-source, stateful load balancer and reverse proxy designed for servers running llama.cpp. Unlike typical strategies like round-robin or least connections, Paddler uses each server’s available slots.

It uses agents to monitor the health of llama.cpp instances and dynamically adjust to adding or removing servers, making it easier to integrate with autoscaling tools.

Fine-tuning

Fine-tuning is taking a pre-trained model and further training it on a new task. This is typically useful when you want to repurpose a model trained on a large-scale dataset for a new task with less data available.

In practice, that means fine-tuning allows the model to adapt to the new data without forgetting what it has learned before.

A good example might be the sqlcoder model, which is a fine-tuned starcoder model (which is a general coding model) to be exceptionally good at producing SQL.

Retrieval Augmented Generation

Retrieval augmented generation does not modify the underlying model in any way. Instead, it is an approach to directly influence its responses.

In practice, and in a significant simplification, RAG is about injecting data into Large Language Model prompt.

For example, let’s say the user asks the LLM:

  • What are the latest articles on our website?

To augment the response, you need to intercept the user’s question and tell LLM to respond in a way more or less like:

  • You are a <inser persona here>. Tell the user that the latest articles on our site are <insert latest articles metadata here>

That is greatly simplified, but generally, that is how it works. Along the way, embeddings and vector databases are involved.

Predictability

The common issue with Large Language Models is the consistency and structure of outputs.

Software Engineering vs AI

The last few decades of IT developments have accustomed us to extreme predictability. Each time we call a specific API endpoint or use a specific button, the same thing happens consistently, under our complete control.

That is not the case with AI, which operates on probabilities. That stems from the approach to creating software. Neural network designers design just the network and the training process, but they do not design the actual reasoning. The reasoning is learned by the network during training, and it is not under the control of the designer.

That is totally different from the traditional software development process, where we design the reasoning and the process, and the software just executes it.

That is why you might feed Large Language Models with the same prompt multiple times and get different outputs each time. Temperature parameter may be used to limit the “creativeness” of the model, but even setting it to zero does not guarantee predictable outputs.

Structured Outputs

While LLMs not being completely predictable may cause some issues, but no technical solution is completely one-sided and we can turn that flexibility into our advantage.

LLMs are extremely good in understanding natural language. In practice we can finally communicate with computers in a similar way we communicate with other people. We can create systems that interpret such unstructured inputs and react to them in a structured and predictable way. This way we can use the good parts of LLMs to our advantage and mitigate most of the unredictability issues.

Use Cases

Some use cases include (but are not limited to):

  • Searching through unstructured documents (e.g., reports in .pdf, .doc, .csv, or plain text)
  • Converting emails into actionable structures (e.g., converting requests for quotes into API calls with parameters for internal systems)
  • Question answering systems that interpret the context of user queries

Matching the Grammar

Some Large Language Model runners support formal grammars. It can be used to force the output to follow a certain structure (for example, speaking only in emojis or outputting just the valid moves from the Portable Game Notation).

It still does not guarantee that the output will be valid (in a semantic sense), but at least matching the formal grammar guarantees it will follow the correct structure.

One of the popular uses is to force a Large Language Model to match a specific JSON Schema.

llama.cpp

llama.cpp supports GBNF formal grammars, which is an extension of Backus-Naur form, but with the support of some regular expressions.

Application Layer

This chapter is not strictly related to LLMOps, but discussing the best practices for architecting and developing applications that use them would be a good idea.

Those applications have to deal with some issues that are not typically met in traditional web development, primarily long-running HTTP requests or MLOps - using custom models for inference.

Up until Large Language Models became mainstream and in demand by a variety of applications, the issue of dealing with long-running requests was much less prevalent. Typically, due to functional requirements, all the microservice requests normally would take 10ms or less, while waiting for a Large Language Models to complete the inference can take multiple seconds.

That calls for some adjustments in the application architecture, non-blocking Input/Output and asynchronous programming.

This is where asynchronous programming languages shine, like Python with its asyncio library or Rust with its tokio library, Go with its goroutines, etc.

Programming languages like PHP, which are synchronous by default, might struggle unless supplemented by extensions like Swoole (which essentially gives PHP Go-like coroutines) or libraries like AMPHP. Introducing support for asynchronous programming in PHP can be a challenge, but it is possible.

Long-Running

In web development, there are two primary application models:

  1. Long-running processes (for example, a web application written in Go that keeps running and the same process responds to multiple incoming requests)
  2. Worker-based, single-threaded, synchronous (for example, PHP with FPM, Ruby, and some Python setups - it is generally used by scripted languages)

It is not necessarily connected to a specific language (for example, PHP can also start a long-running process with a web server, but most PHP frameworks were not designed with that application model in mind and without extensions like Swoole, it won’t be preemptive).

Python can run synchronously with some frameworks and WSGI, but it can also be run as a long-running application with ASGI or projects like Granian.

The Problem with the Worker-Based Synchronous Model

We will use PHP-FPM as an example. On Debian, it comes preconfigured with the max_children parameter set to 5 by default, which means it can spawn at most 5 workers and handle at most 5 requests in parallel. This parameter can be tweaked, and under normal circumstances, it can be changed to a much higher value at the cost of RAM memory used.

Let’s assume we have 32 workers running. Normally, the time to respond to a request takes at most milliseconds, and this runtime model is not an issue, but working with LLMs or ML models turns that around. Typical requests to LLM can take multiple seconds to complete, sometimes 20-30 seconds, depending on the number of tokens to be generated. That means if we have 32 requests in parallel to our application (which is not a lot), all the synchronous PHP workers can be blocked while waiting for tokens to be generated.

We can end up in a situation where our server’s CPU is almost idling, but it can’t accept more requests due to an Input/Output bottleneck.

flowchart TD
    client[Client]

    client---|Request|response_handler

    subgraph worker[Worker]
    response_handler[Response Handler]---response["Response"]
    end

    subgraph llm[LLM]
    response_handler-->|Waits for|llmserver[Completion]
    llmserver-->response
    end

While the tokens are generated, the worker is at a standstill and cannot accept additional requests.

The Solution

The solution is to use languages and frameworks that support long-running processes (that do not rely on spawning workers) and with any form of asynchronicity.

The perfect example of such language is Go with its goroutines. Each goroutine uses just a few kilobytes of memory, and a server with a few gigabytes of RAM can potentially spawn millions of them. They run asynchronously and are preemptive, so there shouldn’t be a situation where just a few requests can exhaust the entire server capacity.

Asynchronous Programming

By asynchronous programming, we mean the ability to execute multiple tasks concurrently without blocking the main thread; that does not necessarily involve using threads and processes. A good example is the JavaScript execution model, which is, by default, single-threaded but asynchronous. It does not offer parallelism (without worker threads), but it can still issue concurrent network requests, database queries, etc.

Considering that most of the bottlenecks related to working with Large Language Models stem from Input/Output issues (primarily the LLM APIs response times and the time it takes to generate all the completion tokens) and not the CPU itself, asynchronous programming techniques are often the necessity when architecting the applications.

When it comes to network requests, large language models pose a different challenge than most web applications. While most of the REST APIs tend to have consistent response times below 100ms, when working with large language model web APIs, the response times might easily reach 20-30 seconds until all the requested tokens are generated and streamed.

Affected Runtimes

Scripting languages like PHP and Ruby are primarily affected because they are synchronous by default. That is especially cumbersome with PHP, which uses FPM pool of workers as a common hosting method. For example, Debian’s worker pool amounts to five workers by default. That means if each of them would be busy handling 30-second requests, the sixth request would have to wait for the first one to finish. That also means that you can easily run into a situation where your server’s CPU is idling, but it can’t accept more requests simultaneously.

Coroutines, Promises to the Rescue

To mitigate the issue, you can use any programming language supporting async, which primarily manifests in supporting Promises (Futures) or Coroutines. That includes JavaScript, Golang, Python (with asyncio), and PHP (with Swoole).

Preemptive vs Cooperative Scheduling

It is also really important to understand the preemptive aspect of async languages. Although preemptiveness is an aspect primarily of threading, it plays a role when scheduling promises and coroutines. For example, PHP natively implements Fibers, which grants it some degree of asynchronicity, although they are not preemptive. This means if you try something silly in your code, like, for example:

<?php

startCoroutine(function () {
    while (true) { echo 'hi'; }
});

PHP’s built-in scheduler will never give time to other asynchronous functions, meaning your script will get stuck on that specific coroutine. The same applies to Node.js and JavaScript Promises.

On the other hand, Go is preemptive by default, meaning that the runtime will automatically switch between coroutines, and you don’t have to worry about it. That is especially useful because you do not have to worry about infinite loops or blocking requests as long as you structure your code around coroutines.