Introduction
1. Authors
2. Contributing
General Concepts
1. Continuous Batching
2. Embedding
3. Input/Output
4. Large Language Model
5. Load Balancing
7. Model Parameters
8. Supervisor
9. Temperature
10. Vector Database
Infrastructure
1. llama.cpp
2. 1. Installing on AWS EC2 with CUDA
  2. Installing with AWS Image Builder
3. Kubernetes
4. Ollama
5. Paddler
6. VLLM
Customization
1. Fine-tuning
2. Retrieval Augmented Generation
Predictability
1. Hallucinations
2. Consistent Outputs
3. Structured Outputs
4. 1. Matching the JSON Schema
  2. Matching the Grammar
  3. Data Objects (including Pydantic)
  4. Function Calling
Application Layer
1. Architecture
2. 1. Long-Running
  2. Serverless
3. Optimization
4. 1. Asynchronous Programming
  2. Input/Ouput Bottlenecks
Tutorials
1. LLM WebSocket chat with llama.cpp
2. Serving completions with llama.cpp

llama.cpp

Note

GitHub Repository: https://github.com/ggerganov/llama.cpp

Llama.cpp is a production-ready, open-source runner for various Large Language Models.

It has an excellent built-in server with HTTP API.

In this handbook, we will use Continuous Batching, which in practice allows handling parallel requests.