1. Prerequisites
If using Windows, use WSL: Video tutorial (25 min) to install WSL (spanish)
2. Installation
Install the following base dependencies. These are necessary for the llama.cpp compilation and installation tasks we are about to perform. The second line contains mandatory dependencies, and the third line contains recommended ones:
sudo apt update
sudo apt install -y git build-essential cmake
sudo apt install -y ccache libssl-dev
Now, let's download llama.cpp:
cd ~
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
Before to continue, select your Graphic Card:
If using Windows, ensure the latest nVIDIA drivers are installed.
It is also advisable to have the CUDA Toolkit. To do so, visit their website and select x86_64 / Debian / 13 y runfile (local). Verify that it provides a code similar to the one below. We will use this to install the CUDA Toolkit:
wget https://developer.download.nvidia.com/compute/cuda/13.3.0/local_installers/cuda_13.3.0_610.43.02_linux.run
sudo sh cuda_13.3.0_610.43.02_linux.run
If using Windows, ensure the latest AMD drivers are installed.
Next, we are going to compile llama.cpp. This will adapt llama.cpp specifically to your system to achieve maximum performance.
Additionally, I recommend the following optional but beneficial guidelines:
- Use OpenBLAS to accelerate CPU inference (in case the VRAM is completely full).
- Using
nprocmaximizes compilation speed but may freeze the system until it finishes. Remove the-jparameter if you prefer not to accelerate the compilation.
For nvidia, use CUDA with -DGGML_CUDA=ON flag.
sudo apt install libopenblas-dev
cmake -B build -DGGML_CUDA=ON -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
cmake -B build --config Release -j$(nproc)
For AMD, use ROCm with -DGGML_HIPBLAS=ON flag.
sudo apt install libopenblas-dev
cmake -B build -DGGML_HIPBLAS=ON -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
cmake -B build --config Release -j$(nproc)
This compilation process may take some time, depending on your system's power. Once finished, if no errors appeared, everything should be compiled in the build/bin folder. You can check if everything is correct like this:
cd build
cd bin
./llama-cli --version
version: 9518 (7c158fbb4)
built with GNU 13.3.0 for Linux x86_64
Now, you just need to install it on the system. You can do it this way, and you will then be able to run llama-server from anywhere:
sudo cmake --install build
sudo ldconfig
llama-server --version
version: 9518 (7c158fbb4)
built with GNU 13.3.0 for Linux x86_64
3. Usage
Llama.cpp has executable binaries for multiple uses, but the main one we will use is llama-server. My recommendation is to download a small model that, while a bit "simple," is fast enough to test and verify that everything is working correctly.
For the test, we are going to use the unquantized Qwen 3.5 0.8B. Let's download it and save it in the ~/.models/gguf folder:
mkdir -p ~/.models/gguf
wget "https://huggingface.co/unsloth/Qwen3.5-0.8B-GGUF/resolve/main/Qwen3.5-0.8B-BF16.gguf?download=true" -O ~/.models/gguf/Qwen3.5-0.8B-BF16.gguf
Once downloaded, let's activate it with llama-server:
llama-server -m ~/.models/gguf/Qwen3.5-0.8B-BF16.gguf
A lot of information will appear until it stops and a message similar to the following is displayed:
llama_server: server is listening on http://127.0.0.1:8080
Congratulations! You have it ready. Accessing to http://127.0.0.1:8080 page, you will have a chat interface similar to ChatGPT to test the model. Remember that this is a very small model oriented toward mobile devices and very simple tasks. I recommend visiting the Qwen models section to find a more powerful model (and one with higher quantization to reduce its size) according to your hardware.
4. Optimization
Let's imagine we are going to load a model on a machine with 16GB of VRAM (GPU). If you want to fine-tune the performance of a local model to the maximum, you can use the following parameters in llama-server:
#!/bin/bash
llama-server \
# Path to the model to use. Unsloth Qwen3.6 35B A3B quantized to IQ2_XXS (~11GB)
-m ~/.models/gguf/qwen3.6-35b-a3b-iq2_xxs.gguf \
# The model fits GPU entirely. Try to offload as much as possible to the GPU.
--n-gpu-layers 999 \
# Context window (~64K tokens) KV quantized to 4 bits.
--ctx-size 64000 -ctk q4_0 -ctv q4_0 \
# Enabled reduces VRAM usage in large contexts (> 16K)
--flash-attn on \
# Token limit for reasoning (-1 = no limit)
--reasoning-budget 1024 \
# Reduces inter-token latency by keeping a constant CPU thread at 100%.
--poll 100 \
# Processing throughput (queue size)
--batch-size 256 --ubatch-size 256 --cont-batching \
# CPU/Threads: In case CPU is used, CPU and thread usage.
--threads 8 --threads-batch 8 \
# Locks the content allocated in RAM, preventing it from swapping to disk
--mlock \
# Skips warmup. The first request = slightly longer, server = starts up quickly.
--no-warmup \
# Disables automatic memory fitting when it doesn't fit in VRAM.
--fit off \
# Reuses up to 256 tokens from the KV cache.
--cache-reuse 256 \
# Enables the template engine. Required for Qwen3 and its thinking format.
--jinja \
# Single user/agent. Increase only if using sub-agents.
--parallel 1 \
# Temperature: Randomness, lower = more deterministic
# Top-K: Filters highly unlikely tokens (only the 20 most likely)
# Top-P: Considers tokens above 95% probability
# Min-P: Third filter. Disabled.
--temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 \
# Server port
--port 8999
Remove comments to create script.