Skip to content

llama.cpp Engine

llama.cpp is the core inference engine Jan uses to run AI models locally on your computer. This section covers the settings for the engine itself, which control how a model processes information on your hardware.

Find llama.cpp settings at Settings > Local Engine > llama.cpp:

llama.cpp

You might need to modify these settings if:

  • Models load slowly or don’t work
  • You’ve installed new hardware (like a graphics card)
  • You want to optimize performance for your specific setup
FeatureWhat It DoesWhen You Need It
Engine VersionShows current llama.cpp versionCheck compatibility with newer models
Check UpdatesDownloads engine updatesWhen new models require updated engine
Backend SelectionChoose hardware-optimized versionAfter hardware changes or performance issues

Different backends are optimized for different hardware. Pick the one that matches your computer:

For CUDA 12.0:

  • llama.cpp-avx2-cuda-12-0 (most common)
  • llama.cpp-avx512-cuda-12-0 (newer Intel/AMD CPUs)

For CUDA 11.7:

  • llama.cpp-avx2-cuda-11-7 (older drivers)
  • llama.cpp-avx2 (modern CPUs)
  • llama.cpp-avx (older CPUs)
  • llama.cpp-noavx (very old CPUs)
  • llama.cpp-vulkan (AMD, Intel Arc)
SettingWhat It DoesRecommendedImpact
Continuous BatchingHandle multiple requests simultaneouslyEnabledFaster when using tools or multiple chats
Parallel OperationsNumber of concurrent requests4Higher = more multitasking, uses more memory
CPU ThreadsProcessor cores to useAutoMore threads can speed up CPU processing
SettingWhat It DoesRecommendedWhen to Change
Flash AttentionEfficient memory usageEnabledLeave enabled unless problems occur
CachingRemember recent conversationsEnabledSpeeds up follow-up questions
KV Cache TypeMemory vs quality trade-offf16Change to q8_0 if low on memory
mmapEfficient model loadingEnabledHelps with large models
Context ShiftHandle very long conversationsDisabledEnable for very long chats
  • f16: Best quality, uses more memory
  • q8_0: Balanced memory and quality
  • q4_0: Least memory, slight quality reduction

Models won’t load:

  • Try a different backend
  • Check available RAM/VRAM
  • Update engine version

Slow performance:

  • Verify GPU acceleration is active
  • Close memory-intensive applications
  • Increase GPU Layers in model settings

Out of memory:

  • Change KV Cache Type to q8_0
  • Reduce Context Size in model settings
  • Try a smaller model

Crashes or errors:

  • Switch to a more stable backend (avx instead of avx2)
  • Update graphics drivers
  • Check system temperature

Most users:

  1. Use default settings
  2. Only change if problems occur

NVIDIA GPU users:

  1. Download CUDA backend
  2. Ensure GPU Layers is set high
  3. Enable Flash Attention

Performance optimization:

  1. Enable Continuous Batching
  2. Use appropriate backend for hardware
  3. Monitor memory usage