Skip to content

Local AI Engine

llama.cpp is the engine that runs AI models locally on your computer. Think of it as the software that takes an AI model file and makes it work on your hardware - whether that’s your CPU, graphics card, or Apple’s M-series chips.

Originally created by Georgi Gerganov, llama.cpp is designed to run large language models efficiently on consumer hardware without requiring specialized AI accelerators or cloud connections.

  • Privacy: Your conversations never leave your computer
  • Cost: No monthly subscription fees or API costs
  • Speed: No internet required once models are downloaded
  • Control: Choose exactly which models to run and how they behave

Find llama.cpp settings at Settings > Model Providers > Llama.cpp:

llama.cpp

FeatureWhat It DoesWhen You Need It
Engine VersionShows which version of llama.cpp you’re runningCheck compatibility with newer models
Check UpdatesDownloads newer engine versionsWhen new models require updated engine
Backend SelectionChoose the version optimized for your hardwareAfter installing new graphics cards or when performance is poor

Jan offers different backend versions optimized for your specific hardware. Think of these as different “drivers” - each one is tuned for particular processors or graphics cards.

Choose based on your CUDA version (check NVIDIA Control Panel):

For CUDA 12.0:

  • llama.cpp-avx2-cuda-12-0 (most common)
  • llama.cpp-avx512-cuda-12-0 (newer Intel/AMD CPUs)

For CUDA 11.7:

  • llama.cpp-avx2-cuda-11-7 (most common)
  • llama.cpp-avx512-cuda-11-7 (newer Intel/AMD CPUs)
  • llama.cpp-avx2 (most modern CPUs)
  • llama.cpp-avx512 (newer Intel/AMD CPUs)
  • llama.cpp-avx (older CPUs)
  • llama.cpp-noavx (very old CPUs)
  • llama.cpp-vulkan (AMD, Intel Arc, some others)

These control how efficiently models run:

SettingWhat It DoesRecommended ValueImpact
Continuous BatchingProcess multiple requests at onceEnabledFaster when using multiple tools or having multiple conversations
Parallel OperationsHow many requests to handle simultaneously4Higher = more multitasking, but uses more memory
CPU ThreadsHow many processor cores to useAuto-detectedMore threads can speed up CPU processing

These control how models use your computer’s memory:

SettingWhat It DoesRecommended ValueWhen to Change
Flash AttentionMore efficient memory usageEnabledLeave enabled unless you have problems
CachingRemember recent conversationsEnabledSpeeds up follow-up questions
KV Cache TypeMemory precision trade-offf16Change to q8_0 or q4_0 if running out of memory
mmapLoad models more efficientlyEnabledHelps with large models
Context ShiftHandle very long conversationsDisabledEnable for very long chats or multiple tool calls
  • f16: Most stable, uses more memory
  • q8_0: Balanced memory usage and quality
  • q4_0: Uses least memory, slight quality loss

Models won’t load:

  • Try a different backend (switch from CUDA to CPU or vice versa)
  • Check if you have enough RAM/VRAM
  • Update to latest engine version

Very slow performance:

  • Make sure you’re using GPU acceleration (CUDA/Metal/Vulkan backend)
  • Increase GPU Layers in model settings
  • Close other memory-intensive programs

Out of memory errors:

  • Reduce Context Size in model settings
  • Switch KV Cache Type to q8_0 or q4_0
  • Try a smaller model variant

Random crashes:

  • Switch to a more stable backend (try avx instead of avx2)
  • Disable overclocking if you have it enabled
  • Update graphics drivers

For most users:

  1. Use the default backend that Jan installs
  2. Leave all performance settings at defaults
  3. Only adjust if you experience problems

If you have an NVIDIA graphics card:

  1. Download the appropriate CUDA backend
  2. Make sure GPU Layers is set high in model settings
  3. Enable Flash Attention

If models are too slow:

  1. Check you’re using GPU acceleration
  2. Try enabling Continuous Batching
  3. Close other applications using memory

If running out of memory:

  1. Change KV Cache Type to q8_0
  2. Reduce Context Size in model settings
  3. Try a smaller model