# Undercurrent Kit: llama-cpp

> Generated 2026-05-14 | Status: rising | Emergence: 6.0
> Platforms: reddit

## What's Happening

"llama-cpp" has 3 mentions across 1 platform(s) in the last 7 days.
Velocity: 3.0x week-over-week.

## Key Signals

### 24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)
- **Source:** reddit | [link](https://reddit.com/r/LocalLLaMA/comments/1tcc7h5/24_toks_from_30b_moe_models_on_an_old_gtx_1080_8/)
- **Author:** mdda ([profile](https://reddit.com/u/mdda))
- **Scores:** originality:75% depth:82% psychosis:15%
- **Engagement:** 53 upvotes, 14 comments, 0 stars
- **Summary:** Practitioner demonstrates specific, reproducible performance tuning for running 30B+ MoE models on commodity 8GB GPUs using TurboQuant/RotorQuant KV cache compression and CPU offloading, achieving 20-24 tok/s with 128k context window.

**Implementation:**
Use llama.cpp with TurboQuant (K) and RotorQuant/turbo3 (V) KV cache quantization flags to fit 128k context in 8GB VRAM. Key parameters: --n-cpu-moe (offload MoE experts to CPU, typically 20-30 depending on model), --flash-attn for attention optimization, and Q4_K_M quantization for base weights. Start with Qwen 3.6 35B-A3B or Gemma 4 26B-A4B (A3B/A4B variants optimized for these techniques). Benchmark on target hardware to find optimal --n-cpu-moe threshold balancing CPU/GPU utilization. Monitor token throughput with different K/V quantization pairs (turbo4/turbo3, etc.).

### Playing One Night Werewolf (Gemma4 &amp; Qwen3.6)
- **Source:** reddit | [link](https://reddit.com/r/LocalLLaMA/comments/1tcjtmt/playing_one_night_werewolf_gemma4_qwen36/)
- **Author:** Some-Cauliflower4902 ([profile](https://reddit.com/u/Some-Cauliflower4902))
- **Scores:** originality:75% depth:65% psychosis:20%
- **Engagement:** 8 upvotes, 1 comments, 0 stars
- **Summary:** Practitioner successfully orchestrates multiple quantized local LLMs as game agents in One Night Werewolf, sharing concrete implementation details around model selection, quantization strategies, and reasoning suppression.

**Implementation:**
Use llama.cpp with model switching capability to run multiple quantized models (Gemma 4 31B Q4/26B Q5, Qwen 3.6 27B Q5/35B Q4) as game agents. Assign each model a role (werewolf, seer, villager, troublemaker) via role cards, suppress extended thinking outputs on Qwen models via inference settings to prevent token waste, and maintain separate observation logs per agent. The builder custom-coded a UI for seamless mid-chat model switching; replicate using llama.cpp's server mode with batched requests or build on top of llama-cpp-python bindings for role-specific prompting and state management across agents.

### New Linux user, need help compiling llamacpp
- **Source:** reddit | [link](https://reddit.com/r/LocalLLaMA/comments/1tcce4k/new_linux_user_need_help_compiling_llamacpp/)
- **Author:** Spiderboyz1 ([profile](https://reddit.com/u/Spiderboyz1))
- **Scores:** originality:20% depth:30% psychosis:10%
- **Engagement:** 8 upvotes, 31 comments, 0 stars
- **Summary:** A genuine practitioner question from someone transitioning to Linux and attempting to compile llama.cpp with multi-GPU setup; demonstrates real friction point in local LLM deployment but lacks depth as a content artifact.

**Implementation:**
User should clone llama.cpp repo (https://github.com/ggerganov/llama.cpp), verify CUDA/ROCm headers are installed (nvidia-cuda-toolkit on CachyOS), build with `make` or cmake specifying GPU targets (e.g., `make CUDA_DOCKER_ARCH=sm_89` for 4070S), then benchmark with their GGUF models. Key friction: CachyOS may need manual driver/toolkit setup; multi-GPU support in llama.cpp requires either distributed inference or single-GPU fallback depending on use case. Community would likely recommend pre-built releases first (https://github.com/ggerganov/llama.cpp/releases) before tackling compilation.

## People to Watch

- **mdda**
- **Spiderboyz1**
- **Some-Cauliflower4902**

## Prompt: Hand This to Claude Code

```
I want to explore "llama-cpp" based on these emerging signals from the AI community.

Key implementations I've found:
- 24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context): Use llama.cpp with TurboQuant (K) and RotorQuant/turbo3 (V) KV cache quantization flags to fit 128k context in 8GB VRAM. Key parameters: --n-cpu-moe (offload MoE experts to CPU, typically 20-30 depend...
- Playing One Night Werewolf (Gemma4 &amp; Qwen3.6): Use llama.cpp with model switching capability to run multiple quantized models (Gemma 4 31B Q4/26B Q5, Qwen 3.6 27B Q5/35B Q4) as game agents. Assign each model a role (werewolf, seer, villager, troub...
- New Linux user, need help compiling llamacpp: User should clone llama.cpp repo (https://github.com/ggerganov/llama.cpp), verify CUDA/ROCm headers are installed (nvidia-cuda-toolkit on CachyOS), build with `make` or cmake specifying GPU targets (e...

Help me:
1. Assess which of these approaches fits my current stack
2. Build a minimal working prototype of the most promising one
3. Identify what these practitioners learned that isn't in the docs yet
```

---
*Kit generated by [Undercurrent](https://undercurrent-dashboard.pages.dev). Trend detection is people detection.*
