# Undercurrent Kit: local-inference

> Generated 2026-05-14 | Status: rising | Emergence: 13.9
> Platforms: reddit

## What's Happening

"local-inference" has 7 mentions across 1 platform(s) in the last 7 days.
Velocity: 7.0x week-over-week.

## Key Signals

### 24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)
- **Source:** reddit | [link](https://reddit.com/r/LocalLLaMA/comments/1tcc7h5/24_toks_from_30b_moe_models_on_an_old_gtx_1080_8/)
- **Author:** mdda ([profile](https://reddit.com/u/mdda))
- **Scores:** originality:75% depth:82% psychosis:15%
- **Engagement:** 53 upvotes, 14 comments, 0 stars
- **Summary:** Practitioner demonstrates specific, reproducible performance tuning for running 30B+ MoE models on commodity 8GB GPUs using TurboQuant/RotorQuant KV cache compression and CPU offloading, achieving 20-24 tok/s with 128k context window.

**Implementation:**
Use llama.cpp with TurboQuant (K) and RotorQuant/turbo3 (V) KV cache quantization flags to fit 128k context in 8GB VRAM. Key parameters: --n-cpu-moe (offload MoE experts to CPU, typically 20-30 depending on model), --flash-attn for attention optimization, and Q4_K_M quantization for base weights. Start with Qwen 3.6 35B-A3B or Gemma 4 26B-A4B (A3B/A4B variants optimized for these techniques). Benchmark on target hardware to find optimal --n-cpu-moe threshold balancing CPU/GPU utilization. Monitor token throughput with different K/V quantization pairs (turbo4/turbo3, etc.).

### Playing One Night Werewolf (Gemma4 &amp; Qwen3.6)
- **Source:** reddit | [link](https://reddit.com/r/LocalLLaMA/comments/1tcjtmt/playing_one_night_werewolf_gemma4_qwen36/)
- **Author:** Some-Cauliflower4902 ([profile](https://reddit.com/u/Some-Cauliflower4902))
- **Scores:** originality:75% depth:65% psychosis:20%
- **Engagement:** 8 upvotes, 1 comments, 0 stars
- **Summary:** Practitioner successfully orchestrates multiple quantized local LLMs as game agents in One Night Werewolf, sharing concrete implementation details around model selection, quantization strategies, and reasoning suppression.

**Implementation:**
Use llama.cpp with model switching capability to run multiple quantized models (Gemma 4 31B Q4/26B Q5, Qwen 3.6 27B Q5/35B Q4) as game agents. Assign each model a role (werewolf, seer, villager, troublemaker) via role cards, suppress extended thinking outputs on Qwen models via inference settings to prevent token waste, and maintain separate observation logs per agent. The builder custom-coded a UI for seamless mid-chat model switching; replicate using llama.cpp's server mode with batched requests or build on top of llama-cpp-python bindings for role-specific prompting and state management across agents.

### Where do you personally draw the line with AI access, read-only, file edits, running commands, browsing for you? Why there?
- **Source:** reddit | [link](https://reddit.com/r/ClaudeAI/comments/1tcflgd/where_do_you_personally_draw_the_line_with_ai/)
- **Author:** CarolusX74 ([profile](https://reddit.com/u/CarolusX74))
- **Scores:** originality:72% depth:35% psychosis:15%
- **Engagement:** 8 upvotes, 51 comments, 0 stars
- **Summary:** Practitioner-driven discussion of trust boundaries in agentic AI systems, grounded in real friction points rather than hype—identifies a genuine open question about where developers should gate AI capabilities.

**Implementation:**
The post doesn't prescribe a specific implementation, but frames a decision tree for tooling: assess your risk tolerance at each autonomy level (read-only via APIs like Claude Files, direct edits via file-writing APIs, shell execution via process managers like `subprocess` with strict allowlists, web browsing via tools like Playwright, or full agentic loops via frameworks like LangGraph or AutoGPT-style patterns). Practitioners should prototype at the lowest required capability level, implement audit logs for file/command changes, use sandboxing (Docker, VMs, or OS-level process isolation) before enabling shell access, and establish human checkpoints before irreversible operations. The discussion itself is the value—reading the 51 comments will surface domain-specific risk models from Android devs, backend engineers, and security-conscious users.

### Open-source, self-updating wiki for your codebase
- **Source:** reddit | [link](https://reddit.com/r/ClaudeAI/comments/1tcjv9b/opensource_selfupdating_wiki_for_your_codebase/)
- **Author:** ElectronicUnit6303 ([profile](https://reddit.com/u/ElectronicUnit6303))
- **Scores:** originality:72% depth:35% psychosis:25%
- **Engagement:** 38 upvotes, 1 comments, 0 stars
- **Summary:** Developer built Almanac, a self-updating markdown wiki that extracts institutional knowledge from conversations with Claude Code and repo structure to reduce context re-explanation overhead for coding agents.

**Implementation:**
Almanac appears to work by parsing Claude Code/Codex conversations and repo files to automatically generate and maintain markdown documentation in the repository root. To implement or extend this: (1) establish a hook or API integration with Claude Code conversations to capture context explanations, (2) build a parser that extracts rationale patterns ("we tried X but backed it out because Y") into structured wiki entries, (3) run periodic scans of codebase structure and git history to auto-populate architectural decisions, (4) store the wiki as markdown so it's version-controllable and readable. The tool likely feeds this self-maintained wiki back into agent prompts as context, reducing token waste on repetitive explanations. Specific implementation details (whether it uses AST parsing, LLM-based extraction, or pattern matching) are not provided; check the GitHub repo for architecture choices around markdown generation and conversation integration.

### we really all are going to make it, aren't we? 2x3090 setup.
- **Source:** reddit | [link](https://reddit.com/r/LocalLLaMA/comments/1tcf2dt/we_really_all_are_going_to_make_it_arent_we/)
- **Author:** RedShiftedTime ([profile](https://reddit.com/u/RedShiftedTime))
- **Scores:** originality:35% depth:45% psychosis:15%
- **Engagement:** 49 upvotes, 35 comments, 0 stars
- **Summary:** Practitioner documents tangible improvements in local LLM inference on dual RTX3090 hardware using club-3090 project after bug fixes, with specific setup comparisons but incomplete throughput data.

**Implementation:**
User reports running club-3090 (GitHub: noonghunna/club-3090) on dual RTX3090 setup, initially on WSL2 vs LM Studio with improved results. To replicate: clone club-3090 repo, deploy Sonnet-patched fixes for SSE session drops and tool-calling bugs, benchmark against LM Studio baseline. User mentions tokens/second metrics but the post cuts off—check GitHub repo for supported models, exact setup requirements (bare metal vs WSL2 tradeoffs), and VRAM constraints. Relevant for hobbyists targeting 24GB×2=48GB total VRAM budgets.

### running Qwen 3.6 35b A3B on 2x 5060TI
- **Source:** reddit | [link](https://reddit.com/r/LocalLLaMA/comments/1tch5ps/running_qwen_36_35b_a3b_on_2x_5060ti/)
- **Author:** chocofoxy ([profile](https://reddit.com/u/chocofoxy))
- **Scores:** originality:30% depth:60% psychosis:10%
- **Engagement:** 9 upvotes, 17 comments, 0 stars
- **Summary:** A practitioner shares real hardware constraints and throughput metrics for running a 35B model on consumer GPUs, asking for optimization advice—genuinely practical but incremental knowledge.

**Implementation:**
Someone attempting this setup should: (1) Benchmark quantization levels (Q4 vs Q6 vs Q8) using LM Studio or similar inference engines to find the throughput/quality tradeoff on 2x 5060 Ti (32GB total); (2) Investigate memory optimization flags in the inference engine (layer splitting, KV-cache quantization, batch optimization) to squeeze throughput beyond 90 t/s; (3) For cooling, consider aftermarket GPU coolers or vertical orientation with airflow management between cards—the stacked configuration will throttle the top GPU without active cooling or ventilation gaps. Test with tools like `nvidia-smi` to monitor thermal throttling during inference.

### New Linux user, need help compiling llamacpp
- **Source:** reddit | [link](https://reddit.com/r/LocalLLaMA/comments/1tcce4k/new_linux_user_need_help_compiling_llamacpp/)
- **Author:** Spiderboyz1 ([profile](https://reddit.com/u/Spiderboyz1))
- **Scores:** originality:20% depth:30% psychosis:10%
- **Engagement:** 8 upvotes, 31 comments, 0 stars
- **Summary:** A genuine practitioner question from someone transitioning to Linux and attempting to compile llama.cpp with multi-GPU setup; demonstrates real friction point in local LLM deployment but lacks depth as a content artifact.

**Implementation:**
User should clone llama.cpp repo (https://github.com/ggerganov/llama.cpp), verify CUDA/ROCm headers are installed (nvidia-cuda-toolkit on CachyOS), build with `make` or cmake specifying GPU targets (e.g., `make CUDA_DOCKER_ARCH=sm_89` for 4070S), then benchmark with their GGUF models. Key friction: CachyOS may need manual driver/toolkit setup; multi-GPU support in llama.cpp requires either distributed inference or single-GPU fallback depending on use case. Community would likely recommend pre-built releases first (https://github.com/ggerganov/llama.cpp/releases) before tackling compilation.

## People to Watch

- **mdda**
- **Spiderboyz1**
- **RedShiftedTime**
- **CarolusX74**
- **chocofoxy**
- **Some-Cauliflower4902**
- **ElectronicUnit6303**

## Prompt: Hand This to Claude Code

```
I want to explore "local-inference" based on these emerging signals from the AI community.

Key implementations I've found:
- 24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context): Use llama.cpp with TurboQuant (K) and RotorQuant/turbo3 (V) KV cache quantization flags to fit 128k context in 8GB VRAM. Key parameters: --n-cpu-moe (offload MoE experts to CPU, typically 20-30 depend...
- Playing One Night Werewolf (Gemma4 &amp; Qwen3.6): Use llama.cpp with model switching capability to run multiple quantized models (Gemma 4 31B Q4/26B Q5, Qwen 3.6 27B Q5/35B Q4) as game agents. Assign each model a role (werewolf, seer, villager, troub...
- Where do you personally draw the line with AI access, read-only, file edits, running commands, browsing for you? Why there?: The post doesn't prescribe a specific implementation, but frames a decision tree for tooling: assess your risk tolerance at each autonomy level (read-only via APIs like Claude Files, direct edits via ...
- Open-source, self-updating wiki for your codebase: Almanac appears to work by parsing Claude Code/Codex conversations and repo files to automatically generate and maintain markdown documentation in the repository root. To implement or extend this: (1)...
- we really all are going to make it, aren't we? 2x3090 setup.: User reports running club-3090 (GitHub: noonghunna/club-3090) on dual RTX3090 setup, initially on WSL2 vs LM Studio with improved results. To replicate: clone club-3090 repo, deploy Sonnet-patched fix...

Help me:
1. Assess which of these approaches fits my current stack
2. Build a minimal working prototype of the most promising one
3. Identify what these practitioners learned that isn't in the docs yet
```

---
*Kit generated by [Undercurrent](https://undercurrent-dashboard.pages.dev). Trend detection is people detection.*