Running Qwen3.6-35B-A3B on RTX 3090 24GB — Real Use Cases for the 3B-Active MoE (2026)
Qwen3.6-35B-A3B (April 2026 release) puts a 35B-parameter MoE model on a single RTX 3090 24GB at usable speed thanks to its 3B active parameters and Apache 2.0 license. Practical use cases — agentic coding (SWE-bench 73.4), 262K context document analysis, vision-language tasks, and tool calling — with realistic VRAM math, expected throughput, and where the model genuinely outperforms 8B alternatives.
The 35B Model That Fits in 24GB
Qwen released Qwen3.6-35B-A3B in April 2026. The specs are unusual:
- 35B total parameters (impressive on paper)
- 3B active parameters per token (MoE — 256 experts, 8 routed + 1 shared activated)
- 262K native context (extensible to 1M via YaRN)
- Vision-language capable (multimodal)
- Apache 2.0 (commercial-friendly)
- 391 community quantizations across llama.cpp, Ollama, LM Studio, Jan
The combination that matters for RTX 3090 owners: the 35B weight count determines VRAM (≈21 GB at Q4_K_M, fits in 24 GB), but only 3B parameters compute per token, so inference runs at MoE-class speed — much faster than a dense 30B model.
This guide is the practical look at what this actually enables on a single RTX 3090. Use cases where 8B isn't enough but you can't justify a multi-GPU rig. The post groups by what the model is genuinely good at per the published benchmarks (SWE-bench Verified 73.4, MMLU-Pro 85.2, AIME 2026 92.7, MMMU 81.7), translated to real workflows.
For the general RTX 3090 model comparison context, see Best Ollama Models for RTX 3090 24GB in 2026: Real Benchmarks.
VRAM Math for RTX 3090 24GB
Approximate footprint for Qwen3.6-35B-A3B per quantization (full model weights only — KV cache adds more):
| Quantization | Weights size | + 8K context KV | + 32K context KV | Single 3090 24GB? |
|---|---|---|---|---|
| Q8_0 | 37 GB | 39 GB | 45 GB | ❌ OOM |
| Q6_K | 28 GB | 30 GB | 36 GB | ❌ OOM |
| Q5_K_M | 25 GB | 27 GB | 32 GB | ❌ OOM |
| Q4_K_M | 21 GB | 22.5 GB | 27 GB (tight) | ✅ for ≤16K ctx |
| IQ4_XS | 18 GB | 19.5 GB | 23.5 GB | ✅ comfortable to 32K |
| Q4_K_S | 19 GB | 20.5 GB | 24.5 GB | ⚠️ tight at 32K |
| Q3_K_M | 16 GB | 17.5 GB | 21 GB | ✅ long context room |
| IQ3_M | 14 GB | 15.5 GB | 18.5 GB | ✅ 64K context viable |
The practical sweet spots for RTX 3090 24GB:
- IQ4_XS for general use — best quality fit with 16-32K context
- Q4_K_M if quality matters more than context length — limit to ≤16K context
- IQ3_M for long-context workflows — 64K+ context becomes feasible
For the broader quantization tradeoff discussion, see GGUF Quantization Showdown — Q4_K_M vs Q4_K_S vs IQ4_XS vs Q5_K_M.
Expected Throughput on RTX 3090
Because Qwen3.6-35B-A3B is MoE with only 3B active parameters per token, generation speed is closer to a 3B dense model than a 35B dense model. Expected ballpark on RTX 3090 24GB:
| Quant + Context | Expected tokens/sec |
|---|---|
| IQ4_XS @ 8K | 45-65 |
| Q4_K_M @ 8K | 45-60 |
| IQ4_XS @ 32K (well into context) | 30-45 |
| IQ3_M @ 64K (well into context) | 20-35 |
For comparison:
- Llama 3.1 8B Q4_K_M on RTX 3090: ~95 t/s (dense 8B)
- Llama 3.1 70B Q4_K_M (split GPUs): ~10-15 t/s (dense 70B)
- Mixtral 8×7B Q4_K_M on RTX 3090: ~50-65 t/s (similar MoE class)
The MoE 35B (3B active) sits between Mixtral and dense 70B — much faster than a comparable dense model would be.
Caveat on numbers: these are estimates based on architecture math and Mixtral 8×7B precedent. Specific Qwen3.6 measurements on RTX 3090 in 2026 community testing should be cross-referenced against r/LocalLLaMA and Hugging Face discussions for your exact quant and llama.cpp version.
Real Use Cases — Where 35B-A3B Actually Earns Its VRAM
Use Case 1 — Agentic Coding (SWE-bench Verified 73.4)
The published SWE-bench Verified score of 73.4 puts Qwen3.6-35B-A3B in the top tier of open coding models — competitive with much larger frontier models on real GitHub bug-fixing tasks.
For RTX 3090 + a local code agent (Aider, Continue, Cursor with local backend), this enables:
- Fixing real bugs in your codebase without sending code to a cloud provider
- Multi-file changes — the 262K context can hold a substantial codebase
- Iterative tool use — model + IDE + test runner in a loop, all local
Practical setup with Ollama + Continue (VS Code):
# Ollama
ollama pull qwen3.6:35b-a3b-iq4_xs
# In VS Code with Continue extension, ~/.continue/config.json:
{
"models": [{
"title": "Qwen3.6-35B-A3B Local",
"provider": "ollama",
"model": "qwen3.6:35b-a3b-iq4_xs",
"contextLength": 32768,
"completionOptions": { "temperature": 0.1 }
}]
}
For agentic coding (Aider):
aider --model ollama/qwen3.6:35b-a3b-iq4_xs --architect
Realistic expectation: for typical 1-3 file bug fixes in a well-structured Python or TypeScript codebase, expect output quality comparable to GPT-4-class on Google CodeBench tasks. For deep architectural refactors across many files, even SWE-bench 73.4 leaves room for failure modes — keep tests green, commit small.
Use Case 2 — Long-Context Document Analysis (262K Native)
The 262K native context (with YaRN extending to 1M) is unusual for a model that fits on a single 24GB card. Realistic workflows:
- Annotated code review across a full repository: paste an entire mid-sized library's source, ask architectural questions
- Legal document analysis: contracts, regulatory filings (~150K tokens) entirely in context
- Scientific paper synthesis: 5-10 long papers (each 15-30K tokens) compared in one session
- Long-form RAG: instead of retrieval-augmented chunks, load the full source documents
VRAM reality for long context:
- IQ3_M + 64K context: ~18.5 GB total → fits with headroom
- IQ4_XS + 32K context: ~23.5 GB → fits but tight
- Q4_K_M + full 262K: not viable on single 24GB without context offload
For genuinely long context (>64K), drop to IQ3_M and accept the slight quality reduction in exchange for the context budget. The model's native long-context training means it handles position 200K nearly as well as position 2K — unlike many older models that degrade past 32K.
Use Case 3 — Vision-Language Tasks (MMMU 81.7)
The native vision capabilities (MMMU 81.7, RealWorldQA 85.3) mean Qwen3.6-35B-A3B handles:
- Document parsing: invoices, scanned PDFs, forms — extracting structured data from images
- Chart and table reading: convert screenshots of dashboards to text
- Code from screenshots: paste a screenshot of code from a paper or video, get text back
- Real-world QA: "what's broken in this photo of my circuit board" type queries
This was previously the domain of separate vision models (LLaVA, Pixtral, Qwen2-VL). Having text + vision in one 35B-A3B model means a single deployment serves both — significant infrastructure simplification.
Practical setup:
# Via Ollama API (vision-capable models support image inputs natively)
import requests, base64
with open('chart.png', 'rb') as f:
img_b64 = base64.b64encode(f.read()).decode()
r = requests.post('http://localhost:11434/api/generate', json={
'model': 'qwen3.6:35b-a3b-iq4_xs',
'prompt': 'Extract the quarterly revenue figures from this chart as JSON',
'images': [img_b64],
})
Use Case 4 — Tool Calling / Agentic Workflows
The model is trained for tool use, which combined with the 262K context enables full agentic deployments:
- MCP server backends: local model serving Model Context Protocol clients
- Function calling: structured output for API calls, database queries
- Multi-step task execution: research → analysis → report generation loops
Example with OpenAI-compatible function calling (Ollama exposes this via OpenAI-format endpoint):
from openai import OpenAI
client = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')
tools = [{
"type": "function",
"function": {
"name": "search_codebase",
"description": "Search a local codebase for files matching a pattern",
"parameters": {
"type": "object",
"properties": {"pattern": {"type": "string"}},
"required": ["pattern"],
},
},
}]
response = client.chat.completions.create(
model='qwen3.6:35b-a3b-iq4_xs',
messages=[{"role": "user", "content": "Find all auth-related files in my project"}],
tools=tools,
)
Use Case 5 — Math and Reasoning (AIME 2026 92.7)
The AIME 2026 score of 92.7 is unusually high for an open model. For reasoning-heavy use cases:
- Code review for logical bugs (vs syntax bugs)
- Math/stats problem solving in research workflows
- Multi-step deductive tasks (legal reasoning, scientific hypothesis chains)
- Verification of LLM-generated code or proofs
The MoE architecture activates different experts for different reasoning patterns, which empirically improves complex multi-step problems over comparable dense models.
When NOT to Use Qwen3.6-35B-A3B on RTX 3090
Honest counter-cases:
Pure chat / simple Q&A
For interactive chat at maximum speed, Llama 3.1 8B Q4_K_M (95 t/s) beats Qwen3.6-35B-A3B (~50 t/s) in raw throughput. The 35B-A3B quality advantage only shows on harder tasks. If your interactive load is mostly "summarize this email," 8B is the right pick.
Very tight VRAM (running other workloads concurrently)
Loading 35B-A3B at IQ4_XS uses ~18 GB. If you're also running Stable Diffusion (4-8 GB) or Jupyter PyTorch sessions (4-12 GB) on the same GPU, you'll OOM. Either reserve the 3090 for the LLM or use a smaller model.
Real-time low-latency (sub-second first-token)
MoE models have a slight first-token latency cost over dense (~100-300ms more). For latency-sensitive applications (interactive chat with immediate response perception), an 8B dense beats 35B-A3B.
Need >32K context with quality intact
IQ3_M extends context budget but with quality cost. If you genuinely need 64K+ tokens at full Q4-level quality, you've outgrown a single 3090 — consider dual-GPU split (see llama.cpp split-mode guide) or upgrade to 32GB+ card (RTX 5090, A6000).
Setup Walkthrough (Ollama)
# Install / update Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull the IQ4_XS quant (best fit for 24GB with context headroom)
ollama pull qwen3.6:35b-a3b-iq4_xs
# Or for tightest quality / least context
ollama pull qwen3.6:35b-a3b-q4_k_m
# Verify
ollama list
ollama ps # Will show after first request
# Set context length for the session
echo 'PARAMETER num_ctx 32768' > Modelfile-qwen36
ollama create qwen3.6-32k -f Modelfile-qwen36
# Run with extended context
ollama run qwen3.6-32k
Recommended environment variables:
# In /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_KEEP_ALIVE=24h"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0" # KV cache quantization saves ~30% VRAM
The q8_0 KV cache option saves substantial VRAM on long contexts at minimal quality cost. For 35B-A3B at 64K context with q8 KV, expect about 1.5 GB savings vs f16 KV.
For OLLAMA_KEEP_ALIVE nuances see Ollama OLLAMA_KEEP_ALIVE — How Model Memory Persistence Actually Works.
Comparison — Qwen3.6-35B-A3B vs Other 24GB-Class Options
| Model | Params | Active | RTX 3090 quant | Quality tier (per benchmarks) |
|---|---|---|---|---|
| Qwen3.6-35B-A3B | 35B total | 3B | IQ4_XS / Q4_K_M | Frontier-competitive on coding + math + vision |
| Qwen3-30B-A3B (older) | 30B total | 3B | Q4_K_M | Strong but no vision, smaller context |
| Llama 3.1 70B | 70B dense | 70B | IQ3_M or split | Slower, broader knowledge |
| Mixtral 8×7B | 47B total | 13B | Q4_K_M | Older (2023), no vision |
| Qwen 3 14B | 14B dense | 14B | Q5_K_M | Solid, faster than 35B-A3B |
| Phi-4 14B | 14B dense | 14B | Q5_K_M | Strong reasoning per param |
| Llama 3.1 8B | 8B dense | 8B | Q8_0 | Fastest, less capable on hard tasks |
The 35B-A3B's specific edge in 2026: best multimodal + coding model that fits on consumer 24GB. Mixtral 8×7B was the previous best-fit MoE but is older and text-only. Qwen3-30B-A3B was the immediate predecessor without vision.
Practical Recommendations
For a single RTX 3090 24GB user in 2026:
- Default model for capable work: Qwen3.6-35B-A3B IQ4_XS, 16-32K context
- Fast chat / quick queries: Llama 3.1 8B Q4_K_M
- Long-context document analysis: Qwen3.6-35B-A3B IQ3_M, up to 64K
- Pure coding agent: Qwen3.6-35B-A3B IQ4_XS (SWE-bench 73.4 score speaks for itself)
- Vision tasks: Qwen3.6-35B-A3B (no need for separate VLM)
If you have two 1080 Ti instead, see the GTX 1080 Ti and dual-GPU guides for similar models that fit in 22 GB combined.
FAQ
Q: Why does the 3B active parameter make speed so much better than 35B dense?
Each token forward pass through an MoE only computes through the active experts (8 routed + 1 shared = ~9 experts × 256-dim = ~3B params worth of compute). Dense 35B would compute through all 35B params per token. Inference time scales with active params, not total. VRAM usage scales with total params (all experts must be loaded for routing).
Q: Will my RTX 3090 thermal-throttle running this?
Sustained inference on RTX 3090 hits ~70-80°C with stock cooling. Throttling typically starts at 83°C. With adequate case airflow, this should be fine for hours of continuous use. Mining-recovered 3090s may run hotter due to thermal pad degradation — re-pad if temps spike.
Q: Does this require driver/CUDA upgrade?
llama.cpp / Ollama work with CUDA 11.8+ on the 3090. Most 2026 Linux distros ship 12.x by default which is more than adequate. No special drivers beyond standard NVIDIA proprietary.
Q: Can I run two requests in parallel?
With OLLAMA_NUM_PARALLEL=2, yes — but each parallel inference uses additional KV cache memory. At 32K context per stream + IQ4_XS weights, you'll OOM. Stick with 1 parallel on 24GB; use a server with 48GB+ for concurrent serving.
Q: How does this compare to running Claude 3.7 Sonnet via API?
Subjectively for many tasks, similar quality. Cost math: Claude API @ $3-15/M tokens × your usage; Qwen3.6 local @ electricity ($0.50/day if always-on) + zero API cost. Break-even is around 5-15M tokens/month depending on rate. Privacy + data residency are local-only advantages.
Q: What about training/fine-tuning on RTX 3090?
Full fine-tuning of 35B-A3B requires multi-GPU + 100GB+ aggregate VRAM. QLoRA fine-tuning of select experts is feasible on 24GB but rare — usually you'd fine-tune a dense smaller model (8B) instead.
Q: Is the 1M-token YaRN extended context useful?
Theoretically yes, practically limited by VRAM. To run inference at 500K-1M context on a single 3090, you'd need IQ2 quantization which has measurable quality loss. Reserved for special use cases; standard "long context" is 32-64K.
Q: Why Qwen specifically over equivalent Llama 4 or Gemma 3 MoE?
As of mid-2026, Qwen3.6-35B-A3B's combination of license (Apache 2.0), multimodal capability, 256K context, and benchmark scores (SWE-bench 73.4, AIME 92.7) puts it ahead of other open MoE models of similar size. Llama 4 has different size points; Gemma 3 doesn't have a strict MoE in this class.
Closing — The Single Reason This Matters
Qwen3.6-35B-A3B is the first 30B+ class model that:
- Has frontier-competitive benchmarks (SWE-bench 73, AIME 92, MMMU 81)
- Fits on a consumer single RTX 3090 24GB (via IQ4_XS or Q4_K_M)
- Runs at MoE-class speed (~50 t/s, not 10 t/s)
- Is commercially licensable (Apache 2.0)
- Handles vision + text + tools in one model
For a hobbyist with one RTX 3090, this collapses what used to require either a 70B dense model on multi-GPU (much slower) or a cloud API (cost + privacy tradeoff). For solo developers and small teams running production agents on owned hardware, this is the model to standardize on in 2026.
Related posts:
- Best Ollama Models for RTX 3090 24GB in 2026: Real Benchmarks (Qwen3 vs DeepSeek vs Llama)
- GGUF Quantization Showdown — Q4_K_M vs Q4_K_S vs IQ4_XS vs Q5_K_M
- Ollama OLLAMA_KEEP_ALIVE — Model Memory Persistence Deep Dive
- Running Modern LLMs on GTX 1080 Ti in 2026 — What Still Works
- Ollama Dual GPU Without NVLink — Tensor Split on 2× GTX 1080 Ti
- Home AI Server Build Guide 2026: RTX 4090 vs 3090 vs 5090
- LLM VRAM Calculator
References:
- Qwen3.6-35B-A3B model card: https://huggingface.co/Qwen/Qwen3.6-35B-A3B
- Qwen team release announcements (April 2026)
- llama.cpp quantization documentation: https://github.com/ggerganov/llama.cpp
- LocalLLaMA community benchmarks (r/LocalLLaMA, 2026)
관련 글
Ollama vs LM Studio vs llama.cpp: Honest 2026 Comparison for Local LLM
5월 18일 · 17 min read
일반GGUF Quantization Showdown — Q4_K_M vs Q4_K_S vs IQ4_XS vs Q5_K_M (2026 Real Quality + Speed)
5월 27일 · 11 min read
일반Best Ollama Models for RTX 3090 24GB in 2026: Real Benchmarks (Qwen3 vs DeepSeek vs Llama)
3월 30일 · 19 min read
AI/LLMHow to Run Qwen 3 (30B) Locally with Ollama on RTX 3090 — Complete Guide
2월 25일 · 7 min read