Running Qwen3.6-35B-A3B on RTX 3090 24GB — Real Use Cases for the 3B-Active MoE (2026)

Q: Can I run two requests in parallel?

With `OLLAMA_NUM_PARALLEL=2`, yes — but each parallel inference uses additional KV cache memory. At 32K context per stream + IQ4_XS weights, you'll OOM. Stick with 1 parallel on 24GB; use a server with 48GB+ for concurrent serving.

Qwen3.6-35B-A3B on RTX 3090

Quick Answer (TL;DR)

Can Qwen3.6-35B-A3B run on a single RTX 3090 24GB? Yes — fits at IQ4_XS (~18 GB with 32K context) or Q4_K_M (~21 GB with 8-16K context). Expected throughput is 45-65 tokens/sec because only 3B parameters are active per token despite the 35B total (MoE with 256 experts, 8+1 activated).

Best use cases on RTX 3090:

Agentic coding (SWE-bench Verified 73.4) — Aider/Continue with local backend
Long-context document analysis — 262K native context (1M with YaRN), use IQ3_M for 64K+
Vision-language tasks (MMMU 81.7) — built-in multimodal, no separate VLM needed
Tool calling / agents — native function calling support
Math/reasoning (AIME 2026 92.7) — frontier-competitive open model

Not the right pick for: pure fast chat (Llama 3.1 8B is faster at ~95 t/s), shared workstation with mixed GPU workloads, real-time low-latency apps, or workloads requiring >32K context at full Q4 quality.

Definition

Qwen3.6-35B-A3B is an open-source large language model released by Alibaba's Qwen team in April 2026 under Apache 2.0 license (model card). It uses a Mixture-of-Experts (MoE) architecture with 35 billion total parameters but only 3 billion active per token (256 experts, 8 routed + 1 shared per forward pass). The "A3B" suffix denotes the active parameter count. It supports text + vision + tool calling, has 262K native context (1M with YaRN scaling), and is the first 30B+ class open model to fit a single consumer 24 GB GPU at usable speed.

The 35B Model That Fits in 24GB

Qwen released Qwen3.6-35B-A3B in April 2026. The specs are unusual:

35B total parameters (impressive on paper)
3B active parameters per token (MoE — 256 experts, 8 routed + 1 shared activated)
262K native context (extensible to 1M via YaRN)
Vision-language capable (multimodal)
Apache 2.0 (commercial-friendly)
391 community quantizations across llama.cpp, Ollama, LM Studio, Jan

The combination that matters for RTX 3090 owners: the 35B weight count determines VRAM (≈21 GB at Q4_K_M, fits in 24 GB), but only 3B parameters compute per token, so inference runs at MoE-class speed — much faster than a dense 30B model.

This guide is the practical look at what this actually enables on a single RTX 3090. Use cases where 8B isn't enough but you can't justify a multi-GPU rig. The post groups by what the model is genuinely good at per the published benchmarks (SWE-bench Verified 73.4, MMLU-Pro 85.2, AIME 2026 92.7, MMMU 81.7), translated to real workflows.

For the general RTX 3090 model comparison context, see Best Ollama Models for RTX 3090 24GB in 2026: Real Benchmarks.

VRAM Math for RTX 3090 24GB

Approximate footprint for Qwen3.6-35B-A3B per quantization (full model weights only — KV cache adds more):

Quantization	Weights size	+ 8K context KV	+ 32K context KV	Single 3090 24GB?
Q8_0	37 GB	39 GB	45 GB	❌ OOM
Q6_K	28 GB	30 GB	36 GB	❌ OOM
Q5_K_M	25 GB	27 GB	32 GB	❌ OOM
Q4_K_M	21 GB	22.5 GB	27 GB (tight)	✅ for ≤16K ctx
IQ4_XS	18 GB	19.5 GB	23.5 GB	✅ comfortable to 32K
Q4_K_S	19 GB	20.5 GB	24.5 GB	⚠️ tight at 32K
Q3_K_M	16 GB	17.5 GB	21 GB	✅ long context room
IQ3_M	14 GB	15.5 GB	18.5 GB	✅ 64K context viable

The practical sweet spots for RTX 3090 24GB:

IQ4_XS for general use — best quality fit with 16-32K context
Q4_K_M if quality matters more than context length — limit to ≤16K context
IQ3_M for long-context workflows — 64K+ context becomes feasible

For the broader quantization tradeoff discussion, see GGUF Quantization Showdown — Q4_K_M vs Q4_K_S vs IQ4_XS vs Q5_K_M.

Expected Throughput on RTX 3090

Because Qwen3.6-35B-A3B is MoE with only 3B active parameters per token, generation speed is closer to a 3B dense model than a 35B dense model. Expected ballpark on RTX 3090 24GB:

Quant + Context	Expected tokens/sec
IQ4_XS @ 8K	45-65
Q4_K_M @ 8K	45-60
IQ4_XS @ 32K (well into context)	30-45
IQ3_M @ 64K (well into context)	20-35

For comparison:

Llama 3.1 8B Q4_K_M on RTX 3090: ~95 t/s (dense 8B)
Llama 3.1 70B Q4_K_M (split GPUs): ~10-15 t/s (dense 70B)
Mixtral 8×7B Q4_K_M on RTX 3090: ~50-65 t/s (similar MoE class)

The MoE 35B (3B active) sits between Mixtral and dense 70B — much faster than a comparable dense model would be.

Caveat on numbers: these are estimates based on architecture math and Mixtral 8×7B precedent. Specific Qwen3.6 measurements on RTX 3090 in 2026 community testing should be cross-referenced against r/LocalLLaMA and Hugging Face discussions for your exact quant and llama.cpp version.

Real Use Cases — Where 35B-A3B Actually Earns Its VRAM

Use Case 1 — Agentic Coding (SWE-bench Verified 73.4)

The published SWE-bench Verified score of 73.4 puts Qwen3.6-35B-A3B in the top tier of open coding models — competitive with much larger frontier models on real GitHub bug-fixing tasks.

For RTX 3090 + a local code agent (Aider, Continue, Cursor with local backend), this enables:

Fixing real bugs in your codebase without sending code to a cloud provider
Multi-file changes — the 262K context can hold a substantial codebase
Iterative tool use — model + IDE + test runner in a loop, all local

Practical setup with Ollama + Continue (VS Code):

# Ollama
ollama pull qwen3.6:35b-a3b-iq4_xs

# In VS Code with Continue extension, ~/.continue/config.json:
{
  "models": [{
    "title": "Qwen3.6-35B-A3B Local",
    "provider": "ollama",
    "model": "qwen3.6:35b-a3b-iq4_xs",
    "contextLength": 32768,
    "completionOptions": { "temperature": 0.1 }
  }]
}

For agentic coding (Aider):

aider --model ollama/qwen3.6:35b-a3b-iq4_xs --architect

Realistic expectation: for typical 1-3 file bug fixes in a well-structured Python or TypeScript codebase, expect output quality comparable to GPT-4-class on Google CodeBench tasks. For deep architectural refactors across many files, even SWE-bench 73.4 leaves room for failure modes — keep tests green, commit small.

Use Case 2 — Long-Context Document Analysis (262K Native)

The 262K native context (with YaRN extending to 1M) is unusual for a model that fits on a single 24GB card. Realistic workflows:

Annotated code review across a full repository: paste an entire mid-sized library's source, ask architectural questions
Legal document analysis: contracts, regulatory filings (~150K tokens) entirely in context
Scientific paper synthesis: 5-10 long papers (each 15-30K tokens) compared in one session
Long-form RAG: instead of retrieval-augmented chunks, load the full source documents

VRAM reality for long context:

IQ3_M + 64K context: ~18.5 GB total → fits with headroom
IQ4_XS + 32K context: ~23.5 GB → fits but tight
Q4_K_M + full 262K: not viable on single 24GB without context offload

For genuinely long context (>64K), drop to IQ3_M and accept the slight quality reduction in exchange for the context budget. The model's native long-context training means it handles position 200K nearly as well as position 2K — unlike many older models that degrade past 32K.

Use Case 3 — Vision-Language Tasks (MMMU 81.7)

The native vision capabilities (MMMU 81.7, RealWorldQA 85.3) mean Qwen3.6-35B-A3B handles:

Document parsing: invoices, scanned PDFs, forms — extracting structured data from images
Chart and table reading: convert screenshots of dashboards to text
Code from screenshots: paste a screenshot of code from a paper or video, get text back
Real-world QA: "what's broken in this photo of my circuit board" type queries

This was previously the domain of separate vision models (LLaVA, Pixtral, Qwen2-VL). Having text + vision in one 35B-A3B model means a single deployment serves both — significant infrastructure simplification.

Practical setup:

# Via Ollama API (vision-capable models support image inputs natively)
import requests, base64

with open('chart.png', 'rb') as f:
    img_b64 = base64.b64encode(f.read()).decode()

r = requests.post('http://localhost:11434/api/generate', json={
    'model': 'qwen3.6:35b-a3b-iq4_xs',
    'prompt': 'Extract the quarterly revenue figures from this chart as JSON',
    'images': [img_b64],
})

Use Case 4 — Tool Calling / Agentic Workflows

The model is trained for tool use, which combined with the 262K context enables full agentic deployments:

MCP server backends: local model serving Model Context Protocol clients
Function calling: structured output for API calls, database queries
Multi-step task execution: research → analysis → report generation loops

Example with OpenAI-compatible function calling (Ollama exposes this via OpenAI-format endpoint):

from openai import OpenAI
client = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')

tools = [{
    "type": "function",
    "function": {
        "name": "search_codebase",
        "description": "Search a local codebase for files matching a pattern",
        "parameters": {
            "type": "object",
            "properties": {"pattern": {"type": "string"}},
            "required": ["pattern"],
        },
    },
}]

response = client.chat.completions.create(
    model='qwen3.6:35b-a3b-iq4_xs',
    messages=[{"role": "user", "content": "Find all auth-related files in my project"}],
    tools=tools,
)

Use Case 5 — Math and Reasoning (AIME 2026 92.7)

The AIME 2026 score of 92.7 is unusually high for an open model. For reasoning-heavy use cases:

Code review for logical bugs (vs syntax bugs)
Math/stats problem solving in research workflows
Multi-step deductive tasks (legal reasoning, scientific hypothesis chains)
Verification of LLM-generated code or proofs

The MoE architecture activates different experts for different reasoning patterns, which empirically improves complex multi-step problems over comparable dense models.

When NOT to Use Qwen3.6-35B-A3B on RTX 3090

Honest counter-cases:

Pure chat / simple Q&A

For interactive chat at maximum speed, Llama 3.1 8B Q4_K_M (95 t/s) beats Qwen3.6-35B-A3B (~50 t/s) in raw throughput. The 35B-A3B quality advantage only shows on harder tasks. If your interactive load is mostly "summarize this email," 8B is the right pick.

Very tight VRAM (running other workloads concurrently)

Loading 35B-A3B at IQ4_XS uses ~18 GB. If you're also running Stable Diffusion (4-8 GB) or Jupyter PyTorch sessions (4-12 GB) on the same GPU, you'll OOM. Either reserve the 3090 for the LLM or use a smaller model.

Real-time low-latency (sub-second first-token)

MoE models have a slight first-token latency cost over dense (~100-300ms more). For latency-sensitive applications (interactive chat with immediate response perception), an 8B dense beats 35B-A3B.

Need >32K context with quality intact

IQ3_M extends context budget but with quality cost. If you genuinely need 64K+ tokens at full Q4-level quality, you've outgrown a single 3090 — consider dual-GPU split (see llama.cpp split-mode guide) or upgrade to 32GB+ card (RTX 5090, A6000).

Setup Walkthrough (Ollama)

# Install / update Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull the IQ4_XS quant (best fit for 24GB with context headroom)
ollama pull qwen3.6:35b-a3b-iq4_xs

# Or for tightest quality / least context
ollama pull qwen3.6:35b-a3b-q4_k_m

# Verify
ollama list
ollama ps   # Will show after first request

# Set context length for the session
echo 'PARAMETER num_ctx 32768' > Modelfile-qwen36
ollama create qwen3.6-32k -f Modelfile-qwen36

# Run with extended context
ollama run qwen3.6-32k

Recommended environment variables:

# In /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_KEEP_ALIVE=24h"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"   # KV cache quantization saves ~30% VRAM

The q8_0 KV cache option saves substantial VRAM on long contexts at minimal quality cost. For 35B-A3B at 64K context with q8 KV, expect about 1.5 GB savings vs f16 KV.

For OLLAMA_KEEP_ALIVE nuances see Ollama OLLAMA_KEEP_ALIVE — How Model Memory Persistence Actually Works.

Comparison — Qwen3.6-35B-A3B vs Other 24GB-Class Options

Model	Params	Active	RTX 3090 quant	Quality tier (per benchmarks)
Qwen3.6-35B-A3B	35B total	3B	IQ4_XS / Q4_K_M	Frontier-competitive on coding + math + vision
Qwen3-30B-A3B (older)	30B total	3B	Q4_K_M	Strong but no vision, smaller context
Llama 3.1 70B	70B dense	70B	IQ3_M or split	Slower, broader knowledge
Mixtral 8×7B	47B total	13B	Q4_K_M	Older (2023), no vision
Qwen 3 14B	14B dense	14B	Q5_K_M	Solid, faster than 35B-A3B
Phi-4 14B	14B dense	14B	Q5_K_M	Strong reasoning per param
Llama 3.1 8B	8B dense	8B	Q8_0	Fastest, less capable on hard tasks

The 35B-A3B's specific edge in 2026: best multimodal + coding model that fits on consumer 24GB. Mixtral 8×7B was the previous best-fit MoE but is older and text-only. Qwen3-30B-A3B was the immediate predecessor without vision.

Practical Recommendations

For a single RTX 3090 24GB user in 2026:

Default model for capable work: Qwen3.6-35B-A3B IQ4_XS, 16-32K context
Fast chat / quick queries: Llama 3.1 8B Q4_K_M
Long-context document analysis: Qwen3.6-35B-A3B IQ3_M, up to 64K
Pure coding agent: Qwen3.6-35B-A3B IQ4_XS (SWE-bench 73.4 score speaks for itself)
Vision tasks: Qwen3.6-35B-A3B (no need for separate VLM)

If you have two 1080 Ti instead, see the GTX 1080 Ti and dual-GPU guides for similar models that fit in 22 GB combined.

FAQ

Q: Why does the 3B active parameter make speed so much better than 35B dense?

Each token forward pass through an MoE only computes through the active experts (8 routed + 1 shared = ~9 experts × 256-dim = ~3B params worth of compute). Dense 35B would compute through all 35B params per token. Inference time scales with active params, not total. VRAM usage scales with total params (all experts must be loaded for routing).

Q: Will my RTX 3090 thermal-throttle running this?

Sustained inference on RTX 3090 hits ~70-80°C with stock cooling. Throttling typically starts at 83°C. With adequate case airflow, this should be fine for hours of continuous use. Mining-recovered 3090s may run hotter due to thermal pad degradation — re-pad if temps spike.

Q: Does this require driver/CUDA upgrade?

llama.cpp / Ollama work with CUDA 11.8+ on the 3090. Most 2026 Linux distros ship 12.x by default which is more than adequate. No special drivers beyond standard NVIDIA proprietary.

Q: Can I run two requests in parallel?

With OLLAMA_NUM_PARALLEL=2, yes — but each parallel inference uses additional KV cache memory. At 32K context per stream + IQ4_XS weights, you'll OOM. Stick with 1 parallel on 24GB; use a server with 48GB+ for concurrent serving.

Q: How does this compare to running Claude 3.7 Sonnet via API?

Subjectively for many tasks, similar quality. Cost math: Claude API @ ~~$3-15/M tokens × your usage; Qwen3.6 local @ electricity (~~$0.50/day if always-on) + zero API cost. Break-even is around 5-15M tokens/month depending on rate. Privacy + data residency are local-only advantages.

Q: What about training/fine-tuning on RTX 3090?

Full fine-tuning of 35B-A3B requires multi-GPU + 100GB+ aggregate VRAM. QLoRA fine-tuning of select experts is feasible on 24GB but rare — usually you'd fine-tune a dense smaller model (8B) instead.

Q: Is the 1M-token YaRN extended context useful?

Theoretically yes, practically limited by VRAM. To run inference at 500K-1M context on a single 3090, you'd need IQ2 quantization which has measurable quality loss. Reserved for special use cases; standard "long context" is 32-64K.

Q: Why Qwen specifically over equivalent Llama 4 or Gemma 3 MoE?

As of mid-2026, Qwen3.6-35B-A3B's combination of license (Apache 2.0), multimodal capability, 256K context, and benchmark scores (SWE-bench 73.4, AIME 92.7) puts it ahead of other open MoE models of similar size. Llama 4 has different size points; Gemma 3 doesn't have a strict MoE in this class.

Closing — The Single Reason This Matters

Qwen3.6-35B-A3B is the first 30B+ class model that:

Has frontier-competitive benchmarks (SWE-bench 73, AIME 92, MMMU 81)
Fits on a consumer single RTX 3090 24GB (via IQ4_XS or Q4_K_M)
Runs at MoE-class speed (~50 t/s, not 10 t/s)
Is commercially licensable (Apache 2.0)
Handles vision + text + tools in one model

For a hobbyist with one RTX 3090, this collapses what used to require either a 70B dense model on multi-GPU (much slower) or a cloud API (cost + privacy tradeoff). For solo developers and small teams running production agents on owned hardware, this is the model to standardize on in 2026.

Related posts:

References:

Qwen3.6-35B-A3B model card: https://huggingface.co/Qwen/Qwen3.6-35B-A3B
Qwen team release announcements (April 2026)
llama.cpp quantization documentation: https://github.com/ggerganov/llama.cpp
LocalLLaMA community benchmarks (r/LocalLLaMA, 2026)