일반

Running Qwen3.6-35B-A3B on RTX 3090 24GB — Real Use Cases for the 3B-Active MoE (2026)

Qwen3.6-35B-A3B (April 2026 release) puts a 35B-parameter MoE model on a single RTX 3090 24GB at usable speed thanks to its 3B active parameters and Apache 2.0 license. Practical use cases — agentic coding (SWE-bench 73.4), 262K context document analysis, vision-language tasks, and tool calling — with realistic VRAM math, expected throughput, and where the model genuinely outperforms 8B alternatives.

·13 min read
#Qwen3.6#Qwen3.6-35B-A3B#RTX 3090#local LLM#MoE#mixture of experts#Ollama#llama.cpp#SWE-bench#vision language#262K context#agentic AI

Qwen3.6-35B-A3B on RTX 3090

The 35B Model That Fits in 24GB

Qwen released Qwen3.6-35B-A3B in April 2026. The specs are unusual:

  • 35B total parameters (impressive on paper)
  • 3B active parameters per token (MoE — 256 experts, 8 routed + 1 shared activated)
  • 262K native context (extensible to 1M via YaRN)
  • Vision-language capable (multimodal)
  • Apache 2.0 (commercial-friendly)
  • 391 community quantizations across llama.cpp, Ollama, LM Studio, Jan

The combination that matters for RTX 3090 owners: the 35B weight count determines VRAM (≈21 GB at Q4_K_M, fits in 24 GB), but only 3B parameters compute per token, so inference runs at MoE-class speed — much faster than a dense 30B model.

This guide is the practical look at what this actually enables on a single RTX 3090. Use cases where 8B isn't enough but you can't justify a multi-GPU rig. The post groups by what the model is genuinely good at per the published benchmarks (SWE-bench Verified 73.4, MMLU-Pro 85.2, AIME 2026 92.7, MMMU 81.7), translated to real workflows.

For the general RTX 3090 model comparison context, see Best Ollama Models for RTX 3090 24GB in 2026: Real Benchmarks.

VRAM Math for RTX 3090 24GB

Approximate footprint for Qwen3.6-35B-A3B per quantization (full model weights only — KV cache adds more):

QuantizationWeights size+ 8K context KV+ 32K context KVSingle 3090 24GB?
Q8_037 GB39 GB45 GB❌ OOM
Q6_K28 GB30 GB36 GB❌ OOM
Q5_K_M25 GB27 GB32 GB❌ OOM
Q4_K_M21 GB22.5 GB27 GB (tight)✅ for ≤16K ctx
IQ4_XS18 GB19.5 GB23.5 GB✅ comfortable to 32K
Q4_K_S19 GB20.5 GB24.5 GB⚠️ tight at 32K
Q3_K_M16 GB17.5 GB21 GB✅ long context room
IQ3_M14 GB15.5 GB18.5 GB✅ 64K context viable

The practical sweet spots for RTX 3090 24GB:

  • IQ4_XS for general use — best quality fit with 16-32K context
  • Q4_K_M if quality matters more than context length — limit to ≤16K context
  • IQ3_M for long-context workflows — 64K+ context becomes feasible

For the broader quantization tradeoff discussion, see GGUF Quantization Showdown — Q4_K_M vs Q4_K_S vs IQ4_XS vs Q5_K_M.

Expected Throughput on RTX 3090

Because Qwen3.6-35B-A3B is MoE with only 3B active parameters per token, generation speed is closer to a 3B dense model than a 35B dense model. Expected ballpark on RTX 3090 24GB:

Quant + ContextExpected tokens/sec
IQ4_XS @ 8K45-65
Q4_K_M @ 8K45-60
IQ4_XS @ 32K (well into context)30-45
IQ3_M @ 64K (well into context)20-35

For comparison:

  • Llama 3.1 8B Q4_K_M on RTX 3090: ~95 t/s (dense 8B)
  • Llama 3.1 70B Q4_K_M (split GPUs): ~10-15 t/s (dense 70B)
  • Mixtral 8×7B Q4_K_M on RTX 3090: ~50-65 t/s (similar MoE class)

The MoE 35B (3B active) sits between Mixtral and dense 70B — much faster than a comparable dense model would be.

Caveat on numbers: these are estimates based on architecture math and Mixtral 8×7B precedent. Specific Qwen3.6 measurements on RTX 3090 in 2026 community testing should be cross-referenced against r/LocalLLaMA and Hugging Face discussions for your exact quant and llama.cpp version.

Real Use Cases — Where 35B-A3B Actually Earns Its VRAM

Use Case 1 — Agentic Coding (SWE-bench Verified 73.4)

The published SWE-bench Verified score of 73.4 puts Qwen3.6-35B-A3B in the top tier of open coding models — competitive with much larger frontier models on real GitHub bug-fixing tasks.

For RTX 3090 + a local code agent (Aider, Continue, Cursor with local backend), this enables:

  • Fixing real bugs in your codebase without sending code to a cloud provider
  • Multi-file changes — the 262K context can hold a substantial codebase
  • Iterative tool use — model + IDE + test runner in a loop, all local

Practical setup with Ollama + Continue (VS Code):

# Ollama
ollama pull qwen3.6:35b-a3b-iq4_xs

# In VS Code with Continue extension, ~/.continue/config.json:
{
  "models": [{
    "title": "Qwen3.6-35B-A3B Local",
    "provider": "ollama",
    "model": "qwen3.6:35b-a3b-iq4_xs",
    "contextLength": 32768,
    "completionOptions": { "temperature": 0.1 }
  }]
}

For agentic coding (Aider):

aider --model ollama/qwen3.6:35b-a3b-iq4_xs --architect

Realistic expectation: for typical 1-3 file bug fixes in a well-structured Python or TypeScript codebase, expect output quality comparable to GPT-4-class on Google CodeBench tasks. For deep architectural refactors across many files, even SWE-bench 73.4 leaves room for failure modes — keep tests green, commit small.

Use Case 2 — Long-Context Document Analysis (262K Native)

The 262K native context (with YaRN extending to 1M) is unusual for a model that fits on a single 24GB card. Realistic workflows:

  • Annotated code review across a full repository: paste an entire mid-sized library's source, ask architectural questions
  • Legal document analysis: contracts, regulatory filings (~150K tokens) entirely in context
  • Scientific paper synthesis: 5-10 long papers (each 15-30K tokens) compared in one session
  • Long-form RAG: instead of retrieval-augmented chunks, load the full source documents

VRAM reality for long context:

  • IQ3_M + 64K context: ~18.5 GB total → fits with headroom
  • IQ4_XS + 32K context: ~23.5 GB → fits but tight
  • Q4_K_M + full 262K: not viable on single 24GB without context offload

For genuinely long context (>64K), drop to IQ3_M and accept the slight quality reduction in exchange for the context budget. The model's native long-context training means it handles position 200K nearly as well as position 2K — unlike many older models that degrade past 32K.

Use Case 3 — Vision-Language Tasks (MMMU 81.7)

The native vision capabilities (MMMU 81.7, RealWorldQA 85.3) mean Qwen3.6-35B-A3B handles:

  • Document parsing: invoices, scanned PDFs, forms — extracting structured data from images
  • Chart and table reading: convert screenshots of dashboards to text
  • Code from screenshots: paste a screenshot of code from a paper or video, get text back
  • Real-world QA: "what's broken in this photo of my circuit board" type queries

This was previously the domain of separate vision models (LLaVA, Pixtral, Qwen2-VL). Having text + vision in one 35B-A3B model means a single deployment serves both — significant infrastructure simplification.

Practical setup:

# Via Ollama API (vision-capable models support image inputs natively)
import requests, base64

with open('chart.png', 'rb') as f:
    img_b64 = base64.b64encode(f.read()).decode()

r = requests.post('http://localhost:11434/api/generate', json={
    'model': 'qwen3.6:35b-a3b-iq4_xs',
    'prompt': 'Extract the quarterly revenue figures from this chart as JSON',
    'images': [img_b64],
})

Use Case 4 — Tool Calling / Agentic Workflows

The model is trained for tool use, which combined with the 262K context enables full agentic deployments:

  • MCP server backends: local model serving Model Context Protocol clients
  • Function calling: structured output for API calls, database queries
  • Multi-step task execution: research → analysis → report generation loops

Example with OpenAI-compatible function calling (Ollama exposes this via OpenAI-format endpoint):

from openai import OpenAI
client = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')

tools = [{
    "type": "function",
    "function": {
        "name": "search_codebase",
        "description": "Search a local codebase for files matching a pattern",
        "parameters": {
            "type": "object",
            "properties": {"pattern": {"type": "string"}},
            "required": ["pattern"],
        },
    },
}]

response = client.chat.completions.create(
    model='qwen3.6:35b-a3b-iq4_xs',
    messages=[{"role": "user", "content": "Find all auth-related files in my project"}],
    tools=tools,
)

Use Case 5 — Math and Reasoning (AIME 2026 92.7)

The AIME 2026 score of 92.7 is unusually high for an open model. For reasoning-heavy use cases:

  • Code review for logical bugs (vs syntax bugs)
  • Math/stats problem solving in research workflows
  • Multi-step deductive tasks (legal reasoning, scientific hypothesis chains)
  • Verification of LLM-generated code or proofs

The MoE architecture activates different experts for different reasoning patterns, which empirically improves complex multi-step problems over comparable dense models.

When NOT to Use Qwen3.6-35B-A3B on RTX 3090

Honest counter-cases:

Pure chat / simple Q&A

For interactive chat at maximum speed, Llama 3.1 8B Q4_K_M (95 t/s) beats Qwen3.6-35B-A3B (~50 t/s) in raw throughput. The 35B-A3B quality advantage only shows on harder tasks. If your interactive load is mostly "summarize this email," 8B is the right pick.

Very tight VRAM (running other workloads concurrently)

Loading 35B-A3B at IQ4_XS uses ~18 GB. If you're also running Stable Diffusion (4-8 GB) or Jupyter PyTorch sessions (4-12 GB) on the same GPU, you'll OOM. Either reserve the 3090 for the LLM or use a smaller model.

Real-time low-latency (sub-second first-token)

MoE models have a slight first-token latency cost over dense (~100-300ms more). For latency-sensitive applications (interactive chat with immediate response perception), an 8B dense beats 35B-A3B.

Need >32K context with quality intact

IQ3_M extends context budget but with quality cost. If you genuinely need 64K+ tokens at full Q4-level quality, you've outgrown a single 3090 — consider dual-GPU split (see llama.cpp split-mode guide) or upgrade to 32GB+ card (RTX 5090, A6000).

Setup Walkthrough (Ollama)

# Install / update Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull the IQ4_XS quant (best fit for 24GB with context headroom)
ollama pull qwen3.6:35b-a3b-iq4_xs

# Or for tightest quality / least context
ollama pull qwen3.6:35b-a3b-q4_k_m

# Verify
ollama list
ollama ps   # Will show after first request

# Set context length for the session
echo 'PARAMETER num_ctx 32768' > Modelfile-qwen36
ollama create qwen3.6-32k -f Modelfile-qwen36

# Run with extended context
ollama run qwen3.6-32k

Recommended environment variables:

# In /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_KEEP_ALIVE=24h"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"   # KV cache quantization saves ~30% VRAM

The q8_0 KV cache option saves substantial VRAM on long contexts at minimal quality cost. For 35B-A3B at 64K context with q8 KV, expect about 1.5 GB savings vs f16 KV.

For OLLAMA_KEEP_ALIVE nuances see Ollama OLLAMA_KEEP_ALIVE — How Model Memory Persistence Actually Works.

Comparison — Qwen3.6-35B-A3B vs Other 24GB-Class Options

ModelParamsActiveRTX 3090 quantQuality tier (per benchmarks)
Qwen3.6-35B-A3B35B total3BIQ4_XS / Q4_K_MFrontier-competitive on coding + math + vision
Qwen3-30B-A3B (older)30B total3BQ4_K_MStrong but no vision, smaller context
Llama 3.1 70B70B dense70BIQ3_M or splitSlower, broader knowledge
Mixtral 8×7B47B total13BQ4_K_MOlder (2023), no vision
Qwen 3 14B14B dense14BQ5_K_MSolid, faster than 35B-A3B
Phi-4 14B14B dense14BQ5_K_MStrong reasoning per param
Llama 3.1 8B8B dense8BQ8_0Fastest, less capable on hard tasks

The 35B-A3B's specific edge in 2026: best multimodal + coding model that fits on consumer 24GB. Mixtral 8×7B was the previous best-fit MoE but is older and text-only. Qwen3-30B-A3B was the immediate predecessor without vision.

Practical Recommendations

For a single RTX 3090 24GB user in 2026:

  • Default model for capable work: Qwen3.6-35B-A3B IQ4_XS, 16-32K context
  • Fast chat / quick queries: Llama 3.1 8B Q4_K_M
  • Long-context document analysis: Qwen3.6-35B-A3B IQ3_M, up to 64K
  • Pure coding agent: Qwen3.6-35B-A3B IQ4_XS (SWE-bench 73.4 score speaks for itself)
  • Vision tasks: Qwen3.6-35B-A3B (no need for separate VLM)

If you have two 1080 Ti instead, see the GTX 1080 Ti and dual-GPU guides for similar models that fit in 22 GB combined.

FAQ

Q: Why does the 3B active parameter make speed so much better than 35B dense?

Each token forward pass through an MoE only computes through the active experts (8 routed + 1 shared = ~9 experts × 256-dim = ~3B params worth of compute). Dense 35B would compute through all 35B params per token. Inference time scales with active params, not total. VRAM usage scales with total params (all experts must be loaded for routing).

Q: Will my RTX 3090 thermal-throttle running this?

Sustained inference on RTX 3090 hits ~70-80°C with stock cooling. Throttling typically starts at 83°C. With adequate case airflow, this should be fine for hours of continuous use. Mining-recovered 3090s may run hotter due to thermal pad degradation — re-pad if temps spike.

Q: Does this require driver/CUDA upgrade?

llama.cpp / Ollama work with CUDA 11.8+ on the 3090. Most 2026 Linux distros ship 12.x by default which is more than adequate. No special drivers beyond standard NVIDIA proprietary.

Q: Can I run two requests in parallel?

With OLLAMA_NUM_PARALLEL=2, yes — but each parallel inference uses additional KV cache memory. At 32K context per stream + IQ4_XS weights, you'll OOM. Stick with 1 parallel on 24GB; use a server with 48GB+ for concurrent serving.

Q: How does this compare to running Claude 3.7 Sonnet via API?

Subjectively for many tasks, similar quality. Cost math: Claude API @ $3-15/M tokens × your usage; Qwen3.6 local @ electricity ($0.50/day if always-on) + zero API cost. Break-even is around 5-15M tokens/month depending on rate. Privacy + data residency are local-only advantages.

Q: What about training/fine-tuning on RTX 3090?

Full fine-tuning of 35B-A3B requires multi-GPU + 100GB+ aggregate VRAM. QLoRA fine-tuning of select experts is feasible on 24GB but rare — usually you'd fine-tune a dense smaller model (8B) instead.

Q: Is the 1M-token YaRN extended context useful?

Theoretically yes, practically limited by VRAM. To run inference at 500K-1M context on a single 3090, you'd need IQ2 quantization which has measurable quality loss. Reserved for special use cases; standard "long context" is 32-64K.

Q: Why Qwen specifically over equivalent Llama 4 or Gemma 3 MoE?

As of mid-2026, Qwen3.6-35B-A3B's combination of license (Apache 2.0), multimodal capability, 256K context, and benchmark scores (SWE-bench 73.4, AIME 92.7) puts it ahead of other open MoE models of similar size. Llama 4 has different size points; Gemma 3 doesn't have a strict MoE in this class.

Closing — The Single Reason This Matters

Qwen3.6-35B-A3B is the first 30B+ class model that:

  1. Has frontier-competitive benchmarks (SWE-bench 73, AIME 92, MMMU 81)
  2. Fits on a consumer single RTX 3090 24GB (via IQ4_XS or Q4_K_M)
  3. Runs at MoE-class speed (~50 t/s, not 10 t/s)
  4. Is commercially licensable (Apache 2.0)
  5. Handles vision + text + tools in one model

For a hobbyist with one RTX 3090, this collapses what used to require either a 70B dense model on multi-GPU (much slower) or a cloud API (cost + privacy tradeoff). For solo developers and small teams running production agents on owned hardware, this is the model to standardize on in 2026.


Related posts:

References:

관련 글