What Quantization Actually Does (And Why I Was Afraid of It for Too Long)
Quantization is the most misunderstood concept for anyone running local AI. I spent my first twenty-plus model downloads avoiding it because I thought my PC would blow up. Here's what quantization actually does, what it doesn't do, and how to stop leaving model capability on the table.
Through my first twenty-or-so local model downloads, I stuck to Q8 and refused to touch Q4.
Not because I’d tested and decided Q8 was better. Because I was convinced something terrible would happen if I loaded a “lower precision” model. In my head, somewhere between the forum posts warning about “severe quality loss” and the GitHub issues with cryptic GPU error messages, I’d quietly built a model of quantization where Q4 meant “my PC might catch fire.”
None of that was true. The fear was doing more harm than any quant level would have. I was running oversized models conservatively and leaving real capability on the table for no reason. Here’s what I wish someone had told me on day one.
What quantization actually is
Every parameter in a language model is a number. A 7B model has about 7 billion of them. Those numbers are stored at some level of numerical precision.
Full precision is 32-bit floating point (FP32). Each weight takes 4 bytes. A 7B model in FP32 is about 28 GB on disk.
Most frontier models you download are already half-precision: 16-bit floating point (FP16 or BF16). Each weight takes 2 bytes. A 7B model in FP16 is about 14 GB.
Quantization drops the precision further. Instead of 16 bits per weight, you use 8 bits (Q8), 6 bits (Q6), 5 bits (Q5), 4 bits (Q4), or lower. Fewer bits per weight means:
- The file is smaller on disk.
- The model uses less VRAM at runtime.
- Inference runs faster, because there’s less data to move around.
That’s the whole trick. Quantization is “storing the model at lower precision.” It is not editing the model. It is not retraining the model. It is not changing what the model “knows.” It’s the same weights, rounded off more aggressively.
What the Qs and letters mean
Open any GGUF model repository and you’ll see filenames like Qwen3-14B-Q4_K_M.gguf or Mistral-Large-Instruct-Q5_K_S.gguf. Decoding this looks intimidating. It isn’t.
Qn is the number of bits per weight. Q8 = 8 bits. Q4 = 4 bits. Lower number = more compression = smaller file = more quality loss.
K means the file uses the “k-quants” format, a smarter approach than naive rounding. K-quants split weights into groups and assign different precisions to different groups based on how much each group actually needs. The result is that a Q4_K file performs much closer to full precision than a simple 4-bit round-off would.
S, M, L are size variants within the same bit-width. S (small), M (medium), L (large). They trade a bit of file size for a bit of quality. Q4_K_M is the most common sweet spot. Q4_K_L is slightly larger, slightly higher quality. Q4_K_S is slightly smaller, slightly lower quality.
For most people running local models, Q4_K_M is the default worth starting at. Going up to Q5_K_M or Q6_K_M costs VRAM and gains a small amount of quality. Going up to Q8_0 is close to lossless but uses roughly twice the VRAM of Q4.
The fears, dismantled
Here are the four fears I had, and what’s actually true about each.
“Quantization will damage my hardware.”
No. Running a quantized model is less demanding on your hardware than running the same model at full precision, because there’s less data flowing around. The GPU does the same kind of math it was going to do anyway; the numbers are just smaller. Nothing strains more. Nothing heats up more. Nothing damages. If your PC can run the full-precision version, it can definitely run the quantized version, and usually faster.
“Quantization permanently alters the model.”
No. The quantized file is a separate file. The original full-precision weights still exist on whatever server they were uploaded from. You can download Q4, Q5, Q8 versions of the same model, use them side by side, compare, delete the ones you don’t like. Nothing you do to a quantized GGUF affects the source model.
“Quantization turns the model into a drooling mess.”
Below Q3, quality drops noticeably. Between Q4 and Q8, most users cannot tell the difference in blind tests on most tasks. K-quants are particularly well-calibrated. Q4_K_M output is not “dumb Q8 output.” It’s output that’s close enough that the quality-per-VRAM ratio usually favors the smaller file.
The real exception: tasks that lean on precise numerical reasoning, long-horizon consistency, or rare token patterns, where the small perplexity increase from Q4 shows up as visible errors. Most people running local models for chat, code, or writing will not hit this.
“Quantization is an advanced technique.”
It isn’t. You don’t quantize models yourself. Someone else already did it and uploaded the .gguf file. You download the quant level that fits your VRAM and run it. That’s the whole workflow. It takes as much effort as picking which resolution to stream.
One note for 2026: many GGUF releases now include imatrix (importance-matrix) variants, which use calibration data to allocate precision more carefully than standard k-quants. They’re slightly better at the same bit-width. When imatrix variants are available, prefer them; when they aren’t, standard Q4_K_M or Q5_K_M is still completely fine.
The real tradeoff quantization puts in front of you
The real decision isn’t “Q4 or Q8.” It’s: at a given VRAM budget, would I rather have a bigger model at Q4, or a smaller model at Q8?
In almost every case I’ve tested, a bigger model at Q4 beats a smaller model at Q8 on real tasks. A 14B model at Q4_K_M uses roughly the same VRAM as a 7B at Q8_0, and the 14B is meaningfully smarter. A 70B at Q4_K_M, if you have the hardware, beats a 30B at Q8 on almost everything that matters.
This inversion is the single most important practical fact about quantization, and it’s the opposite of what my fear was telling me. My fear said “lower quant = worse.” The reality is “lower quant at higher parameter count > higher quant at lower parameter count, almost always.”
The reason: model capability scales more strongly with parameter count than it degrades with quantization. Dropping from FP16 to Q4 costs you maybe 2-3% on most benchmarks. Going from 7B to 14B at the same quant gains you 20-30% on the same benchmarks. The math is not even close.
Practical recommendations by hardware tier
If you’re starting out, the rule is: pick the biggest model you can fit at Q4_K_M. Only go higher-quant if you have leftover VRAM after that. Only go lower-quant if you cannot fit the model you actually want any other way.
For specific hardware tiers in 2026:
8 GB VRAM (consumer laptops, older gaming GPUs): 7B to 8B models at Q4_K_M. 7B-class models from Qwen, Llama, Mistral, and DeepSeek all work well at this tier.
12-16 GB VRAM (RTX 3080/4070, M1 Pro/Max): 13B to 14B models at Q4_K_M, or 7B models at Q8_0 if you prefer the quality headroom. The 14B at Q4 is usually the smarter choice.
24 GB VRAM (RTX 3090/4090, RTX 5080): 30B to 34B class models at Q4_K_M, or 14B at Q8_0. The 30B class at Q4 is the sweet spot for serious home use.
48+ GB VRAM (dual-GPU rigs, A6000, RTX 6000 Ada): 70B models at Q4_K_M, or smaller models at higher quality. A 70B at Q4 on this hardware gives you something that rivals cloud models on many tasks.
Apple Silicon with unified memory: Same math, but unified memory is slower than dedicated GPU VRAM, so inference is slower. The quant-level decisions are identical.
A note: MoE architectures change this math. A 30B-A3B mixture-of-experts model uses the total-param VRAM but only 3B active params per token. If you’re running a MoE, read I Thought I Was Using a 30B Model. I’d Been Using a 3B One for Ten Months. before you tune the quant.
What to do if you’ve been running oversized quants
If you’re already running local models but only ever at Q8 or FP16, try this experiment. Download the Q4_K_M of a model one or two tiers bigger than what you’ve been using. Run your typical workload through both. Compare.
In most cases, the bigger-model-at-Q4 wins and you’ve been leaving capability on the table. Occasionally the smaller-model-at-Q8 wins on a specific task you care about, which tells you something real about your workload rather than something general about quantization.
Either way, you’ll know. That’s worth more than any benchmark table.
The closing call
Quantization is not a risk. It’s a knob. The right setting depends on your hardware and your workload, not on a vague fear about what “lowering precision” might do.
Download a Q4_K_M. Run it. Watch nothing catch fire. Download a bigger one. Run it too. The lesson is in the comparison, not in the articles about quantization, including this one.