Most advice about running local language models is either a vague pep talk or a list of settings copied from someone else's machine. Neither helps much, because the right choices depend on understanding why they are right. This article takes the opposite approach: a set of opinionated defaults, each paired with the reasoning, so you can tell when the default applies and when your situation calls for something else.
These are practices that hold up after the novelty wears off — the ones that keep a local setup fast, predictable, and trustworthy months in, rather than just impressive on day one. They come from the recurring tension every local setup has to resolve: speed versus quality, capability versus hardware, convenience versus discipline. Good defaults pick a sensible point on each of those dials and explain why.
Where a practice is genuinely a matter of taste, the article says so. Where it is not, it makes the case.
Size Models to Your Hardware, Not Your Ambition
The foundational practice is matching the model to the machine, and it is the one people most often get backward.
Let Memory Set the Ceiling
Your available memory is a hard ceiling. A model that does not fit either fails to load or spills to disk and crawls. The correct default is to choose the largest model that fits comfortably with headroom, not the largest that technically loads. Headroom matters because the working memory a model needs grows with the length of your conversations.
Keep a Small Model on Hand Too
Even with a capable machine, keep a small fast model installed alongside your main one. Most quick questions do not need your largest model, and the small one answers instantly. Reaching for the right size per task beats forcing everything through one heavyweight, a habit reinforced in the sequential setup walkthrough.
Default to a Middle Quantization, Then Move Deliberately
Quantization is the dial that most affects the speed-quality balance, and the default should be a considered middle.
Why the Middle Is the Right Start
The most compressed versions run fast but lose noticeable quality; the least compressed preserve quality but strain hardware. A middle level captures most of the quality at a fraction of the cost, which is why it is the sensible default. Start there and you are rarely far wrong.
Adjust With Evidence, Not Guesswork
Only move off the middle when you have a reason: answers feel weak, so you step up; the model is too slow, so you step down. Changing quantization on a hunch leads to thrashing. Tuning against actual observed behavior is what keeps the process from becoming superstition, and it sidesteps the quantization mistakes that trip people up.
Verify Privacy, Do Not Assume It
If privacy is why you went local, treat it as something to confirm, not take on faith.
Prove the Model Is Offline
The default discipline is to unplug the network and confirm the model still answers before trusting it with anything sensitive. A polished tool can look local while quietly calling a cloud service. One offline test settles it, and it is worth repeating after any tool update.
Keep Sensitive Work on Verified-Local Models
Once you have a model you have proven runs locally, route confidential work only through it. Mixing sensitive prompts into a tool you have not verified defeats the purpose. This separation is the practical core of the privacy argument that motivates the broader case for local.
Use the Right Hardware Path
Getting the model onto the right processor is a practice, not an accident.
Confirm the Graphics Card Is in Play
If you have a capable graphics card, the default is to ensure the runner uses it. Many setups silently run on the processor and deliver a fraction of the possible speed. Verifying hardware utilization is a one-time check that pays off in every session afterward.
Match the Path to the Machine
On a machine without strong graphics, lean on smaller models that the processor handles gracefully rather than fighting a large model that will never run well. Matching the model to the available hardware path is more productive than wishing for hardware you do not have.
Watch the Working Memory, Not Just the File Size
A model's file size is only part of the memory story. As a conversation grows longer, the model needs more working memory to hold the context, and a setup that fit a short exchange can spill once the conversation runs long. The practical default is to leave generous headroom above the file size so long sessions do not suddenly bog down. If you routinely hold lengthy conversations, size down a notch to keep that headroom intact.
Treat Output as Draft, Always
The most important discipline has nothing to do with configuration.
Local Does Not Mean Accurate
Running privately and locally creates a false sense of authority. A local model invents confident falsehoods exactly like any other. The default practice is to verify anything consequential, every time, regardless of how trustworthy the private setup feels.
Build Verification Into the Habit
Make checking facts part of how you use the model, not an afterthought. The convenience of a local model makes it tempting to lean on it, which is precisely why the verification habit matters. This discipline shows up across real examples of where local helps and where it misleads.
Maintain the Setup Like It Matters
A local setup is infrastructure, and infrastructure decays without upkeep.
Prune Model Files Regularly
Model files are large and accumulate as you experiment. The default is periodic housekeeping — remove what you do not use, keep the few you run. This prevents the disk from filling unexpectedly and keeps the machine responsive.
Revisit Your Model Choices as Things Improve
Open-weight models improve steadily, and a model that was your best choice months ago may be outclassed by a newer one that fits the same hardware. Revisiting your defaults on a cadence keeps your setup current without constant churn, the same maintenance mindset that keeps a forward-looking view of these tools' cousins from going stale.
Write Down What Works
The final practice is the least glamorous and the most durable: record the configuration that works for you. Which model, which quantization level, which hardware settings, and how you verified the offline guarantee. A written record turns a lucky configuration into a repeatable one, lets you rebuild after a machine change without rediscovering everything, and makes the setup something a colleague can adopt rather than something only you can run. Undocumented setups quietly decay; documented ones endure.
Frequently Asked Questions
What is the single most important practice?
Sizing the model to your hardware with headroom. Almost every speed and stability problem traces back to a model that is too big for the machine. Get that right and most other issues disappear.
How do I decide on a quantization level?
Start at a middle level that fits comfortably, then move only with evidence: step up if answers feel weak, step down if it is slow or does not fit. Avoid changing it on a hunch, which leads to endless thrashing.
Should I trust a local model more because it is private?
No. Privacy and accuracy are separate. A local model invents confident falsehoods like any other, so verify anything that matters. The private setup protects your data, not the truth of the output.
Is it worth keeping multiple models installed?
Yes. A small fast model for quick questions plus a larger one for hard tasks is a strong default. Matching model size to the task beats forcing everything through one heavyweight model.
How often should I maintain my setup?
A light monthly pass works for most people: prune unused model files and check whether a newer model would serve you better. That cadence keeps the disk clear and the setup current without constant fiddling.
How do I make sure I am using my graphics card?
Check the runner's settings or status for hardware utilization. If a capable card sits idle while the processor does the work, output will be far slower than necessary. Confirming this once fixes it for every session after.
Key Takeaways
- Size the model to your hardware with headroom; let memory set the ceiling and keep a small fast model on hand too.
- Default to a middle quantization level and move off it only with evidence, never on a hunch.
- Verify privacy by unplugging the network, and route sensitive work only through models you have confirmed run locally.
- Confirm your graphics card is in use, and always treat output as a draft to verify — local does not mean accurate.
- Maintain the setup like infrastructure: prune unused model files and revisit your model choices as open-weight models improve.