The phrase covers more software categories than newcomers expect. When someone says they want to run a language model locally, they usually mean they want one thing, but getting there involves a runtime to execute the model, an interface to talk to it, and often a serving layer to expose it to other programs. Treating these as one undifferentiated pile is why people end up with a setup that technically works but fights them at every turn.
This survey separates the landscape into the layers that actually exist, describes what each does, and gives you criteria to choose within each. The goal is not to crown a winner, because the right choice depends heavily on whether you are a solo tinkerer, a developer building an application, or a small team sharing an inference box. The goal is to let you reason about the categories so a specific recommendation makes sense for your situation.
By the end you should be able to look at any tool and place it: what layer it occupies, what it competes with, and which trade-offs it asks you to accept.
The Runtime Layer: What Actually Executes the Model
The runtime is the engine that loads model weights and produces tokens. Everything else sits on top of it.
What runtimes do
- Load quantized or full-precision weights into memory.
- Manage GPU offloading and CPU fallback.
- Expose a way to send prompts and receive completions.
How runtimes differ
- Quantization support. Some runtimes handle a wide range of quantization formats; others are narrower. Broader support means more model choices.
- Hardware coverage. Apple Silicon, NVIDIA GPUs, and plain CPU each have runtimes that excel on them. Matching the runtime to your hardware matters more than any other runtime decision.
- Setup friction. Some runtimes are a single binary; others assume a full development environment.
Our end-to-end overview of self-hosting walks through how the runtime fits into the larger picture.
The Interface Layer: How You Actually Talk to the Model
A runtime gives you completions; an interface makes them pleasant to use. This layer is where the experience of using a local model lives.
Interface styles
- Desktop chat applications that bundle a runtime and present a familiar conversation window.
- Command-line tools for people who prefer the terminal and want scriptability.
- Web interfaces that run a local server and present a browser-based chat.
The selection criterion here is mostly taste plus integration. A chat app is the fastest path to a working conversation, while a command-line tool composes better into scripts.
The Serving Layer: Exposing the Model to Other Software
When you want a local model to power an application rather than answer you directly, you need a serving layer that presents a stable interface, often mimicking a familiar API shape.
What serving buys you
- A consistent endpoint your code can call.
- The ability to swap models behind a fixed interface.
- Concurrency handling when multiple requests arrive.
Selection criteria for serving
- API compatibility with whatever your application already expects, which minimizes integration code.
- Concurrency behavior under realistic load, since a serving layer that blocks on one request at a time will bottleneck.
- Configuration surface for model loading, context size, and resource limits.
Our practical examples piece shows several of these serving setups powering real tasks.
Trade-offs That Cut Across the Layers
A few tensions show up regardless of which layer you are choosing within. Naming them helps you decide.
The recurring tensions
- Convenience versus control. Bundled chat apps are the easiest path but hide the configuration you may later need. Lower-level tools demand more setup but expose every knob.
- Breadth versus polish. Tools that support every model and format are powerful but rougher; tools that support a curated set are smoother but narrower.
- Speed of setup versus longevity. The fastest thing to stand up is not always the thing you want to depend on in six months.
The decision-focused look at competing approaches explores these tensions as explicit axes you can weigh.
Choosing for Your Situation
The right stack depends less on which tool is objectively best and more on who you are and what you are building.
Three common profiles
- The tinkerer wants a bundled chat app: one download, immediate conversation, minimal configuration.
- The application developer wants a runtime plus a serving layer with API compatibility, so the model becomes a callable component.
- The small team wants a serving layer that handles concurrency, plus disciplined version recording so everyone runs the same thing.
Whichever profile fits, our best practices for running local models help you avoid the configuration pitfalls each profile tends to hit.
Letting the profile evolve
A common error is locking into a profile and treating the choice as permanent. People often begin as tinkerers and discover they want to script the model, which pulls them toward the developer profile and its runtime-plus-serving stack. Others start building an application and realize they only ever needed a desktop chat window. Because the layers are modular, you can change one without discarding the rest, swapping an interface while keeping the runtime, or adding a serving layer to a setup that began as a personal chat tool. Designing with that modularity in mind keeps an early choice from becoming a cage.
Evaluating a New Tool You Encounter
The landscape moves, and you will regularly meet software you have never seen. A short evaluation habit lets you place any new tool quickly instead of starting from scratch each time.
Questions that locate any tool
- Which layer does it occupy? Decide whether it is a runtime, an interface, a serving layer, or a bundle of several. This single question resolves most confusion about what a tool is for.
- What does it compete with? Once you know its layer, you know its alternatives, which makes its trade-offs legible.
- What does it ask you to accept? Every tool trades something. Identify whether it costs you control for convenience, polish for breadth, or longevity for speed of setup.
Reading the signals of maturity
- An active community and recent updates signal a tool you can depend on, because help exists when something breaks.
- Clear documentation of configuration signals a tool that will not fight you when you need to tune it.
- Honest scope signals reliability; a tool that claims to do everything usually does several things poorly.
Running any unfamiliar tool through these questions turns the overwhelming pace of the landscape into something manageable, because you are placing each new entry into a structure you already understand.
Avoiding Tool Sprawl
A failure mode for enthusiastic adopters is accumulating tools faster than competence with any of them. Each new release promises an improvement, and it is tempting to chase every one, but a drawer full of half-learned tools is weaker than one tool you know deeply.
Staying disciplined
- Standardize where it counts. Settle on a runtime that suits your hardware and a serving interface your code expects, then change them only for a real reason rather than novelty.
- Evaluate before adopting. Run a candidate through the placement questions and confirm it solves a problem your current stack genuinely has, not one it merely could have.
- Retire deliberately. When you do switch, remove the old tool rather than leaving it to rot, so your setup reflects what you actually use.
The goal is a small, well-understood stack that you can configure confidently and reason about under pressure. Depth with a few tools beats breadth across many, especially when something breaks and you need to fix it quickly rather than relearn an interface you barely touched.
Frequently Asked Questions
Do I need software from all three layers?
Not always. A bundled chat application can include runtime and interface in one download, which is enough for personal use. You only need a separate serving layer when other software has to call the model.
How do I know if a runtime suits my hardware?
Match the runtime to your hardware family first. There are runtimes optimized for Apple Silicon, for NVIDIA GPUs, and for CPU-only machines. Picking the wrong one leaves performance on the table even with good hardware.
Is the command line worth learning for this?
If you want scriptability and integration, yes. If you only want to chat with a model occasionally, a desktop app is faster and equally capable for that purpose. Match the interface to how you actually work.
What makes a serving layer good?
API compatibility with what your code expects, sensible concurrency behavior under load, and enough configuration to control model loading and resource limits. These three together determine whether it integrates cleanly.
Should I optimize for the easiest setup?
Only if the setup is for experimentation. For anything you intend to depend on, weigh longevity and control alongside ease, because the easiest thing to stand up is not always the most maintainable.
Key Takeaways
- The landscape splits into runtime, interface, and serving layers; treat them as distinct decisions.
- Match the runtime to your hardware family before weighing anything else about it.
- Use a serving layer only when other software needs to call the model.
- Convenience versus control is the trade-off that recurs across every layer.
- The right stack depends on whether you are a tinkerer, a developer, or a team, not on a single best tool.