Surveying the Software That Runs Models on Your Machine

The phrase covers more software categories than newcomers expect. When someone says they want to run a language model locally, they usually mean they want one thing, but getting there involves a runtime to execute the model, an interface to talk to it, and often a serving layer to expose it to other programs. Treating these as one undifferentiated pile is why people end up with a setup that technically works but fights them at every turn.

This survey separates the landscape into the layers that actually exist, describes what each does, and gives you criteria to choose within each. The goal is not to crown a winner, because the right choice depends heavily on whether you are a solo tinkerer, a developer building an application, or a small team sharing an inference box. The goal is to let you reason about the categories so a specific recommendation makes sense for your situation.

By the end you should be able to look at any tool and place it: what layer it occupies, what it competes with, and which trade-offs it asks you to accept.

The Runtime Layer: What Actually Executes the Model

The runtime is the engine that loads model weights and produces tokens. Everything else sits on top of it.

What runtimes do

Load quantized or full-precision weights into memory.
Manage GPU offloading and CPU fallback.
Expose a way to send prompts and receive completions.

How runtimes differ

Quantization support. Some runtimes handle a wide range of quantization formats; others are narrower. Broader support means more model choices.
Hardware coverage. Apple Silicon, NVIDIA GPUs, and plain CPU each have runtimes that excel on them. Matching the runtime to your hardware matters more than any other runtime decision.
Setup friction. Some runtimes are a single binary; others assume a full development environment.

Our end-to-end overview of self-hosting walks through how the runtime fits into the larger picture.

The Interface Layer: How You Actually Talk to the Model

A runtime gives you completions; an interface makes them pleasant to use. This layer is where the experience of using a local model lives.

Interface styles

Desktop chat applications that bundle a runtime and present a familiar conversation window.
Command-line tools for people who prefer the terminal and want scriptability.
Web interfaces that run a local server and present a browser-based chat.

The selection criterion here is mostly taste plus integration. A chat app is the fastest path to a working conversation, while a command-line tool composes better into scripts.

The Serving Layer: Exposing the Model to Other Software

When you want a local model to power an application rather than answer you directly, you need a serving layer that presents a stable interface, often mimicking a familiar API shape.

What serving buys you

A consistent endpoint your code can call.
The ability to swap models behind a fixed interface.
Concurrency handling when multiple requests arrive.

Selection criteria for serving

API compatibility with whatever your application already expects, which minimizes integration code.
Concurrency behavior under realistic load, since a serving layer that blocks on one request at a time will bottleneck.
Configuration surface for model loading, context size, and resource limits.

Our practical examples piece shows several of these serving setups powering real tasks.

Trade-offs That Cut Across the Layers

A few tensions show up regardless of which layer you are choosing within. Naming them helps you decide.

The recurring tensions

Convenience versus control. Bundled chat apps are the easiest path but hide the configuration you may later need. Lower-level tools demand more setup but expose every knob.
Breadth versus polish. Tools that support every model and format are powerful but rougher; tools that support a curated set are smoother but narrower.
Speed of setup versus longevity. The fastest thing to stand up is not always the thing you want to depend on in six months.

The decision-focused look at competing approaches explores these tensions as explicit axes you can weigh.

Choosing for Your Situation

The right stack depends less on which tool is objectively best and more on who you are and what you are building.

Three common profiles

The tinkerer wants a bundled chat app: one download, immediate conversation, minimal configuration.
The application developer wants a runtime plus a serving layer with API compatibility, so the model becomes a callable component.
The small team wants a serving layer that handles concurrency, plus disciplined version recording so everyone runs the same thing.

Whichever profile fits, our best practices for running local models help you avoid the configuration pitfalls each profile tends to hit.

Letting the profile evolve

A common error is locking into a profile and treating the choice as permanent. People often begin as tinkerers and discover they want to script the model, which pulls them toward the developer profile and its runtime-plus-serving stack. Others start building an application and realize they only ever needed a desktop chat window. Because the layers are modular, you can change one without discarding the rest, swapping an interface while keeping the runtime, or adding a serving layer to a setup that began as a personal chat tool. Designing with that modularity in mind keeps an early choice from becoming a cage.

Evaluating a New Tool You Encounter

The landscape moves, and you will regularly meet software you have never seen. A short evaluation habit lets you place any new tool quickly instead of starting from scratch each time.

Questions that locate any tool

Which layer does it occupy? Decide whether it is a runtime, an interface, a serving layer, or a bundle of several. This single question resolves most confusion about what a tool is for.
What does it compete with? Once you know its layer, you know its alternatives, which makes its trade-offs legible.
What does it ask you to accept? Every tool trades something. Identify whether it costs you control for convenience, polish for breadth, or longevity for speed of setup.

Reading the signals of maturity

An active community and recent updates signal a tool you can depend on, because help exists when something breaks.
Clear documentation of configuration signals a tool that will not fight you when you need to tune it.
Honest scope signals reliability; a tool that claims to do everything usually does several things poorly.

Running any unfamiliar tool through these questions turns the overwhelming pace of the landscape into something manageable, because you are placing each new entry into a structure you already understand.

Avoiding Tool Sprawl

A failure mode for enthusiastic adopters is accumulating tools faster than competence with any of them. Each new release promises an improvement, and it is tempting to chase every one, but a drawer full of half-learned tools is weaker than one tool you know deeply.

Staying disciplined

Standardize where it counts. Settle on a runtime that suits your hardware and a serving interface your code expects, then change them only for a real reason rather than novelty.
Evaluate before adopting. Run a candidate through the placement questions and confirm it solves a problem your current stack genuinely has, not one it merely could have.
Retire deliberately. When you do switch, remove the old tool rather than leaving it to rot, so your setup reflects what you actually use.

The goal is a small, well-understood stack that you can configure confidently and reason about under pressure. Depth with a few tools beats breadth across many, especially when something breaks and you need to fix it quickly rather than relearn an interface you barely touched.

Frequently Asked Questions

Do I need software from all three layers?

Not always. A bundled chat application can include runtime and interface in one download, which is enough for personal use. You only need a separate serving layer when other software has to call the model.

How do I know if a runtime suits my hardware?

Match the runtime to your hardware family first. There are runtimes optimized for Apple Silicon, for NVIDIA GPUs, and for CPU-only machines. Picking the wrong one leaves performance on the table even with good hardware.

Is the command line worth learning for this?

If you want scriptability and integration, yes. If you only want to chat with a model occasionally, a desktop app is faster and equally capable for that purpose. Match the interface to how you actually work.

What makes a serving layer good?

API compatibility with what your code expects, sensible concurrency behavior under load, and enough configuration to control model loading and resource limits. These three together determine whether it integrates cleanly.

Should I optimize for the easiest setup?

Only if the setup is for experimentation. For anything you intend to depend on, weigh longevity and control alongside ease, because the easiest thing to stand up is not always the most maintainable.

Key Takeaways

The landscape splits into runtime, interface, and serving layers; treat them as distinct decisions.
Match the runtime to your hardware family before weighing anything else about it.
Use a serving layer only when other software needs to call the model.
Convenience versus control is the trade-off that recurs across every layer.
The right stack depends on whether you are a tinkerer, a developer, or a team, not on a single best tool.

By the end you should be able to look at any tool and place it: what layer it occupies, what it competes with, and which trade-offs it asks you to accept.

The Runtime Layer: What Actually Executes the Model

The runtime is the engine that loads model weights and produces tokens. Everything else sits on top of it.

What runtimes do

Load quantized or full-precision weights into memory.
Manage GPU offloading and CPU fallback.
Expose a way to send prompts and receive completions.

How runtimes differ

Quantization support. Some runtimes handle a wide range of quantization formats; others are narrower. Broader support means more model choices.
Hardware coverage. Apple Silicon, NVIDIA GPUs, and plain CPU each have runtimes that excel on them. Matching the runtime to your hardware matters more than any other runtime decision.
Setup friction. Some runtimes are a single binary; others assume a full development environment.

Our end-to-end overview of self-hosting walks through how the runtime fits into the larger picture.

The Interface Layer: How You Actually Talk to the Model

A runtime gives you completions; an interface makes them pleasant to use. This layer is where the experience of using a local model lives.

Interface styles

Desktop chat applications that bundle a runtime and present a familiar conversation window.
Command-line tools for people who prefer the terminal and want scriptability.
Web interfaces that run a local server and present a browser-based chat.

The selection criterion here is mostly taste plus integration. A chat app is the fastest path to a working conversation, while a command-line tool composes better into scripts.

The Serving Layer: Exposing the Model to Other Software

When you want a local model to power an application rather than answer you directly, you need a serving layer that presents a stable interface, often mimicking a familiar API shape.

What serving buys you

A consistent endpoint your code can call.
The ability to swap models behind a fixed interface.
Concurrency handling when multiple requests arrive.

Selection criteria for serving

API compatibility with whatever your application already expects, which minimizes integration code.
Concurrency behavior under realistic load, since a serving layer that blocks on one request at a time will bottleneck.
Configuration surface for model loading, context size, and resource limits.

Our practical examples piece shows several of these serving setups powering real tasks.

Trade-offs That Cut Across the Layers

A few tensions show up regardless of which layer you are choosing within. Naming them helps you decide.

The recurring tensions

Convenience versus control. Bundled chat apps are the easiest path but hide the configuration you may later need. Lower-level tools demand more setup but expose every knob.
Breadth versus polish. Tools that support every model and format are powerful but rougher; tools that support a curated set are smoother but narrower.
Speed of setup versus longevity. The fastest thing to stand up is not always the thing you want to depend on in six months.

The decision-focused look at competing approaches explores these tensions as explicit axes you can weigh.

Choosing for Your Situation

The right stack depends less on which tool is objectively best and more on who you are and what you are building.

Three common profiles

The tinkerer wants a bundled chat app: one download, immediate conversation, minimal configuration.
The application developer wants a runtime plus a serving layer with API compatibility, so the model becomes a callable component.
The small team wants a serving layer that handles concurrency, plus disciplined version recording so everyone runs the same thing.

Whichever profile fits, our best practices for running local models help you avoid the configuration pitfalls each profile tends to hit.

Letting the profile evolve

Evaluating a New Tool You Encounter

The landscape moves, and you will regularly meet software you have never seen. A short evaluation habit lets you place any new tool quickly instead of starting from scratch each time.

Questions that locate any tool

Which layer does it occupy? Decide whether it is a runtime, an interface, a serving layer, or a bundle of several. This single question resolves most confusion about what a tool is for.
What does it compete with? Once you know its layer, you know its alternatives, which makes its trade-offs legible.
What does it ask you to accept? Every tool trades something. Identify whether it costs you control for convenience, polish for breadth, or longevity for speed of setup.

Reading the signals of maturity

An active community and recent updates signal a tool you can depend on, because help exists when something breaks.
Clear documentation of configuration signals a tool that will not fight you when you need to tune it.
Honest scope signals reliability; a tool that claims to do everything usually does several things poorly.

Avoiding Tool Sprawl

Staying disciplined

Standardize where it counts. Settle on a runtime that suits your hardware and a serving interface your code expects, then change them only for a real reason rather than novelty.
Evaluate before adopting. Run a candidate through the placement questions and confirm it solves a problem your current stack genuinely has, not one it merely could have.
Retire deliberately. When you do switch, remove the old tool rather than leaving it to rot, so your setup reflects what you actually use.

Frequently Asked Questions

Do I need software from all three layers?

How do I know if a runtime suits my hardware?

Is the command line worth learning for this?

What makes a serving layer good?

Should I optimize for the easiest setup?

Only if the setup is for experimentation. For anything you intend to depend on, weigh longevity and control alongside ease, because the easiest thing to stand up is not always the most maintainable.

Key Takeaways

The landscape splits into runtime, interface, and serving layers; treat them as distinct decisions.
Match the runtime to your hardware family before weighing anything else about it.
Use a serving layer only when other software needs to call the model.
Convenience versus control is the trade-off that recurs across every layer.
The right stack depends on whether you are a tinkerer, a developer, or a team, not on a single best tool.

Surveying the Software That Runs Models on Your Machine

The Runtime Layer: What Actually Executes the Model

What runtimes do

How runtimes differ

The Interface Layer: How You Actually Talk to the Model

Interface styles

The Serving Layer: Exposing the Model to Other Software

What serving buys you

Selection criteria for serving

Trade-offs That Cut Across the Layers

The recurring tensions

Choosing for Your Situation

Three common profiles

Letting the profile evolve

Evaluating a New Tool You Encounter

Questions that locate any tool

Reading the signals of maturity

Avoiding Tool Sprawl

Staying disciplined

Frequently Asked Questions

Do I need software from all three layers?

How do I know if a runtime suits my hardware?

Is the command line worth learning for this?

What makes a serving layer good?

Should I optimize for the easiest setup?

Key Takeaways

Agency Script Editorial

Related Articles

Prompt Quality Decides Whether AI Earns Its Keep

Counting the Real Cost of Every Token You Send

Rolling Out AI Hallucinations Across a Team

Ready to certify your AI capability?

Surveying the Software That Runs Models on Your Machine

The Runtime Layer: What Actually Executes the Model

What runtimes do

How runtimes differ

The Interface Layer: How You Actually Talk to the Model

Interface styles

The Serving Layer: Exposing the Model to Other Software

What serving buys you

Selection criteria for serving

Trade-offs That Cut Across the Layers

The recurring tensions

Choosing for Your Situation

Three common profiles

Letting the profile evolve

Evaluating a New Tool You Encounter

Questions that locate any tool

Reading the signals of maturity

Avoiding Tool Sprawl

Staying disciplined

Frequently Asked Questions

Do I need software from all three layers?

How do I know if a runtime suits my hardware?

Is the command line worth learning for this?

What makes a serving layer good?

Should I optimize for the easiest setup?

Key Takeaways

Agency Script Editorial

Related Articles

Prompt Quality Decides Whether AI Earns Its Keep

Counting the Real Cost of Every Token You Send

Rolling Out AI Hallucinations Across a Team

Ready to certify your AI capability?