AGENCYSCRIPT
CoursesEnterpriseBlog
đź‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
© 2026 Agency Script, Inc.·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

Play 1: Right-sizing the model to the taskPlay 2: Fitting weights onto your hardwareThe levers, in order of preferencePlay 3: Fine-tuning without breaking the basePlay 4: Locking down weight provenance and versioningPlay 5: Responding to a quality regressionDiagnostic sequencePlay 6: Deciding open weights versus closed APISequencing the playsFrequently Asked QuestionsWho should own model and weight decisions on a team?When should I quantize versus just use a smaller model?How do I avoid being surprised by a vendor changing their model?Is full fine-tuning ever worth it over LoRA?What is the first thing to check during a quality regression?Key Takeaways
Home/Blog/When a 70B Model Won't Fit: Plays for Real Weight Trouble
General

When a 70B Model Won't Fit: Plays for Real Weight Trouble

A

Agency Script Editorial

Editorial Team

·March 2, 2025·8 min read
ai model parameters and weightsai model parameters and weights playbookai model parameters and weights guideai fundamentals

A playbook is not a tutorial. A tutorial teaches you what weights are; a playbook tells you what to do when a 70 billion parameter model won't fit on your hardware, when a fine-tune degrades production quality, or when a vendor swaps the model behind your API without telling you. This document is organized around situations you will actually hit, each with a clear trigger, a recommended play, and the person who should own the call.

The premise is that parameters and weights are operational concerns, not just academic ones. The model you choose, how you store its weights, how you quantize them, and how you version them all have direct consequences on cost, latency, and quality. Teams that manage these deliberately ship faster and break less. Teams that treat them as someone else's problem discover the costs at the worst possible time—usually during an incident.

Use this as a reference. You do not need to run every play. You need to know which play applies when the trigger fires, and who pulls the trigger.

Play 1: Right-sizing the model to the task

Trigger: You are starting a new feature or evaluating a vendor and someone proposes the biggest model available "to be safe."

The default instinct toward maximum size is almost always wrong. Larger models cost more per token, add latency, and frequently outperform smaller ones by margins your users will never notice. The play is to start small and scale up only on evidence.

  • Owner: The engineer building the feature, with budget sign-off from the team lead.
  • Sequence: Define the task, pick the smallest credible model, benchmark on your real data, and only move up a size class if the smaller one fails a measurable threshold.
  • Failure mode to avoid: Choosing by parameter count on a spec sheet instead of by performance on your workload.

A disciplined right-sizing process is the single highest-leverage habit here. The Ai Model Parameters and Weights: Best Practices That Actually Work goes deeper on how to structure that benchmark.

Play 2: Fitting weights onto your hardware

Trigger: A model you want to self-host does not fit in available GPU memory.

You have a precise problem with a small set of known solutions. Run the memory math first—parameters times bytes per parameter—then choose a lever.

The levers, in order of preference

  • Quantize. Drop from FP16 to INT8 or INT4. Halves or quarters the weight memory. Test for accuracy loss on your task.
  • Use a smaller model. If quantization isn't enough, step down a size class.
  • Shard across GPUs. Split the weights across multiple devices. More cost, more complexity, but keeps the large model.
  • Offload to CPU or disk. Last resort. Works, but latency suffers badly.

Owner: The infrastructure or ML engineer. The team lead owns the cost trade-off if sharding adds GPUs.

Play 3: Fine-tuning without breaking the base

Trigger: A prompt-only approach isn't getting you the quality or consistency you need on a specialized task.

Fine-tuning adjusts weights toward your data, but full fine-tuning is expensive and risks catastrophic forgetting, where the model loses general ability while gaining your narrow task. The play is to reach for parameter-efficient methods first.

  • Start with LoRA or a similar adapter. Freeze the base weights, train a small set of new ones. Cheap, fast, reversible.
  • Keep a held-out evaluation set that tests general capability, not just your task, so you catch regressions.
  • Version every adapter alongside the base model version it was trained against.

Owner: The ML engineer running the fine-tune, with a reviewer signing off on the evaluation results before anything ships.

Play 4: Locking down weight provenance and versioning

Trigger: You are putting a model into production and need to know exactly which weights are running.

This is the play teams skip and regret. Weights are artifacts; treat them like any other production dependency.

  • Pin exact versions. Record the model checkpoint, quantization scheme, and adapter hashes.
  • Store weights in a registry, not on someone's laptop. Use checksums to verify integrity.
  • Watch for silent vendor swaps. Closed API providers update models behind stable names. Build evaluation canaries that alert you when behavior shifts.

Owner: The platform or MLOps engineer. This is foundational reliability work, not optional polish.

Play 5: Responding to a quality regression

Trigger: Output quality drops in production and you suspect the model.

Resist the urge to immediately swap models. Diagnose first, because the cause is often not where you think.

Diagnostic sequence

  • Confirm the weights didn't change. Check your version pins and any vendor changelog.
  • Check quantization. If you recently quantized, that is your prime suspect—roll back to higher precision and compare.
  • Check the data path. Most "model" regressions are actually changes in prompts, retrieval, or input formatting.
  • Only then consider the weights themselves.

Owner: Whoever is on call, escalating to the ML engineer if the issue isolates to the model. Working through this kind of diagnosis methodically is exactly what A Step-by-Step Approach to Ai Model Parameters and Weights is built to support.

Play 6: Deciding open weights versus closed API

Trigger: A new project forces the build-versus-buy question for the model layer.

This is a strategic call with long consequences, so make it explicitly rather than by default.

  • Choose open weights when you need portability, on-premise deployment, data residency control, or custom fine-tuning at depth.
  • Choose a closed API when you want the strongest raw capability, zero infrastructure burden, and you can tolerate vendor dependence.
  • Hybrid is valid. Many teams use closed APIs for hard tasks and self-hosted open weights for high-volume, simpler ones.

Owner: The engineering lead or architect, with input from security and finance. The The Best Tools for Ai Model Parameters and Weights overview can help map the options on each side.

Sequencing the plays

These plays are not a checklist to run top to bottom; they are a library you draw from. But there is a natural order for a new project: right-size first (Play 1), fit to hardware if self-hosting (Play 2), establish provenance before launch (Play 4), and keep the fine-tuning and regression plays (Plays 3 and 5) ready for when they trigger. The open-versus-closed decision (Play 6) usually comes before all of them, since it shapes which plays even apply.

Frequently Asked Questions

Who should own model and weight decisions on a team?

It is shared but specific. Engineers own task-level model selection and fitting weights to hardware. Platform or MLOps owns versioning and provenance. The engineering lead owns strategic calls like open versus closed and any decision that materially affects budget. Naming the owner per play prevents the common failure where everyone assumes someone else is watching.

When should I quantize versus just use a smaller model?

Quantize first if you want to keep a specific model's behavior but need it to fit or run faster, since quantization usually preserves most quality. Switch to a smaller model when even aggressive quantization isn't enough, or when the smaller model already passes your benchmark. Test both options on your real task rather than deciding in the abstract.

How do I avoid being surprised by a vendor changing their model?

Build evaluation canaries—small, fixed test sets you run on a schedule against the API. When scores move, you get an early warning even if the vendor publishes nothing. Combine this with version pinning where the provider supports it. Treating a closed model as a moving dependency rather than a fixed one is the core discipline.

Is full fine-tuning ever worth it over LoRA?

Occasionally, when you have a large, high-quality dataset and the task differs substantially from the base model's training. Even then, start with LoRA to validate the approach cheaply before committing the compute. Full fine-tuning also demands more rigorous regression testing because of the catastrophic forgetting risk.

What is the first thing to check during a quality regression?

Confirm nothing changed in the weights or quantization, then check the data path—prompts, retrieval, and input formatting. The majority of regressions blamed on the model turn out to be upstream changes. Ruling those out first saves you from an expensive and unnecessary model swap.

Key Takeaways

  • Treat parameters and weights as operational concerns with named owners, not as a black box.
  • Right-size to the task first; start small and scale up only on measured evidence.
  • Quantization is your primary lever for fitting weights to hardware—test it on your real workload.
  • Prefer adapter-based fine-tuning like LoRA, and always keep a general-capability evaluation set.
  • Pin weight versions and build evaluation canaries so vendor swaps and regressions don't surprise you.
  • Diagnose quality regressions in order: weights, then quantization, then the data path.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Prompt Quality Decides Whether AI Earns Its Keep

Prompt quality is the single biggest variable in whether AI delivers real work or expensive noise. The model matters, the platform matters — but the prompt you write determines whether you get a first

A
Agency Script Editorial
June 1, 2026·10 min read
General

Counting the Real Cost of Every Token You Send

Tokens and context windows sit at the intersection of AI capability and operational cost—yet most business cases treat them as technical footnotes. That's a mistake that costs real money. Every time y

A
Agency Script Editorial
June 1, 2026·10 min read
General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way — a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026·11 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification