When teams start using language models for anything involving numbers, the same questions surface in nearly the same order. Why did the total come back wrong? Does telling it to show its work help? When should I reach for code instead? These are not naive questions; they are the right questions, and the answers determine whether your numerical features ship as trustworthy or as liabilities.
This piece collects the highest-frequency questions we hear from practitioners and answers each one directly, with the reasoning behind the answer. It is organized so you can read it front to back as a primer or jump to the specific question that is blocking you right now.
The throughline is simple: a language model is a superb interpreter of quantitative problems and an unreliable executor of arithmetic. Almost every good answer below comes from respecting that split and designing around it rather than against it.
Why Does the Model Get Numbers Wrong at All?
This is the foundational question, and the answer reframes everything else.
The mechanism
A model generates text by predicting likely next tokens. Digits are tokens. When you ask for a product or a sum, the model predicts the sequence of digit-tokens that most plausibly follows your question. For arithmetic it saw often, that prediction matches the true answer. For arithmetic it rarely saw verbatim, the prediction is merely plausible, which is frequently wrong.
What this implies
- Correctness correlates with how common the calculation was in training, not with how hard it is for a calculator.
- The model has no internal guarantee of arithmetic accuracy, only a probability of it.
- Errors are silent: nothing in the output signals that a particular number is a guess.
Does Asking the Model to Show Its Work Help?
Yes, meaningfully, but with an important caveat.
Why it works
Breaking a problem into steps turns one improbable prediction into several more-probable ones. The model is better at predicting each small step than at leaping to the final answer. This is the core insight behind Why Think Step by Step Quietly Changes What Models Can Do.
The caveat
Step-by-step reasoning improves the plan far more than it guarantees the arithmetic. A model can reason flawlessly and still fumble a multiplication mid-stream. So showing work is necessary but not sufficient; you still verify the numbers.
When Should I Use Code or a Tool Instead?
Whenever the number feeds a decision, document, or downstream system.
The decision rule
- Stakes low and math common: inline arithmetic is acceptable.
- Stakes high or math non-trivial: have the model write the expression or code, then execute it deterministically.
How to wire it up
- Prompt the model to produce a calculation as code or a structured expression rather than a final figure.
- Run that code in a sandbox or calculator and return the computed result.
- Let the model interpret and explain the computed result, which is work it does well.
How Do I Handle Multi-Step Calculations?
Decompose deliberately and validate at each boundary.
The approach
Long calculations accumulate error: a small mistake in step two corrupts every step after it. Treat each step as a checkpoint. The discipline mirrors Breaking Hard Tasks Into Prompts a Model Can Handle, where complexity is tamed by splitting it.
Practical tactics
- Ask the model to label intermediate values so they can be checked individually.
- Carry forward computed values explicitly rather than letting the model re-derive them.
- Recompute the final answer from the validated intermediates rather than trusting an end-to-end pass.
How Do I Know If a Numerical Prompt Is Reliable?
Measure it across many runs and inputs, never a single example.
Why one success is meaningless
Outputs are probabilistic. The same prompt can be right once and wrong the next time. A demo that produces the right total tells you the answer is possible, not that it is reliable.
What to measure instead
- Accuracy across a representative test set of inputs.
- Consistency: sample the same problem several times and check whether the answers agree.
- Worst-case behavior, since the cost of a wrong number is usually asymmetric.
What About Units, Currencies, and Rounding?
These are where correct math still produces wrong answers.
The hidden failure modes
A model can compute the right number in the wrong unit, mix currencies, or round inconsistently across a report. These errors are not arithmetic failures; they are specification failures, and they are entirely preventable.
How to prevent them
- State units, currency, and rounding rules explicitly in the prompt.
- Ask the model to carry units through every step and flag any conversion.
- Validate that totals reconcile, since reconciliation catches unit and rounding drift that spot-checks miss.
Should I Ask for the Answer or the Method?
Ask for the method, then execute it separately.
Why the method is the better target
When you ask a model for a final number, you are asking it to do its weakest job. When you ask for the method, the formula, the steps, the expression, you are asking for its strongest job: structuring a problem. The method is also auditable in a way a bare number never is.
How this changes the prompt
- Request the calculation as an expression or code rather than the result.
- Ask the model to name the formula and the inputs it plugged in.
- Run the method through a deterministic tool and let the model interpret the output.
How Do I Debug a Numerical Prompt That Is Wrong?
Trace the error to one of two distinct causes.
Planning fault versus computation fault
Almost every wrong number comes from either a flawed plan, the model chose the wrong formula or misread the problem, or a flawed computation, the plan was right but the arithmetic was botched. These have completely different fixes, so distinguishing them is the whole job.
The diagnostic routine
- Inspect the model's stated method. If the method is wrong, the fault is in reasoning, and clearer instructions or examples help.
- If the method is right but the number is wrong, the fault is in computation, and the fix is to route the arithmetic to a tool.
- Keep the two separated in your prompt so this diagnosis is fast rather than a guessing game.
Frequently Asked Questions
Is there a single best model for numerical tasks?
The better question is which architecture, not which model. A capable model paired with tool execution and verification will beat a stronger model used freehand. Choose for reasoning quality, then add deterministic computation around it.
Can I trust the model to verify its own arithmetic?
Only if the verification is an independent re-derivation, ideally in a fresh context using a different method. A simple "is this right?" tends to elicit agreement with the prior answer rather than a genuine check.
Does a lower temperature improve accuracy?
It improves consistency, not accuracy. You will get the same answer more often, which is useful for reproducibility, but if that answer is wrong you will now get it wrong reliably.
How should I present numbers the model produced?
Compute first, format last, and keep the underlying expressions available. Polished formatting makes unverified numbers look trustworthy, so never let presentation run ahead of verification.
What is the cheapest reliability win I can adopt today?
Move actual computation to code. Have the model emit the calculation rather than the answer, then execute it. This one change eliminates the largest class of silent numerical errors.
How do I test numerical prompts before shipping?
Build a small evaluation set with known-correct answers, run the prompt many times against it, and track accuracy and consistency. Treat the prompt like code that needs tests, because it is.
Key Takeaways
- Numbers fail because digits are predicted, not computed; correctness tracks training frequency.
- Showing work improves the plan but not the arithmetic, so verify the numbers separately.
- Route any consequential calculation through code and reserve the model for interpretation.
- Decompose multi-step math and validate at every checkpoint to stop error propagation.
- Reliability is a measured property across many runs, never a single successful demo.
- Units, currency, and rounding are specification problems; state them explicitly and reconcile totals.