We audited our own AI against the research. It was over-prescribing.

The fear people bring to an AI coach is recklessness. Too much load, too soon, no respect for recovery. It is a reasonable fear, and it is the one we built most of our guardrails against.

A few weeks ago we audited those guardrails. Yuge's program engine runs every generated program through a set of science rules before anything reaches the lifter — weekly volume per muscle, deload cadence, push-to-pull balance, rep ranges by goal, exercise selection. Sixteen of those rules were put to the same test, one at a time, against the research library the product is anchored in: Schoenfeld, Helms, Israetel, Contreras, Wendler. One question per rule. Does the literature actually say this, or did we write a rule that merely sounds like the literature?

The result: one rule the literature contradicts outright. Five rules stricter than the evidence. One rule that turned out to be too gentle. Nothing reckless.

The engine's failure mode was over-prescription, and we think that is worth a post, because it is probably the default failure mode of any system that turns training science into code.

What held up

Most rules passed, and a few passed with a precision that was genuinely satisfying to see.

The engine reserves cluster sets and similar intensity techniques for lifters with at least two years of training behind them. That number is not ours: Helms defines an advanced lifter as someone with two or more years of continuous, structured training, and reserves techniques like cluster sets and drop sets for that stage.

The engine caps how much of a program is taken to failure — AMRAP sets are limited to roughly a third of weekly working sets. Contreras's "rule of thirds" puts about a third of working sets at or near failure, with the rest kept further from it, and Helms keeps failure work selective — usually the last set of an exercise. Our 30% cap is the same idea, rounded.

Volume targets scale with training age — beginners get about three-quarters of the intermediate baseline, advanced lifters about 120%. Helms's volume guidelines step up with training age: roughly 10–12 weekly sets per muscle for novices, 13–15 for intermediates, and 16–20 for advanced lifters.

And the deload week — 40% fewer sets, every set two RPE points easier — landed almost exactly on the published prescription. That one has its own post.

The rule the literature contradicts

The validator required at least three exercises on every training day, and treated a violation as critical — the kind of finding that forces a program to be regenerated.

It sounds harmless. Of course a training day has three exercises.

Except: Helms's three-day novice powerlifting template builds entire training days out of two lifts — squat and bench one day, bench and deadlift another. Wendler argues for minimalist sessions built around two main movements — quality over quantity. Israetel constrains sessions by total working sets and warns against cramming too many distinct exercises into a day, which is the opposite concern. Nobody in the library sets a floor on exercise count. Day structure is expressed through volume and movement coverage, not a minimum number of rows on the whiteboard.

The practical cost of the rule: handed a legitimately minimalist strength day, the engine would bolt accessory work onto it to satisfy a constraint no coach ever wrote down. We demoted it from a critical violation to an informational note the same day the audit ran.

The pattern: averages encoded as laws

Five more rules were stricter than the evidence, and all five failed the same way — a single uniform threshold where the literature holds a range that depends on goal, split, or muscle.

Rep ranges. The hypertrophy validator flagged programs where most sets fell outside 5–15 reps. The effective rep range for hypertrophy is roughly 5 to 30 reps, provided sets are taken with high effort. We were flagging the entire top half of the legitimate range — every twenty-rep leg press, every high-rep glute block. The band is now 5–30.

Specialisation volume. A rule flagged any muscle receiving more than 40% of a program's total weekly sets. But concentrating volume is what a specialisation phase is: Israetel describes specialisation phases that deliberately push a focus muscle toward its maximum recoverable volume while the rest of the body drops to maintenance volumes, and Contreras's glute-focused programming routinely prescribes thirty to forty weekly sets of glute work. A rule that flags every specialisation block is not a safety rule — it is a bias toward generic programs. The check now exempts declared focus muscles.

Compound-to-isolation ratio. Strength programs had to be 70% compound; hypertrophy programs 40%. Both floors were too high. Helms programs dedicated accessory hypertrophy work inside strength blocks — around a quarter to a third of total volume, in the six-to-fifteen rep range, and serious bodybuilding programming is intentionally isolation-heavy. The floors are now 50% and 25%.

Push–pull pairing. The validator wanted pulling work in any session that contained meaningful pushing. That is correct for full-body and 5/3/1-style templates — and wrong for every push/pull/legs split on earth. Helms's novice bodybuilding split programs a dedicated push day with no pulling in it at all — the pulls arrive on their own day later in the week. The rule now reads the program's weekly balance before it judges a single day.

Movement patterns. Every program had to cover all six fundamental movement patterns, including vertical pushing and pulling. Strength-focused templates legitimately drop the verticals — Wendler's assistance work is bucketed as push and pull without distinguishing horizontal from vertical at all. Strength programs are now validated against the four patterns that actually define them.

The rule that was too gentle

One finding ran the other way. Our mesocycle effort ramp climbs about one and a half RPE points from the first week to the last — start around 7, finish around 8.5. The literature is steeper: Israetel's hypertrophy mesocycles ramp from roughly RPE 5–6 in week one to RPE 9–10 in the final week, and Helms and Contreras both run steeper ramps than ours as well.

For a beginner, the gentler ramp is defensible. For an advanced lifter it leaves stimulus on the table — the opposite of the over-prescription pattern, produced by the same root cause: one number where the literature wants a range conditioned on training age. This one has not shipped yet. The likely fix is a ramp that scales with experience.

Why this is the default failure mode

Turning research into code means every judgment call becomes a number. The literature is full of "it depends" — on goal, on split, on training age, on which muscle. Code wants a threshold. The path of least resistance is to take the average case and enforce it everywhere.

The result is not dangerous, which is exactly why it is insidious. Nobody gets hurt by a validator that quietly adds a third exercise to a minimalist strength day, or flags a glute specialisation block, or trims a twenty-rep set back to twelve. They just get a slightly more generic program than the one they should have had — sanded toward the middle, with no error message, ever, for anyone.

A reckless rule announces itself the first time someone gets hurt by it. An over-prescriptive rule never announces itself at all. The only way to find it is to hold each rule up against its source and ask whether the source actually says that. So that is now part of how the engine is maintained: the rules carry their citations, and the citations get audited.

What changed

For the record, the fixes that shipped from this audit:

The three-exercise-per-day floor is no longer a critical violation.
The hypertrophy rep-range band widened from 5–15 to 5–30.
Declared focus muscles are exempt from the volume-distribution cap, so specialisation blocks validate.
Compound floors dropped from 70% to 50% for strength and 40% to 25% for hypertrophy.
Per-day push–pull pairing is now judged in the context of the weekly split.
Strength programs are no longer forced to include vertical pressing or pulling.
Pre-planned beginner deloads were removed entirely — the literature handles novice fatigue reactively, which we cover in the deload post.

The RPE ramp is still open.

If an AI coach tells you it is anchored in the literature, the question worth asking is the one we asked ours: show me the rule, show me the source, and show me what happened when they disagreed.

We audited our own AI against the research. It was over-prescribing.

What held up

The rule the literature contradicts

The pattern: averages encoded as laws

The rule that was too gentle

Why this is the default failure mode

What changed

References

When to deload, and how much: what the evidence actually supports

Anchored in the literature. Adaptive in the gym.

How Yuge handles training interventions