AI that follows the rules — and knows when not to
Two live demos you can try now, and the recipe they are built on
Most attempts to make an AI follow the rules pour the whole rulebook into it and hope it complies. It does not, and that can be measured. Here are two running demos built on the opposite principle: they look up the single rule when it is needed, answer only when the source covers the question, and hold back or refer to a human when it does not. The goal is not an AI that never fails. It is one that knows where its limit is.
Try them
Both run on an open language model (Qwen2.5-7B) and are free to try. They are demonstrations of the research, not finished products.
- The compliance assistant — a staff-handbook assistant. Ask it about leave, data, sick pay. It answers only from the handbook, and if you ask something the handbook does not cover, it says so and points you onward instead of inventing an answer.
- The counselling demo — a demonstration of a safety-first conversation architecture. It explains the mechanics behind change, refers on anything resembling diagnosis, medication, or crisis, and keeps crisis contacts visible at all times. It is explicitly not a treatment provider.
The one idea: optimal, not perfect
Following a rule is an internal race between routes, like any other action. The rule's route has to win the races it should win and yield the ones it should yield. It can fail in two ways. It can ignore the rule under pressure, so the rule loses a race it should have won. Or it can follow the rule so rigidly that it refuses legitimate things and turns useless. An assistant tuned to "never break a rule" lands in the second ditch. The right place is the middle, where it follows the rule when it should and gives way when it should.
The same holds for holding back. A model that answers everything fabricates confident answers when the source is missing. A model trained to always hold back ends up withholding on what it actually knows. What works is the selective middle: answer what is covered, hold back on what is not. That is what the demos try to hit.
How they are built — the recipe
The architecture is the same in both, and every choice has a reason in the research.
- The rules are looked up, not trained in — the single rule is fetched into context when it is needed (this is RAG, retrieval-augmented generation). That way you can see which rule was used on each turn, update it without retraining, and keep it legible. Fine-tuning hides a rule in the weights, where it is both buried among everything else and stripped of calibration.
- Training is for the behaviour, not the facts — what you fine-tune in is the disposition: to check the source, to flag doubt, to refer rather than guess. Behaviour is what fine-tuning is good at; individual facts are not.
- Every rule is learned in many wordings — a rule known as one phrasing fails the moment the user asks differently, and otherwise fires rigidly as a reflex. Wider coverage buys both more robust recall and less rigidity.
- It answers only when the answer is covered — if no source covers the question, it does not invent one. The most dangerous thing in a lookup system is a confident answer to something that was not there.
- It tries to rephrase before giving up — if your everyday wording misses the handbook's terms, it translates the question and searches again, and shows you that it did. Only if it still finds nothing does it refer you onward.
- Safety comes before everything else — in the counselling demo the crisis and referral rules are deterministic: on anything resembling crisis it delivers the right contacts without improvising, and it does not treat anyone itself.
- Everything can be checked automatically — because the product is software end to end, every rule can be checked against every conversation, down to the individual answer. A human advisor's rule-adherence can only be sampled after the fact. This can be checked exhaustively.
The research behind it
The demos build on a set of findings about how language models work inside:
- Compliance is behaviour, not information — why a rule buried among many stops working, and why the rule should be looked up rather than trained in.
- Fine-tuning installs dispositions, not data — what training actually puts in: a way of answering, not a register of facts.
- ICL as working memory, FT as long-term memory — why knowledge in context stays calibrated, while knowledge trained into the weights loses its calibration.
A full paper on the architecture is in preparation.