Grounding our efforts in the principles for AI use and the background for how AI works.
Preston Magouirk · DC CAP Enterprise AI Leadership Pilot · The mechanics behind four weeks of reps.
Context. Tokens. Memory. Models. Effort. None of these show up in a comparable way anywhere outside this room. So they feel mysterious, even when you have been using them daily for four weeks.
These are not theoretical anymore. You have been experiencing them. Today we put names on what you have already lived.
The 4Ds are how we operate. The 3As are the shape your delegation takes. The mechanics underneath them are why those moves work — or fail.
You have done the reps. The next eight slides put a real-world parallel beside each mechanic so the names stick.
An executive walks into a board meeting with whatever they reviewed before the door closed. The financials they pulled. The director questions from last quarter. The funder dossiers their team sent over. Whatever they didn't open isn't available when a board member asks a curveball.
What's loaded into the prep window decides what they can answer in the room. Everything else is a best guess.
The window holds what you've pasted, what Claude has said back, and any reference files loaded. Whatever's not in the window can't shape the answer.
Treat each chat like the prep window for one decision. Load names, dates, voice rules, audience, format. When the task ends, close the chat. For knowledge that should persist across meetings, use a Project.
The takeaway: Description quality is bounded by what sits in the window.
A patient walks in with chest pain. With no medical history, that symptom could be heartburn, anxiety, a heart attack, or a dozen other things. The doctor guesses.
Add history — cardiac stent six months ago. Add family — father died of MI at 52. Add behavior — ran six miles before symptoms started. Each piece of context narrows the diagnosis to the one that fits.
"Write me an email to a parent" gets you the average of the internet. Add the named family, the specific deadline, the cycle date, the voice rules — and Claude narrows to the answer that fits DC CAP.
If you keep retyping the same context every week, that context belongs in a Project. The doctor doesn't ask for your history every time. They have the chart.
The takeaway: Empty context, generic answer. Rich context, sharp answer.
External memory: law school, the bar exam, fifteen years of practice. Fixed. Brought to every case.
Internal memory: the case file — complaint, depositions, discovery, prior motions. Variable. Different in every matter.
Strong external + empty case file = malpractice. Empty external + full file = rookie. Real lawyers run both.
Training data is Claude's external memory. What the model learned from the public internet through its training cutoff. Fixed.
Context is Claude's internal memory. What's loaded in this chat right now. Your data, voice rules, files. You decide what goes in.
Empty context = Claude guessing from generic training. Asking about events after the cutoff = Claude inventing.
The takeaway: External memory is what the model knows about the world. Internal memory is what you've told it about your work. Both layers, every time.
Precision: of the emails marked as spam, what percent are spam? Low precision = important emails in the spam folder. False positives.
Recall: of all the spam coming in, what percent does the filter catch? Low recall = spam piling up in the inbox. False negatives.
Push precision up, recall drops. Push recall up, precision drops. Every filter trades them off.
Precision failure = hallucination. The model confidently "remembers" something and is wrong. The Mata case was a precision failure — six fake citations asserted as real.
Recall failure = forgetting what you told it earlier. The model should have used something from earlier and didn't. Drift, mechanically.
Verify catches precision failures. Re-anchoring catches recall failures.
The takeaway: Precision and recall are the two ways memory lets you down. Both have fixes — once you can name which one you're seeing.
Microsoft launched a chatbot named Tay on Twitter on March 23, 2016. Tay was designed to learn from conversation. Within 16 hours, trolls had steered her into producing racist and offensive content. Microsoft pulled her offline.
The mechanism is the same one that happens in every long chat — quieter, slower. The longer a conversation runs, the more recent inputs dominate the original instructions. Newer signals crowd out older ones.
You correct Claude on something at message 5. By message 30, the same error is back. Tone shifts. Earlier rules dilute. That's drift.
The fix: start fresh when the trail feels off. Re-anchor inside long chats with a reminder. A Project loads the foundational context fresh every time you open a new conversation inside it.
The takeaway: When the answer thins, that's a Diligence signal — not a Claude failure.
A hospital matches case to clinician. A specialist for the complex procedure. A general practitioner for the routine visit. A triage nurse for the initial intake. Same building. Different jobs. Claude works the same way.
Deep multi-step reasoning. Slowest. Most expensive.
Highest-stakes drafting. Board memos. Funder LOIs. Cross-document audits. Anything where being wrong is costly. Don't use it for routine emails or simple summaries.
The DC CAP default. Strong reasoning, everyday speed.
Counselor emails. Partnership updates. Briefings. First drafts of grant narratives. Most Projects and most Skills. If you don't know which to pick, this is the answer.
Built for speed and volume.
Quick lookups. Classifications. Pulling structured data out of unstructured text. Don't use it for anything signed by a person or anything that needs voice.
The takeaway: Model selection is a Delegation decision. The right level of care for the case — not the most expensive option in the building.
Two New York lawyers used ChatGPT to write a personal injury brief. ChatGPT cited six prior cases — including Varghese v. China Southern Airlines — with fabricated judge names, quotes, and citation numbers.
Asked if the cases were real, ChatGPT said yes and pointed to "Westlaw and LexisNexis." None existed. Judge Castel sanctioned the lawyers $5,000. He called the analysis "gibberish."
Hallucination is when a model confidently states something untrue. A fabricated quote. A made-up statistic. A paper that doesn't exist.
It happens because LLMs are pattern completers. When they don't know, they generate text that sounds like the right kind of answer. Names, numbers, dates, citations are highest-risk.
The takeaway: The lawyers asked ChatGPT to verify itself. ChatGPT said yes. Don't ask the model to grade its own homework.
In its public launch ad, Bard was asked what JWST discoveries to tell a 9-year-old. Bard said JWST "took the very first pictures" of a planet outside our solar system.
Wrong. The first exoplanet image came from the European Southern Observatory's VLT in 2004. Reuters caught it the day of Google's Paris AI event.
Jake Moffatt asked Air Canada's chatbot about bereavement fares after his grandmother died. The chatbot told him he could apply for a refund within 90 days. The actual policy required pre-booking.
The BC Civil Resolution Tribunal sided with Moffatt: a company is liable for what its chatbot says. "Air Canada is responsible for all the information on its website."
The takeaway: Hallucination has a market price, a legal price, and a reputational price. Verify before you let it ship.
Before every takeoff, a commercial pilot runs two checks. The pre-flight checklist verifies every part works — engines, controls, fuel, instruments. The flight plan validates the route fits the mission — weather, weight, fuel reserves.
A plane in perfect condition flying into a thunderstorm is verified, not validated. A flawless plan in a plane with a broken altimeter is validated, not verified. Both mandatory under FAA rules.
Verify — are the facts true? Names, numbers, dates, citations. Cross-check anything material. Don't ask the model to grade itself.
Validate — does it fit the work? Right audience, right format, right tone. Read it as the audience before you send.
For high-stakes work, run the adversarial-audit Skill from Week 4 — or hand it to a teammate.
The takeaway: Diligence is two moves, not one. Design the gates into the workflow.
Vague: "Make us a brochure." Designer makes the choices you didn't make. Revisions take six rounds. The project misses deadline.
Sharp: "Trifold brochure, 8.5×11 folded to three panels. Audience: high school juniors. Reading level: grade 9. Logo at 1″ minimum. Five program features bulleted. A CTA at the bottom. Photos from our approved set. First draft Friday, two revision rounds, final by the 28th."
One round. The vendor delivers what was needed.
1. State the work, not the task.
2. Anchor in concrete facts — named partners, real numbers, real dates.
3. Tell Claude the audience and the format.
4. Iterate in dialogue. Show Claude what's wrong; don't start over.
5. Save what works.
Every great prompt is a SKILL.md waiting to be written. The more reps you put in on the five-part structure, the less you have to rewrite it next time.
The takeaway: Description is the most valuable move in this whole pilot.
When we get these right, two things happen at once. Work quality goes up. The error surface drops. Every new tool we pick up from here — Skills, automations, agents — amplifies both effects.
The new tools scale the quality when the basics are solid. They scale the errors when they're not. Same mechanism, two directions.
Walk us through a moment from this week. Pick one of the eight. The reasoning matters more than the polish.
Which of the eight did you wrestle with this week — context, memory, drift, model choice, hallucination, verify, validate, or the prompt itself?
What did you do? What changed when you got it right — or what showed up when you didn't?
What's in your capstone scope now? Where is a Skill or Project you're building this week — and what would a teammate need from it to run it cold?