Lesson 12 – LLM Training vs. ML Training¶

An Intern's Guide to Working with AI — Cortado Group

1. The Big Misconception¶

When most people hear "AI," they assume the system is actively learning as it talks to them — getting smarter with every conversation, adapting, improving. That's not how it works. Understanding this distinction will save you a lot of confusion.

Machine Learning (ML) — What happened before you arrived

Machine learning is the process of training a model. Engineers feed enormous amounts of data into a system, and the system iterates — sometimes billions of times — asking itself: "Was that right? Was that wrong?" It gets corrected, it adjusts, it improves. Sometimes a human guides it ("no, that's wrong — try again"), sometimes it figures it out on its own. Either way, this process produces something: a model.

The LLM — What you're actually using

LLM stands for Large Language Model. The M is the key word: model. That model was built long before you typed a single word. When you open ChatGPT, Claude, or any other AI tool, the training is already done. You are not training it. You are using the result of training that happened without you.

Think of it like a master chef. Machine learning is the 20 years of culinary school, cooking disasters, Michelin-star kitchens, and relentless practice. The LLM is the chef standing in front of you today, ready to cook. Your prompt is the order you place. You're not teaching the chef to cook — you're telling them what you want for dinner.

The practical implication: you cannot fundamentally change how an LLM thinks or permanently fix its behavior. You can guide it within a single conversation. But start a new conversation, and it's a blank slate — game reset, every time.

2. How LLMs Actually Think¶

Language in, language out

The LLM's one job is: based on the language I give you, generate language. That's it. It doesn't "understand" in the human sense. It works by identifying patterns across everything it was trained on and generating the most statistically appropriate response.

Tokens, not words

LLMs don't think in English words — they think in tokens. A token is a subword unit: not a full word, not a syllable, but a chunk of characters (typically 3–4 characters, roughly ¾ of a word). Tokens get mapped internally to concepts and ideas — that's where the semantic meaning lives. That's what makes LLMs multilingual: the concept stays the same regardless of which language expresses it. Ask it to explain something in Python, then in SQL, then in Japanese — it's the same idea, three different expressions.

This also explains why you can't get the exact same document back if you ask for a rewrite. The LLM doesn't store your words — it processes your ideas and regenerates language around them. Every regeneration is a new interpretation.

3. Temperature — The Creativity Dial¶

Most LLMs have a setting called temperature. It controls how much randomness the model accepts when generating a response.

Low heat: Low temperature → deterministic. Ask the same question twice, you get the same answer. This is intentional — low heat is how you get consistent, repeatable output. Use it when you need precision.
High heat: High temperature → more variety. The model will explore less obvious paths. Use this when brainstorming titles, openings, or options where variety is the point.

The tradeoff: high temperature gives you creativity, but some outputs will be bad. That's expected — and why you always need a way to evaluate what comes back.

Rule of thumb: Use high temperature to generate options, low temperature to produce the final output.

4. Context Windows — The LLM's Working Memory¶

Every time you send a message in a chat session, the entire conversation history goes along with it — every message you've sent, every response it's given, all the way back to the start. That accumulated history is the context window.

LLMs have a primacy and recency bias — they pay the most attention to the beginning and the end of the context. The middle of a long conversation gets underweighted. This is known as the "lost in the middle" problem. Your role definition and examples belong at the top (primacy). Your actual prompt — the specific thing you're asking for right now — belongs at the bottom (recency). Content buried in the middle of a long session is most likely to be ignored or misapplied.

The right backstory reinforces your intent and improves output. The wrong backstory (irrelevant, contradictory, or distracting) confuses it — especially when it lands in the middle.

Practical rules:

Keep context purposeful. Don't let unrelated discussion pollute the history.
If a conversation has gone off-track, start fresh rather than trying to override it.
Role definition and examples at the top. Prompt at the bottom. Nothing critical in the middle.

5. Examples Are Your Most Powerful Tool¶

You can't re-train the LLM. But you can show it what "good" looks like — and that goes a long way. This is called few-shot prompting, and it's the closest thing to training you'll have access to.

The pattern is simple: before giving it the actual task, give it examples of the right answer.

3 positive examples of what you want
1 negative example of what you don't want
Then: the actual prompt

The more examples you provide, the more likely the model mimics them. You can even use a duck-goose pattern — bad example, bad example, good example, good example — so it learns both what to avoid and what to aim for.

Important: this "training" only lives in the current conversation. It does not persist. Every new session, you start over. Build your examples into your workflow, not your memory.

6. Write Instructions Like You Mean It¶

Vague instructions produce vague results. "Make this better" tells the LLM almost nothing. You need to be explicit about what "better" means.

Instead of: "Fix the language in this email."

Try: "This email contains two profanities. Replace them with professional alternatives. Do not substitute one profanity for another. Do not rewrite any other part of the email."

The more constraints you layer on, the more likely the model will miss one — so be deliberate. Give only the constraints that matter, and make them unambiguous.

7. Give Your Bot One Job¶

LLMs follow the role they've been given. If you tell it "you are an email writer," and then ask it to "rewrite this paragraph," don't be surprised if you get an entire email back. That's what it thinks it's there to do.

The fix is simple: define the role precisely at the top of your conversation.

Author bot → writes. Give it full content to produce.
Editor bot → edits. Give it a specific chunk and specific instructions for changing it.
Scoring bot → evaluates. Give it output and a rubric; ask it to flag failures.

Support your role definition with examples that match the role. An editor bot should see examples of "here is a flawed paragraph / here is the corrected version" — not examples of full emails being written from scratch.

8. Subdivide and Conquer — Don't Regenerate the Whole Thing¶

One of the most common mistakes: something is wrong with part of the output, so you regenerate everything. This is almost always the wrong call. When you regenerate, the LLM will paraphrase, consolidate, and summarize — you will not get the original document back with one thing fixed. You'll get a new document that happens to fix the thing you flagged, while quietly mangling everything else.

The right approach:

Break the output into discrete chunks (paragraphs, sections, sentences).
Evaluate each chunk independently.
Send only the failing chunks back for revision, with specific instructions for what to fix.
Reassemble from the working pieces.

9. Parallel Retries — Don't Make Users Wait¶

When you have multiple failing chunks that need to be rewritten, fire them all at the same time — in parallel. Do not fix chunk 1, wait for it to finish, then fix chunk 2, wait, then fix chunk 3. That serial approach compounds wait time. If each retry takes 5 seconds and you have 3 chunks with up to 10 retries each, that's potentially 150 seconds of waiting — serially. In parallel, it's the length of the slowest single chunk.

In parallel is important. Say it again: in parallel.

10. Score, Threshold, and Accept "Good Enough"¶

Build a scoring engine for your outputs. Define what "good" means in measurable terms: no spam words, appropriate sentence length, no banned phrases, correct tone, etc. Run every output through it. If it fails, send it back. If it passes, you're done.

Key principles:

Set a threshold (e.g., 95 out of 100). Chasing 100 can be impossible and expensive.
Cap your retries. After N attempts, take the best score you achieved and move on.
Sometimes the content itself is the constraint (e.g., the word "meeting" is flagged as a spam word, but you can't remove "meeting" from an email about a meeting). Recognize when you've hit the ceiling and accept it.
Track scores across retries. Don't blindly retry — take the best result out of all attempts.

11. Model Cost vs. Quality Tradeoffs¶

Not all models are the same. Expensive models self-correct internally — they apply LLM-on-LLM evaluation before responding, catching their own errors before you see them. They're more likely to get it right on the first try. But they cost more and take longer.

A note on code generation and self-correction: when you see AI tools generate code, hit an error, and automatically fix it, that behavior comes from an agentic layer — a system built on top of the LLM that executes the code, reads the error output, and feeds it back in as context. The base LLM itself does not execute code. Agentic tools like Claude Code or ChatGPT's code interpreter add this loop. When you're working with a raw LLM via API without that layer, you are responsible for building the execution-and-feedback loop yourself.

Cheaper, simpler models are less capable individually — but they're cheap enough to run many times in parallel. If your success rate per attempt is 30%, but each attempt costs almost nothing, running 10 in parallel and taking the best result may outperform one expensive attempt.

The choice depends on your use case. For precision outputs where one attempt matters, pay for the better model. For high-volume, retry-heavy workflows, use cheaper models and let parallelism do the work.

One important caveat: a bad prompt can ruin even the best model. Model quality does not compensate for poor prompting.

12. Putting It All Together — The Prompt Stack¶

When you're building a prompt for a real task, structure it in layers:

Role definition. Tell the model exactly what kind of bot it is and what its job is.
Examples. 3 positive examples of the output you want. 1 negative example of what to avoid. These should closely match the format of the task.
The prompt. The actual request — specific, constrained, and referencing only what's needed.

Keep each session focused on one type of task. Mix roles and you get confused output. Start clean, stay narrow, and use your examples to anchor it.

Cortado Group / Cortado Labs | Based on internal build session, June 8, 2026