Updated 6/3
Long beards who craft systems level software often speak of Undefined Behavior, where if you break the rules of a language like C or C++ and do things that the compiler assumes can never happen, you can elicit the compiler to do crazy things.
Today, we use LLMs as an indeterministic compiler. The vagueness of English could generate any kind of code, differing without much understanding from the programmer. Its Undefined Behavior in a new form. Maybe the code compiles, or maybe not, but it doesn’t always do what it’s supposed to.
Perhaps this is how early Assembly programmers felt about actual early compilers. They didn’t want to trust the machine to do what they could do better by hand. Eventually the compiler got pretty good, but for languages like C that allowed unsafe pointer access and manipulation, undefined behavior was still possible.
Nowadays we have languages like Rust, with very sophisticated compilers. What will be the equivalent for LLM’s?
So how do we avoid this new form of Undefined Behavior? Unfortunately, Uncle Bob has the right answer: Test Driven Development. Tests, or Evals as the hipster AI researchers call them, are what guide the Agent/Model/LLM towards the right goal.
Evals can be as simple as the prompts that you craft, and you can even have another LLM evaluate the output of another model, or they can affect the actual weights of the model via Reinforcement Learning. Either way, you shift your focus to Prompt Engineering instead of code-writing.
This means programmers are becoming technical writers (I’m glad I had to take that class in college). It’s an old joke that enterprise programmers only really write about 10 lines of code a day. The rest of the time they’re reviewing, discussing, and searching through the code.
This doesn’t appeal to me very much, so I’ve been holding out and only using models like O3 as a trusted advisor with super-googling abilities. If I need some help churning out boiler-plate, I use an Agent, but for the most part I’m trying to keep typing the code out myself. We’ll see how long that lasts.