Hey there,
Welcome to where I write about my journey from a stable Big Tech Software Engineering job to the wild and volatile world of Venture Capital.
If you haven’t already, subscribe to be the first to get my weekly posts:
AI Supervising AI (The Paradox That Actually Works?)
I've been reading about AI alignment research lately, and here’s something wild: we're literally teaching AI systems to be safer by asking other AI systems to supervise them.
Let me say that again: AI... supervising AI.
This sounds like the setup to a sci-fi horror movie, right? But here's the thing, it's actually become one of the most promising approaches in alignment research. The jury's still out on whether it's brilliant or concerning, but it's definitely one of the most interesting bets the field is making
The Problem (And Why We Can't Just Hire More Humans)
Here's the deal: How do you make sure AI systems behave in ways that align with human values when these systems grow so complex that even experts can barely evaluate them?
Traditional methods relied on armies of human labelers judging AI outputs as "good" or "bad." This worked fine for a while. Then models got way too capable, generating outputs so nuanced that even expert humans couldn't agree on what's "better."
When you think about it, humans already struggle to agree on what's ethical in real life. Now imagine trying to get consistent labels on AI outputs when you're tired, it's 3pm on a Friday, and you're reviewing your 847th response about content moderation edge cases. Yeah.
Plus, the economics are brutal: $1 to $10 per labeled example. Training a single model needs hundreds of thousands of labels. Do the math. That's... a lot of money. 💸
Enter: AI Supervising AI
This is where Constitutional AI comes in-developed by Anthropic (yes, the Claude people).
Here's how it works: Instead of humans labeling individual outputs, they write a set of principles, like a "constitution" for the AI. The model generates responses, critiques them against these principles, revises them, and learns from the revisions. Then AI evaluators judge which outputs better follow the constitutional rules.
Human oversight happens at the principle level, not the output level. We're basically moving from "is this specific response good?" to "here are the rules, figure it out."
The cost difference is insane: AI-generated feedback costs less than $0.01 per example versus $1-10 for human feedback. This opened up alignment research to people who previously couldn't afford the massive labeling operations.
Does it actually work though? Early results say yes. Models trained this way show fewer harmful outputs on safety benchmarks. They resist manipulation better. And they do this while basically eliminating the need for massive human labeling armies.
But (and there's always a but), what happens when the AI evaluator shares the same flaws as the system it's supervising? How do you prevent aligned systems from just perpetuating biases embedded in their training? These questions are still very much open.
The approach is essentially a practical bet: AI systems can help us solve the alignment problem they create, if we structure that help carefully. Whether this bet pays off? TBD. But right now, it's one of the most actively deployed methods in production AI systems.
Meanwhile, Other People Are Trying to Read AI's Mind 🧠
While Constitutional AI addresses what models say, another group of researchers is trying to understand how they think. They call it mechanistic interpretability, which honestly sounds like something from a sci-fi novel, but stay with me.
These researchers reverse engineer neural networks. They trace how specific features activate, how circuits process information, how representations form. Instead of treating models as black boxes, they're literally trying to map what happens inside.
The goal? Catch misalignment before it shows up in outputs. Because here's the scary part: a model might produce perfectly safe outputs while using deceptive internal processes, saying the right thing while "thinking" something totally different. RLHF and Constitutional AI can't detect this gap. Interpretability methods can.
(Sidenote: This feels like debugging code, except instead of following your own logic, you're trying to reverse engineer logic that emerged from billions of training examples. No pressure.)
The International "We Should Probably Coordinate" Report 🌍
In January 2025, 30 nations got together (mandated by the Bletchley AI Safety Summit) and released the International AI Safety Report. One hundred AI experts contributed. The main takeaway? Transparency, robust risk management, and third-party evaluations have moved from "nice to have" to "absolutely necessary."
Organizations now publish explicit risk assessment methodologies. The field is moving toward combining methods rather than betting everything on one approach. (Which, as an engineer, feels like finally applying the "don't put all your eggs in one basket" principle to AI safety.)
Active Learning: Because Not All Failures Are Created Equal ⚡
Another approach focuses on where models are most likely to screw up. Active learning systems identify cases where they're uncertain or where training data underrepresents important scenarios. The model basically raises its hand and says "I'm not sure about this one", then researchers retrain specifically on those gaps.
Research shows this significantly improves safety and fairness, especially in physical safety scenarios and online harms. It's like focusing your debugging efforts on the parts of your codebase that fail most often, rather than uniformly testing everything.
The Common Thread (And Why This Matters)
All these approaches share one goal: making AI behavior more predictable and controllable as systems grow more capable.
Constitutional AI scales human oversight through principles
Mechanistic interpretability exposes internal processes
Active learning focuses resources on edge cases
Together, they address different facets of the alignment problem. At ICML 2025 (one of the field's major conferences), human-AI alignment was a central priority. The emphasis has shifted from purely theoretical concerns to practical deployment challenges. Because right now? Models operate in healthcare, finance, infrastructure: domains where mistakes have serious consequences.
The Open Questions (Because Obviously There Are Some)
The field still faces some pretty fundamental problems:
How do you verify that interpretability methods actually find what they claim to find?
How do you ensure constitutional principles capture the values they're meant to represent?
How do you prevent aligned systems from becoming brittle when deployed in contexts they've never seen?
But here's the thing, progress is measurable. Systems trained with these methods perform better on safety benchmarks. Red team attacks succeed less frequently. Edge case failures decrease. The gap between lab safety and deployment safety narrows.
My Take (For What It's Worth)
The alignment problem isn't even close to being solved. These approaches represent our current best attempts to ensure AI systems remain controllable as they grow more powerful and complex.
Whether they're sufficient for the systems we'll build next year (or five years from now) remains an open question. But the alternative (just... not doing this?) seems way worse.
So yeah. We're using AI to align AI. It's weird. It's a bit scary. But it might also be one of the cleverest moves we've made in AI safety.
Or it could blow up in our faces. Time will tell. 🎲
If you made it this far, congrats! You now know more about AI alignment than 99% of people. Use this knowledge wisely (or at least to sound smart at parties).
Rate this post
PS: I want to make this better for you, add a comment below with your thoughts on what you love, hate, or want more of.

