Interviewing for AI Fluency
Most interviews test whether someone can talk about AI. This framework tests whether they've actually changed how they work because of it.
Every question maps to one of six dimensions of AI fluency and is delivered through one of three assessment types. Together they give you a complete picture of how someone thinks with, builds with, and reasons about AI.
The Six Dimensions
AI Craft
AI Craft is the most fundamental dimension. It measures whether someone can actually use AI tools to get real work done.
This isn't about knowing what a transformer is or being able to list five LLMs. It's about muscle memory. When a problem lands on their desk, do they know which tool to reach for? Can they write a prompt that gets useful output on the first or second try? Do they iterate quickly when the result isn't right? Do they know the difference between what Claude is good at versus what ChatGPT is good at versus when they should just open a spreadsheet?
The people who score high on AI Craft have a toolkit, not a tool. They've developed strong intuitions about how to get the most out of AI through hundreds of hours of actual usage. They move fast because they've done the reps.
The people who score low might have tried AI a few times. Maybe they use it for the occasional email draft. But they haven't rewired their workflow around it. AI is still an addition to how they work rather than a transformation of it.
Quality Judgment
Quality Judgment is the single most important dimension. It tests whether someone can tell when AI output is good versus when it's garbage.
This matters more than any other dimension because AI is confidently wrong all the time. It generates text that reads beautifully but contains invented facts. It writes code that looks correct but fails on edge cases. It produces analysis that sounds rigorous but is based on misunderstood data. The person who can't distinguish between good and bad AI output will ship garbage at scale, and they'll do it faster than someone who doesn't use AI at all.
Strong Quality Judgment looks like a built-in smell test. These people read AI output with a critical eye. They fact-check claims that feel off. They test code before trusting it. They notice when a summary glosses over something important. They've been burned enough times to have developed real instincts about where AI tends to go wrong.
Weak Quality Judgment looks like blind trust. They paste AI output into a document without reading it carefully. They accept the first answer without verifying. They can't tell you about a time AI got something wrong because they've never bothered to check.
Adaptability
Adaptability measures how someone responds when their first approach doesn't work and whether they're keeping pace as the tools evolve underneath them.
AI moves faster than any technology most people have worked with. The best tool for a task six months ago might not be the best tool today. The prompting strategy that worked with GPT-4 might not be optimal with Claude. Someone who found one workflow in 2023 and has been running it unchanged is not AI-fluent. They're AI-familiar.
High Adaptability looks like someone who has switched tools multiple times, not out of hype-chasing but because they found something genuinely better. When their usual approach fails on a task, they don't get stuck. They try a different prompt structure, a different tool, or a completely different approach.
Low Adaptability looks like rigidity. They have one tool and one way of using it. When it doesn't work they either force it or give up. They haven't updated their toolkit in months. They talk about AI in terms of what it could do six months ago rather than what it can do now.
Systems Thinking
Systems Thinking measures whether someone can reason about AI in a business context where there are real constraints, competing priorities, and no clean answers.
This is not product strategy with AI sprinkled on top. It's about understanding the specific tradeoffs that AI introduces. What do you do when your AI feature is useful but wrong 15% of the time? How do you handle the tension between automation and the human relationships your customers value? Is your data actually a moat or does it just feel like one?
Strong Systems Thinking looks like someone who can sit in tension. They don't jump to a clean answer. They reason about second-order effects. They distinguish between problems that better models will solve and problems that are structural to how AI works.
Weak Systems Thinking looks like someone who applies standard frameworks without adapting them. They say "just make the model better" without engaging with the constraint. They treat AI as interchangeable with any other technology and miss the things that make it genuinely different.
Scaling / Org Impact
Scaling measures whether someone can make a team or organisation better with AI, not just themselves.
Personal AI fluency is necessary but not sufficient for senior roles. The harder problem is getting twenty people to change how they work. That requires understanding adoption psychology, knowing where to start, building guardrails that protect quality without killing momentum, and measuring whether the change is actually working.
Strong Scaling looks like someone who has actually done this. They introduced AI workflows to a team and can tell you what stuck and what didn't. They know that the first wave of excitement fades and have strategies for what happens after.
Weak Scaling looks like someone who either hasn't operated at the team level or who approaches it as pure evangelism. "We should all use AI more" is not a strategy. Neither is mandating tool usage without thinking about workflows, training, or quality control.
Self-Awareness
Self-Awareness measures whether someone knows what they don't know and can draw honest boundaries around where AI should and shouldn't be used.
This dimension is the antidote to hype. The most dangerous AI user is the one who is enthusiastic but uncalibrated. They deploy AI everywhere without thinking about where it's genuinely useful versus where it creates risk. They've never met an AI limitation they took seriously.
Strong Self-Awareness looks like someone with clear, experience-based opinions about AI's limits. They can tell you where AI is overhyped right now and back it up with something they personally tested. They know their own blind spots.
Weak Self-Awareness looks like undifferentiated enthusiasm or undifferentiated skepticism. Either "AI can do everything" or "AI is all hype" — both positions held without the nuance that comes from real, sustained usage.
The Three Assessment Types
Behavioural
Behavioural questions test past experience. You ask for a specific example, then probe for depth. "Tell me about a time you..." and then you dig. What tool did you use? What did your prompt look like? What went wrong? How did you iterate?
Good at revealing the depth and breadth of someone's AI experience. They surface how long someone has been using AI seriously, whether they've encountered and overcome real failure modes, and whether AI has genuinely changed how they operate.
Articulate people can make surface-level experience sound deeper than it is. That's why behavioural should be paired with at least one other assessment type.
AI Craft, Quality Judgment, Scaling, and Self-Awareness
Live Exercise
You give the candidate a real task, access to a laptop and whatever AI tools they want, and you watch them work for 20 minutes. Every exercise is designed so that AI is essential. The task can't be done well in the time available without it.
The only assessment type where people can't fake it. You're watching real behaviour in real time. Someone can tell a great story about their AI workflow in a behavioural question, but the live exercise reveals whether they actually have the muscle memory to back it up.
You need a laptop, prepared materials, and the candidate needs to be comfortable working while being observed. Acknowledge this at the start and tell them you're watching how they work, not expecting a perfect output.
AI Craft, Adaptability, and Quality Judgment
Pressure-Test Scenario
Business situations involving AI where there is no clean answer. You present the scenario, give the candidate a minute to think, and then have a conversation about how they'd navigate it.
Every scenario contains a built-in constraint that prevents a generic answer. The best responses sit in the tension of the tradeoff. They don't rush to an answer. They ask clarifying questions. They think about second-order effects.
Weak responses pick a side without engaging with the complexity, or they give an answer that would work for any technology and has nothing specifically to do with AI.
Systems Thinking, Self-Awareness, and Adaptability
Building an Interview
Pick questions from the dimensions that matter most for the role. Weight the assessment types based on what level you're hiring for.
For individual contributors, weight the live exercise highest. You need to see them work. For leadership roles, weight the pressure-test scenarios and scaling behavioural questions.
Every question in the bank has a code. The first two letters are the dimension. The letter after the dash is the assessment type. The number is the sequence. AC-B1 is the first behavioural question for AI Craft. ST-S3 is the third scenario for Systems Thinking. Use these codes in your debrief to make calibration easy across interviewers.
Question Bank
30 questions