AI Benchmarking

www.emfinfo.com

Jason Thibeault

Jason has a degree in Philosophy, was a Captain in Air Force Space Command, Plant Manager, then rock star Headhunter. Also a published author, a black belt martial artist, and a former chess champ who likes to cook and tinker – his curiosity about all things lets him see the big picture. Using all that to help others, he has built a reputation as one of the most truly gifted coaches and trainers in the country. His specialty is getting into the minds of people and unlocking the big picture, clearing the next obstacle. Sherlock Holmes has his face on a dart board, and James Bond was heard to remark, “I ordered shaken, not nerd.” You can see the timeline of his life and learn more about him here. Want to book coaching? https://mooreessentials.com/course-catalog/

By Jason Thibeault | Wednesday March 11, 2026

Every morning, I (Jason) read science and tech articles. Which means I am constantly getting updates on which model did what on a benchmark test. And it’s all meaningless. Don’t worry, I’ll explain.

I grew up watching boxing. Imagine you run an Olympic Boxing Gym and you have to choose who to pick for a team. Boxing benchmarks might include reach, weight, and strength. We don’t know if the person can take a punch, predict their opponent’s movements, or even last a few rounds before exhaustion. Mike Tyson was not the tallest (reach) or heaviest (strength) boxer of his era, but he was the most feared.

It’s like using IQ to determine success. Things which IQ doesn’t test: Can you read a room? Communicate well? How’s your grit, creativity, judgment… do people even trust you? All of those are integral to success, yet someone invented mental testing years ago and now there’s a whole industry around it.

Workplace testing is the worst. Quick, resolve these problems in an unfamiliar environment, using none of the tools you usually use! If a potential employee uses an AI tool they built into a pair of smartglasses to resolve a coding test, are they going to be a great employee who resolves problems creatively? Or should they not be hired because they cheated the test? (This happened recently.)

If I take twins who have always done comparably in school, and send only one to SAT training, will that twin do better on the test? We assume so, but it doesn’t mean they will be the better student. They just knew what to expect on the test, and a lot of that is also happening with benchmarking. It’s easier to make a model to beat a test’s benchmarks, than it is to build an AI which is truly useful.

Using those benchmark numbers to declare one AI language model better than another is just as silly as using IQ or Reach to determine success. If a client used only the test numbers to determine a hire, we would think they were bad a hiring.

What do you want the model to do for you, and how are you planning on using it? From there, you can decide which is the right one for you. Ignore benchmark scores.