A new study warns that artificial intelligence cannot reliably predict which occupations it will disrupt. According to a working paper published by the National Bureau of Economic Research, researchers from Northwestern University and American University found that different large language models give wildly different answers when asked to assess their own impact on jobs.
The team tested four frontier AI systems, GPT-4, ChatGPT-5, Gemini 2.5, and Claude 4.5, using the same rubric to rate nearly 19,000 work tasks. The results showed deep disagreement. Mean exposure scores ranged from 0.14 (GPT-4 and Gemini) to 0.51 (Claude), a 3.6-fold difference. Pairwise agreement between models fell as low as 57%, which researchers called only "fair".
The largest disagreements occurred in occupations that mix cognitive and physical duties, such as management, teaching, and sales. Management roles ranged from roughly 0.08 (Gemini) to 0.83 (Claude). Computer and mathematical occupations ranged from 0.42 (Gemini) to 0.95 (Claude). Educational instruction, life sciences, and sales all showed spreads of 0.30 or more across annotators. Models broadly agreed that physical jobs like construction were safe, and that coding jobs were vulnerable. But for white-collar roles in the middle, the verdict varied sharply.
The instability changed real-world conclusions. At the county level, Claude 4.5 produced a statistically significant negative relationship between AI exposure and employment. In contrast, GPT-4, ChatGPT-5, and Gemini 2.5 all found no significant effect, with Gemini even yielding a positive, though insignificant, coefficient. At the individual level, all models gave significant negative results, but magnitudes varied: Gemini showed the largest effect, 2.4 times the original GPT-4 estimate.
"A researcher’s conclusion about whether LLM exposure reduces employment, and by how much, depends on an unreported and untested choice: which model rated the tasks," the authors wrote. They argue that asking AI to assess its own capabilities is circular. They urge regulators, economists, and workforce boards to treat current job exposure scores as highly fragile and to move towards measures based on actual AI usage data.