AI-assisted teams outperform AI-led teams but not human-only teams in assessing research reproducibility in quantitative social science

Abstract

Significance Verifying results of published social sciences research is essential but expensive, costing hundreds of dollars per study. With AI tools like ChatGPT becoming widespread, we tested whether they could help scientists check if research findings can be reproduced. We assigned 288 researchers to 103 teams working with no AI, with AI as an assistant, or AI leading the work with minimal human input. Human teams and AI-assisted teams performed similarly on most tasks, but humans caught more critical errors. AI working autonomously achieved a 37% reproduction rate, making it potentially useful for automated screening when human review is cost-prohibitive. These results nonetheless show that human expertise remains essential for reliable scientific validation. Abstract Large Language Models (LLMs) such as ChatGPT are transforming how scientists conduct and validate research, offering promise as tools to improve scientific reproducibility. However, computational reproducibility and error detection remain expensive and labor-intensive. We experimentally test how collaboration between researchers and LLM assistants influences the reproduction of quantitative social science findings across different levels of AI autonomy. We randomly assigned 288 researchers to 103 teams working under three conditions: human-only, AI-assisted (using ChatGPT as a collaborative tool), or AI-led (ChatGPT operating with minimal human oversight). Teams reproduced published results from leading social science journals, detected coding errors, and proposed robustness checks. Human-only and AI-assisted teams achieved comparable reproduction rates (94% vs. 91%) and performed similarly on most outcomes, except human-only teams identified significantly more major coding errors. Both substantially outperformed AI-led teams, which achieved only a 37% reproduction rate, detected fewer errors across all categories, proposed weaker robustness checks, and required more time. This autonomous approach, however, likely represents only a lower bound of AI capabilities. Despite rapid model advances, expert human judgment currently remains indispensable for reliable empirical verification. While AI assistance did not degrade most outcomes, it provided no measurable advantages and was associated with reduced detection of major errors. However, the 37% autonomous reproduction rate indicates that AI could provide value in settings where scale or cost constraints preclude human review of papers, even though general-purpose LLMs offer no immediate advantages for human-supervised verification.

Metadata

Item Type:	Article
Authors/Creators:	Brodeur, Abel Valenta, David Marcoci, Alexandru Aparicio, Juan P Mikola, Derek Barbarioli, Bruno Alexander, Rohan Deer, Lachlan Stafford, Tom Vilhuber, Lars Bensch, Gunther Motoki, Fabio Abdelhady, Mohamed Abdelmoula, Yousra Baki, Ghina Abdul Aguirre, Tomás Aiyer, Sriraj Akhtar, Shumi Akhtar, Farida Albada, Melle R Altman, Micah Angenendt, David Arjmandi Lari, Zahra De León Tejada, Jorge Armando Arana, David Rodriguez Asanov, Igor Noha, Anastasiya-Mariya Ashong, Rebecca Auer, Tobias Bahamonde-Birke, Francisco J Baker, Bradley J Bartram, Söhnke M Bao, Dongqi Batinovic, Lucija Batistoni, Tommaso Beeder, Monica Beland, Louis-Philippe Gero Bienz, Carsten Aryanto, Christ Billy Bolibaugh, Cylcia https://orcid.org/0000-0001-7500-264X Bonander, Carl Bravo, Ramiro Bronnikov, Egor Bruns, Stephan Buliskeria, Nino Caicedo-Silva, Sara Calef, Andrea Sebastian Cano Arias, Juan A Castillo Alvarez, Gustavo Caulker, Solomon Cepenas, Simonas Chatton, Arthur Chen, Zirou Chioma Ewurum, Ngozi Ciocîrlan, Anda-Bianca Clouth, Felix J Collins, Jason Cook, Nikolai Cornejo, Cesar Craveiro, João Créchet, Jonathan Cui, Jing Chalil Vayalabron, Niveditha Czymara, Christian Bermúdez Jaramillo, Carlos Daniel Datta, Hannes Denoo, Lien Dhaliwal, Arshia Dhameja, Nency Djemai, Elodie Dujeancourt, Erwan Dündar, Uǧurcan Duprey, Thibaut Eissa, Yasmine El Fassi, Youssef El Fassi, Ismail Ellis, Keaton Elminejad, Ali Elsherif, Mahmoud Emirmahmutoglu, Aysil Etingin-Frati, Giulian Eze, Emeka Dollbaum, Jan Fabian Feld, Jan Felipe Rengifo Jaramillo, Andres Fenig, Guidon Fernandes, Victoria Fiala, Lenka Fink, Lukas Firouzjaeiangalougah, Mojtaba Fish, Sara Fitzgerald, Jack Forshaw, Rachel Fortier-Chouinard, Alexandre Fréget, Louis Frese, Joris Gabani, Jacopo Gallegos, Sebastian Gamill, Max C Gáspár, Attila Gauriot, Romain Gavrilova, Evelina Geraldes, Diogo Cantone, Giulio Giacomo Gibson, Grant Goldschmitt, Dirk Gourdon-Kanhukamwe, Amélie Gregor de Varda, Andrea Grigoryeva, Idaliya Gugushvili, Alexi Fletcher, Aaron H A Habermann, Florian Hablicsek, Márton Haddad, Joanne Hall, Jonathan D Hammar, Olle Hassouneh, Malek Hausladen, Carina I Hendrikse, Sophie C F Hepplewhite, Matthew Ho, Anson T Y Hogan-Hennessy, Senan Howley, Elliot Huang, Gaoyang Hulstaert, Héloïse Ilchovska, Zlatomira G https://orcid.org/0000-0001-6682-9952 Jaimes Santamaria, Paola Jakobsson, Niklas Jansson, Joakim Jarosz, Ewa Jebeli, Hossein Jiang, Yanchen Junaid, Hiba Kalluraya, Rohan Karim, Sunny Kelly, Edmund Kimel, Eva Kingsuwankul, Sorravich Klotzbücher, Valentin Krähmer, Daniel Krūminas, Pijus Kruus, Nicholas Kujansuu, Essi Kurz, Christoph F Küster, Stephan Lee-Whiting, Blake Lewandowski, Felix Li, Tongzhe Li, Ruoxi Liu, Dan https://orcid.org/0000-0002-1891-9352 Liu, Jiacheng Lo, Helix Loter, Katharina Macedo Dias, Felipe Madan, Christopher R Mäder, Nicolas Mandas, Marco Mantilla, Cesar Marcus, Jan Marino Fages, Diego Martin, Xavier McWay, Ryan Medina-Gaspar, Daniel Meng, Sisi Meng, Lingyu Merz, Simon Miller, Alex P Mirabel, Thibault Mishra, Dibya Deepta Mishra, Sumit Moges, Belay W Mohandes Mojarrad, Morteza Mohnen, Myra Morin, Louis-Philippe Muehlenbachs, Lucija Mullin, Gastón Musulan, Andreea Muzzì, Sara Myers, James A C https://orcid.org/0000-0002-7157-9975 Neubauer, Florian Nguyen, Tuan Niazi, Ali Nordstrom, Ardyn Nowak, Bartłomiej O'Habib, Daneal Ölkers, Tim Ong, Justin Orozco Castiblanco, Valeria Özak, Ömer Ozkes, Ali I Paaso, Mikael Pandey, Shubham Papazoglou, Varvara Penheiro, Romeo Pham, Linh Phieler, Ulrike Pütz, Peter Qi, Quan Qiu, Jingyi Rein, Manuel T Reinstein, David A Repo, Juuso Rudolf, Nicolas Saha, Shree Saka, Orkun Saponaro, Chiara Sator, Georg Schoenmakers, Martijn Seri, Raffaello Shah, Meet Sibille, Paul Siemroth, Christoph Skavysh, Vladimir Slater, Ben Song, Wenting Staubli, Stefan Steindl, Tobias Waongo, Nomwendé Steven Stott, Paul Strobel, Stephenson Sudhaharan, Roshini Sun, Pu Swain, Scott D Talavera, Oleksandr Tantiangco, Hanz M Tarasenko, Georgy Tarlinton, Boyd Tarraf, Mariam Teoh, Ken Thériault, Rémi Thompson, Bethan Tian, Tonghui Tian, Wenjie Tolani, Emmanuel Borgen, Nicolai Topstad Borgen, Solveig Torralba, Javier Velez-Ospina, Carolina Mak, Man Wai Wallrich, Lukas Wang, Zeyang Ward, Leah Webb, Matthew D Webb, Duncan Weber, Bryan S Weber, Christoph Weng, Wei-Chien Westheide, Christian Wilkinson, Tom Wong, Kwong-Yu Wroński, Marcin Wu, Zhuangchen Wu, Qixia Wu, Victor Y Xiao, Bohan Xu, Feihong Xu, Cong Yadav, Pranav Yang Chou, Yu Yap, Luther Yazbeck, Myra Yao, Bo Zagrodzka, Zuzanna Zahra, Tahreen Zaneva, Mirela Zhang, Xiaomeng Zhao, Ziwei Zhong, Han Zirgulis, Aras Zou, Jiacheng Zoutman, Floris Zozoungbo, Christelle
Copyright, Publisher and Additional Information:	We make our i) AI training materials and recording, ii) data and code, iii) preanalysis plan and iv) template form available here: https://github.com/I4Replication/AI-Games (50). We declare no restrictions on sharing or reuse. © 2026 the Author(s)
Keywords:	Humans,Reproducibility of Results,Large Language Models,Social Sciences/methods,Generative Artificial Intelligence,Artificial Intelligence,Intelligent Systems,Cooperative Behavior
Dates:	Accepted: 16 March 2026 Published: 28 May 2026
Institution:	The University of York
Academic Units:	The University of York > Faculty of Social Sciences (York) > Education (York) The University of York > Faculty of Social Sciences (York) > Economics and Related Studies (York) The University of York > Faculty of Sciences (York) > Psychology (York) The University of York > Faculty of Social Sciences (York) > Centre for Health Economics (York) The University of York > Faculty of Sciences (York) > Health Sciences (York) The University of York > Faculty of Social Sciences (York) > Social Policy and Social Work (York)
Date Deposited:	29 May 2026 11:00
Last Modified:	01 Jul 2026 23:07
Published Version:	https://doi.org/10.1073/pnas.2524747123
Status:	Published
Refereed:	Yes
Identification Number:	10.1073/pnas.2524747123
Related URLs:	https://github.com/I4Replication/AI-Game...
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:241548

Download

Published Version

Filename: brodeur-et-al-2026-ai-assisted-teams-outperform-ai-led-teams-but-not-human-only-teams-in-assessing-research-1.pdf

Description: brodeur-et-al-2026-ai-assisted-teams-outperform-ai-led-teams-but-not-human-only-teams-in-assessing-research-1

Licence: CC-BY-NC-ND 2.5

CLICK TO DOWNLOAD

[thumbnail of brodeur-et-al-2026-ai-assisted-teams-outperform-ai-led-teams-but-not-human-only-teams-in-assessing-research-1]

CORE (COnnecting REpositories)

AI-assisted teams outperform AI-led teams but not human-only teams in assessing research reproducibility in quantitative social science

Abstract

Metadata

Download

Published Version

Export

Statistics