Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation and Enhancement Using the StepGame Benchmark.

This is the latest version of this eprint.

Li, F. orcid.org/0000-0002-1109-6285, Hogg, D.C. orcid.org/0000-0002-6125-9564 and Cohn, A.G. orcid.org/0000-0002-7652-8907 (2024) Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation and Enhancement Using the StepGame Benchmark. In: Wooldridge, M.J., Dy, J.G. and Natarajan, S., (eds.) Proceedings of the AAAI Conference on Artificial Intelligence. Thirty-Eighth AAAI Conference on Artificial Intelligence, 20-27 Feb 2024, Vancouver, Canada. AAAI, pp. 18500-18507. ISBN: 978-1-57735-887-9 ISSN: 2159-5399 EISSN: 2374-3468

Abstract

Artificial intelligence (AI) has made remarkable progress across various domains, with large language models like ChatGPT gaining substantial attention for their human-like text-generation capabilities. Despite these achievements, improving spatial reasoning remains a significant challenge for these models. Benchmarks like StepGame evaluate AI spatial reasoning, where ChatGPT has shown unsatisfactory performance. However, the presence of template errors in the benchmark has an impact on the evaluation results. Thus there is potential for ChatGPT to perform better if these template errors are addressed, leading to more accurate assessments of its spatial reasoning capabilities. In this study, we refine the StepGame benchmark, providing a more accurate dataset for model evaluation. We analyze GPT’s spatial reasoning performance on the rectified benchmark, identifying proficiency in mapping natural language text to spatial relations but limitations in multi-hop reasoning. We provide a flawless solution to the benchmark by combining template-to-relation mapping with logic-based reasoning. This combination demonstrates proficiency in performing qualitative reasoning on StepGame without encountering any errors. We then address the limitations of GPT models in spatial reasoning. To improve spatial reasoning, we deploy Chain-of-Thought and Tree-of-thoughts prompting strategies, offering insights into GPT’s cognitive process. Our investigation not only sheds light on model deficiencies but also proposes enhancements, contributing to the advancement of AI with more robust spatial reasoning capabilities.

Metadata

Item Type:	Proceedings Paper
Authors/Creators:	Li, F. https://orcid.org/0000-0002-1109-6285 Hogg, D.C. https://orcid.org/0000-0002-6125-9564 Cohn, A.G. https://orcid.org/0000-0002-7652-8907
Editors:	Wooldridge, M.J. Dy, J.G. Natarajan, S.
Keywords:	NLP: Interpretability, Analysis, and Evaluation of NLP Models; DMKM: Mining of Spatial, Temporal or Spatio-Temporal Data; NLP: (Large) Language Models; NLP: Other; PRS: Model-Based Reasoning; PRS: Optimization of Spatio-temporal Systems
Dates:	Published (online): 24 March 2024 Published: 24 March 2024
Institution:	The University of Leeds
Academic Units:	The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Computing (Leeds) > Artificial Intelligence
Funding Information:	Funder Grant number Alan Turing Institute Not Known Foreign Commonwealth and Development Office Not Known
Depositing User:	Symplectic Publications
Date Deposited:	17 Apr 2024 10:34
Last Modified:	24 Jan 2025 11:25
Status:	Published
Publisher:	AAAI
Identification Number:	10.1609/aaai.v38i17.29811
Related URLs:	Dataset
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:211546

Available Versions of this Item

Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation and Enhancement Using the StepGame Benchmark. (deposited 12 Mar 2024 11:15)
- Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation and Enhancement Using the StepGame Benchmark. (deposited 17 Apr 2024 10:34) [Currently Displayed]

CORE (COnnecting REpositories)

Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation and Enhancement Using the StepGame Benchmark.

Abstract

Metadata

Available Versions of this Item

Download not available

Export

Statistics

Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation and Enhancement Using the StepGame Benchmark.

Abstract

Metadata

Available Versions of this Item

Download not available

Related datasets

Export

Statistics