MalAlgoQA: Pedagogical Evaluation of Counterfactual Reasoning in Large Language Models and Implications for AI in Education

Liu, Naiming; Sonkar, Shashank; Le, Myco; Baraniuk, Richard

Computer Science > Computation and Language

arXiv:2407.00938 (cs)

[Submitted on 1 Jul 2024 (v1), last revised 5 Oct 2024 (this version, v2)]

Title:MalAlgoQA: Pedagogical Evaluation of Counterfactual Reasoning in Large Language Models and Implications for AI in Education

Authors:Naiming Liu, Shashank Sonkar, Myco Le, Richard Baraniuk

View PDF HTML (experimental)

Abstract:This paper introduces MalAlgoQA, a novel dataset designed to evaluate the counterfactual reasoning capabilities of Large Language Models (LLMs) through a pedagogical approach. The dataset comprises mathematics and reading comprehension questions, each accompanied by four answer choices and their corresponding rationales. At the heart of MalAlgoQA are ``malgorithms'' - rationales behind incorrect answer choices that represent flawed yet logically coherent reasoning paths. These malgorithms serve as counterfactual scenarios, allowing us to assess an LLM's ability to identify and analyze flawed reasoning patterns. We propose the Malgorithm Identification task, where LLMs are assessed based on their ability to identify corresponding malgorithm given an incorrect answer choice. To evaluate the model performance, we introduce two metrics: Algorithm Identification Accuracy (AIA) for correct answer rationale identification, and Malgorithm Identification Accuracy (MIA) for incorrect answer rationale identification. Our experiments reveal that state-of-the-art LLMs exhibit significant performance drops in MIA compared to AIA, highlighting the challenges in counterfactual reasoning. Surprisingly, we find that the chain-of-thought prompting technique not only fails to consistently enhance MIA but can sometimes lead to underperformance compared to simple prompting. These findings have important implications for developing LLMs with improved counterfactual reasoning, particularly relevant for AI-powered tutoring systems, where identifying and addressing student misconceptions is essential. MalAlgoQA dataset is available \href{this https URL}{here}.

Comments:	Accepted at EMNLP 2024 Findings Track
Subjects:	Computation and Language (cs.CL); Computers and Society (cs.CY)
Cite as:	arXiv:2407.00938 [cs.CL]
	(or arXiv:2407.00938v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2407.00938

Submission history

From: Shashank Sonkar [view email]
[v1] Mon, 1 Jul 2024 03:39:13 UTC (717 KB)
[v2] Sat, 5 Oct 2024 09:14:00 UTC (399 KB)

Computer Science > Computation and Language

Title:MalAlgoQA: Pedagogical Evaluation of Counterfactual Reasoning in Large Language Models and Implications for AI in Education

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:MalAlgoQA: Pedagogical Evaluation of Counterfactual Reasoning in Large Language Models and Implications for AI in Education

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators