Turkish Math Word Problem Corpora
Date
2023-01-03
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Turkish Journal of Electrical Engineering and Computer Sciences
Contact Person
Abstract
Description
To solve elementary-level math problems, we introduce new Turkish MWP corpora, by translating and combining English benchmark datasets, which are MAWPS, ASDiv-A, SVAMP, and MathQA. After manual arrangements and preprocessing, we publish the corpora consisting of question texts, equations, and answers customized to our model.
- Combined Dataset from MAWPS, ASDiv-A, and SVAMP
MAWPS is a frequently used English benchmark dataset containing equation templates and 3320 questions. ASDiv-A is a diverse corpus in terms of lexicon patterns and problem types with 1218 data. SVAMP is a challenging dataset with 1000 data samples and injected several types of modifications into a set of seed problems derived from the ASDiv-A dataset.
In Turkish version, these three datasets are chosen to merge. In total, 4163 MWP data are provided by adding a few manual questions.
There are 862 data samples in the dev set and 3301 samples in the training set.
- MathQA Dataset
The MathQA benchmark dataset consisting of 37200 data is employed as the second dataset. It is one of the most challenging datasets, the amount of data is satisfactory, and it covers a variety of questions from many aspects.
After visual inspections, the dataset is reduced to 19555 data samples in Turkish version. Physics, geometry, some of the probability, economics, and interest problems that require knowledge of formulas and equations with many unknowns are eliminated.
There are 3904 data samples in the dev set and 15651 samples in the training set.
For more details, see: https://github.com/esingedik/Turkish-MWP-Corpora-and-Code
Keywords
Citation
Gedik, Esin, and TUNGA GÜNGÖR. "Solving Turkish math word problems by sequence-to-sequence encoder-decoder models." Turkish Journal of Electrical Engineering and Computer Sciences 31.2 (2023): 431-447.