35.06 09 PDF
35.06 09 PDF
35.06 09 PDF
Automatic Short Answer Grading System in Indonesian Language Using BERT Machine
Learning
Marvin Chandra Wijaya
Computer Engineering Department, Maranatha Christian University, Jl. Suria Sumantri 65, Bandung 40164, Indonesia
https://doi.org/10.18280/ria.350609 ABSTRACT
Received: 5 October 2021 A system capable of automatically grading short answers is a very useful tool. The system
Accepted: 25 November 2021 can be created using machine learning algorithms. In this study, a machine system using
BERT is proposed. BERT is an open-source system that is set to English by default. The
Keywords: use of languages other than English Language is a challenge to be implemented in BERT.
automatic grading system, BERT, machine This study proposes a novel system to implement Indonesian Language in the BERT system
learning, Indonesian language for automatic grading of short answers. The experimental results were measured using two
measuring instruments: Cohen's Kappa coefficient and the Confusion Matrix. The result of
measuring the BERT output of the implemented system has a Cohen Kappa coefficient of
0.75, a precision of 0.94, a recall of 0.96, a Specificity of 0.76 and an F1 Score of 0.95.
Based on the measurement results, it can be seen that the implementation of the automatic
short answer grading system in Indonesian Language using BERT machine learning has
been successful.
503
Similarity scores are calculated using four formulas: model answer to process automatic short answer questions.
Longest Common Subsequence (LCS), Jaccard Index, Cosine Experiments were carried out by groups of students for
Coefficient, and DICE Coefficient. LCS calculates the grading short answers by making comparisons between
similarity value between the two sentences being compared student answers and model answers [8]. This study uses the
(sentence 1 and sentence 2) using the following formula: Bag of Words model to represent the text in the short answer
question and the Frequency Term to represent the relevance of
2 × |𝐿𝐶𝑆(𝑠1, 𝑠2)| the term in the short answer question.
𝑠𝑖𝑚 𝑙𝑐𝑠 = (1) In 2021, Ince and Kutlu [9] conducted a study entitled
|𝑠1| + 𝐹𝑃|𝑠2|
“Web-Based Turkish Automatic Short Answer Grading
where: System” which focused on research on the Turkish Language
sim lcs = LCS value for grading short answer questions. The steps used in this
s1 = sentence 1 research are: instructor module, stored exams, questions,
s2 = sentence 2 answer keys, exam grades, student module, exam module,
Jaccard Index is the percentage of the same terms from all student answer, similarity calculations, calculated question
combined terms of the two documents (document 1 and point, and total exam grade, as shown in Figure 2.
document 2). The Jaccard Index formula is as follows:
|𝑑𝑜𝑐1 ∩ 𝑑𝑜𝑐2|
𝑠𝑖𝑚 𝑗𝑖 = (2)
|𝑑𝑜𝑐1 ∪ 𝑑𝑜𝑐2|
where:
sim ji = Jaccard Index value
doc1 = document 1
doc2 = document 2
The cosine Coefficient measures the comparison of angular
cosine on the similarity between two documents and the
multiplication of two documents. The Cosine Coefficient
formula is as follows:
504
There are several algorithms that can improve the accuracy training datasets and testing data sets. The training set is taken
in the classification of student answers for each class, such as from the question and answer data bank provided by the
using the K-Means algorithm. The K-Means algorithm uses teacher. There are 100 questions and answers in the data bank
cluster analysis to segment a data set into several clusters. for Computer and Information Technology (CIT) subjects.
Cluster analysis is a methodology for grouping data based on The testing dataset was taken from the student’s responses.
certain similarities [12]. K-means is used to predict student Respondents consisted of 60 students consisting of 35 men and
answers using an algorithm, where K is used to cluster several 25 women. Respondents are 10th-grade students aged between
segments to find the closest proximity to each cluster. Each 15 - 17 years. Each student is given ten questions which are
data set point is taken and connected to each cluster centre taken randomly from the question data bank. The data set of
point, and this step is repeated to get the best cluster centre each student consists of 10 different questions with relative
point. With this repetition, every time learning is carried out, information in the context of the questions. The initial
the performance of student assessment clustering will be more rankings are settled by experts, and each answer was
accurate. The formula used in this method is as follows: categorized into four categories, namely: true, true-but-
incomplete, contradictory, and incorrect.
𝑎𝑐𝑐 = 𝑚𝑒𝑎𝑛 + 𝑘 × (𝑛𝑞 − 𝑐𝑜𝑟𝑟) (6)
3.2 Data preprocessing
(𝑟𝑑 + 𝑎𝑡)
𝑚𝑒𝑎𝑛 = (7) The data set that has been obtained from 60 students is pre-
(𝑐𝑜𝑟𝑟 + 1)
processed before being processed to the next stage. Data sets
where: with incomplete answers are deleted. From 60 respondents, a
acc = accuracy filter was conducted to find complete answers, and 48
nq = number of questions complete answers were obtained.
corr = number of correct answers Grading on a short answer question is sometimes not only
rd = read time correct or incorrect; it can also be true but incomplete or
at = answer time contradictory. However, these four rating categories can make
the grading process using BERT Machine Learning more
complex. The four existing rating categories were shrunk into
3. METHOD two to become binary response variables (correct answer or
wring answer) to be processed later. In non-binary
This method section will discuss the data set, pre-processing, classification, you have to manage the number of data sets in
and Bidirectional Encoder Representation from Transformer each category. This setup process makes the pre-processing
(BERT), as shown in Figure 4. There are three main steps step more complex. In this study, a binary process was carried
proposed in this study: collecting data sets (consisting of out on the classification according to the needs of the
questions and answers), pre-processing (consisting of filters respondents being tested. Responses that are considered
and concatenated), and grading process using BERT. The “correct” answers are in the correct category, while the other
proposed machine learning algorithm is able to optimize the categories (true but incomplete, contradictory, and incorrect)
resulting output [13]. are considered “wrong” answers.
The next step is to combine the question context text with
the answer itself. Then, the combined text is tokenized using
the base-BERT-uncased tokenizer, as shown in Figure 5.
There are two steps in the process, which are as follows:
Step 1: The concatenated process combines questions and
answers.
Step 2: The tokenizer process by analyzing the sentences
that have been concatenated. The sentence is segmented into
tokens that have a single meaning. The stop word in the
sentence is also omitted.
An example of this process is:
Question: “Indonesia is located on the continent of Asia.
What are the two continents and two oceans closest to
Indonesia?”
Answer: “Indonesia is located between the continents of
Asia and the continent of Australia. Indonesia is located
between the Pacific Ocean and the Indian Ocean.”
Figure 4. Method The results of the process of the questions and answers are
as follows:
3.1 Data set Token 1: “The country of Indonesia is located on the
continent of Asia”.
The important thing in starting the BERT machine learning Token 2: “What are the two closest continents and two
process is to prepare the data set in advance [14, 15]. The data oceans of Indonesia”.
set consisted of short answers made to a quiz between students Token 3: “Indonesia is located between the continents of
from high school in Bandung, Indonesia. About 60 students Asia and the continents of Australia”.
responded to questions related to Computer and Information Token 4: “Indonesia lies between the Pacific Ocean and the
Technology (CIT) subjects. There are two kinds of data sets: Indian Ocean”.
505
Figure 7. BERT processes for non-English language
506
the training dataset is taken from the question and answer data • True Negative (TN): The model predicts that the data is
bank that the teacher has prepared. in the negative class and the actual data is in the negative
In this study, it is for the needs of the grading system. There class.
may be a need to fine-tune the training model system created • True Positive (TP): The model predicts that the data is
during the training, as shown in Figure 9. Fine-tuning is used in a positive class and the actual data is in a positive class.
to adapt the deep learning process to the given task. In this • False Negative (FN): The model predicts that the data is
study, BERT was specifically given the task of grading short in the Negative class, but actually, the data is in a
answer questions so that fine-tuning was carried out to positive class.
separate tokens between questions and answers. • False Positive (FP): The model predicts the data is in a
positive class, but actually, the data is in the Negative
class.
Precision is the ratio of positive correct predictions to the
overall positive predicted results. The Precision Formula is as
Figure 9. BERT workflow with pre-training and fine-tuning follows:
phases
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑠𝑖𝑜𝑛 = (9)
These steps can be repeated as needed by evaluating the 𝑇𝑃 + 𝐹𝑃
results of the training. After the pre-training and fine-tuning
Recall (Sensitivity) is the ratio of true positive predictions
are completed, it can be continued with testing using
compared to the overall data that are true positive. The formula
respondent data. The model can then be tested to predict a real
for the recall is as follows:
example, which is to create an automatic grading system
application. 𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 = (10)
3.4 Measurement instrument – Cohen’s Kappa 𝑇𝑃 + 𝐹𝑁
507
Table 1. Example of Data Set (Indonesian Language) 76%
𝑅𝑒𝑐𝑎𝑙𝑙 = = 0.96
76% + 3%
Question
Question Answer
Context The capabilities of the machine learning algorithms
Komputer berasal dari implemented to predict wrong answers are as follows:
Istilah Komputer Komputer berasal dari kata
kata Computare
Contoh alat masukan Contoh alat masukan 16%
Alat masukan
adalah adaah tetikus 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = = 0.76
Contoh alat keluaran Contoh alat keluaran 16% + 5%
Alat keluaran
adalah adalah layar
Contoh prosesor The comparison of the weighted average precision and
Contoh prosessor recall of the implemented machine learning algorithms is as
Prosesor komputer adalah
komputer adalah
INTEL dan AMD follows:
Automatic Manual
Student no. Question no.
Grading grading 5. CONCLUSIONS
1 1 Correct Correct
1 2 Correct Correct A novel system proposed in the study is a system that is able
. . . . to implement Indonesian language in machine learning
. . . .
algorithms using BERT. BERT is an open-source system that
1 10 Correct Incorrect
2 1 Incorrect Incorrect by default has limitations because it is set for English. This
. . . . study proposes a novel system to implement Indonesian in the
. . . . BERT system for automatic grading of short answers. The
2 10 Correct Incorrect experimental results were measured using two measuring
. . . . instruments: Cohen's Kappa coefficient and the Confusion
. . . . Matrix.
48 1 Correct Correct The result of measuring the BERT output of the
. . . . implemented system has a Cohen Kappa coefficient value of
. . . .
0.75, which means that the implemented algorithm has a good
48 10 Incorrect Correct
consistency.
In measuring the success of the algorithm that is presented
Table 3. Consistency
using the confusion matrix, it produces the following values:
Precision of 0.94, Recall of 0.96, Specificity of 0.76, and F1
Automatic Grading system
Correct Incorrect Score of 0.95. Based on the measurement results, it can be seen
Manual that the implementation of the automatic short answer grading
Grading Correct 76% 5% system in Indonesian Language using BERT machine learning
System (by has been successful.
Incorrect 3% 16%
Human)
508
undergraduate plant biodiversity lab. Evolution: learning techniques. Revue d’Intelligence Artificielle,
Education and Outreach, 7(1): 1-9. 35(1): 99-104. https://doi.org/10.18280/ria.350112
https://doi.org/10.1186/s12052-014-0010-7 [13] Sharma, R., Hooda, N. (2019). Optimized ensemble
[4] de Bruin, A.B., Kok, E.M., Lobbestael, J., de Grip, A. machine learning framework for high dimensional
(2017). The impact of an online tool for monitoring and imbalanced bio assays. Revue d’Intelligence Artificielle,
regulating learning at university: Overconfidence, 33(5): 387-392. https://doi.org/10.18280/ria.330509
learning strategy, and personality. Metacognition and [14] Satla, S.P., Sadanandam, M., Suvarna, B. (2020).
Learning, 12(1): 21-43. https://doi.org/10.1007/s11409- Dangerous Prediction in Roads by Using Machine
016-9159-5 Learning Models. Ingénierie des Systèmes d’
[5] Wenzel, K., Reinhard, M.A. (2020). Tests and academic Information, 25(5): 637-644.
cheating: Do learning tasks influence cheating by way of https://doi.org/10.18280/isi.250511
negative evaluations? Social Psychology of Education, [15] Singla, S.K., Garg, R.D., Dubey, O.P. (2020). Ensemble
23(3): 721-753. https://doi.org/10.1007/s11218-020- machine learning methods to estimate the sugarcane
09556-0 yield based on remote sensing information. Revue
[6] Boukhari, Y. (2020). Application and comparison of d’Intelligence Artificielle, 34(6): 731-743.
machine learning algorithms for predicting mass loss of https://doi.org/10.18280/RIA.340607
cement raw materials due to decarbonation process. [16] Das, S., Deb, N., Cortesi, A., Chaki, N. (2021). Sentence
Revue d’Intelligence Artificielle, 34(4): 403-411. embedding models for similarity detection of software
https://doi.org/10.18280/ria.340404 requirements. SN Computer Science, 2(2): 1-11.
[7] Hasanah, U., Permanasari, A.E., Kusumawardani, S.S., https://doi.org/10.1007/s42979-020-00427-1
Pribadi, F.S. (2019). A scoring rubric for automatic short [17] Devlin, J., Chang, M.W., Lee, K., Toutanova, K. (2018).
answer grading system. Telkomnika, 17(2): 763-770. BERT: Pre-training of deep bidirectional transformers
https://doi.org/10.12928/TELKOMNIKA.V17I2.11785 for language understanding. arXiv preprint
[8] Süzen, N., Gorban, A.N., Levesley, J., Mirkes, E.M. arXiv:1810.04805. https://doi.org/10.18653/v1%2FN19-
(2020). Automatic short answer grading and feedback 1423
using text mining methods. Procedia Computer Science, [18] Alzubi, J., Alzubi, J.A., Jain, R., Singh, A., Parwekar, P.,
169(2019): 726-743. Gupta, M. (2021). COBERT: COVID-19 question
https://doi.org/10.1016/j.procs.2020.02.171 answering system using BERT. Arabian Journal for
[9] Ince, E.Y., Kutlu, A. (2021). Web-based Turkish Science and Engineering, pp. 1-11.
Automatic Short-Answer Grading System. Natural https://doi.org/10.1007/s13369-021-05810-5
Language Processing Research, 1(3-4): 46-55. [19] Sur, C. (2020). RBN: Enhancement in language attribute
https://doi.org/10.2991/nlpr.d.210212.001 prediction using global representation of natural
[10] Zhang, L., Huang, Y., Yang, X., Yu, S., Zhuang, F. language transfer learning technology like Google BERT.
(2019). An automatic short-answer grading model for SN Applied Sciences, 2(1): 1-15.
semi-open-ended questions. Interactive Learning https://doi.org/10.1007/s42452-019-1765-9
Environments, pp. 1-14. [20] Kanerva, J., Ginter, F., Pyysalo, S. (2020). Dependency
https://doi.org/10.1080/10494820.2019.1648300 parsing of biomedical text with BERT. BMC
[11] Pado, U., Kiefer, C. (2015). Short answer grading: When Bioinformatics, 21(23): 1-12.
sorting helps and when it doesn’t. In Proceedings of the https://doi.org/10.1186/s12859-020-03905-8
Fourth Workshop on NLP for Computer-Assisted [21] Ahmed, M.Z., Mahesh, C. (2021). A weight based
Language Learning, pp. 42-50. labeled classifier using machine learning technique for
[12] Vankayalapati, R., Ghutugade, K.B., Vannapuram, R., classification of medical data. Revue d’Intelligence
Prasanna, B.P.S. (2021). K-means algorithm for Artificielle, 35(1): 39-46.
clustering of learners performance levels using machine https://doi.org/10.18280/ria.350104
509