35.06 09 PDF

Revue d'Intelligence Artificielle
Vol. 35, No. 6, December, 2021, pp. 503-509

Journal homepage: http://iieta.org/journals/ria
Automatic Short Answer Grading System in Indonesian Language Using BERT Machine
Learning
Marvin Chandra Wijaya
Computer Engineering Department, Maranatha Christian University, Jl. Suria Sumantri 65, Bandung 40164, Indonesia
Corresponding Author Email: [email protected]
https://doi.org/10.18280/ria.350609 ABSTRACT
Received: 5 October 2021 A system capable of automatically grading short answers is a very useful tool. The system
Accepted: 25 November 2021 can be created using machine learning algorithms. In this study, a machine system using
BERT is proposed. BERT is an open-source system that is set to English by default. The
Keywords: use of languages other than English Language is a challenge to be implemented in BERT.
automatic grading system, BERT, machine This study proposes a novel system to implement Indonesian Language in the BERT system
learning, Indonesian language for automatic grading of short answers. The experimental results were measured using two
measuring instruments: Cohen's Kappa coefficient and the Confusion Matrix. The result of
measuring the BERT output of the implemented system has a Cohen Kappa coefficient of
0.75, a precision of 0.94, a recall of 0.96, a Specificity of 0.76 and an F1 Score of 0.95.
Based on the measurement results, it can be seen that the implementation of the automatic
short answer grading system in Indonesian Language using BERT machine learning has
been successful.
1. INTRODUCTION proposed in this study must be able to make the classification

correctly [6].
A learning system between teachers and students requires
two-way communication to measure learning outcomes. One
of the tools to measure learning outcomes is to hold tests, 2. RELATED STUDY
quizzes, or exams for students who take learning classes [1].
The test has a good impact on the learning system because Currently, many short answer question grading systems
teachers and students can know the effectiveness of the have been studied to get better grading accuracy. In 2019
learning being carried out [2]. The exam can be in the form of Hasanah et al. [7] conducted a study entitled “A scoring rubric
a pre-test, a test in the middle of the lesson, or at the end of the for automatic short answer grading system”. This study aims
learning. These three methods have a good impact on the to perform grading without using language semantic’s tool.
learning process [3]. The rise of online learning at this time
also increases the need for a measuring tool to find out distance
learning has been successfully implemented [4]. In a learning
process, there is a lot of interaction between teachers and
students, so that it causes a good social impact that can be
measured as well [5]. There are several types of questions in
an exam: memory questions, comprehension questions,
application questions, analysis questions, synthesis questions,
evaluation questions, and process skills questions.
A good exam in measuring the learning system is to make
questions in the form of short answer questions. Along with
the increasing number of students who must be checked for
answers from the exam, a system is needed to assist teachers
in checking student answers. The automatic answer checking
system can help teachers check answers; at least this system Figure 1. Research flow by Uswatun Hasanah
can help if teachers make mistakes in grading exam results
from students. Experiments were carried out on seven questions, with
The grading system for short answers is automatically thirty-four for alternative answers, and tested on two hundred
difficult compared to the multiple-choice answer checker, and and twenty-four students. The experimental results produce a
this is because student answers vary widely. The variations of Pearson correlation of 0.65 - 0.66 with a mean absolute error
these answers must be studied properly in order to get proper of 0.95 - 1.24 [7]. The steps used in this research are: data
results. However, if the automatic grading system can be collection, pre-processing, calculate similarity score, calculate
realized properly, then it can be very helpful for teachers in keyword matching score, calculate the final score, and
their learning system. The Machine Learning Algorithm performance evaluation, as shown in Figure 1.
503
Similarity scores are calculated using four formulas: model answer to process automatic short answer questions.
Longest Common Subsequence (LCS), Jaccard Index, Cosine Experiments were carried out by groups of students for
Coefficient, and DICE Coefficient. LCS calculates the grading short answers by making comparisons between
similarity value between the two sentences being compared student answers and model answers [8]. This study uses the
(sentence 1 and sentence 2) using the following formula: Bag of Words model to represent the text in the short answer
question and the Frequency Term to represent the relevance of
2 × |𝐿𝐶𝑆(𝑠1, 𝑠2)| the term in the short answer question.
𝑠𝑖𝑚 𝑙𝑐𝑠 = (1) In 2021, Ince and Kutlu [9] conducted a study entitled
|𝑠1| + 𝐹𝑃|𝑠2|
“Web-Based Turkish Automatic Short Answer Grading
where: System” which focused on research on the Turkish Language
sim lcs = LCS value for grading short answer questions. The steps used in this
s1 = sentence 1 research are: instructor module, stored exams, questions,
s2 = sentence 2 answer keys, exam grades, student module, exam module,
Jaccard Index is the percentage of the same terms from all student answer, similarity calculations, calculated question
combined terms of the two documents (document 1 and point, and total exam grade, as shown in Figure 2.
document 2). The Jaccard Index formula is as follows:
|𝑑𝑜𝑐1 ∩ 𝑑𝑜𝑐2|
𝑠𝑖𝑚 𝑗𝑖 = (2)
|𝑑𝑜𝑐1 ∪ 𝑑𝑜𝑐2|
where:
sim ji = Jaccard Index value
doc1 = document 1
doc2 = document 2
The cosine Coefficient measures the comparison of angular
cosine on the similarity between two documents and the
multiplication of two documents. The Cosine Coefficient
formula is as follows:
|𝑑𝑜𝑐1 ∩ 𝑑𝑜𝑐2| Figure 2. TASAG System by Ebru Yilmaz Ince

𝑠𝑖𝑚 𝑐𝑐 = (3)
|𝑑𝑜𝑐1|0.5 ∙ |𝑑𝑜𝑐2|0.5 In 2019, Zhang et al. [10] conducted a study entitled “An
automatic short answer grading model for semi open-ended
Dice Coefficient is twice the calculation of the percentage questions”. In this study, the model used is a long short-term
of the same term in both documents with the number of terms memory recurrent neural network to process the grading of
in each document. The Dice Coefficient formula is as follows: semi open-ended questions, as shown in Figure 3.
|𝑑𝑜𝑐1 ∩ 𝑑𝑜𝑐2|
𝑠𝑖𝑚 𝑑𝑐 = 2 × (4)
|𝑑𝑜𝑐1| × |𝑑𝑜𝑐2|
Similarity score will be taken from the highest value from

the calculation results of LCS value, Jaccard Index value,
Cosine Coefficient value, and DICE Coefficient.
Keyword Matching Score calculates the number of
keywords from student answers and alternative answers,
where each keyword from the alternative answers will be
compared with student answers. The highest match score in
the calculation will be used as the Keyword Matching Score.
The final score is a combined calculation between similarity
score and keyword match score. The final score will be
calculated from the average multiplication of the similarity
score with the answer score and the keyword match score with
the answer score, as in the following formula:
(𝑠𝑖𝑚 × 𝑎𝑠) + (𝑘𝑚 × 𝑎𝑠)

𝐹𝑆 = (5)
2
Figure 3. Automatic Grading Model by Lishan Zhang
where:
FS = Final score In 2015, Pado and Keifer conducted a study entitled “Short
sim = similarity score Answer Grading: When Soring Helps and When it Doesn’t”.
km = keyword matching score This study uses a domain-independent system to process the
as = answer score short answer grading [11]. The evaluation of the experiment in
In 2020, Süzen et al. [8] conducted a study entitled this study using CREG and Computer Science Short Answer
“Automatic short answer grading and feedback using text in German (CSSAG) has an average grading speed of 1.6
mining methods”. This research uses a standard data mining seconds.
504
There are several algorithms that can improve the accuracy training datasets and testing data sets. The training set is taken
in the classification of student answers for each class, such as from the question and answer data bank provided by the
using the K-Means algorithm. The K-Means algorithm uses teacher. There are 100 questions and answers in the data bank
cluster analysis to segment a data set into several clusters. for Computer and Information Technology (CIT) subjects.
Cluster analysis is a methodology for grouping data based on The testing dataset was taken from the student’s responses.
certain similarities [12]. K-means is used to predict student Respondents consisted of 60 students consisting of 35 men and
answers using an algorithm, where K is used to cluster several 25 women. Respondents are 10th-grade students aged between
segments to find the closest proximity to each cluster. Each 15 - 17 years. Each student is given ten questions which are
data set point is taken and connected to each cluster centre taken randomly from the question data bank. The data set of
point, and this step is repeated to get the best cluster centre each student consists of 10 different questions with relative
point. With this repetition, every time learning is carried out, information in the context of the questions. The initial
the performance of student assessment clustering will be more rankings are settled by experts, and each answer was
accurate. The formula used in this method is as follows: categorized into four categories, namely: true, true-but-
incomplete, contradictory, and incorrect.
𝑎𝑐𝑐 = 𝑚𝑒𝑎𝑛 + 𝑘 × (𝑛𝑞 − 𝑐𝑜𝑟𝑟) (6)
3.2 Data preprocessing
(𝑟𝑑 + 𝑎𝑡)
𝑚𝑒𝑎𝑛 = (7) The data set that has been obtained from 60 students is pre-
(𝑐𝑜𝑟𝑟 + 1)
processed before being processed to the next stage. Data sets
where: with incomplete answers are deleted. From 60 respondents, a
acc = accuracy filter was conducted to find complete answers, and 48
nq = number of questions complete answers were obtained.
corr = number of correct answers Grading on a short answer question is sometimes not only
rd = read time correct or incorrect; it can also be true but incomplete or
at = answer time contradictory. However, these four rating categories can make
the grading process using BERT Machine Learning more
complex. The four existing rating categories were shrunk into
3. METHOD two to become binary response variables (correct answer or
wring answer) to be processed later. In non-binary
This method section will discuss the data set, pre-processing, classification, you have to manage the number of data sets in
and Bidirectional Encoder Representation from Transformer each category. This setup process makes the pre-processing
(BERT), as shown in Figure 4. There are three main steps step more complex. In this study, a binary process was carried
proposed in this study: collecting data sets (consisting of out on the classification according to the needs of the
questions and answers), pre-processing (consisting of filters respondents being tested. Responses that are considered
and concatenated), and grading process using BERT. The “correct” answers are in the correct category, while the other
proposed machine learning algorithm is able to optimize the categories (true but incomplete, contradictory, and incorrect)
resulting output [13]. are considered “wrong” answers.
The next step is to combine the question context text with
the answer itself. Then, the combined text is tokenized using
the base-BERT-uncased tokenizer, as shown in Figure 5.
There are two steps in the process, which are as follows:
Step 1: The concatenated process combines questions and
answers.
Step 2: The tokenizer process by analyzing the sentences
that have been concatenated. The sentence is segmented into
tokens that have a single meaning. The stop word in the
sentence is also omitted.
An example of this process is:
Question: “Indonesia is located on the continent of Asia.
What are the two continents and two oceans closest to
Indonesia?”
Answer: “Indonesia is located between the continents of
Asia and the continent of Australia. Indonesia is located
between the Pacific Ocean and the Indian Ocean.”
Figure 4. Method The results of the process of the questions and answers are
as follows:
3.1 Data set Token 1: “The country of Indonesia is located on the
continent of Asia”.
The important thing in starting the BERT machine learning Token 2: “What are the two closest continents and two
process is to prepare the data set in advance [14, 15]. The data oceans of Indonesia”.
set consisted of short answers made to a quiz between students Token 3: “Indonesia is located between the continents of
from high school in Bandung, Indonesia. About 60 students Asia and the continents of Australia”.
responded to questions related to Computer and Information Token 4: “Indonesia lies between the Pacific Ocean and the
Technology (CIT) subjects. There are two kinds of data sets: Indian Ocean”.
505
Figure 7. BERT processes for non-English language
BERT has been previously trained to use only the unlabeled

plain text corpus [20] (i.e. Brown Corpus and the entirety of
Figure 5. Implementation steps the English Wikipedia). In order for BERT to be used for non-
English languages, training in other languages is required, so
3.3 Bidirectional Encoder Representation from a separate process is needed as shown in Figure 7.
Transformer (BERT) – Machine Learning
BERT is an open-source machine learning framework for

natural language processing (NLP) designed to help computers
understand language meaning in the text [16]. BERT is a
Bidirectional Encoder Representations based on Transformers,
using a deep learning model as shown in Figure 6 [17].
Figure 8. BERT representation
The input representation of BERT is constructed by

summing the corresponding tokens, segments, and position
Figure 6. BERT Model placements. Figure 8 shows an example of an input
representation consisting of two segments (segment A and
E1 - EN represents the token generated from the tokenizer segment B). In this example, it can be seen that the embedding
and then inserted into the Trm (Intermediate representation) tokens are summed with the embedding segment and the
and T1 - TN (Final Output) layers. The classification (C) used embedding position. The modeling of the machine learning
is binary classification (Correct and Incorrect). will classify the answers [21].
Each output element is connected to each input element, and BERT by default, already has a data set in English. To use
the weights between those elements are calculated BERT in other languages (besides English) some initial
dynamically based on the connection. Masked Language adjustments are required. First of all, by installing the simple
Model (MLM) training aims to train to predict the hidden word transformer library in accordance with the language that will
based on the context of the word [18]. The Next Sentence be used. After the library is installed, then choose the training
Prediction training indicates whether the two sentences given that is suitable for the language model used. The training
have a logical connection and the two sentences are sequential requires a number of datasets according to the needs of the
[19]. application to be made. As discussed in the previous section,
506
the training dataset is taken from the question and answer data • True Negative (TN): The model predicts that the data is
bank that the teacher has prepared. in the negative class and the actual data is in the negative
In this study, it is for the needs of the grading system. There class.
may be a need to fine-tune the training model system created • True Positive (TP): The model predicts that the data is
during the training, as shown in Figure 9. Fine-tuning is used in a positive class and the actual data is in a positive class.
to adapt the deep learning process to the given task. In this • False Negative (FN): The model predicts that the data is
study, BERT was specifically given the task of grading short in the Negative class, but actually, the data is in a
answer questions so that fine-tuning was carried out to positive class.
separate tokens between questions and answers. • False Positive (FP): The model predicts the data is in a
positive class, but actually, the data is in the Negative
class.
Precision is the ratio of positive correct predictions to the
overall positive predicted results. The Precision Formula is as
Figure 9. BERT workflow with pre-training and fine-tuning follows:
phases
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑠𝑖𝑜𝑛 = (9)
These steps can be repeated as needed by evaluating the 𝑇𝑃 + 𝐹𝑃
results of the training. After the pre-training and fine-tuning
Recall (Sensitivity) is the ratio of true positive predictions
are completed, it can be continued with testing using
compared to the overall data that are true positive. The formula
respondent data. The model can then be tested to predict a real
for the recall is as follows:
example, which is to create an automatic grading system
application. 𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 = (10)
3.4 Measurement instrument – Cohen’s Kappa 𝑇𝑃 + 𝐹𝑁
Specificity is the correctness of predicting negative

Cohen’s kappa coefficient (κ) is a statistic used to measure
compared to the overall negative data. The formula for
inter-rater reliability (as well as intra-rater reliability). This is
Specificity is as follows:
generally considered a more effective measure than a simple
percentage agreement calculation. 𝑇𝑁
Cohen’s Kappa formula is as follows: 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = (11)
𝑇𝑁 + 𝐹𝑃
Pr(𝑎) − Pr (𝑒) F1 Score is a weighted comparison of the average precision
𝐾= (8)
1 − Pr (𝑒) and recall. The formula for the F1 Score is as follows:
where: 2 × 𝑅𝑒𝑐𝑎𝑙𝑙 × 𝑃𝑟𝑒𝑐𝑖𝑠𝑠𝑖𝑜𝑛
Pr(a) = Percentage of the number of measurements that are 𝐹1 𝑆𝑐𝑜𝑟𝑒 = (12)
𝑅𝑒𝑐𝑎𝑙𝑙 + 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑖𝑜𝑛
consistent between raters.
Pr(e) =Percentage of number of measurement changes
between raters. 4. DISCUSSION AND RESULT
If Cohen’s Kappa score is <0.20, the consistency is poor. If
it is 0.21 - 0.40, the consistency is fair. If it is 0.41 - 0.60, the The grading system for short answer questions has been
consistency is moderate. If it is 0.61 - 0.80, the consistency is experimented with a high school student in Bandung,
good. If it is 0.81 - 1.00, the consistency is very good. Indonesia. There were 60 students, grade 10 in a high school
in Bandung, who were taken as respondents. The initial filter
3.5 Measurement instrument – Confusion Matrix is to process those who have complete answers (no blank
answers). Of the 60 students, there are 48 complete answer
The Confusion Matrix can be used as a reference in sheets. The first step is to prepare a table for the module of
evaluating the algorithm performance of Machine Learning learning that contains context, questions, and answers. Table 1
(especially supervised learning) as shown in Figure 10. The is an example of a data set in Indonesian language.
Confusion Matrix represents the predictions and actual These examples were graded using an automatic grading
conditions of the data generated by the Machine Learning system and a manually grading system using humans. Table 2
algorithm. Based on the Confusion Matrix, Accuracy, is an example of comparing the results of automatic and
Precision, Recall, and Specificity can be determined. manual assessments for forty-eight (48) students who have
complete answers, and each student gets ten (10) random
questions.
The results of the two assessments were compared using
Cohen’s Kappa method. Cohen's Kappa is a measure that
states the consistency of measurements made by two raters or
the consistency between two measurement methods or can
also measure the consistency between two measurement tools.
Cohen's kappa coefficient is only applied to the results of
qualitative data measurement (Category). The experiments
that have been carried out have resulted in consistent
Figure 10. Confusion Matrix equivalence, as shown in Table 3.
507
Table 1. Example of Data Set (Indonesian Language) 76%
𝑅𝑒𝑐𝑎𝑙𝑙 = = 0.96
76% + 3%
Question
Question Answer
Context The capabilities of the machine learning algorithms
Komputer berasal dari implemented to predict wrong answers are as follows:
Istilah Komputer Komputer berasal dari kata
kata Computare
Contoh alat masukan Contoh alat masukan 16%
Alat masukan
adalah adaah tetikus 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = = 0.76
Contoh alat keluaran Contoh alat keluaran 16% + 5%
Alat keluaran
adalah adalah layar
Contoh prosesor The comparison of the weighted average precision and
Contoh prosessor recall of the implemented machine learning algorithms is as
Prosesor komputer adalah
komputer adalah
INTEL dan AMD follows:
Table 2. Examples of automatic and manual grading 2 × 0.96 × 0.94

𝐹1 𝑆𝑐𝑜𝑟𝑒 = = 0.95
comparison 0.96 + 0.94
Automatic Manual
Student no. Question no.
Grading grading 5. CONCLUSIONS
1 1 Correct Correct
1 2 Correct Correct A novel system proposed in the study is a system that is able
. . . . to implement Indonesian language in machine learning
. . . .
algorithms using BERT. BERT is an open-source system that
1 10 Correct Incorrect
2 1 Incorrect Incorrect by default has limitations because it is set for English. This
. . . . study proposes a novel system to implement Indonesian in the
. . . . BERT system for automatic grading of short answers. The
2 10 Correct Incorrect experimental results were measured using two measuring
. . . . instruments: Cohen's Kappa coefficient and the Confusion
. . . . Matrix.
48 1 Correct Correct The result of measuring the BERT output of the
. . . . implemented system has a Cohen Kappa coefficient value of
. . . .
0.75, which means that the implemented algorithm has a good
48 10 Incorrect Correct
consistency.
In measuring the success of the algorithm that is presented
Table 3. Consistency
using the confusion matrix, it produces the following values:
Precision of 0.94, Recall of 0.96, Specificity of 0.76, and F1
Automatic Grading system
Correct Incorrect Score of 0.95. Based on the measurement results, it can be seen
Manual that the implementation of the automatic short answer grading
Grading Correct 76% 5% system in Indonesian Language using BERT machine learning
System (by has been successful.
Incorrect 3% 16%
Human)
(0.76 × 0.16) − ((0.81 × 0.79) + (0.19 × 0.21)) ACKNOWLEDGMENT

𝐾=
1 − ((0.81 × 0.79) + (0.19 × 0.21))
= 0.75 Thank you for the support from Computer Laboratory,
Department of Computer Engineering, Maranatha Christian
The automatic grading system made using BERT produces University, Indonesia, in carrying out this study.
a Cohen's Kappa value of 0.75. This automatic grading system
has a value similar to that of a human appraiser. The Machine
Learning Algorithm proposed in this study is able to classify REFERENCES
and provide an assessment of these answers.
The experiments carried out were also measured using a [1] Nederhand, M.L., Tabbers, H.K., Jongerling, J., Rikers,
confusion matrix. The first measurement is to calculate how R.M. (2020). Reflection on exam grades to improve
precise the machine learning algorithm is implemented. The calibration of secondary school students: A longitudinal
calculation is as follows: study. Metacognition and Learning, 15(3): 291-317.
https://doi.org/10.1007/s11409-020-09233-9
TP = 76% [2] Friedrich, D. (2019). Effectiveness of peer review as
FP = 5% cooperative web-based learning method applied out-of-
FN = 3% class in a role playing game: A case study by quasi-
TN = 16% experimental approach. Smart Learning Environments,
76% 6(19): 1-22. https://doi.org/10.1186/s40561-019-0102-5
𝑃𝑟𝑒𝑐𝑖𝑠𝑠𝑖𝑜𝑛 = = 0.94
76% + 5% [3] Basey, J.M., Maines, A.P., Francis, C.D., Melbourne, B.,
Wise, S.B., Safran, R.J., Johnson, P.T. (2014). Impact of
Meanwhile, the sensitivity or recall of the machine learning pre-lab learning activities, a post-lab written report, and
algorithms implemented are as follows: content reduction on evolution-based learning in an
508
undergraduate plant biodiversity lab. Evolution: learning techniques. Revue d’Intelligence Artificielle,
Education and Outreach, 7(1): 1-9. 35(1): 99-104. https://doi.org/10.18280/ria.350112
https://doi.org/10.1186/s12052-014-0010-7 [13] Sharma, R., Hooda, N. (2019). Optimized ensemble
[4] de Bruin, A.B., Kok, E.M., Lobbestael, J., de Grip, A. machine learning framework for high dimensional
(2017). The impact of an online tool for monitoring and imbalanced bio assays. Revue d’Intelligence Artificielle,
regulating learning at university: Overconfidence, 33(5): 387-392. https://doi.org/10.18280/ria.330509
learning strategy, and personality. Metacognition and [14] Satla, S.P., Sadanandam, M., Suvarna, B. (2020).
Learning, 12(1): 21-43. https://doi.org/10.1007/s11409- Dangerous Prediction in Roads by Using Machine
016-9159-5 Learning Models. Ingénierie des Systèmes d’
[5] Wenzel, K., Reinhard, M.A. (2020). Tests and academic Information, 25(5): 637-644.
cheating: Do learning tasks influence cheating by way of https://doi.org/10.18280/isi.250511
negative evaluations? Social Psychology of Education, [15] Singla, S.K., Garg, R.D., Dubey, O.P. (2020). Ensemble
23(3): 721-753. https://doi.org/10.1007/s11218-020- machine learning methods to estimate the sugarcane
09556-0 yield based on remote sensing information. Revue
[6] Boukhari, Y. (2020). Application and comparison of d’Intelligence Artificielle, 34(6): 731-743.
machine learning algorithms for predicting mass loss of https://doi.org/10.18280/RIA.340607
cement raw materials due to decarbonation process. [16] Das, S., Deb, N., Cortesi, A., Chaki, N. (2021). Sentence
Revue d’Intelligence Artificielle, 34(4): 403-411. embedding models for similarity detection of software
https://doi.org/10.18280/ria.340404 requirements. SN Computer Science, 2(2): 1-11.
[7] Hasanah, U., Permanasari, A.E., Kusumawardani, S.S., https://doi.org/10.1007/s42979-020-00427-1
Pribadi, F.S. (2019). A scoring rubric for automatic short [17] Devlin, J., Chang, M.W., Lee, K., Toutanova, K. (2018).
answer grading system. Telkomnika, 17(2): 763-770. BERT: Pre-training of deep bidirectional transformers
https://doi.org/10.12928/TELKOMNIKA.V17I2.11785 for language understanding. arXiv preprint
[8] Süzen, N., Gorban, A.N., Levesley, J., Mirkes, E.M. arXiv:1810.04805. https://doi.org/10.18653/v1%2FN19-
(2020). Automatic short answer grading and feedback 1423
using text mining methods. Procedia Computer Science, [18] Alzubi, J., Alzubi, J.A., Jain, R., Singh, A., Parwekar, P.,
169(2019): 726-743. Gupta, M. (2021). COBERT: COVID-19 question
https://doi.org/10.1016/j.procs.2020.02.171 answering system using BERT. Arabian Journal for
[9] Ince, E.Y., Kutlu, A. (2021). Web-based Turkish Science and Engineering, pp. 1-11.
Automatic Short-Answer Grading System. Natural https://doi.org/10.1007/s13369-021-05810-5
Language Processing Research, 1(3-4): 46-55. [19] Sur, C. (2020). RBN: Enhancement in language attribute
https://doi.org/10.2991/nlpr.d.210212.001 prediction using global representation of natural
[10] Zhang, L., Huang, Y., Yang, X., Yu, S., Zhuang, F. language transfer learning technology like Google BERT.
(2019). An automatic short-answer grading model for SN Applied Sciences, 2(1): 1-15.
semi-open-ended questions. Interactive Learning https://doi.org/10.1007/s42452-019-1765-9
Environments, pp. 1-14. [20] Kanerva, J., Ginter, F., Pyysalo, S. (2020). Dependency
https://doi.org/10.1080/10494820.2019.1648300 parsing of biomedical text with BERT. BMC
[11] Pado, U., Kiefer, C. (2015). Short answer grading: When Bioinformatics, 21(23): 1-12.
sorting helps and when it doesn’t. In Proceedings of the https://doi.org/10.1186/s12859-020-03905-8
Fourth Workshop on NLP for Computer-Assisted [21] Ahmed, M.Z., Mahesh, C. (2021). A weight based
Language Learning, pp. 42-50. labeled classifier using machine learning technique for
[12] Vankayalapati, R., Ghutugade, K.B., Vannapuram, R., classification of medical data. Revue d’Intelligence
Prasanna, B.P.S. (2021). K-means algorithm for Artificielle, 35(1): 39-46.
clustering of learners performance levels using machine https://doi.org/10.18280/ria.350104
509

35.06 09 PDF

Uploaded by

Copyright:

Available Formats

35.06 09 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

35.06 09 PDF

Uploaded by

Copyright:

Available Formats

Revue d'Intelligence Artificielle

Vol. 35, No. 6, December, 2021, pp. 503-509

Corresponding Author Email: [email protected]

1. INTRODUCTION proposed in this study must be able to make the classification

|𝑑𝑜𝑐1 ∩ 𝑑𝑜𝑐2| Figure 2. TASAG System by Ebru Yilmaz Ince

Similarity score will be taken from the highest value from

(𝑠𝑖𝑚 × 𝑎𝑠) + (𝑘𝑚 × 𝑎𝑠)

BERT has been previously trained to use only the unlabeled

BERT is an open-source machine learning framework for

Figure 8. BERT representation

The input representation of BERT is constructed by

Specificity is the correctness of predicting negative

Table 2. Examples of automatic and manual grading 2 × 0.96 × 0.94

(0.76 × 0.16) − ((0.81 × 0.79) + (0.19 × 0.21)) ACKNOWLEDGMENT

You might also like