Ilham 2020 IOP Conf. Ser. Mater. Sci. Eng. 875 012039
Ilham 2020 IOP Conf. Ser. Mater. Sci. Eng. 875 012039
Ilham 2020 IOP Conf. Ser. Mater. Sci. Eng. 875 012039
Abstract. To finish study, students are requested to submit final projects. In some universities,
the final projects are not necessary to be submitted for publication. The final project reports are
stored in a local database. As the number of final projects is growing in the local database, similar
contents may exist among the documents. The commercial tools cannot be used to detect the
content similarity since the documents are not published. This paper proposed a system to detect
content similarity in documents that are stored in a local database. Considering the number of
stored documents, this similar content detection system implements two step processes. First,
clustering documents to find most related documents. Second, finding content similarity among
the selected documents. The experiment results show that the system is successfully clustering
documents and detecting content similarity by implementing TF-IDF and Cosine Similarity
algorithms. This system is limited to proceed documents that are written in Bahasa.
1. Introduction
Similar contents among documents may indicated the documents contain plagiarism. Plagiarism is
defined as taking of essays, opinions and so on from others and makes them appear like their own essays
and opinions. The documents that have little content, the examination can be done manually without the
help of the system. However, on documents that have thousands of lines and pages, it is certainly not
possible. Many commercial tools exist, but they cannot be used to examine documents that are stored in
a local database.
A number of algorithms have been used to detect content similarity in documents such as string-
based detection technique, Vector Space Models (VSM), syntax and semantic based detection technique,
structural-based detection technique and citation based detection technique [1]. The Rabin-Karp
Algorithm has been used in [2]. The hash value generated from this algorithm is the benchmark of
similarity between the documents being tested. However, this research still needs to be adjusted when
choosing the K-gram value to be used in the tokenization process. The weakness of this algorithm is that
the system cannot know the order in which documents appear in this case the time of publication. This
algorithm only focuses on determining the similarity between documents being compared. The
comparison of the level of document similarity using several methods such as K-Means Clustering, K-
Nearest Neighbor and Shingling Algorithm is done in [3]. The purpose of this study is to compare two
vectors of text documents that measure the degree of similarity. The best level of accuracy generated by
the K-Nearest Neighbor algorithm is equal to 95%.
The Rabin-Karp algorithm is also applied in the Indonesian document similarity detection system. In
addition to Rabin-Karp, the Confix-Stripping method is used to maximize the steps for searching basic
Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution
of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Published under licence by IOP Publishing Ltd 1
The 3rd EPI International Conference on Science and Engineering 2019 (EICSE2019) IOP Publishing
IOP Conf. Series: Materials Science and Engineering 875 (2020) 012039 doi:10.1088/1757-899X/875/1/012039
words in text processing. The system works well in detecting document similarities with an average
processing time of 0.0123 seconds and an average similarity level of documents tested 89.1967%.
Overall the level of accuracy shown by the system is 0.7 [4].
A number of commercial tools have been developed such as PlagAware, PlagScan, Check for
Plagiarism, iThenticate, PlagiarismDetection.org, Academic Plagiarism, The Plagiarism Checker,
Urkund, Docoloc and Turnitin [5]. PlagAware provides two main features namely tracing content theft
and proof of authorship [6]. PlagScan provides a number of features such as database checking, Internet
checking, publications checking, synonym and sentence structure checking [7]. CheckForPlagiarism.net
implements fingerprint and document sources to protect documents form plagiarism. It provides features
such as Internet checking, publications checking, synonym and sentence structure checking, and
multiple document comparison [8].
2. Research Method
Considering the number of stored documents, our system implements two step processes. First,
clustering documents to find 10 most related documents. Second, finding content similarity among the
selected documents. The system design can be seen in Figure 1.
2.1. Preprocessing
Preprocessing is the initial stage of text mining where the text data will be cleaned so that the text
becomes more structured before entering the next stage for further processing. Continuous character sets
(text) must be broken into more meaningful. This can be done on several different levels. Text
preprocessing stages in this study consist of tokenizing, case folding, filtering, and stopword.
2
The 3rd EPI International Conference on Science and Engineering 2019 (EICSE2019) IOP Publishing
IOP Conf. Series: Materials Science and Engineering 875 (2020) 012039 doi:10.1088/1757-899X/875/1/012039
where f(d,t) is the number of times the word t appears in document d. IDF considers the frequency of
words in all documents. IDF weighting assumes that the weight of a word will be large if the word often
appears in a document but not many documents contain the word. IDF values are calculated using the
following Equation (2):
𝑁
𝐼𝐷𝐹(𝑡) = 𝑙𝑜𝑔 ( ) (2)
𝑑𝑓(𝑡)
where df(t) is the number of documents that have the word t. The results of previous studies show that
TF x IDF weighting can improve performance better. The TF x IDF value is calculated using the
following function [9].
At Term Frequency (TF), there are several types of formulas used, i.e. [10]:
a. Binary TF: pays attention to whether or not a word exists in the document. If it exists, a value
of one is given, if not, a zero value is given.
b. Raw TF: the TF value is based on the number of occurrences of a word in the document. For
example, if it appears five times the word will be worth five.
c. TF logarithmic: used to avoid the dominance of documents that contain few words in the query
but have a high frequency.
d. TF normalization: using a comparison between the frequencies of a word with the total number
of words in a document.
∑𝑁
𝑖=1 𝑤𝑖,𝑗 𝑤𝑖,𝑞
𝑠𝑖𝑚 (𝑑𝑗 , 𝑞) = (4)
√∑𝑁 2 𝑁 2
𝑖=1 𝑤 𝑖,𝑗 √∑𝑖=1 𝑤 𝑖,𝑞
where:
𝑑𝑗 = document j
𝑞 = query document
∑𝑁𝑖=1 𝑤𝑖,𝑗 = total weight of the word i in document j
∑𝑁𝑖=1 𝑤𝑖,𝑞 = the weights number of word i in the query
The TF-IDF weighting process can be seen in Figure 2. The TF-IDF method combines two concepts
for weight calculation, namely the frequency of occurrence of a word in a particular document and the
inverse frequency of the document containing the word. The frequency with which words appear in a
3
The 3rd EPI International Conference on Science and Engineering 2019 (EICSE2019) IOP Publishing
IOP Conf. Series: Materials Science and Engineering 875 (2020) 012039 doi:10.1088/1757-899X/875/1/012039
given document indicates how important that word is in the document. The frequency of documents
containing the word indicates how common the word is. So, the weight of the relationship between a
word and a document will be high if the frequency of the word is high in the document and the overall
frequency of the document containing the word is low in the document.
Start
Calculate DF (Number of
documents containing
keywords)
IDF = Log(n/DF)
TF-IDF = TF * IDF
TF-IDF
Results
End
4
The 3rd EPI International Conference on Science and Engineering 2019 (EICSE2019) IOP Publishing
IOP Conf. Series: Materials Science and Engineering 875 (2020) 012039 doi:10.1088/1757-899X/875/1/012039
After the preprocessing, TF-DF is calculated by counting the appearance of each term in Q in other
documents. The TF and DF values can be seen in Table 3.
5
The 3rd EPI International Conference on Science and Engineering 2019 (EICSE2019) IOP Publishing
IOP Conf. Series: Materials Science and Engineering 875 (2020) 012039 doi:10.1088/1757-899X/875/1/012039
TF-DF values are used to calculate IDF using log(n/df) formula. TF x IDF values as can be seen in
Table 4.
6
The 3rd EPI International Conference on Science and Engineering 2019 (EICSE2019) IOP Publishing
IOP Conf. Series: Materials Science and Engineering 875 (2020) 012039 doi:10.1088/1757-899X/875/1/012039
TF x IDF
Term IDF = log (n/df)
D2 D3 D4 D5
dokumen 0 0 0 0 0
alasan 0 0 0 0 0
penulis 0 0 0 0 0
mencoba 0 0 0 0 0
membangun 0 0 0 0 0
sistem 0 0 0 0 0
deteksi 0.602 0 0 0.602 0
bahasa 0 0 0 0 0
indonesia 0 0 0 0 0
metode 0.301 0 0 0.602 0.301
vector 0 0 0 0 0
space 0 0 0 0 0
Document Weight Value 0 2.408 1.806 0.301
As can be seen in Table 4, document D3 has the highest weighting value among other documents.
This shows that document D3 has a high similarity with document Q. We implement Cosine Similarity
algorithm to measure the level of similarity between document Q and D3, as shown in Table 5.
7
The 3rd EPI International Conference on Science and Engineering 2019 (EICSE2019) IOP Publishing
IOP Conf. Series: Materials Science and Engineering 875 (2020) 012039 doi:10.1088/1757-899X/875/1/012039
Based on the Cosine Similarity formula in Equation (4), the result is:
12
𝐶𝑜𝑠𝑖𝑛𝑒 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 =
√45 × √24
12
=
6.708 × 4.899
= 0.365
8
The 3rd EPI International Conference on Science and Engineering 2019 (EICSE2019) IOP Publishing
IOP Conf. Series: Materials Science and Engineering 875 (2020) 012039 doi:10.1088/1757-899X/875/1/012039
9
The 3rd EPI International Conference on Science and Engineering 2019 (EICSE2019) IOP Publishing
IOP Conf. Series: Materials Science and Engineering 875 (2020) 012039 doi:10.1088/1757-899X/875/1/012039
4. Conclusion
A system to detect content similarity in documents that were stored in local database was built in this
study. Considering the number of stored documents, the system implemented two step processes by
clustering documents to find most related documents, and then finding content similarity among the
selected documents. The system was able to recognize the order in which documents appeared, in this
case the time of publication. Only documents that had older publication time were compared to the tested
document. The functional testing showed that the system was able get the input, processed the document
and displayed the results as expected. The system was successfully clustering documents and detecting
content similarity by implementing TF-IDF and Cosine Similarity algorithms. Future research can be
developed using other weighting methods for large-scale documents.
Acknowledgment
This research is funded by Lab Based Education Grant 2019, Faculty of Engineering, Hasanuddin
University.
References
[1] K. Vani and D. Gupta, “Study on Extrinsic Text Plagiarism Detection Techniques and Tools,” J.
Eng. Sci. Technol. Rev., vol. 9, no. 3, pp. 99–105, 2016.
[2] R. E. Putri and A. P. Siahaan 2017 Examination of Document Similarity using Rabin-Karp
Algorithm, International Journal of Recent Trends in Engineering and Research, vol. 03, no.
08, pp. 196-201.
[3] M. Oghbaie and M. M. Zanjireh 2018 Pairwise document similarity measure based on present term
set Journal of Big Data, vol. 5, no. 52, pp. 1-23.
[4] D. D. Sinaga and S. Hansun 2018 Indonesian Text Document Similarity Detection System using
Rabin-Karp and Confix-Stripping Algorithms, International Journal of Innovative Computing,
Information and Control, vol. 14, no. 5, pp. 1893-1903.
[5] A. M. El Tahir Ali, H. M. Dahwa Abdulla, and V. Snášel, “Overview and comparison of plagiarism
detection tools,” CEUR Workshop Proc., vol. 706, pp. 161–172, 2011.
[6] “PlagAware.” [Online]. Available: https://www.plagaware.com/. [Accessed: 09-Apr-2019].
[7] “PlagScan: Online Plagiarism Checking.” [Online]. Available: https://www.plagscan.com.
[Accessed: 09-Apr-2019].
10
The 3rd EPI International Conference on Science and Engineering 2019 (EICSE2019) IOP Publishing
IOP Conf. Series: Materials Science and Engineering 875 (2020) 012039 doi:10.1088/1757-899X/875/1/012039
11