Plagiarism and Its Detection in Programming Languages
Plagiarism and Its Detection in Programming Languages
Plagiarism and Its Detection in Programming Languages
Abstract
Program similarity checking is an important application of programming education fields. The increase of material now available in electronic form and improved access to this via the Internet is allowing, with greater ease than ever before, plagiarism that is either intentional or unintentional. Due to increased availability of On-line material, people checking for plagiarism are finding this task increasingly harder. Techniques for detecting plagiarism in programming languages are discussed in this report to provide the reader with a comprehensive introduction to this area.
Keywords
Source-code, Plagiarism, Reuse, Greedy String Tiling
1. Introduction
Plagiarism in programming assignments is an inevitable issue for most academics teaching programming. The Internet, the rising number of essay banks, and text-books are common sources used by students to obtain material, and these facilities make it easier for students to plagiarize. A recent article revealed that some students use the internet to hire expert coders to implement their programming assignments [4]. Bull et al. [1] and Culwin et al. [2] have carried out surveys on academics to determine the prevalence of plagiarism and have evaluated the performance of free-text plagiarism detection software and source-code plagiarism detection software respectively. The surveys have shown that both free-text and source-code plagiarism are significant problems in academic institutions, and the study by Bull et al. [1] indicated that 50% of the 293 academics that participated in their survey felt that in recent years there has been an increase in plagiarism. A review of the current literature on source-code plagiarism in student assignments reveals that there is no commonly agreed description of what constitutes source-code plagiarism from the perspective of academics who teach programming on computer courses. Some definitions on source code plagiarism exist, but these appear to be very limited. For example, according to Faidhi and Robinson [3], plagiarism occurs when programming assignments are copied and transformed with very little effort from the students, whereas Joy and Luck define plagiarism as unacknowledged copying of documents and programs. Furthermore, a book on academic misconduct written by Decoo [5], discusses various issues surrounding academic plagiarism. Decoo briefly discusses software plagiarism and the level of user interface, content and source-code. Sutherland-Smith [6] carried out a survey to gather the views of 11 teachers in the faculty of Business and Law at South-Coast University in Australia. The findings reveal varied perceptions on plagiarism between academics
1994
1994
2005
Java, C, C++, C, C++, Java, C#, Python, Visual Basic, natural Javascript, FORTRAN, language text. ML, Haskell, Lisp, Scheme, Pascal, Modula2, Perl, TCL, Matlab, VHDL, Verilog, Spice, MIPS Assembly 8086, HCL2.
BASIC, C, C++, Java, JSP, C, C++, Fortran C#, Delphi, and PHP code Flash ActionScript, Java, JavaScript, MASM, Pascal, Perl, PHP, PowerBuilder, Ruby, SQL, Verilog, VHDL.
Cost
Free but user Free but user must must create create an account an account
Free and open Free and open Commercial sourced sourced tool, free on any code where the total of all files being examined is less than 1 megabyte Standalone application GUI JDK 1.4 or later Standalone application GUI Standalone application GUI JDK 1.4 or later
Service Interface
Requirements Web browser, A submission script Java Runtime for either UNIX or Environment Windows (JRE), Java 1.5 or higher Security Submission Methods
User id and e-mail needed User id and e-mail needed
Runs locally
Runs locally
Runs locally
Command line Standalone Java software application that can be deployed over the network Files, or a directory Files, or a directory with with or without subdirectories or without subdirectorie s
Standalone application
Standalone application
SourceData
A directory containing subdirectories, each containing one or more zip archives or submissions Yes
Yes
Yes
Yes
No
Result
ResultOutput
HTML output for Both text and HTML reports exploring the similar code fragments found Interactive dialogues for comparing detected similarities and printing reports HTML report, and spreadsheet showing statistical information about the files that were analyzed Local Ordered list Text report, xml file, cvs file.
Local Histogram, statistics about the files that were analyzed Powerful graphical interface for presenting results
Powerful graphical interface for presenting results 2/4 scroll bars, simultaneous scrolling
Graphical user interface presenting the results Table showing the similar lines of code detected among file pairs. No
List of results showing the detected code fragments between file pairs. No
Yes
Yes
Returns Returns suspicious lines duplicate lines of code of code, no option to view entire suspicious code fragments or the entire suspicious files Percentage similarity, lines matched Token matches, lines matched
Metrics produced
Percentage Percentage similarity, token similarity, token matches matches, lines matched
Other
Otherfeatures
Checkbox to mark the suspicious pairs
Algorithms
Winnowing algorithm
Figure 1: Table comparing the features offered by various tools that aid with the detection of source code plagiarism.
2. Methodology
The paper presents the design of an anti-plagiarism tool for local machine usage. Eventually the very basic subproblem statement to be addressed is How to analyze the codes to be compared? The basic need is the correct analysis of the optimum parameters for inferring the uniqueness of the codes concerned. Each code to be analyzed is to be distinguished from the other code. If any two codes be having similar distinguishing features, they might be copied. Precisely, Which kind algorithm for comparison to choose and what are the critical issues of concern? are the next questions to be addressed. The algorithm must be optimum in both processing time and complexity. Further, the processing must be such that the false cases encountered may be minimal. These issues are discussed in the following subsections. 2.1 The Algorithm The most important step is the selection of the algorithm for comparison of the codes. There are a lot of algorithms proposed for the purpose. A few of them are worth mentioning here. Greedy String Tilling Algorithm [10] Winnowing based approach [11] Metrics based approach [12] Which approach to choose from, for code comparison? is the next concern. All the above mentioned approaches, but greedy string tilling based approach, basically are either computationally expensive or are not suitable for the concerned usage. The greedy string tilling algorithm, on the other hand is preferred. String tilling is a powerful technique that is used in various domains like Document Matching and Rigorous String Operations. The algorithm works on the basis of the comparison of the tokens identified by the partial parsing of the code files. The tokens are
2.3 The Graphical User Interface (GUI) for the software Ones the algorithm is finalized, the actual implementation of the code is to follow up next. The application to be designed must be such that it takes the input source code files from the chosen directory, analyzes them and then
Figure 3: Plagio Guard: An Anti-Plagiarism Tool for Programming Languages. The upper code is the one whose replica is found and the below one is the replicated code. The label in between shows the percentage matches.
Conclusion
The application Plagio Guard clearly demonstrates the befitted applicability of the proposed solution for detection of plagiarized codes. The algorithm proposed is independent of the applicability in terms of the programming language, the data to be compared etc. The same solution approach could be applied to the code comparison for other languages like assembly (MASM assembly language) etc and also for the textual comparison for detecting essay plagiarism or in general document plagiarism.
References
[1] Bull J., et al., Technical review of plagiarism detection software report, CAA, University of Luton, (2001).
http://www.cs.berkeley.edu/moss/ (as of April 2000) and personal communication, 1998. University of Berkeley,