Abstract
The string matching and global word frequency model are two basic models of Document Copy Detection, although they are both unsatisfied in some respects. The String Kernel (SK) and Word Sequence Kernel (WSK) may map string pairs into a new feature space directly, in which the data is linearly separable. This idea inspires us with the Semantic Sequence Kin (SSK) and we apply it to document copy detection. SK and WSK only take into account the gap between the first word/term and the last word/term so that it is not good for plagiarism detection. SSK considers each common word’s position information so as to detect plagiarism in a fine granularity. SSK is based on semantic density that is indeed the local word frequency information. We believe these measures diminish the noise of rewording greatly. We test SSK in a small corpus with several common copy types. The result shows that SSK is excellent for detecting non-rewording plagiarism and valid even if documents are reworded to some extent.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Cancedda, N., Gaussier, E., Goutte, C., Renders, J.M.: Word-Sequence Kernels. Journal of Machine Learning Research 3, 1059–1082 (2003)
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.: Text Classification using String Kernels. Journal of Machine Learning Research 2, 419–444 (2002)
Brin, S., Davis, J., Garcia-Molina, H.: Copy detection mechanisms for digital documents. In: Proceedings of the ACM SIGMOD Annual Conference, San Francisco, USA (1995)
Broder, A.Z., Glassman, S.C., Manasse, M.S.: Syntactic Clustering of the Web. In: Proceedings of the Sixth International Web Conference, Santa Clara, California (1997)
Heintze, N.: Scalable Document Fingerprinting. In: Proceedings of the Second USENIX Workshop on Electronic Commerce, Oakland, California (1996)
Shivakumar, N., Garcia-Molina, H.: SCAM: A copy detection mechanism for digital documents. In: Proceedings of 2nd International Conference in Theory and Practice of Digital Libraries, Austin, Texas (1995)
Monostori, K., Zaslavsky, A., Schmidt, H.: MatchDetectReveal: Finding Overlapping and Similar Digital Documents. In: Proceedings of Information Resources Management Association International Conference, Alaska (2000)
Si, A., Leong, H.V., Lau, R.W.H.: CHECK: A Document Plagiarism Detection System. In: Proceedings of ACM Symposium for Applied Computing, pp. 70–77 (1997)
Qinbao, S., Junyi, S.: On illegal coping and distributing detection mechanism for digital goods. Journal of Computer Research and Development 38, 121–125 (2001)
Porter, M.F.: An algorithm for suffix stripping. Program 14, 130–137 (1980)
Denning, P.J.: Editorial: Plagiarism in the web. Communications of the ACM 38 (1995)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bao, JP., Shen, JY., Liu, XD., Liu, HY., Zhang, XD. (2004). Semantic Sequence Kin: A Method of Document Copy Detection. In: Dai, H., Srikant, R., Zhang, C. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2004. Lecture Notes in Computer Science(), vol 3056. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24775-3_63
Download citation
DOI: https://doi.org/10.1007/978-3-540-24775-3_63
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22064-0
Online ISBN: 978-3-540-24775-3
eBook Packages: Springer Book Archive