计算机科学 ›› 2019, Vol. 46 ›› Issue (10): 141-147.doi: 10.11896/jsjkx.180801573

• 信息安全 • 上一篇    下一篇

基于后缀树的二进制可执行代码的克隆检测算法

张凌浩1, 桂盛霖2,3, 穆逢君2, 王胜1   

  1. (国网四川省电力公司电力科学研究院 成都610000)1
    (电子科技大学计算机科学与工程学院 成都611731)2
    (中国电子科技集团公司第三十研究所 成都610041)3
  • 收稿日期:2018-08-25 修回日期:2019-01-08 出版日期:2019-10-15 发布日期:2019-10-21
  • 作者简介:张凌浩(1985-),男,博士,工程师,主要研究方向为电力信息安全技术;桂盛霖(1983-),男,博士,副教授,CCF会员,主要研究方向为嵌入式软件技术、信息安全,E-mail:[email protected];穆逢君(1997-),男,主要研究方向为软件克隆检测技术;王胜(1987-),男,硕士,工程师,主要研究方向为网络安全技术。
  • 基金资助:
    本文受国家自然科学基金(61401067),国网四川省电力公司科技项目(521997170001P, 521997170017)资助。

Clone Detection Algorithm for Binary Executable Code with Suffix Tree

ZHANG Ling-hao1, GUI Sheng-lin2,3, MU Feng-jun2, WANG Sheng1   

  1. (State Grid Sichuan Electric Power Research Institute,Chengdu 610000,China)1
    (School of Computer Science and Engineering,University of Electronic Science and Technology of China,Chengdu 611731,China)2
    (The 30th Institute of China Electronics Technology Group Corporation,Chengdu 610041,China)3
  • Received:2018-08-25 Revised:2019-01-08 Online:2019-10-15 Published:2019-10-21

摘要: 如何发现代码克隆,是软件维护和软件侵权纠纷案件中的一个关键问题。由于商业保密等原因,在商业软件的侵权纠纷案中往往无法使用基于源代码比对的克隆检测技术。因此,针对这类无法获得源代码进行代码克隆检测的场景,文中提出一种针对二进制可执行文件分析的代码克隆检测方法。首先,通过反编译与指令类型抽象得到二进制可执行目标文件的指令类型序列;然后,对指令类型序列构建后缀树,利用后缀树的性质获取函数级的指令序列间的克隆信息,并通过消除沙砾指令进一步提高检测性能;最后,基于MIPS32指令集,使用Linux 内核和经过混淆处理的代码分别作为克隆级别0-级别2与级别1-级别4的二进制可执行文件代码克隆测试样本,并与源代码检测工具进行对比测试。结果表明,所提算法在缺少源代码的场景下同样能进行细粒度的克隆分析,且对各级代码克隆均具有较好的检测性能。

关键词: 代码克隆, 二进制可执行文件, 后缀树, 性能优化

Abstract: How to detect code clones is an important issue in software maintenance and software infringements.Clone detection techniques based on source code tend to fail in the infringement disputes of commercial software due to trade secret.Therefore,in the scenario when the source code is unavailable for detection,this paper presented a clone detection algorithm based on binary executable file analysis.Firstly,instruction type sequences of binary executable files are obtained by decompilation instruction type abstraction,then a suffix tree is constructed based on these instruction type sequences.The clone pairs among functions can be figured out based on this suffix tree.In addition,this paper eliminated gravel instructions for enhancing performance.At last,based on MIPS32 instruction set,this paper used respectively Linux kernel and obfuscated test code as samples on clone level 0-level 2 and level 1-level 4 to compare with the source code detection tools.Test results show that even in the scenario where the source code is lacking,this algorithm can also perform fine-grained clone analysis and has high detection performance for code clones at all levels.

Key words: Binary executable file, Code cone, Performance optimization, Suffix tree

中图分类号: 

  • TP311.5
[1]United States Court of Appeals for the Federal Circuit:JACOBSENV.KATZER [A/OL].(2018-03-27)[2018-07-07].http://www.cafc.uscourts.gov/sites/default/files/opinions-orders/08-1001.pdf.
[2]LV H.Computer software copyright infringement judge[D]. Lanzhou:Lanzhou University,2015.(in Chinese)
吕浩.计算机软件著作权侵权的判定研究[D].兰州:兰州大学,2015.
[3]DONG H,TANG Q.Review of Jacobsen V.Katzer Case:The Influence of the Differences Between Chinese and American Copyright Systems on the Nature of Open Source Agreements and Its Enlightenment[J].China Copyright,2009,2009(3):48-51.(in Chinese)
董皓,唐磬.Jacobsen V.Katzer案评述:中美版权制度差异对开源协议性质的影响及启示[J].中国版权,2009,2009(3):48-51.
[4]SU Q,WU W M,LI Z L,et al.Research and Application of Chaos Opaque Predicate in Code Obfuscation[J].Computer Science,2013,40(6):155-159.(in Chinese)
苏庆,吴伟民,李忠良,李景樑,陈为德.混沌不透明谓词在代码混淆中的研究与应用[J].计算机科学,2013,40(6):155-159.
[5]ROY,CHANCHAL K,CORDY,et al.Comparison and evaluation of code clone detection techniques and tools:A qualitative approach[J].Science of Computer Programming,2009,74(7):470-495.
[6]RATTAN D,BHATIA R,SINGH M.Software clone detection:A systematic review[J].Information & Software Technology,2013,55(7):1165-1199.
[7]SU X H,ZHANG F L.A Survey for Management Oriented Code Clone Research[J].Chinese Journal of Computers,2018,41(3):628-651.(in Chinese)
苏小红,张凡龙.面向管理的克隆代码研究综述[J].计算机学报,2018,41(3):628-651.
[8]CHEN H,GUO T,CUI B J,et al.Comparison and analysis on binary file similarity detection technique[C]//Proceedings of Conference on Vulnerability Analysis and Risk.Beijing:China Information Technology Security Evalution Center,2011.
[9]JOHNSON J H.Substring matching for clone detection and change tracking[C]//Proceedings of International Conference on Software Maintenance,1994.New York:IEEE,1994:120-126.
[10]KAMIYA T,KUSUMOTO S,INOUE K.CCFinder:A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code[J].IEEE Transactions on Software Engineering,2002,28(7):654-670.
[11]SAINI V,SAJNANI H,KIM J,et al.SourcererCC and SourcererCC-I:tools to detect clones in batch mode and during software development[C]// Proceedings of the 38th Proceedings of International Conference on Software Engineering Companion,2016.New York:IEEE,2016:597-600.
[12]SUDHAMANI M,RANGARAJAN L.Code clone detection based on order and content of control statements[C]//Procee-dings of International Conference on Contemporary Computing &Informatics.New York:IEEE,2017:59-64.
[13]WANG M,WANG P,XU Y.CCSharp:An Efficient Three-Phase Code Clone Detector Using Modified PDGs[C]//Procee-dings of Asia-pacific Software Engineering Conference.New York:IEEE,2018:100-109.
[14]ROY C K,CORDY J R.NICAD:Accurate Detection of Near-Miss Intentional Clones Using Flexible Pretty-Printing and Code Normalization[C]//Proceedings of IEEE International Confe-rence on Program Comprehension.Washington DC:IEEE Computer Society,2008:172-181.
[15]WHITE M,TUFANO M,VENDOME C,et al.Deep learning code fragments for code clone detection[C]//Proceedings of IEEE/ACM International Conference on Automated Software Engineering.NewYork:IEEE,2016:87-98.
[16]RAGKHITWETSAGUL C,KRINKE J,MARNETTE B.A picture is worth a thousand words:Code clone detection based on image similarity[C]//Proceedings of International Workshop on Software Clones.NewYork:IEEE,2018:44-50.
[17]SÆBJØRNSEN A,WILLCOCK J,PANAS T,et al.Detecting code clones in binary executables[C]//Proceedings of Eighteenth International Symposium on Software Testing and Ana-lysis.New York:ACM,2009:117-128.
[18]DONG M,ZHUANG H,ZHANG R,et al.A New Method of Software Clone Detection Based on Binary Instruction Structure Analysis[C]//Proceedings of International Conference on Wireless Communications.NewYork:IEEE,2013:1-4.
[19]HU Y K,ZHANG Y Y,LI J R,et al.Binary code clone detection across architectures and compiling configurations[C]//Proceedings of International Conference on Program Comprehension.New York:IEEE,2017:88-98.
[20]TIS Committee.Tool Interface Standard (TIS) Executable and Linking Format (ELF) Specification[S/OL].http://refspecs.linuxbase.org/elf/elf.pdf.
[21]MIPS32® Instruction Set Quick Reference [S/OL].https://www.mips.com/?do-download=mips32-instruction-set-quick-reference-v1-01.
[22]MCCREIGHT E M.A Space-Economical Suffix Tree Construction Algorithm[J].Journal of the Acm,1976,23(2):262-272.
[23]UKKONEN E.On-line construction of suffix trees[J].Algorithmica,1995,14(3):249-260.
[24]rswier c4:C in four functions[CP/OL].(2017-08-22) [2018-08-22].https://github.com/rswier/c4.git.
[25]C/C++ Obfuscator[EB/OL].(2018-08-09)[2018-08-09]http://stunnix.com/prod/cxxo/.
[26]UDDIN M S,ROY C K,SCHNEIDER K A.SimCad:An extensible and faster clone detection tool for large scale software systems[C]//Proceedings of International Conference on Program Comprehension.NewYork:IEEE,2013:236-238.
[1] 陈钧吾, 余华山.
面向无尺度图的Δ-stepping算法改进策略
Strategies for Improving Δ-stepping Algorithm on Scale-free Graphs
计算机科学, 2022, 49(6A): 594-600. https://doi.org/10.11896/jsjkx.210400062
[2] 鄂海红, 张田宇, 宋美娜.
基于Web的数据可视化图表渲染优化方法
Web-based Data Visualization Chart Rendering Optimization Method
计算机科学, 2021, 48(3): 119-123. https://doi.org/10.11896/jsjkx.200600038
[3] 张晓, 张思蒙, 石佳, 董聪, 李战怀.
Ceph分布式存储系统性能优化技术研究综述
Review on Performance Optimization of Ceph Distributed Storage System
计算机科学, 2021, 48(2): 1-12. https://doi.org/10.11896/jsjkx.201000149
[4] 乐乔艺, 刘建勋, 孙晓平, 张祥平.
代码克隆检测研究进展综述
Survey of Research Progress of Code Clone Detection
计算机科学, 2021, 48(11A): 509-522. https://doi.org/10.11896/jsjkx.210300310
[5] 徐江峰谭玉龙.
基于机器学习的HBase配置参数优化研究
Research on HBase Configuration Parameter Optimization Based on Machine Learning
计算机科学, 2020, 47(6A): 474-479. https://doi.org/10.11896/JsJkx.190900046
[6] 张丹,罗平.
代码相似性检测方法与工具综述
Survey of Code Similarity Detection Methods and Tools
计算机科学, 2020, 47(3): 5-10. https://doi.org/10.11896/jsjkx.190500148
[7] 张彭奕, 宋杰.
区块链共识算法效能优化研究进展
Research Advance on Efficiency Optimization of Blockchain Consensus Algorithms
计算机科学, 2020, 47(12): 296-303. https://doi.org/10.11896/jsjkx.200700020
[8] 徐传福,王曦,刘舒,陈世钊,林玉.
基于Python的大规模高性能LBM多相流模拟
Large-scale High-performance Lattice Boltzmann Multi-phase Flow Simulations Based on Python
计算机科学, 2020, 47(1): 17-23. https://doi.org/10.11896/jsjkx.190500009
[9] 徐启泽, 韩文廷, 陈俊仕, 安虹.
众核平台上广度优先搜索算法的优化
Optimization of Breadth-first Search Algorithm Based on Many-core Platform
计算机科学, 2019, 46(1): 314-319. https://doi.org/10.11896/j.issn.1002-137X.2019.01.049
[10] 孙涛, 张俊星.
SDN性能优化技术研究综述
Review of SDN Performance Optimization Technology
计算机科学, 2018, 45(11A): 84-91.
[11] 邹凌君,陈崚,戴彩艳.
基于广义后缀树的二分网络社区挖掘算法
Detecting Community from Bipartite Network Based on Generalized Suffix Tree
计算机科学, 2017, 44(7): 221-226. https://doi.org/10.11896/j.issn.1002-137X.2017.07.039
[12] 孙志龙,沙行勉,诸葛晴凤,陈咸彰,吴剀劼.
面向内存文件系统的数据一致性更新机制研究
Research on Data Consistency for In-memory File Systems
计算机科学, 2017, 44(2): 222-227. https://doi.org/10.11896/j.issn.1002-137X.2017.02.036
[13] 倪友聪,李松,叶鹏,杜欣.
基于随机搜索规则的软件体系结构层性能演化优化方法
Random Search Rule Based Performance Evolutionary Optimization Method at Software Architecture Level
计算机科学, 2017, 44(11): 156-163. https://doi.org/10.11896/j.issn.1002-137X.2017.11.023
[14] 赵利伟,陈咸彰,诸葛晴凤.
连接操作在SIMFS和EXT4上的性能比较
Performance Comparison of Join Operations on SIMFS and EXT4
计算机科学, 2016, 43(6): 184-187. https://doi.org/10.11896/j.issn.1002-137X.2016.06.037
[15] 柯叶青,马志柔,伍海江,刘 杰.
一种简历语义搜索系统的实现方法
SmartHR:A Resume Query and Management System Based on Semantic Web
计算机科学, 2015, 42(12): 56-59.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!