Notes:De-anonymizing Programmers via Code Stylometry

Essay Information

De-anonymizing Programmers via Code Stylometry
Aylin Caliskan-Islam, Richard Harang, Andrew Liu, Arvind Narayanan, Clare Voss,
Fabian Yamaguchi, and Rachel Greenstadt.
Usenix Security Symposium, 2015

Source code stylometry

  • Everyone learns coding on an individual basis, as a result code in a
    unique style, which makes de-annoymization possible.

  • Software engineering insights

    • programmer style changes while implementing sophisticated
      functionality
    • differences in coding style of programmers with different skill sets
  • Identify malicious programmers.

  • Scenario 1 : Who wrote this code?
    Alice analyzes a library with Malicious source code.
    Bob has a source code collection with known authors
    Bob will search his collection to find Alice’s adversary

  • Scenario 2 : Who wrote this code?
    Alice got an extension for her programming assignment.
    Bob, the teacher has everyone else’s code.
    Bob wants to see if Alice plagiarized.

这里写图片描述

Machine learning workflow

这里写图片描述

Abstract Syntax Trees (AST)

这里写图片描述

  • Stylemotry can be used in source code to identify the author of a program.
  • Extract layout and lexical features from source code.
    • Abstract syntax trees (AST) in code represent the structure of the program.
    • Preprocess source code to obtain AST.
    • Parse AST to extract coding style features.

Feature Extraction

### Code Stylometry Feature Set (CSFS)

Lexical features (Extract from source code)

这里写图片描述

Layout features (Extract from source code)

这里写图片描述

Syntactic features (Extract from ASTs)

这里写图片描述

Feature Selection

WEKA’s information gain criterion, which evaluates the difference between the entropy of the distribution of classes and the entropy of the conditional distribution of classes given a particular feature:
这里写图片描述
where A is the class corresponding to an author, H is Shannon entropy, and Mi is the ith feature of the dataset.

Intuitively, the information gain can be thought of as measuring the amount of information that the observation of the value of feature i gives about the class label associated with the example.

To reduce the total size and sparsity of the feature vector, we retained only those features that individually had non-zero information gain
这里写图片描述

Random Forest Classification

Method

  • Use random forest as the machine learning classifier
    • avoid over-fitting
    • multi-class classifier by nature
  • K-fold cross validation
  • Validate method on a different dataset

Future work

  • Multiple authorship detection
  • Multiple author identification
  • Anonymizing source code
    • obfuscation is not the answer
  • Stylometry in executable binaries
    • authorship attribution
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值