Vulnerability Detection: A Machine Learning Approach to Identifying Security Vulnerabilities In Code Open Access

Robertson, Peyton (Spring 2023)

Permanent URL: https://etd.library.emory.edu/concern/etds/4j03d103r?locale=en%5D
Published

Abstract

Cybercrime is a rapidly growing threat that poses significant risks to individuals, businesses, and governments worldwide. With the proliferation of technology and the increasing sophistication of cybercriminals, there is an enormous need for practical and effective techniques to identify security vulnerabilities hidden in source code. We propose a novel approach to vulnerability detection that combines machine learning with fuzzing techniques in order to identify areas of a program that are more likely to contain possible security exploits. This approach strives to improve on the state- of-the-art for automated vulnerability detection by addressing the challenges that fuzzing faces in selecting and mutating seed inputs, expanding code coverage, and bypassing verification checks. The machine learning component of our study relies on Microsoft’s CodeBERTa, a bimodal pre-trained model for natural language pro- cessing, that is ideal for vulnerability detection because it can utilize both natural language and source code as inputs. We fine-tune the model an expanded version of the vulnerability dataset curated by Zhou et al [54] and evaluate its performance both individually and comparatively. The results of our model include an overall accuracy score of 62.88%, an F1-score of .55, a ROC-AUC score of .70, and a PR-AUC score of .666. These findings suggest that the model can identify vulnerabilities in source code with relative accuracy and that employment of machine learning techniques can enhance the efficacy of vulnerability detection. As such, our CodeBERTa model Improves the effectiveness of vulnerability detection techniques and assists software engineers in regaining the advantage in the battle against cybercrime.

Table of Contents

Contents

1 Introduction 1

2 Background 8

2.1 The Art of Fuzzing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.1 Strategies, Types, and Limitations . . . . . . . . . . . . . . . 10

2.2 Machine Learning and CodeBERT . . . . . . . . . . . . . . . . . . . 18

2.2.1 Learning Paradigms . . . . . . . . . . . . . . . . . . . . . . . 19

2.2.2 BERT-Based Models . . . . . . . . . . . . . . . . . . . . . . . 20

2.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Related Work 23

3.1 The Status of Machine Learning . . . . . . . . . . . . . . . . . . . . . 23

3.2 Defect Detection with Machine Learning . . . . . . . . . . . . . . . . 27

3.3 CodeBERT and Vulnerability Identification . . . . . . . . . . . . . . 28

4 Materials and Methods 31

4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1.1 Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.2 Model Architecture and Hyperparameters . . . . . . . . . . . . . . . 34

4.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.3.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.3.2 F1-Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.3.3 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.3.4 ROC-AUC Curve . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.3.5 PR-AUC Curve . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5 Results 40

5.1 Semantic Differences between Vulnerable and Non-Vulnerable Code . 40

5.2 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6 Discussion 46

6.1 Empirical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6.2 Comparative Performance . . . . . . . . . . . . . . . . . . . . . . . . 47

6.3 Limitations and Confounding Variables . . . . . . . . . . . . . . . . . 49

6.4 Code Bugs versus Code Flaws . . . . . . . . . . . . . . . . . . . . . . 53

7. Conclusion 58

8. Bibliography 61

About this Honors Thesis

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School
Department
Degree
Submission
Language
  • English
Research Field
Keyword
Committee Chair / Thesis Advisor
Committee Members
Last modified

Primary PDF

Supplemental Files