Vulnerability Detection: A Machine Learning Approach to Identifying Security Vulnerabilities In Code Open Access

Robertson, Peyton (Spring 2023)

Permanent URL: https://etd.library.emory.edu/concern/etds/4j03d103r?locale=en%5D

Published

Abstract

Cybercrime is a rapidly growing threat that poses significant risks to individuals, businesses, and governments worldwide. With the proliferation of technology and the increasing sophistication of cybercriminals, there is an enormous need for practical and effective techniques to identify security vulnerabilities hidden in source code. We propose a novel approach to vulnerability detection that combines machine learning with fuzzing techniques in order to identify areas of a program that are more likely to contain possible security exploits. This approach strives to improve on the state- of-the-art for automated vulnerability detection by addressing the challenges that fuzzing faces in selecting and mutating seed inputs, expanding code coverage, and bypassing verification checks. The machine learning component of our study relies on Microsoft’s CodeBERTa, a bimodal pre-trained model for natural language pro- cessing, that is ideal for vulnerability detection because it can utilize both natural language and source code as inputs. We fine-tune the model an expanded version of the vulnerability dataset curated by Zhou et al [54] and evaluate its performance both individually and comparatively. The results of our model include an overall accuracy score of 62.88%, an F1-score of .55, a ROC-AUC score of .70, and a PR-AUC score of .666. These findings suggest that the model can identify vulnerabilities in source code with relative accuracy and that employment of machine learning techniques can enhance the efficacy of vulnerability detection. As such, our CodeBERTa model Improves the effectiveness of vulnerability detection techniques and assists software engineers in regaining the advantage in the battle against cybercrime.

Contents

1 Introduction 1

2 Background 8

2.1 The Art of Fuzzing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.1 Strategies, Types, and Limitations . . . . . . . . . . . . . . . 10

2.2 Machine Learning and CodeBERT . . . . . . . . . . . . . . . . . . . 18

2.2.1 Learning Paradigms . . . . . . . . . . . . . . . . . . . . . . . 19

2.2.2 BERT-Based Models . . . . . . . . . . . . . . . . . . . . . . . 20

2.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Related Work 23

3.1 The Status of Machine Learning . . . . . . . . . . . . . . . . . . . . . 23

3.2 Defect Detection with Machine Learning . . . . . . . . . . . . . . . . 27

3.3 CodeBERT and Vulnerability Identification . . . . . . . . . . . . . . 28

4 Materials and Methods 31

4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1.1 Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.2 Model Architecture and Hyperparameters . . . . . . . . . . . . . . . 34

4.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.3.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.3.2 F1-Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.3.3 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.3.4 ROC-AUC Curve . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.3.5 PR-AUC Curve . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5 Results 40

5.1 Semantic Differences between Vulnerable and Non-Vulnerable Code . 40

5.2 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6 Discussion 46

6.1 Empirical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6.2 Comparative Performance . . . . . . . . . . . . . . . . . . . . . . . . 47

6.3 Limitations and Confounding Variables . . . . . . . . . . . . . . . . . 49

6.4 Code Bugs versus Code Flaws . . . . . . . . . . . . . . . . . . . . . . 53

7. Conclusion 58

8. Bibliography 61

About this Honors Thesis

Rights statement

Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.

School	Emory College
Department	Computer Science
Degree	B.A.
Submission	Honors Thesis
Language	English
Research Field	Computer Science
Keyword	Cybersecurity Fuzzing
Committee Chair / Thesis Advisor	Ymir Vigfusson, Emory University
Committee Members	Emily Wall, Emory University Li Xiong, Emory University Davide Fossati, Emory University

Last modified

Primary PDF

Thumbnail	Title	Date Uploaded	Actions
	Vulnerability Detection: A Machine Learning Approach to Identifying Security Vulnerabilities In Code ()	2023-04-09 22:17:16 -0400	Download

Vulnerability Detection: A Machine Learning Approach to Identifying Security Vulnerabilities In Code Open Access

Robertson, Peyton (Spring 2023)

Abstract

Table of Contents

About this Honors Thesis

Primary PDF

Supplemental Files