Predicting likelihood of legitimate data loss in email DLP

Faiz, Mohamed Falah; Arshad, Junaid; Alazab, Mamoun; Shalaginov, Andrii

Predicting likelihood of legitimate data loss in email DLP

Lists

Faiz, Mohamed Falah, Arshad, Junaid ORCID: https://orcid.org/0000-0003-0424-9498, Alazab, Mamoun and Shalaginov, Andrii (2020) Predicting likelihood of legitimate data loss in email DLP. Future Generation Computer Systems, 110. pp. 744-757. ISSN 0167-739X

[thumbnail of Predicting_the_Likelihood_of_Legitimate_Data_Loss_in_Email_DLP-accepted_Repo.pdf]

Preview

PDF
Predicting_the_Likelihood_of_Legitimate_Data_Loss_in_Email_DLP-accepted_Repo.pdf - Accepted Version
Available under License Creative Commons Attribution Non-commercial No Derivatives.
Download (1MB) | Preview

Official URL: https://doi.org/10.1016/j.future.2019.11.004

Abstract

The volume and variety of data collected for modern organizations has increased significantly over the last decade necessitating the detection and prevention of disclosure of sensitive data. Data loss prevention is an embedded process used to protect against disclosure of sensitive data to external uncontrolled environments. A typical Data Loss Prevention (DLP) system uses custom policies to identify and prevent accidental and malicious data leakage producing large number of security alerts including significant volume of false positives. Consequently, identifying legitimate data loss can be very challenging as each incident comprises of different characteristics often requiring extensive intervention by a domain expert to review alerts individually. This limits the ability to detect data loss alerts in real-time making organisations vulnerable to financial and reputational damages. The aim of this research is to strengthen data loss detection capabilities of a DLP system by implementing a machine learning model to predict the likelihood of legitimate data loss. We conducted extensive experimentation using Decision Tree and Random Forest algorithms with historical email incident data collected by a globally established telecommunication enterprise. The final model produced with Random Forest algorithm was identified as the most effective as it was successfully able to predict approximately 95% data loss incidents accurately with an average true positive value of 90%. Furthermore, the proposed solution successfully enables identification of legitimate data loss in email DLP whilst facilitating prioritisation of real data loss through human-understandable explanation of the decision thereby improving the efficiency of the process.

Item Type:	Article
Identifier:	10.1016/j.future.2019.11.004
Keywords:	Data Loss Prevention, Email DLP, Insider Threats, Threat Prediction, Machine Learning
Subjects:	Computing > Information security > Cyber security Computing > Information security Computing
Date Deposited:	05 Nov 2019
URI:	https://repository.uwl.ac.uk/id/eprint/6510