Learning Representations for Log Data in Cybersecurity

Ignacio Arnaldo,Alfredo Cuesta-Infante,Ankit Arun,Mei Lam,Costas Bassias,Kalyan Veeramachaneni
DOI: https://doi.org/10.1007/978-3-319-60080-2_19
2017-01-01
Abstract:We introduce a framework for exploring and learning representations of log data generated by enterprise-grade security devices with the goal of detecting advanced persistent threats (APTs) spanning over several weeks. The presented framework uses a divide-and-conquer strategy combining behavioral analytics, time series modeling and representation learning algorithms to model large volumes of data. In addition, given that we have access to human-engineered features, we analyze the capability of a series of representation learning algorithms to complement human-engineered features in a variety of classification approaches. We demonstrate the approach with a novel dataset extracted from 3 billion log lines generated at an enterprise network boundaries with reported command and control communications. The presented results validate our approach, achieving an area under the ROC curve of 0.943 and 95 true positives out of the Top 100 ranked instances on the test data set.
What problem does this paper attempt to address?