Shell Language Processing: Unix command parsing for Machine Learning

Dmitrijs Trizna
DOI: https://doi.org/10.48550/arXiv.2107.02438
2022-07-07
Abstract:In this article, we present a Shell Language Preprocessing (SLP) library, which implements tokenization and encoding directed at parsing Unix and Linux shell commands. We describe the rationale behind the need for a new approach with specific examples of when conventional Natural Language Processing (NLP) pipelines fail. Furthermore, we evaluate our methodology on a security classification task against widely accepted information and communications technology (ICT) tokenization techniques and achieve significant improvement of an F1 score from 0.392 to 0.874.
Machine Learning,Programming Languages
What problem does this paper attempt to address?