Large-scale Empirical Study on Machine Learning Related Questions on Stack Overflow

WANZhi-yuan,TAOJia-heng,LIANGJia-kun,CAIZhen-gong,CHANGCheng,QIAOLin,ZHOUQiao-ni
DOI: https://doi.org/10.3785/j.issn.1008-973x.2019.05.001
2019-01-01
Abstract:By using filtered tags, 60 028 machine learning related questions were extracted from more than 41.78 million posts on an online Q & A website, Stack Overflow, in order to investigate the topic distribution and trends related to machine learning. Extracted question posts were analyzed by counting the amount of discussion on each machine learning platform, and top three most frequently discussed machine learning platforms were discovered, i.e. Scikit-learn, TensorFlow and Keras, accounting for 58% of these posts. Latent Dirichlet allocation (LDA) topic model training was conducted to further explore discussion topics related to machine learning. A progressive search approach was proposed for number of topics in adaptive LDA, which discovered the optimal number of topics with topic coherence coefficient, in order to obtain the optimal topic numbers for LDA models. Nine discussion topics related to machine learning were discovered, which fell into three broad categories, i.e. code-related, model-related, and theory-related. In addition, the popularity and difficulty of different topics were analyzed according to the view counts and comment counts of question posts.
What problem does this paper attempt to address?