Modifying boosted trees to improve performance on task 1 of the 2006 KDD challenge cup

Robert M. Bell,Patrick G. Haffner,Chris Volinsky
DOI: https://doi.org/10.1145/1233321.1233327
2006-12-01
ACM SIGKDD Explorations Newsletter
Abstract:Task 1 of the 2006 KDD Challenge Cup required classification of pulmonary embolisms (PEs) using variables derived from computed tomography angiography. We present our approach to the challenge and justification for our choices. We used boosted trees to perform the main classification task, but modified the algorithm to address idiosyncrasies of the scoring criteria. The two main modifications were: 1) changing the dependent variable in the training set to account for multiple PEs per patient, and 2) incorporating neighborhood information through augmentation of the set of predictor variables. Both of these resulted in measurable predictive improvement. In addition, we discuss a statistically based method for setting the classification threshold.
What problem does this paper attempt to address?
This paper aims to solve the classification problem in Task 1 of the 2006 KDD Challenge Cup. The specific goal is to use variables extracted from computed tomography angiography (CTA) to identify pulmonary embolism (PE). The core problem of the paper is to deal with non - standard scoring criteria, which are different from the problem scoring standards based on (weighted) classification errors, mainly in two aspects: 1. **PE Sensitivity Criterion**: For multiple positive candidates from the same PE, the standard does not give extra credit. Therefore, simply selecting the candidate with the highest predicted probability may waste many false positives (FP), and even if these false positives are correctly identified, it will not improve the PE sensitivity score. 2. **Hard Limit on the Number of False Positives**: Exceeding the specified limits of 2, 4, and 10 false positives per patient will result in the disqualification of the submitted results. This requires an unbiased estimate of the false positive rate and its uncertainty. To deal with these problems, the authors propose the following improvement measures: - **Modify the Dependent Variable of the Training Set**: Deal with the situation where each patient may have multiple PEs by changing the dependent variable. For positive candidates, change the dependent variable from +1 to +1/(PE degree) so that each PE has the same weight, thereby improving the sensitivity of candidates with low - degree PE. - **Enhance the Set of Predictor Variables**: Enhance the set of predictor variables by adding neighborhood information. In the selected boosting steps, five new variables are added, which reflect the relationship between the candidate and its neighboring area and help to distinguish between low - degree and high - degree PE. In addition, the paper also discusses a statistical - based method to set the classification threshold to ensure maximizing PE sensitivity under the given false positive rate limit. Through these improvements, the authors have achieved a significant performance improvement in Task 1.