Oops, I Sampled it Again: Reinterpreting Confidence Intervals in Few-Shot Learning

Raphael Lafargue,Luke Smith,Franck Vermet,Mathias Löwe,Ian Reid,Vincent Gripon,Jack Valmadre
2024-09-06
Abstract:The predominant method for computing confidence intervals (CI) in few-shot learning (FSL) is based on sampling the tasks with replacement, i.e.\ allowing the same samples to appear in multiple tasks. This makes the CI misleading in that it takes into account the randomness of the sampler but not the data itself. To quantify the extent of this problem, we conduct a comparative analysis between CIs computed with and without replacement. These reveal a notable underestimation by the predominant method. This observation calls for a reevaluation of how we interpret confidence intervals and the resulting conclusions in FSL comparative studies. Our research demonstrates that the use of paired tests can partially address this issue. Additionally, we explore methods to further reduce the (size of the) CI by strategically sampling tasks of a specific size. We also introduce a new optimized benchmark, which can be accessed at <a class="link-external link-https" href="https://github.com/RafLaf/FSL-benchmark-again" rel="external noopener nofollow">this https URL</a>
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The paper primarily explores the issues in calculating Confidence Intervals (CI) in Few-Shot Learning (FSL) and proposes improved methods. Specifically: 1. **Problems with Existing Methods**: The current main method for calculating CI is through task generation with replacement sampling, which leads to CI considering only the randomness of sampling while ignoring the characteristics of the data itself. This type of CI is referred to as "Closed Confidence Intervals" (CCIs). 2. **Open Confidence Intervals (OCIs)**: Unlike CCIs, OCIs are calculated through sampling without replacement, which better reflects the true distribution of the data. However, the drawback of OCIs is that they limit the number of tasks that can be generated, especially on small datasets, which may result in a larger CI range. 3. **Comparative Analysis**: The paper compares the performance of CCIs and OCIs on multiple standard visual datasets through experiments. The results show that on small datasets, CCIs are significantly narrower than OCIs, while on large datasets, the opposite is true. Additionally, the paper finds that when accuracy approaches 100%, both types of CI become narrower due to the saturation of accuracy reducing variance. 4. **Paired Tests**: To improve the reliability of comparison results, the paper introduces the paired test method. This method evaluates different approaches on the same set of tasks, reducing the impact of task difficulty differences, thereby making the conclusions more reliable. 5. **Optimizing Task Size**: To further reduce the CI range, the paper explores how to adjust the size of tasks. Specifically, by increasing the number of query samples (Q), the CI range can be reduced to some extent, but this also reduces the number of tasks that can be generated. Therefore, there is an optimal Q value that can effectively reduce the CI range. In summary, the paper aims to emphasize the importance of correctly understanding and interpreting CI in FSL and proposes a series of improvements to enhance the accuracy and reliability of method comparisons.