Abstract:PP(top x%) is the proportion of papers of a unit (e.g. an institution or a group of researchers), which belongs to the x% most frequently cited papers in the corresponding fields and publication years. It has been proposed that x% of papers can be expected which belongs to the x% most frequently cited papers. In this Letter to the Editor we will present the results of an empirical test whether we can really have this expectation and how strong the deviations from the expected values are when many random samples are drawn from the database.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to verify whether the actual expected values of percentile indicators (such as PP(top x%)) are consistent with the theoretical expectations, and how large the deviation between the actual values and the expected values is when a large number of random samples are drawn from the database. Specifically, the author explores the following aspects through empirical tests:
1. **Verification of expected values**: By definition, it can be theoretically expected that x% of papers belong to the x% most - cited papers. The paper aims to verify whether this hypothesis holds, that is, in a given database, whether the actual value of PP(top x%) is indeed close to x%.
2. **Deviation analysis**: By drawing random samples of different sizes from the database, analyze the deviation between the PP(top x%) values of these samples and the overall values. The purpose is to understand the impact of sample size on deviation and the issues that need to be noted when using these indicators in practical applications.
### Main findings
- **Impact of sample size**: As the sample size increases, the deviation between the actual value and the expected value gradually decreases. For example, for PP(top 10%), when the sample size is 100, the minimum value is 6.310 and the maximum value is 13.407, while when the sample size increases to 100,000, the minimum value is 9.564 and the maximum value is 10.163, which are closer to the overall value of 9.903.
- **Differences in expected values**: The paper points out that the actual expected values in the database may be different from the expected values defined by the indicators. Therefore, when interpreting the empirical research results based on a specific database, the actual expected values of the database should be used instead of simply relying on the expected values defined by the indicators.
### Conclusions
1. **Actual expected values of the database**: When interpreting the empirical research results based on a specific database, the actual expected values of the database should be used to avoid misunderstanding.
2. **Reliability of large samples**: Although percentile indicators (such as PP(top 50%), PP(top 10%) and PP(top 1%)) are based on complex cross - domain calculations, when random samples are drawn multiple times or the sample size is large enough, the expected values of these indicators can be expected.
Through these findings, the author provides important guidance for the use of indicators in scientific research evaluation, especially when using percentile indicators to evaluate institutions or researchers.