Software Libraries and Their Reuse: Entropy, Kolmogorov Complexity, and Zipf's Law

Todd L. Veldhuizen
DOI: https://doi.org/10.48550/arXiv.cs/0508023
2005-10-03
Abstract:We analyze software reuse from the perspective of information theory and Kolmogorov complexity, assessing our ability to ``compress'' programs by expressing them in terms of software components reused from libraries. A common theme in the software reuse literature is that if we can only get the right environment in place-- the right tools, the right generalizations, economic incentives, a ``culture of reuse'' -- then reuse of software will soar, with consequent improvements in productivity and software quality. The analysis developed in this paper paints a different picture: the extent to which software reuse can occur is an intrinsic property of a problem domain, and better tools and culture can have only marginal impact on reuse rates if the domain is inherently resistant to reuse. We define an entropy parameter $H \in [0,1]$ of problem domains that measures program diversity, and deduce from this upper bounds on code reuse and the scale of components with which we may work. For ``low entropy'' domains with $H$ near 0, programs are highly similar to one another and the domain is amenable to the Component-Based Software Engineering (CBSE) dream of programming by composing large-scale components. For problem domains with $H$ near 1, programs require substantial quantities of new code, with only a modest proportion of an application comprised of reused, small-scale components. Preliminary empirical results from Unix platforms support some of the predictions of our model.
Software Engineering,Information Theory,Programming Languages
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: What are the potentials and limitations of software reuse in different problem domains? Specifically, the author analyzes the possibility of "compressing" programs by using software library components from the perspectives of information theory and Kolmogorov complexity. The core issue of the paper is to explore whether the degree of software reuse is an inherent property of the problem domain, and whether better tools and culture can significantly improve the reuse rate. ### Main contributions of the paper 1. **Introduction of entropy parameter \( H \)**: - The author defines an entropy parameter \( H\in[0, 1]\) to measure the program diversity in the problem domain. - When \( H \) is close to 0, the programs are highly similar and suitable for large - scale component reuse. - When \( H \) is close to 1, the programs are very diverse and the reuse potential is limited. 2. **Theoretical model**: - Use information theory and Kolmogorov complexity to model software reuse. - Analyze the distribution of programs in different problem domains. - Derive the upper bound of the reuse rate and explain the application of Zipf's law in software reuse. 3. **Experimental verification**: - Through the analysis of shared object data on the Unix platform, preliminarily verify the predictions of the model. - Show a good fit between the actual data and Zipf's law. ### Main conclusions - The potential of software reuse is an inherent property of the problem domain, which is determined by the entropy parameter \( H \). - Better tools and culture can only have a marginal impact on the reuse rate if the problem domain itself is not suitable for reuse. - For low - entropy domains, productivity can be significantly improved through large - scale component reuse; while for high - entropy domains, the reuse potential is limited and a large amount of new code still needs to be written. ### Formula summary - Definition of entropy parameter \( H \): \[ H=\limsup_{s_0\rightarrow\infty}\left(\frac{1}{\vert A_{\leq s_0}\vert}H(p_{s_0})\right) \] where \( H(p_{s_0}) = -\sum_{w:\|w\|\leq s_0}p_{s_0}(w)\log_2p_{s_0}(w)\) is the entropy of the distribution \( p_{s_0}\). - Upper bound of the reuse rate: \[ \text{Expected reuse proportion}\leq1 - H \] - Application of Zipf's law: \[ \lambda(n)\sim\frac{c}{n} \] where \(\lambda(n)\) is the expected reuse rate of the \(n\)th library component and \( c \) is a constant. Through these theoretical and empirical analyses, the paper reveals the essential limitations of software reuse in different problem domains and provides a new perspective for understanding the reuse potential.