Information inequality problem over set functions

Miika Hannula
2023-09-21
Abstract:Information inequalities appear in many database applications such as query output size bounds, query containment, and implication between data dependencies. Recently Khamis et al. proposed to study the algorithmic aspects of information inequalities, including the information inequality problem: decide whether a linear inequality over entropies of random variables is valid. While the decidability of this problem is a major open question, applications often involve only inequalities that adhere to specific syntactic forms linked to useful semantic invariance properties. This paper studies the information inequality problem in different syntactic and semantic scenarios that arise from database applications. Focusing on the boundary between tractability and intractability, we show that the information inequality problem is coNP-complete if restricted to normal polymatroids, and in polynomial time if relaxed to monotone functions. We also examine syntactic restrictions related to query output size bounds, and provide an alternative proof, through monotone functions, for the polynomial-time computability of the entropic bound over simple sets of degree constraints.
Databases,Information Theory
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the Information Inequality Problem (IIP), that is, to determine whether a given information inequality is valid for all entropy functions. Specifically, the author explores the algorithmic properties of the Information Inequality Problem in different syntactic and semantic scenarios, especially those related to database applications. The paper focuses on the solvability and complexity bounds of the problem, especially its performance under normal polymatroids and monotone functions. ### Main Contributions 1. **Complexity and Solvability**: - The paper proves that when restricted to normal polymatroids, the Information Inequality Problem is coNP - complete. - When relaxed to monotone functions or restricted to modular functions, the Information Inequality Problem can be solved in polynomial time. 2. **Syntactic and Semantic Restrictions**: - The author studies the syntactic restrictions related to the query output size bounds and provides an alternative proof of polynomial - time computability through monotone functions. - Identifies the factors that make the Information Inequality Problem easy or difficult, including the influence of coefficients and the expressive power of information measures. 3. **Application Background**: - Information inequalities have many applications in database theory, such as bounds on query output size, query containment, and implications of data dependencies. - In particular, information inequalities play an important role in tight bounds on query output size and optimal join algorithms in the worst - case. ### Specific Results - **Complexity Results**: - Proves that the Information Inequality Problem on normal polymatroids is coNP - complete. - By reduction from the monotone satisfiability problem, it is proved that the Information Inequality Problem involving three - variable mutual information and conditional entropy is coNP - complete on step functions. - **Solvability Results**: - Proves that the Information Inequality Problem on monotone functions can be solved in polynomial time. - For certain specific syntactic forms (such as cyclic or simple conditional sets), the Information Inequality Problem is also solvable in polynomial time. ### Formula Presentation - **Definition of Entropy**: \[ H(X):=-\sum_{x\in D}p(x)\log p(x) \] where \(D = \text{Dom}(X)\) is the finite domain of the random variable \(X\), \(p:D\rightarrow[0, 1]\) is the probability distribution, and \(\sum_{a\in D}p(a) = 1\). - **Multivariate Mutual Information**: \[ I_h(S)=\sum_{T\subseteq S}(- 1)^{|T|-1}h(T) \] where \(S\) is a set of random variables and \(h\) is a general set function. - **Three - Variable Mutual Information**: \[ I(ABC)=h(A)+h(B)+h(C)-h(AB)-h(AC)-h(BC)+h(ABC) \] - **Definition of Information Inequality**: \[ c_1h(X_1)+\cdots + c_kh(X_k)\geq0 \] where \(c_i\in\mathbb{R}\), \(X_i\) is a subset of the variable set \(\{X_j\}_{j = 1}^n\). ### Conclusion This paper reveals the complexity and solvability bounds of the Information Inequality Problem by analyzing it in different syntactic and semantic scenarios. These results are of great significance for understanding the applications of information inequalities in database theory.