Tsung-Shan Yang,Yun-Cheng Wang,Chengwei Wei,Suya You,C.-C. Jay Kuo
Abstract:Human-Object Interaction (HOI) detection is a fundamental task in image understanding. While deep-learning-based HOI methods provide high performance in terms of mean Average Precision (mAP), they are computationally expensive and opaque in training and inference processes. An Efficient HOI (EHOI) detector is proposed in this work to strike a good balance between detection performance, inference complexity, and mathematical transparency. EHOI is a two-stage method. In the first stage, it leverages a frozen object detector to localize the objects and extract various features as intermediate outputs. In the second stage, the first-stage outputs predict the interaction type using the XGBoost classifier. Our contributions include the application of error correction codes (ECCs) to encode rare interaction cases, which reduces the model size and the complexity of the XGBoost classifier in the second stage. Additionally, we provide a mathematical formulation of the relabeling and decision-making process. Apart from the architecture, we present qualitative results to explain the functionalities of the feedforward modules. Experimental results demonstrate the advantages of ECC-coded interaction labels and the excellent balance of detection performance and complexity of the proposed EHOI method.
What problem does this paper attempt to address?
### What problems does this paper attempt to solve?
This paper aims to solve several key problems in **Human - Object Interaction (HOI)**, especially in terms of computational efficiency, model transparency, and data imbalance. Specifically:
1. **Computational Efficiency and Complexity**:
- Although existing deep - learning methods perform excellently in terms of performance (measured by mean average precision, mAP), they are computationally costly and opaque during the training and inference processes. This limits the application of these models on edge devices.
- The paper proposes an efficient detector named **Efficient Human - Object Interaction (EHOI)**, aiming to achieve a good balance among detection performance, inference complexity, and mathematical transparency.
2. **Data Imbalance Problem**:
- The distribution of interaction pairs in the HOI dataset is extremely unbalanced, causing the model to be prone to over - fitting to common interaction types and ignoring rare interaction types.
- To this end, the paper introduces a **hybrid encoding scheme**, which divides interaction pairs into common and rare types. For common interactions, traditional one - hot encoding is used; for rare interactions, binary encoding combined with Error Correction Codes (ECCs) is adopted to reduce the model size and the complexity of the XGBoost classifier.
3. **Model Transparency**:
- Deep - learning models are usually black - box models, and it is difficult to explain their decision - making processes. EHOI improves the interpretability of the model through modular design and statistics - based learning methods.
- In particular, the paper proposes a conditional decision - making framework, which represents the final prediction result as an aggregation of the probabilities of each interaction bit and integrates them using Linear Discriminant Analysis (LDA).
### Main Contributions
- **Efficient Two - Stage Architecture**: EHOI adopts a two - stage approach. In the first stage, a pre - trained object detector is used to extract features, and in the second stage, an XGBoost classifier is used to predict interaction types.
- **Hybrid Encoding Scheme**: By introducing Error Correction Codes (ECCs) to handle rare interaction types, the model complexity is reduced and the performance is improved.
- **Transparent Decision - Making Process**: All modules can be interpreted as probability estimators, and the entire learning process is formulated as an aggregation of conditional probabilities.
- **Experimental Verification**: The experimental results show that EHOI significantly reduces the computational complexity while maintaining high detection performance, especially the number of FLOPs is far lower than other SOTA models.
### Formula Summary
- **Conditional Probability Formula**:
\[
P(\text{relation}|\text{human},\text{object})=\alpha P(\text{relation}|\text{human})+\beta P(\text{relation}|\text{object})
\]
where $\alpha$ and $\beta$ are learnable parameters.
- **Rare Interaction Encoding**:
\[
\text{Hamming codes: }\{000, 011, 101, 110\}
\]
The Hamming distance between each codeword is 2, ensuring the difference between different representations.
- **Feature Selection Loss Function**:
\[
L = \frac{N_{\text{left}}H_{\text{left}}+N_{\text{right}}H_{\text{right}}}{N_{\text{left}}+N_{\text{right}}}
\]
where $H$ represents binary entropy and $N$ represents the number of samples.
Through these improvements, EHOI is not only competitive in performance but also has made significant progress in computational efficiency and model transparency.