Towards Robust and Fair Vision Learning in Open-World Environments

Thanh-Dat Truong
2024-12-13
Abstract:The dissertation presents four key contributions toward fairness and robustness in vision learning. First, to address the problem of large-scale data requirements, the dissertation presents a novel Fairness Domain Adaptation approach derived from two major novel research findings of Bijective Maximum Likelihood and Fairness Adaptation Learning. Second, to enable the capability of open-world modeling of vision learning, this dissertation presents a novel Open-world Fairness Continual Learning Framework. The success of this research direction is the result of two research lines, i.e., Fairness Continual Learning and Open-world Continual Learning. Third, since visual data are often captured from multiple camera views, robust vision learning methods should be capable of modeling invariant features across views. To achieve this desired goal, the research in this thesis will present a novel Geometry-based Cross-view Adaptation framework to learn robust feature representations across views. Finally, with the recent increase in large-scale videos and multimodal data, understanding the feature representations and improving the robustness of large-scale visual foundation models is critical. Therefore, this thesis will present novel Transformer-based approaches to improve the robust feature representations against multimodal and temporal data. Then, a novel Domain Generalization Approach will be presented to improve the robustness of visual foundation models. The research's theoretical analysis and experimental results have shown the effectiveness of the proposed approaches, demonstrating their superior performance compared to prior studies. The contributions in this dissertation have advanced the fairness and robustness of machine vision learning.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve several key problems in machine vision learning to achieve more fair and robust visual perception capabilities. Specifically, the paper mainly focuses on the following four aspects of problems: 1. **Large - scale data - dependence problem**: - Current visual learning methods usually rely on large - scale labeled data, and the data - labeling process is both expensive and time - consuming. To solve this problem, the paper proposes a new **Fairness Domain Adaptation** method, by introducing **Bijective Maximum Likelihood** and **Fairness Adaptation Learning Framework** to reduce the dependence on large - scale labeled data. 2. **Unfair prediction problem**: - Due to unbalanced data distribution, current visual models will produce unfair prediction results in practical applications, especially in applications involving humans. For this reason, the paper proposes an **Open - world Fairness Continual Learning Framework**, which combines the research directions of **Fairness Continual Learning** and **Open - world Continual Learning** to improve the fairness of the model. 3. **Cross - view feature - modeling problem**: - Visual data usually comes from multiple camera perspectives, so robust methods that can model invariant features across views are required. The paper proposes a Geometry - based Cross - view Adaptation framework to learn robust feature representations across views. 4. **Large - scale multi - modal data - understanding problem**: - With the increase of large - scale videos and multi - modal data, it is crucial to understand and improve the robustness of large - scale visual foundation models. The paper proposes some new Transformer - based methods. By introducing new self - attention mechanisms and learning objectives, it improves the robust feature representations of multi - modal and temporal data, and proposes a new Domain Generalization Approach to enhance the robustness of visual foundation models. Through these contributions, the paper aims to promote the fairness and robustness of machine vision learning in the open - world environment, thus getting closer to human capabilities in visual perception tasks.