Guest Editorial: Learning from limited annotations for computer vision tasks.

Yazhou Yao,Wenguan Wang,Qiang Wu,Dongfang Liu,Jin Zheng
Abstract:The past decade has witnessed remarkable achievements in computer vision, owing to the fast development of deep learning. With the advancement of computing power and deep learning algorithms, we can process and apply millions or even hundreds of millions of large-scale data to train robust and advanced deep learning models. In spite of the impressive success, current deep learning methods tend to rely on massive annotated training data and lack the capability of learning from limited exemplars. However, constructing a million-scale annotated dataset like ImageNet is time-consuming, labour-intensive and even infeasible in many applications. In certain fields, very limited annotated examples can be gathered due to various reasons such as privacy or ethical issues. Consequently, one of the pressing challenges in computer vision is to develop approaches that are capable of learning from limited annotated data. The purpose of this Special Issue is to collect high-quality articles on learning from limited annotations for computer vision tasks (e.g. image classification, object detection, semantic segmentation, instance segmentation and many others), publish new ideas, theories, solutions and insights on this topic and showcase their applications. In this Special Issue we received 29 papers, all of which underwent peer review. Of the 29 originally submitted papers, 9 have been accepted. The nine accepted papers can be clustered into two main categories: theoretical and applications. The papers that fall into the first category are by Liu et al., Li et al. and He et al. The second category of papers offers a direct solution to various computer vision tasks. These papers are by Ma et al., Wu et al., Rao et al., Sun et al., Hou et al. and Gong et al. A brief presentation of each of the papers in this Special Issue follows. Liu et al. present a Gaussianisation prototypical classifier (GPC) for few-shot classification which mainly focuses on solving the issue of prototype bias. GPC consists of handling the features with the Gaussianisation operation and estimating a reliable prototype using the maximum a posteriori method using base class features as prior information. The proposed method is simple yet effective, which does not use any extra labelled data or knowledge. Moreover, it's also a one-step prototype rectification method, which does not resort any complex continuous optimisation. The ablation study shows that GPC can benefit from features pretrained only with CE loss or jointly trained with self-supervised loss. The results demonstrate that the proposed method outperforms related work and other state-of-the-art methods. Li et al. present a novel hyperspectral unmixing method named ‘Global centralised and Structured discriminative Nonnegative Matrix Factorisation (GSNMF)’. The proposed GSNMF offers several distinct advantages over the traditional unmixing techniques. Constructed on the foundation of the manifold regularisation techniques, GSNMF captures the intrinsic structural information by using the local affinity and distant repulsion constraints concurrently. With the structured discriminative information, local affinity constraint ensures that similar elements share similar estimated abundances, while the distant repulsion constraint ensures that dissimilar elements have different abundances. All experiments and analyses have demonstrated that the proposed GSNMF exhibits a remarkable performance compared to the other methods. He et al. present a taxonomy of existing algorithms in the task of makeup transfer. Evaluation methods are proposed, existing methods are analysed and existing datasets are reviewed. Finally the current problems in the field of makeup transfer are discussed, and the trend of future research is analysed. Ma et al. present a dense transformer framework for person re-identification tasks. This paper introduces densely connected class tokens to connect any two layers implicitly. The framework, Denseformer, outperforms other vision transformer models on four widely used benchmarks, namely Market-1501, DukeMTMC-reID, MSMT17 and Occluded-Duke datasets with only a small amount of extra calculation cost. According to the visualisation results, the proposed Denseformer pays more attention to the main parts of human bodies, obtaining discriminative global features. The Denseformer is a general improvement on ViT and works well on other tasks that use ViT as a backbone according to the promising results. Wu et al. present a new homology-continuous-based makeup transformation method, which can be roughly divided into two network branches: the age compensation branch and the makeup transformation branch. Specifically, in the age compensation branch, based on the same source continuity the authors designed a new encoding module which can map the face vector into the corresponding high-dimensional vector space and realise the compensation for age by adjusting the vector direction. In the makeup transformation branch, this work designed a multi-style encoder to handle different types of makeup, such as Chinese Japanese Korean makeup etc. In addition, the proposed network structure is a two-pass encoder-decoder architecture which has good parallelism and can achieve better results with training and inference on GPU. Rao et al. present a novel end-to-end architecture for point completion by using a stack-style folding network called the SSFN. Due to the fact that the output shape code cannot completely represent semantically, they propose a Stack-Style Folding module that transforms the bottleneck output into the style code analogous to StyleGAN. Experiments on ShapeNet and KITTI datasets indicate that the proposed SSFN architecture achieves a decent visual quality and metric performance. Sun et al. present a method for a zero-shot temporal event localisation (ZSTEL) that leverage large-scale video and language models, for example, CLIP. They solve the two key problems for ZSTEL: (1) how to find the relevant region where the event is likely to occur, (2) how to determine event duration after the relevant region is found. They propose the query-guided optimisation for local frame relevance. Relying on the query-to-frame relationship, this method can find the most relevant local frame region where the event is most likely to occur, guided by a constructed objective. The experimental results on the two standard benchmark datasets, Charades-STA and ActivityCaptions have shown the effectiveness of the proposed approach. Hou et al. present a cutting-edge few-shot detection method for logo images. To avoid the misclassification between the base and novel classes, they add an extra classification head. They also apply the convolutional layer into regression heads to improve the accuracy of location by using the limited training data. Considering the characteristics of logo images, they add balanced feature pyramid with Deformable RoI Pooling and unfreeze region proposal network in the fine-tuning stage. The extensive comparative experimentation and ablation studies illustrate the advantage of the proposed method and the effectiveness of every component in the model. Gong et al. present a method for object detection with a long-tail distribution that includes a dual-balanced network and balanced classification loss. This work investigates how the long-tailed distribution impacts the sub-networks in the general two-stage object detection framework Faster-RCNN and finds that unbalanced proposal sampling and unbalanced classification logic deteriorate the performance of the model in terms of AP. They propose the balanced region proposal network and balanced the classification network to address the above issues. Experiments on the LVIS-v0.5 dataset demonstrate that the framework improves the performance of AP without sacrificing too much from the performance of head categories in long-tail distribution. All of the papers selected for this Special Issue show that the field of learning from limited annotations for computer vision tasks is steadily moving forward. The possibility of a weakly supervised learning paradigm will remain a source of inspiration for new techniques in the years to come. Firstly, we wish to express our thanks to Ph.D. students at Nanjing University of Science and Technology for their continuous assistance throughout this process. Also, we wish to express our gratitude to all the contributors who submitted novel scientific results in this special issue and to the anonymous reviewers, whose expert work allowed the realisation of this endeavor. We aspire that this effort should contribute to the further development of DL and increase the concern of the scientific and technological community in the respective area. Last, we should not omit to express our appreciation to the journal's Editors-in-Chief and the Editorial Office for their support throughout this venture. Data sharing is not applicable to this article as no new data were created or analysed in this study. Yazhou Yao is a professor at the School of Computer Science and Engineering and Nanjing University of Science and Technology. With the support of the China Scholarship Council, he received his Ph.D. degree in Computer Science, University of Technology Sydney, Australia at 2018. From July 2018 to July 2019, he worked as a Research Scientist at the Inception Institute of Artificial Intelligence, Abu Dhabi, UAE. His research interests include multimedia processing and machine learning. Wenguan Wang is currently a ZJU100 Young Professor at Zhejiang University. He received his Ph.D. degree from Beijing Institute of Technology in 2018. From 2016 to 2018, he was a joint Ph.D. candidate at the University of California, Los Angeles. From 2018 to 2019, he was a senior scientist at the Inception Institute of Artificial Intelligence, UAE. From 2020 to 2022, he worked as a postdoc researcher at ETH Zurich, Switzerland. After that, he worked as a lecturer and ARC DECRA Fellow at the University of Technology Sydney. His current research interests include computer vision, image processing and deep learning. Qiang Wu received the BEng and MEng degrees in electronic engineering from the Harbin Institute of Technology, Harbin, China, in 1996 and 1998, respectively, and the Ph.D. degree in computing science from the University of Technology Sydney, Sydney, Australia, in 2004. He is currently an Associate Professor and a Core Member of the Global Big Data Technologies Centre, University of Technology Sydney. He has published more than 70 refereed papers, including those published in prestigious journals and top international conferences. His major research interests include computer vision, image processing, pattern recognition, machine learning and multimedia processing. He has served as the chair and/or a Programme Committee Member for a number of international conferences. Dongfang Liu is an Assistant Professor in the Department of Computer Engineering at the Rochester Institute of Technology (RIT). He earned his Ph.D. degree from Purdue University. Dr. Dongfang Liu's research focus on embodied AI and creates general AI solutions to address significant societal challenges. His ongoing work consists of: (1) developing attention-guided perception models that behave like a human's perpetual capacity; and (2) developing structured and human-centred recognition systems that comprehend the surrounding visual world. His publication portfolio includes papers from major conferences in the artificial intelligence and robotics fields, such as CVPR, ECCV, ICCV, ICLR, NIPS, ICML, AAAI, IJCAI, ACL, EMNLP, WWW, WACV, IROS etc. He currently serves on the senior programme committee for AAAI and IJCAI and as an associate editor for IEEE Transactions on Circuits and Systems for Video Technology (TCSVT). Jin Zheng received the BS and MS degrees from Liaoning Technical University, in 2001 and 2004, respectively, and the Ph.D. degree from the School of Computer Science and Engineering, Beihang University, in 2009. She joined the School of Computer Science and Engineering, Beihang University, in 2009. In 2014, she visited Harvard University, MA, USA, as a Visiting Scholar for 1 year. Her current research interests include object detection, tracking and recognition, among other similar interests.
What problem does this paper attempt to address?