Abstract:7 FEATURE representation and learning is at the core ofmany 8 computer vision problems such as image classification, 9 object recognition, action recognition, object tracking, image 10 search, biometrics and many others. In the past two decades 11 remarkable progress has been witnessed in feature represen12 tation and learning, which mainly consist of two important 13 development stages. In the first stage from 1995 to 2012 (i.e., 14 the predeep learning era), the field was dominated by mile15 stone handcrafted feature descriptors such as SIFT, SURF, 16 HOG, LBP, Bag of Visual Words, Fisher Vector, etc. The sec17 ond stage, i.e., the deep learning era, starts from 2012 when a 18 team led byHintonwon the prestigious ImageNet Challenge 19 using deep learning techniques rather than traditional hand20 crafted features. The second stage is featured by deep learn21 ing based representations especially Deep Convolutional 22 Neural Networks (DeepCNNs) which can learn powerful 23 feature representations with multiple levels of abstraction 24 directly from data. 25 Deep learning techniques have attracted enormous atten26 tion and have brought about considerable breakthroughs for 27 many problems in computer vision. Increased computa28 tional power, deeper andmore complicated deep neural net29 works, and the availability of large scale datasets are fueling 30 computer vision systems. Despite the great success, the 31 known deficiencies of deep neural networks have not been 32 fully addressed, such as data hungry, energy hungry, lack of 33 theoretical interpretability, etc. 34 Nowadays, intelligence is moving towards edge devices. 35 Running machine learning systems on the end devices (e.g., 36 smartphones, automobiles, wearable devices or Internet of 37 Things devices) versus in the cloud has various benefits such 38 as immediate response, enhanced reliability, increased pri39 vacy, and efficient use of network bandwidth. However, many 40 realtime applications such as online learning, incremental 41 learning, mobile, embedded, or wearable devices with limited 42 resources and tight power budgets, or real time systems in 43 which constraints are imposed by a limited economical budget, 44 expose the inadequacies of existing algorithms, and require 45 feature representations that are computationally and memory 46 efficient. In addition, those applications where only limited 47 amounts of annotated training data can be gathered (such as 48 withmany visual inspection ormedical diagnostics tasks) pose 49 great challenges for applying state of the art deep neural net50 works. Therefore, despite the great strides, especially over 51 recent years, there is continued need for vigorous research in 52 this area to solve many challenging problems, by developing 53 compact, efficient feature representations from three aspects: 54 computationally efficient, label efficient, and sample efficient. 55 Since 2017, we have organized four international work56 shops associated with top conferences (ICCV2017, 57 ECCV2018, CVPR2019 and ICCV2019), explicitly devoted to 58 the topic “Compact and Efficient Feature Representation and 59 Learning in Computer Vision”. This is a clear sign of the 60 growing interest in computer vision around these themes. 61 The goal of this special section has been to solicit and publish 62 high quality papers that bring a clear picture of the state of 63 the art along this direction, and identify future promising 64 research directions. As guest editors of this special section, 65 we were happy to receive 25 submissions to our special sec66 tion. After a careful review process, we accepted ten papers 67 for publication. We thank the reviewers who provided 68 detailed, insightful, and timely reviews, leading to the high 69 quality of accepted papers. We also thank TPAMI EIC Sven 70 Dickinson and Associate EICs for recognizing the wide71 spread interest in this field, which warrants this special sec72 tion. The accepted 10 papers in this special section can be 73 grouped into five differentmain categories:

Guest Editorial Introduction to the Special Section on Video and Language

Guest Editorial Introduction to the Special Section on Intelligent Visual Content Analysis and Understanding

Vision and language: from visual perception to content creation

Guest Editorial: AI for Computational Audition—sound and Music Processing

Introduction to the Special Section on Deep Learning in Video Enhancement and Evaluation: the New Frontier

IEEE ACCESS SPECIAL SECTION EDITORIAL: RECENT ADVANTAGES OF COMPUTERVISION

VQA and Visual Reasoning: An Overview of Recent Datasets, Methods and Challenges

Vision-Language Intelligence: Tasks, Representation Learning, and Large Models

Towards AGI in Computer Vision: Lessons Learned from GPT and Large Language Models

Introduction to the Special Issue on Deep Learning for Multi-Modal Intelligence Across Speech, Language, Vision, and Heterogeneous Signals

’ Introduction to the Special 2 Section on Compact and Efficient Feature 3 Representation and Learning in Computer Vision

Deep Learning for Video Captioning: A Review

Guest Editorial Special Section on Visual Saliency Computing and Learning.

Guest Editorial Introduction to the Special Issue on Label-Efficient Learning on Video Data

Challenges and Prospects in Vision and Language Research

Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods

Guest Editorial Introduction to the Special Section on Representation Learning for Visual Content Understanding

Artificial Intelligence Methods in Natural Language Processing: A Comprehensive Review

Intelligent Visual Media Processing: when Graphics Meets Vision.

Enhancing machine vision: the impact of a novel innovative technology on video question-answering

Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions