A Review of Machine Learning and Deep Learning for Object Detection, Semantic Segmentation, and Human Action Recognition in Machine and Robotic Vision

Nikoleta Manakitsa,George S. Maraslidis,Lazaros Moysis,George F. Fragulis
DOI: https://doi.org/10.3390/technologies12020015
2024-01-23
Technologies
Abstract:Machine vision, an interdisciplinary field that aims to replicate human visual perception in computers, has experienced rapid progress and significant contributions. This paper traces the origins of machine vision, from early image processing algorithms to its convergence with computer science, mathematics, and robotics, resulting in a distinct branch of artificial intelligence. The integration of machine learning techniques, particularly deep learning, has driven its growth and adoption in everyday devices. This study focuses on the objectives of computer vision systems: replicating human visual capabilities including recognition, comprehension, and interpretation. Notably, image classification, object detection, and image segmentation are crucial tasks requiring robust mathematical foundations. Despite the advancements, challenges persist, such as clarifying terminology related to artificial intelligence, machine learning, and deep learning. Precise definitions and interpretations are vital for establishing a solid research foundation. The evolution of machine vision reflects an ambitious journey to emulate human visual perception. Interdisciplinary collaboration and the integration of deep learning techniques have propelled remarkable advancements in emulating human behavior and perception. Through this research, the field of machine vision continues to shape the future of computer systems and artificial intelligence applications.
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on the applications of machine learning and deep learning in machine vision and robot vision, especially for the three key tasks of object detection, semantic segmentation and human behavior recognition. Specifically, the paper aims to: 1. **Summarize the current research status**: Conduct a comprehensive review of the applications of machine learning and deep learning methods in object detection, semantic segmentation and human behavior recognition, and show the latest progress in these fields. 2. **Explain the technical principles**: Discuss in detail the algorithms and techniques used in these tasks, including supervised learning and unsupervised learning methods, convolutional neural networks (CNN), recurrent neural networks (RNN), generative adversarial networks (GAN), etc. 3. **Analyze challenges and limitations**: Explore the challenges and limitations faced by these techniques, such as the difficulty of data labeling, the generalization ability of models, the demand for computing resources, etc. 4. **Propose future directions**: Based on the current research results, propose possible future research directions and development trends to promote the further development of these fields. ### Main tasks and their mathematical foundations #### 1. Object detection - **Objective**: Identify specific objects in an image or video, and locate and classify them. - **Common methods**: - **Convolutional neural network (CNN)**: Extract features through convolution operations, generate candidate boxes using a region proposal network (RPN), and then classify them through a classifier. - **Sliding window**: Slide windows of different sizes on the image, extract features and classify them. - **Non - maximum suppression (NMS)**: Remove duplicate detection boxes and retain the most likely detection results. - **Mathematical formulas**: - Convolution operation: \[ y = \sum_{i = 0}^{n - 1}x_i * w_i+ b \] - Loss function: \[ L = -\sum_{i = 1}^{N}(y_i\log(p_i)+(1 - y_i)\log(1 - p_i)) \] #### 2. Semantic segmentation - **Objective**: Divide an image into multiple regions and assign a class label to each region. - **Common methods**: - **Fully convolutional network (FCN)**: Achieve pixel - level classification through convolutional layers and deconvolutional layers. - **U - Net**: Widely used in medical image segmentation, combining encoder and decoder structures. - **Mathematical formulas**: - Cross - entropy loss: \[ L = -\sum_{i = 1}^{N}\sum_{j = 1}^{C}y_{ij}\log(p_{ij}) \] - Deconvolution operation: \[ y=\text{deconv}(x, w) \] #### 3. Human behavior recognition - **Objective**: Identify and classify human behaviors from video sequences. - **Common methods**: - **Convolutional neural network (CNN)**: Extract spatial features. - **Recurrent neural network (RNN)**: Capture time - series information, especially long - short - term memory networks (LSTM). - **Multi - modal fusion**: Combine RGB and depth information to improve recognition accuracy. - **Mathematical formulas**: - LSTM unit: \[ h_t=\text{tanh}(W_h[h_{t - 1}, x_t]+b_h) \] - Attention mechanism: \[ \alpha_t=\frac{\exp(e_t)}{\sum_{k = 1}^{T}\exp(e_k)} \] ### Conclusion By conducting a comprehensive review of the applications of machine learning and deep learning in object detection, semantic segmentation and human behavior recognition, the paper not only summarizes the current research results, but also points out the existing challenges and future development directions. This provides valuable references and guidance for researchers in related fields.