A Diffusion-based Data Generator for Training Object Recognition Models in Ultra-Range Distance

Eran Bamani,Eden Nissinman,Lisa Koenigsberg,Inbar Meir,Avishai Sintov
2024-04-15
Abstract:Object recognition, commonly performed by a camera, is a fundamental requirement for robots to complete complex tasks. Some tasks require recognizing objects far from the robot's camera. A challenging example is Ultra-Range Gesture Recognition (URGR) in human-robot interaction where the user exhibits directive gestures at a distance of up to 25~m from the robot. However, training a model to recognize hardly visible objects located in ultra-range requires an exhaustive collection of a significant amount of labeled samples. The generation of synthetic training datasets is a recent solution to the lack of real-world data, while unable to properly replicate the realistic visual characteristics of distant objects in images. In this letter, we propose the Diffusion in Ultra-Range (DUR) framework based on a Diffusion model to generate labeled images of distant objects in various scenes. The DUR generator receives a desired distance and class (e.g., gesture) and outputs a corresponding synthetic image. We apply DUR to train a URGR model with directive gestures in which fine details of the gesturing hand are challenging to distinguish. DUR is compared to other types of generative models showcasing superiority both in fidelity and in recognition success rate when training a URGR model. More importantly, training a DUR model on a limited amount of real data and then using it to generate synthetic data for training a URGR model outperforms directly training the URGR model on real data. The synthetic-based URGR model is also demonstrated in gesture-based direction of a ground robot.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the lack and difficulty of obtaining training data in long - distance object recognition (especially ultra - long - distance gesture recognition, Ultra - Range Gesture Recognition, URGR). Specifically: 1. **Challenges in long - distance object recognition**: In the process of robot - environment interaction, especially in human - robot interaction (HRI) scenarios, robots need to be able to recognize objects or gestures at a distance. However, as the distance increases, the image resolution decreases, resulting in a significant reduction in recognition performance, especially in complex backgrounds. 2. **Difficulties in data acquisition**: In order to train a model that can accurately recognize objects or gestures at ultra - long distances, a large number of labeled samples are required. These samples should be taken not only in different environments but also cover different distance ranges. However, actually collecting these data is very time - consuming and costly, which limits the effective training of the model. 3. **Limitations of existing generation methods**: Although existing synthetic data generation methods can partially alleviate the problem of data shortage, they cannot well simulate the visual features of long - distance objects in the real world, especially in cases of low resolution and blurred details. To solve these problems, the authors propose the Diffusion in Ultra - Range (DUR) framework based on the diffusion model for generating conditional synthetic images at ultra - long distances. DUR can generate corresponding synthetic images according to the specified distance and gesture category, thereby providing a large amount of high - quality training data to improve the performance of the ultra - long - distance gesture recognition model. ### Main contributions: - Propose a new DUR framework for generating conditional synthetic images at ultra - long distances. - The URGR model trained with the synthetic data generated by DUR shows a higher recognition success rate than the model trained directly with real data. - The synthetic images generated by DUR are superior to other generation models in terms of fidelity and recognition success rate. - Demonstrate the application potential of the synthetic data generated by DUR in guiding ground robots. ### Core technologies of the solution: - **Diffusion Model**: DUR is based on a non - Markovian diffusion process and generates high - fidelity synthetic images by gradually adding and removing noise. - **Conditional generation**: DUR can generate synthetic images according to the specified gesture category and distance, ensuring that the generated data meets specific application requirements. - **Quality control**: Use the ResNet model to filter low - quality synthetic images to ensure that the finally generated data set has a relatively high quality. Through these technological innovations, the DUR framework effectively solves the data acquisition problem in ultra - long - distance gesture recognition and significantly improves the recognition performance of the model.