YOLO11 and Vision Transformers based 3D Pose Estimation of Immature Green Fruits in Commercial Apple Orchards for Robotic Thinning

Ranjan Sapkota,Manoj Karkee
2024-10-22
Abstract:In this study, a robust method for 3D pose estimation of immature green apples (fruitlets) in commercial orchards was developed, utilizing the YOLO11 object detection and pose estimation algorithm alongside Vision Transformers (ViT) for depth estimation (Dense Prediction Transformer (DPT) and Depth Anything V2). For object detection and pose estimation, performance comparisons of YOLO11 (YOLO11n, YOLO11s, YOLO11m, YOLO11l and YOLO11x) and YOLOv8 (YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l and YOLOv8x) were made under identical hyperparameter settings among the all configurations. It was observed that YOLO11n surpassed all configurations of YOLO11 and YOLOv8 in terms of box precision and pose precision, achieving scores of 0.91 and 0.915, respectively. Conversely, YOLOv8n exhibited the highest box and pose recall scores of 0.905 and 0.925, respectively. Regarding the mean average precision at 50\% intersection over union (mAP@50), YOLO11s led all configurations with a box mAP@50 score of 0.94, while YOLOv8n achieved the highest pose mAP@50 score of 0.96. In terms of image processing speed, YOLO11n outperformed all configurations with an impressive inference speed of 2.7 ms, significantly faster than the quickest YOLOv8 configuration, YOLOv8n, which processed images in 7.8 ms. Subsequent integration of ViTs for the green fruit's pose depth estimation revealed that Depth Anything V2 outperformed Dense Prediction Transformer in 3D pose length validation, achieving the lowest Root Mean Square Error (RMSE) of 1.52 and Mean Absolute Error (MAE) of 1.28, demonstrating exceptional precision in estimating immature green fruit lengths. Integration of YOLO11 and Depth Anything Model provides a promising solution to 3D pose estimation of immature green fruits for robotic thinning applications.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address the issue of 3D pose estimation of young fruits (small fruits) in commercial apple orchards to achieve automated thinning. Specifically, the research objectives include: 1. **Data Collection**: Using a robotic platform to collect high-resolution RGB images in a commercial orchard environment to build a comprehensive dataset. 2. **Deep Learning-Based Detection and Pose Estimation**: Training and evaluating the performance of YOLO11 and YOLOv8 models under the same hyperparameter settings for detecting and estimating the pose of young fruits. Selecting the best-performing model to estimate the peduncle's pose from RGB images. 3. **Using Vision Transformers for RGB to RGB-D Mapping**: Converting RGB images to 3D point clouds to enhance the pose estimation provided by the best deep learning model. Using Dense Prediction Transformer (DPT) and Depth Anything V2 vision transformers for this conversion. 4. **Field Validation**: Conducting field validation of depth estimation in a commercial orchard by comparing the main axis length of young fruits measured with calipers to the corresponding point cloud data, ensuring the accuracy and reliability of pose estimation. ### Background and Motivation In modern agriculture, labor shortages are a significant issue, especially in the management of specialty fruit tree crops. For example, apple orchards require a large amount of manual labor for pruning, thinning, and harvesting, which demands high precision and meticulous care. Traditional manual thinning methods are not only labor-intensive but also pose safety risks, such as the dangers of working at heights. Therefore, developing automated solutions to reduce labor demand and increase efficiency has become particularly important. ### Main Objectives The main objective of this study is to address the challenges of manual thinning in tree fruit agriculture through advanced machine vision technology, particularly during the young fruit stage when the fruits are green and look similar. The research uses the latest YOLO11 (You Only Look Once, version 11) deep learning algorithm to develop a system capable of accurately detecting and determining the position and orientation (i.e., "pose") of young fruits in standard RGB image space. Furthermore, by integrating the Depth Anything model, the study aims to perform monocular depth estimation from RGB images to improve the accuracy of pose estimation. By comparing these estimates with actual ground truth measurements in RGB-D (RGB + Depth) space, the effectiveness of the method is validated, laying the foundation for subsequent efforts in orchard management automation.