Abstract:In this study, a robust method for 3D pose estimation of immature green apples (fruitlets) in commercial orchards was developed, utilizing the YOLO11 object detection and pose estimation algorithm alongside Vision Transformers (ViT) for depth estimation (Dense Prediction Transformer (DPT) and Depth Anything V2). For object detection and pose estimation, performance comparisons of YOLO11 (YOLO11n, YOLO11s, YOLO11m, YOLO11l and YOLO11x) and YOLOv8 (YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l and YOLOv8x) were made under identical hyperparameter settings among the all configurations. It was observed that YOLO11n surpassed all configurations of YOLO11 and YOLOv8 in terms of box precision and pose precision, achieving scores of 0.91 and 0.915, respectively. Conversely, YOLOv8n exhibited the highest box and pose recall scores of 0.905 and 0.925, respectively. Regarding the mean average precision at 50\% intersection over union (mAP@50), YOLO11s led all configurations with a box mAP@50 score of 0.94, while YOLOv8n achieved the highest pose mAP@50 score of 0.96. In terms of image processing speed, YOLO11n outperformed all configurations with an impressive inference speed of 2.7 ms, significantly faster than the quickest YOLOv8 configuration, YOLOv8n, which processed images in 7.8 ms. Subsequent integration of ViTs for the green fruit's pose depth estimation revealed that Depth Anything V2 outperformed Dense Prediction Transformer in 3D pose length validation, achieving the lowest Root Mean Square Error (RMSE) of 1.52 and Mean Absolute Error (MAE) of 1.28, demonstrating exceptional precision in estimating immature green fruit lengths. Integration of YOLO11 and Depth Anything Model provides a promising solution to 3D pose estimation of immature green fruits for robotic thinning applications.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the issue of 3D pose estimation of young fruits (small fruits) in commercial apple orchards to achieve automated thinning. Specifically, the research objectives include: 1. **Data Collection**: Using a robotic platform to collect high-resolution RGB images in a commercial orchard environment to build a comprehensive dataset. 2. **Deep Learning-Based Detection and Pose Estimation**: Training and evaluating the performance of YOLO11 and YOLOv8 models under the same hyperparameter settings for detecting and estimating the pose of young fruits. Selecting the best-performing model to estimate the peduncle's pose from RGB images. 3. **Using Vision Transformers for RGB to RGB-D Mapping**: Converting RGB images to 3D point clouds to enhance the pose estimation provided by the best deep learning model. Using Dense Prediction Transformer (DPT) and Depth Anything V2 vision transformers for this conversion. 4. **Field Validation**: Conducting field validation of depth estimation in a commercial orchard by comparing the main axis length of young fruits measured with calipers to the corresponding point cloud data, ensuring the accuracy and reliability of pose estimation. ### Background and Motivation In modern agriculture, labor shortages are a significant issue, especially in the management of specialty fruit tree crops. For example, apple orchards require a large amount of manual labor for pruning, thinning, and harvesting, which demands high precision and meticulous care. Traditional manual thinning methods are not only labor-intensive but also pose safety risks, such as the dangers of working at heights. Therefore, developing automated solutions to reduce labor demand and increase efficiency has become particularly important. ### Main Objectives The main objective of this study is to address the challenges of manual thinning in tree fruit agriculture through advanced machine vision technology, particularly during the young fruit stage when the fruits are green and look similar. The research uses the latest YOLO11 (You Only Look Once, version 11) deep learning algorithm to develop a system capable of accurately detecting and determining the position and orientation (i.e., "pose") of young fruits in standard RGB image space. Furthermore, by integrating the Depth Anything model, the study aims to perform monocular depth estimation from RGB images to improve the accuracy of pose estimation. By comparing these estimates with actual ground truth measurements in RGB-D (RGB + Depth) space, the effectiveness of the method is validated, laying the foundation for subsequent efforts in orchard management automation.

YOLO11 and Vision Transformers based 3D Pose Estimation of Immature Green Fruits in Commercial Apple Orchards for Robotic Thinning

Immature Green Apple Detection and Sizing in Commercial Orchards Using YOLOv8 and Shape Fitting Techniques

Comprehensive Performance Evaluation of YOLO11, YOLOv10, YOLOv9 and YOLOv8 on Detecting and Counting Fruitlet in Complex Orchard Environments

Comparing YOLO11 and YOLOv8 for instance segmentation of occluded and non-occluded immature green fruits in complex orchard environment

Machine Vision-Based Crop-Load Estimation Using YOLOv8

Apple Ripeness Identification from Digital Images Using Transformers

A Seamless Deep Learning Approach for Apple Detection, Depth Estimation, and Tracking Using YOLO Models Enhanced by Multi-Head Attention Mechanism

Detection and Counting of Small Target Apples under Complicated Environments by Using Improved YOLOv7-tiny

Multi-stage tomato fruit recognition method based on improved YOLOv8

An Improved Apple Object Detection Method Based on Lightweight YOLOv4 in Complex Backgrounds

Fruit fast tracking and recognition of apple picking robot based on improved YOLOv5

YOLO Models for Fresh Fruit Classification from Digital Videos

An improved YOLOv5s model for assessing apple graspability in automated harvesting scene

Deep learning for real-time fruit detection and orchard fruit load estimation: benchmarking of ‘MangoYOLO’

Using YOLOv3 Algorithm with Pre- and Post-Processing for Apple Detection in Fruit-Harvesting Robot

YOLOAPPLE: Augment Yolov3 deep learning algorithm for apple fruit quality detection

Apple stem/calyx real-time recognition using YOLO-v5 algorithm for fruit automatic loading system

Fruit Detection and Counting in Apple Orchards Based on Improved Yolov7 and Multi-Object Tracking Methods

Apple Detection in Complex Scene Using the Improved YOLOv4 Model

Fruits hidden by green: an improved YOLOV8n for detection of young citrus in lush citrus trees