Abstract:To automate harvesting and de-leafing of tomato plants using robots, it is important to search and detect the task-relevant plant parts. This is challenging due to high levels of occlusion in tomato plants. Active vision is a promising approach to viewpoint planning, which helps robots to deliberately plan camera viewpoints to overcome occlusion and improve perception accuracy. However, current active-vision algorithms cannot differentiate between relevant and irrelevant plant parts and spend time on perceiving irrelevant plant parts, making them inefficient for targeted perception. We propose a semantics-aware active-vision strategy that uses semantic information to identify the relevant plant parts and prioritise them during view planning. We evaluated our strategy on the task of searching and detecting the relevant plant parts using simulation and real-world experiments. In simulation, using 3D models of tomato plants with varying structural complexity, our semantics-aware strategy could search and detect 81.8% of all the relevant plant parts using nine viewpoints. It was significantly faster and detected more plant parts than predefined, random, and volumetric active-vision strategies. Our strategy was also robust to uncertainty in plant and plant-part position, plant complexity, and different viewpoint-sampling strategies. Further, in real-world experiments, our strategy could search and detect 82.7% of all the relevant plant parts using seven viewpoints, under real-world conditions with natural variation and occlusion, natural illumination, sensor noise, and uncertainty in camera poses. Our results clearly indicate the advantage of using semantics-aware active vision for targeted perception of plant parts and its applicability in real-world setups. We believe that it can significantly improve the speed and robustness of automated harvesting and de-leafing in tomato crop production.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of how robots can efficiently search for and detect plant parts related to tasks (such as tomatoes, pedicels, and petioles) during the automated picking and defoliation processes in tomato greenhouses. Specifically: 1. **Detection challenges in highly occluded environments**: In tomato greenhouses, severe occlusion between plants makes it difficult for robots to accurately detect target parts. This makes methods relying solely on 2D image detection insufficient for estimating cutting points, thereby affecting the accuracy of automated operations. 2. **Limitations of existing active vision algorithms**: Although existing active vision algorithms can overcome occlusion by planning camera viewing angles, they are unable to distinguish between relevant and irrelevant plant parts, thus wasting time on unnecessary perception and reducing efficiency. 3. **The need for the application of semantic information**: In order to improve detection efficiency, a strategy that can use semantic information (such as category labels) to identify and prioritize relevant plant parts is required. To solve the above problems, the authors propose a semantics - aware active vision strategy. This strategy, by introducing an attention mechanism, preferentially selects viewing angles that can obtain more new information, thereby more efficiently searching for and detecting target plant parts. ### Specific research content - **Problem description**: Given a tomato plant located in a limited 3D space \( V \subset \mathbb{R}^3 \), the task is to use a robot with an RGB - D camera to explore the plant and detect all objects of interest (OOIs). Initially, the robot only knows the approximate location of the plant but is not sure of the specific location. Therefore, it is necessary to gradually detect all OOIs through a series of viewing angle selections. - **Method overview**: - **Perception module**: Use a convolutional neural network (such as Mask R - CNN) to detect OOIs. - **3D scene representation module**: Combine OOI information from multiple viewing angles to generate an OctoMap containing semantic information. - **Viewpoint planning module**: According to the currently known information, select the next best viewing angle to maximize the acquisition of new semantic information. - **Experimental verification**: The effectiveness of this method has been verified through simulation and actual experiments. The results show that in the simulation environment, this method can detect 81.8% of relevant plant parts within nine viewing angles; in the actual greenhouse environment, it can detect 82.7% of relevant plant parts within seven viewing angles, which is significantly better than other methods. ### Key formulas - **Semantic information gain**: \[ I_{\text{sem}}(x)= - p_s(x)\log_2(p_s(x))-(1 - p_s(x))\log_2(1 - p_s(x)) \] where \( p_s(x) \) is the confidence that point \( x \) belongs to a certain category. - **Expected semantic information gain**: \[ G_{\text{sem}}(\xi)=\sum_{x\in(X_\xi\cap B)}I_{\text{sem}}(x) \] where \( X_\xi \) is the set of all voxels expected to be visible from viewing angle \( \xi \), and \( B \) is the set of voxels in the region of interest. - **Total viewpoint utility**: \[ U_{\text{sem}} = G_{\text{sem}}(\xi)\times e^{-d} \] where \( d \) is the Euclidean distance between the current viewing angle and the candidate viewing angle. ### Conclusion This research shows that by introducing semantic information and an attention mechanism, the speed and robustness of robots in searching for and detecting specific plant parts in complex greenhouse environments can be significantly improved, thereby providing strong support for the automated picking and defoliation of tomato crops.

Semantics-Aware Next-best-view Planning for Efficient Search and Detection of Task-relevant Plant Parts

Attention-driven Next-best-view Planning for Efficient Reconstruction of Plants and Targeted Plant Parts

Gradient-based Local Next-best-view Planning for Improved Perception of Targeted Plant Nodes

NBV-SC: Next Best View Planning based on Shape Completion for Fruit Mapping and Reconstruction

Toward Semantic Scene Understanding for Fine-Grained 3D Modeling of Plants

Development and evaluation of automated localisation and reconstruction of all fruits on tomato plants in a greenhouse based on multi-view perception and 3D multi-object tracking

Dual-arm Cooperation and Implementing for Robotic Harvesting Tomato Using Binocular Vision.

DAVIS-Ag: A Synthetic Plant Dataset for Prototyping Domain-Inspired Active Vision in Agricultural Robots

Plant-part Segmentation Using Deep Learning and Multi-View Vision

Fruit growing direction recognition and nesting grasping strategies for tomato harvesting robots

Enhancing Agricultural Environment Perception via Active Vision and Zero-Shot Learning

Robotic Harvesting of the Occluded Fruits with a Precise Shape and Position Reconstruction Approach

Graph-based View Motion Planning for Fruit Detection

Autonomous Apple Fruitlet Sizing with Next Best View Planning

A novel perception and semantic mapping method for robot autonomy in orchards

Semantic-aware Next-Best-View for Multi-DoFs Mobile System in Search-and-Acquisition based Visual Perception

Towards Active Robotic Vision in Agriculture: A Deep Learning Approach to Visual Servoing in Occluded and Unstructured Protected Cropping Environments

Safe Leaf Manipulation for Accurate Shape and Pose Estimation of Occluded Fruits

Active Perception Fruit Harvesting Robots — A Systematic Review

AHPPEBot: Autonomous Robot for Tomato Harvesting based on Phenotyping and Pose Estimation

Visual Perception and Modelling in Unstructured Orchard for Apple Harvesting Robots