LSVOS Challenge 3rd Place Report: SAM2 and Cutie based VOS

Xinyu Liu,Jing Zhang,Kexin Zhang,Xu Liu,Lingling Li
2024-08-21
Abstract:Video Object Segmentation (VOS) presents several challenges, including object occlusion and fragmentation, the dis-appearance and re-appearance of objects, and tracking specific objects within crowded scenes. In this work, we combine the strengths of the state-of-the-art (SOTA) models SAM2 and Cutie to address these challenges. Additionally, we explore the impact of various hyperparameters on video instance segmentation performance. Our approach achieves a J\&F score of 0.7952 in the testing phase of LSVOS challenge VOS track, ranking third overall.
Computer Vision and Pattern Recognition,Information Retrieval
What problem does this paper attempt to address?
This paper attempts to solve several key problems in video object segmentation (VOS), specifically including: 1. **Object occlusion and fragmentation**: In videos, the target object may be partially or completely occluded by other objects, resulting in incomplete or inaccurate segmentation results. 2. **Object disappearance and reappearance**: The object may temporarily disappear in some frames and then reappear in subsequent frames, which poses a challenge to maintaining continuous object tracking. 3. **Tracking specific objects in crowded scenes**: In scenes containing multiple similar objects or cluttered backgrounds, it becomes very difficult to accurately identify and segment the target object. These problems are particularly prominent in long - term videos because the appearance of objects changes significantly over a long period of time, and complex motion patterns make it more difficult to maintain accurate tracking and segmentation. In addition, the performance of current memory - based methods drops significantly when dealing with complex datasets (such as MOSE), mainly because these methods rely on pixel - level matching, are easily affected by noise and frequent occlusions, and lack high - level consistency. To solve the above problems, the author combines two state - of - the - art models - SAM2 and Cutie. By integrating the advantages of these two models, the author proposes a new VOS method and explores the influence of different hyperparameters on the performance of video instance segmentation. Finally, this method achieved a J&F score of 0.7952 in the test phase of the LSVOS challenge, ranking third. ### Formula summary - **Jaccard value (J)**: It is used to measure the similarity between the predicted segmentation mask \(P\) and the ground - truth segmentation mask \(G\), and is defined as follows: \[ J=\frac{|P\cap G|}{|P\cup G|}=\frac{\sum_{i}P_{i}\cdot G_{i}}{\sum_{i}P_{i}+\sum_{i}G_{i}-\sum_{i}P_{i}\cdot G_{i}} \] where \(P_{i}\) and \(G_{i}\) represent the values of the \(i\)-th pixel in the predicted mask and the ground - truth mask respectively. - **F - Measure (F)**: An evaluation metric that combines Precision and Recall, and is defined as follows: \[ F = \frac{2\cdot\text{Precision}\cdot\text{Recall}}{\text{Precision}+\text{Recall}} \] where: \[ \text{Precision}=\frac{|P\cap G|}{|P|}=\frac{\sum_{i}P_{i}\cdot G_{i}}{\sum_{i}P_{i}} \] \[ \text{Recall}=\frac{|P\cap G|}{|G|}=\frac{\sum_{i}P_{i}\cdot G_{i}}{\sum_{i}G_{i}} \] - **Average of J and F**: An indicator for comprehensively evaluating the performance of the model, and is defined as follows: \[ \text{Mean}(J,F)=\frac{J + F}{2} \] Through these formulas, the author can comprehensively evaluate the performance of their VOS method on different datasets, ensuring its robustness and accuracy in complex environments.