Abstract:Enabled by large annotated datasets, tracking and segmentation of objects in videos has made remarkable progress in recent years. Despite these advancements, algorithms still struggle under degraded conditions and during fast movements. Event cameras are novel sensors with high temporal resolution and high dynamic range that offer promising advantages to address these challenges. However, annotated data for developing learning-based mask-level tracking algorithms with events is not available. To this end, we introduce: ($i$) a new task termed \emph{space-time instance segmentation}, similar to video instance segmentation, whose goal is to segment instances throughout the entire duration of the sensor input (here, the input are quasi-continuous events and optionally aligned frames); and ($ii$) \emph{\dname}, a dataset for the new task, containing aligned grayscale frames and events. It includes annotated ground-truth labels (pixel-level instance segmentation masks) of a group of up to seven freely moving and interacting mice. We also provide two reference methods, which show that leveraging event data can consistently improve tracking performance, especially when used in combination with conventional cameras. The results highlight the potential of event-aided tracking in difficult scenarios. We hope our dataset opens the field of event-based video instance segmentation and enables the development of robust tracking algorithms for challenging conditions.\url{<a class="link-external link-https" href="https://github.com/tub-rip/MouseSIS" rel="external noopener nofollow">this https URL</a>}
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: under challenging conditions (such as low - light, fast motion, etc.), the performance of existing object tracking and segmentation algorithms based on traditional cameras is poor, especially in the video instance segmentation task. For this reason, the author introduced a new task - Space - Time Instance Segmentation (SIS), and constructed a dataset named MouseSIS to promote the research of this new task.
### Specific Problems and Solutions
1. **Limitations of Existing Methods**:
- **Low - light and Fast Motion**: Traditional cameras have a limited frame rate and are prone to over - exposure or under - exposure when in low - light conditions or facing fast motion, resulting in poor tracking and segmentation effects.
- **Lack of Annotated Data**: Most of the existing event - camera datasets are used for simple single - target bounding - box tracking, lacking multi - instance segmentation data with pixel - level annotations.
2. **Introduced New Task**:
- **Space - Time Instance Segmentation (SIS)**: Similar to video instance segmentation, but the input is an approximately continuous event stream and aligned frames, and the goal is to segment each instance during the entire sensor input period.
3. **Constructed New Dataset**:
- **MouseSIS Dataset**: It contains aligned grayscale frames and event data, and is annotated with pixel - level instance segmentation masks for up to seven freely moving and interacting mice. The dataset contains a total of 33 video clips, with an average duration of about 20 seconds, and a total of about 75,000 instance masks.
4. **Provided Reference Methods**:
- **ModelMixSort**: Based on the classic detection - tracking method, it combines multiple pre - trained models (such as E2VID, YOLOv8, SAM, etc.), generates instance masks through event data and frame data, and uses XMem for tracking.
- **EventSeqFormer**: An end - to - end learning method based on the SeqFormer architecture, which uses transformers to model temporal and spatial dependencies, processes event and frame data, and generates multi - instance tracking results.
### Experimental Results
The experimental results show that combining event data can significantly improve tracking performance, especially in difficult scenarios such as low - light and fast motion. Specifically:
- **ModelMixSort**: When combining event and frame data, the MOTA, IDF1, and HOTA metrics reached 54.94%, 65.17%, and 54.19% respectively, outperforming methods that only use frame data.
- **EventSeqFormer**: It also performs well when combining event and frame data, but under certain contrast threshold settings, the frame data reconstructed by E2VID fails to generalize well, affecting the overall performance.
### Summary
The paper fills the gap in the multi - instance segmentation field of event cameras by introducing the Space - Time Instance Segmentation task and the MouseSIS dataset, shows the potential of event data under complex conditions, and provides a basis for further research.