Video Object Segmentation with 3D Convolution Network

Huiyun Tang,Pin Tao,Rui Ma,Yuanchun Shi
DOI: https://doi.org/10.1145/3341016.3341031
2019-01-01
Abstract:We explore a novel method to realize semi-supervised video object segmentation with special spatiotemporal feature extracting structure. Considering 3-dimension convolution network can convolute a volume of image sequence, it is a distinct way to get both spatial and temporal information. Our network is composed of three parts, the visual module, the motion module and the decoder module. The visual module learns object appearance feature from object in the first frame for network to detect specific object in following image sequences. The motion module aims to get spatiotemporal information of image sequences with 3-dimension convolution network, which learns diversities of object temporal appearance and location. The purpose of decoder module is to get foreground object mask from output of visual module and motion module with concatenation and upsampling structure. We evaluate our model on DAVIS segmentation dataset[15]. Our model doesn't need online training compared with most detection-based methods because of visual module. As a result, it takes 0.14 second per frame to get mask which is 71 times faster than the state-of-art method-OSVOS[2]. Our model also shows better performance than most methods proposed in recent years and its meanIOU accuracy is comparable with state-of-art methods.
What problem does this paper attempt to address?