Abstract:Our shot boundary determination system consists of three components, including a FOI detector, a generalized CUT detector, and a long gradual transition detector. One support vector machine, taking score vector calculated with graph partition model as input, is used to detect CUT. Long gradual transition is determined by another three support vector machines with multi-resolution score vectors as input. After these detectors make decision successively, the locations of shot boundaries and the corresponding types are obtained. It is found in the experiments on development data that by tuning penalty ratio of loss of misclassifying the positive and the negative samples, it is possible to control the trade-off between precision and recall. 31 runs are generated from the same system with the 4 support vector machines being trained with different parameters. Among them, 10 runs are submitted for evaluation. And the results show that our system is among the best. In our system for low level feature extraction, some spatial features of motion vectors are proposed to select the motion vectors which describe the camera motion in deed. The four-parameter affine model is used to describe the camera motion, and the ILSE technique is used to calculate the parameters. Then camera motion will be classified into three classes: pan, tilt and zoom with an accurate classification method based on finite-state automata. Our system achieves best results in this task of TRECVID2005. Our systems for high level feature extraction rely heavily on the visual information. Visual features include Color Auto-Correlograms, Color Coherence Vector, Color Histogram, Color Moment, Edge Histogram and Wavelet Texture. Two different systems using regional and global image features are compared to explore the effectiveness of regional features. In the regional system, keyframes are segmented and regional feature of all the six types mentioned above are extracted. Then support vector machine classifier with Earth Mover Distance (EMD) kernel is built. In the global system, the six types of global feature are extracted for each keyframe directly. Then the classifier ensembles for detecting each concept are formed by using the Relay Boost algorithm. This is followed by a concept context module. We tried mainly two approaches, one based on stacked SVM and the other based on weighted sum of the confidence scores of the related concepts. We then apply time clustered post-filtering to remove false positive shots. Based on these two systems we have our 7 runs. From the results, we find that multi-feature fusion improves over any single modality significantly. Our automatic video search systems have three basic retrieval models: a text model based on script generated by ASR, an image model based on region-based image matching and a concept model which automatically parses the queries and video shots into concept vectors, and then searches video shots through query-shot similarity computing in concept space. Based on these models, we also develop some combination systems. In the score fusing system, the results are ranked by fusing the scores generated from the basic retrieval models. In the fusing system based on query type, queries are classified into two classes, and then retrieved using different models. Among our 7 submissions, the results show that when searching for general topics, which are always less related to person, combining text and concept models performs better than only using text model.

NJU MCG - Sensetime Team Submission to Pre-training for Video Understanding Challenge Track II.

Transformer Union Convolution Network for Visual Object Tracking

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

CUHK ETHZ SIAT Submission to ActivityNet Challenge 2016

Top-1 Solution of Multi-Moments in Time Challenge 2019

Non-local NetVLAD Encoding for Video Classification

The Monkeytyping Solution to the YouTube-8M Video Understanding Challenge.

Spiking Tucker Fusion Transformer for Audio-Visual Zero-Shot Learning

Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation

Video-TransUNet: Temporally Blended Vision Transformer for CT VFSS Instance Segmentation

Video-SwinUNet: Spatio-temporal Deep Learning Framework for VFSS Instance Segmentation

UNetFormer: A Unified Vision Transformer Model and Pre-Training Framework for 3D Medical Image Segmentation

NJU's submission to the WMT20 QE Shared Task.

All in One: Exploring Unified Video-Language Pre-training

NuTime: Numerically Multi-Scaled Embedding for Large-Scale Time-Series Pretraining

A Vanilla Multi-Task Framework for Dense Visual Prediction Solution to 1st VCL Challenge -- Multi-Task Robustness Track

Technical Report for ActivityNet Challenge 2022 – Temporal Action Localization

MCV-UNet: a modified convolution & transformer hybrid encoder-decoder network with multi-scale information fusion for ultrasound image semantic segmentation

The Runner-up Solution for YouTube-VIS Long Video Challenge 2022

Tsinghua University at TRECVID 2005.

UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer