Abstract:This article introduces the solutions of the two champion teams, `MMfruit' for the detection track and `MMfruitSeg' for the segmentation track, in OpenImage Challenge 2019. It is commonly known that for an object detector, the shared feature at the end of the backbone is not appropriate for both classification and regression, which greatly limits the performance of both single stage detector and Faster RCNN \cite{ren2015faster} based detector. In this competition, we observe that even with a shared feature, different locations in one object has completely inconsistent performances for the two tasks. \textit{E.g. the features of salient locations are usually good for classification, while those around the object edge are good for regression.} Inspired by this, we propose the Decoupling Head (DH) to disentangle the object classification and regression via the self-learned optimal feature extraction, which leads to a great improvement. Furthermore, we adjust the soft-NMS algorithm to adj-NMS to obtain stable performance improvement. Finally, a well-designed ensemble strategy via voting the bounding box location and confidence is proposed. We will also introduce several training/inferencing strategies and a bag of tricks that give minor improvement. Given those masses of details, we train and aggregate 28 global models with various backbones, heads and 3+2 expert models, and achieves the 1st place on the OpenImage 2019 Object Detection Challenge on the both public and private leadboards. Given such good instance bounding box, we further design a simple instance-level semantic segmentation pipeline and achieve the 1st place on the segmentation challenge.

Second Place Solution of WSDM2023 Toloka Visual Question Answering Challenge

Dual Path Multi-Modal High-Order Features for Textual Content Based Visual Question Answering

Simple and Effective Visual Question Answering in a Single Modality

Champion Solution for the WSDM2023 Toloka VQA Challenge

First Place Solution to the Multiple-choice Video QA Track of The Second Perception Test Challenge

The Solution for The PST-KDD-2024 OAG-Challenge

The Solution for the ICCV 2023 Perception Test Challenge 2023 -- Task 6 -- Grounded videoQA

The Solution for the CVPR2023 NICE Image Captioning Challenge

2nd Place Solution SSLAD Track 1-O2O Semi-Supervised Framwork

Two-Stage Multimodality Fusion for High-Performance Text-Based Visual Question Answering.

Multitask Learning for Visual Question Answering

Locate Then Generate: Bridging Vision and Language with Bounding Box for Scene-Text VQA

Solution for Temporal Sound Localisation Task of ECCV Second Perception Test Challenge 2024

Knowledge-aware image understanding with multi-level visual representation enhancement for visual question answering

Solution for SMART-101 Challenge of CVPR Multi-modal Algorithmic Reasoning Task 2024

Visual7W: Grounded Question Answering in Images

2nd Place Solution for MOSE Track in CVPR 2024 PVUW workshop: Complex Video Object Segmentation

Task-driven Visual Saliency and Attention-based Visual Question Answering

Learning Rich Image Region Representation for Visual Question Answering

LCV2: A Universal Pretraining-Free Framework for Grounded Visual Question Answering

1st Place Solutions for OpenImage2019 -- Object Detection and Instance Segmentation