A-FPN: Attention Aggregation based Feature Pyramid Network for Instance Segmentation — Supplementary Material

Miao Hu,Yali Li,Lu Fang,Shengjin Wang
2021-01-01
Abstract:where the input comprises queries Q and keys K of dimension dk, and values V of dimension dv . The scaled dotproduct attention divides each dot product by √ dk and applies a softmax function to generate the attention weights. However, the variance of dot products grows large for large values of dk, pushing the softmax function into the regions of extremely small gradients, as shown in Eqn. 2.
What problem does this paper attempt to address?