作者提出的3个问题
Common challenges that image classification models encounter when localizing objects are, (a) they tend to look at the most discriminative features in an image that confines the localization map to a very small region, (b) the localization maps are class agnostic, and the models highlight objects of multiple classes in the same image and, (c) the localization performance is affected by background noise
作者针对3个问题分别进行了可视化举例
为了缓解上面的问题,作者使用Transformer。为什么要使用transformer呢?后文有进一步分析吗?
To alleviate the above challenges, we leverage vision transformer architectures [10] to localize objects in an image.
大概说明了作者做了什么
In our work, we introduce below simple modifications to the vision transformer.
讲述作者提出的patch-based Attention Dropout Layer (p-ADL)的部件及其作用
Two components of the p-ADL layer are patch importance map and patch drop mask. During the course of training, these components are utilized to highlight informative and drop discriminative patches respectively to balance the classification v/s localization performance of the model.
讲述作者提出的Grad Attention Rollout (GAR)模块的作用
We introduce a weighted attention rollout mechanism using the gradients computed at each attention map. We refer to this mechanism as Grad Attention Rollout. This post-hoc method in combination with the p-ADL enables the model to quantify the positive and negative contributions in the attention map for each class. This guides the model to generate classdependent attention maps and a negative clamping operation in GAR further suppresses the effect of background noise in the attention map
作者设计的p-ADL的作用,以及其与ADL的区别
which acts as a regularizer and enhances the localization capability of the model. Unlike ADL which operates on feature maps, p-ADL operates on the patch embeddings including the class token embedding.
p-ADL操作步骤:将patches embedding沿着embedding dimension进行平均,得倒Mean Attention Map,然后一个分支用阈值化得到patch drop mask,另外一个用sigmoid activation得到patch importance map
Both these components operate on the mean attention map, which is computed by taking the mean over the embedding dimension. 1 The patch drop mask is created by dropping the most activated mean patch embedding based on a drop threshold (λ) parameter. The patch importance map is calculated by normalizing the mean attention map using a sigmoid activation, which denotes the importance of patches in an image.
设计GAR的原因
Rollout adds an Identity matrix I to the attention matrix at each layer to account for residual connections [13] in the network. However, this method assumes that attentions are linearly combined. As stated by Chefer et al. [7], this overlooks the fact that GELU [14] activation is used in all intermediate layers. An ill-effect of this ignorance is that it fails to distinguish between positive and negative contributions to the final attention map.