[论文粗读][CVPR22workshop]ViTOL: Vision transformer for weakly supervised object localization

summary

type

status

slug

个人总结

作者提出的关于WSOL的三个挑战：（1）易聚焦在most discriminative region；（2）由于class-agnostic特性生成了包含其他干扰物的边界框；(3)背景噪声的存在，导致生成的边界框包含背景。针对这3个挑战，作者开门见山提出使用Transformer，至于为什么用Transformer呢，并没有分析。作者为了解决这3个问题，在ViT上进行了一些修改：主要是提出了patch-based Attention Dropout Layer (p-ADL)和Grad Attention Rollout (GAR)。前者作为一个regularizer来增强模型的定位能力（背后原理是，靠一个系数来权衡highlight informative patches和drop discriminative patches的程度），后者引导模型生成class-dependent attention maps，并且其中的negative clamping operation进一步抑制了attention map中的背景噪声的影响。作者提出的方法，偏向工程性（毕竟workshop），实验结果不错，但是由于我不是WSOL方向的，不清楚实验中的细节以及某些对比是否合理。

方法框架

随记

作者提出的3个问题 Common challenges that image classification models encounter when localizing objects are, (a) they tend to look at the most discriminative features in an image that confines the localization map to a very small region, (b) the localization maps are class agnostic, and the models highlight objects of multiple classes in the same image and, (c) the localization performance is affected by background noise

作者针对3个问题分别进行了可视化举例

为了缓解上面的问题，作者使用Transformer。为什么要使用transformer呢？后文有进一步分析吗？ To alleviate the above challenges, we leverage vision transformer architectures [10] to localize objects in an image.

大概说明了作者做了什么 In our work, we introduce below simple modifications to the vision transformer.

讲述作者提出的patch-based Attention Dropout Layer (p-ADL)的部件及其作用 Two components of the p-ADL layer are patch importance map and patch drop mask. During the course of training, these components are utilized to highlight informative and drop discriminative patches respectively to balance the classification v/s localization performance of the model.

讲述作者提出的Grad Attention Rollout (GAR)模块的作用 We introduce a weighted attention rollout mechanism using the gradients computed at each attention map. We refer to this mechanism as Grad Attention Rollout. This post-hoc method in combination with the p-ADL enables the model to quantify the positive and negative contributions in the attention map for each class. This guides the model to generate classdependent attention maps and a negative clamping operation in GAR further suppresses the effect of background noise in the attention map

作者设计的p-ADL的作用，以及其与ADL的区别 which acts as a regularizer and enhances the localization capability of the model. Unlike ADL which operates on feature maps, p-ADL operates on the patch embeddings including the class token embedding.

p-ADL操作步骤：将patches embedding沿着embedding dimension进行平均，得倒Mean Attention Map，然后一个分支用阈值化得到patch drop mask，另外一个用sigmoid activation得到patch importance map Both these components operate on the mean attention map, which is computed by taking the mean over the embedding dimension. 1 The patch drop mask is created by dropping the most activated mean patch embedding based on a drop threshold (λ) parameter. The patch importance map is calculated by normalizing the mean attention map using a sigmoid activation, which denotes the importance of patches in an image.

设计GAR的原因 Rollout adds an Identity matrix I to the attention matrix at each layer to account for residual connections [13] in the network. However, this method assumes that attentions are linearly combined. As stated by Chefer et al. [7], this overlooks the fact that GELU [14] activation is used in all intermediate layers. An ill-effect of this ignorance is that it fails to distinguish between positive and negative contributions to the final attention map.