TPSeNCE: Towards Artifact-Free Realistic Rain Generation For Deraining and Object Detection in Rain
TPSeNCE: Towards Artifact-Free Realistic Rain Generation For Deraining and Object Detection in Rain
TPSeNCE: Towards Artifact-Free Realistic Rain Generation For Deraining and Object Detection in Rain
Object Segmentation
1
The Hong Kong University of Science and Technology (Guangzhou)
[email protected]
2
The Hong Kong University of Science and Technology
arXiv:2401.14168v1 [cs.CV] 25 Jan 2024
1 Introduction
Fig. 1. The overview of the proposed Vivim. “FFN” represents FeedForward layer.
the temporal information and integrate it with the spatial counterpart. How-
ever, the quadratic computation complexity impedes the application of the self-
attention mechanism to video scenarios. The dramatically increasing number
of tokens in long sequences from videos leads to significant computational bur-
dens when adopting MSA for the simultaneous modeling of spatial and temporal
information [1].
Very recently, to address the bottleneck concerning long sequence modeling,
Mamba [3], inspired by state space models (SSMs) [7], has been developed. Its
main idea is to effectively capture long-range dependencies and improve training
and inference efficiency through the implementation of a selection mechanism
and a hardware-aware algorithm. Based on this, U-Mamba [10] designed a hy-
brid CNN-SSM block, which is mainly composed of Mamba modules, to handle
the long sequences in biomedical image segmentation tasks. Vision Mamba [20]
provided a new generic vision backbone with bidirectional Mamba blocks on
image classification and semantic segmentation tasks. As suggested by them,
the reliance on the self-attention module is not necessary, and it can be easily
replaced by Mamba to achieve efficient visual representation learning when com-
ing across long sequence modeling. Inherently, this offers an effective solution to
explore long-term temporal dependency in video scenarios.
Inspired by this, we propose a novel framework, named Vivim, that inte-
grates Mamba into the multi-level transformer architecture to transform a video
clip into one feature sequence containing spatiotemporal information at each
scale. Our Vivim is designed to explore the temporal dependency across frames
Vivim: a Video Vision Mamba for Medical Video Object Segmentation 3
for the advancement of segmentation results with the cheaper computational cost
than other video-based methods. To the best of our knowledge, this is the first
work to incorporate Mamba into the task of video object segmentation, facilitat-
ing faster and greater performance. In our Vivim, drawing inspiration from the
architecture of modern transformer block, we present a novel Temporal Mamba
Block. A hierarchical Mamba encoder consisting of multiple Temporal Mamba
Blocks is introduced to investigate the correlation between spatial and temporal
dependency at various scales. The Mamba block is integrated into each scale of
the model’s encoder, replacing the self-attention or window-attention module to
achieve efficient visual representation learning. Also, we leverage a lightweight
CNN-based decoder head to integrate multi-level feature sequences and thus
predict segmentation masks. Experiments on breast US videos demonstrate the
effectiveness of our framework Vivim.
2 Method
of tokens increases linearly with the number of input frames. Motivated by this,
we develop a more efficient block, Temporal Mamba Block, to simultaneously
model spatial and temporal information.
As illustrated in Fig. 1, in the Temporal Mamba Block, a spatial self-attention
module is first introduced to extract spatial information by the attention mecha-
nism followed by a Mix-FeedForwoard layer. We leverage the sequence reduction
process introduced in [17,15] to improve its efficiency. For the i-level feature
embedding of the given video clip Fi ∈ RT ×Ci ×H×W , we first transpose the
channel and temporal dimension, and employ a flattening operation to reshape
the spatiotemporal feature embedding into 1D long sequence hi ∈ RCi ×T HW ,
enabling highly efficient sequential modeling with less inductive bias. Then, the
flattened sequence hi is fed into layers of a Mamba module (Mamba) and a
Detail-specific FeedForward (DSF). The Mamba module explores the correlation
among patches of input frames while the Detail-specific Feedforward preserves
fine-grained details by a depth-wise convolution with a kernel size of 3 × 3 × 3.
After the Mamba layers, the feature is returned to the original shape by its in-
verse operation. Finally, we employ overlapped patch merging to down-sampling
the feature embedding. The procedure in the Mamba Layer can be defined as:
hl−1 = ϕ(hl−1 ),
ĥl = Mamba LayerNorm hl−1 + hl−1 ,
(1)
hl = DSF LayerNorm ĥl + ĥl ,
hl = ϕ−1 (hl ),
where ϕ denotes the transposition and flattening operation, ϕ−1 denotes its
inverse operation, l ∈ [1, Nm ]. It is worth noting that when fed into the (l + 1)-th
Mamba layer, we conduct the operation ϕ reverse to that of the l-th layer in the
temporal axis for bidirectional modeling.
2.2 Decoder
3 Experiments
3.1 Dataset and Implementation
The breast ultrasound (US) dataset consists of 63 video sequences, with one
video sequence per person, containing 4619 frames that have been annotated
with pixel-level ground truth by experts. These videos were captured using var-
ious US devices, with spatial resolutions ranging from 580×600 to 600×800.
Following the approach outlined in [8], the video sequences were further cropped
to a spatial resolution of 300×200. To facilitate quantitative comparison, we uti-
lized several commonly employed segmentation evaluation metrics, including the
Jaccard similarity coefficient (Jaccard), Dice similarity coefficient (Dice), Preci-
sion, and Recall; for their precise definitions, please refer to [16]. We also report
the inference speed performance of different methods by computing the number
of frames per second (FPS). We follow the official splits for training and testing.
The proposed framework was trained on one NVIDIA RTX 3090 GPU and
implemented on the Pytorch platform. Our framework is empirically trained for
100 epochs in an end-to-end way and the Adam optimizer is applied. The initial
learning rate is set to 1 × 10−4 and decayed to 1 × 10−6 . During training, we
resize the video frames to 256 × 256, and feed a batch of 8 video clips, each of
which has 5 frames, into the network for each iteration. We use cross-entropy
loss and IoU loss as the objective function of our method.
image segmentation method SERT [18] and video object segmentation methods
(OSVOS [12], ViViT [1], STM [11], AFB-URR [9], DPSTT [8]). For the fair-
ness of comparisons, we reproduce these methods using their publicly available
codes following [8]. We can observe that video-based methods tend to outperform
image-based methods as evidenced by their better performance. This suggests
that the exploration of temporal information offers significant advantages for
segmenting breast lesions in ultrasound videos. More importantly, among all
image-based and video-based segmentation methods, our Vivim has achieved
the highest performance across all scores. In particular, our Vivim has the best
run-time among all video-based methods observed from the evaluation on FPS.
This demonstrates that our solution can simultaneously learn spatial and tem-
poral cues in an efficient way, and achieves significant improvements over those
Transformer methods, such as SERT and ViViT.
4 Conclusion
References
1. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: Vivit: A
video vision transformer. In: Proceedings of the IEEE/CVF international confer-
ence on computer vision. pp. 6836–6846 (2021)
2. Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A.L., Zhou,
Y.: Transunet: Transformers make strong encoders for medical image segmentation.
arXiv preprint arXiv:2102.04306 (2021)
3. Gu, A., Dao, T.: Mamba: Linear-time sequence modeling with selective state
spaces. arXiv preprint arXiv:2312.00752 (2023)
4. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the
IEEE international conference on computer vision. pp. 2961–2969 (2017)
5. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE conference on computer vision and pattern recognition.
pp. 770–778 (2016)
6. Huang, Q., Huang, Y., Luo, Y., Yuan, F., Li, X.: Segmentation of breast ultra-
sound image with semantic classification of superpixels. Medical Image Analysis
61, 101657 (2020)
7. Kalman, R.E.: A new approach to linear filtering and prediction problems (1960)
Vivim: a Video Vision Mamba for Medical Video Object Segmentation 7
8. Li, J., Zheng, Q., Li, M., Liu, P., Wang, Q., Sun, L., Zhu, L.: Rethinking breast
lesion segmentation in ultrasound: A new video dataset and a baseline network.
In: International Conference on Medical Image Computing and Computer-Assisted
Intervention. pp. 391–400. Springer (2022)
9. Liang, Y., Li, X., Jafari, N., Chen, J.: Video object segmentation with adaptive
feature bank and uncertain-region refinement. Advances in Neural Information
Processing Systems 33, 3430–3441 (2020)
10. Ma, J., Li, F., Wang, B.: U-mamba: Enhancing long-range dependency for biomed-
ical image segmentation. arXiv preprint arXiv:2401.04722 (2024)
11. Oh, S.W., Lee, J.Y., Xu, N., Kim, S.J.: Video object segmentation using space-time
memory networks. In: Proceedings of the IEEE/CVF International Conference on
Computer Vision. pp. 9226–9235 (2019)
12. Perazzi, F., Khoreva, A., Benenson, R., Schiele, B., Sorkine-Hornung, A.: Learn-
ing video object segmentation from static images. In: Proceedings of the IEEE
conference on computer vision and pattern recognition. pp. 2663–2672 (2017)
13. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedi-
cal image segmentation. In: International Conference on Medical Image Computing
and Computer-Assisted Intervention. pp. 234–241. Springer (2015)
14. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,
Ł., Polosukhin, I.: Attention is all you need. Advances in neural information pro-
cessing systems 30 (2017)
15. Wang, W., Xie, E., Li, X., Fan, D.P., Song, K., Liang, D., Lu, T., Luo, P., Shao,
L.: Pyramid vision transformer: A versatile backbone for dense prediction with-
out convolutions. In: Proceedings of the IEEE/CVF international conference on
computer vision. pp. 568–578 (2021)
16. Wang, Y., Deng, Z., Hu, X., Zhu, L., Yang, X., Xu, X., Heng, P.A., Ni, D.: Deep
attentional features for prostate segmentation in ultrasound. In: Medical Image
Computing and Computer Assisted Intervention–MICCAI 2018: 21st International
Conference, Granada, Spain, September 16-20, 2018, Proceedings, Part IV 11. pp.
523–530. Springer (2018)
17. Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer:
Simple and efficient design for semantic segmentation with transformers. Advances
in Neural Information Processing Systems 34, 12077–12090 (2021)
18. Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T.,
Torr, P.H., et al.: Rethinking semantic segmentation from a sequence-to-sequence
perspective with transformers. In: Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition. pp. 6881–6890 (2021)
19. Zhou, Z., Rahman Siddiquee, M.M., Tajbakhsh, N., Liang, J.: Unet++: A nested
u-net architecture for medical image segmentation. In: Deep learning in medical
image analysis and multimodal learning for clinical decision support, pp. 3–11.
Springer (2018)
20. Zhu, L., Liao, B., Zhang, Q., Wang, X., Liu, W., Wang, X.: Vision mamba: Efficient
visual representation learning with bidirectional state space model. arXiv preprint
arXiv:2401.09417 (2024)