Neuron-Based Spiking Transmission and Reasoning Network for Robust Image-Text Retrieval

Wenrui Li,Zhengyu Ma,Liang-Jian Deng,Xiaopeng Fan,Yonghong Tian
DOI: https://doi.org/10.1109/tcsvt.2022.3233042
IF: 5.859
2023-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Most of the image-text retrieval methods carry out accurate results using fine-grained features for feature alignment. However, extracting the robustness features while maintaining the retrieval accuracy in wireless communication is still a challenge, especially with channel noises and limited transmission bandwidth. Inspired by spike signals of neurons in the human brain, we propose the neuron-based spiking transmission and reasoning network (NSTRN). In this way, the features are compressed into compacted efficient representations. In NSTRN, we construct the feature sender based on spiking activation function to selectively encode only important information in images and sentences into binary codes, and reduce the transmission cost. Moreover, the feature receiver is designed as a recurrent architecture and applies both temporal attention and global attention blocks to memorize long-term information. Finally, to compensate for the loss of visual concepts in transmission, we use the global textual features as coefficients to guide the formation of visual features in the training stage. The traditional CNN-based joint source-channel coding model outputs float-point encoded features, which requires additional quantization steps to convert features into binary bitstreams in the practical wireless communication system. Instead, the spiking neural networks (SNNs) directly use binary spike trains to reduce the computation complexity caused by the quantization steps. More importantly, SNNs can naturally encode the asynchronous event streams and inhibit the discrete noisy events to extract robust information. Even with binary bitstreams, NSTRN shows effectiveness compared with the state-of-the-art image-text retrieval methods. In the wireless communication scenario, NSTRN not only reduces the transmission bandwidth but also alleviates the "cliff effect" to a certain extent in the traditional separate encoding methods. To the best of our knowledge, this is the first work using SNNs on robust image-text retrieval.
What problem does this paper attempt to address?