End-to-end multimodal image registration via reinforcement learning

Jing Hu,Ziwei Luo,Xin Wang,Shanhui Sun,Youbing Yin,Kunlin Cao,Qi Song,Siwei Lyu,Xi Wu
DOI: https://doi.org/10.1016/j.media.2020.101878
IF: 10.9
2021-02-01
Medical Image Analysis
Abstract:<p>Multimodal image registration is a vital initial step in several medical image applications for providing complementary information from different data modalities. Since images with different modalities do not exhibit the same characteristics, finding their accurate correspondences remains a challenge. For convolutional multimodal registration methods, two components are quite significant: descriptive image feature as well as the suited similarity metric. However, these two components are often custom-designed and are infeasible to the high diversity of tissue appearance across modalities. In this paper, we translate image registration into a decision-making problem, where registration is achieved via an artificial agent trained by asynchronous reinforcement learning. More specifically, convolutional long-short-term-memory is incorporated after stacked convolutional layers in this method to extract spatial-temporal image features and learn the similarity metric implicitly. A customized reward function driven by landmark error is advocated to guide the agent to the correct registration direction. A Monte Carlo rollout strategy is also leveraged to perform as a look-ahead inference in the testing stage, to increase registration accuracy further. Experiments on paired CT and MR images of patients diagnosed as nasopharyngeal carcinoma demonstrate that our method achieves state-of-the-art performance in medical image registration.</p>
engineering, biomedical,computer science, interdisciplinary applications, artificial intelligence,radiology, nuclear medicine & medical imaging
What problem does this paper attempt to address?
The problem that this paper attempts to solve is multimodal medical image registration. Specifically, the goal of the paper is to achieve end - to - end multimodal image registration through the reinforcement learning (RL) framework to overcome the limitations of traditional methods in feature extraction and similarity metric definition. Since images of different modalities have significant differences in structure and appearance, finding the accurate correspondence between them is a challenge. The method proposed in the paper utilizes convolutional neural networks (CNN) and convolutional long - short - term memory networks (ConvLSTM) to extract spatio - temporal features and guides the agent to perform correct registration operations through a custom - defined reward function, thereby achieving high - precision image registration. ### Main Contributions 1. **Proposed a new reinforcement learning framework**: This framework combines the policy network and the value network and can learn the perception - action cycle from scratch without using pre - trained convolutional features. 2. **Designed a reward function based on landmark error**: This reward function helps to solve the problem of inconsistent transformation parameter units and promotes the stable convergence of the model. 3. **Introduced the Monte Carlo look - ahead strategy**: As look - ahead guidance in the testing phase to overcome the problem of unknown termination states, further improving the accuracy and stability of prediction. ### Method Overview - **State Representation**: The fixed image \(I_f\) and the moving image \(I_m\) are resampled to the same size (168×168), and the state \(s_t\) is represented by a 3D tensor composed of these two images. - **Action Space**: The action space is discretized, allowing the agent to freely explore the entire registration parameter space. Specifically, it includes 8 candidate transformations, corresponding to changes of ±1 pixel, ±1° and ±0.05 for translation, rotation and scaling respectively. - **Reward Function**: The reward function is based on the Euclidean distance of landmark points and is used to measure the improvement after the agent selects a specific action. If the distance is less than the threshold \(\tau\), it is considered that the termination state is reached and a high reward is given. ### Model Structure - **Deep Actor - Critic Network**: This network simultaneously maintains the policy function \(\pi(\cdot|s_t; \theta)\) and the value function \(V(s_t; \theta_t)\). The policy function is responsible for selecting actions according to the current state, and the value function is used to evaluate the value of the current state. - **Convolutional Neural Network and Convolutional Long - Short - Term Memory Network**: CNN extracts short - term local spatial features, while ConvLSTM not only discovers inter - frame changes but also extracts long - term spatial features, thus making full use of spatio - temporal redundant information. ### Training Protocol - **Asynchronous Advantage Actor - Critic Algorithm (A3C)**: Multiple agents are associated with different environments and update the policy asynchronously. Each agent starts from a pair of unaligned images until the termination state is reached or the maximum episode length is reached. Through these innovations, the method proposed in the paper has achieved state - of - the - art performance on clinical datasets, demonstrating strong capabilities in multimodal medical image registration tasks.