GPU Technology is the Hope for Near Real‐time Monte Carlo Dose Calculations
Xun Jia,X George Xu,Colin G Orton
DOI: https://doi.org/10.1118/1.4903901
IF: 4.506
2015-01-01
Medical Physics
Abstract:Medical PhysicsVolume 42, Issue 4 p. 1474-1476 Point/counterpointFree Access GPU technology is the hope for near real-time Monte Carlo dose calculations Xun Jia Ph.D., Xun Jia Ph.D. Department of Radiation Oncology, The University of Texas Southwestern Medical Center, Dallas, Texas 75390 (Tel: 214-648-3224; E-mail: [email protected])Search for more papers by this authorX. George Xu Ph.D., X. George Xu Ph.D. Nuclear Engineering Program, Rensselaer Polytechnic Institute, Troy, New York 12180 (Tel: 518-276-4014; E-mail: [email protected])Search for more papers by this authorColin G. Orton Ph.D., Colin G. Orton Ph.D. ModeratorSearch for more papers by this author Xun Jia Ph.D., Xun Jia Ph.D. Department of Radiation Oncology, The University of Texas Southwestern Medical Center, Dallas, Texas 75390 (Tel: 214-648-3224; E-mail: [email protected])Search for more papers by this authorX. George Xu Ph.D., X. George Xu Ph.D. Nuclear Engineering Program, Rensselaer Polytechnic Institute, Troy, New York 12180 (Tel: 518-276-4014; E-mail: [email protected])Search for more papers by this authorColin G. Orton Ph.D., Colin G. Orton Ph.D. ModeratorSearch for more papers by this author First published: 11 March 2015 https://doi.org/10.1118/1.4903901Citations: 14AboutSectionsPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL OVERVIEW Monte Carlo (MC) dose calculations are recognized as being the most accurate modality for radiotherapy treatment planning but, because of the excessive computational time required, they cannot presently be used for near real-time dose calculations. Currently, the most common way to accelerate MC dose calculations is to use clusters of central processing units (CPUs), but some believe that the future of near real-time MC dose calculations lies not with clusters of CPUs but with the use of graphics processing unit (GPU) technology. This is the claim debated in this month's Point/Counterpoint. Arguing for the Proposition is Xun Jia, Ph.D. Dr. Jia received his Masters degree in Applied Mathematics and Ph.D. degree in Physics, both from UCLA. He is currently an Assistant Professor in the Department of Radiation Oncology, University of Texas Southwestern Medical Center. Dr. Jia's research focuses on GPU-based high-performance computing for medical physics and medical imaging. He has developed several Monte Carlo packages to improve efficiency for photon, electron, and proton transport. Dr. Jia's research has been supported by government and industrial grants and he has published 60 peer-reviewed papers. He is currently a section editor of the Journal of Applied Clinical Medical Physics. Arguing against the Proposition is X. George Xu, Ph.D. Dr. Xu obtained his Ph.D. in Nuclear Engineering from Texas A&M University, College Station, TX and, for the past 20 years, he has been on the faculty of Rensselaer Polytechnic Institute, Troy, NY, where he currently holds the Edward E. Hood Endowed Chair of Engineering. Dr. Xu's research has centered around applications of Monte Carlo methods to problems in radiation protection, imaging, and radiation therapy. He has been continuously funded by the NIH over the past ten years, including an R01 grant to develop a new Monte Carlo code, archer, for heterogeneous computing involving GPUs and coprocessors. He is the author of more than 150 journal papers and book chapters, and 270 conference abstracts. Dr. Xu is a Fellow of the American Association of Physicists in Medicine, the Health Physics Society, and the American Nuclear Society. In 2014, he was re-elected to a 6-yr term as a council member of the National Council on Radiation Protection and Measurements. FOR THE PROPOSITION: Xun Jia, Ph.D. Opening Statement Clinical applications of MC dose calculations have been limited by the long computation time to achieve a sufficient precision level. Over the years, great efforts have been devoted to accelerating MC simulations. Recently, with the success of GPU-based high-performance computing,1,2 particularly for MC simulations, near real-time (e.g., seconds or subseconds) dose calculation is becoming feasible. Achieving this will not only facilitate its routine utilization, but also realize novel applications to advance radiotherapy practice, such as MC-based inverse treatment planning. To date, the computation time for a typical photon plan has been reduced to less than a minute with ∼1% uncertainty using only one GPU, and the speed can be further boosted with multiple GPUs by a factor proportional to the number of GPUs. Also reported are computation times as low as seconds to tens of seconds for different applications.3,4 Notably, the group at UT Southwestern5 has developed a GPU application to visualize an MC-reconstructed dose delivery process in almost real-time during beam delivery, with a refresh frequency of >10 Hz. These achievements have clearly demonstrated the potential of near real-time MC dose calculations. Besides advantages in speed, GPUs also hold other favorable features for clinical applications. First, GPUs are orders of magnitude lower in cost than a conventional high-performance-computing structure with a similar processing power. Second, GPUs are locally hosted and managed. This is particularly important for problems aiming at near real-time applications, since data-transfer and job-scheduling times cannot be neglected if the computation facility is remotely placed and shared by many users. Patient privacy may also be a concern when transferring medical data to a remote facility. Of course we cannot neglect disadvantages of using GPUs for MC. As a new platform, redevelopment of codes is necessary. However, burdens of initial code development have been overcome to a large extent, and several packages have been successfully built. Efforts have also been initiated to write MC packages in OpenCL to increase portability.6 While there are also technical issues hindering computational efficiency, e.g., thread divergence and memory writing conflicts, many solutions exist to remove or alleviate them.4,7 I would also like to mention a strong competitor of the GPU, the Intel many integrated core (MIC) processor. What makes this particularly attractive is its x86 compatibility, which can run existing CPU codes with minor modification. However, just like for GPUs, substantial effort is needed to achieve optimal performance.8 Simply running an existing code may not achieve high acceleration, because parallel-computing specific issues such as memory access and vectorization were not considered sufficiently in the conventional CPU code. As of today, there has been only limited study regarding MC dose calculations on MIC processors. While it holds the potential to improve efficiency, a lot of research is needed. In conclusion, GPU technology has the capability of substantially accelerating MC simulations. Its advantages and extensive research efforts demonstrate the hope for near real-time dose calculations. AGAINST THE PROPOSITION: X. George Xu, Ph.D. Opening Statement Since the invention of computers in the 1940s, MC codes have been developed for nuclear engineering, high-energy physics, and, recently, medical physics applications. However, most radiation treatment planning is done currently using dosimetry algorithms that are extremely fast, but only "approximately" correct.9 Given the lasting interest in accelerating MC methods, the recent hype related to the GPU is not surprising. Originally marketed by NVidia as household devices, GPU-based game consoles offered amazingly fast graphics at an affordable price. It did not take long, however, for the scientific community to realize that these desktop toys were actually parallel computers. As summarized in two review papers,1,2 GPU adopters from the medical physics community wasted no time in reporting overwhelmingly positive experiences, including a dozen studies that focused specifically on MC dosimetry. Impressive, but inconsistent, "speedup factors" ranging from single digits to several hundreds were reported within months, sometimes by the same group. It has become a cliché to highlight how fast an MC-based dose calculation can be done with a GPU. Such results indeed attracted a lot of attention from medical physicists who are notoriously busy and seeking expediency. There are two strong indications that GPU technology is only hype and not the hope for near real-time, fully MC dose calculations. First, we have not seen any convincing evidence that the GPU is indeed better than traditional solutions for running MC dose calculations. Both of the above review papers1,2 enjoyed referencing the rapidly increasing number of GPU-related journal articles—which only reinforces the concept of a "hype cycle." Furthermore, the authors of the GPU-accelerated MC studies obscure the issue by omitting details on how they compared GPU performance with traditional CPUs. CPU-based clusters are currently so cheap that one can assemble a desk-side 32-core cluster for about $3000US—the cost of a high-end CPU/GPU system. Using software optimization schemes and hyperthreading, such a CPU cluster may achieve a speedup similar to the best reported for GPUs, without the painful process of rewriting the MC code for the GPU/compute unified device architecture (GPU/CUDA) environment. But few of the GPU enthusiasts optimized the CPU code in order to make fair performance comparisons. It has been observed that a lack of "fair comparison" measures is responsible for exaggerated GPU performance.10 Second, competing technologies are mostly ignored by GPU adopters. Intel's Xeon Phi coprocessor, for example, which comes with 60 embedded Pentium cores, is capable of achieving a similar level of parallelism as GPUs.11–13 Adopting the coprocessor is relatively easy and a large number of them are, in fact, used in Tianhe-2—the world's number-1 supercomputer. The "heterogeneous computing" era has just begun and it is uncertain which hardware (and software) technology will dominate the market.14 The excitement brought by the GPU has reignited our interest in achieving real-time MC dose calculations and one should take full advantage of the research opportunities.15 However, an inflated expectation can be counterproductive, especially when investing in a single technology that may be obsolete in ten years. Rebuttal: Xun Jia, Ph.D. I agree that variations in reported GPU-acceleration factors exist due to different degrees of software/hardware utilization and optimization. However, it is quite difficult, if not impossible, to conduct an absolutely fair comparison. For example, I would like to mention the software aspect that unfairly treats GPUs: Software optimization schemes, such as variance reduction techniques widely employed in CPU-based MC packages, have been barely explored for GPUs. The deterministic nature of such algorithms is expected to be particularly favorable for GPU's single-instruction-multiple-thread structure. Yet it is absolute computational efficiency, rather than performance relative to CPUs, that determines the feasibility of near real-time MC calculations. The fact that a single GPU can already compute dose in seconds strongly supports this feasibility. Practicality should also be considered. While a low-end cluster with 4–8 computers may offer high speed, it is more advantageous in a clinical environment to use GPU-enabled computers in terms of energy efficiency, ease of management, etc. The utilization of GPUs in scientific computing is absolutely more than hype. Among the world's top 500 supercomputers, 46 of them use GPU-based coprocessors compared to only 17 systems with MIC coprocessors. A few major vendors in radiotherapy, e.g., RaySearch and Elekta, already employ GPUs in their products. I agree that multiple options are available to substantially accelerate MC in this era of booming technology. Intel MIC is a great example. Nonetheless, it too may be hype which only emphasizes the ease of programmability based on existing CPU codes but hides the required efforts of performance tuning. There is probably no single technology that is undoubtedly better than others. However, based on the overall consideration of GPU's advantages and developments so far, I believe that GPU technology is the hope for near real-time MC dose calculations. Rebuttal: X. George Xu, Ph.D. I agree with Dr. Jia that the capability of real-time MC dose calculations is within reach owing largely to the innovative technology and marketing strategies by Nvidia. The greatest roadblock to GPU is the fact that the effort to translate legacy MC codes to the new CUDA programming environment is prohibitively expensive. GPU also faces tough technological challenges, including limited memory and data bandwidth.14 Given the steep investment and market risk, for everyone to jump onto the GPU wagon is costly and unwise. To CPU enthusiasts, multithreading techniques such as OpenMP and Pthreads are readily available for parallel computing. Intel CPUs come with hyperthreading for concurrent execution, and various compiler options can be used for optimization. As a competing architecture, Intel's MIC is much easier to adopt. To avoid "unfair comparison" between GPU and CPU,11 one should consider the above-mentioned software optimization techniques and pick a "multicore" CPU (instead of a "single-core") at a similar price to the GPU implementation. Comparative studies should also consider software related labor expenses. When we recently compared the performances of ARCHER—an MC dosimetry code developed from scratch by my Ph.D. students11–13—in the CPU, GPU, and MIC platforms, we found that GPU's advantages as a dose engine are less dramatic than some of those reported in the literature. All things considered, traditional CPU clusters and MIC remain serious competitors to GPUs when energy efficiency is not the priority. In the next five years, all these technologies are expected to evolve rapidly. The potential waste of capital and human resources due to hype and misleading information should be avoided. To this end, peer-reviewed journal publication and grant application processes should emphasize balanced GPU studies that offer the best methodologies and practices to the medical physics community. REFERENCES 1X. Jia, P. Ziegenhein, and S. B. Jiang, "GPU-based high-performance computing for radiation therapy," Phys. Med. Biol. 59, R151– R182 (2014).10.1088/0031-9155/59/4/R151 2G. Pratx and L. Xing, "GPU computing in medical physics: A review," Med. Phys. 38, 2685– 2697 (2011).10.1118/1.3578605 3S. Hissoiny, M. D'Amours, B. Ozell, P. Despres, and L. Beaulieu, "Sub-second high dose rate brachytherapy Monte Carlo dose calculations with bGPUMCD," Med. Phys. 39, 4559– 4567 (2012).10.1118/1.4730500 4X. Jia, J. Schuemann, H. Paganetti, and S. B. Jiang, "GPU-based fast Monte Carlo dose calculation for proton therapy," Phys. Med. Biol. 57, 7783– 7797 (2012).10.1088/0031-9155/57/23/7783 5F. Shi, X. Gu, Y. Graves, S. Jiang, and X. Jia, "A real-time virtual delivery system for photon radiotherapy delivery monitoring," Med. Phys. 41(6), 432 (2014).10.1118/1.4889184 6Khronos OpenCL Working Group, "The open standard for parallel programming of heterogeneous systems" (2013), available at: https://www.khronos.org/opencl/.others. 7S. Hissoiny, B. Ozell, H. Bouchard, and P. Despres, "GPUMCD: A new GPU-oriented Monte Carlo dose calculation platform," Med. Phys. 38, 754– 764 (2011).10.1118/1.3539725 8D. Mackay, "Optimization and performance tuning for Intel®Xeon Phi™ coprocessors–Part 1: Optimization essentials" (2012), available at: https://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi- coprocessors-part-1-optimization.others. 9D. W. O. Rogers, "Fifty years of Monte Carlo simulations for medical physics," Phys. Med. Biol. 51, R287– R301 (2006).10.1088/0031-9155/51/13/R17 10V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey, "Debunking the 100X GPU vs. CPU myth: An evaluation of throughput computing on CPU and GPU," in Proceedings of the 37th Annual International Symposium on Computer Architecture (ACM, New York, NY, 2010), Vol. 38(3), pp. 451– 460. 11T. Liu, X. G. Xu, and C. D. Carothers, "Comparison of two accelerators for Monte Carlo radiation transport calculations, NVIDIA Tesla M2090 GPU and Intel Xeon Phi 3120 coprocessor: A case study for x-ray CT imaging dose calculation," in Joint International Conference on Supercomputing in Nuclear Applications and Monte Carlo (SNA + MC 2013), Paris, France, 27–31 October (EDP Sciences, Les Ulis, France, 2014). 12L. Su, Y. M. Yang, B. Bednarz, E. Sterpin, X. Du, T. Liu, W. Ji, and X. G. Xu, "ARCHERRT—A photon-electron coupled Monte Carlo dose computing engine for GPU: Software development and application to helical tomotherapy," Med. Phys. 41, 071709 (13pp.) (2014).10.1118/1.4884229 13X. G. Xu, T. Liu, L. Su, X. Du, M. J. Riblett, W. Ji, D. Gu, C. D. Carothers, M. S. Shephard, F. B. Brown, M. K. Kalra, and B. Liu, "archer, a new Monte Carlo software tool for emerging heterogeneous computing environments," in Joint International Conference on Supercomputing in Nuclear Applications and Monte Carlo (SNA + MC 2013), Paris, France, 27–31 October (EDP Sciences, Les Ulis, France, 2014). 14B. R. Gaster, L. Howes, D. R. Kaeli, P. Mistry, and D. Schaa, Heterogeneous Computing with OpenCL, 2nd ed. (Elsevier, Inc., Waltham, MA, 2013). 15T. Friedman, Do believe the hype, New York times, 2 November, 2010, available at: http://www.nytimes.com/2010/11/03/opinion/03friedman.html?_r=0.others. Citing Literature Volume42, Issue4April 2015Pages 1474-1476 ReferencesRelatedInformation
What problem does this paper attempt to address?
-
GPU-accelerated Monte Carlo convolution/superposition implementation for dose calculation
Bo Zhou,Cedric X Yu,Danny Z Chen,X Sharon Hu
DOI: https://doi.org/10.1118/1.3490083
Abstract:Purpose: Dose calculation is a key component in radiation treatment planning systems. Its performance and accuracy are crucial to the quality of treatment plans as emerging advanced radiation therapy technologies are exerting ever tighter constraints on dose calculation. A common practice is to choose either a deterministic method such as the convolution/superposition (CS) method for speed or a Monte Carlo (MC) method for accuracy. The goal of this work is to boost the performance of a hybrid Monte Carlo convolution/superposition (MCCS) method by devising a graphics processing unit (GPU) implementation so as to make the method practical for day-to-day usage. Methods: Although the MCCS algorithm combines the merits of MC fluence generation and CS fluence transport, it is still not fast enough to be used as a day-to-day planning tool. To alleviate the speed issue of MC algorithms, the authors adopted MCCS as their target method and implemented a GPU-based version. In order to fully utilize the GPU computing power, the MCCS algorithm is modified to match the GPU hardware architecture. The performance of the authors' GPU-based implementation on an Nvidia GTX260 card is compared to a multithreaded software implementation on a quad-core system. Results: A speedup in the range of 6.7-11.4x is observed for the clinical cases used. The less than 2% statistical fluctuation also indicates that the accuracy of the authors' GPU-based implementation is in good agreement with the results from the quad-core CPU implementation. Conclusions: This work shows that GPU is a feasible and cost-efficient solution compared to other alternatives such as using cluster machines or field-programmable gate arrays for satisfying the increasing demands on computation speed and accuracy of dose calculation. But there are also inherent limitations of using GPU for accelerating MC-type applications, which are also analyzed in detail in this article.
-
GPU-based fast Monte Carlo simulation for radiotherapy dose calculation
Xun Jia,Xuejun Gu,Yan Jiang Graves,Michael Folkerts,Steve B. Jiang
DOI: https://doi.org/10.1088/0031-9155/56/22/002
2011-07-18
Abstract:Monte Carlo (MC) simulation is commonly considered to be the most accurate dose calculation method in radiotherapy. However, its efficiency still requires improvement for many routine clinical applications. In this paper, we present our recent progress towards the development a GPU-based MC dose calculation package, gDPM v2.0. It utilizes the parallel computation ability of a GPU to achieve high efficiency, while maintaining the same particle transport physics as in the original DPM code and hence the same level of simulation accuracy. In GPU computing, divergence of execution paths between threads can considerably reduce the efficiency. Since photons and electrons undergo different physics and hence attain different execution paths, we use a simulation scheme where photon transport and electron transport are separated to partially relieve the thread divergence issue. High performance random number generator and hardware linear interpolation are also utilized. We have also developed various components to handle fluence map and linac geometry, so that gDPM can be used to compute dose distributions for realistic IMRT or VMAT treatment plans. Our gDPM package is tested for its accuracy and efficiency in both phantoms and realistic patient cases. In all cases, the average relative uncertainties are less than 1%. A statistical t-test is performed and the dose difference between the CPU and the GPU results is found not statistically significant in over 96% of the high dose region and over 97% of the entire region. Speed up factors of 69.1 ~ 87.2 have been observed using an NVIDIA Tesla C2050 GPU card against a 2.27GHz Intel Xeon CPU processor. For realistic IMRT and VMAT plans, MC dose calculation can be completed with less than 1% standard deviation in 36.1~39.6 sec using gDPM.
Medical Physics
-
Fast on-site Monte Carlo tool for dose calculations in CT applications
Wei Chen,Daniel Kolditz,Marcel Beister,Robert Bohle,Willi A Kalender
DOI: https://doi.org/10.1118/1.4711748
Abstract:Purpose: Monte Carlo (MC) simulation is an established technique for dose calculation in diagnostic radiology. The major drawback is its high computational demand, which limits the possibility of usage in real-time applications. The aim of this study was to develop fast on-site computed tomography (CT) specific MC dose calculations by using a graphics processing unit (GPU) cluster. Methods: GPUs are powerful systems which are especially suited to problems that can be expressed as data-parallel computations. In MC simulations, each photon track is independent of the others; each launched photon can be mapped to one thread on the GPU, thousands of threads are executed in parallel in order to achieve high performance. For further acceleration, the authors considered multiple GPUs. The total computation was divided into different parts which can be calculated in parallel on multiple devices. The GPU cluster is an MC calculation server which is connected to the CT scanner and computes 3D dose distributions on-site immediately after image reconstruction. To estimate the performance gain, the authors benchmarked dose calculation times on a 2.6 GHz Intel Xeon 5430 Quad core workstation equipped with two NVIDIA GeForce GTX 285 cards. The on-site calculation concept was demonstrated for clinical and preclinical datasets on CT scanners (multislice CT, flat-detector CT, and micro-CT) with varying geometry, spectra, and filtration. To validate the GPU-based MC algorithm, the authors measured dose values on a 64-slice CT system using calibrated ionization chambers and thermoluminesence dosimeters (TLDs) which were placed inside standard cylindrical polymethyl methacrylate (PMMA) phantoms. Results: The dose values and profiles obtained by GPU-based MC simulations were in the expected good agreement with computed tomography dose index (CTDI) measurements and reference TLD profiles with differences being less than 5%. For 10(9) photon histories simulated in a 256 × 256 × 12 voxel thorax dataset with voxel size of 1.36 × 1.36 × 3.00 mm(3), calculation times of about 70 and 24 min were necessary with single-core and multiple-core central processing unit (CPU) solutions, respectively. Using GPUs, the same MC calculations were performed in 1.27 min (single card) and 0.65 min (two cards) without a loss in quality. Simulations were thus speeded up by factors up to 55 and 36 compared to single-core and multiple-core CPU, respectively. The performance scaled nearly linearly with the number of GPUs. Tests confirmed that the proposed GPU-based MC tool can be easily adapted to different types of CT scanners and used as service providers for fast on-site dose calculations. Conclusions: The Monte Carlo software package provides fast on-site calculation of 3D dose distributions in the CT suite which makes it a practical tool for any type of CT-specific application.
-
A New Approach to Integrate GPU-based Monte Carlo Simulation into Inverse Treatment Plan Optimization for Proton Therapy.
Yongbao Li,Zhen Tian,Ting Song,Zhaoxia Wu,Yaqiang Liu,Steve Jiang,Xun Jia
DOI: https://doi.org/10.1088/1361-6560/62/1/289
IF: 3.5
2016-01-01
Physics in Medicine and Biology
Abstract:Monte Carlo (MC)-based spot dose calculation is highly desired for inverse treatment planning in proton therapy because of its accuracy. Recent studies on biological optimization have also indicated the use of MC methods to compute relevant quantities of interest, e.g. linear energy transfer. Although GPU-based MC engines have been developed to address inverse optimization problems, their efficiency still needs to be improved. Also, the use of a large number of GPUs in MC calculation is not favorable for clinical applications. The previously proposed adaptive particle sampling (APS) method can improve the efficiency of MC-based inverse optimization by using the computationally expensive MC simulation more effectively. This method is more efficient than the conventional approach that performs spot dose calculation and optimization in two sequential steps. In this paper, we propose a computational library to perform MC-based spot dose calculation on GPU with the APS scheme. The implemented APS method performs a non-uniform sampling of the particles from pencil beam spots during the optimization process, favoring those from the high intensity spots. The library also conducts two computationally intensive matrix-vector operations frequently used when solving an optimization problem. This library design allows a streamlined integration of the MC-based spot dose calculation into an existing proton therapy inverse planning process. We tested the developed library in a typical inverse optimization system with four patient cases. The library achieved the targeted functions by supporting inverse planning in various proton therapy schemes, e.g. single field uniform dose, 3D intensity modulated proton therapy, and distal edge tracking. The efficiency was 41.6 +/- 15.3% higher than the use of a GPU-based MC package in a conventional calculation scheme. The total computation time ranged between 2 and 50 min on a single GPU card depending on the problem size.
-
Development of a GPU-based Monte Carlo dose calculation code for coupled electron-photon transport
Xun Jia,Xuejun Gu,Josep Sempau,Dongju Choi,Amitava Majumdar,Steve B. Jiang
DOI: https://doi.org/10.1088/0031-9155/55/11/006
2010-03-23
Abstract:Monte Carlo simulation is the most accurate method for absorbed dose calculations in radiotherapy. Its efficiency still requires improvement for routine clinical applications, especially for online adaptive radiotherapy. In this paper, we report our recent development on a GPU-based Monte Carlo dose calculation code for coupled electron-photon transport. We have implemented the Dose Planning Method (DPM) Monte Carlo dose calculation package (Sempau et al, Phys. Med. Biol., 45(2000)2263-2291) on GPU architecture under CUDA platform. The implementation has been tested with respect to the original sequential DPM code on CPU in phantoms with water-lung-water or water-bone-water slab geometry. A 20 MeV mono-energetic electron point source or a 6 MV photon point source is used in our validation. The results demonstrate adequate accuracy of our GPU implementation for both electron and photon beams in radiotherapy energy range. Speed up factors of about 5.0 ~ 6.6 times have been observed, using an NVIDIA Tesla C1060 GPU card against a 2.27GHz Intel Xeon CPU processor.
Medical Physics
-
A fast GPU-based Monte Carlo simulation of proton transport with detailed modeling of non-elastic interactions
H. Wan Chan Tseung,J. Ma,C. Beltran
DOI: https://doi.org/10.1118/1.4921046
2014-09-30
Abstract:Purpose: Very fast Monte Carlo (MC) simulations of proton transport have been implemented recently on GPUs. However, these usually use simplified models for non-elastic (NE) proton-nucleus interactions. Our primary goal is to build a GPU-based proton transport MC with detailed modeling of elastic and NE collisions. Methods: Using CUDA, we implemented GPU kernels for these tasks: (1) Simulation of spots from our scanning nozzle configurations, (2) Proton propagation through CT geometry, considering nuclear elastic scattering, multiple scattering, and energy loss straggling, (3) Modeling of the intranuclear cascade stage of NE interactions, (4) Nuclear evaporation simulation, and (5) Statistical error estimates on the dose. To validate our MC, we performed: (1) Secondary particle yield calculations in NE collisions, (2) Dose calculations in homogeneous phantoms, (3) Re-calculations of head and neck plans from a commercial treatment planning system (TPS), and compared with Geant4.9.6p2/TOPAS. Results: Yields, energy and angular distributions of secondaries from NE collisions on various nuclei agree well with the Geant4 Bertini and Binary cascade models. The 3D-gamma pass rate at 2\%-2 mm for treatment plan simulations is typically 98\%. The net calculation time on a NVIDIA GTX680 card, including all data transfers, is $\sim$20 s for $1\times10^7$ proton histories. Conclusions: Our GPU-based MC is the first of its kind to include a detailed nuclear model to handle NE interactions of protons with any nucleus. Dosimetric calculations are in very good agreement with Geant4/TOPAS. Our MC is being used to perform fast routine clinical QA of pencil-beam based treatment plans, and has also been adopted as the dose engine in a clinically-applicable MC-based IMPT TPS. The detailed nuclear modeling will allow us to perform very fast linear energy transfer and neutron dose estimates on the GPU.
Medical Physics,Computational Physics
-
Real-time dose computation: GPU-accelerated source modeling and superposition/convolution
Robert Jacques,John Wong,Russell Taylor,Todd McNutt
DOI: https://doi.org/10.1118/1.3483785
Abstract:Purpose: To accelerate dose calculation to interactive rates using highly parallel graphics processing units (GPUs). Methods: The authors have extended their prior work in GPU-accelerated superposition/ convolution with a modern dual-source model and have enhanced performance. The primary source algorithm supports both focused leaf ends and asymmetric rounded leaf ends. The extra-focal algorithm uses a discretized, isotropic area source and models multileaf collimator leaf height effects. The spectral and attenuation effects of static beam modifiers were integrated into each source's spectral function. The authors introduce the concepts of arc superposition and delta superposition. Arc superposition utilizes separate angular sampling for the total energy released per unit mass (TERMA) and superposition computations to increase accuracy and performance. Delta superposition allows single beamlet changes to be computed efficiently. The authors extended their concept of multi-resolution superposition to include kernel tilting. Multi-resolution superposition approximates solid angle ray-tracing, improving performance and scalability with a minor loss in accuracy. Superposition/convolution was implemented using the inverse cumulative-cumulative kernel and exact radiological path ray-tracing. The accuracy analyses were performed using multiple kernel ray samplings, both with and without kernel tilting and multi-resolution superposition. Results: Source model performance was <9 ms (data dependent) for a high resolution (4002) field using an NVIDIA (Santa Clara, CA) GeForce GTX 280. Computation of the physically correct multispectral TERMA attenuation was improved by a material centric approach, which increased performance by over 80%. Superposition performance was improved by approximately 24% to 0.058 and 0.94 s for 64(3) and 128(3) water phantoms; a speed-up of 101-144X over the highly optimized Pinnacle3 (Philips, Madison, WI) implementation. Pinnacle3 times were 8.3 and 94 s, respectively, on an AMD (Sunnyvale, CA) Opteron 254 (two cores, 2.8 GHz). Conclusions: The authors have completed a comprehensive, GPU-accelerated dose engine in order to provide a substantial performance gain over CPU based implementations. Real-time dose computation is feasible with the accuracy levels of the superposition/convolution algorithm.
-
A GPU-accelerated Monte Carlo code, RT2for coupled transport of photon, electron/positron, and neutron
Chang-Min Lee,Sung-Joon Ye
DOI: https://doi.org/10.1088/1361-6560/ad694f
2024-08-14
Abstract:Objective.This work aims to develop a graphics processing unit (GPU)-accelerated Monte Carlo code for the coupled transport of photon, electron/positron and neutron over a broad range of energies for medical applications.Approach.By separating the MC evolution of radiation into source, transport, and interaction kernels, the branch divergence was alleviated. The memory coalescence was achieved by vectorizing the access pattern in which the secondary particles were archived. To accelerate further particle tracking, ray-tracing hardware acceleration in the Nvidia OptiXTMframework was applied. For photon and electron/positron, the EGSnrc interaction modules were ported as a GPU-optimized configuration. For neutron, a group-wised transport based on NJOY21 preprocessed data was implemented. The developed code was validated against CPU-based FLUKA. Neutron, x-ray and electron beams incident on water and ICRP phantoms were simulated. The neutron energy group and the transport parameters of photon and electron were set to be the same in both codes. A single Nvidia RTX 4090 card was used in this code while all 20 threads of a single Intel Core i9-10900K node were used in FLUKA.Main results.The number of histories was set to ensure that statistical uncertainties lower than 2% for all voxels whose doses were larger than 20% of the maximum. In all cases, the dose differences in the voxels between the codes were within 2.5%. For photons and electrons, the developed code was 150-300 times faster than FLUKA in both geometries. For neutrons, the code was respectively 80 and 135 times faster in the water and ICRP phantoms than FLUKA.Significance.This study offers an appropriate solution for uncoalesced memory access and branch divergence commonly encountered in coupled MC transport on the GPU architecture. The formidable acceleration in computing times and accuracy shown in this study can promise a routine clinical use of MC simulations.
-
EVALUATION OF SPEEDUP OF MONTE CARLO CALCULATIONS OF TWO SIMPLE REACTOR PHYSICS PROBLEMS CODED FOR THE GPU/CUDA ENVIRONMENT
A. Ding,Chao Liang,F. Brown,Tianyu Liu,X. Xu,M. Shephard,W. Ji
Abstract:Monte Carlo simulation is ideally suited for solving Boltzmann neutron transport equation in inhomogeneous media. However, routine applications require the computation time to be reduced to hours and even minutes in a desktop system. The interest in adopting GPUs for Monte Carlo acceleration is rapidly mounting, fueled partially by the parallelism afforded by the latest GPU technologies and the challenge to perform full-size reactor core analysis on a routine basis. In this study, Monte Carlo codes for a fixed-source neutron transport problem and an eigenvalue/criticality problem were developed for CPU and GPU environments, respectively, to evaluate issues associated with computational speedup afforded by the use of GPUs. The results suggest that a speedup factor of 30 in Monte Carlo radiation transport of neutrons is within reach using the state-of-the-art GPU technologies. However, for the eigenvalue/criticality problem, the speedup was 8.5. In comparison, for a task of voxelizing unstructured mesh geometry that is more parallel in nature, the speedup of 45 was obtained. It was observed that, to date, most attempts to adopt GPUs for Monte Carlo acceleration were based on naive implementations and have not yielded the level of anticipated gains. Successful implementation of Monte Carlo schemes for GPUs will likely require the development of an entirely new code. Given the prediction that future-generation GPU products will likely bring exponentially improved computing power and performances, innovative hardware and software solutions may make it possible to achieve full-core Monte Carlo calculation within one hour using a desktop computer system in a few years.
Engineering,Computer Science,Physics
-
GPU-Accelerated Monte Carlo Electron Transport Methods: Development and Application for Radiation Dose Calculations Using Six GPU cards
lin su,xining du,tianyu liu,x george xu
DOI: https://doi.org/10.1051/snamc/201405405
2014-01-01
Abstract:An electron-photon coupled Monte Carlo code ARCHER - Accelerated Radiation-transport Computations in Heterogeneous EnviRonments - is being developed at Rensselaer Polytechnic Institute as a software testbed for emerging heterogeneous high performance computers that utilize accelerators such as GPUs. This paper presents the preliminary code development and the testing involving radiation dose related problems. In particular, the paper discusses the electron transport simulations using the class-II condensed history method. The considered electron energy ranges from a few hundreds of keV to 30 MeV. For photon part, photoelectric effect, Compton scattering and pair production were modeled. Voxelized geometry was supported. A serial CPU code was first written in C++. The code was then transplanted to the GPU using the CUDA C 5.0 standards. The hardware involved a desktop PC with an Intel Xeon X5660 CPU and six NVIDIA Tesla (TM) M2090 GPUs. The code was tested for a case of 20 MeV electron beam incident perpendicularly on a water-aluminum-water phantom. The depth and later dose profiles were found to agree with results obtained from well tested MC codes. Using six GPU cards, 6x10(6) electron histories were simulated within 2 seconds. In comparison, the same case running the EGSnrc and MCNPX codes required 1645 seconds and 9213 seconds, respectively. On-going work continues to test the code for different medical applications such as radiotherapy and brachytherapy.
-
GPU-based ultra fast dose calculation using a finite pencil beam model
Xuejun Gu,Dongju Choi,Chunhua Men,Hubert Pan,Amitava Majumdar,Steve B. Jiang
DOI: https://doi.org/10.1088/0031-9155/54/20/017
2009-08-31
Abstract:Online adaptive radiation therapy (ART) is an attractive concept that promises the ability to deliver an optimal treatment in response to the inter-fraction variability in patient anatomy. However, it has yet to be realized due to technical limitations. Fast dose deposit coefficient calculation is a critical component of the online planning process that is required for plan optimization of intensity modulated radiation therapy (IMRT). Computer graphics processing units (GPUs) are well-suited to provide the requisite fast performance for the data-parallel nature of dose calculation. In this work, we develop a dose calculation engine based on a finite-size pencil beam (FSPB) algorithm and a GPU parallel computing framework. The developed framework can accommodate any FSPB model. We test our implementation on a case of a water phantom and a case of a prostate cancer patient with varying beamlet and voxel sizes. All testing scenarios achieved speedup ranging from 200~400 times when using a NVIDIA Tesla C1060 card in comparison with a 2.27GHz Intel Xeon CPU. The computational time for calculating dose deposition coefficients for a 9-field prostate IMRT plan with this new framework is less than 1 second. This indicates that the GPU-based FSPB algorithm is well-suited for online re-planning for adaptive radiotherapy.
Medical Physics
-
Accelerated ray tracing for radiotherapy dose calculations on a GPU
M de Greef,J Crezee,J C van Eijk,R Pool,A Bel
DOI: https://doi.org/10.1118/1.3190156
Abstract:Purpose: The graphical processing unit (GPU) on modern graphics cards offers the possibility of accelerating arithmetically intensive tasks. By splitting the work into a large number of independent jobs, order-of-magnitude speedups are reported. In this article, the possible speedup of PLATO's ray tracing algorithm for dose calculations using a GPU is investigated. Methods: A GPU version of the ray tracing algorithm was implemented using NVIDIA's CUDA, which extends the standard C language with functionality to program graphics cards. The developed algorithm was compared based on the accuracy and speed to a multithreaded version of the PLATO ray tracing algorithm. This comparison was performed for three test geometries, a phantom and two radiotherapy planning CT datasets (a pelvic and a head-and-neck case). For each geometry, four different source positions were evaluated. In addition to this, for the head-and-neck case also a vertex field was evaluated. Results: The GPU algorithm was proven to be more accurate than the PLATO algorithm by elimination of the look-up table for z indices that introduces discretization errors in the reference algorithm. Speedups for ray tracing were found to be in the range of 2.1-10.1, relative to the multithreaded PLATO algorithm running four threads. For dose calculations the speedup measured was in the range of 1.5-6.2. For the speedup of both the ray tracing and the dose calculation, a strong dependency on the tested geometry was found. This dependency is related to the fraction of air within the patient's bounding box resulting in idle threads. Conclusions: With the use of a GPU, ray tracing for dose calculations can be performed accurately in considerably less time. Ray tracing was accelerated, on average, with a factor of 6 for the evaluated cases. Dose calculation for a single beam can typically be carried out in 0.6-0.9 s for clinically realistic datasets. These findings can be used in conventional planning to enable (nearly) real-time dose calculations. Also the importance for treatment optimization techniques is evident.
-
A GPU implementation of a track-repeating algorithm for proton radiotherapy dose calculations
Pablo P Yepes,Dragan Mirkovic,Phillip J Taddei
DOI: https://doi.org/10.1088/0031-9155/55/23/S11
2010-11-10
Abstract:An essential component in proton radiotherapy is the algorithm to calculate the radiation dose to be delivered to the patient. The most common dose algorithms are fast but they are approximate analytical approaches. However their level of accuracy is not always satisfactory, especially for heterogeneous anatomic areas, like the thorax. Monte Carlo techniques provide superior accuracy, however, they often require large computation resources, which render them impractical for routine clinical use. Track-repeating algorithms, for example the Fast Dose Calculator, have shown promise for achieving the accuracy of Monte Carlo simulations for proton radiotherapy dose calculations in a fraction of the computation time. We report on the implementation of the Fast Dose Calculator for proton radiotherapy on a card equipped with graphics processor units (GPU) rather than a central processing unit architecture. This implementation reproduces the full Monte Carlo and CPU-based track-repeating dose calculations within 2%, while achieving a statistical uncertainty of 2% in less than one minute utilizing one single GPU card, which should allow real-time accurate dose calculations.
Medical Physics,Nuclear Experiment
-
An OpenCL-based Monte Carlo dose calculation engine (oclMC) for coupled photon-electron transport
Zhen Tian,Feng Shi,Michael Folkerts,Nan Qin,Steve B. Jiang,Xun Jia
DOI: https://doi.org/10.1118/1.4924473
2015-03-06
Abstract:Monte Carlo (MC) method has been recognized the most accurate dose calculation method for radiotherapy. However, its extremely long computation time impedes clinical applications. Recently, a lot of efforts have been made to realize fast MC dose calculation on GPUs. Nonetheless, most of the GPU-based MC dose engines were developed in NVidia CUDA environment. This limits the code portability to other platforms, hindering the introduction of GPU-based MC simulations to clinical practice. The objective of this paper is to develop a fast cross-platform MC dose engine oclMC using OpenCL environment for external beam photon and electron radiotherapy in MeV energy range. Coupled photon-electron MC simulation was implemented with analogue simulations for photon transports and a Class II condensed history scheme for electron transports. To test the accuracy and efficiency of our dose engine oclMC, we compared dose calculation results of oclMC and gDPM, our previously developed GPU-based MC code, for a 15 MeV electron beam and a 6 MV photon beam on a homogenous water phantom, one slab phantom and one half-slab phantom. Satisfactory agreement was observed in all the cases. The average dose differences within 10% isodose line of the maximum dose were 0.48-0.53% for the electron beam cases and 0.15-0.17% for the photon beam cases. In terms of efficiency, our dose engine oclMC was 6-17% slower than gDPM when running both codes on the same NVidia TITAN card due to both different physics particle transport models and different computational environments between CUDA and OpenCL. The cross-platform portability was also validated by successfully running our new dose engine on a set of different compute devices including an Nvidia GPU card, two AMD GPU cards and an Intel CPU card using one or four cores. Computational efficiency among these platforms was compared.
Medical Physics
-
TU-AB-BRC-10: Modeling of Radiotherapy Linac Source Terms Using ARCHER Monte Carlo Code: Performance Comparison of GPU and MIC Computing Accelerators.
T. Liu,H. Lin,L. Su,C. Shi,X. Tang,B. Bednarz,X. Xu
DOI: https://doi.org/10.1118/1.4957404
IF: 4.506
2016-01-01
Medical Physics
Abstract:PURPOSE (1) To perform phase space (PS) based source modeling for Tomotherapy and Varian TrueBeam 6 MV Linacs, (2) to examine the accuracy and performance of the ARCHER Monte Carlo code on a heterogeneous computing platform with Many Integrated Core coprocessors (MIC, aka Xeon Phi) and GPUs, and (3) to explore the software micro-optimization methods. METHODS The patient-specific source of Tomotherapy and Varian TrueBeam Linacs was modeled using the PS approach. For the helical Tomotherapy case, the PS data were calculated in our previous study (Su et al. 2014 41(7) Medical Physics). For the single-view Varian TrueBeam case, we analytically derived them from the raw patient-independent PS data in IAEA's database, partial geometry information of the jaw and MLC as well as the fluence map. The phantom was generated from DICOM images. The Monte Carlo simulation was performed by ARCHER-MIC and GPU codes, which were benchmarked against a modified parallel DPM code. Software micro-optimization was systematically conducted, and was focused on SIMD vectorization of tight for-loops and data prefetch, with the ultimate goal of increasing 512-bit register utilization and reducing memory access latency. RESULTS Dose calculation was performed for two clinical cases, a Tomotherapy-based prostate cancer treatment and a TrueBeam-based left breast treatment. ARCHER was verified against the DPM code. The statistical uncertainty of the dose to the PTV was less than 1%. Using double-precision, the total wall time of the multithreaded CPU code on a X5650 CPU was 339 seconds for the Tomotherapy case and 131 seconds for the TrueBeam, while on 3 5110P MICs it was reduced to 79 and 59 seconds, respectively. The single-precision GPU code on a K40 GPU took 45 seconds for the Tomotherapy dose calculation. CONCLUSION We have extended ARCHER, the MIC and GPU-based Monte Carlo dose engine to Tomotherapy and Truebeam dose calculations.
-
Technical note: A GPU‐based shared Monte Carlo method for fast photon transport in multi‐energy x‐ray exposures
Yiwen Zhou,Wenxin Deng,Jing Kang,Jinqiu Xia,Yingjie Yang,Bin Li,Yuqin Zhang,Hongliang Qi,WangJiang Wu,Mengke Qi,Linghong Zhou,Jianhui Ma,Yuan Xu
DOI: https://doi.org/10.1002/mp.17314
IF: 4.506
2024-07-20
Medical Physics
Abstract:Background The Monte Carlo (MC) method is an accurate technique for particle transport calculation due to the precise modeling of physical interactions. Nevertheless, the MC method still suffers from the problem of expensive computational cost, even with graphics processing unit (GPU) acceleration. Our previous works have investigated the acceleration strategies of photon transport simulation for single‐energy CT. But for multi‐energy CT, conventional individual simulation leads to unnecessary redundant calculation, consuming more time. Purpose This work proposes a novel GPU‐based shared MC scheme (gSMC) to reduce unnecessary repeated simulations of similar photons between different spectra, thereby enhancing the efficiency of scatter estimation in multi‐energy x‐ray exposures. Methods The shared MC method selects shared photons between different spectra using two strategies. Specifically, we introduce spectral region classification strategy to select photons with the same initial energy from different spectra, thus generating energy‐shared photon groups. Subsequently, the multi‐directional sampling strategy is utilized to select energy‐and‐direction‐shared photons, which have the same initial direction, from energy‐shared photon groups. Energy‐and‐direction‐shared photons perform shared simulations, while others are simulated individually. Finally, all results are integrated to obtain scatter distribution estimations for different spectral cases. Results The efficiency and accuracy of the proposed gSMC are evaluated on the digital phantom and clinical case. The experimental results demonstrate that gSMC can speed up the simulation in the digital case by ∼37.8% and the one in the clinical case by ∼20.6%, while keeping the differences in total scatter results within 0.09%, compared to the conventional MC package, which performs an individual simulation. Conclusions The proposed GPU‐based shared MC simulation method can achieve fast photon transport calculation for multi‐energy x‐ray exposures.
radiology, nuclear medicine & medical imaging
-
GPU-based Parallel Monte Carlo Simulation for Radiotherapy Dose Calculation
Huang Fei-zeng
DOI: https://doi.org/10.3969/j.issn.1005-202X.2012.06.001
2012-01-01
Abstract:Objective: Monte Carlo simulation is commonly considered to be the most accurate dose calculation method in radiotherapy.However,its efficiency still requires improvement for many routine clinical applications.Methods:This paper will present recent progresses in GPU-based Monte Carlo dose calculation.We utilizes the parallel computation ability of a GPU to achieve high efficiency,while maintaining the same particle transport physics as in the original Monte Carlo simulation code and therefore obtains the same level of simulation accuracy.Results: Our research results show that using an NVIDIA GTX460 GPU card against an INTEL i5 2300 in computing a one-million sample with all 336 processor cores working together,speed-up factors can be as high as 116.6,as for a ten-million situation,even obtain a result as high as 127.5.Conclusions:Using GPU and CUDA to process a Monte Carlo simulation can highly improve the efficiency of dose calculation.
-
Development and application of graphics processor units-based Monte Carlo simulation in radiation dose calculation
Ying HUANG,Haikuan LIU
DOI: https://doi.org/10.3969/j.issn.1005-202X.2017.10.001
2017-01-01
Abstract:Monte Carlo calculation plays an important role in medical physics,but the application in routine clinical use is limited by its computing speed.With the development of graphics processor units (GPU),GPU parallel speedup was increasingly used for MC simulation.Herein,we report on the implementation of photon,electron and proton on a card equipped with GPU and its development in radiation dose calculation and application in medical physics.
-
A general-purpose Monte Carlo particle transport code based on inverse transform sampling for radiotherapy dose calculation
Ying Liang,Wazir Muhammad,Gregory R. Hart,Bradley J. Nartowt,Zhe J. Chen,James B. Yu,Kenneth B. Roberts,James S. Duncan,Jun Deng
DOI: https://doi.org/10.1038/s41598-020-66844-7
IF: 4.6
2020-06-17
Scientific Reports
Abstract:Abstract The Monte Carlo (MC) method is widely used to solve various problems in radiotherapy. There has been an impetus to accelerate MC simulation on GPUs whereas thread divergence remains a major issue for MC codes based on acceptance-rejection sampling. Inverse transform sampling has the potential to eliminate thread divergence but it is only implemented for photon transport. Here, we report a MC package Particle Transport in Media (PTM) to demonstrate the implementation of coupled photon-electron transport simulation using inverse transform sampling. Rayleigh scattering, Compton scattering, photo-electric effect and pair production are considered in an analogous manner for photon transport. Electron transport is simulated in a class II condensed history scheme, i.e., catastrophic inelastic scattering and Bremsstrahlung events are simulated explicitly while subthreshold interactions are subject to grouping. A random-hinge electron step correction algorithm and a modified PRESTA boundary crossing algorithm are employed to improve simulation accuracy. Benchmark studies against both EGSnrc simulations and experimental measurements are performed for various beams, phantoms and geometries. Gamma indices of the dose distributions are better than 99.6% for all the tested scenarios under the 2%/2 mm criteria. These results demonstrate the successful implementation of inverse transform sampling in coupled photon-electron transport simulation.
multidisciplinary sciences
-
Multi-GPU implementation of a VMAT treatment plan optimization algorithm
Zhen Tian,Fei Peng,Michael Folkerts,Jun Tan,Xun Jia,Steve B Jiang
DOI: https://doi.org/10.1118/1.4919742
Abstract:Purpose: Volumetric modulated arc therapy (VMAT) optimization is a computationally challenging problem due to its large data size, high degrees of freedom, and many hardware constraints. High-performance graphics processing units (GPUs) have been used to speed up the computations. However, GPU's relatively small memory size cannot handle cases with a large dose-deposition coefficient (DDC) matrix in cases of, e.g., those with a large target size, multiple targets, multiple arcs, and/or small beamlet size. The main purpose of this paper is to report an implementation of a column-generation-based VMAT algorithm, previously developed in the authors' group, on a multi-GPU platform to solve the memory limitation problem. While the column-generation-based VMAT algorithm has been previously developed, the GPU implementation details have not been reported. Hence, another purpose is to present detailed techniques employed for GPU implementation. The authors also would like to utilize this particular problem as an example problem to study the feasibility of using a multi-GPU platform to solve large-scale problems in medical physics. Methods: The column-generation approach generates VMAT apertures sequentially by solving a pricing problem (PP) and a master problem (MP) iteratively. In the authors' method, the sparse DDC matrix is first stored on a CPU in coordinate list format (COO). On the GPU side, this matrix is split into four submatrices according to beam angles, which are stored on four GPUs in compressed sparse row format. Computation of beamlet price, the first step in PP, is accomplished using multi-GPUs. A fast inter-GPU data transfer scheme is accomplished using peer-to-peer access. The remaining steps of PP and MP problems are implemented on CPU or a single GPU due to their modest problem scale and computational loads. Barzilai and Borwein algorithm with a subspace step scheme is adopted here to solve the MP problem. A head and neck (H&N) cancer case is then used to validate the authors' method. The authors also compare their multi-GPU implementation with three different single GPU implementation strategies, i.e., truncating DDC matrix (S1), repeatedly transferring DDC matrix between CPU and GPU (S2), and porting computations involving DDC matrix to CPU (S3), in terms of both plan quality and computational efficiency. Two more H&N patient cases and three prostate cases are used to demonstrate the advantages of the authors' method. Results: The authors' multi-GPU implementation can finish the optimization process within ∼ 1 min for the H&N patient case. S1 leads to an inferior plan quality although its total time was 10 s shorter than the multi-GPU implementation due to the reduced matrix size. S2 and S3 yield the same plan quality as the multi-GPU implementation but take ∼4 and ∼6 min, respectively. High computational efficiency was consistently achieved for the other five patient cases tested, with VMAT plans of clinically acceptable quality obtained within 23-46 s. Conversely, to obtain clinically comparable or acceptable plans for all six of these VMAT cases that the authors have tested in this paper, the optimization time needed in a commercial TPS system on CPU was found to be in an order of several minutes. Conclusions: The results demonstrate that the multi-GPU implementation of the authors' column-generation-based VMAT optimization can handle the large-scale VMAT optimization problem efficiently without sacrificing plan quality. The authors' study may serve as an example to shed some light on other large-scale medical physics problems that require multi-GPU techniques.