General Parameterized Thermal Modeling for Multi-core Microprocessor Design

Thom Eguia,Sheldon X.-D. Tan,Ruijing Shen,Duo Li,Eduardo H. Pacheco,Murli Tirumala,Lingli Wang
2010-01-01
Abstract:This paper proposes a new parameterized dynamic thermal modeling algorithm for emerging thermal-aware design and optimization for multi-core microprocessor design at architecture and package levels. Compared with existing behavioral thermal modeling algorithms, the proposed method can build the compact models from more general transient power and temperature waveforms used as training data. Such an approach can make the modeling process much easier and less restrictive than before, and more amenable for practical measured data. The new method, calledParThermSID, consists of two steps. First, the response surface method based on second order polynomials is applied to build the parameterized models at each time point for all the given sampling nodes in the parameter space. Second, an improved subspace system identification method, called ThermSID, is employed to build the discrete state space models, by construction of the Hankel matrix and state space realization, for each time-varying coefficient of the polynomials generated in the first step. To overcome the overfitting problems of the subspace method, the new method employs an overfitting mitigation technique to improve model accuracy and predictive ability. Experimental results on a practical quad-core microprocessor show that the generated parameterized thermal model matches the given data very well. The results also show that ThermSIDis more accurate than the existingThermPOFmethod. The compact models generated byParThermSIDalso offer two orders of magnitude speedup over the commercial thermal analysis tool FloTHERM on the given example.ParThermSIDis also much more general and flexible than the recently proposed parameterized thermal modeling methodParThermPOF. I. I NTRODUCTION As VLSI technology is scaled into the nanometer region, the power density of high-performance microprocessors increa ses drastically. The exponential power density increase will, in turn, lead to average chip temperature to raise rapidly [3]. Higher temperature has significant adverse impacts on chip packaging cost, performance, and reliability. Excessive o nchip temperature leads to slower transistor speed owing to reduced carrier mobility, more leakage power consumption a s leakage currents grow exponentially with temperature, hig her interconnect resistance, and reduced reliability [11], [5 . Thom Eguia, Sheldon X.-D. Tan, Ruijing Shen, Duo Li are with D epartment of Electrical Engineering, University of California, Rive rside, CA 92521 USA (e-mail:{teguia,stan,rshen,dli }@ee.ucr.edu) Eduardo H. Pacheco and Murli Tirumala are with Intel Corporat i n. Lingli Wang is with Departmebt of Microelectronics, Fudan Un iversity, Shanghai, China, 200433. This work is supported in part by NSF grant under No. CCF-0448 53 , in part by NSF Grant under No. CCF-0902885, in part by Semicond uctor Research Corporation (SRC) grant under No. 2009-TJ-1991, i n part by Science and Technology Commission of Shanghai Municipality u nder grant No. 2009B021. One way to mitigate the high temperature problem is to put multiple cores into one single multi-core CPU [17], [1], [2] . In this way, one can simply increase the total throughput via task-level parallel computation, and have lower voltage an d frequency to meet thermal constraints. In this case, howeve r, the thermal effects are influenced by the placement of cores and shared caches. Therefore, it is very important to consid er the temperature during the floor planning and architecture design of multi-core microprocessor. The estimated temperature at the architecture level is vita l for performing accurate power (especially leakage power), performance, reliability, wear-out and aging analysis in t he floor planning and packaging design [24]. As a result, design guided by temperature can be optimized theoretically witho u potential thermal problems. For the cycle-accurate archit e ture thermal simulation, the simulation time can be very long (several seconds) [18], [27]. For instance, for a 3GHz CPU, 10K clock cycles (typically used) is 3.3us. For 10 seconds, the number of time steps is 3 million. Although the simulatio n techniques have seen some progress recently [6], more effici ent thermal simulators are still highly desired. To facilitate this temperature-aware architecture design, it is important to have accurate and fast thermal estimation at the architecture le vel. The demands for reliable and practical tools for thermal architecture modeling from both architecture and CAD tool communities could not be higher. The traditional bottom-up approaches including FEM (finite element), FDM (finite difference), and computational flow dynamics (CFD) based methods were widely used for thermal modeling and analysis in the past. They can be accurate when detailed thermal structures are known. However, these detailed models can be substantially large which prevents their use in many practical problems. Static and transient thermal modeling methods at different levels (parts, packge, board) have been been proposed in the past. Many approaches try to use thermal resistance and capacitance wi th fixed topology networks subject to different thermal bounda ry conditions [14], [7], [4]. The main limitation of those meth ods is to determine appropriate RC values of elements, especial ly for complex geometries and boundary conditions. The RC values are typically determined and optimized against the fi eld numerical or analytic results [10], [21] and measured data [ 23]. For thermal modeling at architecture level, existing work o n HotSpot [12], [24] tries to solve this problem by generating the architecture thermal model in a bottom-up way based on the internal structure/architecture of the microproces sor. These bottom-up compact models, however, may suffer from IEEE TRANSACTIONS ON VLSI SYSTEMS, VOL XX, NO. XX, DECEMBER 2 00X 2 accuracy loss, and compact models have to be calibrated with hardware if more accurate models are required. Also, the generated lumped RC models are not parameterized as different RC models will be generated with different parameter s such as thermal conductivities, different thermal conditi ons (ambient temperatures), and packaging configurations [25] , [26]. Recently a top-down behavioral architecture level th ermal modeling method, ThermPOF, has been proposed [16], where temperature impulse responses are used to build the thermal models by the matrix pencil method. In this paper, we propose a new parameterized thermal modeling approach for fast temperature estimation at the ar chitecture and package levels for multi-core microprocess ors. The new approach can build the behavioral thermal models from measured or simulated transient thermal and power information. The main advantage of the proposed modeling method over the existing black-box thermal modeling method s like ThermPOF [16] and ParThermPOF [15], where only impulse/step power inputs can be accepted is that the propos ed method can accept general transient power and temperature waveforms. This makes the new method much more training friendly and general, as transfer function-like responses are typically difficult, even impossible (intractable), to obt ain from the measurements. Furthermore, the new method is a top-down, black-box approach, which means it does not require any internal structure of the system. Lastly, it can accommodate a number of parameters, such as location of thermal sensors on a heat sink, thermal conductivity of heat sink materials, etc. The new method, calledParThermSID, consists of two techniques. First, the response surface method (RSM) based on second order polynomials is adopted to build the parameterized models at each time point for all the given sampling nodes in the parameter space (except for time). Second it applies an improved subspace system identification method, called ThermSIDto build the transient model for each timevarying coefficient of the polynomials generated in the first step. The subspace system identification method can accept general transient inputs and thus eliminates the need for impulse/step power inputs. The subspace system identificat ion first generates the states of the desired models in terms of a Hankel matrix of Markov parameters from the measured input and output data via subspace projection and reduction. Then , the discrete state matrices are obtained through state matr ix realizations. To overcome the over-fitting problem in the subspace method, ThermSIDapplies an over-fitting mitigation technique to pick up the best model among several that are built on partial training data to overcome the unavoidable overfitting problem associated with training-based modeli ng processes. Experimental results on a real multicore microprocessor show thatThermSIDand ParThermSIDcan provide thermal behavioral models that match the measured data very closely with similar accuracy toThermPOFand ParThermPOF. The compact models generated by ParThermSIDalso offer two orders of magnitude speedup over the commercial thermal analysis toolFloTHERMon given examples. ParThermSIDis also much more general and flexible than the recently propose d die:3 die:2 die:1 die:0 CACHE die:4 1 cm 1 cm DIE TIM1 Heat spreader TIM2 Heat sink Fig. 1. Quad-core architecture parameterized thermal modeling method ParThermPOF. The rest of this paper is organized as follows: Section II presents the thermal modeling problem we are trying to solve . Section III reviews the subspace system method for its use in thermal modeling, while III-B explores overfitting and the r esulting mitigation techniques. Section IV reviews the resp onse surface method, and describes its use in parameterization. Finally, section V presents the results of both T ermSIDand ParThermSID, with section VI concluding the paper. II. PACKAGE-LEVEL PARAMETERIZED THERMAL MODELING PROBLEM Our modeling problem requires building parameterized ther mal models considering both time and other variable param
What problem does this paper attempt to address?