Latency in Reconfigurable Message-Passing Environments
A. Afsahi,Nikitas,J. Dimopoulos
Abstract:for transceivers, VCSELs and SEEDs for photon generaCommunication overhead is one of the most important tion or modulation. We shall assume that one or more of factors affecting the performance of message passing multhese technologies will be used to implement the proposed ticomputers. We present evidence (through the analysis of interconnect. Under such an implementation, the various several para/lei benchmarks) that there exists communicaoverheads associated with the reconfiguration of the nettions locality, and that this locality is «structured". We work are lumped together as the reconfiguration delay do have devised a number of heuristics that can 'predict" the Beam routers Potential links target of subsequent communication requests. This tech~ ~,~ + nique, can be applied to reconfigurable interconnects to J ~ ~ de the communications latency by reconfiguring the Active links Interconnect concurrently to the computation. By compar0 1 2 I I.ng th . t . t .. if ...Nodes e In er-commumca Ion computation times o a number of para/lei benchmarks with some ;I; FIGURE 1. RON(k, N), a massively parallel computer .I; ..specb.c Interconnected bya complete free-space optical network reco~.guratlon times, we argue that the computation intero .o val can be used to hide the concurrent reconjiguration of The node-to-node commumcatlon delay IS modeled as the interconnect, and present the performance enhanceT = d + ts + Im or with d being the reconfiguration delay, ments of the proposed heuristics. ts the setup time, Im the length of message, or the per unit transmission time. The setup time ts [7] and reconfiguration 1 O I t d ti delay d, are the major contribution of the communication .n ro uc on d 1 . e ay 1: both bemg of the order of several Ils .Several Message-passing multicomputers are composed of a researchers are working to minimize this cost by user-level number of computing nodes that communicate by exchangmessaging techniques such as active messages [8] or fast ing messages through an interconnect. Optics is ideally messages [16]. In this work we are particularly interested suited for implementing interconnection networks because in the techniques that hide the reconfiguration delay, d. of its superior characteristics over electronics [20,15]. VariIt is obvious that if a link is already in place, then the ous optical interconnection networks including the works configuration phase does not enter the picture with a comin [5,12] have been proposed. mensurate savings in the message transmission time. This We have introduced [ 1] a reconjigurable optical netcan be accomplished, if the target of the communication work, RON (k, N), consisting of N computing nodes. A operation can be "predicted" before the message itself is node is capable of connecting directly to any other node available. If the communication operation is regular and and can establish k simultaneous connections. Connections known, then it is possible to determine the destinations and are established by reconfiguring the interconnect and the instances that these shall be used [1]. However, if the remain established until they are explicitly destroyed. A algorithm is not known, the approach mentioned above block diagram of the network is shown in Figure 1. Circuitcannot be used. switching with k-port or single-port models with fullIn the context of the shared memory programming, there duplex communication is assumed. are several works on hardware-controlled and softwareVarious implementation technologies exist to embody controlled prefetching of the next shared data request the above abstract model. Such technologies include com[14,18,22]. In the context of message passing programputer generated holograms and deformable mirrors for ming, many parallel algorithms are built from loops conswitching, frequency hoping for coding, wavelength tuning sisting of computation and communication phases. Hence, communication patterns may be repetitive, This has motiA number of heuristics were proposed and studied in a vated researchers to find the communications locality propprevious work [2,3], These include erties ofparalle~ ap~lications.[10,11], ..I, The Least Recently Used (LRU) [10], First-In-First-Out By communications localIty, we mean that If a certam (FIFO) d L t "' tl T~ d (LFU) h ' t ' II , ..., an eas rrequen y use eUfls ICS, a source-destlnation paIr has been used It wIll be re-used f hi h ' tam' t f k d tm' ti' [3] If 'th ' h 'I ..' o w c mam a se o message es a ons , WI hig probabl Ity by a portion of code that IS "near" the th t d t . ti., lr ad . th t th h 't . ., , e nex es ma on IS a e y m e se , en a I IS place that was used earlIer, and that It wIll be re-used in the d d Oth ., , d d d th d ., ' recor e , erwlse, a mlss IS recor e an e new esnear future, If communications locality exists m parallel t ' t . I fth d tin ' ti' , th t d , , ., .' ma Ion rep aces one o e es a ons m e se accor applIcations, then It IS possible to cache the configuratiOn ' t th ad t d LRU FIFO LFU trat th ' ., mg o e op e , or s egy, at a previous communication request has made and reuse it at a later stage. Caching in the context of this discussion 2, The Single-cycle heuristic [3], implements a simple cycle will mean that a communication channel will remain estabdiscovery algorithm, Starting with a cycle-head node the lished until it is explicitly destroyed, sequence of requests is logged until the cycle-head node This work has two parts, The first part is an extension of is requested again, This stored sequence constitutes a our work in [3] and explores the effect that a number of cycle, and can be used to predict the subsequent requests, heuristics has in predicting the target of a communication If the requested node does not coincide with the prerequest, For these studies, we have used the MPI [13] dicted one, then a new cycle formation stage commences implementation of the NAS parallel benchmarks suite with the cycle-head being the node that caused the miss, (NPB) (version 2.3, W class) [4], the Parallel Spectral 3, The Single-cycle2 heuristic [3] is identical to the singleTransform Shallow Water Model (PSTSWM) parallel cycle heuristic with the addition that during cycle formabenchmark (version 6,2) [19], and the pure QCD Monte tion, the previously requested node is offered as the preCarlo Simulation Code with MPI (QCDMPI) parallel dicted node, Both cycle heuristics have a better benchmark (version 1.4) [9] on an IBM SP2. We wrote our performance than the LRU, FIFO and LFU heuristics own profiling codes using the wrapper facility of the MPI under the single-port assumption, to gather the communication traces and the timing profiles of these applications, It is worth mentioning that the pro2.1 Better-cycle and Better-cycle2 heuristics posed heuristics can be used in any circuit-switched networks including the wave switching [6] and [21]. The performance of the cycle heuristics is improved if The second part considers the execution time of the the previously formed cycles are maintained, In the Bettercomputation phases of these parallel benchmarks on an cycle heuristic, we keep the last cycle associated with each IBM SP2 using its high performance switch under the user cycle-head encountered, In case of a miss, if the prediction space mode when we had an exclusive access to the sysoffered by the stored cycle associated with the node that tern, We show that computation times, are sufficiently large caused the miss, is incorrect, then a new cycle formation for reconfigurations, proceeding concurrently with compucommences. Otherwise, the stored cycle is used to predict tations, to terminate before the computation, and we the subsequent requests, The state diagram of this heuristic present the performance enhancements achieved because is shown in Figure 2. This heuristic performs better than of the latency hiding power of the heuristics developed, the Single-cycle and Single-cycle2 heuristics [2], Section 2 analyzes the proposed heuristics. In section 3, Cycle-head we obtain the inter-send computation times for the benchM ' l ( 1 h d) ISS , c c e new cyc eea marks, and present the performance enhancements of the .t proposed heuristics, Finally, we conclude with section 4. 2.0 Latency hiding heuristics f. The heuristics proposed in this section predict the destii nation node of a subsequent communication request based ~ on a past history of communication patterns, Our heuristics ~ would execute on each node of the multicomputer, and pre[ dict the destination nodes for communications originating ~ at the node on which they reside, We use the hit ratio to establish and compare the performance of these heuristics. As a hit ratio, we define the percentage of times that the .~,,~d ' d d ' , d Hit. vne-cycle-complete pre Icte estinatlon no e was correct, FIGURE 2 St t d ' f th Bett I I 'th .a e lagram 0 e er-cyc e a gorl m an excellent perfonnance (hit ratios in the upper 90%) for all the benchmarks except for the CG, PSTSWM, and the QCDMPI benchmarks. The reason is that these benchmarks include send operations with a target address calculated based on loop variables. Thus, the same section of code cycles through a number of different target addresses. The Better-cycle2 heuristic is identical to the Bettercycle heuristic with the addition that during the cycle formation and cycle revision phases the previously requested node is offered as the predicted node. The performance of this heuristic is shown in Figure 3. This heuristic has better performance than the Single-cycle and Single-cycle2 heuristics, and the Better-cycle heuristic for the BT, SP, and the QCDMPI benchmarks [2]. Beller-CYde2 OYdO2 'r .~ o.sf .