An Extended GPROF Profiling and Tracing Tool for Enabling Feedback-Directed Optimizations of Multithreaded Applications
S. Bartolini
Abstract:This paper presents an approach for profiling and tracing multithreaded applications with two main objectives. First, extend the positive points and overcome the limitations of GPROF tool when used on parallel applications. Second, focus on gathering information that can be useful for extending the existing GCC profile-driven optimizations and to investigate on new ones for parallel applications. In order to perform an insightful profiling of a multithreaded application, our approach proposes to gather intra-thread together with inter-thread information. For the latter, Operating System activity, as well as the usage of programmer-level synchronization mechanisms (e.g., semaphores, mutex), have to be taken into account. The proposed approach exposes various per-thread information like the call-graph, and a number of intra-thread ones like blocking relationship between threads, blocking time, usage patterns of synchronization mechanisms, context switches. The approach introduces a relatively low overhead which makes it widely applicable: less than 9% on test multithreaded benchmarks and less than 3.9x slowdown for the real MySQL executions. 1 Intro and motivation Parallel and, in particular, multithreaded programming is very common especially in general-purpose applications (e.g., office automation, OS services and tools, web browsers) and in special-purpose systems like weband DB-servers, but is gaining increasing importance also in the embedded domain due to the market demand for more and more complex portable applications, and the technological offer of growingly powerful devices. In addition, the trend towards on-chip parallel architectures enforces the general interest towards managing parallel applications along the entire software development process, (i.e., from the design and programming phases, down to compiling, optimizing, debugging, testing, and running phases) even if it is far more complicated than in case of sequential applications [1]. The simple, but still very useful, profiling capabilities provided by gprof GNU tool [11] for monoprocess, mono-threaded applications is not applicable for gathering insightful information for multi-threaded ones because of two main reasons: a) the collected information are per-process and, therefore, are not able to investigate on the thread-specific behavior; b) there is no way to gather inter-thread information, which are related to both cooperation and competition for shared resources, which the threads use through the Operating System (OS) primitives for synchronization (e.g., semaphores). In order to tune the performance of applications through specific optimizations [5] (manually and/or automatically), each thread profile has to be available, as well as specific information on the interaction between threads. For instance, some feedback-directed optimizations for cache performance, like Pettis and Hansen one [6], are already present in GCC and rely on the function call-graph, which is collectable by gprof on mono-threaded applications. Additional statistics for the analysis of temporal and spatial locality of functions, which could enable more sophisticated optimizations [7][8][9], are still missing even for monothreaded applications. For multi-threaded applications the gprof tool only collects the statistics on the main thread, which can constitute a negligible part of the executed instructions and of the execution time of the application. This work aims to provide a profiling framework that can put the bases for the profiling/tracing of multithreaded applications so that existing and, possibly, new feedback-directed optimizations can be investigated. In addition, for debugging and testing purposes, the history of the parallel execution should be made available at a granularity that can allow following the execution through each function of each thread, to inspect the scheduling/descheduling events in the OS, and go through the synchronization operations. Essentially, in order to have a precise snapshot of the runtime behavior a multithreaded application, the parallel execution has to be collected in a way that both thread activities and the interaction between threads, mediated by the OS, could be ideally re-played offline. These requirements are far more complicated than the corresponding mono-process ones, especially because of the tight interaction between the application and the OS services, which forces to investigate also on the behavior of the OS itself during the application execution. Another crosscutting issue it that the profiling activity should have low overhead because parallel applications tend to be complex (no big slowdown is typically affordable) and, in particular, parallel applications can be very sensitive to the probe-effect of profiling itself, which may artificially alter the execution time of specific code fragments and, consequently, the relative speed of the involved concurrent activities.