SumatraTT : Towards aUniversal Data Preprocessor
P. Aubrecht,F. Železný,P. Miksovský,Olga Št,pánková
Abstract:In the practice of data mining (DM) and ata warehousing (DWH), real-life data rrive in various different form ats, and without putting them into an acceptable shape, even the most intelligent DM/DWH tool would be useless. SumatraTT (Transformation Tool) is an original univers al data pre-processing tool allowing to access and transform data stored in various types of datasources (e.g. plain text, SQL etc.). We briefly review the concept of the syst em and summarize its recent developments. The paper briefly overviews the connectivity with inductive logic programming (ILP) systems and then informs on more recently added features consisting of new data interfaces , scripting features, and templates. The usage of Sumatra TT on an example application is hortly demonstratied. After a brief touch upon ear-future plans, we finally discuss ome questions typically arising at the first usage of SumatraT T. 1 OUR MOTIVATION AND GOALS DM algorithms [6] are being designed by researchers and SW houses all over the world. There are many of them and their offer is continuously growing. Their different ava ilable implementations differ in principles as well as in tiny details uch as the format used for the input data. Moreov er, most often the data subjected to DM have not been coll ected for DM purposes primarily; on the contrary they serve .g. as a company archive. Consequently, the format of such data cannot meet the requirements of a specific DM algorithm most often. The challenge of a DM task is in finding the algorithm which will reveal interesting observations in the considered ata. But o reach this goal many experiments have to be done. One cannot decide in advance which set of DM tools or which derived attribut es will prove most useful for the given problem. Thus origina l data have to be processed or transformed in different wa ys to make them usable by the chosen DM algorithms. Forma t of data has to be changed, data has to be cleaned, filte red, aggregated, etc. This is the purpose of data transformation systems which have recently appeared as independent SW tools supporting DM process itself [14]. This is an important step simplifying data preparation processes and supporting experiments with real ife data. The common pain of state-of-art data transformation systems is t heir insufficient generality. Our goal is to overcome this problem by designing a developing a system • that allows for virtually any customization with respect to different data standards and requirements on the transformation, • but at the same time provides ultimate ase-of-use in cases where only standard procedures are required. The former goal can be achieved by providing a dataprocessing oriented scripting language, and the latter goal by providing templates of the common procedures and standard interfaces to many kinds of datasources. These ideas form the design principles of the data preprocessing tool SumatraTT described briefly in the next section. Most industries benefit from appropriate standardization. Positive reaction of the research community to the a ctivity of the PMML group roves this is the case of the resear ch field concerned with decision support systems, too. Prediction Model Mark-up Language is being developed to simplify exchange or sharing the results „between compliant vendors ́ applications ... so that proprietary issues and incompababilities are no longer a barrier“ ( http://www.dmg.org/pmmlspecs_v2/ ). Similar view can be taken towards data-transformations. We hope that our study of data transfomations using SumatraTT will complemet to development of a standard for data transfomation, namely to the Data Transformation Ma rkup Language (DTML) supporting e.g. reuse of the same data by different algorithms through seemless import, rapi d development of derived attributes, etc. 2 THE CONCEPT OF SUMATRA SumatraTT (Transformation Tool) is a metadata-driven, platform independent, extensible, and universal data processing tool [3]. The mentioned features have been achieved by building the tool as an interpreter of the transformation-oriented scripting language called Sumatra [2]. The Sumatra language is afully interpreted Java-like language combining data access, metadata access, and common programming constructions. Furthermore, it supports the RAD (Rapid Application Development) technology by means of the library of reusable transformation templates . The principal scheme of SumatraTT is hown in Figure 1. As can be seen in the figure, the central part of Sumat raTT is the Metadata repository module. Basically, the reposi tory plays two roles. It is the central storage consisting of descriptions of all data sources and ata transformations to be used. Moreover, the repository contains data objects interconnecting the abstract data access level in the Sumatra interpreter with real-life data sources. This intermediated connection helps to unify data ccess to v ery different data sources (e.g. SQL-based data sources, plain text files, etc). Such unification makes the process of transformation script development easier and data source independent. Moreover it separates the transformation "logic" from the data connection problems. In the case of very complicated ata pre-processing task, the development of a data transformation script can be rather time consuming. SumatraTT allows to speed up this process by using reapplicable transformation templates. The idea of reusable templates is based on the library of solved t ypes of tasks. E.g. there is adata set containing time series and we need to calculate astatistical characterization of t he data. If this is carried out for the first time, a new template has to be developed. But the next time, the statistical transforma tion script can be developed via parametric modification of the xisting template within afraction of the time re quired before. Every pre-processing task realized using SumatraTT consists of design and run-time phases. It corresponds to a client-server architecture where the design phase consi sts of the definition of all data sources and the development of transformation scripts on the client side. Regarding a typical user who is an expert in data mining or data warehousing but who is not a programmer, the design phase can be carried out using graphical user interface. The GUI allows to interactively realize both the dat a definition and script development by simple clicking on wizards. On the other hand, the run-time phase corresponds to ascript execution on the server side. Fro m the user's perspective, the execution can be invoked immediately or scheduled for a later run.