Generalization for Database Integration in a Multi- Database System. 6. Simulation Results Group 5: Both Collection C1 and Collection C2 Use

References,U Dayal,H-Y Hwang,Ieee Tse,W Du,R Krishnamurthy,M C Shan,D Harman,W Meng,C Yu,W Kim,G Wang,T Pham,S Dao,A Kamada,Y-H Chang,Trans,W Wang,N Rishe,Perfor,C Yu,Y Zhang,Wsj Fr,Doe,Minfi
1996-01-01
Abstract:collections is not very large (roughly N 1 N 2 < 10000 B) and both document collections are large such that none can be entirely held in the memory, then VVM (the sequential version) can outperform other algorithms. 4. For most other cases, the simple algorithm HHNL performs very well. 5. The costs of the random versions of these algorithms depict the worst case scenario when the I/O devices are busy satisfying diierent obligations at the same time. Except for VVM, these costs have no impact in ranking these algorithms. Overall, the simulation results match well with our analysis in Section 5.4. Since no one algorithm is deenitely better than all other algorithms in all circumstances, it is desirable to construct an integrated algorithm that can automatically determine which algorithm to use given the statistics of the two collec-predicates on non-textual attributes). The sketch of an integrated algorithm can be found in 11]. 7. Concluding remarks In this paper, we presented and analyzed three algorithms for processing joins between attributes of tex-tual type. From analysis and simulation, we identi-ed, for each algorithm, the type of input document collections with which the algorithm is likely to perform well. More speciically, we found that HVNL can be very competitive when the number of documents in one of the two document collections is/becomes very small, and VVM can perform very well when the number of documents in each of the two collections is not very large and both document collections are large such that none can be entirely held in the memory. In other cases, HHNL is likely to be the top performer. Since no one algorithm is deenitely better than all other algorithms, we proposed the idea of constructing an integrated algorithm consisting of the basic algorithms such that a particular basic algorithm is invoked if it has the lowest estimated cost. We also indicated that the standardization of term numbers will be very useful in multidatabase environments. Further studies in this area include (1) investigate the impact of the availability of clusters on the performance of each algorithm; (2) develop cost formulas that include CPU cost and communication cost; (3) develop algorithms that process textual joins in parallel ; and (4) more detailed simulation and experiment. Due to the large number of parameters in the cost formulas of the algorithms presented, it is very dii-cult to compare the performance of these …
What problem does this paper attempt to address?