Abstract:collections is not very large (roughly N 1 N 2 < 10000 B) and both document collections are large such that none can be entirely held in the memory, then VVM (the sequential version) can outperform other algorithms. 4. For most other cases, the simple algorithm HHNL performs very well. 5. The costs of the random versions of these algorithms depict the worst case scenario when the I/O devices are busy satisfying diierent obligations at the same time. Except for VVM, these costs have no impact in ranking these algorithms. Overall, the simulation results match well with our analysis in Section 5.4. Since no one algorithm is deenitely better than all other algorithms in all circumstances, it is desirable to construct an integrated algorithm that can automatically determine which algorithm to use given the statistics of the two collec-predicates on non-textual attributes). The sketch of an integrated algorithm can be found in 11]. 7. Concluding remarks In this paper, we presented and analyzed three algorithms for processing joins between attributes of tex-tual type. From analysis and simulation, we identi-ed, for each algorithm, the type of input document collections with which the algorithm is likely to perform well. More speciically, we found that HVNL can be very competitive when the number of documents in one of the two document collections is/becomes very small, and VVM can perform very well when the number of documents in each of the two collections is not very large and both document collections are large such that none can be entirely held in the memory. In other cases, HHNL is likely to be the top performer. Since no one algorithm is deenitely better than all other algorithms, we proposed the idea of constructing an integrated algorithm consisting of the basic algorithms such that a particular basic algorithm is invoked if it has the lowest estimated cost. We also indicated that the standardization of term numbers will be very useful in multidatabase environments. Further studies in this area include (1) investigate the impact of the availability of clusters on the performance of each algorithm; (2) develop cost formulas that include CPU cost and communication cost; (3) develop algorithms that process textual joins in parallel ; and (4) more detailed simulation and experiment. Due to the large number of parameters in the cost formulas of the algorithms presented, it is very dii-cult to compare the performance of these …

Generalization for Database Integration in a Multi- Database System. 6. Simulation Results Group 5: Both Collection C1 and Collection C2 Use

Performance Analysis of Three Text-Join Algorithms

Adaptive Multi-join Query Processing in PDBMS

Database Query Optimization Based On Parallel Ant Colony Algorithm

Preference Join on Heterogeneous Data.

String Similarity Joins

Survey of Vector Database Management Systems

Efficient Parallel Partition-Based Algorithms for Similarity Search and Join with Edit Distance Constraints

Efficient Similarity Join and Search on Multi-Attribute Data

Evidence for high-m secondary islands induced by large low-m islands in a tokamak plasma.

Parallel Join Algorithms on a Network of Workstations.

Join Algorithms Based on Tertiary Storage

Overlap Set Similarity Joins with Theoretical Guarantees.

Design Trade-offs for a Robust Dynamic Hybrid Hash Join (Extended Version)

The Sort-Merge-Shrink join

OLAP Query Processing Algorithm Based on Relational Storage.

Synthesizing Document Database Queries using Collection Abstractions

Achieving Both High Precision and High Recall in Near-Duplicate Detection.

Coexistence of Multiple Partition Plan Based Physical Database Design.

A Multi-Table Join Algorithm for Data Warehouse Query Processing

Intelligent Sequential Mining Via Alignment: Optimization Techniques For Very Large Db