Abstract:collections is not very large (roughly N 1 N 2 < 10000 B) and both document collections are large such that none can be entirely held in the memory, then VVM (the sequential version) can outperform other algorithms. 4. For most other cases, the simple algorithm HHNL performs very well. 5. The costs of the random versions of these algorithms depict the worst case scenario when the I/O devices are busy satisfying diierent obligations at the same time. Except for VVM, these costs have no impact in ranking these algorithms. Overall, the simulation results match well with our analysis in Section 5.4. Since no one algorithm is deenitely better than all other algorithms in all circumstances, it is desirable to construct an integrated algorithm that can automatically determine which algorithm to use given the statistics of the two collec-predicates on non-textual attributes). The sketch of an integrated algorithm can be found in 11]. 7. Concluding remarks In this paper, we presented and analyzed three algorithms for processing joins between attributes of tex-tual type. From analysis and simulation, we identi-ed, for each algorithm, the type of input document collections with which the algorithm is likely to perform well. More speciically, we found that HVNL can be very competitive when the number of documents in one of the two document collections is/becomes very small, and VVM can perform very well when the number of documents in each of the two collections is not very large and both document collections are large such that none can be entirely held in the memory. In other cases, HHNL is likely to be the top performer. Since no one algorithm is deenitely better than all other algorithms, we proposed the idea of constructing an integrated algorithm consisting of the basic algorithms such that a particular basic algorithm is invoked if it has the lowest estimated cost. We also indicated that the standardization of term numbers will be very useful in multidatabase environments. Further studies in this area include (1) investigate the impact of the availability of clusters on the performance of each algorithm; (2) develop cost formulas that include CPU cost and communication cost; (3) develop algorithms that process textual joins in parallel ; and (4) more detailed simulation and experiment. Due to the large number of parameters in the cost formulas of the algorithms presented, it is very dii-cult to compare the performance of these …

Performance Analysis of Three Text-Join Algorithms

Performance Evaluation for Distributed Join Based on MapReduce.

Optimization Factor Analysis Of Large-Scale Join Queries On Different Platforms

Generalization for Database Integration in a Multi- Database System. 6. Simulation Results Group 5: Both Collection C1 and Collection C2 Use

Adaptive Multi-join Query Processing in PDBMS

A Multi-Table Join Algorithm for Data Warehouse Query Processing

OLAP Query Processing Algorithm Based on Relational Storage.

Analysis of Algorithms for XML Database Structural Join

The Sort-Merge-Shrink join

Join Algorithms Based on Tertiary Storage

Parallel Join Algorithms on a Network of Workstations.

Similarity Joins Of Text With Incomplete Information Formats

An Effective High-Performance Multiway Spatial Join Algorithm with Spark

Graphical Join: A New Physical Join Algorithm for RDBMSs

Processing Long Queries Against Short Text

Skew Strikes Back: New Developments in the Theory of Join Algorithms

Efficient Join Synopsis Maintenance for Data Warehouse.

Utilizing the column imprints to accelerate no‐partitioning hash joins in large‐scale edge systems

Join Processing for Graph Patterns: An Old Dog with New Tricks

Multi-query Join Algorithm Based on Sharing for XML Publish/subscribe Data Stream Systems

Database Query Optimization Based On Parallel Ant Colony Algorithm