Parallel Evaluation of Multi-Semi-Joins
Journal
Proceedings of the VLDB Endowment
Journal Volume
9
Journal Issue
10
Pages
732-743
Date Issued
2016
Author(s)
Daenen, Jonny
Neven, Frank
Vansummeren, Stijn
TONY TAN
Abstract
While services such as Amazon AWS make computing power abundantly available, adding more computing nodes can incur high costs in, for instance, pay-as-you-go plans while not always significantly improving the net running time (aka wall-clock time) of queries. In this work, we provide algorithms for parallel evaluation of SGF queries in MapReduce that optimize total time, while retaining low net time. Not only can SGF queries specify all semi-join reducers, but also more expressive queries involving disjunction and negation. Since SGF queries can be seen as Boolean combinations of (potentially nested) semi-joins, we introduce a novel multisemi- join (MSJ) MapReduce operator that enables the evaluation of a set of semi-joins in one job. We use this operator to obtain parallel query plans for SGF queries that outvalue sequential plans w.r.t. net time and provide additional optimizations aimed at minimizing total time without severely affecting net time. Even though the latter optimizations are NP-hard, we present effective greedy algorithms. Our experiments, conducted using our own implementation Gumbo on top of Hadoop, confirm the usefulness of parallel query plans, and the effectiveness and scalability of our optimizations, all with a significant improvement over Pig and Hive. © 2016 VLDB Endowment.
Other Subjects
Boolean combinations; Computing nodes; Computing power; Greedy algorithms; Parallel evaluation; Parallel queries; Pay as you go; Running time; Optimization
Type
conference paper
