Consider the two relations R(r1,r2,r3) and S(s1,s2,s3). The tuples of R have a size of 1100B, the tuples of S have a size of 128B. The cardinalities of R and S are: |R| = 100'000, |S| = 100'000'000.
R and S are stored on 20 nodes in a cluster used for MapReduce. The data of both relations is distributed uniformly across all nodes. The number of mappers = number of reducers = number of nodes Mappers and reducers are co-located in the nodes
Given is the following SQL query:
SELECT *
FROM R, S
WHERE R.r1 = S.s1;
Calculate how much data is transferred during a MapReduce broadcast join and a MapReduce partition join. You do not need to consider sending key-value pairs and simply assume that the record size accounts for this information. Partitioning is performed with a hash and modulo operation.
Now assume there is filtering involved (S.s2 < X) in the Map phase. The optimizer can chose between a partition-based and a broadcast join strategy. The amount of shipped data depends on the local predicate (S.s2 < X).
Compute the selectivity for which the broadcast strategy and the partition-based strategy transfer the same amount of data. Give the number with 4 decimal place precision. Remember: Selectivity is the fraction of tuples that match the filter predicate.
Cost of the broadcast join:
Send all data from R to all nodes. Each node has 5000 R tuples (100'000 / 20).
1100 bytes per record * (5000 tuple per node * 19 nodes to send the data to) * 20 nodes that send their data
==> 1100 byte * 5000 * 19 * 20 = 2.09 GB
======================
Cost of partition join:
Partition R and S. Each node has 5'000'000 S tuples (100'000'000 / 20).
General form: record size * # nodes sending data * (# tuples per node / # nodes * (# nodes - 1))
Cost R: 1100 byte * 20 * (5000 / 20 * 19) = 0.1045 GB
Here, (5000 / 20 * 19) says that each node has to send one twentieth of their data to 19 other nodes. As we have 20 nodes, each of them does this, so 20 * (5000 / 20 * 19).
Cost S: 128 byte * 20 * (5'000'000 / 20 * 19) = 12.16 GB
==> total cost = cost R + cost S = 0.1045 GB + 12.16GB = 12.2645 GB
======================
Selectivity:
When we filter on S, the partition join sends less data. To find the selectivity that sets the cost of both joins about equal, we can put them into a short equation:
cost broadcast join = selectivity * cost S in partition join + cost R in partition join
==> 2.09 GB = (x * 12.16GB) + 0.1045 GB
If we solve for x, we receive:
x = (2.09GB - 0.1045GB) / 12.16 GB = 0.1633