Big Data Systems HPI

Quiz 3 - MapReduce II & Key-Value Stores


Question

Consider the two relations R(r1,r2,r3) and S(s1,s2,s3). The tuples of R have a size of 1100B, the tuples of S have a size of 128B. The cardinalities of R and S are: |R| = 100'000, |S| = 100'000'000.

R and S are stored on 20 nodes in a cluster used for MapReduce. The data of both relations is distributed uniformly across all nodes. The number of mappers = number of reducers = number of nodes Mappers and reducers are co-located in the nodes

Given is the following SQL query:

SELECT *
FROM R, S
WHERE R.r1 = S.s1;

Calculate how much data is transferred during a MapReduce broadcast join and a MapReduce partition join. You do not need to consider sending key-value pairs and simply assume that the record size accounts for this information. Partitioning is performed with a hash and modulo operation.

Now assume there is filtering involved (S.s2 < X) in the Map phase. The optimizer can chose between a partition-based and a broadcast join strategy. The amount of shipped data depends on the local predicate (S.s2 < X).

Compute the selectivity for which the broadcast strategy and the partition-based strategy transfer the same amount of data. Give the number with 4 decimal place precision. Remember: Selectivity is the fraction of tuples that match the filter predicate.


Answer

Cost of the broadcast join:

Send all data from R to all nodes. Each node has 5000 R tuples (100'000 / 20).

1100 bytes per record * (5000 tuple per node * 19 nodes to send the data to) * 20 nodes that send their data

==> 1100 byte * 5000 * 19 * 20 = 2.09 GB

======================

Cost of partition join:

Partition R and S. Each node has 5'000'000 S tuples (100'000'000 / 20).

General form: record size * # nodes sending data * (# tuples per node / # nodes * (# nodes - 1))

Cost R: 1100 byte * 20 * (5000 / 20 * 19) = 0.1045 GB

Here, (5000 / 20 * 19) says that each node has to send one twentieth of their data to 19 other nodes. As we have 20 nodes, each of them does this, so 20 * (5000 / 20 * 19).

Cost S: 128 byte * 20 * (5'000'000 / 20 * 19) = 12.16 GB

==> total cost = cost R + cost S = 0.1045 GB + 12.16GB = 12.2645 GB

======================

Selectivity:

When we filter on S, the partition join sends less data. To find the selectivity that sets the cost of both joins about equal, we can put them into a short equation:

cost broadcast join = selectivity * cost S in partition join + cost R in partition join

==> 2.09 GB = (x * 12.16GB) + 0.1045 GB

If we solve for x, we receive:

x = (2.09GB - 0.1045GB) / 12.16 GB = 0.1633




Comments

*19