Skip to content
HeZzz
Go back

云计算技术-25fa-zb最后一课-MapReduce

云计算技术最后一课的 MapReduce 部分内容整理。

题目

Section 2.3.1: Matrix-Vector Multiplication by MapReduce

2.3.1节:通过 MapReduce 进行矩阵-向量乘法

Suppose we have an n×nn \times n matrix MM, whose element in row ii and column jj will be denoted mijm_{ij}. Suppose we also have a vector vv of length nn, whose jjth element is vjv_j. Then the matrix-vector product is the vector xx of length nn, whose iith element xix_i is given by xi=j=1nmijvjx_i = \sum_{j=1}^{n} m_{ij} \cdot v_j.

假设我们有一个 n×nn \times n 矩阵 MM,其第 ii 行第 jj 列的元素表示为 mijm_{ij}。假设我们还有一个长度为 nn 的向量 vv,其第 jj 个元素为 vjv_j。那么矩阵-向量乘积是长度为 nn 的向量 xx,其第 ii 个元素 xix_ixi=j=1nmijvjx_i = \sum_{j=1}^{n} m_{ij} \cdot v_j 给出。

Question: Using the matrix-vector multiplication described in Section 2.3.1, applied to the matrix and vector:

使用 2.3.1 节中描述的矩阵-向量乘法,应用于矩阵和向量:

[12345678910111213141516]\begin{bmatrix} 1 & 2 & 3 & 4 \\\\ 5 & 6 & 7 & 8 \\\\ 9 & 10 & 11 & 12 \\\\ 13 & 14 & 15 & 16 \end{bmatrix} [1234]\begin{bmatrix} 1 \\\\ 2 \\\\ 3 \\\\ 4 \end{bmatrix}

Apply the Map function to this matrix and vector. Then, identify all the key-value pairs that are output of Map.

应用此矩阵和向量的 Map 函数。然后,识别所有 Map 输出的键值对。

Solution:

Each mijm_{ij} is multiplied by vjv_j, and this product forms the value of a key-value pair that has key ii, the row number.

每个 mijm_{ij} 乘以 vjv_j,该乘积形成一个键值对的值,该键值对的键为 ii,即行号。

Thus, in row-major order, the sixteen key-value pairs produced are:

这样,按行优先顺序,产生的十六个键值对是:

[(1,1)(1,4)(1,9)(1,16)(2,5)(2,12)(2,21)(2,32)(3,9)(3,20)(3,33)(3,48)(4,13)(4,28)(4,45)(4,64)]\begin{bmatrix} (1,1) & (1,4) & (1,9) & (1,16) \\\\ (2,5) & (2,12) & (2,21) & (2,32) \\\\ (3,9) & (3,20) & (3,33) & (3,48) \\\\ (4,13) & (4,28) & (4,45) & (4,64) \end{bmatrix}

Reduce will sum the values corresponding to the same key, thus, results of reduce are:

(1,30),(2,70),(3,110),(4,150).(1,30), (2,70), (3,110), (4,150).

Reduce 函数将对相同键对应的值进行求和,因此,reduce 的结果是:

(1,30)(2,70)(3,110)(4,150)(1,30),(2,70),(3,110),(4,150)。


Consider a simple example:

举一个简单的例子:

Solution 1

a) Initially, Tommy has finished an implementation with Version 1 (Figure 1). He finds that the implementation can have correct results, but the performance is very slow. Why?

最开始,Tommy 完成了版本 1 的实现。他发现该实现可以得到正确的结果,但性能非常慢。为什么?

class MAPPER
  method MAP(string t, integer r)
    EMIT(string t, integer r)

class REDUCER
  method REDUCE(string t, integers [r_1, r_2, ...])
    sum0
    cnt ← 0
    for all integer r ∈ integers [r_1, r_2, ...] do
        sumsum + r
        cnt ← cnt + 1
    r_avg ← sum/cnt
    EMIT(string t, integer r_avg)

Figure 1. Computing the Mean: Version 1

It requires shuffling all key-value pairs from mappers to reducers across the network, which is highly inefficient.

它需要通过网络将所有键值对从 mappers 传输到 reducers,这效率非常低下。

Solution 2

b) Tommy wants to improve the performance using combiner. He comes out the second implementation (Version 2 in Figure 2). He finds that he can seldom get the reasonable results. Why?

Tommy 想使用 combiner 来提高性能。他提出了第二个实现。他发现很少能得到合理的结果。为什么?

class MAPPER
  method MAP(string t, integer r)
    EMIT(string t, integer r)

class COMBINER
  method COMBINE(string t, integers [r_1, r_2, ...])
    sum0
    cnt ← 0
    for all integer r ∈ integers [r_1, r_2, ...] do
        sumsum + r
        cnt ← cnt + 1
    EMIT(string t, pair (sum, cnt)) // Separate sum and count

class REDUCER
  method REDUCE(string t, pairs [(s_1, c_1), (s_2, c_2) ...])
    sum0
    cnt ← 0
    for all pair (s, c) ∈ pairs [(s_1, c_1), (s_2, c_2) ...] do
        sumsum + s
        cnt ← cnt + c
    r_avg ← sum/cnt
    EMIT(string t, integer r_avg)

Figure 2. Computing the Mean: Version 2

Solution 3

c) After careful design, Tommy finally develops an efficient and correct implementation (Version 3 in Figure 3). Analyze the correctness of the combiner and efficiency of the algorithm (i.e., why it is more efficient than Version 1).

在经过小心谨慎的设计后,Tommy 最终开发出了一个高效且正确的实现。分析 combiner 的正确性和算法的效率(即,为什么它比版本 1 更高效)。

class MAPPER
  method MAP(string t, integer r)
    EMIT(string t, pair (r, 1))

class COMBINER
  method COMBINE(string t, pairs [(s₁, c₁), (s₂, c₂) ...])
    sum0
    cnt ← 0
    for all pair (s, c) ∈ pairs [(s₁, c₁), (s₂, c₂) ...] do
        sumsum + s
        cnt ← cnt + c
    EMIT(string t, pair (sum, cnt))

class REDUCER
  method REDUCE(string t, pairs [(s₁, c₁), (s₂, c₂) ...])
    sum0
    cnt ← 0
    for all pair (s, c) ∈ pairs [(s₁, c₁), (s₂, c₂) ...] do
        sumsum + s
        cnt ← cnt + c
    r_avg ← sum/cnt
    EMIT(string t, pair (r_avg, cnt))

Figure 3. Computing the Mean: Version 3

Solution 4

d) Tommy analyzes the efficiency of Version 3, and comes out an even more efficient implementation (Version 4 in Figure 4). Why does Version 4 is even more efficient than Version 3?

Tommy 分析了版本 3 的效率,并提出了一个更高效的实现。为什么版本 4 比版本 3 更高效?

class MAPPER
  method INITIALIZE
    S ← new ASSOCIATIVEARRAY
    C ← new ASSOCIATIVEARRAY
  method MAP(string t, integer r)
    S{t} ← S{t} + r
    C{t} ← C{t} + 1
  method CLOSE
    for all term t ∈ S do
      EMIT(term t, pair (S{t}, C{t}))

Figure 4. Computing the Mean: Version 4

Summary

PS: 内容与 zb 的 PPT 内容一致,仅作整理记录之用。

中文部分由 ChatGPT 翻译,仅供参考,如有错误,欢迎指正。

伪代码部分用 Python 语法高亮,仅作展示之用。

MapReduce


Share this post on:

上一篇
工作流引擎技术分享:从原理到实战
下一篇
云计算技术-25fa-zb最后一课