CUDA教程6 -- 共轭梯度法

§矩阵表示

稀疏矩阵的表示格式有许多种，如COO、CSR、CSC、DIA、BSR等，此处主要选用CSR格式，过程中涉及到部分COO格式。常见存储格式的细节可以参考这个链接。

COO(Coordinate)格式是最简单的存储格式，每一个元素需要用一个三元组来表示，分别是（行号，列号，数值）。这种存储格式简单，但存储空间占用较大。以下面的矩阵为例：

$$ A=\begin{bmatrix} 1 & 2 & 0 & 0 \\ 0 & 3 & 4 & 5 \\ 0 & 6 & 7 & 0 \\ 0 & 0 & 8 & 9 \end{bmatrix} $$

存储在三个数组中：

row: 0  0  1  2  1  2  3  1  3  
col: 0  1  1  1  2  2  2  3  3
val: 1  2  3  6  4  7  8  5  9

一般而言，COO格式对元素的存储顺序没有要求，比如下面的三个数组表示的是同一个矩阵：

row: 0  0  1  1  1  2  2  3  3
col: 0  1  1  2  3  1  2  2  3
val: 1  2  3  4  5  6  7  8  9

但是有的地方也会元素按行主序或者列主序存放，对这三个数组做一下排序即可，后面将提到。

CSR格式全称Compressed Sparse Row格式，是使用最广泛的存储格式，它按照行主序进行存储，表达灵活，操作方便，空间利用率高。仍以上面的矩阵A为例，如果要求其COO格式必须按行主序存储，且每行内部按照列排序，即：

row: 0  0  1  1  1  2  2  3  3
col: 0  1  1  2  3  1  2  2  3
val: 1  2  3  4  5  6  7  8  9

显而易见，row数组并不用存储这么多行号，因为相邻的元素行号是相等的，只需要存储每行的范围即可，这样就得到了CSR格式：

row_offsets: 0  2  5 7 9
col: 0  1  1  2  3  1  2  2  3
val: 1  2  3  4  5  6  7  8  9

为更直观地看到CSR的存储特点，此处再给一个示例：

CSR

如果按照COO格式存储，并按行排序，行内按列排序，此矩阵应当存储为：

row:   0   0   0   0   1   1   1   2   2   2   3   3   4   4   5   5
col:   0   1   2   3   0   1   2   0   1   2   0   3   4   5   4   5
val: 7.5 2.9 2.8 2.7 6.8 5.7 3.8 2.4 6.2 3.2 9.7 2.3 5.8 5.0 6.6 8.1

可以看到，rowptr或者说row_offsets数组中，相邻的两个元素标识了矩阵一行的非零元素在col和val中的下标范围[rowptr[i], rowptr[i+1])。为保证最后一行也不越界，一般在rowptr末尾再添加一个哨兵元素，它的值是矩阵非零元素个数，也是下行的首个非零元素应当存放的下标(如果有的话)。

§多线程共轭梯度法

§数据结构

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36


using ValueType = double;
using IndexType = int;

namespace Host {

class CsrMatrix {
public:
    // 矩阵行数
    IndexType num_rows;

    // 矩阵列数
    IndexType num_cols;

    // 非零元素个数
    IndexType num_entries;

    // 非零元素
    std::vector<ValueType> values;

    // 各行第一个非零元素所在下标
    std::vector<IndexType> row_offsets;

    // 非零元素所在列
    std::vector<IndexType> column_indices;
}; // class CsrMatrix

class Vector {
public:
    // 向量长度，size永远等于values.size()
    IndexType size;

    // 向量元素
    std::vector<ValueType> values;
}; // class Vector
    
} // Host

§矩阵和向量基本运算

首先看矩阵与向量的乘法：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39


// COO格式乘向量
void SpMvCOO(
        Vector *result,
        const IndexType *rows,
        const IndexType *cols,
        const ValueType *vals,
        const IndexType num_rows,
        const IndexType num_cols,
        const IndexType num_entries,
        const Vector &v
) {
    if (num_cols != v.size)
        throw std::runtime_error{"SpMvCOO: 矩阵的列数和向量长度不一致！"};
    
    result->resize(num_rows);
    for (IndexType i = 0; i < num_rows; ++i)
        (*result)[i] = 0;
    
    for (IndexType i = 0; i < num_entries; ++i)
        (*result)[rows[i]] += vals[i] * v[cols[i]];
}

// CSR格式乘向量
void SpMvCSR(Vector *result, const CsrMatrix &A, const Vector &v) {
    if (A.num_cols != v.size)
        throw std::runtime_error{"SpMvCSR: 矩阵的列数和向量长度不一致！"};

    result->resize(A.num_rows);
    
    for (IndexType row = 0; row < A.num_rows; ++row) {
        IndexType start = A.row_offsets[row];
        IndexType end = A.row_offsets[row + 1];  // 不会越界
        
        result[row] = 0;
        for (IndexType j = start; j < end; ++j) {
            (*result)[row] += A.values[j] * v[A.column_indices[j]];
        }
    }
}

使用tbb进行多线程并行。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97


namespace Host {

// result = Av
void SpMv(Vector *result, const CsrMatrix &A, const Vector &v) {
    if (A.num_cols != v.size)
        throw std::runtime_error{"SpMv: 矩阵的列数和向量长度不一致！"};

    result->resize(A.num_rows);
    tbb::parallel_for(
            tbb::blocked_range<IndexType>(0, A.num_rows),
            [&A, &v, &result](
                    const tbb::blocked_range<IndexType> &range
            ) -> void {
                for (IndexType i = range.begin(); i != range.end(); ++i) {
                    ValueType t = 0.0;
                    for (IndexType j = A.row_offsets[i];
                             j < A.row_offsets[i + 1]; ++j)
                        t += A.values[j] * v[A.column_indices[j]];
                    (*result)[i] = t;
                }
            }
    );
}

// result = v1 + alpha * v2，其它形式类似
void Add(
        Vector *result,
        const Vector &v1,
        const Vector &v2,
        ValueType alpha = 1.0
) {
    if (v1.size != v2.size)
        throw std::runtime_error{"Add: 参与运算的两个向量长度不一致！"};

    result->resize(v1.size);
    tbb::parallel_for(
            tbb::blocked_range<IndexType>(0, v1.size),
            [&v1, &v2, &result, alpha](
                    const tbb::blocked_range<IndexType> &range
            ) -> void {
                for (IndexType i = range.begin(); i != range.end(); ++i)
                    (*result)[i] = v1[i] + alpha * v2[i];
            }
    );
}

// return v1^T * v2
ValueType Dot(const Vector &v1, const Vector &v2) {
    if (v1.size != v2.size)
        throw std::runtime_error{"Dot: 向量长度不一致！"};

    return tbb::parallel_reduce(
            tbb::blocked_range<IndexType>(0, v1.size), 0.0,
            [&v1, &v2](
                    const tbb::blocked_range<IndexType> &range,
                    ValueType s
            ) -> ValueType {
                for (IndexType i = range.begin(); i != range.end(); ++i)
                    s += v1[i] * v2[i];
                return s;
            },
            [](ValueType s1, ValueType s2) -> ValueType {
                return s1 + s2;
            }
    );
}

// L2范数
ValueType Norm2(const Vector &v) {
    return sqrt(Dot(v, v));
}

// r = b - Ax
void Residual(Vector *r, const CsrMatrix &A, const Vector &b, const Vector &x) {
    if (A.num_cols != x.size)
        throw std::runtime_error{"Residual: A的列数与x维度不一致！"};
    if (A.num_rows != b.size)
        throw std::runtime_error{"Residual: A的行数和b的维度不一致！"};

    r->resize(b.size);
    tbb::parallel_for(
            tbb::blocked_range<IndexType>(0, A.num_rows),
            [&A, &b, &x, &r](
                    const tbb::blocked_range<IndexType> &range
            ) -> void {
                for (IndexType i = range.begin(); i < range.end(); ++i) {
                    ValueType t = 0.0;
                    for (IndexType j = A.row_offsets[i];
                               j < A.row_offsets[i + 1]; ++j)
                        t += A.values[j] * x[A.column_indices[j]];
                    (*r)[i] = b[i] - t;
                }
            }
    );
}
    
} // Host

§共轭梯度法

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56


namespace Host {

void CG(
        Vector *x,
        const CsrMatrix &A,
        const Vector &b,
        ValueType error,
        IndexType max_steps
) {
    IndexType N = b.size;

    // r_0 = b - Ax_0
    Vector r(N);
    Residual(&r, A, b, *x);

    // r_k^T * r_k and r_{k-1}^T * r_{k-1}
    ValueType rho0 = Dot(r, r);
    ValueType rho1 = 0;
    ValueType norm_r = sqrt(rho0);

    // 初始解精度已经足够高
    if (norm_r < error)
        return;
    
    // p_1 = r_0
    Vector p = r;

    // q_k
    Vector q(N);

    for (IndexType k = 1; k <= max_steps; ++k) {
        // q_k = A * p_k
        SpMv(&q, A, p);

        // xi_k = (r_{k-1}^T * r_{k-1}) / (p_k^T * q_k)
        ValueType xi = rho0 / Dot(p, q);

        // x_k = x_{k-1} + xi_k * p_k
        Add(x, *x, p, xi);

        // r_k = r{k-1} - xi_k * q_k
        Add(&r, r, q, -xi);

        rho1 = rho0;
        rho0 = Dot(r, r);
        norm_r = sqrt(rho0);

        if (norm_r < error)
            break;

        // p_k = r_{k-1} + u_{k-1} * p_{k-1}
        Add(&p, r, p, rho0 / rho1);
    }
} // CG
    
} // Host

调用方法：

1
2
3
4
5


Host::SparseCSR A;     // generate A
Host::Vector b;        // generate b
Host::Vector x;        // init x, and solve Ax=b

Host::CG(&x, A, b, 1e-6, 100);  // 残量小于1e-6或者迭代100次后退出

§GPU并行的共轭梯度法

§数据结构

和多线程版本不同的是，GPU版本的数据结构必须完成显存的管理，如果不想这么麻烦，可以考虑使用thrust库，但要注意引入的依赖问题(可能必须使用nvcc直接编译，而无法使用gcc编译后再链接)。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38


namespace SingleGPU {

class CsrMatrix {
public:
    ~CsrMatrix();
    
    // 非零元素
    ValueType *values;

    // 各行第一个非零元素所在下标
    IndexType *row_offsets;

    // 非零元素所在列
    IndexType *column_indices;

    // 矩阵行数
    IndexType num_rows;

    // 矩阵列数
    IndexType num_cols;

    // 非零元素个数
    IndexType num_entries;
}; // class CsrMatrix

class Vector {
public:
    // 向量元素
    ValueType *_values;

    // 向量长度
    IndexType size;

    // 显存容量，当size变小后会有多余空间
    IndexType capacity;
}; // class Vector
    
} // SingleGPU

几乎不太可能直接在GPU上生成矩阵，一般都在CPU上生成，然后传输过去，因此可以定义一组传输数据用的接口：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37


namespace SingleGPU {

class CsrMatrix {
public:
    // 以内存中的稀疏矩阵创建显存上的稀疏矩阵，将会复制数据
    CsrMatrix(
            const ValueType *h_values,
            const IndexType *h_row_offsets,
            const IndexType *h_columns_indices,
            IndexType num_rows,
            IndexType num_cols,
            IndexType num_entries
     );
    
    // 以内存中的稀疏矩阵创建显存上的稀疏矩阵，将会复制数据
    explicit CsrMatrix(const Host::CsrMatrix &A);
    
    // 复制到内存中
    void CopyToHost(Host::CsrMatrix &A) const;
    
    ~CsrMatrix();
}; // class CsrMatrix

class Vector {
public:
    // 从内存构造，values必须为指向内存中某个地址的指针
    Vector(const ValueType *values, IndexType N);
    
    // 从内存构造
    explicit Vector(const Host::Vector &v);
    
    void CopyToHost(Host::Vector &v) const;
    
    ~Vector();
}; // class Vector
    
} // SingleGPU

实现为：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75


namespace SingleGPU {
    
CsrMatrix::CsrMatrix(
        const ValueType *h_values,
        const IndexType *h_row_offsets,
        const IndexType *h_columns_indices,
        IndexType num_rows,
        IndexType num_cols,
        IndexType num_entries
) {
    this->num_rows = num_rows;
    this->num_cols = num_cols;
    this->num_entries = num_entries;

    CHECK(cudaMalloc(&values, sizeof(ValueType) * num_entries));
    CHECK(cudaMalloc(&row_offsets, sizeof(IndexType) * (num_rows + 1)));
    CHECK(cudaMalloc(&column_indices, sizeof(IndexType) * num_entries));

    CHECK(cudaMemcpy(values, h_values,
                     sizeof(ValueType) * num_entries, cudaMemcpyHostToDevice));
    CHECK(cudaMemcpy(row_offsets, h_row_offsets,
                     sizeof(IndexType) * (num_rows + 1), cudaMemcpyHostToDevice));
    CHECK(cudaMemcpy(column_indices, h_columns_indices,
                     sizeof(IndexType) * num_entries, cudaMemcpyHostToDevice));
}

CsrMatrix::CsrMatrix(const Host::CsrMatrix &A) :
        CsrMatrix(A.values.data(),
                  A.row_offsets.data(),
                  A.column_indices.data(),
                  A.num_rows,
                  A.num_cols,
                  A.num_entries
         ) {}
    
void CsrMatrix::CopyToHost(Host::CsrMatrix &A) const {
    A.resize(num_rows, num_cols, num_entries);
    CHECK(cudaMemcpy(A.values.data(), values, sizeof(ValueType) * num_entries,
                     cudaMemcpyDeviceToHost));
    CHECK(cudaMemcpy(A.row_offsets.data(), row_offsets,
                     sizeof(IndexType) * (num_rows + 1), cudaMemcpyDeviceToHost));
    CHECK(cudaMemcpy(A.column_indices.data(), column_indices,
                     sizeof(IndexType) * num_entries, cudaMemcpyDeviceToHost));
}
    
CsrMatrix::~CsrMatrix() {
    CHECK(cudaFree(values));
    CHECK(cudaFree(row_offsets));
    CHECK(cudaFree(column_indices));
}

    
Vector::Vector(const ValueType *values, IndexType N) : size(N), capacity(N) {
    CHECK(cudaMalloc(&this->_values, sizeof(ValueType) * N));
    CHECK(cudaMemcpy(this->_values, values, sizeof(ValueType) * N,
                     cudaMemcpyHostToDevice));
}

Vector::Vector(const Host::Vector &v) : size{v.size}, capacity{v.size} {
    CHECK(cudaMalloc(&_values, sizeof(ValueType) * v.size));
    CHECK(cudaMemcpy(_values, v.data(), sizeof(ValueType) * size,
                     cudaMemcpyHostToDevice));
}
    
Vector::~Vector() {
    CHECK(cudaFree(_values));
}
        
void Vector::CopyToHost(Host::Vector &v) const {
    v.resize(size);
    CHECK(cudaMemcpy(v.data(), _values, sizeof(ValueType) * size,
                     cudaMemcpyDeviceToHost));
}
    
} // SingleGPU

对于Vector而言，与CsrMatrix有一些不太一样的地方，从多线程的共轭梯度法的实现可以看到，CsrMatrix在迭代过程中全程不变，而Vector的操作就复杂多了，为了支撑复杂的向量运算，必须进行一些特殊的设计。

从std::vector可以得到灵感，分配的显存可以比实际使用量大，这样可以保证向量长度增长的时候可以更高效地完成内存的扩充。具体而言，std::vector的数据成员大致为：

1
2
3
4
5


class vector {
    T *_ptr;
    size_t _size;
    size_t _capacity;
};

其中_size是_ptr所指内存中保存的元素个数，_capacity是_ptr所指内存的大小，且必须满足_capacity>=_size。一般而言，当声明一个大小为n的vector时，分配的内存容量会比n大许多，这样在进行一连串的push_back操作时，不是每次都需要申请内存，这样就可以将平均时间复杂度降低到O(1)（对于某一次push_back而言最大还是O(n)，因为当_size==_capacity时，必须重新申请一块更大的内存，然后将已有的数据逐个拷贝过去）。相关的接口实现大致为：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


class Vector {
private:
    T *_ptr;
    size_t _size;
    size_t _capacity;
    double _factor = 1.5;
    
public:
    void resize(size_t new_size) {
        if (new_size > _capacity) {
            new_capacity = (size_t)(_factor * new_size);
            T *temp = new T[new_capacity];
            memcpy(temp, _ptr, _size * sizeof(T));
            delete [] _ptr;
            _ptr = temp;
            _capacity = new_capacity;
        }
        _size = new_size;
    }
    
    void push_back(const T &val) {
        resize(_size + 1);
        _ptr[_size - 1] = val;
    }
};

但GPU计算稍有不同，我们没有push_back的需求，并且核函数中不能分配内存，必须提前分配好内存然后再使用核函数进行计算，因此在扩容时不需要_factor，但我们仍然希望Vector的容量足够时不要重新分配显存，因此可以写出这样的接口：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74


namespace SingleGPU {

class Vector {
public:
    // 如果size变小，不释放多余显存，如果size变大，
    // 不保留之前的数据，如果需要保留，使用extend
    void resize(IndexType new_size);

    // 将显存扩充至new_size个单元，如果new_size<=size效果
    // 与resize一致，如果new_size>size，
    // 扩充显存并将先前保存的数据复制过来
    void extend(IndexType new_size);
    
    // 拷贝构造
    Vector(const Vector &v);

    // 移动构造
    Vector(Vector &&v) noexcept;
    
    // 拷贝赋值操作
    Vector &operator=(const Vector &v);

    // 移动赋值操作
    Vector &operator=(Vector &&v) noexcept;
}; // class Vector
    
void Vector::resize(IndexType new_size) {
    if (capacity < new_size) {
        CHECK(cudaFree(_values));
        CHECK(cudaMalloc(&_values, sizeof(ValueType) * new_size));
        capacity = new_size;
    }
    size = new_size;
}
    
void Vector::extend(IndexType new_size) {
    if (capacity < new_size) {
        ValueType *new_values = nullptr;
        CHECK(cudaMalloc(&new_values, sizeof(ValueType) * new_size));
        CHECK(cudaMemcpy(new_values, _values, sizeof(ValueType) * size,
                         cudaMemcpyDeviceToDevice));
        CHECK(cudaFree(_values));
        _values = new_values;
        capacity = new_size;
    }
    size = new_size;
}

// 拷贝等操作实现起来就比较容易了
Vector &Vector::operator=(const Vector &v) {
    if (this == &v)
        return *this;

    this->resize(v.size);
    CHECK(cudaMemcpy(_values, v._values, sizeof(ValueType) * size,
                     cudaMemcpyDeviceToDevice));

    return *this;
}
    
Vector &Vector::operator=(Vector &&v) noexcept {
    CHECK(cudaFree(_values));
    _values = v._values;
    size = v.size;
    capacity = v.capacity;

    v._values = nullptr;
    v.size = 0;
    v.capacity = 0;

    return *this;
}
    
} // SingleGPU

§矩阵和向量基本运算

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219


namespace SingleGPU {

// 线程块大小，对于目前的GPU，128、256、512都是不错的选择，但不能选择64，
// 否则一个sm上需要驻留的线程块数可能超出限制。仅用于非启发式配置
constexpr int BLOCK_SIZE = 512;

// 非启发式配置，线程块大小都选择256，线程块数为刚好能使整个gpu满载的配置，
// 即每个sm可驻留的最大线程块数(由最大数量计算)乘上sm个数
inline void LaunchConfig(int *grid_size, int *block_size) {
    *block_size = BLOCK_SIZE;

    int device_id = 0;
    cudaGetDevice(&device_id);

    int max_threads_SM = 0;
    int num_SM = 0;

    CHECK(cudaDeviceGetAttribute(
            &max_threads_SM, cudaDevAttrMaxThreadsPerMultiProcessor, device_id
    ));
    CHECK(cudaDeviceGetAttribute(
            &num_SM, cudaDevAttrMultiProcessorCount, device_id
    ));   

    *grid_size = max_threads_SM / (*block_size) * num_SM;;
}

// 使用带任务大小的线程网格配置，来启动核函数，当任务规模比较小时，不要启动太多的
// 线程块，只需要启动N / block_size向上取整个线程块。如果任务规模比较大，则仍然
// 取LaunchConfig返回的线程网格
template<typename KernelFunction, typename ...Args>
void LaunchKernelWithNumThreads(
        KernelFunction kernel,
        size_t num_threads,
        size_t shared_memory,
        cudaStream_t stream,
        Args ...args
) {
    int grid_size = 0;
    int block_size = 0;
    LaunchConfig(&grid_size, &block_size);

    int N_grid_size = static_cast<int>(
            (num_threads + block_size - 1) / block_size
    );
    if (N_grid_size < grid_size)
        grid_size = N_grid_size;

    kernel<<<grid_size, block_size, shared_memory, stream>>>(args...);
}

    
__global__ void AddKernel1(
        ValueType *result,
        const ValueType *v1,
        const ValueType *v2,
        const IndexType N,
        const ValueType alpha
) {
    unsigned int i = threadIdx.x + blockIdx.x * blockDim.x;
    unsigned int stride = gridDim.x * blockDim.x;
    for (; i < N; i += stride)
        result[i] = v1[i] + alpha * v2[i];
}
    
// result = v1 + alpha * v2
void Add(
        Vector *result,
        const Vector &v1,
        const Vector &v2,
        const ValueType alpha = 1.0
) {
    if (v1.size != v2.size)
        throw std::runtime_error{"SingleGPU::Add: 参与运算的两个向量长度不一致！"};
    
    result->resize(v1.size);
    LaunchKernelWithNumThreads(
            AddKernel1, v1.size, 0, 0,
            result->data(), v1.data(), v2.data(), v1.size, alpha
    );
}
    
    
template<int BlockSize = BLOCK_SIZE>
__global__ void DotKernel(
        ValueType *__restrict__ partial,
        const ValueType *__restrict__ vec1,
        const ValueType *__restrict__ vec2,
        const IndexType N
) {
    __shared__ volatile ValueType cache[BlockSize];

    int tid = threadIdx.x;

    ValueType t = 0;
    for (IndexType i = threadIdx.x + blockIdx.x * blockDim.x;
         i < N; i += gridDim.x * blockDim.x)
        t += vec1[i] * vec2[i];
    cache[tid] = t;
    __syncthreads();

    // 线程块内归约
    if (BlockSize >= 1024 && tid < 512)
        cache[tid] += cache[tid + 512];
    __syncthreads();

    if (BlockSize >= 512 && tid < 256)
        cache[tid] += cache[tid + 256];
    __syncthreads();

    if (BlockSize >= 256 && tid < 128)
        cache[tid] += cache[tid + 128];
    __syncthreads();

    if (BlockSize >= 128 && tid < 64)
        cache[tid] += cache[tid + 64];
    __syncthreads();

    // 线程束内归约
    if (tid < 32) {
        cache[tid] += cache[tid + 32];
        cache[tid] += cache[tid + 16];
        cache[tid] += cache[tid + 8];
        cache[tid] += cache[tid + 4];
        cache[tid] += cache[tid + 2];
        cache[tid] += cache[tid + 1];
    }

    if (tid == 0)
        partial[blockIdx.x] = cache[0];
}
    
ValueType Dot(const Vector &v1, const Vector &v2) {
    if (v1.size != v2.size)
        throw std::runtime_error{"SingleGPU::Dot: 参与运算的向量维度不一致！"};

    int grid_size = 0;
    int block_size = 0;
    LaunchConfig(&grid_size, &block_size);

    int N_grid_size = (v1.size + block_size - 1) / block_size;
    if (N_grid_size < grid_size)
        grid_size = N_grid_size;

    Vector dev_partial(grid_size);

    DotKernel<><<<grid_size, block_size>>>(
            dev_partial.data(), v1.data(), v2.data(), v1.size
    );

    Host::Vector partial;
    dev_partial.CopyToHost(partial);

    ValueType sum = 0;
    for (int i = 0; i < grid_size; i++)
        sum += partial[i];

    return sum;
}
   
// L2范数
ValueType Norm2(const Vector &v) {
    return sqrt(Dot(v, v));
}
    

/*
    计算csr格式的稀疏矩阵与稠密向量之积的核函数: result = Av
    半个warp计算一个分量
*/
template<int BlockSize = BLOCK_SIZE>
__global__ void SpMvKernel(
        ValueType *__restrict__ result,
        const ValueType *__restrict__ values,
        const IndexType *__restrict__ row_offsets,
        const IndexType *__restrict__ column_indices,
        const IndexType num_rows,
        const ValueType *__restrict__ vec
) {
    __shared__ volatile ValueType cache[BlockSize];

    const int thread_id = blockIdx.x * blockDim.x + threadIdx.x;
    const int lane = thread_id & 15;           // 半线程束内偏移量
    const int num_rows_per_block = gridDim.x * blockDim.x / 16;

    for (IndexType row = thread_id / 16; row < num_rows; row += num_rows_per_block) {
        ValueType t = 0;
        for (IndexType j = row_offsets[row] + lane; j < row_offsets[row + 1]; j += 16)
            t += values[j] * vec[column_indices[j]];
        cache[threadIdx.x] = t;

        // parallel reduction in shared memory
        if (lane < 8) {
            cache[threadIdx.x] = t = t + cache[threadIdx.x + 8];
            cache[threadIdx.x] = t = t + cache[threadIdx.x + 4];
            cache[threadIdx.x] = t = t + cache[threadIdx.x + 2];
            t = t + cache[threadIdx.x + 1];
        }

        // first thread writes the result
        if (lane == 0)
            result[row] = t;
    }
}
    
void SpMv(Vector *result, const CsrMatrix &A, const Vector &v) {
    if (A.num_cols != v.size)
        throw std::runtime_error{"SingleGPU::SpMv: 参与运算的矩阵列数与向量维度不一致！"};

    result->resize(A.num_rows);
    LaunchKernelWithNumthreads(
            SpMvKernel<>, A.num_rows * 16ull, 0, 0,
            result->data(),
            A.values, A.row_offsets, A.column_indices, A.num_rows,
            v.data()
    );
}
    
}

此处稀疏矩阵乘向量使用16个线程计算一个分量，更naive的方法是使用一个线程计算一个分量：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


__global__ void SpMvNaiveKernel(
        ValueType *__restrict__ result,
        const ValueType *__restrict__ values,
        const IndexType *__restrict__ row_offsets,
        const IndexType *__restrict__ column_indices,
        const IndexType num_rows,
        const ValueType *__restrict__ vec
) {
    IndexType row = blockIdx.x * blockDim.x + threadIdx.x;
    const int stride = gridDim.x * blockDim.x;
    for (; row < num_rows; row += stride) {
        ValueType t = 0;
        for (IndexType j = row_offsets[row]; j < row_offsets[row + 1]; ++j)
            t += values[j] * vec[column_indices[j]];
        result[row] = t;
    }
}

void SpMvNaive(Vector *result, const CsrMatrix &A, const Vector &v) {
    if (A.num_cols != v.size)
        throw std::runtime_error{"SingleGPU::SpMv: 参与运算的矩阵列数与向量维度不一致！"};

    result->resize(A.num_rows);
    LaunchKernelWithNumthreads(
            SpMvNaiveKernel, A.num_rows, 0, 0,
            result->data(),
            A.values, A.row_offsets, A.column_indices, A.num_rows,
            v.data()
    );
}

Naive版本显然在访存上不如前一版本高。性能的分析对比可以参考下面的文献：

Markus Steinberger等. How naive is naive SpMV on the GPU? 2016.

§共轭梯度法

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56


namespace SingleGPU {

void CG(
        Vector *x,
        const CsrMatrix &A,
        const Vector &b,
        ValueType error,
        IndexType max_steps
) {
    IndexType N = b.size;

    // r_0 = b - Ax_0
    Vector r(N);
    Residual(&r, A, b, *x);

    // r_k^T * r_k and r_{k-1}^T * r_{k-1}
    ValueType rho0 = Dot(r, r);
    ValueType rho1 = 0;
    ValueType norm_r = sqrt(rho0);

    // 初始解精度已经足够高
    if (norm_r < error)
        return;
    
    // p_1 = r_0
    Vector p = r;

    // q_k
    Vector q(N);

    for (IndexType k = 1; k <= max_steps; ++k) {
        // q_k = A * p_k
        SpMv(&q, A, p);

        // xi_k = (r_{k-1}^T * r_{k-1}) / (p_k^T * q_k)
        ValueType xi = rho0 / Dot(p, q);

        // x_k = x_{k-1} + xi_k * p_k
        Add(x, *x, p, xi);

        // r_k = r{k-1} - xi_k * q_k
        Add(&r, r, q, -xi);

        rho1 = rho0;
        rho0 = Dot(r, r);
        norm_r = sqrt(rho0);

        if (norm_r < error)
            break;

        // p_k = r_{k-1} + u_{k-1} * p_{k-1}
        Add(&p, r, p, rho0 / rho1);
    }
} // CG
    
} // SingleGPU

调用方法为：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


Host::CsrMatrix A;     // generate A
Host::Vector b;        // generate b

// 数据传输
SingleGPU::CsrMatrix d_A(A);
SingleGPU::Vector d_b(b);

SingleGPU::Vector d_x(N, 1.0);   // init x with N ones, and solve Ax=b

SingleGPU::CG(&d_x, d_A, d_b, 1e-6, 100);  // 残量小于1e-6或者迭代100次后退出

§思考问题

1、在《GPU并行的共轭梯度法》一节的《数据结构》的std::vector的参考实现下，向容量为0的vector做1000次push_back需要多少次数据拷贝，多少次内存分配？

2、《GPU并行的共轭梯度法》一节中仍然有一些向量的操作没有实现，考虑怎样实现它们。

3、《GPU并行的共轭梯度法》一节中函数Residual没有实现，考虑怎样实现它。

CUDA教程6 -- 共轭梯度法

相关文章：