首页 / 知识

关于oop：在C ++中创建稀疏数组的最佳方法是什么？

2023-04-11 19:14:00

What is the best way to create a sparse array in C++?

我正在做一个需要处理大量矩阵的项目，特别是用于求模算的金字塔求和。

简而言之，我需要在矩阵(多维数组)的零海中跟踪相对少量的值(通常为1，在极少数情况下大于1)。

稀疏数组允许用户存储少量值，并将所有未定义的记录假定为预设值。由于实际上不可能将所有值存储在内存中，因此我只需要存储几个非零元素。这可能是几百万个条目。

速度是重中之重，我还想在运行时动态选择类中的变量数量。

我目前正在使用二进制搜索树(b-tree)存储条目的系统上工作。有人知道更好的系统吗？

对于C ++，映射效果很好。数百万个对象将不是问题。一千万个项目在我的计算机上花费了大约4.4秒和大约57兆。

我的测试应用程序如下：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

#include <stdio.h>
#include <stdlib.h>
#include <map>

class triple {
public:
int x;
int y;
int z;
bool operator<(const triple &other) const {
if (x < other.x) return true;
if (other.x < x) return false;
if (y < other.y) return true;
if (other.y < y) return false;
return z < other.z;
}
};

int main(int, char**)
{
std::map<triple,int> data;
triple point;
int i;

for (i = 0; i < 10000000; ++i) {
point.x = rand();
point.y = rand();
point.z = rand();
//printf("%d %d %d %d
", i, point.x, point.y, point.z);
data[point] = i;
}
return 0;
}

现在要动态选择变量的数量，最简单的解决方案是将index表示为字符串，然后将string用作映射的键??。例如，可以通过" 23,55"字符串表示位于[23] [55]的项目。我们还可以将此解决方案扩展到更大的尺寸；例如对于三个维度，任意索引将看起来像" 34、45、56"。此技术的简单实现如下：

1
2
3
4
5
6
7
8

std::map data<string,int> data;
char ix[100];

sprintf(ix,"%d,%d", x, y); // 2 vars
data[ix] = i;

sprintf(ix,"%d,%d,%d", x, y, z); // 3 vars
data[ix] = i;

作为一般建议，使用字符串作为索引的方法实际上非常慢。一个更有效但其他等效的解决方案是使用向量/数组。绝对没有必要在字符串中编写索引。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

typedef vector<size_t> index_t;

struct index_cmp_t : binary_function<index_t, index_t, bool> {
bool operator ()(index_t const& a, index_t const& b) const {
for (index_t::size_type i = 0; i < a.size(); ++i)
if (a[i] != b[i])
return a[i] < b[i];
return false;
}
};

map<index_t, int, index_cmp_t> data;
index_t i(dims);
i[0] = 1;
i[1] = 2;
// … etc.
data[i] = 42;

但是，在实践中使用map通常效率不高，因为它是根据平衡的二进制搜索树实现的。在这种情况下，性能更好的数据结构将是哈希表，由std::unordered_map提供。

Boost有一个称为uBLAS的BLAS模板化实现，其中包含一个稀疏矩阵。

https://www.boost.org/doc/libs/release/libs/numeric/ublas/doc/index.htm

Eigen是一个C ++线性代数库，具有稀疏矩阵的实现。它甚至支持针对稀疏矩阵优化的矩阵运算和求解器(LU分解等)。

索引比较中的小细节。您需要进行字典比较，否则：

1 2	a= (1, 2, 1); b= (2, 1, 2); (a<b) == (b<a) is true, but b!=a

编辑：所以比较可能应该是：

1
2
3
4
5
6
7
8
9

return lhs.x<rhs.x
? true
: lhs.x==rhs.x
? lhs.y<rhs.y
? true
: lhs.y==rhs.y
? lhs.z<rhs.z
: false
: false

哈希表可以快速插入并查找。您可以编写一个简单的哈希函数，因为您知道只处理整数对作为键。

解决方案的完整列表可以在Wikipedia中找到。为了方便起见，我引用了以下相关部分。

https://zh.wikipedia.org/wiki/Sparse_matrix#Dictionary_of_keys_.28DOK.29

按键字典(DOK)

DOK consists of a dictionary that maps (row, column)-pairs to the
value of the elements. Elements that are missing from the dictionary
are taken to be zero. The format is good for incrementally
constructing a sparse matrix in random order, but poor for iterating
over non-zero values in lexicographical order. One typically
constructs a matrix in this format and then converts to another more
efficient format for processing.[1]

清单清单(LIL)

LIL stores one list per row, with each entry containing the column
index and the value. Typically, these entries are kept sorted by
column index for faster lookup. This is another format good for
incremental matrix construction.[2]

座标清单(COO)

COO stores a list of (row, column, value) tuples. Ideally, the entries
are sorted (by row index, then column index) to improve random access
times. This is another format which is good for incremental matrix
construction.[3]

压缩的稀疏行(CSR，CRS或耶鲁格式)

The compressed sparse row (CSR) or compressed row storage (CRS) format
represents a matrix M by three (one-dimensional) arrays, that
respectively contain nonzero values, the extents of rows, and column
indices. It is similar to COO, but compresses the row indices, hence
the name. This format allows fast row access and matrix-vector
multiplications (Mx).

实现稀疏矩阵的最佳方法是不实现它们-至少不是您自己。我建议使用BLAS(我认为它是LAPACK的一部分)，它可以处理非常大的矩阵。

我建议做这样的事情：

1
2
3
4
5
6
7
8
9
10
11
12

typedef std::tuple<int, int, int> coord_t;
typedef boost::hash<coord_t> coord_hash_t;
typedef std::unordered_map<coord_hash_t, int, c_hash_t> sparse_array_t;

sparse_array_t the_data;
the_data[ { x, y, z } ] = 1; /* list-initialization is cool */

for( const auto& element : the_data ) {
int xx, yy, zz, val;
std::tie( std::tie( xx, yy, zz ), val ) = element;
/* ... */
}

为了帮助保持数据稀疏，您可能需要编写unorderd_map的子类，其子类的迭代器会自动跳过(并擦除)值为0的所有项目。

这是一个相对简单的实现，应该提供合理的快速查找(使用哈希表)以及对行/列中非零元素的快速迭代。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150

// Copyright 2014 Leo Osvald
//
// Licensed under the Apache License, Version 2.0 (the"License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an"AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

#ifndef UTIL_IMMUTABLE_SPARSE_MATRIX_HPP_
#define UTIL_IMMUTABLE_SPARSE_MATRIX_HPP_

#include
#include <limits>
#include <map>
#include <type_traits>
#include <unordered_map>
#include <utility>
#include <vector>

// A simple time-efficient implementation of an immutable sparse matrix
// Provides efficient iteration of non-zero elements by rows/cols,
// e.g. to iterate over a range [row_from, row_to) x [col_from, col_to):
// for (int row = row_from; row < row_to; ++row) {
// for (auto col_range = sm.nonzero_col_range(row, col_from, col_to);
// col_range.first != col_range.second; ++col_range.first) {
// int col = *col_range.first;
// // use sm(row, col)
// ...
// }
template<typename T = double, class Coord = int>
class SparseMatrix {
struct PointHasher;
typedef std::map< Coord, std::vector<Coord> > NonZeroList;
typedef std::pair<Coord, Coord> Point;

public:
typedef T ValueType;
typedef Coord CoordType;
typedef typename NonZeroList::mapped_type::const_iterator CoordIter;
typedef std::pair<CoordIter, CoordIter> CoordIterRange;

SparseMatrix() = default;

// Reads a matrix stored in MatrixMarket-like format, i.e.:
// <num_rows> <num_cols> <num_entries>
// <row_1> <col_1> <val_1>
// ...
// Note: the header (lines starting with '%' are ignored).
template<class InputStream, size_t max_line_length = 1024>
void Init(InputStream& is) {
rows_.clear(), cols_.clear();
values_.clear();

// skip the header (lines beginning with '%', if any)
decltype(is.tellg()) offset = 0;
for (char buf[max_line_length + 1];
is.getline(buf, sizeof(buf)) && buf[0] == '%'; )
offset = is.tellg();
is.seekg(offset);

size_t n;
is >> row_count_ >> col_count_ >> n;
values_.reserve(n);
while (n--) {
Coord row, col;
typename std::remove_cv< T >::type val;
is >> row >> col >> val;
values_[Point(--row, --col)] = val;
rows_[col].push_back(row);
cols_[row].push_back(col);
}
SortAndShrink(rows_);
SortAndShrink(cols_);
}

const T& operator()(const Coord& row, const Coord& col) const {
static const T kZero = T();
auto it = values_.find(Point(row, col));
if (it != values_.end())
return it->second;
return kZero;
}

CoordIterRange
nonzero_col_range(Coord row, Coord col_from, Coord col_to) const {
CoordIterRange r;
GetRange(cols_, row, col_from, col_to, &r);
return r;
}

CoordIterRange
nonzero_row_range(Coord col, Coord row_from, Coord row_to) const {
CoordIterRange r;
GetRange(rows_, col, row_from, row_to, &r);
return r;
}

Coord row_count() const { return row_count_; }
Coord col_count() const { return col_count_; }
size_t nonzero_count() const { return values_.size(); }
size_t element_count() const { return size_t(row_count_) * col_count_; }

private:
typedef std::unordered_map<Point,
typename std::remove_cv< T >::type,
PointHasher> ValueMap;

struct PointHasher {
size_t operator()(const Point& p) const {
return p.first << (std::numeric_limits<Coord>::digits >> 1) ^ p.second;
}
};

static void SortAndShrink(NonZeroList& list) {
for (auto& it : list) {
auto& indices = it.second;
indices.shrink_to_fit();
std::sort(indices.begin(), indices.end());
}

// insert a sentinel vector to handle the case of all zeroes
if (list.empty())
list.emplace(Coord(), std::vector<Coord>(Coord()));
}

static void GetRange(const NonZeroList& list, Coord i, Coord from, Coord to,
CoordIterRange* r) {
auto lr = list.equal_range(i);
if (lr.first == lr.second) {
r->first = r->second = list.begin()->second.end();
return;
}

auto begin = lr.first->second.begin(), end = lr.first->second.end();
r->first = lower_bound(begin, end, from);
r->second = lower_bound(r->first, end, to);
}

ValueMap values_;
NonZeroList rows_, cols_;
Coord row_count_, col_count_;
};

#endif /* UTIL_IMMUTABLE_SPARSE_MATRIX_HPP_ */

为简单起见，它是immutable，但您可以使其可变。如果需要合理有效的"插入"(将零更改为非零)，请确保将std::vector更改为std::set。

由于仅具有[a] [b] ... [w] [x] [y] [z]的值才有意义，因此我们仅存储索引本身，而不存储几乎无处不在的值1-始终同样+没有办法对其进行哈希处理。注意存在维度的诅咒，建议使用一些已建立的工具NIST或Boost，至少阅读其来源以规避不必要的错误。

如果工作需要捕获未知数据集的时间相关性分布和参数趋势，那么具有单值根的Map或B-Tree可能不切实际。对于所有1个值，我们只能存储索引本身，如果排序(表示的敏感性)可以服从运行时时域的缩减，则可以对其进行散列处理。由于除了一个以外的非零值很少，因此可以很容易找到并理解的任何数据结构都是显而易见的选择。如果数据集确实是巨大的宇宙，我建议您自己管理文件/磁盘/持久性io的某种滑动窗口，并根据需要将部分数据移入范围。 (编写您可以理解的代码)如果您承诺为工作组提供实际的解决方案，那么如果您未这样做，您将只能依靠消费级操作系统，而这些操作系统的唯一目标是使您的午餐不再为时。

数组方法项目用于