C++ AMP: Matrix Multiplication with C++ AMP

As part of our API tour of C++ AMP, we looked recently at parallel_for_each. I ended that post by saying we would revisit parallel_for_each after introducing array and array_view. Now is the time, so this is part 2 of parallel_for_each, and also a post that brings together everything we've seen until now.

The code for serial and accelerated

Consider a naïve (or brute force) serial implementation of matrix multiplication 

0:  void MatrixMultiplySerial(std::vector<float>& vC, 
        const std::vector<float>& vA, 
        const std::vector<float>& vB, int M, int N, int W)
1:  {
2:    for (int row = 0; row < M; row++) 
3:    {
4:      for (int col = 0; col < N; col++)
5:      {
6:        float sum = 0.0f;
7:        for(int i = 0; i < W; i++)
8:          sum += vA[row * W + i] * vB[i * N + col];
9:        vC[row * N + col] = sum;
10:     }
11:   }
12: }

We notice that each loop iteration is independent from each other and so can be parallelized. If in addition we have really large amounts of data, then this is a good candidate to offload to an accelerator. First, I'll just show you an example of what that code may look like with C++ AMP, and then we'll analyze it. It is assumed that you included at the top of your file #include <amp.h>

13:  void MatrixMultiplySimple(std::vector<float>& vC, 
         const std::vector<float>& vA, 
         const std::vector<float>& vB, int M, int N, int W)
14:  {
15:    concurrency::array_view<const float,2> a(M, W, vA);
16:    concurrency::array_view<const float,2> b(W, N, vB);
17:    concurrency::array_view<float,2> c(M, N, vC); c.discard_data();
18:    concurrency::parallel_for_each(c.extent, 
19:    [=](concurrency::index<2> idx) restrict(amp) {
20:      int row = idx[0]; int col = idx[1];
21:      float sum = 0.0f;
22:      for(int i = 0; i < W; i++)
23:        sum += a(row, i) * b(i, col);
24:      c[idx] = sum;
25:    });
26:  }

First a visual comparison, just for fun: The beginning and end is the same, i.e. lines 0,1,12 are identical to lines 13,14,26. The double nested loop (lines 2,3,4,5 and 10,11) has beenMatrix Multiplication image from wikipedia transformed into a parallel_for_each call (18,19,20 and 25). The core algorithm (lines 6,7,8,9) is essentially the same (lines 21,22,23,24). We have extra lines in the C++ AMP version (15,16,17). Now let's dig in deeper.

Using array_view and extent

When we decided to convert this function to run on an accelerator, we knew we couldn't use the std::vector objects in the restrict(amp) function. So we had a choice of copying the data to the the concurrency::array<T,N> object, or wrapping the vector container (and hence its data) with a concurrency::array_view<T,N> object from amp.h – here we used the latter (lines 15,16,17). Now we can access the same data through the array_view objects (a and b) instead of the vector objects (vA and vB), and the added benefit is that we can capture the array_view objects in the lambda (lines 19-25) that we pass to the parallel_for_each call (line 18) and the data will get copied on demand for us to the accelerator.

Note that line 15 (and ditto for 16 and 17) could have been written as two lines instead of one:

  extent<2> e(M, W);
  array_view<const float, 2> a(e, vA);

In other words, we could have explicitly created the extent object instead of letting the array_view create it for us under the covers through the constructor overload we chose. The benefit of the extent object in this instance is that we can express that the data is indeed two dimensional, i.e a matrix. When we were using a vector object we could not do that, and instead we had to track via additional unrelated variables the dimensions of the matrix (i.e. with the integers M and W) – aren't you loving C++ AMP already?

Note that the const before the float when creating a and b, will result in the underling data only being copied to the accelerator and not be copied back – a nice optimization. A similar thing is happening on line 17 when creating array_view c, where we have indicated that we do not need to copy the data to the accelerator, through the discard_data call.

The kernel dispatch

On line 18 we make the call to the C++ AMP entry point (parallel_for_each) to invoke our parallel loop or, as some may say, dispatch our kernel.

The first argument we need to pass describes how many threads we want for this computation. For this algorithm we decided that we want exactly the same number of threads as the number of elements in the output matrix, i.e. in array_view c which will eventually update the vector vC. So each thread will compute exactly one result. Since the elements in c are organized in a 2-dimensional manner we can organize our threads in a two-dimensional manner too. We don't have to think too much about how to create the first argument (a extent) since the array_view object helpfully exposes that as a property. Note that instead of c.extent we could have written extent<2>(M, N) – the result is the same in that we have specified M*N threads to execute our lambda.

The second argument is a restrict(amp) lambda that accepts an index object. Since we elected to use a two-dimensional extent as the first argument of parallel_for_each, the index will also be two-dimensional and as covered in the previous posts it represents the thread ID, which in our case maps perfectly to the index of each element in the resulting array_view.

The kernel itself

The lambda body (lines 20-24), or as some may say, the kernel, is the code that will actually execute on the accelerator. It will be called by M*N threads and we can use those threads to index into the two input array_views (a,b) and write results into the output array_view ( c ).

The four lines (21-24) are essentially identical to the four lines of the serial algorithm (6-9). The only difference is how we index into a,b,c versus how we index into vA,vB,vC. The code we wrote with C++ AMP is much nicer in its indexing, because the dimensionality is a first class concept, so you don't have to do funny arithmetic calculating the index of where the next row starts, which you have to do when working with vectors directly (since they store all the data in a flat manner).

I skipped over describing line 20. Note that we didn't really need to read the two components of the index into temporary local variables. This mostly reflects my personal choice, in some algorithms to break down the index into local variables with names that make sense for the algorithm, i.e. in this case row and col. In other cases it may i,j,k or x,y,z, or M,N or whatever. Also note that we could have written line 24 as: c(idx[0], idx[1])=sum  or  c(row, col)=sum instead of the simpler c[idx]=sum

Targeting a specific accelerator

Imagine that we had more than one hardware accelerator on a system and we wanted to pick a specific one to execute this parallel loop on. So there would be some code like this anywhere before line 18:

  vector<accelerator> accs = MyFunctionThatChoosesSuitableAccelerators();
  accelerator acc = accs[0];

…and then we would modify line 18 so we would be calling another overload of parallel_for_each that accepts an accelerator_view as the first argument, so it would become:

  concurrency::parallel_for_each(acc.default_view, c.extent,

...and the rest of your code remains the same… how simple is that?

内容概要:本文是一份针对2025年中国企业品牌传播环境撰写的《全网媒体发稿白皮书》,聚焦企业媒体发稿的策略制定、渠道选择与效果评估难题。通过分析当前企业面临的资源分散、内容同质、效果难量化等核心痛点,系统性地介绍了新闻媒体、央媒、地方官媒和自媒体四大渠道的特点与适用场景,并深度融合“传声港”AI驱动的新媒体平台能力,提出“策略+工具+落地”的一体化解决方案。白皮书详细阐述了传声港在资源整合、AI智能匹配、舆情监测、合规审核及全链路效果追踪方面的技术优势,构建了涵盖曝光、互动、转化与品牌影响力的多维评估体系,并通过快消、科技、零售等行业的实战案例验证其有效性。最后,提出了按企业发展阶段和营销节点定制的媒体组合策略,强调本土化传播与政府关系协同的重要性,助力企业实现品牌声量与实际转化的双重增长。; 适合人群:企业市场部负责人、品牌方管理者、公关传播从业者及从事数字营销的相关人员,尤其适用于初创期至成熟期不同发展阶段的企业决策者。; 使用场景及目标:①帮助企业科学制定媒体发稿策略,优化预算分配;②解决渠道对接繁琐、投放不精准、效果不可衡量等问题;③指导企业在重大营销节点(如春节、双11)开展高效传播;④提升品牌权威性、区域渗透力与危机应对能力; 阅读建议:建议结合自身企业所处阶段和发展目标,参考文中提供的“传声港服务组合”与“预算分配建议”进行策略匹配,同时重视AI工具在投放、监测与优化中的实际应用,定期复盘数据以实现持续迭代。
先展示下效果 https://pan.quark.cn/s/987bb7a43dd9 VeighNa - By Traders, For Traders, AI-Powered. Want to read this in english ? Go here VeighNa是一套基于Python的开源量化交易系统开发框架,在开源社区持续不断的贡献下一步步成长为多功能量化交易平台,自发布以来已经积累了众多来自金融机构或相关领域的用户,包括私募基金、证券公司、期货公司等。 在使用VeighNa进行二次开发(策略、模块等)的过程中有任何疑问,请查看VeighNa项目文档,如果无法解决请前往官方社区论坛的【提问求助】板块寻求帮助,也欢迎在【经验分享】板块分享你的使用心得! 想要获取更多关于VeighNa的资讯信息? 请扫描下方二维码添加小助手加入【VeighNa社区交流微信群】: AI-Powered VeighNa发布十周年之际正式推出4.0版本,重磅新增面向AI量化策略的vnpy.alpha模块,为专业量化交易员提供一站式多因子机器学习(ML)策略开发、投研和实盘交易解决方案: :bar_chart: dataset:因子特征工程 * 专为ML算法训练优化设计,支持高效批量特征计算与处理 * 内置丰富的因子特征表达式计算引擎,实现快速一键生成训练数据 * Alpha 158:源于微软Qlib项目的股票市场特征集合,涵盖K线形态、价格趋势、时序波动等多维度量化因子 :bulb: model:预测模型训练 * 提供标准化的ML模型开发模板,大幅简化模型构建与训练流程 * 统一API接口设计,支持无缝切换不同算法进行性能对比测试 * 集成多种主流机器学习算法: * Lass...
### C++ 中 Eigen::Matrix 的使用方法 #### 创建矩阵和向量 在Eigen中,所有矩阵和向量都是`Matrix`模板类的对象。创建一个固定大小的矩阵可以通过指定行列数来完成: ```cpp #include <iostream> #include <Eigen/Dense> using namespace Eigen; int main() { // 定义一个2x3的浮点型矩阵并初始化为随机值 MatrixXf mat = MatrixXf::Random(2, 3); std::cout << "Initial random matrix:" << std::endl; std::cout << mat << std::endl << std::endl; // 构造一个新的2x2矩阵并与之前的矩阵相乘 mat = (MatrixXf(2, 2) << 0, 1, 1, 0).finished() * mat; std::cout << "After multiplication with a predefined matrix:" << std::endl; std::cout << mat << std::endl; return 0; } ``` 上述代码展示了如何利用逗号操作符快速初始化一个小尺寸的方阵,并将其与另一个矩形矩阵相乘[^1]。 对于动态大小的矩阵,则可以省略模板中的行数和列数参数,转而使用运行时变量定义其维度;而对于静态编译期已知大小的情况,建议尽可能提供这些信息给模板以获得更好的性能优化效果[^2]。 #### 基本运算 除了简单的算术运算外,Eigen还支持许多其他类型的线性代数操作,比如求解线性方程组、特征分解等高级功能。这里仅展示了一些基础的操作作为入门介绍。 #### 动态 vs 静态分配 当声明像 `MatrixXd` 或者 `VectorXd` 这样的类型时,默认情况下它们会采用堆上分配内存的方式存储数据。如果事先知道矩阵的具体规模,在可能的情况下应该优先考虑使用带有具体尺寸参数的形式(如 `Matrix<float, 4, 4>`),这有助于提高程序效率并且减少不必要的开销。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值