HPC_C-优快云博客

原创 DistServe: Disaggragating Prefill and Decode for Goodput-optimizated Large Language Model Serving

This paper mainly contributes the following 2 points:So the core logic the algorithm.

2025-12-27 18:09:56 134

原创 TileLang: A Composable Tiled Programming Model for AI system

TileLang closely resembles TVM: it cleanly separates the scheduling space—thread binding, layout, tensorization, and pipelining—from the pure data-flow description, and it exposes the same knobs through the Python API shown in Fig. 1.As illustrated in Fig.

2025-12-20 12:10:23 205

原创 TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

TVM can process graph-level and operator-level optimization.As for the graph-level optimization, it can do operator fusion, constant folding, static memory pre-allocation, and data transformate pass.Now I want emphasis operator fusion, It split the operato

2025-12-14 23:40:51 695

原创 Megatron-LM: Training multi-billion parameter Language Model using model parallelism

This paper proposed a very simple method for parallel computation of large matrices. The following graph describes the core logic:A is splited by column, and B is splited by rows. Each shard of XA in MLP ,and each group consisting of Q1, K1, V1, is placed

2025-12-07 23:13:53 271

原创 Tile IR created by Nvidia

【代码】Tile IR created by Nvidia。

2025-12-07 15:57:05 263

原创 LLama: Open and Effecient Foundation Language Models

This paper is inspired by the Chinchilla scaling law. It found that given a fixed computing budget, the best performance is not generated by the larger models, but by the smaller models trained on more data. So it proposed a collection of models ranging fr

2025-12-06 22:38:46 464

原创 Triton: An Intermediate Language and Compiler for tiled Netural Network Computation

The Triton is composed of 3 parts: Triton-C, Trion-IR and Triton JIT.1.1. The get_global_range(axis) is associated with kernel, and it is shown in the above graph.1.2. broadcastThe structure of Trion-IR is similar with MLIR. Both includes modules, func

2025-11-22 17:20:32 423

原创 MLIR: A Compiler Infrastructure for the End of Moore‘s Law

【代码】MLIR: A Compiler Infrastructure for the End of Moore‘s Law。

2025-11-16 21:35:22 341

原创 SGLang: Efficient Execution of Structured Language Model Programs

I think there are 3 advantages in SGLang. It allows direct programing in python, it suuport RadixAttention to effeicient KVCache reuse, and it used compressed finite state machine to accelerate the structured output.Reuse the KVCache with the same prompts.

2025-11-08 14:51:06 347

原创 Efficient Memory Management for Large Language Model with PagedAttention

This paper porposed PagedAttention Algorithm, inspired by paging technique in OS.It can improve 2~4x memory throughput.

2025-11-02 20:15:44 284

原创 Language Models are few-shot learners

This paper is written by OpenAI at 2020. It introduces GPT-3, a model with 175B parameters, it demostrated the scaling up pretrained language enables task-agnostic, zero shot or few-shot can perform state-of-the-art results without fine-tuning for many dom

2025-09-15 23:34:23 177

原创 Mooncake: A KVCache-centric disaggregated Architecture for LLM serving

Kimi is the most frequently used large language model tool by me. It is faster than other tools, like deepseek. So I decide to read this paper to figure out the magic behind its chat interface.As a MaaS provider, there are a lot of constraints, limited re

2025-09-01 00:12:16 636

原创 LoRA: Low-Rank Adaptation of LLM

For Large Language Models, there are a significant numbers of parameters, such as GPT3(175B). If we want to fine-tune it to adapt mutiple downstream tasks, then we would need to retrain all the parameters, which will greatly waste a lot of computing resour

2025-08-30 15:21:49 265

原创 Why do the LLM possess the capability for chain of thought?

I remember that when I started to using LLM tools, like chatgpt, or kimi, which are terrific at handing complex reasoning jobs, such as math, commonsense and other tasks. After a while, these tools can do it. The most important factor is the chain of thoug

2025-08-17 16:48:30 131

原创 What is the model distillation?

The model distillation is a special method used to compress the model for convenient deployment. It involves using a small model that doesn't significantly decline the accurency. The principle behind this method is the use of "soft target" which are differ

2025-08-12 23:16:10 428

原创 GQA:Grouped-query attension

The principle is very simple from the above graph. Regarding the conversation from MHA to GQA, we use the mean pool method to do it over the keys and values heads, as shown in the following figure: 1.GQA leads to a higher quality than MQA, but faster than

2025-08-09 15:40:09 321

原创 The core logic of Rotary Position Embedding

where:

2025-07-20 11:51:00 340

原创 Why the rotation matric is like this?

[ cosθ -sinθsinθ cosθ ]Let us demostrate this using the "complex-number shortcut".1. We denotes 2-D matric [x, y] as the complx number:z = x + y2. If we want to rotate some angle, in complex numbers, we usually multiplys:e^(iθ) = cos(θ) + isin(θ)3. So, we

2025-07-16 23:00:08 168

原创 A deep Analysis of MLA algorithm

(Because the formula is so difficult to type, I wrote these by hand.)

2025-05-30 22:55:45 240

原创 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

This paper describes two models, names DeepSeek-R1-Zero and DeepSeek R1. The first of all, We introduce the DeepSeek-R1-Zero, which is trained via Rinforcement Learning without the Supervised fine-tuning(SFT). It demonstrates a remarkable ability in reason

2025-05-18 15:00:19 306

原创 Adaptive Mixtures of Local Experts

Author: Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, Geoffrey E. HintonThis paper begins by introducing the Mixture of Experts(MOE) architecture. For a new system composed of many seperate networks, this architecture can effectively manage each b

2025-03-06 22:13:58 470

原创 remote-vscode cannot be accessed

1.1 Remove the corresponding scripts within know_hosts located at .ssh directory on the local machine.1.2 Add a remote-platform in seettings.json;1.3 Remove the remote .vscode-server directory in the remote machine.

2025-01-17 14:39:05 241

原创 Bert介绍

BERT(Bidirectional Encoder Representation from Transformers)是2018年10月由Google AI研究院提出的一种预训练模型，论文来自于：BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding。整体结构分为Embedding、Transformer Encoder结构和模型输出。

2024-06-10 17:37:16 983

原创使用CUDA做Reduce算子优化

文章运用了shuffle、合并访存和shared memory三种手段来优化reduce算子。

2024-05-23 15:50:56 1062 2

原创手把手教你不用SASS写出超越cublas的GEMM

最终，mykernel在M，N，K都是2048的情况下比cublas快了7.3%，并且文章给大家提供了一步步查看profiler来优化GEMM kernel的过程，而不像其他文章一样直接给个最终结果，希望能对你的kernel优化有所帮助。

2024-05-15 15:33:52 1716

原创详解eigen中的表达式模板

通过参考eigen，实现了自己的一个表达式模板，目前可以完成加法和减法运算，并预留了其他运算的接口。主要优化技巧有：1、head only：只有头文件，使用方便；2、运用栈存储元素；3、运用表达式模板进行lazy compute，减少访存和临时变量生成；4、对赋值进行循环展开。

2024-03-05 16:35:33 1220 1

原创一文读懂对象池

建立对象池，可以避免对象的频繁构造与析构，同时防止内存碎片。

2024-02-05 18:16:30 484 1

原创自定义Vector

自定了Vector，核心原理是在小size的情况下使用栈内存替代堆内存。

2024-01-29 18:16:13 592 1

原创在C++中部署TVM GEMM

该文章详细解释如何将python写的TVM在部署到C++中使用，速度和python侧基本一致；

2023-12-04 17:17:16 560

原创使用TVM优化GEMM

优化时间（s）default2.930760+tile0.261980+vectorize0.16041+parallel0.025833。

2023-11-21 13:07:11 443 1

原创 IPC(Inter Process Communication)通信（C代码）

主要用sem_open，sem_post，sem_wait，sem_try_wait，同样也是基于name进行匹配，其中sem_post每次执行，信号量会加1，sem_wait每次执行，信号量会减1，这里同样写了一个简单的Producer和Consumer；unamed pipe必须在具有亲缘关系的管道中才能使用，fifo管道，将能够避免这个问题，fifo的核心API是mkfifo，具体使用方式类似于文件，下面是一个fifo调用的Writer和Reader；这里分别写了两个class来实现文件读和写；

2023-11-14 13:02:48 347 1

原创 openmp #pragma omp task总结

如果产生task的数量过多且每个task处理的内容很少，会发生极其严重的负向优化；在unbloundloop结构中，如果process的的内容与前后的task有关，也是负向优化；在unbloundloop结构中，只有每个task处理的内容与前后无关，会产生正向优化；taskloop可以有效降低运行时间，但是grainsize不应该选用过小。

2023-11-01 18:50:05 1448 1

weixin_44604698的博客