- 博客(32)
- 收藏
- 关注
原创 DistServe: Disaggragating Prefill and Decode for Goodput-optimizated Large Language Model Serving
This paper mainly contributes the following 2 points:So the core logic the algorithm.
2025-12-27 18:09:56
134
原创 TileLang: A Composable Tiled Programming Model for AI system
TileLang closely resembles TVM: it cleanly separates the scheduling space—thread binding, layout, tensorization, and pipelining—from the pure data-flow description, and it exposes the same knobs through the Python API shown in Fig. 1.As illustrated in Fig.
2025-12-20 12:10:23
205
原创 TVM: An Automated End-to-End Optimizing Compiler for Deep Learning
TVM can process graph-level and operator-level optimization.As for the graph-level optimization, it can do operator fusion, constant folding, static memory pre-allocation, and data transformate pass.Now I want emphasis operator fusion, It split the operato
2025-12-14 23:40:51
695
原创 Megatron-LM: Training multi-billion parameter Language Model using model parallelism
This paper proposed a very simple method for parallel computation of large matrices. The following graph describes the core logic:A is splited by column, and B is splited by rows. Each shard of XA in MLP ,and each group consisting of Q1, K1, V1, is placed
2025-12-07 23:13:53
271
原创 LLama: Open and Effecient Foundation Language Models
This paper is inspired by the Chinchilla scaling law. It found that given a fixed computing budget, the best performance is not generated by the larger models, but by the smaller models trained on more data. So it proposed a collection of models ranging fr
2025-12-06 22:38:46
464
原创 Triton: An Intermediate Language and Compiler for tiled Netural Network Computation
The Triton is composed of 3 parts: Triton-C, Trion-IR and Triton JIT.1.1. The get_global_range(axis) is associated with kernel, and it is shown in the above graph.1.2. broadcastThe structure of Trion-IR is similar with MLIR. Both includes modules, func
2025-11-22 17:20:32
423
原创 MLIR: A Compiler Infrastructure for the End of Moore‘s Law
【代码】MLIR: A Compiler Infrastructure for the End of Moore‘s Law。
2025-11-16 21:35:22
341
原创 SGLang: Efficient Execution of Structured Language Model Programs
I think there are 3 advantages in SGLang. It allows direct programing in python, it suuport RadixAttention to effeicient KVCache reuse, and it used compressed finite state machine to accelerate the structured output.Reuse the KVCache with the same prompts.
2025-11-08 14:51:06
347
原创 Efficient Memory Management for Large Language Model with PagedAttention
This paper porposed PagedAttention Algorithm, inspired by paging technique in OS.It can improve 2~4x memory throughput.
2025-11-02 20:15:44
284
原创 Language Models are few-shot learners
This paper is written by OpenAI at 2020. It introduces GPT-3, a model with 175B parameters, it demostrated the scaling up pretrained language enables task-agnostic, zero shot or few-shot can perform state-of-the-art results without fine-tuning for many dom
2025-09-15 23:34:23
177
原创 Mooncake: A KVCache-centric disaggregated Architecture for LLM serving
Kimi is the most frequently used large language model tool by me. It is faster than other tools, like deepseek. So I decide to read this paper to figure out the magic behind its chat interface.As a MaaS provider, there are a lot of constraints, limited re
2025-09-01 00:12:16
636
原创 LoRA: Low-Rank Adaptation of LLM
For Large Language Models, there are a significant numbers of parameters, such as GPT3(175B). If we want to fine-tune it to adapt mutiple downstream tasks, then we would need to retrain all the parameters, which will greatly waste a lot of computing resour
2025-08-30 15:21:49
265
原创 Why do the LLM possess the capability for chain of thought?
I remember that when I started to using LLM tools, like chatgpt, or kimi, which are terrific at handing complex reasoning jobs, such as math, commonsense and other tasks. After a while, these tools can do it. The most important factor is the chain of thoug
2025-08-17 16:48:30
131
原创 What is the model distillation?
The model distillation is a special method used to compress the model for convenient deployment. It involves using a small model that doesn't significantly decline the accurency. The principle behind this method is the use of "soft target" which are differ
2025-08-12 23:16:10
428
原创 GQA:Grouped-query attension
The principle is very simple from the above graph. Regarding the conversation from MHA to GQA, we use the mean pool method to do it over the keys and values heads, as shown in the following figure: 1.GQA leads to a higher quality than MQA, but faster than
2025-08-09 15:40:09
321
原创 Why the rotation matric is like this?
[ cosθ -sinθsinθ cosθ ]Let us demostrate this using the "complex-number shortcut".1. We denotes 2-D matric [x, y] as the complx number:z = x + y2. If we want to rotate some angle, in complex numbers, we usually multiplys:e^(iθ) = cos(θ) + isin(θ)3. So, we
2025-07-16 23:00:08
168
原创 A deep Analysis of MLA algorithm
(Because the formula is so difficult to type, I wrote these by hand.)
2025-05-30 22:55:45
240
原创 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
This paper describes two models, names DeepSeek-R1-Zero and DeepSeek R1. The first of all, We introduce the DeepSeek-R1-Zero, which is trained via Rinforcement Learning without the Supervised fine-tuning(SFT). It demonstrates a remarkable ability in reason
2025-05-18 15:00:19
306
原创 Adaptive Mixtures of Local Experts
Author: Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, Geoffrey E. HintonThis paper begins by introducing the Mixture of Experts(MOE) architecture. For a new system composed of many seperate networks, this architecture can effectively manage each b
2025-03-06 22:13:58
470
原创 remote-vscode cannot be accessed
1.1 Remove the corresponding scripts within know_hosts located at .ssh directory on the local machine.1.2 Add a remote-platform in seettings.json;1.3 Remove the remote .vscode-server directory in the remote machine.
2025-01-17 14:39:05
241
原创 Bert介绍
BERT(Bidirectional Encoder Representation from Transformers)是2018年10月由Google AI研究院提出的一种预训练模型,论文来自于:BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding。整体结构分为Embedding、Transformer Encoder结构和模型输出。
2024-06-10 17:37:16
983
原创 手把手教你不用SASS写出超越cublas的GEMM
最终,mykernel在M,N,K都是2048的情况下比cublas快了7.3%,并且文章给大家提供了一步步查看profiler来优化GEMM kernel的过程,而不像其他文章一样直接给个最终结果,希望能对你的kernel优化有所帮助。
2024-05-15 15:33:52
1716
原创 详解eigen中的表达式模板
通过参考eigen,实现了自己的一个表达式模板,目前可以完成加法和减法运算,并预留了其他运算的接口。主要优化技巧有:1、head only:只有头文件,使用方便;2、运用栈存储元素;3、运用表达式模板进行lazy compute,减少访存和临时变量生成;4、对赋值进行循环展开。
2024-03-05 16:35:33
1220
1
原创 使用TVM优化GEMM
优化时间(s)default2.930760+tile0.261980+vectorize0.16041+parallel0.025833。
2023-11-21 13:07:11
443
1
原创 IPC(Inter Process Communication)通信(C代码)
主要用sem_open,sem_post,sem_wait,sem_try_wait,同样也是基于name进行匹配,其中sem_post每次执行,信号量会加1,sem_wait每次执行,信号量会减1,这里同样写了一个简单的Producer和Consumer;unamed pipe必须在具有亲缘关系的管道中才能使用,fifo管道,将能够避免这个问题,fifo的核心API是mkfifo,具体使用方式类似于文件,下面是一个fifo调用的Writer和Reader;这里分别写了两个class来实现文件读和写;
2023-11-14 13:02:48
347
1
原创 openmp #pragma omp task总结
如果产生task的数量过多且每个task处理的内容很少,会发生极其严重的负向优化;在unbloundloop结构中,如果process的的内容与前后的task有关,也是负向优化;在unbloundloop结构中,只有每个task处理的内容与前后无关,会产生正向优化;taskloop可以有效降低运行时间,但是grainsize不应该选用过小。
2023-11-01 18:50:05
1448
1
空空如也
空空如也
TA创建的收藏夹 TA关注的收藏夹
TA关注的人
RSS订阅