ColumnParallelLinear 和 RowParallelLinear

ColumnParallelLinearRowParallelLinear 是两种并行线性层,它们的主要区别在于权重矩阵的分割方式和计算过程。具体来说:

ColumnParallelLinear

  • 权重矩阵分割方式

    • 权重矩阵 W 被按列(column)分割成多个子矩阵,每个子矩阵在并行设备上进行计算。
    • 如果权重矩阵 W 的形状是 (output_dim, input_dim),它会被分割成 P 个子矩阵,每个子矩阵的形状是 (output_dim, input_dim / P)
  • 计算过程:(矩阵块的运算法则与一般矩阵运算相同)

    The linear layer is defined as Y = XA + b. A is parallelized along
        its second dimension as A = [A_1, ..., A_p].

  • 优势

    • 适合输入数据较小但输出维度较大的场景。

RowParallelLinear

  • 权重矩阵分割方式

    • 权重矩阵 W 被按行(row)分割成多个子矩阵,每个子矩阵在并行设备上进行计算。
    • 如果权重矩阵 W 的形状是 (output_dim, input_dim),它会被分割成 P 个子矩阵,每个子矩阵的形状是 (output_dim / P, input_dim)
  • 计算过程:(矩阵块的运算法则与一般矩阵运算相同)

        

The linear layer is defined as Y = XA + b. A is parallelized along
    its first dimension and X along its second dimension as:
               -   -
              | A_1 |
              | .   |
          A = | .   |        X = [X_1, ..., X_p]
              | .   |
              | A_p |
               -   -
  • 优势

    • 适合输出维度较小但输入数据较大的场景。

Multi-Headed Attention Layer (MLA). Attributes: dim (int): Dimensionality of the input features. n_heads (int): Number of attention heads. n_local_heads (int): Number of local attention heads for distributed systems. q_lora_rank (int): Rank for low-rank query projection. kv_lora_rank (int): Rank for low-rank key/value projection. qk_nope_head_dim (int): Dimensionality of non-positional query/key projections. qk_rope_head_dim (int): Dimensionality of rotary-positional query/key projections. qk_head_dim (int): Total dimensionality of query/key projections. v_head_dim (int): Dimensionality of value projections. softmax_scale (float): Scaling factor for softmax in attention computation. """ def __init__(self, args: ModelArgs): super().__init__() self.dim = args.dim self.n_heads = args.n_heads self.n_local_heads = args.n_heads // world_size self.q_lora_rank = args.q_lora_rank self.kv_lora_rank = args.kv_lora_rank self.qk_nope_head_dim = args.qk_nope_head_dim self.qk_rope_head_dim = args.qk_rope_head_dim self.qk_head_dim = args.qk_nope_head_dim + args.qk_rope_head_dim self.v_head_dim = args.v_head_dim if self.q_lora_rank == 0: self.wq = ColumnParallelLinear(self.dim, self.n_heads * self.qk_head_dim) else: self.wq_a = Linear(self.dim, self.q_lora_rank) self.q_norm = RMSNorm(self.q_lora_rank) self.wq_b = ColumnParallelLinear(self.q_lora_rank, self.n_heads * self.qk_head_dim) self.wkv_a = Linear(self.dim, self.kv_lora_rank + self.qk_rope_head_dim) self.kv_norm = RMSNorm(self.kv_lora_rank) self.wkv_b = ColumnParallelLinear(self.kv_lora_rank, self.n_heads * (self.qk_nope_head_dim + self.v_head_dim)) self.wo = RowParallelLinear(self.n_heads * self.v_head_dim, self.dim) self.softmax_scale = self.qk_head_dim ** -0.5 if args.max_seq_len > args.original_seq_len: mscale = 0.1 * args.mscale * math.log(args.rope_factor) + 1.0 self.softmax_scale = self.softmax_scale * mscale * mscale if attn_impl == "naive": self.register_buffer("k_cache", torch.zeros(args.max_batch_size, args.max_seq_len, self.n_local_heads, self.qk_head_dim), persistent=False) self.register_buffer("v_cache", torch.zeros(args.max_batch_size, args.max_seq_len, self.n_local_heads, self.v_head_dim), persistent=False) else: self.register_buffer("kv_cache", torch.zeros(args.max_batch_size, args.max_seq_len, self.kv_lora_rank), persistent=False) self.register_buffer("pe_cache", torch.zeros(args.max_batch_size, args.max_seq_len, self.qk_rope_head_dim), persistent=False)
04-01
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值