【Triton 教程】triton_language.join

Triton 是一种用于并行编程的语言和编译器。它旨在提供一个基于 Python 的编程环境,以高效编写自定义 DNN 计算内核,并能够在现代 GPU 硬件上以最大吞吐量运行。

更多 Triton 中文文档可访问 →https://triton.hyper.ai/

triton.language.join(a, b)

在 1 个新的次要维度中连接给定的张量。

For example, given two tensors of shape (4,8), produces a new tensor of shape (4,8,2). Given two scalars, returns a tensor of shape (2).

例如,给定 2 个形状为 (4,8) 的张量,生成 1 个新的形状为 (4,8,2) 的张量。给定 2 个标量,返回 1 个形状为 (2) 的张量。

2 个输入被广播到相同的形状。

If you want to join more than two elements, you can use multiple calls to this function. This reflects the constraint in Triton that tensors must have power-of-two sizes.

如果你想连接超过 2 个元素,可以多次调用这个函数。这反映了 Triton 中的约束,即张量的大小必须是 2 的幂。

join 是 split 的逆操作。

参数

  • a (T**ensor)– 第 1 个输入张量。
  • b (Tensor) - 第 2 个输入张量。
Traceback (most recent call last): File "/home/robot/UR5-Pick-and-Place-Simulation-main/catkin_ws/src/vision/scripts/lego-vision.py", line 475, in <module> load_models() File "/home/robot/UR5-Pick-and-Place-Simulation-main/catkin_ws/src/vision/scripts/lego-vision.py", line 465, in load_models model = torch.hub.load(path_yolo,'custom',path=weight, source='local') File "/home/robot/.local/lib/python3.8/site-packages/torch/hub.py", line 570, in load model = _load_local(repo_or_dir, model, *args, **kwargs) File "/home/robot/.local/lib/python3.8/site-packages/torch/hub.py", line 599, in _load_local model = entry(*args, **kwargs) File "/home/robot/yolov5/hubconf.py", line 135, in custom return _create(path, autoshape=autoshape, verbose=_verbose, device=device) File "/home/robot/yolov5/hubconf.py", line 54, in _create from models.common import AutoShape, DetectMultiBackend File "/home/robot/yolov5/models/common.py", line 39, in <module> from utils.dataloaders import exif_transpose, letterbox File "/home/robot/yolov5/utils/dataloaders.py", line 23, in <module> import torchvision File "/home/robot/.local/lib/python3.8/site-packages/torchvision/__init__.py", line 10, in <module> from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils # usort:skip File "/home/robot/.local/lib/python3.8/site-packages/torchvision/models/__init__.py", line 2, in <module> from .convnext import * File "/home/robot/.local/lib/python3.8/site-packages/torchvision/models/convnext.py", line 8, in <module> from ..ops.misc import Conv2dNormActivation, Permute File "/home/robot/.local/lib/python3.8/site-packages/torchvision/ops/__init__.py", line 23, in <module> from .poolers import MultiScaleRoIAlign File "/home/robot/.local/lib/python3.8/site-packages/torchvision/ops/poolers.py", line 10, in <module> from .roi_align import roi_align File "/home/robot/.local/lib/python3.8/site-packages/torchvision/ops/roi_align.py", line 7, in <module> from torch._dynamo.utils import is_compile_supported File "/home/robot/.local/lib/python3.8/site-packages/torch/_dynamo/__init__.py", line 2, in <module> from . import convert_frame, eval_frame, resume_execution File "/home/robot/.local/lib/python3.8/site-packages/torch/_dynamo/convert_frame.py", line 48, in <module> from . import config, exc, trace_rules File "/home/robot/.local/lib/python3.8/site-packages/torch/_dynamo/exc.py", line 12, in <module> from .utils import counters File "/home/robot/.local/lib/python3.8/site-packages/torch/_dynamo/utils.py", line 1063, in <module> if has_triton_package(): File "/home/robot/.local/lib/python3.8/site-packages/torch/utils/_triton.py", line 9, in has_triton_package import triton File "/home/robot/.local/lib/python3.8/site-packages/triton/__init__.py", line 8, in <module> from .runtime import ( File "/home/robot/.local/lib/python3.8/site-packages/triton/runtime/__init__.py", line 1, in <module> from .autotuner import (Autotuner, Config, Heuristics, autotune, heuristics) File "/home/robot/.local/lib/python3.8/site-packages/triton/runtime/autotuner.py", line 9, in <module> from ..testing import do_bench, do_bench_cudagraph File "/home/robot/.local/lib/python3.8/site-packages/triton/testing.py", line 7, in <module> from . import language as tl File "/home/robot/.local/lib/python3.8/site-packages/triton/language/__init__.py", line 4, in <module> from . import math File "/home/robot/.local/lib/python3.8/site-packages/triton/language/math.py", line 1, in <module> from . import core File "/home/robot/.local/lib/python3.8/site-packages/triton/language/core.py", line 10, in <module> from ..runtime.jit import jit File "/home/robot/.local/lib/python3.8/site-packages/triton/runtime/jit.py", line 12, in <module> from ..runtime.driver import driver File "/home/robot/.local/lib/python3.8/site-packages/triton/runtime/driver.py", line 1, in <module> from ..backends import backends File "/home/robot/.local/lib/python3.8/site-packages/triton/backends/__init__.py", line 50, in <module> backends = _discover_backends() File "/home/robot/.local/lib/python3.8/site-packages/triton/backends/__init__.py", line 44, in _discover_backends driver = _load_module(name, os.path.join(root, name, 'driver.py')) File "/home/robot/.local/lib/python3.8/site-packages/triton/backends/__init__.py", line 12, in _load_module spec.loader.exec_module(module) File "/home/robot/.local/lib/python3.8/site-packages/triton/backends/amd/driver.py", line 7, in <module> from triton.runtime.build import _build File "/home/robot/.local/lib/python3.8/site-packages/triton/runtime/build.py", line 8, in <module> import setuptools File "/home/robot/.local/lib/python3.8/site-packages/setuptools/__init__.py", line 27, in <module> from .dist import Distribution File "/home/robot/.local/lib/python3.8/site-packages/setuptools/dist.py", line 30, in <module> from . import ( File "/home/robot/.local/lib/python3.8/site-packages/setuptools/_entry_points.py", line 45, in <module> def validate(eps: metadata.EntryPoints): AttributeError: module 'importlib_metadata' has no attribute 'EntryPoints'
09-29
可以帮我详细解释一下嘛def omniquant( lm, args, dataloader, act_scales, act_shifts, logger=None, ): logger.info("Starting ...") # move embedding layer and first layer to target device model = lm.model dev = lm.device use_cache = model.config.use_cache model.config.use_cache = False is_llama = False if "llama" in args.net.lower(): is_llama = True layers = model.model.layers model.model.embed_tokens = model.model.embed_tokens.to(dev) model.model.norm = model.model.norm.to(dev) DecoderLayer = QuantLlamaDecoderLayer pairs = { "q_proj":"qkv", "o_proj":"out", "up_proj":"fc1" } layer_name_prefix = "model.layers" elif "opt" in args.net.lower(): layers = model.model.decoder.layers model.model.decoder.embed_tokens = model.model.decoder.embed_tokens.to(dev) model.model.decoder.embed_positions = model.model.decoder.embed_positions.to(dev) if hasattr(model.model.decoder, "project_out") and model.model.decoder.project_out: model.model.decoder.project_out = model.model.decoder.project_out.to(dev) if hasattr(model.model.decoder, "project_in") and model.model.decoder.project_in: model.model.decoder.project_in = model.model.decoder.project_in.to(dev) DecoderLayer = QuantOPTDecoderLayer pairs = { "q_proj":"qkv", "out_proj":"out", "fc1":"fc1" } layer_name_prefix = "model.decoder.layers" elif "falcon" in args.net.lower(): layers = model.transformer.h model.transformer.word_embeddings.to(dev) model.transformer.ln_f.to(dev) model.lm_head.to(dev) DecoderLayer = QuantFalconDecoderLayer layer_name_prefix = "model.transformer.h" elif 'mixtral' in args.net.lower(): is_llama = True # same to llama except ffn layers = model.model.layers model.model.embed_tokens = model.model.embed_tokens.to(dev) model.model.norm = model.model.norm.to(dev) layer_name_prefix = "model.layers" else: raise ValueError("Only support for opt/llama/Llama-2/falcon/mixtral now") layers[0] = layers[0].to(dev) if args.deactive_amp and args.epochs>0: dtype = torch.float traincast = nullcontext else: dtype = torch.float16 traincast = torch.cuda.amp.autocast inps = torch.zeros( (args.nsamples, lm.seqlen, model.config.hidden_size), dtype=dtype, device=dev ) cache = {"i": 0} # catch the first layer input class Catcher(nn.Module): def __init__(self, module): super().__init__() self.module = module self.is_llama = False def forward(self, inp, **kwargs): inps[cache["i"]] = inp cache["i"] += 1 cache["attention_mask"] = kwargs["attention_mask"] if self.is_llama: cache["position_ids"] = kwargs["position_ids"] raise ValueError layers[0] = Catcher(layers[0]) layers[0].is_llama = is_llama with torch.no_grad(): for batch in dataloader: if cache["i"] >= args.nsamples: break try: model(batch[0].to(dev)) except ValueError: pass # move embedding layer and first layer to cpu layers[0] = layers[0].module layers[0] = layers[0].cpu() if "llama" in args.net.lower() or "mixtral" in args.net.lower(): model.model.embed_tokens = model.model.embed_tokens.cpu() model.model.norm = model.model.norm.cpu() elif "opt" in args.net.lower(): model.model.decoder.embed_tokens = model.model.decoder.embed_tokens.cpu() model.model.decoder.embed_positions = model.model.decoder.embed_positions.cpu() if hasattr(model.model.decoder, "project_out") and model.model.decoder.project_out: model.model.decoder.project_out = model.model.decoder.project_out.cpu() if hasattr(model.model.decoder, "project_in") and model.model.decoder.project_in: model.model.decoder.project_in = model.model.decoder.project_in.cpu() elif 'falcon' in args.model: model.transformer.word_embeddings = model.transformer.word_embeddings.cpu() else: raise ValueError("Only support for opt/llama/Llama-2/falcon/mixtral now") torch.cuda.empty_cache() # same input of first layer for fp model and quant model quant_inps = inps fp_inps = copy.deepcopy(inps) # take output of fp model as input fp_inps_2 = copy.deepcopy(inps) if args.aug_loss else None # take output of quantization model as input attention_mask = cache["attention_mask"] if attention_mask is not None: attention_mask_batch = attention_mask.repeat(args.batch_size,1,1,1) if args.deactive_amp else attention_mask.repeat(args.batch_size,1,1,1).float() else: logger.info( "No attention mask caught from the first layer." " Seems that model's attention works without a mask." ) attention_mask_batch = None loss_func = torch.nn.MSELoss() if is_llama: position_ids = cache["position_ids"] else: position_ids = None if args.resume: omni_parameters = torch.load(args.resume) else: omni_parameters = {} for i in range(len(layers)): logger.info(f"=== Start quantize layer {i} ===") layer = layers[i].to(dev) if "mixtral" in args.net.lower(): # for mixtral, we only leverage lwc, which can be achieve by simply replace Linear with QuantLinear qlayer = copy.deepcopy(layer) for name, module in qlayer.named_modules(): if isinstance(module,torch.nn.Linear) and not "gate" in name: # do not quantize gate quantlinear = QuantLinear(module, args.weight_quant_params, args.act_quant_params) add_new_module(name, qlayer, quantlinear) else: qlayer = DecoderLayer(lm.model.config, layer, args) qlayer = qlayer.to(dev) # obtain output of full-precision model set_quant_state(qlayer, weight_quant=False, act_quant=False) if args.epochs > 0: with torch.no_grad(): with torch.cuda.amp.autocast(): for j in range(args.nsamples): fp_inps[j] = qlayer(fp_inps[j].unsqueeze(0), attention_mask=attention_mask,position_ids=position_ids)[0] if args.aug_loss: fp_inps_2[j] = qlayer(quant_inps[j].unsqueeze(0), attention_mask=attention_mask,position_ids=position_ids)[0] # init smooth parameters set_quant_state(qlayer, weight_quant=False, act_quant=True) # weight will be manually quantized before forward qlayer.let = args.let use_shift = True if is_llama or args.abits == 16: use_shift = False # deactivate channel-wise shifting for llama model and weight-only quantization if args.let: # init channel-wise scaling and shift qlayer.register_parameter("qkt_smooth_scale",torch.nn.Parameter(torch.ones(layer.self_attn.q_proj.out_features,device=dev, dtype=dtype))) for name,module in qlayer.named_modules(): if isinstance(module, QuantLinear): for key in pairs.keys(): if key in name: act = act_scales[f"{layer_name_prefix}.{i}.{name}"].to(device=dev, dtype=dtype).clamp(min=1e-5) weight = module.weight.abs().max(dim=0)[0].clamp(min=1e-5) scale = (act.pow(args.alpha)/weight.pow(1-args.alpha)).clamp(min=1e-5) if use_shift and not is_llama: shift = act_shifts[f"{layer_name_prefix}.{i}.{name}"].to(device=dev, dtype=dtype) else: shift = torch.zeros_like(scale) qlayer.register_parameter(f"{pairs[key]}_smooth_shift",torch.nn.Parameter(shift)) qlayer.register_parameter(f"{pairs[key]}_smooth_scale",torch.nn.Parameter(scale)) if args.resume: qlayer.load_state_dict(omni_parameters[i], strict=False) if args.epochs > 0: with torch.no_grad(): qlayer.float() # required for AMP training # create optimizer optimizer = torch.optim.AdamW( [{"params":let_parameters(qlayer, use_shift),"lr":args.let_lr}, {"params":lwc_parameters(qlayer),"lr":args.lwc_lr}],weight_decay=args.wd) loss_scaler = utils.NativeScalerWithGradNormCount() for epochs in range(args.epochs): loss_list = [] norm_list = [] for j in range(args.nsamples//args.batch_size): index = j * args.batch_size # obtain output of quantization model with traincast(): smooth_and_quant_temporary(qlayer, args, is_llama) quant_out = qlayer(quant_inps[index:index+args.batch_size,], attention_mask=attention_mask_batch,position_ids=position_ids)[0] loss = loss_func(fp_inps[index:index+args.batch_size,], quant_out) if args.aug_loss: loss += loss_func(fp_inps_2[index:index+args.batch_size,], quant_out) if not math.isfinite(loss.item()): logger.info("Loss is NAN, stopping training") pdb.set_trace() loss_list.append(loss.detach().cpu()) optimizer.zero_grad() norm = loss_scaler(loss, optimizer,parameters= get_omni_parameters(qlayer, use_shift)).cpu() norm_list.append(norm.data) loss_mean = torch.stack(loss_list).mean() norm_mean = torch.stack(norm_list).mean() logger.info(f"layer {i} iter {epochs} loss:{loss_mean} norm:{norm_mean} max memory_allocated {torch.cuda.max_memory_allocated(lm._device) / 1024**2} ") clear_temp_variable(qlayer) del optimizer qlayer.half() # real smooth and quantization smooth_and_quant_inplace(qlayer, args, is_llama) if args.epochs>0: # update input of quantization model with torch.no_grad(): # with torch.cuda.amp.autocast(): with traincast(): for j in range(args.nsamples): quant_inps[j] = qlayer(quant_inps[j].unsqueeze(0), attention_mask=attention_mask,position_ids=position_ids)[0] register_scales_and_zeros(qlayer) layers[i] = qlayer.to("cpu") omni_parameters[i] = omni_state_dict(qlayer) torch.save(omni_parameters, os.path.join(args.output_dir, f"omni_parameters.pth")) else: register_scales_and_zeros(qlayer) layers[i] = qlayer.to("cpu") if args.real_quant: assert args.wbits in [2,3,4] and args.abits >= 16 # only support weight-only quantization named_linears = get_named_linears(qlayer) for name, module in named_linears.items(): scales = module.weight_quantizer.scales zeros = module.weight_quantizer.zeros group_size = module.weight_quantizer.group_size dim0 = module.weight.shape[0] scales = scales.view(dim0,-1) zeros = zeros.view(dim0,-1) if args.wbits == 3: q_linear = qlinear_cuda.QuantLinear(args.wbits, group_size, module.in_features,module.out_features,not module.bias is None) else: q_linear = qlinear_triton.QuantLinear(args.wbits, group_size, module.in_features,module.out_features,not module.bias is None) q_linear.pack(module.cpu(), scales.float().cpu(), zeros.float().cpu()) add_new_module(name, qlayer, q_linear) print(f"pack quantized {name} finished") del module del layer torch.cuda.empty_cache() del inps del quant_inps del fp_inps del fp_inps_2 torch.cuda.empty_cache() gc.collect() model.config.use_cache = use_cache return model
08-20
内容概要:本文详细介绍了“秒杀商城”微服务架构的设计与实战全过程,涵盖系统从需求分析、服务拆分、技术选型到核心功能开发、分布式事务处理、容器化部署及监控链路追踪的完整流程。重点解决了高并发场景下的超卖问题,采用Redis预减库存、消息队列削峰、数据库乐观锁等手段保障数据一致性,并通过Nacos实现服务注册发现与配置管理,利用Seata处理跨服务分布式事务,结合RabbitMQ实现异步下单,提升系统吞吐能力。同时,项目支持Docker Compose快速部署和Kubernetes生产级编排,集成Sleuth+Zipkin链路追踪与Prometheus+Grafana监控体系,构建可观测性强的微服务系统。; 适合人群:具备Java基础和Spring Boot开发经验,熟悉微服务基本概念的中高级研发人员,尤其是希望深入理解高并发系统设计、分布式事务、服务治理等核心技术的开发者;适合工作2-5年、有志于转型微服务或提升架构能力的工程师; 使用场景及目标:①学习如何基于Spring Cloud Alibaba构建完整的微服务项目;②掌握秒杀场景下高并发、超卖控制、异步化、削峰填谷等关键技术方案;③实践分布式事务(Seata)、服务熔断降级、链路追踪、统一配置中心等企业级中间件的应用;④完成从本地开发到容器化部署的全流程落地; 阅读建议:建议按照文档提供的七个阶段循序渐进地动手实践,重点关注秒杀流程设计、服务间通信机制、分布式事务实现和系统性能优化部分,结合代码调试与监控工具深入理解各组件协作原理,真正掌握高并发微服务系统的构建能力。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值