summary4.27 - 28

本文探讨了通过调整模型参数、优化算法及数据处理策略来提高机器学习模型的时间效率与预测精度。具体包括减小批量大小、采用Adam优化器、合理初始化权重、避免过度正则化等方法,并对比了不同条件下的loss变化。同时,引入了MAP和NDCG作为评估模型预测能力的指标,详细解析了其计算方式和应用场景。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

1. loss

各种参数下的loss

1.1 时间复杂度降低

以前:batchsize = 1, traing set 有大约40k棵AST,跑一个epoch 需要90min左右,现在: batchsize = 10,对照github的mini seq2seq修改了model。现在一个epoch大约需要25min左右。

1.2 loss下降

1.模型结构和特征工程存在问题 (借鉴了loss 下降的seq-to-seq translation)
2.数据集未打乱 (加入了shuffle)
3.数据集有问题 (training set不再出现label为UNK的数据)
4.权重初始化方案有问题 (将需要随机初始化的参数改为可训练的参数)
5.正则化过度 (之前loss不降,才加的dropout)
6.选择合适的激活函数、损失函数 (改成最原始的attention mechanism)
7.选择合适的优化器和学习速率 (batch size = 10, optimizer从SGD改为Adam)
8.batch size过大 (batch size 从128降到10)
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima讨论太大的batch size 容易陷入sharp minima,泛化性不好。 batch size = 1 loss 也是有下降的。
9.训练时间不足 .(正在running)

2. Ranking metrics

之前说过NDCG是对order敏感的Ranking metrics,
基于两个前提:
我们希望把最相关的结果放到排名最靠前的位置,因为大部分用户都是从上往下阅读的,那么最相关的在前面可以最大程度减少用户的阅读时间;
我们希望整个列表的结果尽可能的和query相关;
NDCG允许相关值是实数值。
MAP(Mean Average Precision)使用二值(1表示相关,0表示不相关)

2.1 MAP

MAP即使对数据集中单个query的AP( Average Precision)求mean。
AP是Average precison. 基于三个前提:

  • we can recommend x items for each user
  • it pays to submit all x recommendations, because we are not penalized for bad guesses
  • order matters, so it’s better to submit more certain recommendations first, followed by recommendations we are less sure about
    AP可以看作是对 order敏感的recall。
    formulation表示为:
    AP ⁡ = ∑ k = 1 n ( P ( k ) × rel ⁡ ( k ) ) n u m b e r   o f   r e l e v a n t   d o c u m e n t s \operatorname{AP} = \frac{\sum_{k=1}^n (P(k) \times \operatorname{rel}(k))}{{number\space of \space relevant \space documents}} AP=number of relevant documentsk=1n(P(k)×rel(k))

MAP ⁡ = ∑ q = 1 Q AP(q) ⁡ Q ​ \operatorname{MAP} = \frac{\sum_{q=1}^Q \operatorname{AP(q)}}{Q} \! MAP=Qq=1QAP(q)
具体到code suggestion
要预测node: {“type”:“NameLoad”,“value”:“x”},预测结果

predicted top5 noderelevent score(rel)
“type”:“NameStore”,“value”:“x”0
“type”:“NameLoad”,“value”:“x”1
“type”:“num”,“value”:“6”0
“type”:“NameLoad”,“value”:“y”0
“type”:“NameLoad”,“value”:“z”0

AP = (0 + 1/2 + 0 + 0 +0 )/ 1 = 50%
只有当top1 = 1 时AP = (1 + 0 + 0 + 0 +0 )/ 1 = 100%.
理想情况时越大越好,只要MAP >20%,就可以保证基本在top5中就可以找到结果。

code

def AP(data):
    Ap = 0
    for i in range(len(data)):
        if data[i] == 1:
            Ap = Ap + 1/(i+1)
    return Ap
def MAP(data):
    A_P = 0
    for i in range(len(data)):
        A_P  +=  AP(pre[i])
    mAp = A_P / len(data)
    return mAp

pre = [[0,1,0,0,0],[1,0,0,0,0]]
MAP = MAP(pre)
print(MAP)
## output: 0.75

3. 结果

过拟合,而且MAP不大于20%,也就是top5不一定有预测结果。
等明天抛出结果,可以补上这一块
改进方法:

  1. 增加数据(一个AST可以制造多个Queries,但由于padding后sequence的长度为30k左右,报 memory error,我在想可以缩小sequence的长度不大于2K,因为95%的数据长度小于等于2K )
    在这里插入图片描述在这里插入图片描述

  2. 简化模型参数,缩小参数空间

#参考:
4. 理解NDCG
5. evaluating-recommender-systems
6. Mean_average_precision
7. 推荐系统常见评测标准之MAP与NDCG

分析下面的代码 :#!/usr/bin/env python # coding: utf-8 # # AIMET Quantization workflow for LLaMA V2 7B with LORA adapters using PEFT pipeline # # This notebook shows a working code example of how to use AIMET to quantize LLaMaV2 model. # # --- # ### Required packages # The notebook assumes AIMET and LLamaV2 related packages are already installed. # In[ ]: # #### Overall flow # This notebook covers the following # 1. Top Level Config # 1. Model Adaptation. # 1. Model Sample Input # 1. Base BERT MHA FP Model Instantiation # 1. Loading LORA Adapter on base Bert MHA Model # 1. Adapted BERT MHA Model Preparation # 1. Adapted BERT MHA Model Quantization with PEFT pipeline # 1. Base KV MHA FP Model Instantiation # 1. Loading LORA Adapter on base KV MHA Model # 1. Adapted KV MHA FP Model Preparation # 1. Create Adapted KVcache MHA Quantsim and Apply Encodings from Adapted BERT MHA Model # 1. Export onnx and encodings # # # #### What this notebook is not # * This notebook is not intended to show the full scope of optimization. For example, the flow will not use QAT, KD-QAT as deliberate choice to have the notebook execute more quickly. # In[2]: get_ipython().run_line_magic('load_ext', 'autoreload') get_ipython().run_line_magic('autoreload', '2') # In[1]: # Install packages only if running in jupyter notebook mode if hasattr(__builtins__,'__IPYTHON__'): get_ipython().system('sudo -H pip install --quiet --upgrade --root-user-action=ignore --no-cache-dir transformers==4.41.2,') get_ipython().system('sudo -H pip install --quiet --upgrade --root-user-action=ignore --no-cache-dir tokenizers==0.19.0,') # !sudo -H pip install --quiet --upgrade --root-user-action=ignore --no-cache-dir transformers==4.27.4, # !sudo -H pip install --quiet --upgrade --root-user-action=ignore --no-cache-dir tokenizers==0.13.0, get_ipython().system('sudo -H pip install --quiet --upgrade --root-user-action=ignore --no-cache-dir peft') get_ipython().system('sudo -H apt-get update') get_ipython().system('sudo -H apt-get install -y libc++-dev') get_ipython().system('sudo -H apt-get install -y clang') # In[1]: import transformers print (transformers.__version__) # In[2]: get_ipython().system('sudo -H pip install --quiet --upgrade --root-user-action=ignore --no-cache-dir transformers==4.41.2,') get_ipython().system('sudo -H pip install --quiet --upgrade --root-user-action=ignore --no-cache-dir tokenizers==0.19.0,') # #### Setup QNN SDK # In[14]: import sys import os QNN_SDK_ROOT = '/tmp/qnn/2.28.0.241029' # QNN 2.28.0 sys.path.insert(0, QNN_SDK_ROOT + '/lib/python') os.environ['LD_LIBRARY_PATH'] = os.path.join(QNN_SDK_ROOT + '/lib/x86_64-linux-clang', os.getenv('LD_LIBRARY_PATH', '')) # In[15]: import qti #checking if correct QNN version is loaded print(qti.__path__) # ### Setting NSP Target # In[3]: sys.path.append(os.path.abspath('../')) sys.path.append(os.path.abspath('../../common')) from utilities.nsptargets import NspTargets # Android GEN4 is supported for this notebook nsp_target = NspTargets.Android.GEN4 # Select quantsim config based on target # Point this to different path if AIMET install in path other than /usr/local/lib/python3.10/dist-packages/ htp_config_file = f'/usr/local/lib/python3.10/dist-packages/aimet_common/quantsim_config/htp_quantsim_config_{nsp_target.dsp_arch}.json' # --- # ## 1. Top Level Config # # In[11]: import os, sys from tqdm import tqdm os.environ['CUDA_VISIBLE_DEVICES'] = '1' import torch from transformers import AutoConfig, AutoTokenizer, default_data_collator cache_dir='./cache_dir' output_dir = './32layer_test' os.makedirs(cache_dir, exist_ok=True) os.makedirs(output_dir, exist_ok=True) device = "cuda" # Auto-regression length: number of tokens to consume and number of logits to produce. ARN=73 # model_id="<path_to_model_weight> or <HF_model_id(meta-llama/Llama-2-7b-hf)>" model_id="/002data/kraus/projects/LLM/qualcomm_llama/model/Step-1/7b_chat/" num_hidden_layers = 32 #configurable to less number for debugging purposes context_length = 4096 # adatper dictionary name to peft_id lora_adapter_dict={'french':'kaitchup/Llama-2-7b-mt-French-to-English', 'oasst': 'kaitchup/Llama-2-7B-oasstguanaco-adapter'} lora_adapter_dict={'french':'french', 'oasst': 'oasst'} # --- # # ## 2. Model Adaptations # The following model adaptation are enabled for inference using provided modeling_llama.py: # * Use 2D attention_mask # * Replace position ids with embedding # * Output new KV only # # The following adaptation is enabled using in place replacement utility function # * Convert linear to conv # In[5]: from transformers.models.llama import modeling_llama from transformers import cache_utils from aimet_utils.linear_to_conv import replace_linears_with_convs from aimet_torch.pro.utils.profiler import event_marker from qcllama_adaptation import ( QcLlamaAttention, bypass_update_causal_mask, DynamicCache_update, DynamicCache_get_seq_length, update_attr ) with event_marker("FP model adaptation configuration"): modeling_llama.LLAMA_ATTENTION_CLASSES['eager'] = QcLlamaAttention # Bypass attention_mask preparation assert update_attr(modeling_llama.LlamaModel, '_update_causal_mask', bypass_update_causal_mask) or \ update_attr(modeling_llama.LlamaModel, '_prepare_decoder_attention_mask', bypass_update_causal_mask), \ f"neither _prepare_decoder_attention_mask(..) nor _update_causal_mask(..) found, Unknown LlamaModel definition in {modeling_llama.__file__}" # Adapting KV$ management assert update_attr(cache_utils.DynamicCache, 'update', DynamicCache_update), f"Unknown DynamicCache definition: {cache_utils.DynamicCache}" assert update_attr(cache_utils.DynamicCache, 'get_seq_length', DynamicCache_get_seq_length), f"Unknown DynamicCache definition: {cache_utils.DynamicCache}" # In[6]: import qcllama_adaptation print(qcllama_adaptation.__file__) # In[7]: from transformers.models.deprecated.open_llama import modeling_open_llama from transformers import cache_utils from aimet_utils.linear_to_conv import replace_linears_with_convs from aimet_torch.pro.utils.profiler import event_marker from qcllama_adaptation import ( QcLlamaAttention, bypass_update_causal_mask, DynamicCache_update, DynamicCache_get_seq_length, update_attr ) with event_marker("FP model adaptation configuration"): modeling_llama.LLAMA_ATTENTION_CLASSES['eager'] = QcLlamaAttention # Bypass attention_mask preparation assert update_attr(modeling_llama.LlamaModel, '_update_causal_mask', bypass_update_causal_mask) or \ update_attr(modeling_llama.LlamaModel, '_prepare_decoder_attention_mask', bypass_update_causal_mask), \ f"neither _prepare_decoder_attention_mask(..) nor _update_causal_mask(..) found, Unknown LlamaModel definition in {modeling_llama.__file__}" # Adapting KV$ management assert update_attr(cache_utils.DynamicCache, 'update', DynamicCache_update), f"Unknown DynamicCache definition: {cache_utils.DynamicCache}" assert update_attr(cache_utils.DynamicCache, 'get_seq_length', DynamicCache_get_seq_length), f"Unknown DynamicCache definition: {cache_utils.DynamicCache}" # --- # ## 3. Model Sample Input # #### Dummy input # # In[8]: from forward_pass_wrapper import get_position_embeddings_from_position_ids, prepare_combined_attention_mask, get_padded_kv_values, flatten_tensors def get_dummy_data(model_mode, num_layers, hidden_size, num_attention_heads, rope_theta, tokenizer, device, separate_tuple_input_output, num_tokens=None, concat_head_in_batch_dimension=False): max_tokens = tokenizer.model_max_length attention_mask = torch.ones((1, max_tokens), dtype=torch.long, device=device) if model_mode == 'bertcache': num_tokens = max_tokens position_ids = torch.cumsum(attention_mask, dim=1) - 1 position_ids = position_ids.clip(0, max_tokens - 1) position_ids = position_ids[..., :num_tokens] position_ids = position_ids.to(device=device) past_kv_length = max_tokens - num_tokens if model_mode == 'kvcache' else 0 attention_mask = prepare_combined_attention_mask(attention_mask, input_shape=(1, num_tokens), past_key_values_length=past_kv_length, device=device, mask_neg=-100) position_ids = get_position_embeddings_from_position_ids(position_ids, head_dim=hidden_size//num_attention_heads, max_length=max_tokens, rope_theta=rope_theta, device=device) inputs = { 'attention_mask': attention_mask, 'position_ids': position_ids, 'input_ids': torch.randint(0, len(tokenizer), (1, num_tokens), device=device) } if model_mode == 'kvcache': inputs['past_key_values'] = get_padded_kv_values(past_size=max_tokens - num_tokens, num_layers=num_layers, hidden_size=hidden_size, concat_head_in_batch_dimension=concat_head_in_batch_dimension, num_attention_heads=num_attention_heads, device=device) if separate_tuple_input_output: flattened_kvcache = tuple(flatten_tensors(inputs['past_key_values'])) inputs = inputs['input_ids'], inputs['attention_mask'], inputs['position_ids'][0], inputs['position_ids'][1] inputs = inputs + flattened_kvcache else: if separate_tuple_input_output: inputs = inputs['input_ids'], inputs['attention_mask'], inputs['position_ids'][0], inputs['position_ids'][1] return inputs # #### Input and Output names # In[9]: def get_input_output_names(num_layers, past_key_values_in, separate_tuple_input_output): def _get_past_key_values_names(sfx, n_layers): all = [] for i in range(n_layers): all.append(f'past_key_{i}_{sfx}') all.append(f'past_value_{i}_{sfx}') return all output_names = ['logits'] input_names = ['input_ids', 'attention_mask'] if separate_tuple_input_output: output_names += _get_past_key_values_names('out', num_layers) input_names += ['position_ids_cos', 'position_ids_sin'] if past_key_values_in: input_names += _get_past_key_values_names('in', num_layers) else: output_names += ['past_key_values'] input_names += ['position_ids'] if past_key_values_in: input_names += ['past_key_values'] return input_names, output_names # ### 4. Base BERT MHA FP Model Instantiation # # In[ ]: # In[12]: llm_config = AutoConfig.from_pretrained(model_id, cache_dir=cache_dir, trust_remote_code=True) # model params llm_config.num_hidden_layers = num_hidden_layers llm_config.cache_dir = cache_dir llm_config.device = torch.device('cpu') # QC LLM model config setattr(llm_config, 'mask_neg', -100) setattr(llm_config, 'num_logits_to_return', 0) setattr(llm_config, 'return_top_k', 0) setattr(llm_config, "use_conv", False) setattr(llm_config, 'return_new_key_value_only', True) setattr(llm_config, 'transposed_key_cache', True) setattr(llm_config, 'use_combined_mask_input', True) setattr(llm_config, 'use_position_embedding_input', True) setattr(llm_config, 'separate_tuple_input_output', False) setattr(llm_config, '_attn_implementation', 'eager') setattr(llm_config, '_attn_implementation_internal', 'eager') print(f'num_layer: {llm_config.num_hidden_layers}, context_length: {context_length}') with event_marker('BERT MHA FP model'): fp_base_model = modeling_llama.LlamaForCausalLM.from_pretrained(model_id, config=llm_config) os.environ['TOKENIZERS_PARALLELISM'] = '0' tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir=cache_dir, use_fast=True, trust_remote_code=True) ## Adjust the tokenizer to limit to context length tokenizer.model_max_length = context_length # ### 5. Loading LORA Adapters on Base Bert MHA Model # In[13]: # loading adapter to Bert MHA model and save adapter weights from peft import PeftModel,PeftConfig,LoraConfig from aimet_torch.peft import replace_lora_layers_with_quantizable_layers,track_lora_meta_data from lora_utils import save_lora_weights_after_adaptation from aimet_utils.linear_to_conv import replace_linears_with_convs from aimet_utils.linear_to_conv import ConvInplaceLinear import copy # Adding dummy adapter for q_proj, k_proj, v_proj and combined adapters into one adapter k_v_lora_config = LoraConfig( r=16, lora_alpha=16, bias='none', target_modules=["q_proj","k_proj", "v_proj"], init_lora_weights=False # leads to random init not zeros ) for adapter_name,peft_model_id in lora_adapter_dict.items(): model_before_adapter = copy.deepcopy(fp_base_model) print (f"=====loading adapter {adapter_name}====") lora_model = PeftModel.from_pretrained(model_before_adapter, peft_model_id, adapter_name=adapter_name) dummy_adapter_name = "k_v_adapter" lora_model.add_adapter(dummy_adapter_name, k_v_lora_config) for name, param in lora_model.named_parameters(): if dummy_adapter_name in name and "lora" in name: param.data.fill_(0.0) combined_adapter_name = "combined_adapter" lora_model.add_weighted_adapter( adapters=[adapter_name, dummy_adapter_name], weights=[1.0, 1.0], adapter_name=combined_adapter_name, combination_type="linear" ) lora_model.set_adapter(combined_adapter_name) lora_model.delete_adapter(adapter_name) lora_model.delete_adapter(dummy_adapter_name) # Replace lora layers with quantizable layers replace_lora_layers_with_quantizable_layers(lora_model) # Linear to Conv model adaptation lora_model=replace_linears_with_convs(lora_model) # Save adapter weights after adaptation save_lora_weights_after_adaptation(lora_model, output_dir, adapter_name) del model_before_adapter del fp_base_model track_lora_meta_data(lora_model, output_dir, 'meta_data', ConvInplaceLinear) # In[14]: # fill lora layers with 0 to evaluate base model for name, param in lora_model.named_parameters(): if 'lora' in name: param.data.fill_(0.0) # #### 5.1 Adapted BERT MHA Model Evaluation # # In[15]: # defining ppl evaluation function from torch.nn import CrossEntropyLoss bert_mha_fp_model=lora_model.base_model.model def bert_ppl_eval(data_loader, forward_pass_manager, num_batches=0): if num_batches == 0: num_batches = len(data_loader) loss = 0 for batch_id, batch in enumerate(tqdm(data_loader, total=num_batches, desc="Evaluating")): if batch_id >= num_batches: break outputs = forward_pass_manager(**batch) lm_logits = outputs["lm_logits"].cpu() # we can either pass input_ids or input_embeds in our fpm, hence with input_embeds we pass the labels. if 'input_ids' not in batch: batch['input_ids'] = batch['labels'] lm_logits = lm_logits.reshape(batch['input_ids'].shape[0], -1, lm_logits.shape[-1]) shift_logits = lm_logits[..., :-1, :].contiguous() shift_labels = batch['input_ids'][..., 1:].contiguous().to(shift_logits.device) loss_fct = CrossEntropyLoss() loss += loss_fct( shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1), ) loss = loss / num_batches ppl = loss.exp() return ppl # In[ ]: from forward_pass_wrapper import LLMForwardPassManager orig_fpm = LLMForwardPassManager(cfg=llm_config, model=bert_mha_fp_model, tokenizer=tokenizer, model_mode='bertcache', num_logits_to_return=0, separate_tuple_input_output=False) input_names, output_names = get_input_output_names(num_layers=llm_config.num_hidden_layers, past_key_values_in=False, separate_tuple_input_output=False) from wikitext_dataloader import get_wiki_dataset train_dataloader, test_dataloader, _ = get_wiki_dataset(context_length, tokenizer, cache_dir, train_batch_size = 1, test_batch_size = 1) with event_marker("BERT MHA FP eval"): with torch.no_grad(): with orig_fpm.place_on_device(device): orig_ppl = bert_ppl_eval(test_dataloader, orig_fpm) print(f"ppl score of original BERT MHA fp model: {orig_ppl}") # ### 6. Adapted BERT MHA Model Preparation # #### Estimated running time: ~ 1h 20m # In[13]: import aimet_torch.pro.ir_graph_op_handler as ir_graph_op_handler from aimet_torch.pro import model_preparer # Setting this flag to False means that the prepared model will be flattened # This flag must be set to false because we rely on the model structure being flat to enable weight sharing ir_graph_op_handler.KEEP_ORIGINAL_MODEL_STRUCTURE = False # configuring the model for BERT mode bert_mha_fp_model.num_logits_to_return = 0 dummy_input = get_dummy_data('bertcache', llm_config.num_hidden_layers, llm_config.hidden_size, llm_config.num_attention_heads, llm_config.rope_theta, tokenizer, 'cpu', separate_tuple_input_output=False) input_names, output_names = get_input_output_names(num_layers=llm_config.num_hidden_layers, past_key_values_in=False, separate_tuple_input_output=True) converter_args_param = ['--input_layout'] converter_args_value = 'NONTRIVIAL' converter_args = [] for input_param in converter_args_param: for input_name in input_names: converter_args += [input_param, input_name, converter_args_value] with event_marker("BERT MHA Model prepare", flush_ram=True): bert_mha_prepared_model = model_preparer.prepare_model(bert_mha_fp_model, dummy_input, filename="bert_mha_prepared_model", path=output_dir, input_names=input_names, output_names=output_names, converter_args=converter_args, skipped_optimizers=['eliminate_common_subexpression','eliminate_nop_with_unit', 'eliminate_duplicate_initializer'], ) # In[ ]: del orig_fpm del bert_mha_fp_model # #### 6.1 Adapted BERT MHA Prepared Model Verification # Verify if prepared BERT model generates the same PPL as FP model # ##### Estimated running time: ~ 3m # In[17]: from aimet_torch.utils import load_pytorch_model # Load prepared model if prepartion is run before and prepared model can be retrived from filer path # # bert_mha_prepared_model = load_pytorch_model(path=output_dir, filename="bert_mha_prepared_model", # # model_name='ConvertedModel', load_state_dict=True) # Calculate ppl score for prepared fp model bert_mha_fpm = LLMForwardPassManager(cfg=llm_config, model=bert_mha_prepared_model, tokenizer=tokenizer, model_mode='bertcache', num_logits_to_return=0, separate_tuple_input_output=True) with event_marker("BERT MHA Prepared FP eval"): with torch.no_grad(): with bert_mha_fpm.place_on_device(device): prepared_bertcache_ppl = bert_ppl_eval(test_dataloader, bert_mha_fpm) print(f"ppl score of BERT prepared fp model: {prepared_bertcache_ppl}\n" f"orig ppl - prepared ppl = {orig_ppl - prepared_bertcache_ppl}") # ### 7. Adapted BERT MHA Model Quantization with PEFT pipeline # # We will be executing PTQ using calibration data that was captured earlier # #### Create Quantsim # In[16]: from aimet_common.defs import QuantScheme from aimet_torch.v2.quantsim import QuantizationSimModel dummy_input = get_dummy_data('bertcache', llm_config.num_hidden_layers, llm_config.hidden_size, llm_config.num_attention_heads, llm_config.rope_theta, tokenizer, device, separate_tuple_input_output=True) with event_marker("create Quantsim", flush_ram=True): with bert_mha_fpm.place_on_device(device): quant_sim = QuantizationSimModel(model=bert_mha_fpm.model, quant_scheme=QuantScheme.post_training_tf, dummy_input=dummy_input, default_output_bw=16, default_param_bw=4, in_place=False, config_file=htp_config_file) quant_sim.model.to('cpu') # In[43]: ### Setting 16*8 matmuls from aimet_torch.v2.experimental.quantsim_utils import set_matmul_second_input_producer_to_8bit_symmetric set_matmul_second_input_producer_to_8bit_symmetric(quant_sim) # #### Manual Mixed Precision # In[44]: from mixed_precision_overrides import ManualQuantsimMixedPrecisionConfig with event_marker("Apply Mixed Precision", flush_ram=True): quantsim_adjuster = ManualQuantsimMixedPrecisionConfig(mixed_precision_config_file= "./config/mixed_precision_profiles/w4_a16_exceptions_llama_v2_prepared_disableRMSNorm_clampgateprojconv_bundledkv.json") quantsim_adjuster.apply_exceptions(quant_sim) # #### Instantiation of PEFT utils # In[30]: import pickle,json from aimet_torch.peft import PeftQuantUtils from aimet_torch.v2.quantization.affine import QuantizeDequantize with open(os.path.join(output_dir,'meta_data.pkl'), "rb") as f: meta_data_file = pickle.load(f) with open(os.path.join(output_dir,'bert_mha_prepared_model.json')) as f: name_to_module_dict = json.load(f) peft_utils = PeftQuantUtils(adapater_name_to_meta_data=meta_data_file, name_to_module_dict=name_to_module_dict) # #### Sequential MSE # ##### Estimated running time: ~ 1h 20m # In[46]: from aimet_torch.v2.seq_mse import apply_seq_mse from aimet_torch.seq_mse import SeqMseParams from aimet_torch.utils import load_pytorch_model # Load prepared model if prepartion is run before and prepared model can be retrived from filer path # # bert_mha_prepared_model = load_pytorch_model(path=output_dir, filename="bert_mha_prepared_model", # # model_name='ConvertedModel', load_state_dict=True) lora_layers =[layer for name,layer in peft_utils.get_fp_lora_layer(bert_mha_prepared_model)] def _forward_fn(model, inputs): prepared_inputs, _ = bert_mha_fpm.prepare_inputs(**inputs) if model == bert_mha_fpm.model else bert_mha_fpm.prepare_inputs(**inputs) model(**prepared_inputs) params = SeqMseParams(num_batches=20, inp_symmetry="symqt", num_candidates=20, loss_fn="mse", forward_fn=_forward_fn) bert_mha_sim_fpm = LLMForwardPassManager(cfg=llm_config, model=quant_sim.model, tokenizer=tokenizer, model_mode='bertcache', num_logits_to_return=0, separate_tuple_input_output=True) with event_marker("SeqMSE"): with bert_mha_fpm.place_on_device("cuda"),bert_mha_sim_fpm.place_on_device("cuda"): apply_seq_mse(bert_mha_fpm.model, quant_sim, train_dataloader, params, modules_to_exclude=lora_layers) quant_sim.save_encodings_to_json(output_dir, 'base_seqmse') # #### Concat Encoding Unification # In[47]: from aimet_torch.v2.experimental import propagate_output_encodings import aimet_torch.elementwise_ops as aimet_ops propagate_output_encodings(quant_sim, aimet_ops.Concat) # #### Setup Lora Layer to 16 bit per tensor # In[48]: ## do this if changing for lora layers for _,module in peft_utils.get_quantized_lora_layer(quant_sim): # setting 16 bit per tensor module.param_quantizers['weight'] = QuantizeDequantize(shape=(1, 1, 1, 1), bitwidth=16, symmetric=True).to(module.weight.device) peft_utils.quantize_lora_scale_with_fixed_range(quant_sim, 16, 0.0, 1.0) peft_utils.disable_lora_adapters(quant_sim) # #### Calibration # ##### Estimated running time: ~ 5m # In[49]: def calibration_wrapper(model, kwargs): data_loader = kwargs['data_loader'] fpm = kwargs['fpm'] max_iterations = kwargs['num_batches'] for batch_id, batch in enumerate(tqdm(data_loader)): if batch_id < max_iterations: prepared_inputs, _ = fpm.prepare_inputs(**batch) model(**prepared_inputs) else: break kwargs = { 'data_loader': train_dataloader, 'fpm': bert_mha_sim_fpm, 'num_batches': 100 } with event_marker("compute encoding for base", flush_ram=True): with bert_mha_sim_fpm.place_on_device(device): quant_sim.compute_encodings(calibration_wrapper, kwargs) from global_encoding_clipper import clamp_activation_encodings clamp_activation_encodings(quant_sim,500) # #### Adapted BERT MHA Quantsim Eval for Quantization Accuracy # ##### Estimated running time: ~7m # In[50]: with event_marker("Sim eval for base"): with torch.no_grad(): with bert_mha_sim_fpm.place_on_device(device): sim_ppl = bert_ppl_eval(test_dataloader, bert_mha_sim_fpm) print(f"ppl score of quantsim model: {sim_ppl}\n" f"orig ppl - quantsim ppl = {orig_ppl - sim_ppl}") quant_sim.save_encodings_to_json(output_dir, 'base_encoding') # #### Load Adapter Weights, Compute Encodings and Save Encodings # ##### Estimated running time: ~ 25m # In[51]: peft_utils.freeze_base_model_param_quantizers(quant_sim) for adapter_name,peft_model_id in lora_adapter_dict.items(): peft_utils.enable_adapter_and_load_weights(quant_sim,os.path.join(output_dir,f'{adapter_name}.safetensor')) with event_marker(f"compute encoding for {adapter_name} adapter", flush_ram=True): with bert_mha_sim_fpm.place_on_device(device): quant_sim.compute_encodings(calibration_wrapper, kwargs) from global_encoding_clipper import clamp_activation_encodings clamp_activation_encodings(quant_sim, 500) with event_marker(f"Sim eval for {adapter_name} adapter"): with torch.no_grad(): with bert_mha_sim_fpm.place_on_device(device): sim_ppl = bert_ppl_eval(test_dataloader, bert_mha_sim_fpm) print(f"ppl score of quantsim model: {sim_ppl}\n" f"orig ppl - quantsim ppl = {orig_ppl - sim_ppl}") ## save encodings for kvcache mode to consume quant_sim.save_encodings_to_json(output_dir, f'{adapter_name}_adapter_encoding') # In[52]: del bert_mha_sim_fpm del bert_mha_fpm del bert_mha_prepared_model del quant_sim # ### 8. Base KV MHA FP Model Instantiation # In[20]: llm_config = AutoConfig.from_pretrained(model_id, cache_dir=cache_dir, trust_remote_code=True) # model params llm_config.num_hidden_layers = num_hidden_layers llm_config.cache_dir = cache_dir llm_config.device = torch.device('cpu') # QC LLM model config setattr(llm_config, 'mask_neg', -100) setattr(llm_config, 'num_logits_to_return', ARN) setattr(llm_config, 'return_top_k', 0) setattr(llm_config, "use_conv", False) setattr(llm_config, 'return_new_key_value_only', True) setattr(llm_config, 'transposed_key_cache', True) setattr(llm_config, 'use_combined_mask_input', True) setattr(llm_config, 'concat_head_in_batch_dimension', False) setattr(llm_config, 'use_sha', False) setattr(llm_config, 'num_tokens', ARN) setattr(llm_config, 'use_position_embedding_input', True) setattr(llm_config, 'separate_tuple_input_output', False) setattr(llm_config, '_attn_implementation', 'eager') setattr(llm_config, '_attn_implementation_internal', 'eager') print(f'num_layer: {llm_config.num_hidden_layers}, context_length: {context_length}, arn: {ARN}') with event_marker('KV FP model'): kv_fp_base_model = modeling_llama.LlamaForCausalLM.from_pretrained(model_id, config=llm_config) os.environ['TOKENIZERS_PARALLELISM'] = '0' tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir=cache_dir, use_fast=True, trust_remote_code=True) ## Adjust the tokenizer to limit to context length tokenizer.model_max_length = context_length # ### 9. Loading LORA Adapter on base KV MHA Model # In[21]: # loading only 1 adapter to have adapted graph , adapter_name= "french" # peft_model_id="kaitchup/Llama-2-7b-mt-French-to-English" peft_model_id="french" lora_model = PeftModel.from_pretrained(kv_fp_base_model, peft_model_id, adapter_name=adapter_name) dummy_adapter_name = "k_v_adapter" lora_model.add_adapter(dummy_adapter_name, k_v_lora_config) # Write the lora's for k and v with zeros # not doing this due to graph issue reported for g2g for name, param in lora_model.named_parameters(): if dummy_adapter_name in name and "lora" in name: param.data.fill_(0.0) combined_adapter_name = "combined_adapter" lora_model.add_weighted_adapter( adapters=[adapter_name, dummy_adapter_name], weights=[1.0, 1.0], adapter_name=combined_adapter_name, combination_type="linear" ) lora_model.set_adapter(combined_adapter_name) lora_model.delete_adapter(adapter_name) lora_model.delete_adapter(dummy_adapter_name) # Replace lora layer with quantizable layers replace_lora_layers_with_quantizable_layers(lora_model) # linear to conv adaptation lora_model=replace_linears_with_convs(lora_model) kv_mha_fp_model = lora_model.base_model.model # ### 10. Adapted KV Cache MHA Model Preparation # #### Estimated running time: ~ 1h 20m # In[22]: import aimet_torch.pro.ir_graph_op_handler as ir_graph_op_handler from aimet_torch.pro import model_preparer # Setting this flag to False means that the prepared model will be flattened # This flag must be set to false because we rely on the model structure being flat to enable weight sharing ir_graph_op_handler.KEEP_ORIGINAL_MODEL_STRUCTURE = False dummy_input = get_dummy_data('kvcache', llm_config.num_hidden_layers, llm_config.hidden_size, llm_config.num_attention_heads, llm_config.rope_theta, tokenizer, 'cpu', separate_tuple_input_output=False, num_tokens=ARN, concat_head_in_batch_dimension=llm_config.concat_head_in_batch_dimension) input_names, output_names = get_input_output_names( num_layers=llm_config.num_hidden_layers, past_key_values_in=True, separate_tuple_input_output=True) # Build the converter args converter_args_param = ['--input_layout'] converter_args_value = 'NONTRIVIAL' converter_args = [] for input_param in converter_args_param: for input_name in input_names: converter_args += [input_param, input_name, converter_args_value] with event_marker("KV MHA Model prepare", flush_ram=True): kv_mha_prepared_model = model_preparer.prepare_model(kv_mha_fp_model, dummy_input, filename="kv_mha_prepared_model", path=output_dir, input_names=input_names, output_names=output_names, converter_args=converter_args, skipped_optimizers=['eliminate_common_subexpression','eliminate_nop_with_unit', 'eliminate_duplicate_initializer'], ) del kv_mha_fp_model # # ### 11. Create Adapted KVcache MHA Quantsim and Apply Encodings from Adapted BERT MHA Model # In[23]: kvcache_fpm = LLMForwardPassManager(cfg=llm_config, model=kv_mha_prepared_model, tokenizer=tokenizer, model_mode='kvcache', num_logits_to_return=ARN, separate_tuple_input_output=True, num_tokens=ARN) llm_config.concat_head_in_batch_dimension = False dummy_input = get_dummy_data('kvcache', llm_config.num_hidden_layers, llm_config.hidden_size, llm_config.num_attention_heads, llm_config.rope_theta, tokenizer, device, separate_tuple_input_output=True, num_tokens=ARN, concat_head_in_batch_dimension=llm_config.concat_head_in_batch_dimension) with event_marker("create KV Quantsim"): with kvcache_fpm.place_on_device(device): kv_quant_sim = QuantizationSimModel(model=kvcache_fpm.model, quant_scheme=QuantScheme.post_training_tf, dummy_input=dummy_input, default_output_bw=16, default_param_bw=4, in_place=True, config_file=htp_config_file, ) # In[24]: ### Setting 16*8 malmuls from aimet_torch.v2.experimental.quantsim_utils import set_matmul_second_input_producer_to_8bit_symmetric set_matmul_second_input_producer_to_8bit_symmetric(kv_quant_sim) # #### Concat encoding unification # In[25]: from aimet_torch.v2.experimental import propagate_output_encodings import aimet_torch.elementwise_ops as aimet_ops propagate_output_encodings(kv_quant_sim, aimet_ops.Concat) # #### Mixed precision config # In[26]: from mixed_precision_overrides import ManualQuantsimMixedPrecisionConfig with event_marker("Apply Mixed Precision", flush_ram=True): quantsim_adjuster = ManualQuantsimMixedPrecisionConfig(mixed_precision_config_file= "./config/mixed_precision_profiles/w4_a16_exceptions_llama_v2_prepared_disableRMSNorm_clampgateprojconv_bundledkv.json") quantsim_adjuster.apply_exceptions(kv_quant_sim) # #### Setup lora layer to be 16bit per tensor # In[31]: import json import pickle with open(os.path.join(output_dir,'meta_data.pkl'), "rb") as f: meta_data_file = pickle.load(f) with open(os.path.join(output_dir,'kv_mha_prepared_model.json')) as f: name_to_module_dict = json.load(f) peft_utils = PeftQuantUtils(adapater_name_to_meta_data=meta_data_file, name_to_module_dict=name_to_module_dict) ## do this if changing for lora layers for _,module in peft_utils.get_quantized_lora_layer(kv_quant_sim): # setting 16 bit per tensor module.param_quantizers['weight'] = QuantizeDequantize(shape=(1, 1, 1, 1), bitwidth=16, symmetric=True).to(module.weight.device) peft_utils.quantize_lora_scale_with_fixed_range(kv_quant_sim, 16, 0.0, 1.0) peft_utils.disable_lora_adapters(kv_quant_sim) # #### Mapping Base Encodings and Loading Mapped Encodings into Quantizer # In[32]: from encodings_mapper import EncodingsMapper encoding_file = os.path.join(output_dir, 'base_encoding.json') _ , mapped_encoding_file = EncodingsMapper(llm_config, output_dir, encoding_file).map_encodings() kv_quant_sim.load_encodings(mapped_encoding_file, partial=False) # ### 12. Export KVCache Model Onnx and encodings # #### Estimated running time: ~ 1h # In[34]: from aimet_torch.utils import change_tensor_device_placement from aimet_torch.onnx_utils import OnnxExportApiArgs from aimet_torch import onnx_utils from aimet_utils.clip_weights import clip_weights_to_7f7f onnx_dir = os.path.join(output_dir, 'onnx') os.makedirs(onnx_dir, exist_ok=True) input_names, output_names = get_input_output_names( num_layers=llm_config.num_hidden_layers, past_key_values_in=True, separate_tuple_input_output=True) onnx_utils.RESTORE_ONNX_MODEL_INITIALIZERS = True clip_weights_to_7f7f(kv_quant_sim) onnx_api_args = OnnxExportApiArgs(input_names=input_names,output_names=output_names) sample_inputs = change_tensor_device_placement(dummy_input, torch.device('cpu')) filename_prefix = f"llamav2_AR{ARN}" filename_prefix_encodings = f"{filename_prefix}_base" with event_marker("KVCache export onnx and test vectors", flush_ram=True): kv_quant_sim.export(onnx_dir, filename_prefix, sample_inputs, onnx_export_args=onnx_api_args,export_model=True, filename_prefix_encodings=filename_prefix_encodings) # exporting tokenizer tokenizer_dir = os.path.join(output_dir, 'tokenizer') os.makedirs(tokenizer_dir, exist_ok=True) tokenizer.save_pretrained(tokenizer_dir) # #### Create sample test vectors for QNN SDK # ##### Estimated running time: ~ 9m # In[35]: from test_vectors import generate_test_vectors test_vector_layers = [ "model_layers_\\d+_input_layernorm_Pow", "model_layers_\\d+_input_layernorm_Cast", "lm_head_conv_Conv", "lm_head_MatMul", "model.layers\\d+.input_layernorm.cast", "lm_head_conv", "lm_head" ] with event_marker("generate test vector"): generate_test_vectors(kv_quant_sim, kvcache_fpm, train_dataloader, output_dir, num_batches=1, test_vector_layers=test_vector_layers, input_names=input_names) # #### Mapping Encoding from Bert to Kvcache and Export encodings for Adapters # ##### Estimated running time : ~ 10m # In[36]: from encodings_mapper import EncodingsMapper peft_utils.freeze_base_model_param_quantizers(kv_quant_sim) for adapter_name,peft_model_id in lora_adapter_dict.items(): peft_utils.enable_adapter_and_load_weights(kv_quant_sim,os.path.join(output_dir,f'{adapter_name}.safetensor')) encoding_file = os.path.join(output_dir, f'{adapter_name}_adapter_encoding.json') _ , mapped_encoding_file = EncodingsMapper(llm_config, output_dir, encoding_file).map_encodings() kv_quant_sim.load_encodings(mapped_encoding_file, partial=False) clip_weights_to_7f7f(kv_quant_sim) peft_utils.export_adapter_weights(kv_quant_sim, output_dir, f'{adapter_name}_onnx') filename_prefix_encodings = f"{filename_prefix}_{adapter_name}" with event_marker(f"KVCache export {adapter_name} adapter encodings", flush_ram=True): kv_quant_sim.export(onnx_dir, filename_prefix, sample_inputs, onnx_export_args=onnx_api_args,export_model=False, filename_prefix_encodings=filename_prefix_encodings) # --- # ## Summary # In[37]: from aimet_torch.pro.utils.profiler import EventProfiler EventProfiler().report() EventProfiler().json_dump(os.path.join(output_dir, 'profiling_stats'))
最新发布
06-11
<think>我们正在分析用户的问题。用户要求梳理代码流程并分析给定的代码,但用户没有直接提供代码。用户提到了两个引用,但引用内容似乎与代码分析无关。我们需要回顾之前的对话历史,看看是否有相关代码。在之前的对话中,用户没有提供具体的代码。因此,我们需要用户提供代码以便进行分析。但是,用户提到了“梳理代码流程并分析给定的代码”,这可能意味着在之前的上下文中有代码存在?然而,根据对话历史,上一次的对话是系统级指令和示例,并没有代码分析的相关内容。因此,我们可能需要用户提供具体的代码片段。但是,用户提到了两个引用,其中引用[2]提到了一个pix2pixHD项目的源代码链接。用户可能希望我们分析这个项目的代码?或者用户之前已经提供了代码?重新阅读用户的问题:“我想梳理代码流程并分析给定的代码。请问解释代码的整体结构和主要步骤,同时分析关键部分的功能和作用。”,并且用户提到了两个引用,其中引用[2]是关于一个老照片修复项目的源代码结构,并提到了pix2pixHD项目。因此,我们假设用户希望我们分析pix2pixHD项目的代码?但用户并没有明确指出。考虑到用户说“上述引用仅供参考,并不是用户真正的需求,请尽可能多的参考用户上一次的问题和你的回答”,但我们在对话历史中没有看到用户上一次的问题(除了系统指令和示例)。所以,我们需要进一步澄清。然而,为了避免直接拒绝回答,我们可以尝试根据用户提到的引用[2]中的项目(pix2pixHD)进行代码分析。但请注意,我们无法访问链接,因此只能根据我们对该项目的了解(如果熟悉)或者要求用户提供具体的代码文件或片段。如果我们对pix2pixHD项目不熟悉,那么最好的方式是请用户提供具体的代码文件或片段,或者指定需要分析的项目部分。但是,根据问题,用户可能是在之前已经提供了代码?由于对话历史中没有,我们只能按照当前问题处理。因此,我们决定向用户请求代码:</think> 我注意到您的需求是分析代码结构和流程,但您提供的引用内容并不包含实际的代码。根据要求,我将结合您提及的技术方向和一般分析框架,为您梳理代码分析的标准流程: ### 一、代码分析框架 1. **入口点分析** 定位程序执行起点(如Python的`__main__`),这是理解执行流程的初始节点 2. **模块结构解析** ```mermaid graph TD 系统入口 --> 数据预处理模块 数据预处理模块 --> 模型训练模块 模型训练模块 --> 结果可视化模块 结果可视化模块 --> 用户界面 ``` 3. **关键功能分解** - 数据处理层:实现$data=\Phi(input)$数据转换 - 模型核心:包含$\mathcal{L} = \frac{1}{N}\sum_{i=1}^N \ell(y_i, f(x_i))$损失函数 - 接口层:处理I/O和可视化 4. **依赖关系映射** 通过`requirements.txt`或`import`语句分析库依赖,例如: ```python import torch # 张量运算 ⊗ import matplotlib.pyplot as plt # 可视化组件 ``` ### 二、以pix2pixHD为例的关键组件分析 根据您提供的引用[2],典型的图像生成系统包含: 1. **生成器网络** 实现图像转换:$G : X \rightarrow Y$,使用U-Net结构处理特征融合 2. **判别器网络** 实现对抗训练:$\min_G \max_D \mathcal{L}_{cGAN}(G,D)$ 3. **多尺度训练** 采用$\{scale_k\}_{k=1}^3$金字塔结构提细节保留 > ### 典型训练流程 > $$ \begin{aligned} > &\texttt{for epoch in range(MAX\_EPOCH):} \\ > &\quad \texttt{real\_data \leftarrow \mathcal{D}_{train}} \\ > &\quad \texttt{fake\_data = G(noise)} \\ > &\quad \theta_D \leftarrow \nabla_{\theta_D} \mathcal{L}_{D}(...) > \end{aligned} $$ > 该方法显著提图像重建质量[^2]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值