Batches & False sharing

本文详细解析 ScheduleParallel() 中 innerloopBatchCount 参数对Job性能的影响,特别关注16的倍数设置以避免伪共享问题。通过实验和CPU缓存原理探讨了批处理大小对性能的影响和最佳实践。

Batches & False sharing

这一节我们来聊一下 ScheduleParallel() 接口中的 innerloopBatchCount 参数对Job性能的影响。

    
public static Unity . Jobs . JobHandle Schedule ( T jobData , int arrayLength , int innerloopBatchCount , Unity . Jobs . JobHandle dependsOn ) ;
在之前的例子中,当我们调用 ScheduleParallel() 接口的时候,第二个 innerloopBatchCount 参数我们都是给的64,这个值代表什么意思呢?这个大小是否又是合适的呢?让我们一起来探讨一下。
首先按照Unity文档中的说法,这个 innerloopBatchCount 代表的是一个worker在执行任务偷取(workstealing)的时候会按照这个值去拿一定数量的任务,并在这个worker线程中连续调用 innerloopBatchCount 次的 Execute(index) 方法。比如我们的 arrayLength 长度是100, innerloopBatchCount 是10,这样整个job就会被分成十等份,每个worker拿走其中一份,然后在每个worker中执行10次 Execute(index) 方法。
那64这个值又是否是合适的呢?如果不合适又该如何确定这个值呢? 让我们先来做个测试,测试代码如下:
    
var job = new VelocityJob ( ) { deltaTime = Time . deltaTime , position = m_Positions , velocity = m_Velocity } ; var batchCount = 1 ; Profiler . BeginSample ( $"Batch = { batchCount } " ) ; job . ScheduleParallel ( m_Positions . Length , batchCount , new JobHandle ( ) ) . Complete ( ) ; Profiler . EndSample ( ) ; batchCount = 8 ; Profiler . BeginSample ( $"Batch = { batchCount } " ) ; job . ScheduleParallel ( m_Positions . Length , batchCount , new JobHandle ( ) ) . Complete ( ) ; Profiler . EndSample ( ) ; batchCount = 16 ; Profiler . BeginSample ( $"Batch = { batchCount } " ) ; job . ScheduleParallel ( m_Positions . Length , batchCount , new JobHandle ( ) ) . Complete ( ) ; Profiler . EndSample ( ) ; batchCount = 32 ; Profiler . BeginSample ( $"Batch = { batchCount } " ) ; job . ScheduleParallel ( m_Positions . Length , batchCount , new JobHandle ( ) ) . Complete ( ) ; Profiler . EndSample ( ) ; batchCount = 64 ; Profiler . BeginSample ( $"Batch = { batchCount } " ) ; job . ScheduleParallel ( m_Positions . Length , batchCount , new JobHandle ( ) ) . Complete ( ) ; Profiler . EndSample ( ) ;
我们分别把 innerloopBatchCount 的值设置为1、8、16、32、64来做以下性能对比。
测试结果如下图: 18款 MBP i7
在上图能明显看到当Batch=1时性能最差,Batch=8时性能稍好,当Batch为16的整数倍时,性能是相近的。
我们先来看一下Batch=1时的情况:
当Batch=1时,每个worker去偷取任务的效率最低,此时会增加额外的开销,看上去这个解释已经比较合理了,但是当我们对比Batch为16,32,64时发现性能并没有随着任务偷取的效率提升而提升。说明这里还有其他因素影响了性能。为了说明这个问题让我们先来了解一下CPU是如何从内存中读取数据的:
我们先来看一下一个CPU的情况:
假设这里CPU需要读取数据0。
当CPU需要从内存中读取数据的时候,首先要将数据读取到cache中,但是读取cache是按照cache line为单位来读取的,一般cache line的大小为32或者64字节。于是,CPU1为了读取数据0,同时也把1-4这四个不需要的数据也读取进来了。在单个CPU的场景下这并不会产生任何问题。
我们接着来看一下两个CPU的情况:
这里假设CPU1读取数据0,CPU2读取数据1。
CPU1和CPU2分别处理数据0,和1,他们属于同一cache line,也就意味着两个CPU要把同一个cache line分别加载到各自的缓存中来。如果两个CPU要同时修改同一个cache line这时候就会产生缓存一致性( Cache coherence )的问题。为了解决这个问题intel引入了 MESI 协议来解决缓存一致性的问题。对应图中的情况就是:
如果CPU1获得了cache line的所有权,那么就会导致CPU2上已经读取的cache line失效。CPU2就必须要重新从内存中读取这个cache line。这样就导致了CPU2运行效率的降低。这个问题就是伪共享( false sharing )问题。
OK,我们了解了当Batch=1的时候导致运行效率降低的原因,我们再来看一下为什么当Batch数为16的整数倍时运行效率为什么几乎一样。
先来回顾一下 VelocityJob
    
struct VelocityJob : IJobFor { [ ReadOnly ] public NativeArray < float3 > velocity ; public NativeArray < float3 > position ; public float deltaTime ; public void Execute ( int i ) { position [ i ] += velocity [ i ] * deltaTime ; } }
velocity 和 position 都是float3类型,他的大小是4 * 3 = 12字节,我这台电脑的cache line大小是64字节,不难得出 12 * 16 == 64 * 3 == 192,因此当batch大小是16的倍数时我们可以完整的利用我们的cache line,可以有效的避免伪共享(false sharing)问题。
【文章目录】
  1. Batches & False sharing
分析下面的代码 :#!/usr/bin/env python # coding: utf-8 # # AIMET Quantization workflow for LLaMA V2 7B with LORA adapters using PEFT pipeline # # This notebook shows a working code example of how to use AIMET to quantize LLaMaV2 model. # # --- # ### Required packages # The notebook assumes AIMET and LLamaV2 related packages are already installed. # In[ ]: # #### Overall flow # This notebook covers the following # 1. Top Level Config # 1. Model Adaptation. # 1. Model Sample Input # 1. Base BERT MHA FP Model Instantiation # 1. Loading LORA Adapter on base Bert MHA Model # 1. Adapted BERT MHA Model Preparation # 1. Adapted BERT MHA Model Quantization with PEFT pipeline # 1. Base KV MHA FP Model Instantiation # 1. Loading LORA Adapter on base KV MHA Model # 1. Adapted KV MHA FP Model Preparation # 1. Create Adapted KVcache MHA Quantsim and Apply Encodings from Adapted BERT MHA Model # 1. Export onnx and encodings # # # #### What this notebook is not # * This notebook is not intended to show the full scope of optimization. For example, the flow will not use QAT, KD-QAT as deliberate choice to have the notebook execute more quickly. # In[2]: get_ipython().run_line_magic(&#39;load_ext&#39;, &#39;autoreload&#39;) get_ipython().run_line_magic(&#39;autoreload&#39;, &#39;2&#39;) # In[1]: # Install packages only if running in jupyter notebook mode if hasattr(__builtins__,&#39;__IPYTHON__&#39;): get_ipython().system(&#39;sudo -H pip install --quiet --upgrade --root-user-action=ignore --no-cache-dir transformers==4.41.2,&#39;) get_ipython().system(&#39;sudo -H pip install --quiet --upgrade --root-user-action=ignore --no-cache-dir tokenizers==0.19.0,&#39;) # !sudo -H pip install --quiet --upgrade --root-user-action=ignore --no-cache-dir transformers==4.27.4, # !sudo -H pip install --quiet --upgrade --root-user-action=ignore --no-cache-dir tokenizers==0.13.0, get_ipython().system(&#39;sudo -H pip install --quiet --upgrade --root-user-action=ignore --no-cache-dir peft&#39;) get_ipython().system(&#39;sudo -H apt-get update&#39;) get_ipython().system(&#39;sudo -H apt-get install -y libc++-dev&#39;) get_ipython().system(&#39;sudo -H apt-get install -y clang&#39;) # In[1]: import transformers print (transformers.__version__) # In[2]: get_ipython().system(&#39;sudo -H pip install --quiet --upgrade --root-user-action=ignore --no-cache-dir transformers==4.41.2,&#39;) get_ipython().system(&#39;sudo -H pip install --quiet --upgrade --root-user-action=ignore --no-cache-dir tokenizers==0.19.0,&#39;) # #### Setup QNN SDK # In[14]: import sys import os QNN_SDK_ROOT = &#39;/tmp/qnn/2.28.0.241029&#39; # QNN 2.28.0 sys.path.insert(0, QNN_SDK_ROOT + &#39;/lib/python&#39;) os.environ[&#39;LD_LIBRARY_PATH&#39;] = os.path.join(QNN_SDK_ROOT + &#39;/lib/x86_64-linux-clang&#39;, os.getenv(&#39;LD_LIBRARY_PATH&#39;, &#39;&#39;)) # In[15]: import qti #checking if correct QNN version is loaded print(qti.__path__) # ### Setting NSP Target # In[3]: sys.path.append(os.path.abspath(&#39;../&#39;)) sys.path.append(os.path.abspath(&#39;../../common&#39;)) from utilities.nsptargets import NspTargets # Android GEN4 is supported for this notebook nsp_target = NspTargets.Android.GEN4 # Select quantsim config based on target # Point this to different path if AIMET install in path other than /usr/local/lib/python3.10/dist-packages/ htp_config_file = f&#39;/usr/local/lib/python3.10/dist-packages/aimet_common/quantsim_config/htp_quantsim_config_{nsp_target.dsp_arch}.json&#39; # --- # ## 1. Top Level Config # # In[11]: import os, sys from tqdm import tqdm os.environ[&#39;CUDA_VISIBLE_DEVICES&#39;] = &#39;1&#39; import torch from transformers import AutoConfig, AutoTokenizer, default_data_collator cache_dir=&#39;./cache_dir&#39; output_dir = &#39;./32layer_test&#39; os.makedirs(cache_dir, exist_ok=True) os.makedirs(output_dir, exist_ok=True) device = &quot;cuda&quot; # Auto-regression length: number of tokens to consume and number of logits to produce. ARN=73 # model_id=&quot;&lt;path_to_model_weight&gt; or &lt;HF_model_id(meta-llama/Llama-2-7b-hf)&gt;&quot; model_id=&quot;/002data/kraus/projects/LLM/qualcomm_llama/model/Step-1/7b_chat/&quot; num_hidden_layers = 32 #configurable to less number for debugging purposes context_length = 4096 # adatper dictionary name to peft_id lora_adapter_dict={&#39;french&#39;:&#39;kaitchup/Llama-2-7b-mt-French-to-English&#39;, &#39;oasst&#39;: &#39;kaitchup/Llama-2-7B-oasstguanaco-adapter&#39;} lora_adapter_dict={&#39;french&#39;:&#39;french&#39;, &#39;oasst&#39;: &#39;oasst&#39;} # --- # # ## 2. Model Adaptations # The following model adaptation are enabled for inference using provided modeling_llama.py: # * Use 2D attention_mask # * Replace position ids with embedding # * Output new KV only # # The following adaptation is enabled using in place replacement utility function # * Convert linear to conv # In[5]: from transformers.models.llama import modeling_llama from transformers import cache_utils from aimet_utils.linear_to_conv import replace_linears_with_convs from aimet_torch.pro.utils.profiler import event_marker from qcllama_adaptation import ( QcLlamaAttention, bypass_update_causal_mask, DynamicCache_update, DynamicCache_get_seq_length, update_attr ) with event_marker(&quot;FP model adaptation configuration&quot;): modeling_llama.LLAMA_ATTENTION_CLASSES[&#39;eager&#39;] = QcLlamaAttention # Bypass attention_mask preparation assert update_attr(modeling_llama.LlamaModel, &#39;_update_causal_mask&#39;, bypass_update_causal_mask) or \ update_attr(modeling_llama.LlamaModel, &#39;_prepare_decoder_attention_mask&#39;, bypass_update_causal_mask), \ f&quot;neither _prepare_decoder_attention_mask(..) nor _update_causal_mask(..) found, Unknown LlamaModel definition in {modeling_llama.__file__}&quot; # Adapting KV$ management assert update_attr(cache_utils.DynamicCache, &#39;update&#39;, DynamicCache_update), f&quot;Unknown DynamicCache definition: {cache_utils.DynamicCache}&quot; assert update_attr(cache_utils.DynamicCache, &#39;get_seq_length&#39;, DynamicCache_get_seq_length), f&quot;Unknown DynamicCache definition: {cache_utils.DynamicCache}&quot; # In[6]: import qcllama_adaptation print(qcllama_adaptation.__file__) # In[7]: from transformers.models.deprecated.open_llama import modeling_open_llama from transformers import cache_utils from aimet_utils.linear_to_conv import replace_linears_with_convs from aimet_torch.pro.utils.profiler import event_marker from qcllama_adaptation import ( QcLlamaAttention, bypass_update_causal_mask, DynamicCache_update, DynamicCache_get_seq_length, update_attr ) with event_marker(&quot;FP model adaptation configuration&quot;): modeling_llama.LLAMA_ATTENTION_CLASSES[&#39;eager&#39;] = QcLlamaAttention # Bypass attention_mask preparation assert update_attr(modeling_llama.LlamaModel, &#39;_update_causal_mask&#39;, bypass_update_causal_mask) or \ update_attr(modeling_llama.LlamaModel, &#39;_prepare_decoder_attention_mask&#39;, bypass_update_causal_mask), \ f&quot;neither _prepare_decoder_attention_mask(..) nor _update_causal_mask(..) found, Unknown LlamaModel definition in {modeling_llama.__file__}&quot; # Adapting KV$ management assert update_attr(cache_utils.DynamicCache, &#39;update&#39;, DynamicCache_update), f&quot;Unknown DynamicCache definition: {cache_utils.DynamicCache}&quot; assert update_attr(cache_utils.DynamicCache, &#39;get_seq_length&#39;, DynamicCache_get_seq_length), f&quot;Unknown DynamicCache definition: {cache_utils.DynamicCache}&quot; # --- # ## 3. Model Sample Input # #### Dummy input # # In[8]: from forward_pass_wrapper import get_position_embeddings_from_position_ids, prepare_combined_attention_mask, get_padded_kv_values, flatten_tensors def get_dummy_data(model_mode, num_layers, hidden_size, num_attention_heads, rope_theta, tokenizer, device, separate_tuple_input_output, num_tokens=None, concat_head_in_batch_dimension=False): max_tokens = tokenizer.model_max_length attention_mask = torch.ones((1, max_tokens), dtype=torch.long, device=device) if model_mode == &#39;bertcache&#39;: num_tokens = max_tokens position_ids = torch.cumsum(attention_mask, dim=1) - 1 position_ids = position_ids.clip(0, max_tokens - 1) position_ids = position_ids[..., :num_tokens] position_ids = position_ids.to(device=device) past_kv_length = max_tokens - num_tokens if model_mode == &#39;kvcache&#39; else 0 attention_mask = prepare_combined_attention_mask(attention_mask, input_shape=(1, num_tokens), past_key_values_length=past_kv_length, device=device, mask_neg=-100) position_ids = get_position_embeddings_from_position_ids(position_ids, head_dim=hidden_size//num_attention_heads, max_length=max_tokens, rope_theta=rope_theta, device=device) inputs = { &#39;attention_mask&#39;: attention_mask, &#39;position_ids&#39;: position_ids, &#39;input_ids&#39;: torch.randint(0, len(tokenizer), (1, num_tokens), device=device) } if model_mode == &#39;kvcache&#39;: inputs[&#39;past_key_values&#39;] = get_padded_kv_values(past_size=max_tokens - num_tokens, num_layers=num_layers, hidden_size=hidden_size, concat_head_in_batch_dimension=concat_head_in_batch_dimension, num_attention_heads=num_attention_heads, device=device) if separate_tuple_input_output: flattened_kvcache = tuple(flatten_tensors(inputs[&#39;past_key_values&#39;])) inputs = inputs[&#39;input_ids&#39;], inputs[&#39;attention_mask&#39;], inputs[&#39;position_ids&#39;][0], inputs[&#39;position_ids&#39;][1] inputs = inputs + flattened_kvcache else: if separate_tuple_input_output: inputs = inputs[&#39;input_ids&#39;], inputs[&#39;attention_mask&#39;], inputs[&#39;position_ids&#39;][0], inputs[&#39;position_ids&#39;][1] return inputs # #### Input and Output names # In[9]: def get_input_output_names(num_layers, past_key_values_in, separate_tuple_input_output): def _get_past_key_values_names(sfx, n_layers): all = [] for i in range(n_layers): all.append(f&#39;past_key_{i}_{sfx}&#39;) all.append(f&#39;past_value_{i}_{sfx}&#39;) return all output_names = [&#39;logits&#39;] input_names = [&#39;input_ids&#39;, &#39;attention_mask&#39;] if separate_tuple_input_output: output_names += _get_past_key_values_names(&#39;out&#39;, num_layers) input_names += [&#39;position_ids_cos&#39;, &#39;position_ids_sin&#39;] if past_key_values_in: input_names += _get_past_key_values_names(&#39;in&#39;, num_layers) else: output_names += [&#39;past_key_values&#39;] input_names += [&#39;position_ids&#39;] if past_key_values_in: input_names += [&#39;past_key_values&#39;] return input_names, output_names # ### 4. Base BERT MHA FP Model Instantiation # # In[ ]: # In[12]: llm_config = AutoConfig.from_pretrained(model_id, cache_dir=cache_dir, trust_remote_code=True) # model params llm_config.num_hidden_layers = num_hidden_layers llm_config.cache_dir = cache_dir llm_config.device = torch.device(&#39;cpu&#39;) # QC LLM model config setattr(llm_config, &#39;mask_neg&#39;, -100) setattr(llm_config, &#39;num_logits_to_return&#39;, 0) setattr(llm_config, &#39;return_top_k&#39;, 0) setattr(llm_config, &quot;use_conv&quot;, False) setattr(llm_config, &#39;return_new_key_value_only&#39;, True) setattr(llm_config, &#39;transposed_key_cache&#39;, True) setattr(llm_config, &#39;use_combined_mask_input&#39;, True) setattr(llm_config, &#39;use_position_embedding_input&#39;, True) setattr(llm_config, &#39;separate_tuple_input_output&#39;, False) setattr(llm_config, &#39;_attn_implementation&#39;, &#39;eager&#39;) setattr(llm_config, &#39;_attn_implementation_internal&#39;, &#39;eager&#39;) print(f&#39;num_layer: {llm_config.num_hidden_layers}, context_length: {context_length}&#39;) with event_marker(&#39;BERT MHA FP model&#39;): fp_base_model = modeling_llama.LlamaForCausalLM.from_pretrained(model_id, config=llm_config) os.environ[&#39;TOKENIZERS_PARALLELISM&#39;] = &#39;0&#39; tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir=cache_dir, use_fast=True, trust_remote_code=True) ## Adjust the tokenizer to limit to context length tokenizer.model_max_length = context_length # ### 5. Loading LORA Adapters on Base Bert MHA Model # In[13]: # loading adapter to Bert MHA model and save adapter weights from peft import PeftModel,PeftConfig,LoraConfig from aimet_torch.peft import replace_lora_layers_with_quantizable_layers,track_lora_meta_data from lora_utils import save_lora_weights_after_adaptation from aimet_utils.linear_to_conv import replace_linears_with_convs from aimet_utils.linear_to_conv import ConvInplaceLinear import copy # Adding dummy adapter for q_proj, k_proj, v_proj and combined adapters into one adapter k_v_lora_config = LoraConfig( r=16, lora_alpha=16, bias=&#39;none&#39;, target_modules=[&quot;q_proj&quot;,&quot;k_proj&quot;, &quot;v_proj&quot;], init_lora_weights=False # leads to random init not zeros ) for adapter_name,peft_model_id in lora_adapter_dict.items(): model_before_adapter = copy.deepcopy(fp_base_model) print (f&quot;=====loading adapter {adapter_name}====&quot;) lora_model = PeftModel.from_pretrained(model_before_adapter, peft_model_id, adapter_name=adapter_name) dummy_adapter_name = &quot;k_v_adapter&quot; lora_model.add_adapter(dummy_adapter_name, k_v_lora_config) for name, param in lora_model.named_parameters(): if dummy_adapter_name in name and &quot;lora&quot; in name: param.data.fill_(0.0) combined_adapter_name = &quot;combined_adapter&quot; lora_model.add_weighted_adapter( adapters=[adapter_name, dummy_adapter_name], weights=[1.0, 1.0], adapter_name=combined_adapter_name, combination_type=&quot;linear&quot; ) lora_model.set_adapter(combined_adapter_name) lora_model.delete_adapter(adapter_name) lora_model.delete_adapter(dummy_adapter_name) # Replace lora layers with quantizable layers replace_lora_layers_with_quantizable_layers(lora_model) # Linear to Conv model adaptation lora_model=replace_linears_with_convs(lora_model) # Save adapter weights after adaptation save_lora_weights_after_adaptation(lora_model, output_dir, adapter_name) del model_before_adapter del fp_base_model track_lora_meta_data(lora_model, output_dir, &#39;meta_data&#39;, ConvInplaceLinear) # In[14]: # fill lora layers with 0 to evaluate base model for name, param in lora_model.named_parameters(): if &#39;lora&#39; in name: param.data.fill_(0.0) # #### 5.1 Adapted BERT MHA Model Evaluation # # In[15]: # defining ppl evaluation function from torch.nn import CrossEntropyLoss bert_mha_fp_model=lora_model.base_model.model def bert_ppl_eval(data_loader, forward_pass_manager, num_batches=0): if num_batches == 0: num_batches = len(data_loader) loss = 0 for batch_id, batch in enumerate(tqdm(data_loader, total=num_batches, desc=&quot;Evaluating&quot;)): if batch_id &gt;= num_batches: break outputs = forward_pass_manager(**batch) lm_logits = outputs[&quot;lm_logits&quot;].cpu() # we can either pass input_ids or input_embeds in our fpm, hence with input_embeds we pass the labels. if &#39;input_ids&#39; not in batch: batch[&#39;input_ids&#39;] = batch[&#39;labels&#39;] lm_logits = lm_logits.reshape(batch[&#39;input_ids&#39;].shape[0], -1, lm_logits.shape[-1]) shift_logits = lm_logits[..., :-1, :].contiguous() shift_labels = batch[&#39;input_ids&#39;][..., 1:].contiguous().to(shift_logits.device) loss_fct = CrossEntropyLoss() loss += loss_fct( shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1), ) loss = loss / num_batches ppl = loss.exp() return ppl # In[ ]: from forward_pass_wrapper import LLMForwardPassManager orig_fpm = LLMForwardPassManager(cfg=llm_config, model=bert_mha_fp_model, tokenizer=tokenizer, model_mode=&#39;bertcache&#39;, num_logits_to_return=0, separate_tuple_input_output=False) input_names, output_names = get_input_output_names(num_layers=llm_config.num_hidden_layers, past_key_values_in=False, separate_tuple_input_output=False) from wikitext_dataloader import get_wiki_dataset train_dataloader, test_dataloader, _ = get_wiki_dataset(context_length, tokenizer, cache_dir, train_batch_size = 1, test_batch_size = 1) with event_marker(&quot;BERT MHA FP eval&quot;): with torch.no_grad(): with orig_fpm.place_on_device(device): orig_ppl = bert_ppl_eval(test_dataloader, orig_fpm) print(f&quot;ppl score of original BERT MHA fp model: {orig_ppl}&quot;) # ### 6. Adapted BERT MHA Model Preparation # #### Estimated running time: ~ 1h 20m # In[13]: import aimet_torch.pro.ir_graph_op_handler as ir_graph_op_handler from aimet_torch.pro import model_preparer # Setting this flag to False means that the prepared model will be flattened # This flag must be set to false because we rely on the model structure being flat to enable weight sharing ir_graph_op_handler.KEEP_ORIGINAL_MODEL_STRUCTURE = False # configuring the model for BERT mode bert_mha_fp_model.num_logits_to_return = 0 dummy_input = get_dummy_data(&#39;bertcache&#39;, llm_config.num_hidden_layers, llm_config.hidden_size, llm_config.num_attention_heads, llm_config.rope_theta, tokenizer, &#39;cpu&#39;, separate_tuple_input_output=False) input_names, output_names = get_input_output_names(num_layers=llm_config.num_hidden_layers, past_key_values_in=False, separate_tuple_input_output=True) converter_args_param = [&#39;--input_layout&#39;] converter_args_value = &#39;NONTRIVIAL&#39; converter_args = [] for input_param in converter_args_param: for input_name in input_names: converter_args += [input_param, input_name, converter_args_value] with event_marker(&quot;BERT MHA Model prepare&quot;, flush_ram=True): bert_mha_prepared_model = model_preparer.prepare_model(bert_mha_fp_model, dummy_input, filename=&quot;bert_mha_prepared_model&quot;, path=output_dir, input_names=input_names, output_names=output_names, converter_args=converter_args, skipped_optimizers=[&#39;eliminate_common_subexpression&#39;,&#39;eliminate_nop_with_unit&#39;, &#39;eliminate_duplicate_initializer&#39;], ) # In[ ]: del orig_fpm del bert_mha_fp_model # #### 6.1 Adapted BERT MHA Prepared Model Verification # Verify if prepared BERT model generates the same PPL as FP model # ##### Estimated running time: ~ 3m # In[17]: from aimet_torch.utils import load_pytorch_model # Load prepared model if prepartion is run before and prepared model can be retrived from filer path # # bert_mha_prepared_model = load_pytorch_model(path=output_dir, filename=&quot;bert_mha_prepared_model&quot;, # # model_name=&#39;ConvertedModel&#39;, load_state_dict=True) # Calculate ppl score for prepared fp model bert_mha_fpm = LLMForwardPassManager(cfg=llm_config, model=bert_mha_prepared_model, tokenizer=tokenizer, model_mode=&#39;bertcache&#39;, num_logits_to_return=0, separate_tuple_input_output=True) with event_marker(&quot;BERT MHA Prepared FP eval&quot;): with torch.no_grad(): with bert_mha_fpm.place_on_device(device): prepared_bertcache_ppl = bert_ppl_eval(test_dataloader, bert_mha_fpm) print(f&quot;ppl score of BERT prepared fp model: {prepared_bertcache_ppl}\n&quot; f&quot;orig ppl - prepared ppl = {orig_ppl - prepared_bertcache_ppl}&quot;) # ### 7. Adapted BERT MHA Model Quantization with PEFT pipeline # # We will be executing PTQ using calibration data that was captured earlier # #### Create Quantsim # In[16]: from aimet_common.defs import QuantScheme from aimet_torch.v2.quantsim import QuantizationSimModel dummy_input = get_dummy_data(&#39;bertcache&#39;, llm_config.num_hidden_layers, llm_config.hidden_size, llm_config.num_attention_heads, llm_config.rope_theta, tokenizer, device, separate_tuple_input_output=True) with event_marker(&quot;create Quantsim&quot;, flush_ram=True): with bert_mha_fpm.place_on_device(device): quant_sim = QuantizationSimModel(model=bert_mha_fpm.model, quant_scheme=QuantScheme.post_training_tf, dummy_input=dummy_input, default_output_bw=16, default_param_bw=4, in_place=False, config_file=htp_config_file) quant_sim.model.to(&#39;cpu&#39;) # In[43]: ### Setting 16*8 matmuls from aimet_torch.v2.experimental.quantsim_utils import set_matmul_second_input_producer_to_8bit_symmetric set_matmul_second_input_producer_to_8bit_symmetric(quant_sim) # #### Manual Mixed Precision # In[44]: from mixed_precision_overrides import ManualQuantsimMixedPrecisionConfig with event_marker(&quot;Apply Mixed Precision&quot;, flush_ram=True): quantsim_adjuster = ManualQuantsimMixedPrecisionConfig(mixed_precision_config_file= &quot;./config/mixed_precision_profiles/w4_a16_exceptions_llama_v2_prepared_disableRMSNorm_clampgateprojconv_bundledkv.json&quot;) quantsim_adjuster.apply_exceptions(quant_sim) # #### Instantiation of PEFT utils # In[30]: import pickle,json from aimet_torch.peft import PeftQuantUtils from aimet_torch.v2.quantization.affine import QuantizeDequantize with open(os.path.join(output_dir,&#39;meta_data.pkl&#39;), &quot;rb&quot;) as f: meta_data_file = pickle.load(f) with open(os.path.join(output_dir,&#39;bert_mha_prepared_model.json&#39;)) as f: name_to_module_dict = json.load(f) peft_utils = PeftQuantUtils(adapater_name_to_meta_data=meta_data_file, name_to_module_dict=name_to_module_dict) # #### Sequential MSE # ##### Estimated running time: ~ 1h 20m # In[46]: from aimet_torch.v2.seq_mse import apply_seq_mse from aimet_torch.seq_mse import SeqMseParams from aimet_torch.utils import load_pytorch_model # Load prepared model if prepartion is run before and prepared model can be retrived from filer path # # bert_mha_prepared_model = load_pytorch_model(path=output_dir, filename=&quot;bert_mha_prepared_model&quot;, # # model_name=&#39;ConvertedModel&#39;, load_state_dict=True) lora_layers =[layer for name,layer in peft_utils.get_fp_lora_layer(bert_mha_prepared_model)] def _forward_fn(model, inputs): prepared_inputs, _ = bert_mha_fpm.prepare_inputs(**inputs) if model == bert_mha_fpm.model else bert_mha_fpm.prepare_inputs(**inputs) model(**prepared_inputs) params = SeqMseParams(num_batches=20, inp_symmetry=&quot;symqt&quot;, num_candidates=20, loss_fn=&quot;mse&quot;, forward_fn=_forward_fn) bert_mha_sim_fpm = LLMForwardPassManager(cfg=llm_config, model=quant_sim.model, tokenizer=tokenizer, model_mode=&#39;bertcache&#39;, num_logits_to_return=0, separate_tuple_input_output=True) with event_marker(&quot;SeqMSE&quot;): with bert_mha_fpm.place_on_device(&quot;cuda&quot;),bert_mha_sim_fpm.place_on_device(&quot;cuda&quot;): apply_seq_mse(bert_mha_fpm.model, quant_sim, train_dataloader, params, modules_to_exclude=lora_layers) quant_sim.save_encodings_to_json(output_dir, &#39;base_seqmse&#39;) # #### Concat Encoding Unification # In[47]: from aimet_torch.v2.experimental import propagate_output_encodings import aimet_torch.elementwise_ops as aimet_ops propagate_output_encodings(quant_sim, aimet_ops.Concat) # #### Setup Lora Layer to 16 bit per tensor # In[48]: ## do this if changing for lora layers for _,module in peft_utils.get_quantized_lora_layer(quant_sim): # setting 16 bit per tensor module.param_quantizers[&#39;weight&#39;] = QuantizeDequantize(shape=(1, 1, 1, 1), bitwidth=16, symmetric=True).to(module.weight.device) peft_utils.quantize_lora_scale_with_fixed_range(quant_sim, 16, 0.0, 1.0) peft_utils.disable_lora_adapters(quant_sim) # #### Calibration # ##### Estimated running time: ~ 5m # In[49]: def calibration_wrapper(model, kwargs): data_loader = kwargs[&#39;data_loader&#39;] fpm = kwargs[&#39;fpm&#39;] max_iterations = kwargs[&#39;num_batches&#39;] for batch_id, batch in enumerate(tqdm(data_loader)): if batch_id &lt; max_iterations: prepared_inputs, _ = fpm.prepare_inputs(**batch) model(**prepared_inputs) else: break kwargs = { &#39;data_loader&#39;: train_dataloader, &#39;fpm&#39;: bert_mha_sim_fpm, &#39;num_batches&#39;: 100 } with event_marker(&quot;compute encoding for base&quot;, flush_ram=True): with bert_mha_sim_fpm.place_on_device(device): quant_sim.compute_encodings(calibration_wrapper, kwargs) from global_encoding_clipper import clamp_activation_encodings clamp_activation_encodings(quant_sim,500) # #### Adapted BERT MHA Quantsim Eval for Quantization Accuracy # ##### Estimated running time: ~7m # In[50]: with event_marker(&quot;Sim eval for base&quot;): with torch.no_grad(): with bert_mha_sim_fpm.place_on_device(device): sim_ppl = bert_ppl_eval(test_dataloader, bert_mha_sim_fpm) print(f&quot;ppl score of quantsim model: {sim_ppl}\n&quot; f&quot;orig ppl - quantsim ppl = {orig_ppl - sim_ppl}&quot;) quant_sim.save_encodings_to_json(output_dir, &#39;base_encoding&#39;) # #### Load Adapter Weights, Compute Encodings and Save Encodings # ##### Estimated running time: ~ 25m # In[51]: peft_utils.freeze_base_model_param_quantizers(quant_sim) for adapter_name,peft_model_id in lora_adapter_dict.items(): peft_utils.enable_adapter_and_load_weights(quant_sim,os.path.join(output_dir,f&#39;{adapter_name}.safetensor&#39;)) with event_marker(f&quot;compute encoding for {adapter_name} adapter&quot;, flush_ram=True): with bert_mha_sim_fpm.place_on_device(device): quant_sim.compute_encodings(calibration_wrapper, kwargs) from global_encoding_clipper import clamp_activation_encodings clamp_activation_encodings(quant_sim, 500) with event_marker(f&quot;Sim eval for {adapter_name} adapter&quot;): with torch.no_grad(): with bert_mha_sim_fpm.place_on_device(device): sim_ppl = bert_ppl_eval(test_dataloader, bert_mha_sim_fpm) print(f&quot;ppl score of quantsim model: {sim_ppl}\n&quot; f&quot;orig ppl - quantsim ppl = {orig_ppl - sim_ppl}&quot;) ## save encodings for kvcache mode to consume quant_sim.save_encodings_to_json(output_dir, f&#39;{adapter_name}_adapter_encoding&#39;) # In[52]: del bert_mha_sim_fpm del bert_mha_fpm del bert_mha_prepared_model del quant_sim # ### 8. Base KV MHA FP Model Instantiation # In[20]: llm_config = AutoConfig.from_pretrained(model_id, cache_dir=cache_dir, trust_remote_code=True) # model params llm_config.num_hidden_layers = num_hidden_layers llm_config.cache_dir = cache_dir llm_config.device = torch.device(&#39;cpu&#39;) # QC LLM model config setattr(llm_config, &#39;mask_neg&#39;, -100) setattr(llm_config, &#39;num_logits_to_return&#39;, ARN) setattr(llm_config, &#39;return_top_k&#39;, 0) setattr(llm_config, &quot;use_conv&quot;, False) setattr(llm_config, &#39;return_new_key_value_only&#39;, True) setattr(llm_config, &#39;transposed_key_cache&#39;, True) setattr(llm_config, &#39;use_combined_mask_input&#39;, True) setattr(llm_config, &#39;concat_head_in_batch_dimension&#39;, False) setattr(llm_config, &#39;use_sha&#39;, False) setattr(llm_config, &#39;num_tokens&#39;, ARN) setattr(llm_config, &#39;use_position_embedding_input&#39;, True) setattr(llm_config, &#39;separate_tuple_input_output&#39;, False) setattr(llm_config, &#39;_attn_implementation&#39;, &#39;eager&#39;) setattr(llm_config, &#39;_attn_implementation_internal&#39;, &#39;eager&#39;) print(f&#39;num_layer: {llm_config.num_hidden_layers}, context_length: {context_length}, arn: {ARN}&#39;) with event_marker(&#39;KV FP model&#39;): kv_fp_base_model = modeling_llama.LlamaForCausalLM.from_pretrained(model_id, config=llm_config) os.environ[&#39;TOKENIZERS_PARALLELISM&#39;] = &#39;0&#39; tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir=cache_dir, use_fast=True, trust_remote_code=True) ## Adjust the tokenizer to limit to context length tokenizer.model_max_length = context_length # ### 9. Loading LORA Adapter on base KV MHA Model # In[21]: # loading only 1 adapter to have adapted graph , adapter_name= &quot;french&quot; # peft_model_id=&quot;kaitchup/Llama-2-7b-mt-French-to-English&quot; peft_model_id=&quot;french&quot; lora_model = PeftModel.from_pretrained(kv_fp_base_model, peft_model_id, adapter_name=adapter_name) dummy_adapter_name = &quot;k_v_adapter&quot; lora_model.add_adapter(dummy_adapter_name, k_v_lora_config) # Write the lora&#39;s for k and v with zeros # not doing this due to graph issue reported for g2g for name, param in lora_model.named_parameters(): if dummy_adapter_name in name and &quot;lora&quot; in name: param.data.fill_(0.0) combined_adapter_name = &quot;combined_adapter&quot; lora_model.add_weighted_adapter( adapters=[adapter_name, dummy_adapter_name], weights=[1.0, 1.0], adapter_name=combined_adapter_name, combination_type=&quot;linear&quot; ) lora_model.set_adapter(combined_adapter_name) lora_model.delete_adapter(adapter_name) lora_model.delete_adapter(dummy_adapter_name) # Replace lora layer with quantizable layers replace_lora_layers_with_quantizable_layers(lora_model) # linear to conv adaptation lora_model=replace_linears_with_convs(lora_model) kv_mha_fp_model = lora_model.base_model.model # ### 10. Adapted KV Cache MHA Model Preparation # #### Estimated running time: ~ 1h 20m # In[22]: import aimet_torch.pro.ir_graph_op_handler as ir_graph_op_handler from aimet_torch.pro import model_preparer # Setting this flag to False means that the prepared model will be flattened # This flag must be set to false because we rely on the model structure being flat to enable weight sharing ir_graph_op_handler.KEEP_ORIGINAL_MODEL_STRUCTURE = False dummy_input = get_dummy_data(&#39;kvcache&#39;, llm_config.num_hidden_layers, llm_config.hidden_size, llm_config.num_attention_heads, llm_config.rope_theta, tokenizer, &#39;cpu&#39;, separate_tuple_input_output=False, num_tokens=ARN, concat_head_in_batch_dimension=llm_config.concat_head_in_batch_dimension) input_names, output_names = get_input_output_names( num_layers=llm_config.num_hidden_layers, past_key_values_in=True, separate_tuple_input_output=True) # Build the converter args converter_args_param = [&#39;--input_layout&#39;] converter_args_value = &#39;NONTRIVIAL&#39; converter_args = [] for input_param in converter_args_param: for input_name in input_names: converter_args += [input_param, input_name, converter_args_value] with event_marker(&quot;KV MHA Model prepare&quot;, flush_ram=True): kv_mha_prepared_model = model_preparer.prepare_model(kv_mha_fp_model, dummy_input, filename=&quot;kv_mha_prepared_model&quot;, path=output_dir, input_names=input_names, output_names=output_names, converter_args=converter_args, skipped_optimizers=[&#39;eliminate_common_subexpression&#39;,&#39;eliminate_nop_with_unit&#39;, &#39;eliminate_duplicate_initializer&#39;], ) del kv_mha_fp_model # # ### 11. Create Adapted KVcache MHA Quantsim and Apply Encodings from Adapted BERT MHA Model # In[23]: kvcache_fpm = LLMForwardPassManager(cfg=llm_config, model=kv_mha_prepared_model, tokenizer=tokenizer, model_mode=&#39;kvcache&#39;, num_logits_to_return=ARN, separate_tuple_input_output=True, num_tokens=ARN) llm_config.concat_head_in_batch_dimension = False dummy_input = get_dummy_data(&#39;kvcache&#39;, llm_config.num_hidden_layers, llm_config.hidden_size, llm_config.num_attention_heads, llm_config.rope_theta, tokenizer, device, separate_tuple_input_output=True, num_tokens=ARN, concat_head_in_batch_dimension=llm_config.concat_head_in_batch_dimension) with event_marker(&quot;create KV Quantsim&quot;): with kvcache_fpm.place_on_device(device): kv_quant_sim = QuantizationSimModel(model=kvcache_fpm.model, quant_scheme=QuantScheme.post_training_tf, dummy_input=dummy_input, default_output_bw=16, default_param_bw=4, in_place=True, config_file=htp_config_file, ) # In[24]: ### Setting 16*8 malmuls from aimet_torch.v2.experimental.quantsim_utils import set_matmul_second_input_producer_to_8bit_symmetric set_matmul_second_input_producer_to_8bit_symmetric(kv_quant_sim) # #### Concat encoding unification # In[25]: from aimet_torch.v2.experimental import propagate_output_encodings import aimet_torch.elementwise_ops as aimet_ops propagate_output_encodings(kv_quant_sim, aimet_ops.Concat) # #### Mixed precision config # In[26]: from mixed_precision_overrides import ManualQuantsimMixedPrecisionConfig with event_marker(&quot;Apply Mixed Precision&quot;, flush_ram=True): quantsim_adjuster = ManualQuantsimMixedPrecisionConfig(mixed_precision_config_file= &quot;./config/mixed_precision_profiles/w4_a16_exceptions_llama_v2_prepared_disableRMSNorm_clampgateprojconv_bundledkv.json&quot;) quantsim_adjuster.apply_exceptions(kv_quant_sim) # #### Setup lora layer to be 16bit per tensor # In[31]: import json import pickle with open(os.path.join(output_dir,&#39;meta_data.pkl&#39;), &quot;rb&quot;) as f: meta_data_file = pickle.load(f) with open(os.path.join(output_dir,&#39;kv_mha_prepared_model.json&#39;)) as f: name_to_module_dict = json.load(f) peft_utils = PeftQuantUtils(adapater_name_to_meta_data=meta_data_file, name_to_module_dict=name_to_module_dict) ## do this if changing for lora layers for _,module in peft_utils.get_quantized_lora_layer(kv_quant_sim): # setting 16 bit per tensor module.param_quantizers[&#39;weight&#39;] = QuantizeDequantize(shape=(1, 1, 1, 1), bitwidth=16, symmetric=True).to(module.weight.device) peft_utils.quantize_lora_scale_with_fixed_range(kv_quant_sim, 16, 0.0, 1.0) peft_utils.disable_lora_adapters(kv_quant_sim) # #### Mapping Base Encodings and Loading Mapped Encodings into Quantizer # In[32]: from encodings_mapper import EncodingsMapper encoding_file = os.path.join(output_dir, &#39;base_encoding.json&#39;) _ , mapped_encoding_file = EncodingsMapper(llm_config, output_dir, encoding_file).map_encodings() kv_quant_sim.load_encodings(mapped_encoding_file, partial=False) # ### 12. Export KVCache Model Onnx and encodings # #### Estimated running time: ~ 1h # In[34]: from aimet_torch.utils import change_tensor_device_placement from aimet_torch.onnx_utils import OnnxExportApiArgs from aimet_torch import onnx_utils from aimet_utils.clip_weights import clip_weights_to_7f7f onnx_dir = os.path.join(output_dir, &#39;onnx&#39;) os.makedirs(onnx_dir, exist_ok=True) input_names, output_names = get_input_output_names( num_layers=llm_config.num_hidden_layers, past_key_values_in=True, separate_tuple_input_output=True) onnx_utils.RESTORE_ONNX_MODEL_INITIALIZERS = True clip_weights_to_7f7f(kv_quant_sim) onnx_api_args = OnnxExportApiArgs(input_names=input_names,output_names=output_names) sample_inputs = change_tensor_device_placement(dummy_input, torch.device(&#39;cpu&#39;)) filename_prefix = f&quot;llamav2_AR{ARN}&quot; filename_prefix_encodings = f&quot;{filename_prefix}_base&quot; with event_marker(&quot;KVCache export onnx and test vectors&quot;, flush_ram=True): kv_quant_sim.export(onnx_dir, filename_prefix, sample_inputs, onnx_export_args=onnx_api_args,export_model=True, filename_prefix_encodings=filename_prefix_encodings) # exporting tokenizer tokenizer_dir = os.path.join(output_dir, &#39;tokenizer&#39;) os.makedirs(tokenizer_dir, exist_ok=True) tokenizer.save_pretrained(tokenizer_dir) # #### Create sample test vectors for QNN SDK # ##### Estimated running time: ~ 9m # In[35]: from test_vectors import generate_test_vectors test_vector_layers = [ &quot;model_layers_\\d+_input_layernorm_Pow&quot;, &quot;model_layers_\\d+_input_layernorm_Cast&quot;, &quot;lm_head_conv_Conv&quot;, &quot;lm_head_MatMul&quot;, &quot;model.layers\\d+.input_layernorm.cast&quot;, &quot;lm_head_conv&quot;, &quot;lm_head&quot; ] with event_marker(&quot;generate test vector&quot;): generate_test_vectors(kv_quant_sim, kvcache_fpm, train_dataloader, output_dir, num_batches=1, test_vector_layers=test_vector_layers, input_names=input_names) # #### Mapping Encoding from Bert to Kvcache and Export encodings for Adapters # ##### Estimated running time : ~ 10m # In[36]: from encodings_mapper import EncodingsMapper peft_utils.freeze_base_model_param_quantizers(kv_quant_sim) for adapter_name,peft_model_id in lora_adapter_dict.items(): peft_utils.enable_adapter_and_load_weights(kv_quant_sim,os.path.join(output_dir,f&#39;{adapter_name}.safetensor&#39;)) encoding_file = os.path.join(output_dir, f&#39;{adapter_name}_adapter_encoding.json&#39;) _ , mapped_encoding_file = EncodingsMapper(llm_config, output_dir, encoding_file).map_encodings() kv_quant_sim.load_encodings(mapped_encoding_file, partial=False) clip_weights_to_7f7f(kv_quant_sim) peft_utils.export_adapter_weights(kv_quant_sim, output_dir, f&#39;{adapter_name}_onnx&#39;) filename_prefix_encodings = f&quot;{filename_prefix}_{adapter_name}&quot; with event_marker(f&quot;KVCache export {adapter_name} adapter encodings&quot;, flush_ram=True): kv_quant_sim.export(onnx_dir, filename_prefix, sample_inputs, onnx_export_args=onnx_api_args,export_model=False, filename_prefix_encodings=filename_prefix_encodings) # --- # ## Summary # In[37]: from aimet_torch.pro.utils.profiler import EventProfiler EventProfiler().report() EventProfiler().json_dump(os.path.join(output_dir, &#39;profiling_stats&#39;))
06-11
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值