Parameters for big model inference
torch_dtype (str or torch.dtype, optional) — Override the default torch.dtype and load the model under a specific dtype. The different options are:
torch.float16 or torch.bfloat16 or torch.float: load in a specified dtype, ignoring the model’s config.torch_dtype if one exists. If not specified
the model will get loaded in torch.float (fp32).
"auto" - A torch_dtype entry in the config.json file of the model will be attempted to be used. If this entry isn’t found then next check the dtype of the first weight in the checkpoint that’s of a floating point type and use that as dtype. This will load the model using the dtype it was saved in at the end of the training. It can’t be used as an indicator of how the model was trained. Since it could be trained in one of half precision dtypes, but saved in fp32.
A string that is a valid torch.dtype. E.g. “float32” loads the model in torch.float32, “float16” loads in torch.float16 etc.
For some models the dtype they were trained in is unknown - you may try to check the model’s paper or reach out to the authors and ask them to add this information to the model’s card and to insert the torch_dtype entry in config.json on the hub.
device_map (str or dict[str, Union[int, str, torch.device]] or int or torch.device, optional) — A map that specifies where each submodule should go. It doesn’t need to be refined to each parameter/buffer name, once a given module name is inside, every submodule of it will be sent to the same device. If we only pass the device (e.g., "cpu", "cuda:1", "mps", or a GPU ordinal rank like 1) on which the model will be allocated, the device map will map the entire model to this device. Passing device_map = 0 means put the whole model on GPU 0.
To have Accelerate compute the most optimized device_map automatically, set device_map="auto". For more information about each option see designing a device map.
max_memory (Dict, optional) — A dictionary device identifier to maximum memory if using device_map. Will default to the maximum memory available for each GPU and the available CPU RAM if unset.
tp_plan (str, optional) — A torch tensor parallel plan, see here. Currently, it only accepts tp_plan="auto" to use predefined plan based on the model. Note that if you use it, you should launch your script accordingly with torchrun [args] script.py. This will be much faster than using a device_map, but has limitations.
tp_size (str, optional) — A torch tensor parallel degree. If not provided would default to world size.
device_mesh (torch.distributed.DeviceMesh, optional) — A torch device mesh. If not provided would default to world size. Used only for tensor parallel for now.
offload_folder (str or os.PathLike, optional) — If the device_map contains any value "disk", the folder where we will offload weights.
offload_state_dict (bool, optional) — If True, will temporarily offload the CPU state dict to the hard drive to avoid getting out of CPU RAM if the weight of the CPU state dict + the biggest shard of the checkpoint does not fit. Defaults to True when there is some disk offload.
offload_buffers (bool, optional) — Whether or not to offload the buffers with the model parameters.
quantization_config (Union[QuantizationConfigMixin,Dict], optional) — A dictionary of configuration parameters or a QuantizationConfigMixin object for quantization (e.g bitsandbytes, gptq). There may be other quantization-related kwargs, including load_in_4bit and load_in_8bit, which are parsed by QuantizationConfigParser. Supported only for bitsandbytes quantizations and not preferred. consider inserting all such arguments into quantization_config instead.
subfolder (str, optional, defaults to "") — In case the relevant files are located inside a subfolder of the model repo on huggingface.co, you can specify the folder name here.
variant (str, optional) — If specified load weights from variant filename, e.g. pytorch_model..bin. variant is ignored when using from_tf or from_flax.
use_safetensors (bool, optional, defaults to None) — Whether or not to use safetensors checkpoints. Defaults to None. If not specified and safetensors is not installed, it will be set to False.
weights_only (bool, optional, defaults to True) — Indicates whether unpickler should be restricted to loading only tensors, primitive types, dictionaries and any types added via torch.serialization.add_safe_globals(). When set to False, we can load wrapper tensor subclass weights.
key_mapping (`dict[str, str], optional) — A potential mapping of the weight names if using a model on the Hub which is compatible to a Transformers architecture, but was not converted accordingly.
kwargs (remaining dictionary of keyword arguments, optional) — Can be used to update the configuration object (after it being loaded) and initiate the model (e.g., output_attentions=True). Behaves differently depending on whether a config is provided or automatically loaded:
If a configuration is provided with config, **kwargs will be directly passed to the underlying model’s __init__ method (we assume all relevant updates to the configuration have already been done)
If a configuration is not provided, kwargs will be first passed to the configuration class initialization function (from_pretrained()). Each key of kwargs that corresponds to a configuration attribute will be used to override said attribute with the supplied kwargs value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model’s __init__ function.
详细解释一下上面这个