Want Speed? Pass by Value.

PRE-C++11


Want Speed? Pass by Value.

Be honest: how does the following code make you feel?

std::vector<std::string> get_names();
…
std::vector<std::string> const names = get_names();

Frankly, even though I should know better, it makes me nervous.In principle, when get_names() returns, we have to copy a vector of strings. Then, we need to copy it again when we initialize names, and we need to destroy the first copy. If there are N strings in the vector, each copy could require as many as N+1 memory allocations and a whole slew of cache-unfriendly data accesses as the string contents are copied.

Rather than confront that sort of anxiety,I’ve often fallen back on pass-by-reference to avoid needless copies:

get_names(std::vector<std::string>& out_param );
…
std::vector<std::string> names;
get_names( names );

Unfortunately, this approach is far from ideal.

  • The code grew by 150%
  • We’ve had to drop const-ness because we’re mutating names.
  • As functional programmers like to remind us, mutation makes code more complex to reason about by undermining referential transparency and equational reasoning.
  • We no longer have strict value semantics1 for names.

But is it really necessary to mess up our code in this way to gain efficiency? Fortunately, the answer turns out to be no (and especially not if you are using C++0x). This article is the first in a series that explores rvalues and their impliciations for efficient value semantics in C++.

RValues

Rvalues are expressions that create anonymous temporary objects. The name rvalue refers to the fact that an rvalue expression of builtin type can only appear on the right-hand side of an assignment. Unlike lvalues, which, when non-const, can always be used on the left-hand-side of an assignment, rvalue expressions yield objects without any persistent identity to assign into.2

The important thing about anonymous temporaries for our purposes, though, is that they can only be used once in an expression. How could you possibly refer to such an object a second time? It doesn’t have a name (thus, “anonymous”); and after the full expression is evaluated, the object is destroyed (thus, “temporary”)!

Once you know you are copying from an rvalue, then, it should be possible to “steal” the expensive-to-copy resources from the source object and use them in the target object without anyone noticing. In this case that would mean transferring ownership of the source vector’s dynamically-allocated array of strings to the target vector. If we could somehow get the compiler to execute that “move” operation for us, it would be cheap–almost free–to initialize names from a vector returned by-value.

That would take care of the second expensive copy, but what about the first?When get_names returns, in principle, it has to copy the function’s return value from the inside of the function to the outside. Well, it turns out that return values have the same property as anonymous temporaries: they are about to be destroyed, and won’t be used again. So, we could eliminate the first expensive copy in the same way, transferring the resources from the return value on the inside of the function to the anonymous temporary seen by the caller.

Copy Elision and the RVO

The reason I kept writing above that copies were made “in principle” is that the compiler is actually allowed to perform some optimizations based on the same principles we’ve just discussed. This class of optimizations is known formally as copy elision. For example, in the Return Value Optimization (RVO),the calling function allocates space for the return value on its stack,and passes the address of that memory to the callee.The callee can then construct a return value directly into that space, which eliminates the need to copy from inside to outside.The copy is simply elided, or “edited out,” by the compiler. So in code like the following, no copies are required:

std::vector<std::string> names = get_names();

Also, although the compiler isnormally required to make a copy when a function parameter is passed by value (so modifications to the parameter inside the function can’t affect the caller), it is allowed to elide the copy, and simply use the source object itself, when the source is an rvalue.

1
2
3
4
5
6
7
8
9
10
11
12
std::vector<std::string> 
sorted(std::vector<std::string> names)
{
    std::sort(names);
    return names;
}
 
// names is an lvalue; a copy is required so we don't modify names
std::vector<std::string> sorted_names1 = sorted( names );
 
// get_names() is an rvalue expression; we can omit the copy!
std::vector<std::string> sorted_names2 = sorted( get_names() );

This is pretty remarkable. In principle, in line 12 above, the compiler can eliminate all the worrisome copies, making sorted_names2 the same object as the one created in get_names(). In practice, though, the principle won’t take us quite that far, as I’ll explain later.

Implications

Although copy elision is never required by the standard, recent versions of every compiler I’ve tested do perform these optimizations today. But even if you don’t feel comfortable returning heavyweight objects by value, copy elision should still change the way you write code.

Consider this cousin of our original sorted(…) function, which takes names by const reference and makes an explicit copy:

std::vector<std::string> 
sorted2(std::vector<std::string> const& names) // names passed by reference
{
    std::vector<std::string> r(names);        // and explicitly copied
    std::sort(r);
    return r;
}

Although sorted and sorted2 seem at first to be identical, there could be a huge performance difference if a compiler does copy elision. Even if the actual argument to sorted2 is an rvalue, the source of the copy, names, is an lvalue,3 so the copy can’t be optimized away. In a sense, copy elision is a victim of the separate compilation model: inside the body of sorted2, there’s no information about whether the actual argument to the function is an rvalue; outside, at the call site, there’s no indication that a copy of the argument will eventually be made.

That realization leads us directly to this guideline:

Guideline: Don’t copy your function arguments. Instead, pass them by value and let the compiler do the copying.

At worst, if your compiler doesn’t elide copies, performance will be no worse. At best, you’ll see an enormous performance boost.

One place you can apply this guideline immediately is in assignment operators. The canonical, easy-to-write, always-correct, strong-guarantee, copy-and-swap assignment operator is often seen written this way:

T& T::operator=(T const& x) // x is a reference to the source
{ 
    T tmp(x);          // copy construction of tmp does the hard work
    swap(*this, tmp);  // trade our resources for tmp's
    return *this;      // our (old) resources get destroyed with tmp 
}

but in light of copy elision, that formulation is glaringly inefficient! It’s now “obvious” that the correct way to write a copy-and-swap assignment is:

T& operator=(T x)    // x is a copy of the source; hard work already done
{
    swap(*this, x);  // trade our resources for x's
    return *this;    // our (old) resources get destroyed with x
}

Reality Bites

Of course, lunch is never really free, so I have acouple of caveats.

First, when you pass parameters by reference and copy in the function body, the copy constructor is called from one central location. However, when you pass parameters by value, the compiler generates calls to the copy constructor at the site of each call where lvalue arguments are passed. If the function will be called from many places and code size or localityare serious considerations for your application, it could have a real effect.

On the other hand, it’s easy to build a wrapper function that localizes the copy:

std::vector<std::string> 
sorted3(std::vector<std::string> const& names)
{
    // copy is generated once, at the site of this call
    return sorted(names);
}

Since the converse doesn’t hold—you can’t get back a lost opportunity for copy elision by wrapping—I recommend you start by following the guideline, and make changes only as you find them [Edit ”only as you find wrapping func is necessary] to be necessary.

Second, I’ve [Edit "not" is missing here ?] yet to find a compiler that will elide the copy when a function parameter is returned, as in our implementation of sorted. When you think about how these elisions are done, it makes sense: without some form of inter-procedural optimization, the caller of sorted can’t know that the argument (and not some other object)will eventually be returned, so the compiler must allocate separate space on the stack for the argument and the return value.

If you need to return a function parameter, you can still get near-optimal performance by swapping into a default-constructed return value (provided default construction and swap are cheap, as they should be):

std::vector<std::string> 
sorted(std::vector<std::string> names)
{
    std::sort(names);
    std::vector<std::string> ret;  # [Edit this is smart!]
    swap(ret, names);
    return ret;
}

More To Come

Hopefully you now have the ammunition you need to stave off anxiety about passing and returning nontrivial objects by value. But we’re not done yet: now that we’ve covered rvalues, copy elision, and the RVO, we have all the background we need to attack move semantics, rvalue references, perfect forwarding, and more as we continue this article series. See you soon!

Follow this link to the next installment.

Acknowledgements

Howard Hinnant is responsible for key insights that make this article series possible. Andrei Alexandrescu was posting on comp.lang.c++.moderated about how to leverage copy elision years before I took it seriously. Most of all, though, thanks in general to all readers and reviewers!


  1. Googling for a good definition of value semantics turned up nothing for me. Unless someone else can point to one (and maybe even if they can), we’ll be running an article on that topic—in which I promise you a definition—soon. 

  2. For a detailed treatment of rvalues and lvalues, please see this excellent article by Dan Saks 

Except for  enum s and non-type template parameters, every value with a name is an lvalue. 

===========================
Valuable Comments from Howard Hinnant

I think this article should be updated for C++11. There are two things wrong with it:

  1. It leaves the impression that one should always write your assignment operator like so:

    T& operator=(T x)    // x is a copy of the source; hard work already done
    {
        swap(*this, x);  // trade our resources for x's
        return *this;    // our (old) resources get destroyed with x
    }
    

    But in some important cases, this is a large performance penalty. Vector-like classes where heap memory can be reused during the copy assignment is a classic example. I’ve just written a short example showing as high as a 7X performance penalty.

  2. In C++11 the correct way to write sorted is:

    std::vector<std::string>
    sorted(std::vector<std::string> names)
    {
        std::sort(names.begin(), names.end());
        return names;
    }
    

    Implicit return-by-move from by-value parameters is now required.

The basic point of the article is sound: Passing by value is an important tool in the tool box. But I’ve seen too many references to this article that mistakenly throw design and testing out the window on this issue, and translate this article into “always pass by value”.

#include <cstddef>
#include <new>
#include <utility>


template <class T>
class MyVector
{
    T* begin_;
    T* end_;
    T* capacity_;


public:
    MyVector()
        : begin_(nullptr),
          end_(nullptr),
          capacity_(nullptr)
        {}


    ~MyVector()
    {
        clear();
        ::operator delete(begin_);
    }


    MyVector(std::size_t N, const T& t)
        : MyVector()
    {
        if (N > 0)
        {
            begin_ = end_ = static_cast<T*>(::operator new(N*sizeof(T)));
            capacity_ = begin_ + N;
            for (; N > 0; --N, ++end_)
                ::new(end_) T(t);
        }
    }


    MyVector(const MyVector& v)
        : MyVector()
    {
        std::size_t N = v.size();
        if (N > 0)
        {
            begin_ = end_ = static_cast<T*>(::operator new(N*sizeof(T)));
            capacity_ = begin_ + N;
            for (std::size_t i = 0; i < N; ++i, ++end_)
                ::new(end_) T(v[i]);
        }
    }


    MyVector(MyVector&& v)
        : begin_(v.begin_),
          end_(v.end_),
          capacity_(v.capacity_)
    {
        v.begin_ = nullptr;
        v.end_ = nullptr;
        v.capacity_ = nullptr;
    }


#ifndef USE_SWAP_ASSIGNMENT


    MyVector& operator=(const MyVector& v)
    {
        if (this != &v)
        {
            std::size_t N = v.size();
            if (capacity() < N)
            {
                clear();
                ::operator delete(begin_);
                begin_ = end_ = static_cast<T*>(::operator new(N*sizeof(T)));
                capacity_ = begin_ + N;
            }
            T* p = begin_;
            const T* q = v.begin_;
            for (; p < end_ && q < v.end_; ++p, ++q)
                *p = *q;
            if (q < v.end_)
            {
                for (; q < v.end_; ++q, ++end_)
                    ::new(end_) T(*q);
            }
            else
            {
                while (end_ > p)
                {
                    --end_;
                    end_->~T();
                }
            }
        }
        return *this;
    }


    MyVector& operator=(MyVector&& v)
    {
        clear();
        swap(v);
        return *this;
    }


#else


    MyVector& operator=(MyVector v)
    {
        swap(v);
        return *this;
    }


#endif


    void clear()
    {
        while (end_ > begin_)
        {
            --end_;
            end_->~T();
        }
    }


    std::size_t size() const
        {return static_cast<std::size_t>(end_ - begin_);}
    std::size_t capacity() const
        {return static_cast<std::size_t>(capacity_ - begin_);}
    const T& operator[](std::size_t i) const
        {return begin_[i];}
    T& operator[](std::size_t i)
        {return begin_[i];}
    void swap(MyVector& v)
    {
        std::swap(begin_, v.begin_);
        std::swap(end_, v.end_);
        std::swap(capacity_, v.capacity_);
    }
};


template <class T>
inline
void
swap(MyVector<T>& x, MyVector<T>& y)
{
    x.swap(y);
}


#include <iostream>
#include <string>
#include <chrono>


int main()
{
    MyVector<std::string> v1(1000, "1234567890123456789012345678901234567890");
    MyVector<std::string> v2(1000, "1234567890123456789012345678901234567890123456789");
    typedef std::chrono::high_resolution_clock Clock;
    typedef std::chrono::duration<double, std::micro> US;
    auto t0 = Clock::now();
    v2 = v1;
    auto t1 = Clock::now();
    std::cout << US(t1-t0).count() << " microseconds\n";
}


$ clang++ -stdlib=libc++ -std=c++11 -O3 -DUSE_SWAP_ASSIGNMENT test.cpp
$ a.out
174.516 microseconds
$ a.out
180.83 microseconds
$ a.out
175.848 microseconds


$ clang++ -stdlib=libc++ -std=c++11 -O3  test.cpp
$ a.out
26.339 microseconds
$ a.out
24.179 microseconds
$ a.out
24.103 microseconds
From
http://cpp-next.com/archive/2009/08/want-speed-pass-by-value/
Introduction During the past year, we have seen the rapid development of video generation models with the release of several open-source models, such as HunyuanVideo, CogVideoX and Mochi. It is very exciting to see that open source video models are going to beat closed source. However, the inference speed of these models is still a bottleneck for real-time applications and deployment. In this article, we will use ParaAttention, a library implements Context Parallelism and First Block Cache, as well as other techniques like torch.compile and FP8 Dynamic Quantization, to achieve the fastest inference speed for HunyuanVideo. If you want to speed up other models like CogVideoX, Mochi or FLUX, you can also follow the same steps in this article. We set up our experiments on NVIDIA L20 GPUs, which only have PCIe support. If you have NVIDIA A100 or H100 GPUs with NVLink support, you can achieve a better speedup with context parallelism, especially when the number of GPUs is large. HunyuanVideo Inference with diffusers Like many other generative AI models, HunyuanVideo has its official code repository and is supported by other frameworks like diffusers and ComfyUI. In this article, we will focus on optimizing the inference speed of HunyuanVideo with diffusers. To use HunyuanVideo with diffusers, we need to install its latest version: pip3 install -U diffusers Then, we can load the model and generate video frames with the following code: import time import torch from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel from diffusers.utils import export_to_video model_id = "tencent/HunyuanVideo" transformer = HunyuanVideoTransformer3DModel.from_pretrained( model_id, subfolder="transformer", torch_dtype=torch.bfloat16, revision="refs/pr/18", ) pipe = HunyuanVideoPipeline.from_pretrained( model_id, transformer=transformer, torch_dtype=torch.float16, revision="refs/pr/18", ).to("cuda") pipe.vae.enable_tiling() begin = time.time() output = pipe( prompt="A cat walks on the grass, realistic", height=720, width=1280, num_frames=129, num_inference_steps=30, ).frames[0] end = time.time() print(f"Time: {end - begin:.2f}s") print("Saving video to hunyuan_video.mp4") export_to_video(output, "hunyuan_video.mp4", fps=15) However, most people will experience OOM (Out of Memory) errors when running the above code. This is because the HunyuanVideo transformer model is relatively large and it has a quite large text encoder. Besides, HunyuanVideo requires a variable length of text conditions and the diffusers library implements this feature with a attn_mask in scaled_dot_product_attention. The size of attn_mask is proportional to the square of the input sequence length, which is crazy when we increase the resolution and the number of frames of the inference! Luckily, we can use ParaAttention to solve this problem. In ParaAttention, we patch the original implementation in diffusers to cut the text conditions before calling scaled_dot_product_attention. We implement this in our apply_cache_on_pipe function so we can call it after loading the model: pip3 install -U para-attn pipe = HunyuanVideoPipeline.from_pretrained( model_id, transformer=transformer, torch_dtype=torch.float16, revision="refs/pr/18", ).to("cuda") from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe apply_cache_on_pipe(pipe, residual_diff_threshold=0.0) We pass residual_diff_threshold=0.0 to apply_cache_on_pipe to disable the cache mechanism now, because we will enable it later. Here, we only want it to cut the text conditions to avoid OOM errors. If you still experience OOM errors, you can try calling pipe.enable_model_cpu_offload or pipe.enable_sequential_cpu_offload after calling apply_cache_on_pipe. This is our baseline. On one single NVIDIA L20 GPU, we can generate 129 frames with 720p resolution in 30 inference steps in 3675.71 seconds. Apply First Block Cache on HunyuanVideo By caching the output of the transformer blocks in the transformer model and resuing them in the next inference steps, we can reduce the computation cost and make the inference faster. However, it is hard to decide when to reuse the cache to ensure the quality of the generated video. Recently, TeaCache suggests that we can use the timestep embedding to approximate the difference among model outputs. And AdaCache also shows that caching can contribute grant significant inference speedups without sacrificing the generation quality, across multiple video DiT baselines. However, TeaCache is still a bit complex as it needs a rescaling strategy to ensure the accuracy of the cache. In ParaAttention, we find that we can directly use the residual difference of the first transformer block output to approximate the difference among model outputs. When the difference is small enough, we can reuse the residual difference of previous inference steps, meaning that we in fact skip this denoising step. This has been proved to be effective in our experiments and we can achieve an up to 2x speedup on HunyuanVideo inference with very good quality. Cache in Diffusion Transformer How AdaCache works, First Block Cache is a variant of it To apply the first block cache on HunyuanVideo, we can call apply_cache_on_pipe with residual_diff_threshold=0.06, which is the default value for HunyuanVideo. apply_cache_on_pipe(pipe, residual_diff_threshold=0.06) HunyuanVideo without FBCache hunyuan_video_original.mp4 HunyuanVideo with FBCache hunyuan_video_fbc.mp4 We observe that the first block cache is very effective in speeding up the inference, and maintaining nearly no quality loss in the generated video. Now, on one single NVIDIA L20 GPU, we can generate 129 frames with 720p resolution in 30 inference steps in 2271.06 seconds. This is a 1.62x speedup compared to the baseline. Quantize the model into FP8 To further speed up the inference and reduce memory usage, we can quantize the model into FP8 with dynamic quantization. We must quantize both the activation and weight of the transformer model to utilize the 8-bit Tensor Cores on NVIDIA GPUs. Here, we use float8_weight_only and float8_dynamic_activation_float8_weightto quantize the text encoder and transformer model respectively. The default quantization method is per tensor quantization. If your GPU supports row-wise quantization, you can also try it for better accuracy. diffusers-torchao provides a really good tutorial on how to quantize models in diffusers and achieve a good speedup. Here, we simply install the latest torchao that is capable of quantizing HunyuanVideo. If you are not familiar with torchao quantization, you can refer to this documentation. pip3 install -U torch torchao We also need to pass the model to torch.compile to gain actual speedup. torch.compile with mode="max-autotune-no-cudagraphs" or mode="max-autotune" can help us to achieve the best performance by generating and selecting the best kernel for the model inference. The compilation process could take a long time, but it is worth it. If you are not familiar with torch.compile, you can refer to the official tutorial. In this example, we only quantize the transformer model, but you can also quantize the text encoder to reduce more memory usage. We also need to notice that the actually compilation process is done on the first time the model is called, so we need to warm up the model to measure the speedup correctly. Note: we find that dynamic quantization can significantly change the distribution of the model output, so you might need to tweak the residual_diff_threshold to a larger value to make it take effect. import time import torch from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel from diffusers.utils import export_to_video model_id = "tencent/HunyuanVideo" transformer = HunyuanVideoTransformer3DModel.from_pretrained( model_id, subfolder="transformer", torch_dtype=torch.bfloat16, revision="refs/pr/18", ) pipe = HunyuanVideoPipeline.from_pretrained( model_id, transformer=transformer, torch_dtype=torch.float16, revision="refs/pr/18", ).to("cuda") from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe apply_cache_on_pipe(pipe) from torchao.quantization import quantize_, float8_dynamic_activation_float8_weight, float8_weight_only quantize_(pipe.text_encoder, float8_weight_only()) quantize_(pipe.transformer, float8_dynamic_activation_float8_weight()) pipe.transformer = torch.compile( pipe.transformer, mode="max-autotune-no-cudagraphs", ) # Enable memory savings pipe.vae.enable_tiling() # pipe.enable_model_cpu_offload() # pipe.enable_sequential_cpu_offload() for i in range(2): begin = time.time() output = pipe( prompt="A cat walks on the grass, realistic", height=720, width=1280, num_frames=129, num_inference_steps=1 if i == 0 else 30, ).frames[0] end = time.time() if i == 0: print(f"Warm up time: {end - begin:.2f}s") else: print(f"Time: {end - begin:.2f}s") print("Saving video to hunyuan_video.mp4") export_to_video(output, "hunyuan_video.mp4", fps=15) The NVIDIA L20 GPU only has 48GB memory and could face OOM errors after compiling the model and not calling enable_model_cpu_offload, because the HunyuanVideo has very large activation tensors when running with high resolution and large number of frames. So here we skip measuring the speedup with quantization and compilation on one single NVIDIA L20 GPU and choose to use context parallelism to release the memory pressure. If you want to run HunyuanVideo with torch.compile on GPUs with less than 80GB memory, you can try reducing the resolution and the number of frames to avoid OOM errors. Due to the fact that large video generation models usually have performance bottleneck on the attention computation rather than the fully connected layers, we don't observe a significant speedup with quantization and compilation. However, models like FLUX and SD3 can benefit a lot from quantization and compilation, it is suggested to try it for these models. Parallelize the inference with Context Parallelism A lot faster than before, right? But we are not satisfied with the speedup we have achieved so far. If we want to accelerate the inference further, we can use context parallelism to parallelize the inference. Libraries like xDit and our ParaAttention provide ways to scale up the inference with multiple GPUs. In ParaAttention, we design our API in a compositional way so that we can combine context parallelism with first block cache and dynamic quantization all together. We provide very detailed instructions and examples of how to scale up the inference with multiple GPUs in our ParaAttention repository. Users can easily launch the inference with multiple GPUs by calling torchrun. If there is a need to make the inference process persistent and serviceable, it is suggested to use torch.multiprocessing to write your own inference processor, which can eliminate the overhead of launching the process and loading and recompiling the model. Below is our ultimate code to achieve the fastest HunyuanVideo inference: import time import torch import torch.distributed as dist from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel from diffusers.utils import export_to_video dist.init_process_group() torch.cuda.set_device(dist.get_rank()) # [rank1]: RuntimeError: Expected mha_graph->execute(handle, variant_pack, workspace_ptr.get()).is_good() to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.) # torch.backends.cuda.enable_cudnn_sdp(False) model_id = "tencent/HunyuanVideo" transformer = HunyuanVideoTransformer3DModel.from_pretrained( model_id, subfolder="transformer", torch_dtype=torch.bfloat16, revision="refs/pr/18", ) pipe = HunyuanVideoPipeline.from_pretrained( model_id, transformer=transformer, torch_dtype=torch.float16, revision="refs/pr/18", ).to("cuda") from para_attn.context_parallel import init_context_parallel_mesh from para_attn.context_parallel.diffusers_adapters import parallelize_pipe from para_attn.parallel_vae.diffusers_adapters import parallelize_vae mesh = init_context_parallel_mesh( pipe.device.type, ) parallelize_pipe( pipe, mesh=mesh, ) parallelize_vae(pipe.vae, mesh=mesh._flatten()) from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe apply_cache_on_pipe(pipe) # from torchao.quantization import quantize_, float8_dynamic_activation_float8_weight, float8_weight_only # # torch._inductor.config.reorder_for_compute_comm_overlap = True # # quantize_(pipe.text_encoder, float8_weight_only()) # quantize_(pipe.transformer, float8_dynamic_activation_float8_weight()) # pipe.transformer = torch.compile( # pipe.transformer, mode="max-autotune-no-cudagraphs", # ) # Enable memory savings pipe.vae.enable_tiling() # pipe.enable_model_cpu_offload(gpu_id=dist.get_rank()) # pipe.enable_sequential_cpu_offload(gpu_id=dist.get_rank()) for i in range(2): begin = time.time() output = pipe( prompt="A cat walks on the grass, realistic", height=720, width=1280, num_frames=129, num_inference_steps=1 if i == 0 else 30, output_type="pil" if dist.get_rank() == 0 else "pt", ).frames[0] end = time.time() if dist.get_rank() == 0: if i == 0: print(f"Warm up time: {end - begin:.2f}s") else: print(f"Time: {end - begin:.2f}s") if dist.get_rank() == 0: print("Saving video to hunyuan_video.mp4") export_to_video(output, "hunyuan_video.mp4", fps=15) dist.destroy_process_group() We save the above code to run_hunyuan_video.py and run it with torchrun: torchrun --nproc_per_node=8 run_hunyuan_video.py With 8 NVIDIA L20 GPUs, we can generate 129 frames with 720p resolution in 30 inference steps in 649.23 seconds. This is a 5.66x speedup compared to the baseline! 翻译
最新发布
09-24
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值