How to Write an Effective Design Document

http://blog.slickedit.com/2007/05/how-to-write-an-effective-design-document/

Day by day, programmers are able to get more done in less time. With today’s high level languages, development environments, tools and the “rapid application development” mindset, both programmers and managers have become accustomed to extremely fast development cycles. Programmers are now more inclined to jump directly into development, fearing that every hour they are not writing code will result in an hour of overtime the weekend before the deadline.

The process of designing before coding is becoming outdated. Documenting designs is becoming even rarer. Many developers have never written a design document and cringe at the idea of doing so. Many who are required to, typically generate a lot of interaction diagrams and class diagrams, which don’t often express the developer’s thought process during the design phase. This article will discuss how to do write an effective design document concisely with no special tools, and without needing to know UML. It will also discuss why a well written design document is one of the most valuable tools a developer can have when entering a new project.

Why write a design document?

A design document is a way for you to communicate to others what your design decisions are and why your decisions are good decisions. Don’t worry if your design is not UML compliant and don’t worry if you didn’t use a special modeling tool to create it. The biggest factor that determines if your design document is good is whether or not it clearly explains your intentions.

This presents a problem, however. In order to convey design decisions, you have to consider the audience that you are writing for. A peer developer will understand why a well-crafted class abstraction is a good design, however your manager will probably not. Because your peer developers and your manager have different concepts of what makes a design good, there is a need for two design documents; one for peer developers and one for managers. Each document serves a different and equally valuable purpose as you begin your project development.

If this seems like too much work, it’s not. This article will show you how to do this through documentation reuse.

What makes a good design?

A design will typically be considered good if it fulfills the requirements in a meaningful way. If any aspect of the design cannot be justified, then it is probably worth reevaluating. Many programmers try to incorporate design patterns into their work, and they often add unnecessary complexity. You should be able to list at least one compelling reason, related to the requirements, for why a design decision was made. That reason must then be documented. If you can’t come up with a clear reason for a design decision, then it is probably not adding value.

Diagrams are a great tool for visualizing your design, but they cannot convey the motivation behind your design decisions. That is why it is so important to let diagrams supplement your design document, not be your design document.

In addition, it is also extremely important to document any benefits that result from a design decision. By doing so, others who read your document will understand what value they can gain. Likewise, any associated risks must also be documented. More often than not, other programmers have faced the same risks and may have helpful pointers or solutions that you may not have thought about. By listing these items, you also get others to think about what the potential risks could be as well. Teammates will often be able to see potential pitfalls that you didn’t see when you created your design. It is much easier to rearrange some boxes in a diagram than it is to rewrite hundreds of lines of code when an assumption fails or when you hit an unforeseen snare during coding. A good design document minimizes unexpected complications by addressing them before the code is written.

Finally, a document will provide you, your manager and your team with a common vocabulary for talking about the project. A design document can be a powerful tool for a manager because it gives them a view into the project that they don’t normally have the technical expertise to see. By listing the benefits you give your manager tangible items that describe why your design is sound. By documenting the risks of your design before development, you pass the responsibility of that risk to your manager, which is where it belongs.

Lastly, the design document is a written contract between you, your manager and your team. When you document your assumptions, decisions, risks, etc, you give others a chance to say, “Yes, this is exactly what I expect.” Once your document passes that stage, it becomes a baseline for limiting changes in project scope. Obviously, requirements are going to change sometimes, but with a baseline document you have the power to say that no change in scope is due to a misunderstanding of the requirements.

Writing for a Peer Developer

The goal of a peer developer design document is to make sure that your ideas are valid and that your approach works with what others are doing. When developers don’t communicate their plans, disaster is sure to strike when modules or classes begin to interact. The following items describe a general guideline for writing this type of document:

Section 1 – State the purpose of your project/sub-system: In this section, write a few paragraphs that describe what the project or sub-system does. What is the problem it is trying to solve? Why does it need to exist? Who will use it? By answering these questions, you establish the scope of your design. If you find it hard to write a few paragraphs in this section, then you probably don’t understand the domain as much as you should. If you can’t fit your description within a few paragraphs, then perhaps the scope is too large. Use this section as a tool to verify that the scope of your design is reasonable.

Section 2 – Define the high level entities in your design: High level entities are objects, or groups of objects, that constitute major constructs of your design. Good examples of entities are a data access layer, a controller object, a set of business objects, etc… Figure 1 shows an example of a .

Figure 1

Figure 1 (click to see full size)

In this section, explain in a few sentences what each entity does. The descriptions don’t have to be verbose, just enough to explain what each block’s purpose is. Be sure to describe your reasoning for defining the entities in your diagram and what their roles are.

Section 3 – For each entity, define the low level design: This section is where your objects and object relationships are defined. For each object (or set of objects) define the following:

Usage

Describe in a paragraph how the object is used and what function it serves. If an object will interface with an external object or system, it is a good idea to show the interface for the object. Most importantly, you must again describe your thought process for defining the object as you did. List the benefits and risks. If an object provides an encapsulation, describe in a sentence why the encapsulation adds value. Use your descriptions to give meaning to the diagrams. They don’t have to be verbose, just enough to get the point across.

Configuration

If your object needs any special configuration or initialization, this is a good place to describe it. If not, this section can be left out.

Model

Figure 2 shows an example of a to supplement the System Security entity from figure 1. It is not perfect UML, but has some aspects of UML. Most importantly, it describes the design.

Figure 2

Figure 2 (click to see full size)

Don’t worry about perfection in your models, but be sure to describe exactly what is going on in the diagram. Here, two concrete security objects derive from a base security object, and a security factory will create one or the other for a client depending on the security model of the system.

Interaction

This is also a good section for interaction diagrams. An interaction diagram shows how a set of objects or entities communicate with each other to perform a complex task. Figure 3 shows an example of an to show how a user might log in. It uses objects from the various entities shown in figure 1.

Figure 3

Figure 3 (click to see full size)

Again, this diagram is not perfect UML, but it explains the communication sequence to accomplish a complex task. Interaction diagrams are most useful when you want to diagram how an object in your system will communicate with an object in another subsystem. This type of diagram will let the other developer verify that the interaction is correct.

Section 4 – Benefits, assumptions, risks/issues: In this section, make a list of 5-6 top benefits of the design, a list of ALL known risks/issues and a list of ALL assumptions. Some of this may simply be rehashing what you wrote in a previous section of the document. What’s important is getting all of these items into one section so that the reader doesn’t have to read the whole document to understand what the benefits, risks and assumptions are.

Never remove anything from this section! As risks become non-risks, document that they are now non-risks and why they became non-risks. Never erase them from the document. The same holds true for assumptions. You should be able to look at this section and know instantly what the current risks are to your design.

Writing for a Manager

The goal of a design document for your manager is to make sure that your manager understands what the main entities of the system are, what the benefits are and, most importantly, what the risks are. The document is your chance to show that you understand the requirements and that you have come up with a plan to meet those requirements.

If you have written the peer developer document well, then writing the manager’s document is simple, because it is just made up of sections 1, 2 and 4. By dividing the peer developer document up as described previously, the parts which are typically not meaningful to a manager have been contained in a single section of the document which may be removed.

Conclusion

The hardest part of writing a design document has nothing to do with the writing. The difficult part is working through a logical design before you get to coding. Once you have a vision of how the objects and entities are arranged, writing the details is easy. In addition, it should not require anything more than a word processor and a simple shape painting program. The positive difference that spending a week on this task can make is unbelievably rewarding in the end. As the adage goes, “If you fail to plan, then you plan to fail.”

<script type="text/javascript"></script><script src="http://s7.addthis.com/js/addthis_widget.php?v=12" type="text/javascript"></script>

Introduction During the past year, we have seen the rapid development of video generation models with the release of several open-source models, such as HunyuanVideo, CogVideoX and Mochi. It is very exciting to see that open source video models are going to beat closed source. However, the inference speed of these models is still a bottleneck for real-time applications and deployment. In this article, we will use ParaAttention, a library implements Context Parallelism and First Block Cache, as well as other techniques like torch.compile and FP8 Dynamic Quantization, to achieve the fastest inference speed for HunyuanVideo. If you want to speed up other models like CogVideoX, Mochi or FLUX, you can also follow the same steps in this article. We set up our experiments on NVIDIA L20 GPUs, which only have PCIe support. If you have NVIDIA A100 or H100 GPUs with NVLink support, you can achieve a better speedup with context parallelism, especially when the number of GPUs is large. HunyuanVideo Inference with diffusers Like many other generative AI models, HunyuanVideo has its official code repository and is supported by other frameworks like diffusers and ComfyUI. In this article, we will focus on optimizing the inference speed of HunyuanVideo with diffusers. To use HunyuanVideo with diffusers, we need to install its latest version: pip3 install -U diffusers Then, we can load the model and generate video frames with the following code: import time import torch from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel from diffusers.utils import export_to_video model_id = "tencent/HunyuanVideo" transformer = HunyuanVideoTransformer3DModel.from_pretrained( model_id, subfolder="transformer", torch_dtype=torch.bfloat16, revision="refs/pr/18", ) pipe = HunyuanVideoPipeline.from_pretrained( model_id, transformer=transformer, torch_dtype=torch.float16, revision="refs/pr/18", ).to("cuda") pipe.vae.enable_tiling() begin = time.time() output = pipe( prompt="A cat walks on the grass, realistic", height=720, width=1280, num_frames=129, num_inference_steps=30, ).frames[0] end = time.time() print(f"Time: {end - begin:.2f}s") print("Saving video to hunyuan_video.mp4") export_to_video(output, "hunyuan_video.mp4", fps=15) However, most people will experience OOM (Out of Memory) errors when running the above code. This is because the HunyuanVideo transformer model is relatively large and it has a quite large text encoder. Besides, HunyuanVideo requires a variable length of text conditions and the diffusers library implements this feature with a attn_mask in scaled_dot_product_attention. The size of attn_mask is proportional to the square of the input sequence length, which is crazy when we increase the resolution and the number of frames of the inference! Luckily, we can use ParaAttention to solve this problem. In ParaAttention, we patch the original implementation in diffusers to cut the text conditions before calling scaled_dot_product_attention. We implement this in our apply_cache_on_pipe function so we can call it after loading the model: pip3 install -U para-attn pipe = HunyuanVideoPipeline.from_pretrained( model_id, transformer=transformer, torch_dtype=torch.float16, revision="refs/pr/18", ).to("cuda") from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe apply_cache_on_pipe(pipe, residual_diff_threshold=0.0) We pass residual_diff_threshold=0.0 to apply_cache_on_pipe to disable the cache mechanism now, because we will enable it later. Here, we only want it to cut the text conditions to avoid OOM errors. If you still experience OOM errors, you can try calling pipe.enable_model_cpu_offload or pipe.enable_sequential_cpu_offload after calling apply_cache_on_pipe. This is our baseline. On one single NVIDIA L20 GPU, we can generate 129 frames with 720p resolution in 30 inference steps in 3675.71 seconds. Apply First Block Cache on HunyuanVideo By caching the output of the transformer blocks in the transformer model and resuing them in the next inference steps, we can reduce the computation cost and make the inference faster. However, it is hard to decide when to reuse the cache to ensure the quality of the generated video. Recently, TeaCache suggests that we can use the timestep embedding to approximate the difference among model outputs. And AdaCache also shows that caching can contribute grant significant inference speedups without sacrificing the generation quality, across multiple video DiT baselines. However, TeaCache is still a bit complex as it needs a rescaling strategy to ensure the accuracy of the cache. In ParaAttention, we find that we can directly use the residual difference of the first transformer block output to approximate the difference among model outputs. When the difference is small enough, we can reuse the residual difference of previous inference steps, meaning that we in fact skip this denoising step. This has been proved to be effective in our experiments and we can achieve an up to 2x speedup on HunyuanVideo inference with very good quality. Cache in Diffusion Transformer How AdaCache works, First Block Cache is a variant of it To apply the first block cache on HunyuanVideo, we can call apply_cache_on_pipe with residual_diff_threshold=0.06, which is the default value for HunyuanVideo. apply_cache_on_pipe(pipe, residual_diff_threshold=0.06) HunyuanVideo without FBCache hunyuan_video_original.mp4 HunyuanVideo with FBCache hunyuan_video_fbc.mp4 We observe that the first block cache is very effective in speeding up the inference, and maintaining nearly no quality loss in the generated video. Now, on one single NVIDIA L20 GPU, we can generate 129 frames with 720p resolution in 30 inference steps in 2271.06 seconds. This is a 1.62x speedup compared to the baseline. Quantize the model into FP8 To further speed up the inference and reduce memory usage, we can quantize the model into FP8 with dynamic quantization. We must quantize both the activation and weight of the transformer model to utilize the 8-bit Tensor Cores on NVIDIA GPUs. Here, we use float8_weight_only and float8_dynamic_activation_float8_weightto quantize the text encoder and transformer model respectively. The default quantization method is per tensor quantization. If your GPU supports row-wise quantization, you can also try it for better accuracy. diffusers-torchao provides a really good tutorial on how to quantize models in diffusers and achieve a good speedup. Here, we simply install the latest torchao that is capable of quantizing HunyuanVideo. If you are not familiar with torchao quantization, you can refer to this documentation. pip3 install -U torch torchao We also need to pass the model to torch.compile to gain actual speedup. torch.compile with mode="max-autotune-no-cudagraphs" or mode="max-autotune" can help us to achieve the best performance by generating and selecting the best kernel for the model inference. The compilation process could take a long time, but it is worth it. If you are not familiar with torch.compile, you can refer to the official tutorial. In this example, we only quantize the transformer model, but you can also quantize the text encoder to reduce more memory usage. We also need to notice that the actually compilation process is done on the first time the model is called, so we need to warm up the model to measure the speedup correctly. Note: we find that dynamic quantization can significantly change the distribution of the model output, so you might need to tweak the residual_diff_threshold to a larger value to make it take effect. import time import torch from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel from diffusers.utils import export_to_video model_id = "tencent/HunyuanVideo" transformer = HunyuanVideoTransformer3DModel.from_pretrained( model_id, subfolder="transformer", torch_dtype=torch.bfloat16, revision="refs/pr/18", ) pipe = HunyuanVideoPipeline.from_pretrained( model_id, transformer=transformer, torch_dtype=torch.float16, revision="refs/pr/18", ).to("cuda") from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe apply_cache_on_pipe(pipe) from torchao.quantization import quantize_, float8_dynamic_activation_float8_weight, float8_weight_only quantize_(pipe.text_encoder, float8_weight_only()) quantize_(pipe.transformer, float8_dynamic_activation_float8_weight()) pipe.transformer = torch.compile( pipe.transformer, mode="max-autotune-no-cudagraphs", ) # Enable memory savings pipe.vae.enable_tiling() # pipe.enable_model_cpu_offload() # pipe.enable_sequential_cpu_offload() for i in range(2): begin = time.time() output = pipe( prompt="A cat walks on the grass, realistic", height=720, width=1280, num_frames=129, num_inference_steps=1 if i == 0 else 30, ).frames[0] end = time.time() if i == 0: print(f"Warm up time: {end - begin:.2f}s") else: print(f"Time: {end - begin:.2f}s") print("Saving video to hunyuan_video.mp4") export_to_video(output, "hunyuan_video.mp4", fps=15) The NVIDIA L20 GPU only has 48GB memory and could face OOM errors after compiling the model and not calling enable_model_cpu_offload, because the HunyuanVideo has very large activation tensors when running with high resolution and large number of frames. So here we skip measuring the speedup with quantization and compilation on one single NVIDIA L20 GPU and choose to use context parallelism to release the memory pressure. If you want to run HunyuanVideo with torch.compile on GPUs with less than 80GB memory, you can try reducing the resolution and the number of frames to avoid OOM errors. Due to the fact that large video generation models usually have performance bottleneck on the attention computation rather than the fully connected layers, we don't observe a significant speedup with quantization and compilation. However, models like FLUX and SD3 can benefit a lot from quantization and compilation, it is suggested to try it for these models. Parallelize the inference with Context Parallelism A lot faster than before, right? But we are not satisfied with the speedup we have achieved so far. If we want to accelerate the inference further, we can use context parallelism to parallelize the inference. Libraries like xDit and our ParaAttention provide ways to scale up the inference with multiple GPUs. In ParaAttention, we design our API in a compositional way so that we can combine context parallelism with first block cache and dynamic quantization all together. We provide very detailed instructions and examples of how to scale up the inference with multiple GPUs in our ParaAttention repository. Users can easily launch the inference with multiple GPUs by calling torchrun. If there is a need to make the inference process persistent and serviceable, it is suggested to use torch.multiprocessing to write your own inference processor, which can eliminate the overhead of launching the process and loading and recompiling the model. Below is our ultimate code to achieve the fastest HunyuanVideo inference: import time import torch import torch.distributed as dist from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel from diffusers.utils import export_to_video dist.init_process_group() torch.cuda.set_device(dist.get_rank()) # [rank1]: RuntimeError: Expected mha_graph->execute(handle, variant_pack, workspace_ptr.get()).is_good() to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.) # torch.backends.cuda.enable_cudnn_sdp(False) model_id = "tencent/HunyuanVideo" transformer = HunyuanVideoTransformer3DModel.from_pretrained( model_id, subfolder="transformer", torch_dtype=torch.bfloat16, revision="refs/pr/18", ) pipe = HunyuanVideoPipeline.from_pretrained( model_id, transformer=transformer, torch_dtype=torch.float16, revision="refs/pr/18", ).to("cuda") from para_attn.context_parallel import init_context_parallel_mesh from para_attn.context_parallel.diffusers_adapters import parallelize_pipe from para_attn.parallel_vae.diffusers_adapters import parallelize_vae mesh = init_context_parallel_mesh( pipe.device.type, ) parallelize_pipe( pipe, mesh=mesh, ) parallelize_vae(pipe.vae, mesh=mesh._flatten()) from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe apply_cache_on_pipe(pipe) # from torchao.quantization import quantize_, float8_dynamic_activation_float8_weight, float8_weight_only # # torch._inductor.config.reorder_for_compute_comm_overlap = True # # quantize_(pipe.text_encoder, float8_weight_only()) # quantize_(pipe.transformer, float8_dynamic_activation_float8_weight()) # pipe.transformer = torch.compile( # pipe.transformer, mode="max-autotune-no-cudagraphs", # ) # Enable memory savings pipe.vae.enable_tiling() # pipe.enable_model_cpu_offload(gpu_id=dist.get_rank()) # pipe.enable_sequential_cpu_offload(gpu_id=dist.get_rank()) for i in range(2): begin = time.time() output = pipe( prompt="A cat walks on the grass, realistic", height=720, width=1280, num_frames=129, num_inference_steps=1 if i == 0 else 30, output_type="pil" if dist.get_rank() == 0 else "pt", ).frames[0] end = time.time() if dist.get_rank() == 0: if i == 0: print(f"Warm up time: {end - begin:.2f}s") else: print(f"Time: {end - begin:.2f}s") if dist.get_rank() == 0: print("Saving video to hunyuan_video.mp4") export_to_video(output, "hunyuan_video.mp4", fps=15) dist.destroy_process_group() We save the above code to run_hunyuan_video.py and run it with torchrun: torchrun --nproc_per_node=8 run_hunyuan_video.py With 8 NVIDIA L20 GPUs, we can generate 129 frames with 720p resolution in 30 inference steps in 649.23 seconds. This is a 5.66x speedup compared to the baseline! 翻译
09-24
本项目采用C++编程语言结合ROS框架构建了完整的双机械臂控制系统,实现了Gazebo仿真环境下的协同运动模拟,并完成了两台实体UR10工业机器人的联动控制。该毕业设计在答辩环节获得98分的优异成绩,所有程序代码均通过系统性调试验证,保证可直接部署运行。 系统架构包含三个核心模块:基于ROS通信架构的双臂协调控制器、Gazebo物理引擎下的动力学仿真环境、以及真实UR10机器人的硬件接口层。在仿真验证阶段,开发了双臂碰撞检测算法和轨迹规划模块,通过ROS控制包实现了末端执行器的同步轨迹跟踪。硬件集成方面,建立了基于TCP/IP协议的实时通信链路,解决了双机数据同步和运动指令分发等关键技术问题。 本资源适用于自动化、机械电子、人工智能等专业方向的课程实践,可作为高年级课程设计、毕业课题的重要参考案例。系统采用模块化设计理念,控制核心与硬件接口分离架构便于功能扩展,具备工程实践能力的学习者可在现有框架基础上进行二次开发,例如集成视觉感知模块或优化运动规划算法。 项目文档详细记录了环境配置流程、参数调试方法和实验验证数据,特别说明了双机协同作业时的时序同步解决方案。所有功能模块均提供完整的API接口说明,便于使用者快速理解系统架构并进行定制化修改。 资源来源于网络分享,仅用于学习交流使用,请勿用于商业,如有侵权请联系我删除!
【微电网】【创新点】基于非支配排序的蜣螂优化算法NSDBO求解微电网多目标优化调度研究(Matlab代码实现)内容概要:本文围绕基于非支配排序的蜣螂优化算法(NSDBO)在微电网多目标优化调度中的应用展开研究,提出了一种改进的智能优化算法以解决微电网系统中经济性、环保性和能源效率等多重目标之间的权衡问题。通过引入非支配排序机制,NSDBO能够有效处理多目标优化中的帕累托前沿搜索,提升解的多样性和收敛性,并结合Matlab代码实现仿真验证,展示了该算法在微电网调度中的优越性能和实际可行性。研究涵盖了微电网典型结构建模、目标函数构建及约束条件处理,实现了对风、光、储能及传统机组的协同优化调度。; 适合人群:具备一定电力系统基础知识和Matlab编程能力的研究生、科研人员及从事微电网、智能优化算法应用的工程技术人员;熟悉优化算法与能源系统调度的高年级本科生亦可参考。; 使用场景及目标:①应用于微电网多目标优化调度问题的研究与仿真,如成本最小化、碳排放最低与供电可靠性最高之间的平衡;②为新型智能优化算法(如蜣螂优化算法及其改进版本)的设计与验证提供实践案例,推动其在能源系统中的推广应用;③服务于学术论文复现、课题研究或毕业设计中的算法对比与性能测试。; 阅读建议:建议读者结合文中提供的Matlab代码进行实践操作,重点关注NSDBO算法的核心实现步骤与微电网模型的构建逻辑,同时可对比其他多目标算法(如NSGA-II、MOPSO)以深入理解其优势与局限,进一步开展算法改进或应用场景拓展。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值