p1093_Formatting Text_(worth thinking)

Formatting Text
Time Limit: 1000MS Memory Limit: 10000K
Total Submissions: 717 Accepted: 147

Description

Writings e-mails is fun, but, unfortunately, they do not look very nice, mainly because not all lines have the same lengths. In this problem, your task is to write an e-mail formatting program which reformats a paragraph of an e-mail (e.g. by inserting spaces) so that, afterwards, all lines have the same length (even the last one of each paragraph).
The easiest way to perform this task would be to insert more spaces between the words in lines which are too short. But this is not the best way. Consider the following example:
****************************
This is the example you are
actually considering.
Let us assume that we want to get lines as long as the row of stars. Then, by simply inserting spaces, we would get
****************************
This is the example you are
actually considering.
But this looks rather odd because of the big gap in the second line. By moving the word ``are'' from the first to the second line, we get a better result:
****************************
This is the example you
are actually considering.
Of course, this has to be formalized. To do this, we assign a badness to each gap between words. The badness assigned to a gap of n spaces is (n - 1)^2. The goal of the program is to minimize the sum of all badnesses. For example, the badness of the first example is 1 + 7^2 = 50 whereas the badness of the second one is only 1 + 1 + 1 + 4 + 1 + 4 = 12.

In the output, every line has to start and to end with a word. (I.e. there cannot be a gap at the beginning or the end of a line.) The only exception to this is the following:

If a line contains only one word this word shall be put at the beginning of the line, and a badness of 500 is assigned to this line if it is shorter than it should be. (Of course, in this case, the length of the line is simply the length of the word.)

Input

The input contains a text consisting of several paragraphs. Each paragraph is preceded by a line containing a single integer n, the desired width of the paragraph (1 <= n <= 80).

Paragraphs consist of one or more lines which contain one or more words each. Words consist of characters with ASCII codes between 33 and 126, inclusive, and are separated by spaces (possibly more than one). No word will be longer than the desired width of the paragraph. The total length of all words of one paragraph will not be more than 10000 characters.

Each paragraph is terminated by exactly one blank line. There is no limit on the number of paragraphs in the input file.

The input file will be terminated by a paragraph description starting with n=0. This paragraph should not be processed.

Output

Output the same text, formatted in the way described above (processing each paragraph separately).

If there are several ways to format a paragraph with the same badness, use the following algorithm to choose which one to output: Let A and B be two solutions. Find the first gap which has not the same length in A and B. Do not output the solution in which this gap is bigger.

Output a blank line after each paragraph.

Sample Input

28
This is the example you are
actually considering.

25
Writing e-mails is fun, and with this program,
they even look nice.

0

Sample Output

This  is  the  example   you
are actually considering.

Writing e-mails is fun,
and with this program,
they even look nice.

Source


题目大意:
在graph的单词之间添加空格,使得每一行的长度是给出的n,两单词间k个空格的不满意度是(k-1)^2,求一种方法使得空格造成的不满意度最小,并且行首行尾均为单词(除非一行只有一个单词,如果此单词长度<n 不满意度数为500)。
构造的最优解是使得前面的空格个数尽量的小。
分析:
1.无论分成多少行,每个单词前面的空格个数和行数无关。
2.每行必有一个单词开头和结尾(特殊情况特殊讨论);
那么根据这两个条件就可以知道:
以某个单词为某一行结尾的最优的解(设mn[i])一定和以他前面的某个单词结尾的最优值有关系;并且他们之间的单词和这个i单词在一行;
这样就很容易知道以最后一个单词为一行结尾的最优值是多少了~;
3.上面的问题并没有完全解决,如果i----j单词是一行的话那么最优解是多少呢。。
这个也简单,预处理一下就好了,就是i----j单词是一行要用规定的n个格子,那么枚举j单词前面的空格数k,就是要知道i----j-1单词用了n-len[j]-k个格子的最优值就可以了~;
但是!有大牛想到了一个特性:平方和最小,一定是大家都很平均了!那么就是尽量的平分。。任意两个空格长度相差不到2的时候当然一定最优。
现在最优解的求解已经完成了。
看如何构造特定的一组解。
4.规定前面的空格数尽量的小====>前面的单词尽量的紧密-===>前面的单词占的行数尽量的少同时后面的行单词尽量的少===>可以记录当前最优值能使得i所在的行号,另外枚举上一行最后一个单词的时候应该从后向前枚举,保证的是i在当前行的所有情况下当前行的单词尽量的少,也就是保证了前面的单词尽量的多~
这样就可以求出最优解了~


后记:
1.最开始看的时候就想到了类似方差的问题,在隔了一天之后想到mn[M]的方法的时候竟然毫无感觉的想去预处理枚举。。。思维很。。。。
2.在没有提示的条件下首先想着测试数据,被老大发现之后进行说服教育...感触很深,对么。。。怎么可以这么懒呢。。思想懒惰很可怕。。。
3.自己想的时候想到了mn[M][R]的方法表示的当前单词占据在某一行的第j个位置时候的最优值,可惜啊。。。没有想到需要行数尽量的小。。。显然没有感觉到问题的本质。。。想法很浮躁,想“懂”了就写。。。服了。。
4.和老大讨论过之后就觉得那是对的,根本没有深想。。pe了多少次。。。。。还以为是细节出问题了。。。。再这么懒就会变成sz的!
Introduction During the past year, we have seen the rapid development of video generation models with the release of several open-source models, such as HunyuanVideo, CogVideoX and Mochi. It is very exciting to see that open source video models are going to beat closed source. However, the inference speed of these models is still a bottleneck for real-time applications and deployment. In this article, we will use ParaAttention, a library implements Context Parallelism and First Block Cache, as well as other techniques like torch.compile and FP8 Dynamic Quantization, to achieve the fastest inference speed for HunyuanVideo. If you want to speed up other models like CogVideoX, Mochi or FLUX, you can also follow the same steps in this article. We set up our experiments on NVIDIA L20 GPUs, which only have PCIe support. If you have NVIDIA A100 or H100 GPUs with NVLink support, you can achieve a better speedup with context parallelism, especially when the number of GPUs is large. HunyuanVideo Inference with diffusers Like many other generative AI models, HunyuanVideo has its official code repository and is supported by other frameworks like diffusers and ComfyUI. In this article, we will focus on optimizing the inference speed of HunyuanVideo with diffusers. To use HunyuanVideo with diffusers, we need to install its latest version: pip3 install -U diffusers Then, we can load the model and generate video frames with the following code: import time import torch from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel from diffusers.utils import export_to_video model_id = "tencent/HunyuanVideo" transformer = HunyuanVideoTransformer3DModel.from_pretrained( model_id, subfolder="transformer", torch_dtype=torch.bfloat16, revision="refs/pr/18", ) pipe = HunyuanVideoPipeline.from_pretrained( model_id, transformer=transformer, torch_dtype=torch.float16, revision="refs/pr/18", ).to("cuda") pipe.vae.enable_tiling() begin = time.time() output = pipe( prompt="A cat walks on the grass, realistic", height=720, width=1280, num_frames=129, num_inference_steps=30, ).frames[0] end = time.time() print(f"Time: {end - begin:.2f}s") print("Saving video to hunyuan_video.mp4") export_to_video(output, "hunyuan_video.mp4", fps=15) However, most people will experience OOM (Out of Memory) errors when running the above code. This is because the HunyuanVideo transformer model is relatively large and it has a quite large text encoder. Besides, HunyuanVideo requires a variable length of text conditions and the diffusers library implements this feature with a attn_mask in scaled_dot_product_attention. The size of attn_mask is proportional to the square of the input sequence length, which is crazy when we increase the resolution and the number of frames of the inference! Luckily, we can use ParaAttention to solve this problem. In ParaAttention, we patch the original implementation in diffusers to cut the text conditions before calling scaled_dot_product_attention. We implement this in our apply_cache_on_pipe function so we can call it after loading the model: pip3 install -U para-attn pipe = HunyuanVideoPipeline.from_pretrained( model_id, transformer=transformer, torch_dtype=torch.float16, revision="refs/pr/18", ).to("cuda") from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe apply_cache_on_pipe(pipe, residual_diff_threshold=0.0) We pass residual_diff_threshold=0.0 to apply_cache_on_pipe to disable the cache mechanism now, because we will enable it later. Here, we only want it to cut the text conditions to avoid OOM errors. If you still experience OOM errors, you can try calling pipe.enable_model_cpu_offload or pipe.enable_sequential_cpu_offload after calling apply_cache_on_pipe. This is our baseline. On one single NVIDIA L20 GPU, we can generate 129 frames with 720p resolution in 30 inference steps in 3675.71 seconds. Apply First Block Cache on HunyuanVideo By caching the output of the transformer blocks in the transformer model and resuing them in the next inference steps, we can reduce the computation cost and make the inference faster. However, it is hard to decide when to reuse the cache to ensure the quality of the generated video. Recently, TeaCache suggests that we can use the timestep embedding to approximate the difference among model outputs. And AdaCache also shows that caching can contribute grant significant inference speedups without sacrificing the generation quality, across multiple video DiT baselines. However, TeaCache is still a bit complex as it needs a rescaling strategy to ensure the accuracy of the cache. In ParaAttention, we find that we can directly use the residual difference of the first transformer block output to approximate the difference among model outputs. When the difference is small enough, we can reuse the residual difference of previous inference steps, meaning that we in fact skip this denoising step. This has been proved to be effective in our experiments and we can achieve an up to 2x speedup on HunyuanVideo inference with very good quality. Cache in Diffusion Transformer How AdaCache works, First Block Cache is a variant of it To apply the first block cache on HunyuanVideo, we can call apply_cache_on_pipe with residual_diff_threshold=0.06, which is the default value for HunyuanVideo. apply_cache_on_pipe(pipe, residual_diff_threshold=0.06) HunyuanVideo without FBCache hunyuan_video_original.mp4 HunyuanVideo with FBCache hunyuan_video_fbc.mp4 We observe that the first block cache is very effective in speeding up the inference, and maintaining nearly no quality loss in the generated video. Now, on one single NVIDIA L20 GPU, we can generate 129 frames with 720p resolution in 30 inference steps in 2271.06 seconds. This is a 1.62x speedup compared to the baseline. Quantize the model into FP8 To further speed up the inference and reduce memory usage, we can quantize the model into FP8 with dynamic quantization. We must quantize both the activation and weight of the transformer model to utilize the 8-bit Tensor Cores on NVIDIA GPUs. Here, we use float8_weight_only and float8_dynamic_activation_float8_weightto quantize the text encoder and transformer model respectively. The default quantization method is per tensor quantization. If your GPU supports row-wise quantization, you can also try it for better accuracy. diffusers-torchao provides a really good tutorial on how to quantize models in diffusers and achieve a good speedup. Here, we simply install the latest torchao that is capable of quantizing HunyuanVideo. If you are not familiar with torchao quantization, you can refer to this documentation. pip3 install -U torch torchao We also need to pass the model to torch.compile to gain actual speedup. torch.compile with mode="max-autotune-no-cudagraphs" or mode="max-autotune" can help us to achieve the best performance by generating and selecting the best kernel for the model inference. The compilation process could take a long time, but it is worth it. If you are not familiar with torch.compile, you can refer to the official tutorial. In this example, we only quantize the transformer model, but you can also quantize the text encoder to reduce more memory usage. We also need to notice that the actually compilation process is done on the first time the model is called, so we need to warm up the model to measure the speedup correctly. Note: we find that dynamic quantization can significantly change the distribution of the model output, so you might need to tweak the residual_diff_threshold to a larger value to make it take effect. import time import torch from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel from diffusers.utils import export_to_video model_id = "tencent/HunyuanVideo" transformer = HunyuanVideoTransformer3DModel.from_pretrained( model_id, subfolder="transformer", torch_dtype=torch.bfloat16, revision="refs/pr/18", ) pipe = HunyuanVideoPipeline.from_pretrained( model_id, transformer=transformer, torch_dtype=torch.float16, revision="refs/pr/18", ).to("cuda") from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe apply_cache_on_pipe(pipe) from torchao.quantization import quantize_, float8_dynamic_activation_float8_weight, float8_weight_only quantize_(pipe.text_encoder, float8_weight_only()) quantize_(pipe.transformer, float8_dynamic_activation_float8_weight()) pipe.transformer = torch.compile( pipe.transformer, mode="max-autotune-no-cudagraphs", ) # Enable memory savings pipe.vae.enable_tiling() # pipe.enable_model_cpu_offload() # pipe.enable_sequential_cpu_offload() for i in range(2): begin = time.time() output = pipe( prompt="A cat walks on the grass, realistic", height=720, width=1280, num_frames=129, num_inference_steps=1 if i == 0 else 30, ).frames[0] end = time.time() if i == 0: print(f"Warm up time: {end - begin:.2f}s") else: print(f"Time: {end - begin:.2f}s") print("Saving video to hunyuan_video.mp4") export_to_video(output, "hunyuan_video.mp4", fps=15) The NVIDIA L20 GPU only has 48GB memory and could face OOM errors after compiling the model and not calling enable_model_cpu_offload, because the HunyuanVideo has very large activation tensors when running with high resolution and large number of frames. So here we skip measuring the speedup with quantization and compilation on one single NVIDIA L20 GPU and choose to use context parallelism to release the memory pressure. If you want to run HunyuanVideo with torch.compile on GPUs with less than 80GB memory, you can try reducing the resolution and the number of frames to avoid OOM errors. Due to the fact that large video generation models usually have performance bottleneck on the attention computation rather than the fully connected layers, we don't observe a significant speedup with quantization and compilation. However, models like FLUX and SD3 can benefit a lot from quantization and compilation, it is suggested to try it for these models. Parallelize the inference with Context Parallelism A lot faster than before, right? But we are not satisfied with the speedup we have achieved so far. If we want to accelerate the inference further, we can use context parallelism to parallelize the inference. Libraries like xDit and our ParaAttention provide ways to scale up the inference with multiple GPUs. In ParaAttention, we design our API in a compositional way so that we can combine context parallelism with first block cache and dynamic quantization all together. We provide very detailed instructions and examples of how to scale up the inference with multiple GPUs in our ParaAttention repository. Users can easily launch the inference with multiple GPUs by calling torchrun. If there is a need to make the inference process persistent and serviceable, it is suggested to use torch.multiprocessing to write your own inference processor, which can eliminate the overhead of launching the process and loading and recompiling the model. Below is our ultimate code to achieve the fastest HunyuanVideo inference: import time import torch import torch.distributed as dist from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel from diffusers.utils import export_to_video dist.init_process_group() torch.cuda.set_device(dist.get_rank()) # [rank1]: RuntimeError: Expected mha_graph->execute(handle, variant_pack, workspace_ptr.get()).is_good() to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.) # torch.backends.cuda.enable_cudnn_sdp(False) model_id = "tencent/HunyuanVideo" transformer = HunyuanVideoTransformer3DModel.from_pretrained( model_id, subfolder="transformer", torch_dtype=torch.bfloat16, revision="refs/pr/18", ) pipe = HunyuanVideoPipeline.from_pretrained( model_id, transformer=transformer, torch_dtype=torch.float16, revision="refs/pr/18", ).to("cuda") from para_attn.context_parallel import init_context_parallel_mesh from para_attn.context_parallel.diffusers_adapters import parallelize_pipe from para_attn.parallel_vae.diffusers_adapters import parallelize_vae mesh = init_context_parallel_mesh( pipe.device.type, ) parallelize_pipe( pipe, mesh=mesh, ) parallelize_vae(pipe.vae, mesh=mesh._flatten()) from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe apply_cache_on_pipe(pipe) # from torchao.quantization import quantize_, float8_dynamic_activation_float8_weight, float8_weight_only # # torch._inductor.config.reorder_for_compute_comm_overlap = True # # quantize_(pipe.text_encoder, float8_weight_only()) # quantize_(pipe.transformer, float8_dynamic_activation_float8_weight()) # pipe.transformer = torch.compile( # pipe.transformer, mode="max-autotune-no-cudagraphs", # ) # Enable memory savings pipe.vae.enable_tiling() # pipe.enable_model_cpu_offload(gpu_id=dist.get_rank()) # pipe.enable_sequential_cpu_offload(gpu_id=dist.get_rank()) for i in range(2): begin = time.time() output = pipe( prompt="A cat walks on the grass, realistic", height=720, width=1280, num_frames=129, num_inference_steps=1 if i == 0 else 30, output_type="pil" if dist.get_rank() == 0 else "pt", ).frames[0] end = time.time() if dist.get_rank() == 0: if i == 0: print(f"Warm up time: {end - begin:.2f}s") else: print(f"Time: {end - begin:.2f}s") if dist.get_rank() == 0: print("Saving video to hunyuan_video.mp4") export_to_video(output, "hunyuan_video.mp4", fps=15) dist.destroy_process_group() We save the above code to run_hunyuan_video.py and run it with torchrun: torchrun --nproc_per_node=8 run_hunyuan_video.py With 8 NVIDIA L20 GPUs, we can generate 129 frames with 720p resolution in 30 inference steps in 649.23 seconds. This is a 5.66x speedup compared to the baseline! 翻译
09-24
STM32电机库无感代码注释无传感器版本龙贝格观测三电阻双AD采样前馈控制弱磁控制斜坡启动内容概要:本文档为一份关于STM32电机控制的无传感器版本代码注释资源,聚焦于龙贝格观测器在永磁同步电机(PMSM)无感控制中的应用。内容涵盖三电阻双通道AD采样技术、前馈控制、弱磁控制及斜坡启动等关键控制策略的实现方法,旨在通过详细的代码解析帮助开发者深入理解基于STM32平台的高性能电机控制算法设计与工程实现。文档适用于从事电机控制开发的技术人员,重点解析了无位置传感器控制下的转子初始定位、速度估算与系统稳定性优化等问题。; 适合人群:具备一定嵌入式开发基础,熟悉STM32平台及电机控制原理的工程师或研究人员,尤其适合从事无感FOC开发的中高级技术人员。; 使用场景及目标:①掌握龙贝格观测器在PMSM无感控制中的建模与实现;②理解三电阻采样与双AD同步采集的硬件匹配与软件处理机制;③实现前馈补偿提升动态响应、弱磁扩速控制策略以及平稳斜坡启动过程;④为实际项目中调试和优化无感FOC系统提供代码参考和技术支持; 阅读建议:建议结合STM32电机控制硬件平台进行代码对照阅读与实验验证,重点关注观测器设计、电流采样校准、PI参数整定及各控制模块之间的协同逻辑,建议配合示波器进行信号观测以加深对控制时序与性能表现的理解。
### configUSE_STATS_FORMATTING_FUNCTIONS 使用说明与配置详解 #### 宏定义的作用 `configUSE_STATS_FORMATTING_FUNCTIONS` 是 FreeRTOS 配置文件中的一个宏,用于启用任务统计信息的格式化输出功能。当此宏被设置为 1 时,FreeRTOS 将编译 `vTaskList()` 和 `vTaskGetRunTimeStats()` 函数[^2]。这些函数分别用于生成任务状态列表和运行时间统计信息。 #### 配置方法 在使用 `configUSE_STATS_FORMATTING_FUNCTIONS` 之前,需要确保以下配置: - 在 `FreeRTOSConfig.h` 文件中将 `configUSE_STATS_FORMATTING_FUNCTIONS` 设置为 1。 - 确保 `configUSE_TRACE_FACILITY` 也被设置为 1,因为这两个宏通常一起使用以支持任务统计功能[^2]。 ```c #define configUSE_STATS_FORMATTING_FUNCTIONS 1 #define configUSE_TRACE_FACILITY 1 ``` #### 功能描述 当 `configUSE_STATS_FORMATTING_FUNCTIONS` 被启用后,可以使用以下两个函数来获取任务统计信息: 1. **`vTaskList()`**:生成所有任务的状态信息,并将其写入指定的缓冲区。信息包括任务名称、优先级、堆栈高水位线等。 2. **`vTaskGetRunTimeStats()`**:生成每个任务的运行时间统计信息,并将其写入指定的缓冲区。此功能依赖于 `configGENERATE_RUN_TIME_STATS` 的配置[^4]。 #### 示例代码 以下是一个示例,展示如何使用 `vTaskList()` 和 `vTaskGetRunTimeStats()`: ```c #include "FreeRTOS.h" #include "task.h" void vPrintTaskInfo(void) { char cTaskListBuffer[1000]; char cRunTimeStatsBuffer[1000]; // 获取任务状态列表 vTaskList(cTaskListBuffer); printf("Task List:\n%s\n", cTaskListBuffer); // 获取任务运行时间统计信息 vTaskGetRunTimeStats(cRunTimeStatsBuffer); printf("Task Run Time Stats:\n%s\n", cRunTimeStatsBuffer); } ``` #### 注意事项 - 如果禁用了 `configUSE_STATS_FORMATTING_FUNCTIONS` 或 `configUSE_TRACE_FACILITY`,则 `vTaskList()` 和 `vTaskGetRunTimeStats()` 将不会被编译[^2]。 - 运行时间统计功能依赖于 `configGENERATE_RUN_TIME_STATS` 的配置。如果未启用该选项,则 `vTaskGetRunTimeStats()` 将无法提供准确的运行时间数据[^4]。 - 在资源受限的嵌入式系统中,应谨慎使用这些功能,因为它们可能会增加内存和 CPU 的开销。 #### 相关配置 以下是与 `configUSE_STATS_FORMATTING_FUNCTIONS` 相关的其他重要配置项: - **`configUSE_TRACE_FACILITY`**:启用任务跟踪功能,支持 `vTaskList()` 和 `vTaskGetRunTimeStats()` 的使用。 - **`configGENERATE_RUN_TIME_STATS`**:启用任务运行时间统计功能,为 `vTaskGetRunTimeStats()` 提供数据支持。 - **`configMAX_TASK_NAME_LEN`**:定义任务名称的最大长度,影响 `vTaskList()` 输出的任务名称字段宽度[^2]。 --- ###
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值