p1093_Formatting Text_(worth thinking)

Formatting Text
Time Limit: 1000MS Memory Limit: 10000K
Total Submissions: 717 Accepted: 147

Description

Writings e-mails is fun, but, unfortunately, they do not look very nice, mainly because not all lines have the same lengths. In this problem, your task is to write an e-mail formatting program which reformats a paragraph of an e-mail (e.g. by inserting spaces) so that, afterwards, all lines have the same length (even the last one of each paragraph).
The easiest way to perform this task would be to insert more spaces between the words in lines which are too short. But this is not the best way. Consider the following example:
****************************
This is the example you are
actually considering.
Let us assume that we want to get lines as long as the row of stars. Then, by simply inserting spaces, we would get
****************************
This is the example you are
actually considering.
But this looks rather odd because of the big gap in the second line. By moving the word ``are'' from the first to the second line, we get a better result:
****************************
This is the example you
are actually considering.
Of course, this has to be formalized. To do this, we assign a badness to each gap between words. The badness assigned to a gap of n spaces is (n - 1)^2. The goal of the program is to minimize the sum of all badnesses. For example, the badness of the first example is 1 + 7^2 = 50 whereas the badness of the second one is only 1 + 1 + 1 + 4 + 1 + 4 = 12.

In the output, every line has to start and to end with a word. (I.e. there cannot be a gap at the beginning or the end of a line.) The only exception to this is the following:

If a line contains only one word this word shall be put at the beginning of the line, and a badness of 500 is assigned to this line if it is shorter than it should be. (Of course, in this case, the length of the line is simply the length of the word.)

Input

The input contains a text consisting of several paragraphs. Each paragraph is preceded by a line containing a single integer n, the desired width of the paragraph (1 <= n <= 80).

Paragraphs consist of one or more lines which contain one or more words each. Words consist of characters with ASCII codes between 33 and 126, inclusive, and are separated by spaces (possibly more than one). No word will be longer than the desired width of the paragraph. The total length of all words of one paragraph will not be more than 10000 characters.

Each paragraph is terminated by exactly one blank line. There is no limit on the number of paragraphs in the input file.

The input file will be terminated by a paragraph description starting with n=0. This paragraph should not be processed.

Output

Output the same text, formatted in the way described above (processing each paragraph separately).

If there are several ways to format a paragraph with the same badness, use the following algorithm to choose which one to output: Let A and B be two solutions. Find the first gap which has not the same length in A and B. Do not output the solution in which this gap is bigger.

Output a blank line after each paragraph.

Sample Input

28
This is the example you are
actually considering.

25
Writing e-mails is fun, and with this program,
they even look nice.

0

Sample Output

This  is  the  example   you
are actually considering.

Writing e-mails is fun,
and with this program,
they even look nice.

Source


题目大意:
在graph的单词之间添加空格,使得每一行的长度是给出的n,两单词间k个空格的不满意度是(k-1)^2,求一种方法使得空格造成的不满意度最小,并且行首行尾均为单词(除非一行只有一个单词,如果此单词长度<n 不满意度数为500)。
构造的最优解是使得前面的空格个数尽量的小。
分析:
1.无论分成多少行,每个单词前面的空格个数和行数无关。
2.每行必有一个单词开头和结尾(特殊情况特殊讨论);
那么根据这两个条件就可以知道:
以某个单词为某一行结尾的最优的解(设mn[i])一定和以他前面的某个单词结尾的最优值有关系;并且他们之间的单词和这个i单词在一行;
这样就很容易知道以最后一个单词为一行结尾的最优值是多少了~;
3.上面的问题并没有完全解决,如果i----j单词是一行的话那么最优解是多少呢。。
这个也简单,预处理一下就好了,就是i----j单词是一行要用规定的n个格子,那么枚举j单词前面的空格数k,就是要知道i----j-1单词用了n-len[j]-k个格子的最优值就可以了~;
但是!有大牛想到了一个特性:平方和最小,一定是大家都很平均了!那么就是尽量的平分。。任意两个空格长度相差不到2的时候当然一定最优。
现在最优解的求解已经完成了。
看如何构造特定的一组解。
4.规定前面的空格数尽量的小====>前面的单词尽量的紧密-===>前面的单词占的行数尽量的少同时后面的行单词尽量的少===>可以记录当前最优值能使得i所在的行号,另外枚举上一行最后一个单词的时候应该从后向前枚举,保证的是i在当前行的所有情况下当前行的单词尽量的少,也就是保证了前面的单词尽量的多~
这样就可以求出最优解了~


后记:
1.最开始看的时候就想到了类似方差的问题,在隔了一天之后想到mn[M]的方法的时候竟然毫无感觉的想去预处理枚举。。。思维很。。。。
2.在没有提示的条件下首先想着测试数据,被老大发现之后进行说服教育...感触很深,对么。。。怎么可以这么懒呢。。思想懒惰很可怕。。。
3.自己想的时候想到了mn[M][R]的方法表示的当前单词占据在某一行的第j个位置时候的最优值,可惜啊。。。没有想到需要行数尽量的小。。。显然没有感觉到问题的本质。。。想法很浮躁,想“懂”了就写。。。服了。。
4.和老大讨论过之后就觉得那是对的,根本没有深想。。pe了多少次。。。。。还以为是细节出问题了。。。。再这么懒就会变成sz的!
Introduction During the past year, we have seen the rapid development of video generation models with the release of several open-source models, such as HunyuanVideo, CogVideoX and Mochi. It is very exciting to see that open source video models are going to beat closed source. However, the inference speed of these models is still a bottleneck for real-time applications and deployment. In this article, we will use ParaAttention, a library implements Context Parallelism and First Block Cache, as well as other techniques like torch.compile and FP8 Dynamic Quantization, to achieve the fastest inference speed for HunyuanVideo. If you want to speed up other models like CogVideoX, Mochi or FLUX, you can also follow the same steps in this article. We set up our experiments on NVIDIA L20 GPUs, which only have PCIe support. If you have NVIDIA A100 or H100 GPUs with NVLink support, you can achieve a better speedup with context parallelism, especially when the number of GPUs is large. HunyuanVideo Inference with diffusers Like many other generative AI models, HunyuanVideo has its official code repository and is supported by other frameworks like diffusers and ComfyUI. In this article, we will focus on optimizing the inference speed of HunyuanVideo with diffusers. To use HunyuanVideo with diffusers, we need to install its latest version: pip3 install -U diffusers Then, we can load the model and generate video frames with the following code: import time import torch from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel from diffusers.utils import export_to_video model_id = "tencent/HunyuanVideo" transformer = HunyuanVideoTransformer3DModel.from_pretrained( model_id, subfolder="transformer", torch_dtype=torch.bfloat16, revision="refs/pr/18", ) pipe = HunyuanVideoPipeline.from_pretrained( model_id, transformer=transformer, torch_dtype=torch.float16, revision="refs/pr/18", ).to("cuda") pipe.vae.enable_tiling() begin = time.time() output = pipe( prompt="A cat walks on the grass, realistic", height=720, width=1280, num_frames=129, num_inference_steps=30, ).frames[0] end = time.time() print(f"Time: {end - begin:.2f}s") print("Saving video to hunyuan_video.mp4") export_to_video(output, "hunyuan_video.mp4", fps=15) However, most people will experience OOM (Out of Memory) errors when running the above code. This is because the HunyuanVideo transformer model is relatively large and it has a quite large text encoder. Besides, HunyuanVideo requires a variable length of text conditions and the diffusers library implements this feature with a attn_mask in scaled_dot_product_attention. The size of attn_mask is proportional to the square of the input sequence length, which is crazy when we increase the resolution and the number of frames of the inference! Luckily, we can use ParaAttention to solve this problem. In ParaAttention, we patch the original implementation in diffusers to cut the text conditions before calling scaled_dot_product_attention. We implement this in our apply_cache_on_pipe function so we can call it after loading the model: pip3 install -U para-attn pipe = HunyuanVideoPipeline.from_pretrained( model_id, transformer=transformer, torch_dtype=torch.float16, revision="refs/pr/18", ).to("cuda") from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe apply_cache_on_pipe(pipe, residual_diff_threshold=0.0) We pass residual_diff_threshold=0.0 to apply_cache_on_pipe to disable the cache mechanism now, because we will enable it later. Here, we only want it to cut the text conditions to avoid OOM errors. If you still experience OOM errors, you can try calling pipe.enable_model_cpu_offload or pipe.enable_sequential_cpu_offload after calling apply_cache_on_pipe. This is our baseline. On one single NVIDIA L20 GPU, we can generate 129 frames with 720p resolution in 30 inference steps in 3675.71 seconds. Apply First Block Cache on HunyuanVideo By caching the output of the transformer blocks in the transformer model and resuing them in the next inference steps, we can reduce the computation cost and make the inference faster. However, it is hard to decide when to reuse the cache to ensure the quality of the generated video. Recently, TeaCache suggests that we can use the timestep embedding to approximate the difference among model outputs. And AdaCache also shows that caching can contribute grant significant inference speedups without sacrificing the generation quality, across multiple video DiT baselines. However, TeaCache is still a bit complex as it needs a rescaling strategy to ensure the accuracy of the cache. In ParaAttention, we find that we can directly use the residual difference of the first transformer block output to approximate the difference among model outputs. When the difference is small enough, we can reuse the residual difference of previous inference steps, meaning that we in fact skip this denoising step. This has been proved to be effective in our experiments and we can achieve an up to 2x speedup on HunyuanVideo inference with very good quality. Cache in Diffusion Transformer How AdaCache works, First Block Cache is a variant of it To apply the first block cache on HunyuanVideo, we can call apply_cache_on_pipe with residual_diff_threshold=0.06, which is the default value for HunyuanVideo. apply_cache_on_pipe(pipe, residual_diff_threshold=0.06) HunyuanVideo without FBCache hunyuan_video_original.mp4 HunyuanVideo with FBCache hunyuan_video_fbc.mp4 We observe that the first block cache is very effective in speeding up the inference, and maintaining nearly no quality loss in the generated video. Now, on one single NVIDIA L20 GPU, we can generate 129 frames with 720p resolution in 30 inference steps in 2271.06 seconds. This is a 1.62x speedup compared to the baseline. Quantize the model into FP8 To further speed up the inference and reduce memory usage, we can quantize the model into FP8 with dynamic quantization. We must quantize both the activation and weight of the transformer model to utilize the 8-bit Tensor Cores on NVIDIA GPUs. Here, we use float8_weight_only and float8_dynamic_activation_float8_weightto quantize the text encoder and transformer model respectively. The default quantization method is per tensor quantization. If your GPU supports row-wise quantization, you can also try it for better accuracy. diffusers-torchao provides a really good tutorial on how to quantize models in diffusers and achieve a good speedup. Here, we simply install the latest torchao that is capable of quantizing HunyuanVideo. If you are not familiar with torchao quantization, you can refer to this documentation. pip3 install -U torch torchao We also need to pass the model to torch.compile to gain actual speedup. torch.compile with mode="max-autotune-no-cudagraphs" or mode="max-autotune" can help us to achieve the best performance by generating and selecting the best kernel for the model inference. The compilation process could take a long time, but it is worth it. If you are not familiar with torch.compile, you can refer to the official tutorial. In this example, we only quantize the transformer model, but you can also quantize the text encoder to reduce more memory usage. We also need to notice that the actually compilation process is done on the first time the model is called, so we need to warm up the model to measure the speedup correctly. Note: we find that dynamic quantization can significantly change the distribution of the model output, so you might need to tweak the residual_diff_threshold to a larger value to make it take effect. import time import torch from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel from diffusers.utils import export_to_video model_id = "tencent/HunyuanVideo" transformer = HunyuanVideoTransformer3DModel.from_pretrained( model_id, subfolder="transformer", torch_dtype=torch.bfloat16, revision="refs/pr/18", ) pipe = HunyuanVideoPipeline.from_pretrained( model_id, transformer=transformer, torch_dtype=torch.float16, revision="refs/pr/18", ).to("cuda") from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe apply_cache_on_pipe(pipe) from torchao.quantization import quantize_, float8_dynamic_activation_float8_weight, float8_weight_only quantize_(pipe.text_encoder, float8_weight_only()) quantize_(pipe.transformer, float8_dynamic_activation_float8_weight()) pipe.transformer = torch.compile( pipe.transformer, mode="max-autotune-no-cudagraphs", ) # Enable memory savings pipe.vae.enable_tiling() # pipe.enable_model_cpu_offload() # pipe.enable_sequential_cpu_offload() for i in range(2): begin = time.time() output = pipe( prompt="A cat walks on the grass, realistic", height=720, width=1280, num_frames=129, num_inference_steps=1 if i == 0 else 30, ).frames[0] end = time.time() if i == 0: print(f"Warm up time: {end - begin:.2f}s") else: print(f"Time: {end - begin:.2f}s") print("Saving video to hunyuan_video.mp4") export_to_video(output, "hunyuan_video.mp4", fps=15) The NVIDIA L20 GPU only has 48GB memory and could face OOM errors after compiling the model and not calling enable_model_cpu_offload, because the HunyuanVideo has very large activation tensors when running with high resolution and large number of frames. So here we skip measuring the speedup with quantization and compilation on one single NVIDIA L20 GPU and choose to use context parallelism to release the memory pressure. If you want to run HunyuanVideo with torch.compile on GPUs with less than 80GB memory, you can try reducing the resolution and the number of frames to avoid OOM errors. Due to the fact that large video generation models usually have performance bottleneck on the attention computation rather than the fully connected layers, we don't observe a significant speedup with quantization and compilation. However, models like FLUX and SD3 can benefit a lot from quantization and compilation, it is suggested to try it for these models. Parallelize the inference with Context Parallelism A lot faster than before, right? But we are not satisfied with the speedup we have achieved so far. If we want to accelerate the inference further, we can use context parallelism to parallelize the inference. Libraries like xDit and our ParaAttention provide ways to scale up the inference with multiple GPUs. In ParaAttention, we design our API in a compositional way so that we can combine context parallelism with first block cache and dynamic quantization all together. We provide very detailed instructions and examples of how to scale up the inference with multiple GPUs in our ParaAttention repository. Users can easily launch the inference with multiple GPUs by calling torchrun. If there is a need to make the inference process persistent and serviceable, it is suggested to use torch.multiprocessing to write your own inference processor, which can eliminate the overhead of launching the process and loading and recompiling the model. Below is our ultimate code to achieve the fastest HunyuanVideo inference: import time import torch import torch.distributed as dist from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel from diffusers.utils import export_to_video dist.init_process_group() torch.cuda.set_device(dist.get_rank()) # [rank1]: RuntimeError: Expected mha_graph->execute(handle, variant_pack, workspace_ptr.get()).is_good() to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.) # torch.backends.cuda.enable_cudnn_sdp(False) model_id = "tencent/HunyuanVideo" transformer = HunyuanVideoTransformer3DModel.from_pretrained( model_id, subfolder="transformer", torch_dtype=torch.bfloat16, revision="refs/pr/18", ) pipe = HunyuanVideoPipeline.from_pretrained( model_id, transformer=transformer, torch_dtype=torch.float16, revision="refs/pr/18", ).to("cuda") from para_attn.context_parallel import init_context_parallel_mesh from para_attn.context_parallel.diffusers_adapters import parallelize_pipe from para_attn.parallel_vae.diffusers_adapters import parallelize_vae mesh = init_context_parallel_mesh( pipe.device.type, ) parallelize_pipe( pipe, mesh=mesh, ) parallelize_vae(pipe.vae, mesh=mesh._flatten()) from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe apply_cache_on_pipe(pipe) # from torchao.quantization import quantize_, float8_dynamic_activation_float8_weight, float8_weight_only # # torch._inductor.config.reorder_for_compute_comm_overlap = True # # quantize_(pipe.text_encoder, float8_weight_only()) # quantize_(pipe.transformer, float8_dynamic_activation_float8_weight()) # pipe.transformer = torch.compile( # pipe.transformer, mode="max-autotune-no-cudagraphs", # ) # Enable memory savings pipe.vae.enable_tiling() # pipe.enable_model_cpu_offload(gpu_id=dist.get_rank()) # pipe.enable_sequential_cpu_offload(gpu_id=dist.get_rank()) for i in range(2): begin = time.time() output = pipe( prompt="A cat walks on the grass, realistic", height=720, width=1280, num_frames=129, num_inference_steps=1 if i == 0 else 30, output_type="pil" if dist.get_rank() == 0 else "pt", ).frames[0] end = time.time() if dist.get_rank() == 0: if i == 0: print(f"Warm up time: {end - begin:.2f}s") else: print(f"Time: {end - begin:.2f}s") if dist.get_rank() == 0: print("Saving video to hunyuan_video.mp4") export_to_video(output, "hunyuan_video.mp4", fps=15) dist.destroy_process_group() We save the above code to run_hunyuan_video.py and run it with torchrun: torchrun --nproc_per_node=8 run_hunyuan_video.py With 8 NVIDIA L20 GPUs, we can generate 129 frames with 720p resolution in 30 inference steps in 649.23 seconds. This is a 5.66x speedup compared to the baseline! 翻译
09-24
内容概要:本文介绍了ENVI Deep Learning V1.0的操作教程,重点讲解了如何利用ENVI软件进行深度学习模型的训练与应用,以实现遥感图像中特定目标(如集装箱)的自动提取。教程涵盖了从数据准备、标签图像创建、模型初始化与训练,到执行分类及结果优化的完整流程,并介绍了精度评价与通过ENVI Modeler实现一键化建模的方法。系统基于TensorFlow框架,采用ENVINet5(U-Net变体)架构,支持通过点、线、面ROI或分类图生成标签数据,适用于多/高光谱影像的单一类别特征提取。; 适合人群:具备遥感图像处理基础,熟悉ENVI软件操作,从事地理信息、测绘、环境监测等相关领域的技术人员或研究人员,尤其是希望将深度学习技术应用于遥感目标识别的初学者与实践者。; 使用场景及目标:①在遥感影像中自动识别和提取特定地物目标(如车辆、建筑、道路、集装箱等);②掌握ENVI环境下深度学习模型的训练流程与关键参数设置(如Patch Size、Epochs、Class Weight等);③通过模型调优与结果反馈提升分类精度,实现高效自动化信息提取。; 阅读建议:建议结合实际遥感项目边学边练,重点关注标签数据制作、模型参数配置与结果后处理环节,充分利用ENVI Modeler进行自动化建模与参数优化,同时注意软硬件环境(特别是NVIDIA GPU)的配置要求以保障训练效率。
### configUSE_STATS_FORMATTING_FUNCTIONS 使用说明与配置详解 #### 宏定义的作用 `configUSE_STATS_FORMATTING_FUNCTIONS` 是 FreeRTOS 配置文件中的一个宏,用于启用任务统计信息的格式化输出功能。当此宏被设置为 1 时,FreeRTOS 将编译 `vTaskList()` 和 `vTaskGetRunTimeStats()` 函数[^2]。这些函数分别用于生成任务状态列表和运行时间统计信息。 #### 配置方法 在使用 `configUSE_STATS_FORMATTING_FUNCTIONS` 之前,需要确保以下配置: - 在 `FreeRTOSConfig.h` 文件中将 `configUSE_STATS_FORMATTING_FUNCTIONS` 设置为 1。 - 确保 `configUSE_TRACE_FACILITY` 也被设置为 1,因为这两个宏通常一起使用以支持任务统计功能[^2]。 ```c #define configUSE_STATS_FORMATTING_FUNCTIONS 1 #define configUSE_TRACE_FACILITY 1 ``` #### 功能描述 当 `configUSE_STATS_FORMATTING_FUNCTIONS` 被启用后,可以使用以下两个函数来获取任务统计信息: 1. **`vTaskList()`**:生成所有任务的状态信息,并将其写入指定的缓冲区。信息包括任务名称、优先级、堆栈高水位线等。 2. **`vTaskGetRunTimeStats()`**:生成每个任务的运行时间统计信息,并将其写入指定的缓冲区。此功能依赖于 `configGENERATE_RUN_TIME_STATS` 的配置[^4]。 #### 示例代码 以下是一个示例,展示如何使用 `vTaskList()` 和 `vTaskGetRunTimeStats()`: ```c #include "FreeRTOS.h" #include "task.h" void vPrintTaskInfo(void) { char cTaskListBuffer[1000]; char cRunTimeStatsBuffer[1000]; // 获取任务状态列表 vTaskList(cTaskListBuffer); printf("Task List:\n%s\n", cTaskListBuffer); // 获取任务运行时间统计信息 vTaskGetRunTimeStats(cRunTimeStatsBuffer); printf("Task Run Time Stats:\n%s\n", cRunTimeStatsBuffer); } ``` #### 注意事项 - 如果禁用了 `configUSE_STATS_FORMATTING_FUNCTIONS` 或 `configUSE_TRACE_FACILITY`,则 `vTaskList()` 和 `vTaskGetRunTimeStats()` 将不会被编译[^2]。 - 运行时间统计功能依赖于 `configGENERATE_RUN_TIME_STATS` 的配置。如果未启用该选项,则 `vTaskGetRunTimeStats()` 将无法提供准确的运行时间数据[^4]。 - 在资源受限的嵌入式系统中,应谨慎使用这些功能,因为它们可能会增加内存和 CPU 的开销。 #### 相关配置 以下是与 `configUSE_STATS_FORMATTING_FUNCTIONS` 相关的其他重要配置项: - **`configUSE_TRACE_FACILITY`**:启用任务跟踪功能,支持 `vTaskList()` 和 `vTaskGetRunTimeStats()` 的使用。 - **`configGENERATE_RUN_TIME_STATS`**:启用任务运行时间统计功能,为 `vTaskGetRunTimeStats()` 提供数据支持。 - **`configMAX_TASK_NAME_LEN`**:定义任务名称的最大长度,影响 `vTaskList()` 输出的任务名称字段宽度[^2]。 --- ###
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值