Assembly line / Pipeline

本文探讨了一种针对大规模缓存文件的数据更新方案。面对约80M个地理位置坐标点的地址信息更新需求,提出了两种解决方案:逐点更新与批量处理。通过比较这两种方案的特点,特别是对中断恢复的支持和处理效率,最终选择了更高效的批量处理方式。

Problem Definition

Cache Data Refresh

 

  • We have 576 cache files, each file (in fact, a file pair, one is an index file, the other is data file) contains point to address mapping: (latitude, longitude) -> Address, i.e., the result of reverse geocoding.
  • some data (addresses) may be out of date, now we need get updated address of those points.

Requirement

  1. There is no date info for each cache entry [point, address], so we refresh all points in cache file,  that is, regenerate the cache file in whole.
  2. The refresh can be interrupted and continue from the breakpoint - we do not want start from beginning. 
  3. Total points number is about 80M, we need finish it in given time, or the refresh duration time can be controlled.
  4. The refresh is done on-line. After refresh finished, the server switch to new cache file and back up old cache file.
  5. There are several peer server, each server has same data (576 cache files), we want all of the server get updated addresses in the same time.

Solution1

  1. For each cache file
    • For each point in the cache file
      • read point from cache file - R
      • get updated address - G
      • write new address into new cache file - W

Limits:

  • If the refresh is interrupted, one cache file need restart at right beginning. Suppose a cache file contains 5m points, the server crash when the refresh coming to the last point of 5m points. Then all the 4.99m effort is lost.
  • The whole process is in single thread, its rate is limits by the bottleneck process: get updated address.

Solution2

Instead of 'R-G-W' one point by one point. We first snap all the points in the cache file intensively, and put all the point in 4096 point files. So each point file contains a certain number of points.  Then use several threads to repeat 'R-G' process using point files, and one threads to repeat 'W' process. Once a point file is finished, delete it and process next file.

  1. For each cache file
    • For each point in the cache file
      • read point from cache file
      • write point into point file, create a new file every 80m/4096 points
  2. For each point file
    • For each point in the point file - multiply threads, each threads handle one file
      • read point from point file
      • get updated address
      • add new address into write thread queue
    • Delete the point file once it finish all the points
  3. For each point in the write queue
    • Write new address into new cache file

Benefit:

  • If the refresh is interrupted, at most one point file work is lost. 
  • Use multiply threads in bottleneck function('G'), so we speed up the whole refresh. The refresh rate is controlled by thread number

Postmortem

 

1. How to handle batch process

 

1) break down (Split) and reassemble to get pipelining

http://en.wikipedia.org/wiki/Assembly_line

Consider the assembly of a car: assume that certain steps in the assembly line are to install the engine, install the hood, and install the wheels (in that order, with arbitrary interstitial steps); only one of these steps can be done at a time. In traditional production, only one car would be assembled at a time. If engine installation takes 20 minutes, hood installation takes 5 minutes, and wheel installation takes 10 minutes, then a car can be produced every 35 minutes.


In an assembly line, car assembly is split between several stations, all working simultaneously. When one station is finished with a car, it passes it on to the next. By having three stations, a total of three different cars can be operated on at the same time, each one at a different stage of its assembly.

 


After finishing its work on the first car, the engine installation crew can begin working on the second car. While the engine installation crew works on the second car, the first car can be moved to the hood station and fitted with a hood, then to the wheels station and be fitted with wheels. After the engine has been installed on the second car, the second car moves to the hood assembly. At the same time, the third car moves to the engine assembly. When the third car’s engine has been mounted, it then can be moved to the hood station; meanwhile, subsequent cars (if any) can be moved to the engine installation station.


Assuming no loss of time when moving a car from one station to another, the longest stage on the assembly line determines the throughput (20 minutes for the engine installation) so a car can be produced every 20 minutes, once the first car taking 35 minutes has been produced.


 

http://en.wikipedia.org/wiki/Pipeline_(computing)

http://en.wikipedia.org/wiki/Instruction_pipeline

 

 

2) find the bottleneck in a proper granularity

 

3) intermediate result/state may be helpful. In above case, point file. 

 

 

2. Cache Design

Considering about cache expire. Generally Date info or life time is created together with cache entry.

 

3. Quadtrees

Why not database to store the entry ?

 

please see Quadtrees

1) C++ Implementation 

2) Use Case

Building /Users/mac/Code/testHybridClr/Library/Bee/artifacts/unitylinker_hph0.traceevents failed with output: /Users/mac/Code/testHybridClr/HybridCLRData/LocalIl2CppData-OSXEditor/il2cpp/build/deploy/UnityLinker --search-directory=/Users/mac/Code/testHybridClr/Temp/StagingArea/Data/Managed --out=Library/Bee/artifacts/MacStandalonePlayerBuildProgram/ManagedStripped --include-link-xml=/Users/mac/Code/testHybridClr/Temp/StagingArea/Data/Managed/TypesInScenes.xml --include-link-xml=/Users/mac/Code/testHybridClr/Assets/HybridCLRGenerate/link.xml --include-directory=/Users/mac/Code/testHybridClr/Temp/StagingArea/Data/Managed --profiler-report --profiler-output-file=/Users/mac/Code/testHybridClr/Library/Bee/artifacts/unitylinker_hph0.traceevents --dotnetprofile=unityaot-macos --dotnetruntime=Il2Cpp --platform=MacOSX --use-editor-options --engine-modules-asset-file=/Applications/Unity/Hub/Editor/2021.3.45f1/Unity.app/Contents/PlaybackEngines/MacStandaloneSupport/modules.asset --editor-data-file=/Users/mac/Code/testHybridClr/Temp/StagingArea/Data/Managed/EditorToUnityLinkerData.json --include-unity-root-assembly=/Users/mac/Code/testHybridClr/Temp/StagingArea/Data/Managed/Assembly-CSharp.dll --print-command-line --enable-analytics Fatal error in Unity CIL Linker Mono.Linker.LinkerFatalErrorException: /Users/mac/Code/testHybridClr/Assets/LoadDll.cs(19,9): error IL1005: .LoadDll.Start(): Error processing method '.LoadDll.Start()' in assembly 'Assembly-CSharp.dll' ---> Mono.Cecil.AssemblyResolutionException: Failed to resolve assembly: 'HotUpdate, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null' at Unity.IL2CPP.Common.MissingMethodStubber.GetTypeModule(TypeReference type, IEnumerable`1 assemblies) in /Users/bokken/build/output/unity/il2cpp/Unity.IL2CPP.Common/MissingMethodStubber.cs:line 178 at Unity.Linker.Steps.AddUnresolvedStubsStep.MarkAssemblyOfType(UnityLinkContext context, TypeReference type) in /Users/bokken/build/output/unity/il2cpp/UnityLinker/Linker.Steps/AddUnresolvedStubsStep.cs:line 18 at Unity.Linker.Steps.Marking.UnresolvedStubMarking.HandleUnresolvedType(TypeReference reference) in /Users/bokken/build/output/unity/il2cpp/UnityLinker/Linker.Steps/Marking/UnresolvedStubMarking.cs:line 96 at Unity.Linker.Steps.Marking.UnresolvedStubMarking.HandleUnresolvedMethod(MethodReference reference) in /Users/bokken/build/output/unity/il2cpp/UnityLinker/Linker.Steps/Marking/UnresolvedStubMarking.cs:line 108 at Unity.Linker.Steps.UnityMarkStep.HandleUnresolvedMethod(MethodReference reference) in /Users/bokken/build/output/unity/il2cpp/UnityLinker/Linker.Steps/UnityMarkStep.cs:line 213 at Mono.Linker.Steps.MarkStep.MarkMethod(MethodReference reference, DependencyInfo reason, IMemberDefinition sourceLocationMember) in /Users/bokken/build/output/unity/il2cpp/repos/monolinker/src/linker/Linker.Steps/MarkStep.cs:line 2311 at Mono.Linker.Steps.MarkStep.MarkInstruction(Instruction instruction, MethodDefinition method, Boolean& requiresReflectionMethodBodyScanner) in /Users/bokken/build/output/unity/il2cpp/repos/monolinker/src/linker/Linker.Steps/MarkStep.cs:line 2867 at Mono.Linker.Steps.MarkStep.MarkMethodBody(MethodBody body) in /Users/bokken/build/output/unity/il2cpp/repos/monolinker/src/linker/Linker.Steps/MarkStep.cs:line 2800 at Unity.Linker.Steps.UnityMarkStep.MarkMethodBody(MethodBody body) in /Users/bokken/build/output/unity/il2cpp/UnityLinker/Linker.Steps/UnityMarkStep.cs:line 359 at Mono.Linker.Steps.MarkStep.ProcessMethod(MethodDefinition method, DependencyInfo& reason) in /Users/bokken/build/output/unity/il2cpp/repos/monolinker/src/linker/Linker.Steps/MarkStep.cs:line 2429 at Unity.Linker.Steps.UnityMarkStep.ProcessMethod(MethodDefinition method, DependencyInfo& reason) in /Users/bokken/build/output/unity/il2cpp/UnityLinker/Linker.Steps/UnityMarkStep.cs:line 222 at Mono.Linker.Steps.MarkStep.ProcessQueue() in /Users/bokken/build/output/unity/il2cpp/repos/monolinker/src/linker/Linker.Steps/MarkStep.cs:line 391 --- End of inner exception stack trace --- at Mono.Linker.Steps.MarkStep.ProcessQueue() in /Users/bokken/build/output/unity/il2cpp/repos/monolinker/src/linker/Linker.Steps/MarkStep.cs:line 393 at Mono.Linker.Steps.MarkStep.ProcessPrimaryQueue() in /Users/bokken/build/output/unity/il2cpp/repos/monolinker/src/linker/Linker.Steps/MarkStep.cs:line 376 at Mono.Linker.Steps.MarkStep.Process() at Unity.Linker.Steps.UnityMarkStep.Process(LinkContext context) in /Users/bokken/build/output/unity/il2cpp/UnityLinker/Linker.Steps/UnityMarkStep.cs:line 255 at Unity.Linker.UnityPipeline.ProcessStep(LinkContext context, IStep step) in /Users/bokken/build/output/unity/il2cpp/UnityLinker/Linker/UnityPipeline.cs:line 23 at Mono.Linker.Pipeline.Process(LinkContext context) in /Users/bokken/build/output/unity/il2cpp/repos/monolinker/src/linker/Linker/Pipeline.cs:line 128 at Unity.Linker.UnityDriver.UnityRun(Boolean noProfilerAllowed, ILogger customLogger) in /Users/bokken/build/output/unity/il2cpp/UnityLinker/Linker/UnityDriver.cs:line 125 at Unity.Linker.UnityDriver.RunDriverWithoutErrorHandling(ILogger customLogger, Boolean noProfilerAllowed) in /Users/bokken/build/output/unity/il2cpp/UnityLinker/Linker/UnityDriver.cs:line 81 at Unity.Linker.UnityDriver.RunDriver() in /Users/bokken/build/output/unity/il2cpp/UnityLinker/Linker/UnityDriver.cs:line 62 UnityEngine.GUIUtility:ProcessEvent (int,intptr,bool&) (at /Users/bokken/build/output/unity/unity/Modules/IMGUI/GUIUtility.cs:189)
最新发布
11-11
Command line: /usr/lib/spades/bin/spades.py -1 /home/st20221701305/SRR27606660_1.clean.fastq.gz -2 /home/st20221701305/SRR27606660_2.clean.fastq.gz --phred-offset33 -o /home/st20221701305/spades_output System information: SPAdes version: 3.13.1 Python version: 3.10.12 OS: Linux-6.8.0-51-generic-x86_64-with-glibc2.35 Output dir: /home/st20221701305/spades_output Mode: read error correction and assembling Debug mode is turned OFF Dataset parameters: Multi-cell mode (you should set '--sc' flag if input data was obtained with MDA (single-cell) technology or --meta flag if processing metagenomic dataset) Reads: Library number: 1, library type: paired-end orientation: fr left reads: ['/home/st20221701305/SRR27606660_1.clean.fastq.gz'] right reads: ['/home/st20221701305/SRR27606660_2.clean.fastq.gz'] interlaced reads: not specified single reads: not specified merged reads: not specified Read error correction parameters: Iterations: 1 PHRED offset: 33 Corrected reads will be compressed Assembly parameters: k: automatic selection based on read length Repeat resolution is enabled Mismatch careful mode is turned OFF MismatchCorrector will be SKIPPED Coverage cutoff is turned OFF Other parameters: Dir for temp files: /home/st20221701305/spades_output/tmp Threads: 16 Memory limit (in Gb): 250 ======= SPAdes pipeline started. Log can be found here: /home/st20221701305/spades_output/spades.log ===== Read error correction started. == Running read error correction tool: /usr/lib/spades/bin/spades-hammer /home/st20221701305/spades_output/corrected/configs/config.info 0:00:00.000 4M / 4M INFO General (main.cpp : 75) Starting BayesHammer, built from N/A, git revision N/A 0:00:00.000 4M / 4M INFO General (main.cpp : 76) Loading config from /home/st20221701305/spades_output/corrected/configs/config.info 0:00:00.000 4M / 4M INFO General (main.cpp : 78) Maximum # of threads to use (adjusted due to OMP capabilities): 16 0:00:00.000 4M / 4M INFO General (memory_limit.cpp : 49) Memory limit set to 250 Gb 0:00:00.000 4M / 4M INFO General (hammer_tools.cpp : 36) Hamming graph threshold tau=1, k=21, subkmer positions = [ 0 10 ] 0:00:00.000 4M / 4M INFO General (main.cpp : 113) Size of aux. kmer data 24 bytes === ITERATION 0 begins === 0:00:00.000 4M / 4M INFO K-mer Index Building (kmer_index_builder.hpp : 301) Building kmer index 0:00:00.000 4M / 4M INFO General (kmer_index_builder.hpp : 117) Splitting kmer instances into 256 files using 16 threads. This might take a while. 0:00:00.001 4M / 4M INFO General (file_limit.hpp : 32) Open file limit set to 1024 0:00:00.001 4M / 4M INFO General (kmer_splitters.hpp : 89) Memory available for splitting buffers: 5.20825 Gb 0:00:00.001 4M / 4M INFO General (kmer_splitters.hpp : 97) Using cell size of 262144 0:00:00.061 8G / 8G INFO K-mer Splitting (kmer_data.cpp : 97) Processing /home/st20221701305/SRR27606660_1.clean.fastq.gz 0:00:14.902 8G / 9G INFO K-mer Splitting (kmer_data.cpp : 107) Processed 2999921 reads 0:00:27.058 8G / 9G INFO K-mer Splitting (kmer_data.cpp : 107) Processed 5506247 reads 0:00:27.058 8G / 9G INFO K-mer Splitting (kmer_data.cpp : 97) Processing /home/st20221701305/SRR27606660_2.clean.fastq.gz 0:00:42.070 8G / 9G INFO K-mer Splitting (kmer_data.cpp : 107) Processed 8607409 reads ^C Traceback (most recent call last): File "/usr/lib/spades/bin/spades.py", line 833, in main hammer_logic.run_hammer(corrected_dataset_yaml_filename, tmp_configs_dir, bin_home, bh_cfg, dataset_data, File "/usr/lib/spades/../../share/spades/spades_pipeline/hammer_logic.py", line 166, in run_hammer support.sys_call(command, log) File "/usr/lib/spades/../../share/spades/spades_pipeline/support.py", line 270, in sys_call line = process_readline(proc.stdout.readline()) KeyboardInterrupt == Error == exception caught: <class 'KeyboardInterrupt'> In case you have troubles running SPAdes, you can write to spades.support@cab.spbu.ru or report an issue on our GitHub repository github.com/ablab/spades Please provide us with params.txt and spades.log files from the output directory.
07-10
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

FireCoder

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值