GNU Parallel in caffe

最新推荐文章于 2022-10-19 17:20:01 发布

转载最新推荐文章于 2022-10-19 17:20:01 发布 · 848 阅读

本文介绍如何利用GNUParallel工具简化并行任务处理流程，包括批量图像处理、参数网格搜索等常见数据科学任务。通过示例展示如何在单机或多机环境下轻松实现任务并行化。

GNU Parallel

15 Sep 2014 Gustav Larsson

I was reading the ImageNet tutorial for Caffe (a deep learning framework), in which they need to resize a large number of images. It struck me that they might not be aware of GNU Parallel, since it is a great tool for this task. I recommend it to any data scientist out there since it is so simple to use and like many other GNU tools, with good chance already installed on your computer (if not, apt-get install moreutils on Debian).

In the writeup, it says that the author used his own MapReduce framework to do it, but it can also be done sequentially as:

for name in *.jpeg; do
    convert -resize 256x256\! $name $name
done

Instead of this sequential approach, you can run it in parallel with even less typing:

parallel convert -resize 256x256\! {} {} ::: *.jpeg

GNU Parallel will insert each filename at {} to form a command. Multiple commands will execute concurrently if you have a multicore computer.

If you have ever been tempted to do this kind of parallelization by adding &at the end of each command in the for loop, then Parallel is definetely for you. Adding & introduces two problems that Parallel solves: (1) you don't know when all of them are done and there is no easy way to join them, and (2) it will start a process for each command all at once, while Parallel will schedule your tasks and execute only as many in parallel as your computer can handle.

Basics

Parallel can also take input from the pipe, in which case it is similar to xargs:

ls *.jpeg | parallel mv {} {.}-old.jpeg

This command inserts -old into the filenames of all the JPEG files in the directory. The {.} is similar to {}, except it removes the extension. There are many replacement strings like this:

parallel convert -resize 256x256\! {} resized/{/} ::: images/*.jpeg

This resizes all the JPEG files inside the folder images and places the output in the folder resized. The replacement string {/} extracts the filename and is thus similar to the command basename. For this example we went back to the::: style input, which in many cases is preferable. For instance, it can be used several times to form a product of the input:

parallel "echo {1}: {2}" ::: A B C D ::: {1..8}

Note how we now used {1} and {2} to refer to the input. We also quoted the command, which is optional and might make things clearer (if you want to use pipes inside your command, it is required). Using multiple inputs is great for doing grid searches of parameters. However, let's say we don't want to do all combinations in the product and instead want to specify each pair of input manually. First create a file with the input and name it input.txt:

A 10
B 20
C 10

Now, use --colsep to specify the delimiter:

parallel --colsep=' ' "echo {1}: {2}" < input.txt

If you did this to test a variety of parameters, you might find it easier to create a file, commands.sh, with all the commands written out:

./experiment 10.0 1.5 > exp1.txt
./experiment 20.0 1.5 --extra-param 3.0 > exp2.txt

Now run them in parallel by:

parallel < commands.sh          # OR
parallel :::: commands.sh

The latter is a newer syntax (note that it has four semicolons), which again I prefer since it can be stringed together multiple times and you can freely mix::: and ::::.

Multiple computers using SSH

Parallel can also be used to parallelize between multiple computers. Let's say you have SSH access to the hostnames or SSH aliases node1 and node2without prompting for password. Now you can tell Parallel to distribute the job across both nodes using the -S option:

parallel -S node1,node2 -j8 convert -resize 256x256\! {} {} ::: *.jpeg

You can refer to the local computer as : (e.g. do -S :,node1,node2 to include the current computer). I also added -j8 to specify that I want each node to run 8 jobs concurrently. You can try leaving this out, but Parallel could have a hard time automatically determining how many jobs to use for each node.

We assumed in this example that the files existed on the other nodes (for instance through NSF). However, Parallel can also transfer the files to the worker nodes and transfer the results back by adding --trc {}.