2019/07/14_NGS Data Analysis Course (Harvard Chan Bioinformatics Core)_6

本文链接：https://blog.youkuaiyun.com/uqyuan0515/article/details/95871793

#Shell scripts

Shell scripts are text files that contain commands we want to run.

As with any file, you can give a shell script any name and usually have the extension .sh.

用vim 创建一个新的文件：

$ vim listing.sh

在此文件里输入几行命令：

echo "Your current working directory is:"

pwd

echo "These are the contents of this directory:"

ls -l

Exit vim and save the file. Now let’s run the new script we have created. To run a shell script you usually use the bash or sh command.

$ sh listing.sh

#Bash variables

variables 变量是许多编程语言共享的一个通用概念。变量本质上是某些信息的符号/临时名称，或对某些信息的引用。变量类似于“bucket”，在那里信息可以存储、维护和修改，而不需要太多麻烦。

Let’s start with a simple variable that has a single number stored in it:

$ num=25

How do we know that we actually created the bash variable? We can use the echo command to print to terminal:

$ echo num

when trying to retrieve the value stored in the variable, we explicitly use a $ in front of it: 如果想得到的是变量里储存的值就不要忘了在变量名前加$

$ echo $num

Variables can also store a string of character values. In the example below, we define a variable or a ‘bucket’ called file. We will put a filename Mov10_oe_1.subset.fq as the value inside the bucket.

$ file=Mov10_oe_1.subset.fq

Once you press return, you should be back at the command prompt. Let’s check what’s stored inside file, but first move into the raw_fastq directory::

$ cd ~/unix_lesson/raw_fastq
$ echo $file

Let’s try another command using the variable that we have created. We can also count the number of lines in Mov10_oe_1.subset.fq by referencing the file variable:

$ wc -l $file

Variables can store more than just a single value.

#Loops 循环

what if you want to run a sequence of multiple commands, on multiple files? This is where loop come in handy!

The structure or the syntax of (for) loops in bash is as follows:

for (variable_name) in (list)

do

(command1 $variable_name)

.

.

done

即循环运行list里的变量，从do 到done 循环运行。这种语法结构是不变的。

for x in *.fq

do

echo $x

wc -l $x

done

上面这个命令的意思是： it writes to the terminal (echo) the name of the file and the number of lines (wc -l) for each files that end in .fq in the current directory. The output is almost identical to what we had before. 向终端echo写入当前目录中以.fq结尾的每个文件的文件名和行数（wc-l）。输出几乎和我们以前的一样。 In this case the list of files is specified using the asterisk wildcard: *.fq, i.e. all files that end in .fq.

#The basename command

The basename command is used for extracting the base name of a file, which is accomplished using string splitting to strip the directory and any suffix from filenames.

Let’s try an example, by first moving back to your home directory:

$ cd

Then we will run the basename command on one of the FASTQ files. Be sure to specify the path to the file:

$ basename ~/unix_lesson/raw_fastq/Mov10_oe_1.subset.fq

What is returned to you? The filename was split into the path unix_lesson/raw_fastq/ and the filename Mov10_oe_1.subset.fq. The command returns only the filename. Now, suppose we wanted to also trim off the file extension (i.e. remove .fq leaving only the file base name). We can do this by adding a parameter to the command to specify what string of characters we want trimmed.

$ basename ~/unix_lesson/raw_fastq/Mov10_oe_1.subset.fq .fq

You should now see that only Mov10_oe_1.subset is returned.

#Automating with Scripts

if you will, a script that will run a series of commands that would do the following for us each time we get a new data set:

Use for loop to iterate over each FASTQ file
Generate a prefix to use for naming our output files
Dump out bad reads into a new file
Get the count of the number of bad reads and generate a summary file

Rather than doing all of this in the terminal we are going to create a script file with all relevant commands. Move back in to unix_lesson and use vim to create our new script file:

$ cd ~/unix_lesson

$ vim generate_bad_reads_summary.sh

We always want to start our scripts with a shebang line:

#!/bin/bash

This line is the absolute path to the Bash interpreter. The shebang line ensures that the bash shell interprets the script even if it is executed using a different shell.

After the shebang line, we enter the commands we want to execute. First we want to move into our raw_fastq directory:

# enter directory with raw FASTQs
cd ~/unix_lesson/raw_fastq

And now we loop over all the FASTQs:

# enter directory with raw FASTQs
for filename in *.fq

For each file that we process we can use basename to create a variable that will uniquely identify our output file based on where it originated from:

do
  # create a prefix for all output files
  base=`basename $filename .subset.fq`

and then we execute the commands for each loop:

  # tell us what file we're working on
  echo $filename
  
  # grab all the bad read records into new file
  grep -B1 -A2 NNNNNNNNNN $filename > ${base}-badreads.fastq

We’ll also count the number of these reads and put that in a new file, using the count flag of grep:

  # grab the number of bad reads and write it to a summary file
  grep -cH NNNNNNNNNN $filename >> badreads.count.summary
done

If you’ve noticed, we used a new grep flag -H above; this flag will report the filename along with the match string. This is useful for when we generate the summary file and we know what number associates with which file.

Save and exit vim, and voila! You now have a script you can use to assess the quality of all your new datasets. Your finished script, complete with comments, should look like the following:

#!/bin/bash 

# enter directory with raw FASTQs
cd ~/unix_lesson/raw_fastq

# count bad reads for each FASTQ file in our directory
for filename in *.fq 
do 

  # create a prefix for all output files
  base=`basename $filename .subset.fq`

  # tell us what file we're working on	
  echo $filename

  # grab all the bad read records
  grep -B1 -A2 NNNNNNNNNN $filename > ${base}-badreads.fastq

  # grab the number of bad reads and write it to a summary file
  grep -cH NNNNNNNNNN $filename >> badreads.count.summary
done

To run this script, we simply enter the following command:

$ sh generate_bad_reads_summary.sh

How do we know if the script worked? Take a look inside the raw_fastq directory, we should see that for every one of the original FASTQ files we have one bad read file. We should also have a summary log file documenting the total number of bad reads from each file.

$ ls -l ~/unix_lesson/raw_fastq

To keep our data organized, let’s move all of the bad read files out of the raw_fastq directory into a new directory called other, and the script to a new directory called scripts.

$ mv raw_fastq/*bad* other/

$ mkdir scripts
$ mv *.sh scripts/

2019/07/14_NGS Data Analysis Course (Harvard Chan Bioinformatics Core)_6_shell

#Shell scripts