9.文本处理之三剑客

最新推荐文章于 2024-08-12 08:39:13 发布

转载最新推荐文章于 2024-08-12 08:39:13 发布 · 193 阅读

Linux 专栏收录该内容

16 篇文章

订阅专栏

博客主要介绍了三个命令。grep命令用于查找文件内容；sed会根据脚本命令处理文本文件数据，有替换、删除、插入等多种脚本命令；awk命令逐行扫描文件，匹配成功则执行操作，脚本命令由两部分组成，还可自动分配变量，可使用BEGIN关键字在处理数据前后运行脚本。

1.grep命令：查找文件内容(global regular expressions print)

 grep命令能够在一个或多个文件中，搜索某一特定的字符模式（也就是正则表达式），

正则通配符功能
	c*	将匹配 0 个（即空白）或多个字符 c（c 为任一字符）。
	.	将匹配任何一个字符，且只能是一个字符。
	[xyz]	匹配方括号中的任意一个字符。
	[^xyz]	匹配除方括号中字符外的所有字符。
	^	锁定行的开头。
	$	锁定行的结尾。
	在基本正则表达式中，如通配符 *、+、{、|、( 和 )等，已经失去了它们原本的含义，
	而若要恢复它们原本的含义，则要在之前添加反斜杠 \转义，如 \*、\+、\{、\|、\( 、 \)。

格式如下：

[root@localhost ~]# grep [选项] 模式 文件名

-c	仅列出文件中包含模式的行数。
-i	忽略模式中的字母大小写。
-l	列出带有匹配行的文件名。
-n	在每一行的最前面列出行号。
-v	列出没有匹配模式的行。
-w	把表达式当做一个完整的单字符来搜寻，忽略那些部分匹配的行。

[root@localhost ~]# grep -c 我 more.txt
4
[root@localhost ~]#

2.sed

sed 会根据脚本命令来处理文本文件中的数据
类似正则匹配内容

命令执行数据的顺序如下：
1.每次仅读取一行内容；
2.根据提供的规则命令匹配并修改数据。注意，sed会将数据复制到缓冲区中，修改也仅限于缓冲区中的数据；
3.将执行结果输出。

[root@localhost ~]# sed [选项] [脚本命令] 文件名

选项：
-e 脚本命令	该选项会将其后跟的脚本命令添加到已有的命令中。
-f 脚本命令文件	该选项会将其后文件中的脚本命令添加到已有的命令中。
-n	而该选项会屏蔽启动输出，需使用 print 命令来完成输出。默认情况下，sed 会在所有的脚本指定执行完毕后，会自动输出处理后的内容，
-i	此选项会直接修改源文件，要慎用。

1.脚本命令s ：替换

格式为：

[address]s/pattern/replacement/flags

- address 用来表明该脚本命令作用到文本中的具体行。
		默认情况下，sed 命令会作用于文本数据的所有行。
		如果只想将命令作用于特定行,则必须写明 address 部分，以数字形式指定行区间；
		
- pattern 指的是需要替换的内容，replacement 指的是要替换的新内容。
- 
- flags 标记
		n	1~512 之间的数字，表示指定要替换的字符串出现第几次时才进行替换
		g	对数据中所有匹配到的内容进行替换，
		p	会打印与替换命令中指定的模式匹配的行。此标记通常与 -n 选项一起使用。
		w file	将缓冲区中的内容写到指定的 file 文件中；
		&	用正则表达式匹配的内容进行替换；
		\n	匹配第 n 个子串，该子串之前在 pattern 中用 \(\) 指定。
		\	转义（转义替换部分包含：&、\ 等）。

例子

1.将more.txt文件每行的第2个test替换成cwz

//替换所有匹配的字符串，也可以使用 g 标记：
[root@localhost ~]# sed 's/test/cwz/2' more.txt 
This is a test of the cwz script.
This is the second test of the cwz script.[root@localhost ~]#

2.选项-n 会禁止 sed 输出，但 p 标记会输出修改过的行，将二者匹配使用的效果就是只输出被替换命令修改过的行

[root@localhost ~]# cat more.txt
This is a test of the test script.
This is the second test of the test script.[root@localhost ~]# 

[root@localhost ~]# sed -n 's/second/last/p' more.txt
This is the last test of the test script.[root@localhost ~]#

3.w 标记会将匹配后的结果保存到指定文件中，

[root@localhost ~]# sed 's/test/trial/w test.txt' data5.txt
This is a trial line.
This is a different line.

[root@localhost ~]#cat test.txt
This is a trial line.

2.脚本命令d :删除

[address]d

删除第二到第三行

[root@localhost ~]# sed '2,3d' data6.txt
This is line number 1.
This is line number 4.

在此强调，在默认情况下 sed 并不会修改原始文件，这里被删除的行只是从 sed 的输出中消失了，原始文件没做任何改变。

3.脚本命令a 和 i ：在行前后插入一行数据

[address]a（或 i）\内容

例子

[root@localhost ~]# sed '3i\This is an inserted line.' data6.txt
This is line number 1.
This is line number 2.
This is an inserted line.
This is line number 3.
This is line number 4.

4.脚本命令c ：替换

将指定行中的全部内容，替换成想要的字符串

[root@localhost ~]# sed '3c\This is a changed line of text.' data6.txt
This is line number 1.
This is line number 2.
This is a changed line of text.
This is line number 4.

5.脚本命令y : 转换

y 转换命令是唯一处理单个字符的 sed 脚本命令

[root@localhost ~]# sed 'y/123/789/' data8.txt
This is line number 7.
This is line number 8.
This is line number 9.
This is line number 4.

6.脚本命令p :打印

[root@localhost ~]# sed -n '/number 3/p' data6.txt
This is line number 3.

用 -n 选项和 p 命令配合使用，我们可以禁止输出其他行，只打印包含匹配文本模式的行。

7.脚本命令w :写

指定行的内容写入文件中

[address]w filename

[root@localhost ~]# sed  -n'1,2w test.txt' data6.txt

[root@localhost ~]# cat test.txt
This is line number 1.
This is line number 2.

8.脚本命令r: 读

将一个文件的数据插入到指定位置

[address]r filename

[root@localhost ~]# cat data12.txt
This is an added line.
//插入到末尾，也可以使用 $ 地址符
[root@localhost ~]# sed '3r data12.txt' data6.txt
This is line number 1.
This is line number 2.
This is line number 3.
This is an added line.
This is line number 4.

9.脚本命令q ：退出

匹配到就退出，后面的内容不会输出

[root@localhost ~]# sed '2q' test.txt
This is line number 1.
This is line number 2.

3.awk

和 sed 命令类似，awk 命令也是逐行扫描文件（从第 1 行到最后一行），寻找含有目标文本的行，如果匹配成功，则会在该行上执行用户想要的操作；反之，则不对行做任何处理。

[root@localhost ~]# awk [选项] '脚本命令' 文件名

-F fs	指定以 fs 作为输入行的分隔符，awk 命令默认分隔符为空格或制表符。
-f file	从脚本文件中读取 awk 脚本指令，以取代直接在命令行中输入指令。
-v var=val	在执行处理过程之前，设置一个变量 var，并给其设备初始值为 val。

脚本命令，它由 2 部分组成

'匹配规则{执行命令}'

/^$/ 是一个正则表达式，功能是匹配文本中的空白行
[root@localhost ~]# awk '/^$/ {print "Blank line"}' test.txt

它还会自动给一行中的每个数据元素分配一个变量。

$0 代表整行；
$1 代表一行中的第 1 个数据字段；
$2 代表一行中的第 2 个数据字段；
$n 代表一行中的第 n 个数据字段。

[root@localhost ~]# cat data2.txt
One line of test text.
Two lines of test text.
Three lines of test text.
[root@localhost ~]# awk '{print $1}' data2.txt
One
Two
Three

使用多条命令，只要在命令之间放个分号即可

[root@localhost ~]# echo "My name is Rich" | awk '{$4="Christine"; print $0}'
My name is Christine

awk从文件中读取程序

root@localhost ~]# cat awk.sh
{print $1 "'s home directory is " $6}
[root@localhost ~]# awk -F: -f awk.sh /etc/passwd
root's home directory is /root
bin's home directory is /bin
.................

在处理数据前运行一些脚本命令，这就需要使用 BEGIN 关键字

[root@localhost ~]# cat data3.txt
Line 1

[root@localhost ~]# awk 'BEGIN {print "The data3 File Contents:"}
> {print $0}' data3.txt
The data3 File Contents:
Line 1

在读完数据后执行

[root@localhost ~]# awk 'BEGIN {print "The data3 File Contents:"}
> {print $0}
> END {print "End of File"}' data3.txt
The data3 File Contents:
Line 1
Line 2
Line 3
End of File