转载自:http://ferd.ca/awk-in-20-minutes.html
Awk in 20 Minutes
What's Awk
Awk is a tiny programming language and a command line tool. It'sparticularly appropriate for log parsing on servers, mostly because Awk willoperate on files, usually structured in lines of human-readable text.
I say it's useful on servers because log files, dump files, or whatevertext format servers end up dumping to disk will tend to grow large, and you'llhave many of them per server. If you ever get into the situation where you haveto analyze gigabytes of files from 50 different servers without tools likeSplunk or its equivalents, it would feelfairly bad to have and download all these files locally to then drive someforensics on them.
This personally happens to me when some Erlang nodes tend to die andleave a crashdump of 700MB to 4GB behind, or on smaller individual servers (say a VPS)where I need to quickly go through logs, looking for a common pattern.
In any case, Awk does more than finding data (otherwise, grepor ack would be enough) — it also lets you process thedata and transform it.
Code Structure
An Awk script is structured simply, as a sequence of patterns and actions:
# comment
Pattern1 { ACTIONS; }
# comment
Pattern2 { ACTIONS; }
# comment
Pattern3 { ACTIONS; }
# comment
Pattern4 { ACTIONS; }
Every line of the document to scan will have to go through each of thepatterns, one at a time. So if I pass in a file that contains the followingcontent:
this is line 1 this is line 2
Then the content this is line 1 will match againstPattern1. If it matches, ACTIONS will be executed.Then this is line 1 will match against Pattern2.If it doesn't match, it skips to Pattern3, and so on.
Once all patterns have been cleared, this is line 2 will gothrough the same process, and so on for other lines, until the input has beenread entirely.
This, in short, is Awk's execution model.
Data Types
Awk only has two main data types: strings and numbers. And even then,Awk likes to convert them into each other. Strings can be interpretedas numerals to convert their values to numbers. If the string doesn'tlook like a numeral, it's 0.
Both can be assigned to variables in ACTIONS parts of your codewith the = operator. Variables can be declared anywhere, at anytime, and used even if they're not initialized: their default value is"", the empty string.
Finally, Awk has arrays. They're unidimensional associative arraysthat can be started dynamically. Their syntax is justvar[key] = value. Awk can simulate multidimensional arrays, but it's all a big hack anyway.
Patterns
The patterns that can be used will fall into three broad categories:regular expressions, Boolean expressions, and special patterns.
Regular and Boolean Expressions
The Awk regular expressions are your run of the mill regexes. They're notPCRE under awk (but gawk will support the fancierstuff — it depends on the implementation! See with awk--version), though for most usages they'll do plenty:
/admin/ { ... } # any line that contains 'admin'
/^admin/ { ... } # lines that begin with 'admin'
/admin$/ { ... } # lines that end with 'admin'
/^[0-9.]+ / { ... } # lines beginning with series of numbers and periods
/(POST|PUT|DELETE)/ # lines that contain specific HTTP verbs
And so on. Note that the patterns cannotcapture specificgroups to make them available in the ACTIONS part of the code.They are specifically to match content.
Boolean expressions are similar to what you would find in PHP or Javascript.Specifically, the operators && ("and"), ||("or"), and ! ("not") are available. This is also what you'll findin pretty much all C-like languages. They'll operate on any regular data type.
What's specifically more like PHP and Javascript is the comparison operator,==, which will do fuzzy matching, so that the string"23" compares equal to the number 23, such that"23" == 23 is true. The operator != is alsoavailable, without forgetting the other common ones: >,<, >=, and <=.
You can also mix up the patterns: Boolean expressions can be used along withregular expressions. The pattern /admin/ || debug == true is validand will match when a line that contains either the word 'admin' is met, orwhenever the variable debug is set to true.
Note that if you have a specific string or variable you'd want to matchagainst a regex, the operators ~ and !~ are whatyou want, to be used as string ~ /regex/ and string !~ /regex/.
Also note that all patterns are optional. An Awk script thatcontains the following:
{ ACTIONS }
Would simply run ACTIONS for every line of input.
Special Patterns
There are a few special patterns in Awk, but not that many.
The first one is BEGIN, which matches only beforeany line has been input to the file. This is basically where you can initiatevariables and all other kinds of state in your script.
There is also END, which as you may have guessed, will matchafter the whole input has been handled. This lets you clean up ordo some final output before exiting.
Finally, the last kind of pattern is a bit hard to classify. It's halfwaybetween variables and special values, and they're called Fields, whichdeserve a section of their own.
Fields
Fields are best explained with a visual example:
# According to the following line
#
# $1 $2 $3
# 00:34:23 GET /foo/bar.html
# \_____________ _____________/
# $0
# Hack attempt?
/admin.html$/ && $2 == "DELETE" {
print "Hacker Alert!";
}
The fields are (by default) separated by white space. The field$0 represents the entire line on its own, as a string.The field $1 is then the first bit (before any white space),$2 is the one after, and so on.
A fun fact (and a thing to avoid in most cases) is that you canmodify the line by assigning to its field. For example,if you go $0 = "HAHA THE LINE IS GONE" in one block,the next patterns will now operate on that line instead of theoriginal one, and similarly for any other field variable!
Actions
There's a bunch of possible actions, but the most common and usefulones (in my experience) are:
{ print $0; } # prints $0. In this case, equivalent to 'print' alone
{ exit; } # ends the program
{ next; } # skips to the next line of input
{ a=$1; b=$0 } # variable assignment
{ c[$1] = $2 } # variable assignment (array)
{ if (BOOLEAN) { ACTION }
else if (BOOLEAN) { ACTION }
else { ACTION }
}
{ for (i=1; i<x; i++) { ACTION } }
{ for (item in c) { ACTION } }
This alone will contain a major part of your Awk toolbox for casualusage when dealing with logs and whatnot.
The variables are all global. Whatever variables you declare in agiven block will be visible to other blocks, for each line. This severelylimits how large your Awk scripts can become before they're unmaintainablehorrors. Keep it minimal.
Functions
Functions can be called with the following syntax:
{ somecall($2) }
There is a somewhat restricted set of built-in functions available, so Ilike to point to regulardocumentation for these.
User-defined functions are also fairly simple:
# function arguments are call-by-value
function name(parameter-list) {
ACTIONS; # same actions as usual
}
# return is a valid keyword
function add1(val) {
return val+1;
}
Special Variables
Outside of regular variables (global, instantiated anywhere), there is a setof special variables acting a bit like configuration entries:
BEGIN { # Can be modified by the user
FS = ","; # Field Separator
RS = "\n"; # Record Separator (lines)
OFS = " "; # Output Field Separator
ORS = "\n"; # Output Record Separator (lines)
}
{ # Can't be modified by the user
NF # Number of Fields in the current Record (line)
NR # Number of Records seen so far
ARGV / ARGC # Script Arguments
}
I put the modifiable variables in BEGIN because that's whereI tend to override them, but that can be done anywhere in the script to thentake effect on follow-up lines.
Examples
That's it for the core of the language. I don't have a whole lot of examplesthere because I tend to use Awk for quick one-off tasks.
I still have a few files I carry around for some usage and metrics, myfavorite one being a script used to parse Erlang crash dumps shaped like this:
=erl_crash_dump:0.3 Tue Nov 18 02:52:44 2014 Slogan: init terminating in do_boot () System version: Erlang/OTP 17 [erts-6.2] [source] [64-bit] [smp:8:8] [async-threads:10] [hipe] [kernel-poll:false] Compiled: Fri Sep 19 03:23:19 2014 Taints: Atoms: 12167 =memory total: 19012936 processes: 4327912 processes_used: 4319928 system: 14685024 atom: 339441 atom_used: 331087 binary: 1367680 code: 8384804 ets: 382552 =hash_table:atom_tab size: 9643 used: 6949 ... =allocator:instr option m: false option s: false option t: false =proc:<0.0.0> State: Running Name: init Spawned as: otp_ring0:start/2 Run queue: 0 Spawned by: [] Started: Tue Nov 18 02:52:35 2014 Message queue length: 0 Number of heap fragments: 0 Heap fragment data: 0 Link list: [<0.3.0>, <0.7.0>, <0.6.0>] Reductions: 29265 Stack+heap: 1598 OldHeap: 610 Heap unused: 656 OldHeap unused: 468 Memory: 18584 Program counter: 0x00007f42f9566200 (init:boot_loop/2 + 64) CP: 0x0000000000000000 (invalid) =proc:<0.3.0> State: Waiting ... =port:#Port<0.0> Slot: 0 Connected: <0.3.0> Links: <0.3.0> Port controls linked-in driver: efile =port:#Port<0.14> Slot: 112 Connected: <0.3.0> ...
To yield the following result:
$ awk -f queue_fun.awk $PATH_TO_DUMP MESSAGE QUEUE LENGTH: CURRENT FUNCTION ====================================== 10641: io:wait_io_mon_reply/2 12646: io:wait_io_mon_reply/2 32991: io:wait_io_mon_reply/2 2183837: io:wait_io_mon_reply/2 730790: io:wait_io_mon_reply/2 80194: io:wait_io_mon_reply/2 ...
Which is a list of functions running in Erlang processes that causedmailboxes to be too large. Here's thescript:
Can you follow along? If so, you can understand Awk. Congratulations.
A detail description of awk can found at:https://www.gnu.org/software/gawk/manual/html_node/index.html#SEC_Contents
- Foreword
- Preface
- 1 Getting Started with
awk - 2 Running
awkandgawk- 2.1 Invoking
awk - 2.2 Command-Line Options
- 2.3 Other Command-Line Arguments
- 2.4 Naming Standard Input
- 2.5 The Environment Variables
gawkUses - 2.6
gawk’s Exit Status - 2.7 Including Other Files Into Your Program
- 2.8 Loading Shared Libraries Into Your Program
- 2.9 Obsolete Options and/or Features
- 2.10 Undocumented Options and Features
- 2.1 Invoking
- 3 Regular Expressions
- 4 Reading Input Files
- 4.1 How Input Is Split into Records
- 4.2 Examining Fields
- 4.3 Nonconstant Field Numbers
- 4.4 Changing the Contents of a Field
- 4.5 Specifying How Fields Are Separated
- 4.6 Reading Fixed-Width Data
- 4.7 Defining Fields By Content
- 4.8 Multiple-Line Records
- 4.9 Explicit Input with
getline- 4.9.1 Using
getlinewith No Arguments - 4.9.2 Using
getlineinto a Variable - 4.9.3 Using
getlinefrom a File - 4.9.4 Using
getlineinto a Variable from a File - 4.9.5 Using
getlinefrom a Pipe - 4.9.6 Using
getlineinto a Variable from a Pipe - 4.9.7 Using
getlinefrom a Coprocess - 4.9.8 Using
getlineinto a Variable from a Coprocess - 4.9.9 Points to Remember About
getline - 4.9.10 Summary of
getlineVariants
- 4.9.1 Using
- 4.10 Reading Input With A Timeout
- 4.11 Directories On The Command Line
- 5 Printing Output
- 6 Expressions
- 6.1 Constants, Variables and Conversions
- 6.2 Operators: Doing Something With Values
- 6.3 Truth Values and Conditions
- 6.4 Function Calls
- 6.5 Operator Precedence (How Operators Nest)
- 6.6 Where You Are Makes A Difference
- 7 Patterns, Actions, and Variables
- 7.1 Pattern Elements
- 7.2 Using Shell Variables in Programs
- 7.3 Actions
- 7.4 Control Statements in Actions
- 7.5 Built-in Variables
- 8 Arrays in
awk - 9 Functions
- 10 A Library of
awkFunctions - 11 Practical
awkPrograms- 11.1 Running the Example Programs
- 11.2 Reinventing Wheels for Fun and Profit
- 11.3 A Grab Bag of
awkPrograms- 11.3.1 Finding Duplicated Words in a Document
- 11.3.2 An Alarm Clock Program
- 11.3.3 Transliterating Characters
- 11.3.4 Printing Mailing Labels
- 11.3.5 Generating Word-Usage Counts
- 11.3.6 Removing Duplicates from Unsorted Text
- 11.3.7 Extracting Programs from Texinfo Source Files
- 11.3.8 A Simple Stream Editor
- 11.3.9 An Easy Way to Use Library Functions
- 11.3.10 Finding Anagrams From A Dictionary
- 11.3.11 And Now For Something Completely Different
- 12 Advanced Features of
gawk - 13 Internationalization with
gawk - 14 Debugging
awkPrograms - 15 Arithmetic and Arbitrary Precision Arithmetic with
gawk - 16 Writing Extensions for
gawk- 16.1 Introduction
- 16.2 Extension Licensing
- 16.3 At A High Level How It Works
- 16.4 API Description
- 16.4.1 Introduction
- 16.4.2 General Purpose Data Types
- 16.4.3 Requesting Values
- 16.4.4 Memory Allocation Functions and Convenience Macros
- 16.4.5 Constructor Functions
- 16.4.6 Registration Functions
- 16.4.7 Printing Messages
- 16.4.8 Updating
ERRNO - 16.4.9 Accessing and Updating Parameters
- 16.4.10 Symbol Table Access
- 16.4.11 Array Manipulation
- 16.4.12 API Variables
- 16.4.13 Boilerplate Code
- 16.5 How
gawkFinds Extensions - 16.6 Example: Some File Functions
- 16.7 The Sample Extensions In The
gawkDistribution- 16.7.1 File Related Functions
- 16.7.2 Interface To
fnmatch() - 16.7.3 Interface To
fork(),wait()andwaitpid() - 16.7.4 Enabling In-Place File Editing
- 16.7.5 Character and Numeric values:
ord()andchr() - 16.7.6 Reading Directories
- 16.7.7 Reversing Output
- 16.7.8 Two-Way I/O Example
- 16.7.9 Dumping and Restoring An Array
- 16.7.10 Reading An Entire File
- 16.7.11 API Tests
- 16.7.12 Extension Time Functions
- 16.8 The
gawkextlibProject
- Appendix A The Evolution of the
awkLanguage- A.1 Major Changes Between V7 and SVR3.1
- A.2 Changes Between SVR3.1 and SVR4
- A.3 Changes Between SVR4 and POSIX
awk - A.4 Extensions in Brian Kernighan’s
awk - A.5 Extensions in
gawkNot in POSIXawk - A.6 History of
gawkFeatures - A.7 Common Extensions Summary
- A.8 Regexp Ranges and Locales: A Long Sad Story
- A.9 Major Contributors to
gawk
- Appendix B Installing
gawk - Appendix C Implementation Notes
- Appendix D Basic Programming Concepts
- Glossary
- GNU General Public License
- GNU Free Documentation License
Awk是一种轻量级编程语言和命令行工具,特别适用于服务器上的日志解析。它能高效地操作文件,处理和转换数据,尤其在面对大量日志文件时。本文详细介绍了Awk的基本概念、执行模型、数据类型、模式、行动、变量、函数等核心特性,以及如何使用其强大的正则表达式、布尔表达式和特殊模式进行复杂的文本匹配和数据处理。
2921

被折叠的 条评论
为什么被折叠?



