awk - 10 examples to group data in a CSV or text file

本文介绍了如何利用awk的强大功能对文件数据进行分组处理,包括求和、计数、最大值、最小值等操作,并提供了多个实际示例。

awk is very powerful when it comes for file formatting.  In this article, we will discuss some wonderful grouping features of awk. awk can group a data based on a column or field , or on a set of columns. It uses the powerful associative array for grouping. If you are new to awk, this article will be easier to understand if you can go over the article how to  parse a simple CSV file using awk.

Let us take a sample CSV file with the below contents. The file is kind of an expense report containing items and their prices. As seen, some expense items  have multiple entries.
$ cat file
Item1,200
Item2,500
Item3,900
Item2,800
Item1,600
  1. To find the total of all numbers in second column. i.e, to find the sum of all the prices.
$ awk -F"," '{x+=$2}END{print x}' file
3000
 The delimiter(-F) used is comma since its a comma separated file.  x+=$2 stands for  x=x+$2. When a line is parsed, the second column($2) which is the price, is added to the variable x. At the end, the variable x contains the sum. This example is same as discussed in the awk example of finding the  sum of all numbers in a file.

   If your input file is a text file with the only difference being the comma not present in the above file, all you need to make is one change. Remove this part from the above command: -F","  . This is because the default delimiter in awk is whitespace.


2. To find the total sum of particular group entry alone. i.e, in this case, of "Item1":

$ awk -F, '$1=="Item1"{x+=$2;}END{print x}' file
800
    This gives us the total sum of all the items pertaining to "Item1". In the earlier example, no condition was specified since we wanted awk to work on every line or record. In this case, we want awk to work on only the records whose first column($1) is equal to   Item1 .


3. If the data to be worked upon is present in a shell variable:

 

$ VAR="Item1"
$ awk -F, -v inp=$VAR '$1==inp{x+=$2;}END{print x}' file
800
  -v is used to  pass the shell variable to awk , and the rest is same as the last one.


4. To find unique values of first column

$ awk -F, '{a[$1];}END{for (i in a)print i;}' file
Item1
Item2
Item3
  Arrays in awk   are associative and is a very powerful feature. Associate arrays have an index and a corresponding value. Example:   a["Jan"]=30   meaning in the array   a,  "Jan" is an index with value 30. In our case here, we use only the index without values. So, the command   a[$1]  works like this: When the first record is processed, in the array named   a,  an index value "Item1" is stored. During the second record, a new index "Item2", during third "Item3" and so on. During the 4th record, since the "Item1" index is already there, no new index is added and the same continues.

  Now, once the file is processed completely, the control goes to the END label where we print all the index items. for loop in awk comes in 2 variants: 1. The C language kind of for loop,  Second being the one used for associate arrays.

   for i in a : This means for every index in the array  a . The variable "i" holds the index value. In place of "i", it can be any variable name. Since there are 3 elements in the array, the loop will run for 3 times, each time holding the value of an index in the "i". And by printing "i", we get the index values printed.


 To understand the for loop better, look at this:

for (i in a)
{
  print i;
}
 

Note: The order of the output in the above command may vary from system to system. Associative arrays do not store the indexes in sequence and hence the order of the output need not be the same in which it is entered.

5. To find the sum of individual group records. i.e, to sum all records pertaining to Item1 alone, Item2 alone, and so on.

$ awk -F, '{a[$1]+=$2;}END{for(i in a)print i", "a[i];}' file
Item1, 800
Item2, 1300
Item3, 900
  a[$1]+=$2 . This can be written as a[$1]=a[$1]+$2. This works like this: When the first record is processed, a["Item1"] is assigned 200(a["Item1"]=200). During second "Item1" record, a["Item1"]=800 (200+600) and so on. In this way, every index item in the array is stored with the appropriate value associated to it which is the sum of the group.
   And in the END label, we print both the index(i) and the value(a[i]) which is nothing but the sum.

6. To find the sum of all entries in second column and add it as the last record .

$ awk -F"," '{x+=$2;print}END{print "Total,"x}' file
Item1,200
Item2,500
Item3,900
Item2,800
Item1,600
Total,3000
  This is same as the first example except that along with adding the value every time, every record is also printed, and at the end, the "Total" record is also printed.


7. To print the maximum or the biggest record of every group:

$ awk -F, '{if (a[$1] < $2)a[$1]=$2;}END{for(i in a){print i,a[i];}}' OFS=, file
Item1,600
Item2,800
Item3,900
  Before storing the value($2) in the array,  the current second column value is compared with the existing value and stored only if the value in the current record is bigger. And finally, the array will contain only the maximum values against every group. In the same way, just by changing the "lesser than(<)" symbol to greater than(>), we can find the smallest element in the group.
The syntax for if in awk is, similar to the C language syntax:


if (condition)
{  
  <code for true condition >
}else{  
 <code for false condition>
 }


8. To find the count of entries against every group:

$ awk -F, '{a[$1]++;}END{for (i in a)print i, a[i];}' file
Item1 2
Item2 2
Item3 1

 a[$1]++ : This can be put as a[$1]=a[$1]+1. When the first "Item1" record is parsed, a["Item1"]=1 and every item on encountering "Item1" record, this count is incremented, and the same follows for other entries as well. This code simply increments the count by 1 for the respective index on encountering a record. And finally on printing the array, we get the item entries and their respective counts.


9. To print only the first record of every group:

$ awk -F, '!a[$1]++' file
Item1,200
Item2,500
Item3,900
   A little tricky this one. In this awk command, there is only condition, no action statement. As a result, if the condition is true, the current record gets printed by default.
 !a[$1]++ : When the first record of a group is encountered, a[$1] remains 0 since ++ is post-fix, and not(!) of 0 is 1 which is true, and hence the first record gets printed. Now,  when the second records of "Item1" is parsed, a[$1] is 1 (will become 2 after the command since its a post-fix). Not(!) of 1 is 0 which is false, and the record does not get printed. In this way, the first record of every group gets printed.
   Simply by removing '!' operator, the above command will print all records other than the first record of the group.


10. To join or concatenate the values of all group items. Join the values of the second column with a colon separator:

$ awk -F, '{if(a[$1])a[$1]=a[$1]":"$2; else a[$1]=$2;}END{for (i in a)print i, a[i];}' OFS=, file
Item1,200:600
Item2,500:800
Item3,900
     This if condition is pretty simple: If there is some value in a[$1], then append or concatenate the current value using a colon delimiter, else just assign it to a[$1] since this is the first value.
To make the above if block clear, let me put it this way:  "if (a[$1])"  means "if a[$1] has some value".
if(a[$1])
 a[$1]=a[$1]":"$2;
else
 a[$1]=$2
 The same can be achieved using the awk ternary operator as well which is same as in the C language.

$ awk -F, '{a[$1]=a[$1]?a[$1]":"$2:$2;}END{for (i in a)print i, a[i];}' OFS=, file
Item1,200:600
Item2,500:800
Item3,900
 Ternary operator is a short form of if-else condition. An example of ternary operator is:  x=x>10?"Yes":"No"  means if x is greater than 10, assign "Yes" to x, else assign "No".
In the same way:  a[$1]=a[$1]?a[$1]":"$2:$2  means if a[$1] has some value assign a[$1]":"$2 to a[$1] , else simply assign $2 to a[$1].



Concatenate variables in awk:
One more thing to notice is the way string concatenation is done in awk. To concatenate 2 variables in awk, use a space in-between.
Examples:

z=x y    #to concatenate x and y
z=x":"y  #to concatenate x and y with a colon separator.
 

 


{ // DHCPv4 configuration starts here. This section will be read by DHCPv4 server // and will be ignored by other components. "Control-agent": { "http-host": "localhost", "http-port": 8000 }, "Dhcp4": { "interfaces-config": { "interfaces": [ "enp3s0f0" ] }, "control-socket": { "socket-type": "unix", "socket-name": "/path/to/kea4-ctrl-socket" }, } "Dhcp4": { // Add names of your network interfaces to listen on. "interfaces-config": { // See section 8.2.4 for more details. You probably want to add just // interface name (e.g. "eth0" or specific IPv4 address on that // interface name (e.g. "eth0/192.0.2.1"). "interfaces": ["enp3s0f1/192.168.100.1"] // Kea DHCPv4 server by default listens using raw sockets. This ensures // all packets, including those sent by directly connected clients // that don't have IPv4 address yet, are received. However, if your // traffic is always relayed, it is often better to use regular // UDP sockets. If you want to do that, uncomment this line: // "dhcp-socket-type": "udp" }, // Kea supports control channel, which is a way to receive management // commands while the server is running. This is a Unix domain socket that // receives commands formatted in JSON, e.g. config-set (which sets new // configuration), config-reload (which tells Kea to reload its // configuration from file), statistic-get (to retrieve statistics) and many // more. For detailed description, see Sections 8.8, 16 and 15. "control-socket": { "socket-type": "unix", "socket-name": "kea4-ctrl-socket" }, // Use Memfile lease database backend to store leases in a CSV file. // Depending on how Kea was compiled, it may also support SQL databases // (MySQL and/or PostgreSQL). Those database backends require more // parameters, like name, host and possibly user and password. // There are dedicated examples for each backend. See Section 7.2.2 "Lease // Storage" for details. "lease-database": { // Memfile is the simplest and easiest backend to use. It's an in-memory // C++ database that stores its state in CSV file. "type": "memfile", "lfc-interval": 3600 }, // Kea allows storing host reservations in a database. If your network is // small or you have few reservations, it's probably easier to keep them // in the configuration file. If your network is large, it's usually better // to use database for it. To enable it, uncomment the following: // "hosts-database": { // "type": "mysql", // "name": "kea", // "user": "kea", // "password": "1234", // "host": "localhost", // "port": 3306 // }, // See Section 7.2.3 "Hosts storage" for details. // Setup reclamation of the expired leases and leases affinity. // Expired leases will be reclaimed every 10 seconds. Every 25 // seconds reclaimed leases, which have expired more than 3600 // seconds ago, will be removed. The limits for leases reclamation // are 100 leases or 250 ms for a single cycle. A warning message // will be logged if there are still expired leases in the // database after 5 consecutive reclamation cycles. // If both "flush-reclaimed-timer-wait-time" and "hold-reclaimed-time" are // not 0, when the client sends a release message the lease is expired // instead of being deleted from the lease storage. "expired-leases-processing": { "reclaim-timer-wait-time": 10, "flush-reclaimed-timer-wait-time": 25, "hold-reclaimed-time": 3600, "max-reclaim-leases": 100, "max-reclaim-time": 250, "unwarned-reclaim-cycles": 5 }, // Global timers specified here apply to all subnets, unless there are // subnet specific values defined in particular subnets. "renew-timer": 900, "rebind-timer": 60, "valid-lifetime": 3600, // Many additional parameters can be specified here: // - option definitions (if you want to define vendor options, your own // custom options or perhaps handle standard options // that Kea does not support out of the box yet) // - client classes // - hooks // - ddns information (how the DHCPv4 component can reach a DDNS daemon) // // Some of them have examples below, but there are other parameters. // Consult Kea User's Guide to find out about them. // These are global options. They are going to be sent when a client // requests them, unless overwritten with values in more specific scopes. // The scope hierarchy is: // - global (most generic, can be overwritten by class, subnet or host) // - class (can be overwritten by subnet or host) // - subnet (can be overwritten by host) // - host (most specific, overwrites any other scopes) // // Not all of those options make sense. Please configure only those that // are actually useful in your network. // // For a complete list of options currently supported by Kea, see // Section 7.2.8 "Standard DHCPv4 Options". Kea also supports // vendor options (see Section 7.2.10) and allows users to define their // own custom options (see Section 7.2.9). "option-data": [ // When specifying options, you typically need to specify // one of (name or code) and data. The full option specification // covers name, code, space, csv-format and data. // space defaults to "dhcp4" which is usually correct, unless you // use encapsulate options. csv-format defaults to "true", so // this is also correct, unless you want to specify the whole // option value as long hex string. For example, to specify // domain-name-servers you could do this: // { // "name": "domain-name-servers", // "code": 6, // "csv-format": "true", // "space": "dhcp4", // "data": "192.0.2.1, 192.0.2.2" // } // but it's a lot of writing, so it's easier to do this instead: { "name": "domain-name-servers", "data": "192.0.2.1, 192.0.2.2" }, // Typically people prefer to refer to options by their names, so they // don't need to remember the code names. However, some people like // to use numerical values. For example, option "domain-name" uses // option code 15, so you can reference to it either by // "name": "domain-name" or "code": 15. { "code": 15, "data": "example.org" }, // Domain search is also a popular option. It tells the client to // attempt to resolve names within those specified domains. For // example, name "foo" would be attempted to be resolved as // foo.mydomain.example.com and if it fails, then as foo.example.com { "name": "domain-search", "data": "mydomain.example.com, example.com" }, // String options that have a comma in their values need to have // it escaped (i.e. each comma is preceded by two backslashes). // That's because commas are reserved for separating fields in // compound options. At the same time, we need to be conformant // with JSON spec, that does not allow "\,". Therefore the // slightly uncommon double backslashes notation is needed. // Legal JSON escapes are \ followed by "\/bfnrt character // or \u followed by 4 hexadecimal numbers (currently Kea // supports only \u0000 to \u00ff code points). // CSV processing translates '\\' into '\' and '\,' into ',' // only so for instance '\x' is translated into '\x'. But // as it works on a JSON string value each of these '\' // characters must be doubled on JSON input. { "name": "boot-file-name", "data": "EST5EDT4\\,M3.2.0/02:00\\,M11.1.0/02:00" }, // Options that take integer values can either be specified in // dec or hex format. Hex format could be either plain (e.g. abcd) // or prefixed with 0x (e.g. 0xabcd). { "name": "default-ip-ttl", "data": "0xf0" } // Note that Kea provides some of the options on its own. In particular, // it sends IP Address lease type (code 51, based on valid-lifetime // parameter, Subnet mask (code 1, based on subnet definition), Renewal // time (code 58, based on renew-timer parameter), Rebind time (code 59, // based on rebind-timer parameter). ], // Other global parameters that can be defined here are option definitions // (this is useful if you want to use vendor options, your own custom // options or perhaps handle options that Kea does not handle out of the box // yet). // You can also define classes. If classes are defined, incoming packets // may be assigned to specific classes. A client class can represent any // group of devices that share some common characteristic, e.g. Windows // devices, iphones, broken printers that require special options, etc. // Based on the class information, you can then allow or reject clients // to use certain subnets, add special options for them or change values // of some fixed fields. "client-classes": [ { // This specifies a name of this class. It's useful if you need to // reference this class. "name": "voip", // This is a test. It is an expression that is being evaluated on // each incoming packet. It is supposed to evaluate to either // true or false. If it's true, the packet is added to specified // class. See Section 12 for a list of available expressions. There // are several dozens. Section 8.2.14 for more details for DHCPv4 // classification and Section 9.2.19 for DHCPv6. "test": "substring(option[60].hex,0,6) == 'Aastra'", // If a client belongs to this class, you can define extra behavior. // For example, certain fields in DHCPv4 packet will be set to // certain values. "next-server": "192.0.2.254", "server-hostname": "hal9000", "boot-file-name": "/dev/null" // You can also define option values here if you want devices from // this class to receive special options. } ], // Another thing possible here are hooks. Kea supports a powerful mechanism // that allows loading external libraries that can extract information and // even influence how the server processes packets. Those libraries include // additional forensic logging capabilities, ability to reserve hosts in // more flexible ways, and even add extra commands. For a list of available // hook libraries, see https://gitlab.isc.org/isc-projects/kea/wikis/Hooks-available. "hooks-libraries":[ { "library": "/usr/local/lib64/kea/hooks/libdhcp_macauth.so", "parameters": { "server_ip": "10.10.10.1", "ac_ip": "10.10.10.102", "port": 5001, "shared_secret": "7a5b8c3e9f" } }, { "library": "/usr/local/lib64/kea/hooks/libdhcp_lease_cmds.so" } //{ // "library": "/usr/local/lib64/kea/hooks/libdhcp_lease_query.so" // } ], // "hooks-libraries": [ // { // // Forensic Logging library generates forensic type of audit trail // // of all devices serviced by Kea, including their identifiers // // (like MAC address), their location in the network, times // // when they were active etc. // "library": "/usr/local/lib64/kea/hooks/libdhcp_legal_log.so", // "parameters": { // "base-name": "kea-forensic4" // } // }, // { // // Flexible identifier (flex-id). Kea software provides a way to // // handle host reservations that include addresses, prefixes, // // options, client classes and other features. The reservation can // // be based on hardware address, DUID, circuit-id or client-id in // // DHCPv4 and using hardware address or DUID in DHCPv6. However, // // there are sometimes scenario where the reservation is more // // complex, e.g. uses other options that mentioned above, uses part // // of specific options or perhaps even a combination of several // // options and fields to uniquely identify a client. Those scenarios // // are addressed by the Flexible Identifiers hook application. // "library": "/usr/local/lib64/kea/hooks/libdhcp_flex_id.so", // "parameters": { // "identifier-expression": "relay4[2].hex" // } // }, // { // // the MySQL host backend hook library required for host storage. // "library": "/usr/local/lib64/kea/hooks/libdhcp_mysql.so" // } // ], // Below an example of a simple IPv4 subnet declaration. Uncomment to enable // it. This is a list, denoted with [ ], of structures, each denoted with // { }. Each structure describes a single subnet and may have several // parameters. One of those parameters is "pools" that is also a list of // structures. "subnet4": [ { // This defines the whole subnet. Kea will use this information to // determine where the clients are connected. This is the whole // subnet in your network. // Subnet identifier should be unique for each subnet. "id": 1, // This is mandatory parameter for each subnet. "subnet": "192.168.30.0/24", // Pools define the actual part of your subnet that is governed // by Kea. Technically this is optional parameter, but it's // almost always needed for DHCP to do its job. If you omit it, // clients won't be able to get addresses, unless there are // host reservations defined for them. "pools": [ { "pool": "192.168.30.10 - 192.168.30.200" } ], // This is one of the subnet selectors. Uncomment the "interface" // parameter and specify the appropriate interface name if the DHCPv4 // server will receive requests from local clients (connected to the // same subnet as the server). This subnet will be selected for the // requests received by the server over the specified interface. // This rule applies to the DORA exchanges and rebinding clients. // Renewing clients unicast their messages, and the renewed addresses // are used by the server to determine the subnet they belong to. // When this parameter is used, the "relay" parameter is typically // unused. // "interface": "eth0", // This is another subnet selector. Uncomment the "relay" parameter // and specify a list of the relay addresses. The server will select // this subnet for lease assignments when it receives queries over one // of these relays. When this parameter is used, the "interface" parameter // is typically unused. // "relay": { // "ip-addresses": [ "10.0.0.1" ] // }, // These are options that are subnet specific. In most cases, // you need to define at least routers option, as without this // option your clients will not be able to reach their default // gateway and will not have Internet connectivity. "option-data": [ { // For each IPv4 subnet you most likely need to specify at // least one router. "name": "routers", "data": "192.0.2.1" } ], // Kea offers host reservations mechanism. Kea supports reservations // by several different types of identifiers: hw-address // (hardware/MAC address of the client), duid (DUID inserted by the // client), client-id (client identifier inserted by the client) and // circuit-id (circuit identifier inserted by the relay agent). // // Kea also support flexible identifier (flex-id), which lets you // specify an expression that is evaluated for each incoming packet. // Resulting value is then used for as an identifier. // // Note that reservations are subnet-specific in Kea. This is // different than ISC DHCP. Keep that in mind when migrating // your configurations. "reservations": [ // This is a reservation for a specific hardware/MAC address. // It's a rather simple reservation: just an address and nothing // else. // { // "hw-address": "1a:1b:1c:1d:1e:1f", // "ip-address": "192.0.2.201" // }, // This is a reservation for a specific client-id. It also shows // the this client will get a reserved hostname. A hostname can // be defined for any identifier type, not just client-id. { "client-id": "01:11:22:33:44:55:66", "ip-address": "192.168.30.202", "hostname": "special-snowflake" }, // The third reservation is based on DUID. This reservation defines // a special option values for this particular client. If the // domain-name-servers option would have been defined on a global, // subnet or class level, the host specific values take preference. { "duid": "01:02:03:04:05", "ip-address": "192.168.30.203", "option-data": [ { "name": "domain-name-servers", "data": "10.1.1.202, 10.1.1.203" } ] }, // The fourth reservation is based on circuit-id. This is an option // inserted by the relay agent that forwards the packet from client // to the server. In this example the host is also assigned vendor // specific options. // // When using reservations, it is useful to configure // reservations-global, reservations-in-subnet, // reservations-out-of-pool (subnet specific parameters) // and host-reservation-identifiers (global parameter). { "client-id": "01:12:23:34:45:56:67", "ip-address": "192.168.30.204", "option-data": [ { "name": "vivso-suboptions", "data": "4491" }, { "name": "tftp-servers", "space": "vendor-4491", "data": "10.1.1.202, 10.1.1.203" } ] }, // This reservation is for a client that needs specific DHCPv4 // fields to be set. Three supported fields are next-server, // server-hostname and boot-file-name { "client-id": "01:0a:0b:0c:0d:0e:0f", "ip-address": "192.168.30.205", "next-server": "192.168.30.1", "server-hostname": "hal9000", "boot-file-name": "/dev/null" }, // This reservation is using flexible identifier. Instead of // relying on specific field, sysadmin can define an expression // similar to what is used for client classification, // e.g. substring(relay[0].option[17],0,6). Then, based on the // value of that expression for incoming packet, the reservation // is matched. Expression can be specified either as hex or // plain text using single quotes. // // Note: flexible identifier requires flex_id hook library to be // loaded to work. { "flex-id": "'s0mEVaLue'", "ip-address": "192.168.30.206" } // You can add more reservations here. ] // You can add more subnets there. }, { "subnet": "192.168.100.0/24", "id":100, "pools": [ { "pool": "192.168.100.100 - 192.168.100.200" } ], "option-data": [ { "name": "routers", "data": "192.168.100.2" }, { "name": "domain-name-servers", "data": "8.8.8.8, 8.8.4.4" } ] }, { "subnet": "192.168.10.0/24", "id":10, "pools": [ { "pool": "192.168.10.100 - 192.168.10.200" } ], "relay": { "ip-addresses": ["192.168.10.1"] }, "option-data": [ { "name": "routers", "data": "192.168.10.1" }, { "name": "domain-name-servers", "data": "114.114.114.114,8.8.8.8" } ] }, { "id":20, "subnet": "192.168.20.0/24", "pools": [ { "pool": "192.168.20.100 - 192.168.20.200" } ], "relay": { "ip-addresses": ["192.168.20.1"] }, "option-data": [ { "name": "routers", "data": "192.168.20.1" }, { "name": "domain-name-servers", "data": "114.114.114.114, 8.8.4.4" } ] } ], // There are many, many more parameters that DHCPv4 server is able to use. // They were not added here to not overwhelm people with too much // information at once. // Logging configuration starts here. Kea uses different loggers to log various // activities. For details (e.g. names of loggers), see Chapter 18. "loggers": [ { // This section affects kea-dhcp4, which is the base logger for DHCPv4 // component. It tells DHCPv4 server to write all log messages (on // severity INFO or more) to a file. "name": "kea-dhcp4", "output-options": [ { // Specifies the output file. There are several special values // supported: // - stdout (prints on standard output) // - stderr (prints on standard error) // - syslog (logs to syslog) // - syslog:name (logs to syslog using specified name) // Any other value is considered a name of the file "output": "kea-dhcp4.log" // Shorter log pattern suitable for use with systemd, // avoids redundant information // "pattern": "%-5p %m\n", // This governs whether the log output is flushed to disk after // every write. // "flush": false, // This specifies the maximum size of the file before it is // rotated. // "maxsize": 1048576, // This specifies the maximum number of rotated files to keep. // "maxver": 8 } ], // This specifies the severity of log messages to keep. Supported values // are: FATAL, ERROR, WARN, INFO, DEBUG "severity": "INFO", // If DEBUG level is specified, this value is used. 0 is least verbose, // 99 is most verbose. Be cautious, Kea can generate lots and lots // of logs if told to do so. "debuglevel": 0 } ] } } 查看以上配置文件查看看dhcp配置接口开放配置有什么问题及语法错误并修复
最新发布
08-15
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值