28
Simplify data extraction using Linux text utilities
No comments · Posted by nguyen in Linux Docs
Table 1. Sample regular expressions Example Description
[abc] Matches one of “a”, “b”, or “c”
[a-z] Matches any one lowercase letter from “a” to “z”
[A-Z] Matches any one uppercase letter from “A” to “Z”
[0-9] Matches any one number from 0 to 9
[^0-9] Matches any character other than the numbers from 0 to 9
[-0-9] Matches any number from 0 to 9, or a dash (“-”)
[0-9-] Matches any number from 0 to 9, or a dash (“-”)
[^-0-9] Matches any character other than the numbers from 0 to 9, or a dash (“-”)
[a-zA-Z0-9] Matches any alphabetic or numeric character
grep
The grep utility works by searching through each line of a file (or files) for the first occurrence of a given string. If that string is found, the line is printed; otherwise, the line is not printed. The following file, which I’ll name “memo,” illustrates grep’s usage and results.
To: All Employees
From: Human Resources
In order to better serve the needs of our mass market customers, ABC Publishing is integrating the groups selling to this channel for ABC General Reference and ABC Computer Publishing. This change will allow us to better coordinate our selling and marketing efforts, as well as simplify ABC’s relationships with these customers in the areas of customer service, co-op management, and credit and collection. Two national account managers, Ricky Ponting and Greeme Smith, have joined the sales team as a result of these changes.
To achieve this goal, we have also organized the new mass sales group into three distinct teams reporting to our current sales directors, Stephen Fleming and Boris Baker. I have outlined below the national account managers and their respective accounts in each of the teams. We have also hired two new national account managers and a new sales administrator to complete our account coverage. They include:
Sachin Tendulkar, who joins us from XYZ Consumer Electronics as a national account manager covering traditional mass merchants.
Brian Lara, who comes to us via PQR Company and will be responsible for managing our West Coast territory.
Shane Warne, who will become an account administrator for our warehouse clubs business and joins us from DEF division.
Effectively, we have seven new faces on board:
1. RICKY PONTING
2. GREEME SMITH
3. STEPHEN FLEMING
4. BORIS BAKER
5. SACHIN TENDULKAR
6. BRIAN LARA
7. SHANE WARNE
Please join me in welcoming each of our new team members.
As a simple example, to find the lines that have the word “welcoming”, the best approach would be to use the following command line:
# grep welcoming memo
Please join me in welcoming each of our new team members.
If you look for the word “market”, the results are slightly different, as shown below.
# grep market memo
In order to better serve the needs of our mass
market customers, ABC Publishing is
integrating the groups selling to this channel
for ABC General Reference and ABC Computer
Publishing. This change will allow us to
better coordinate our selling and marketing
efforts, as well as simplify ABC's
relationships with these customers in the
areas of customer service, co-op management,
and credit and collection. Two national
account managers, Ricky Ponting and Greeme
Smith, have joined the sales team as a result
of these changes.
Note that two matches are found: the requested “market”, and “marketing”. If the words “marketable” or “marketed” had occurred in the file, the utility would have displayed the lines containing those words as well.
Wildcards and meta-characters can be used with grep, and I strongly recommend that you place them inside quotation marks so that the shell doesn’t interpret them as commands.
To find all lines that contain a number, use the following:
# grep "[0-9]" memo
1. RICKY PONTING
2. GREEME SMITH
3. STEPHEN FLEMING
4. BORIS BAKER
5. SACHIN TENDULKAR
6. BRIAN LARA
7. SHANE WARNE
To find all lines that contain “the”, use this:
# grep the memo
In order to better serve the needs of our mass
market customers, ABC Publishing is
integrating the groups selling to this channel
for ABC General Reference and ABC Computer
Publishing. This change will allow us to
better coordinate our selling and marketing
efforts, as well as simplify ABC's
relationships with these customers in the
areas of customer service, co-op management,
and credit and collection. Two national
account managers, Ricky Ponting and Greeme
Smith, have joined the sales team as a result
of these changes.
To achieve this goal, we have also organized
the new mass sales group into three distinct
teams reporting to our current sales
directors, Stephen Flemming and Boris Baker. I
have outlined below the national account
managers and their respective accounts in each
of the teams. We have also hired two new
national account managers and a new sales
administrator to complete our account
coverage. They include:
As you might have noticed, the output contains the word “these”, along with exact matches of the word “the”.
The grep utility, like almost every other UNIX/Linux utility, is case-sensitive, which means that a completely different result comes from looking for “The” instead of “the”.
# grep The memo
To achieve this goal, we have also organized
the new mass sales group into three distinct
teams reporting to our current sales
directors, Stephen Flemming and Boris Baker. I
have outlined below the national account
managers and their respective accounts in each
of the teams. We have also hired two new
national account managers and a new sales
administrator to complete our account
coverage. They include:
If you are seeking a particular word or phrase and don’t care about the case, there are two ways to proceed. The first is to look for both “The” and “the” by using square brackets, as shown below:
# grep "[T, t]he" memo
In order to better serve the needs of our mass
market customers, ABC Publishing is
integrating the groups selling to this channel
for ABC General Reference and ABC Computer
Publishing. This change will allow us to
better coordinate our selling and marketing
efforts, as well as simplify ABC's
relationships with these customers in the
areas of customer service, co-op management,
and credit and collection. Two national
account managers, Ricky Ponting and Greeme
Smith, have joined the sales team as a result
of these changes.
To achieve this goal, we have also organized
the new mass sales group into three distinct
teams reporting to our current sales
directors, Stephen Flemming and Boris Baker. I
have outlined below the national account
managers and their respective accounts in each
of the teams. We have also hired two new
national account managers and a new sales
administrator to complete our account
coverage. They include:
The second method is to use the -i option, which tells grep to ignore case sensitivity.
# grep -i the memo
In order to better serve the needs of our mass
market customers, ABC Publishing is
integrating the groups selling to this channel
for ABC General Reference and ABC Computer
Publishing. This change will allow us to
better coordinate our selling and marketing
efforts, as well as simplify ABC's
relationships with these customers in the
areas of customer service, co-op management,
and credit and collection. Two national
account managers, Ricky Ponting and Greeme
Smith, have joined the sales team as a result
of these changes.
To achieve this goal, we have also organized
the new mass sales group into three distinct
teams reporting to our current sales
directors, Stephen Flemming and Boris Baker. I
have outlined below the national account
managers and their respective accounts in each
of the teams. We have also hired two new
national account managers and a new sales
administrator to complete our account
coverage. They include:
In addition to -i, there are several other command-line options to change grep’s output. The most relevant are the following:
* -c — Suppress normal output; instead, print a count of matching lines for each input file.
* -l — Suppress normal output; instead, print the name of each input file from which output would have normally been printed.
* -n — Prefix each line of output with the line number within its input file.
* -v — Invert the sense of matching — that is, select lines that don’t match the search criteria.
fgrep
fgrep searches files for a string and prints all lines that contain that string. Unlike grep, fgrep searches for a string instead of searching for a pattern that matches an expression. The fgrep utility can be thought of as grep with a few enhancements:
* You can search for more than one object at a time.
* The fgrep utility is always much faster than grep.
* You can’t use fgrep to search for regular expressions with patterns.
Suppose you want to pull uppercase names from your earlier memo file. In order to find “STEPHEN” and “BRIAN”, you would have to issue two separate grep commands, as shown below:
# grep STEPHEN memo
3. STEPHEN FLEMING
# grep BRIAN memo
6. BRIAN LARA
You can accomplish the same task with just one fgrep command:
# fgrep “STEPHEN
> BRIAN” memo
3. STEPHEN FLEMING
6. BRIAN LARA
Note that carriage return is required between entries. Without the carriage return, the search would look for “STEPHEN BRIAN” on each line. With the return, it looks for a match to “STEPHEN” and a match to “BRIAN”.
Note also that quotation marks must be used around the targeted text. This is what differentiates the text from the filename (or filenames).
Instead of specifying search items on the command line, you can place them in a file and use the contents of that file to search other files. The -f option allows you to specify a master file containing search items for which you search frequently.
For example, imagine a file named “search_items” that contains two search items for which you intend to search:
# cat search_items
STEPHEN
BRIAN
The following command searches for “STEPHEN” and “BRIAN” in our earlier memo file:
# fgrep -f search_items memo
3. STEPHEN FLEMING
6. BRIAN LARA
Back to top
egrep
egrep is a more powerful version of grep that allows you to search for more than one object at a time. Objects being searched for are separated by carriage returns (as with fgrep) or by the pipe symbol (|).
# egrep "STEPHEN
> BRIAN" memo
3. STEPHEN FLEMING
6. BRIAN LARA
# egrep "STEPHEN | BRIAN" memo
3. STEPHEN FLEMING
6. BRIAN LARA
The two commands above do the same job.
Besides the capacity to search for multiple objects, egrep offers the ability to search for repetitions and groups:
* ? looks for zero repetitions or one repetition of the character that precedes the question mark.
* + looks for one or more repetitions of the character that precedes the plus sign.
* ( ) signifies a group.
For example, imagine that you can’t remember whether Brian’s surname is “Lara” or “Laras”.
# egrep "LARAS?" memo
6. BRIAN LARA
This search produces matches to both “LARA” and “LARAS”. The following search is a bit different:
# egrep "STEPHEN+" memo
3. STEPHEN FLEMING
It matches “STEPHEN”, STEPHENN”, STEPHENNN”, and so on.
If you are looking for a word plus one of its possible derivatives, include the distinguishing characters of the derivative in parentheses.
# egrep -i "electron(ic)?s" memo
Sachin Tendulkar, who joins us from XYZ Consumer
Electronics as a national account manager covering
traditional mass merchants.
This finds a match for both “electrons” and “electronics”.
To summarize:
* A regular expression followed by + matches one or more occurrences of the regular expression.
* A regular expression followed by ? matches zero or one occurrence of the regular expression.
* Regular expressions separated by | or by a carriage return match strings that are matched by any of the expressions.
* A regular expression can be enclosed in parentheses ( ) for grouping.
* The command-line parameters you can use include -c, -f, -i, -l, -n, and -v.
Back to top
The grep utilities: A real-world example
The grep family of utilities can be used with any system file in text format to find a match in a line. For example, to find the entries in the /etc/passwd file for a user named “root”, use the following:
# grep root /etc/passwd
root:x:0:0:root:/root:/bin/bash
operator:x:11:0:operator:/root:/sbin/nologin
Because it looks for a match anywhere in the file, grep finds entries for both “root” and “operator”. If you want to find only the entry with the username “root”, you can modify the command as follows:
# grep "^root" /etc/passwd
root:x:0:0:root:/root:/bin/bash
Back to top
cut
With the cut utility, you can separate columns that could constitute data fields in a file. The default delimiter is the tab, and the -f option is used to specify the desired field.
For example, imagine a text file named “sample” with three columns that look like this:
one two three
four five six
seven eight nine
ten eleven twelve
Now, apply the following command:
# cut -f2 sample
This will return:
two
five
eight
eleven
If you change your command like so:
# cut -f1, 3 sample
It will return the opposite:
one three
four six
seven nine
ten twelve
Several command-line options are available with this command. Besides -f, you should be familiar with these two:
* -c — Allows you to specify characters instead of fields.
* -d — Allows you to specify a delimiter other than the tab.
Back to top
cut: Two real-world examples
The ls -l command shows the permissions, number of links, owner, group, size, date, and filenames of all the files in a directory — all separated by white space. If you’re not interested in most of the fields and want to see only the file owner, you can use the following command:
# ls -l | cut -d" " -f5
root
562
root
root
root
root
root
root
This command displays only the file owner (the fifth field), ignoring every other field.
If you know the exact position at which the first character of the file owner begins, you can use -c option to display the first character of the file owner. Assuming that it begins with the 16th character, the following command returns the 16th character, the first letter of the owner’s name.
# ls -l | cut -c16
r
r
r
r
r
r
r
If you further assume that most users will use eight characters or fewer for their name, you can use the following command:
# ls -l | cut -c16-24
It will return those entries in the name field.
Now, assume that the name of the file begins with the 55th character, but that it is impossible to determine how many characters it takes up after that because some filenames are considerably longer than others. A solution is to begin with the 55th character and not specifying an ending character (meaning that the entire rest of the line is taken) as shown below:
# ls -l | cut -c55-
a.out
cscope-15.5
cscope-15.5.tar
cscope.out
memo
search_items
test.c
test.s
Now, consider another scenario. To obtain a list of all the users on the system, you can pull only the first field from the /etc/passwd file used in an earlier example:
# cut -d":" -f1 /etc/passwd
root
bin
daemon
adm
lp
sync
shutdown
halt
mail
news
uucp
operator
To collect the usernames and their corresponding home directories, you can pull the first and sixth fields:
# cut -d":" -f1,6 /etc/passwd
root:/root
bin:/bin
daemon:/sbin
adm:/var/adm
lp:/var/spool/lpd
sync:/sbin
shutdown:/sbin
halt:/sbin
mail:/var/spool/mail
news:/etc/news
uucp:/var/spool/uucp
operator:/root
Back to top
paste
The paste utility combines fields from files. It takes one line from one source and combines it with another line from another source.
For example, imagine that the content of a file named “fileone” is:
IBM
Global
Services
In addition, you have “filetwo” with this content:
United States
United Kingdom
India
The following command combines the contents of these files, as shown below:
# paste fileone filetwo
IBM United States
Global United Kingdom
Services India
If there were more lines in fileone than filetwo, then the pasting would continue, with blank entries following the tab.
The tab character is the default delimiter, but you can change it to anything else with the -d option.
# paste -d", " fileone filetwo
IBM, United States
Global, United Kingdom
Services, India
You can also use the -s option to output all of fileone on a line, followed by a carriage return and then filetwo.
# paste -s fileone filetwo
IBM Global Services
United States United Kingdom India
Back to top
join
join is a greatly enhanced version of paste. join works only if the files being joined share a common field.
For example, consider the two files you were using with the paste command previously. Here’s what happens when you try to combine them with join:
# join fileone filetwo
Note that there is nothing to display. The join utility must find a common field between the files in question, and by default it expects that common field to be the first.
To see how this works, try adding some new content. Assume that fileone now contains these entries:
aaaa Jurassic Park
bbbb AI
cccc The Ring
dddd The Mummy
eeee Titanic
And filetwo now contains the following:
aaaa Neil 1111
bbbb Steven 2222
cccc Naomi 3333
dddd Brendan 4444
eeee Kate 5555
Now, try that command again:
# join fileone filetwo
aaaa Jurassic Park Neil 1111
bbbb AI Steven 2222
cccc The Ring Naomi 3333
dddd The Mummy Brendan 4444
eeee Titanic Kate 5555
The commonality of the first field was identified, and the matching entries were combined. But paste blindly took from each file to create the output; join combines only lines that match, and the match must be exact. For example, imagine you added a line to filetwo:
aaaa Neil 1111
bbbb Steven 2222
ffff Elisha 6666
cccc Naomi 3333
dddd Brendan 4444
eeee Kate 5555
Now, your command will produce this output:
# join fileone filetwo
aaaa Jurassic Park Neil 1111
bbbb AI Steven 2222
As soon as the files no longer match, no further operations can be carried out. Each line in the first file is matched to the same and only the same line in the second file for a match on the default field. If matches are found, they are incorporated into the output; otherwise they are not.
By default, join looks only at the first fields for matches and outputs all columns, but you can change this behavior. The -1 option lets you specify which field to use as the matching field in fileone, and the -2 option lets you specify which field to use as the matching field in filetwo.
For example, to match the second field of fileone to the third field of filetwo, use the following syntax:
# join -1 2 -2 3 fileone filetwo
The -o option specifies output in the format {file.field}. Thus, to print the second field of fileone and the third field of filetwo on matching lines, the syntax is:
# join -o 1.2 -o 2.3 fileone filetwo
join: A real-world example
The most obvious way you could use join in the real world would be to pull the username and the corresponding home directory from the /etc/passwd file and the group name from the /etc/group file. Groups appear in the fourth field in numerical format in the /etc/passwd file. Similarly, they appear in the third field in the /etc/group file.
# join -1 4 -2 3 -o 1.1 -o 2.1 -o 1.6 -t":" /etc/passwd /etc/group
root:root:/root
bin:bin:/bin
daemon:daemon:/sbin
adm:adm:/var/adm
lp:lp:/var/spool/lpd
nobody:nobody:/
vcsa:vcsa:/dev
rpm:rpm:/var/lib/rpm
nscd:nscd:/
ident:ident:/home/ident
netdump:netdump:/var/crash
sshd:sshd:/var/empty/sshd
rpc:rpc:/
awk
awk is one of the most powerful utilities in Linux. It is actually a programming language in and of itself and can be used with complex logic statements, as well as to simply pull out snippets of text. We’ll skip the details, but let’s quickly review the syntax and then walk through some real-world examples.
An awk command consists of a pattern and an action composed of one or more statements, as shown in the syntax below:
awk '/pattern/ {action}' file
Notice that:
* awk tests every record in the specified file (or files) for a pattern match. If a match is found, the specified action is performed.
* awk can act as a filter in a pipeline or take input from the keyboard (standard input) if no file or files are specified.
One useful action is to print the data! Here is how to reference fields in a record.
* $0 — The entire record
* $1 — The first field in the record
* $2 — The second field in the record
You can also pull multiple fields in a record, separating each field by a comma.
For example, to pull the sixth field from the /etc/passwd file, the command is:
# awk -F: '{print $6}' /etc/passwd
/root
/bin
/sbin
/var/adm
/var/spool/lpd
/sbin
/sbin
/sbin
/var/spool/mail
/etc/news
/var/spool/uucp
Note that -F is the input field separator defined by the predefined variable FS. It is a blank space, in my case.
To pull the first and sixth fields from the /etc/passwd file, the command is:
# awk -F: '{print $1,$6}' /etc/passwd
root /root
bin /bin
daemon /sbin
adm /var/adm
lp /var/spool/lpd
sync /sbin
shutdown /sbin
halt /sbin
mail /var/spool/mail
news /etc/news
uucp /var/spool/uucp
operator /root
To print the file using a dash in place of the colon delimiter between fields, the command is:
# awk -F: '{OFS="-"}{print $1,$6}' /etc/passwd
root-/root
bin-/bin
daemon-/sbin
adm-/var/adm
lp-/var/spool/lpd
sync-/sbin
shutdown-/sbin
halt-/sbin
mail-/var/spool/mail
news-/etc/news
uucp-/var/spool/uucp
operator-/root
To print the file using a dash between fields, and print only the first and sixth fields in reverse order, the command is:
# awk -F: '{OFS="-"}{print $6,$1}' /etc/passwd
/root-root
/bin-bin
/sbin-daemon
/var/adm-adm
/var/spool/lpd-lp
/sbin-sync
/sbin-shutdown
/sbin-halt
/var/spool/mail-mail
/etc/news-news
/var/spool/uucp-uucp
/root-operator
head
The head utility prints the first part of each file (10 lines by default). It reads from standard input if no files are given, or if given a filename of -.
For example, if you want to extract the first two lines from your memo file, the command is:
# head -2 memo
In order to better serve the needs of our mass
market customers, ABC Publishing is
integrating the groups selling to this channel
for ABC General Reference and ABC Computer
Publishing. This change will allow us to
better coordinate our selling and marketing
efforts, as well as simplify ABC's
relationships with these customers in the
areas of customer service, co-op management,
and credit and collection. Two national
account managers, Ricky Ponting and Greeme
Smith, have joined the sales team as a result
of these changes.
You can specify the number of bytes to display using the -c option. For example, if you want to read the first two bytes from the memo file, the command is:
# head -c 2 memo
In
tail
The tail utility prints the last part of each file (10 lines by default). It reads from standard input if no files are given, or if given a filename of -.
For example, if you want to extract the last two lines from your earlier memo, the command is:
# tail -2 memo
Please join me in welcoming each of our new team members.
You can specify the number of bytes to display using the -c option. For example, if you want to read the last five bytes from the memo file, the command is:
# tail -c 5 memo
ers.
Conclusion
Now you know how to use various utilities to extract data from standard Linux files. Once extracted, that data can be manipulated for viewing and printing or directed into other files or databases. Knowing how to use just this handful of tools can help you spend less time on mundane tasks and become a more efficient administrator.
No tags
