Tuesday, 5 May 2015

Linux commands for Text Manipulating

Posted by Mahesh Doijade
Unix commands for text filtering and manipulating; Shell commands for text filtering and manipulating
 
    Text Filtering/Manipulating is usually building block for most of the problems we solve. Linux and its shell turns out to be very rich in providing tools/utilities for this consideration. It enriches the user with a numerous handy commands/tools for text filtering and manipulation. Here, I try to cover most of the linux manipulation tools which should certainly apply for most of your needs related to text manipulation. They are listed and detailed in alphabetical order below:


awk [ -F fs ] [ -v var=value ] [ 'prog' | -f progfile ] [ file ... ] : Awk is an interpreted programming language in itself, which executes complex pattern matching on streams of textual data. It heavily uses associative arrays, strings and regular expressions.Its usefulness for parsing system data and generation of automated reports is commendable.

Few essential arguments:
-F fs Sets the input field separator to the regular expression fs.
-v var=value Assigns the value value to the variable var before executing the awk program.
'prog' An awk program.
-f progfile Specify a file, progfile, which contains the awk program to be executed.
file ... A file to be processed by the specified awk program.

comm [options] [FILE1 FILE2] : comm compares two sorted files FILE1, FILE2 line by line.
Arguments:
1 : Suppress lines unique to the left file
2 : Suppress lines unique to the right file
3 : Suppress lines that appear in both the files

For example:
comm -12 file1 file2
                 Print only lines present in both file1 and file2.
comm -3   file1 file2
                 Print lines in file1 not in file2, and vice versa.

csplit [options] [file] [pattern] : Splits a file into sections depending upon the context/pattern

Few essential arguments:
-f, prefix=PREFIX   Use PREFIX instead of 'xx'
-z   Remove empty output files
-n, digits=DIGITS   Use specified number of digits instead of 2

cut [options] [file pattern] : This command can be used for extracting a portion of text from a file by selecting columns.
Arguments:
-c range :  Outputs only the characters in the range


For example:
This is your text file "file.txt"

$ cat file.txt
This is a test for cut.
Linux text filtering commands.
Linux text manipulating commands. 


The following example displays 4th character from each line of a file "file.txt".

$ cut -c4 file.txt
s
u
u

diff [options] [file1] [file2] : This commands differentiates between the given two files. A handy utility for having a quick check to see for difference between two files.
Few essential arguments:
-a   Treat all files as text and compare them line-by-line, even if they do not seem to be text.
-b     Ignore changes in amount of white space.
-c   Use the context output format.


echo [options] [string] : Prints the given input string.
Few essential arguments:
-n    do not output the trailing newline
-e    enable interpretation of the backslash-escaped characters listed below

For example: echo is useful in checking what values your environment variables holds.
$ echo $PATH 


fold [options] [files] : Wraps each line / text file to fit in a specified width. By default the output is on stdout one can redirect to a file if needed.


fold -sw [SIZE] [input.txt] > [output.txt]

-s   break at spaces
-w   {SIZE} use SIZE as WIDTH columns instead of default 80.


grep  [options and pattern to find] [file] : This is a crucial utility for finding the lines having existence of a pattern in a given plain-text data set. The name grep comes from globally search  a regular expression and print. Some of the more often used options are:

-m <num>  Stops reading a file after <num> of matching lines.
-c <num>   Prints <num> lines of output context.
-x                Selects only those matches which exactly match the whole line.
-i                 Do case insensitive matching.
-l                 Just print the files that match the pattern.
-R , -r          Read all files within a directory recursively.
-w               Select only those lines containing matches that form whole words.             

head [options] [file] : Prints the first 10 lines of the given file.
Essential arguments:
-n N         print the first N lines instead of the first 10


nl [options] [file] : Numbers the lines of the given file. Adds line number to the lines of the given file displaying it on standard output.

$ cat fruits.txt
  apples
  bananas
  Orange
  Jack fruit
$ nl fruits.txt
  1    apples
  2    bananas
  3    Orange
  4    Jack fruit

sed [expression] [file] : sed (stream editor) is a marvelous utility, IMHO you can do almost any kind of text filtering and transformation if one learns its intricacies. In certain ways it is similar to an editor which allows scripted edits, sed does only one pass over the input(s), and is consequently more efficient. sed’s power to filter text in a pipeline makes it to stand out from other types of editors.
Few essential arguments:
-n suppress automatic printing of pattern space
-e script add the script to the commands to be executed
-f script-fileadd the contents of script-file to the commands to be executed
-l N specify the desired line-wrap length for the ‘l’ command
-r use extended regular expressions in the script.
Essential command: s - substitution
There are many things which can be covered to learn sed which out of scope of this article But s - substitution the mostly used and known command by many of them. The example shown below will replace the passed input old to sed to new.
$ echo old | sed s/old/new/  
  new


sort [options] [file] : Sort is easy to use useful command which sorts the lines in the given file in alphabetically and numerically.

Few essential arguments:
-b   Ignore leading blanks.
-d   Consider only blanks and alphanumeric characters.
-g   Compare according to general numerical value.
-i   Consider only printable characters.
-R   Sort by random hash of keys.
-r   Reverse the result of comparisons.

tail [options] [file] : Prints last 10 lines of the file on the standard output.
Essential arguments:
-f   Output appended data as the file grows.
-n <N>   Print last N lines of the file instead of the default last 10 lines.


tee [options] [file] : Sends the current output stream to the file. It does this at the same time, that is, displaying it on standard output and sending the stream to the file as well.
For example : The command below prints the output of ls command to both the standard output as well as file.txt
$ ls | tee file.txt
One can also let tee to send output to multiple files through command show below:
$ ls | tee file1.txt file2.txt file3.txt

uniq [options] [input] [output] : Removes/filter outs duplicate lines in the input file. If [input] is not specified it reads from stdin and if [output] is not specified it writes to stdout.
Essential arguments:
-c  Prefix lines with a number representing how many times they occurred.
-d  Only print duplicate lines
-i  By default comparisons are case-sensitive, this option enable case-insensitive comparisons.
-u  Only print unique lines
-w <N>  Compare not more than N characters in the lines.


wc [options] [file] : Prints the count of newlines, words and bytes for each input file.
Essential arguments:
-c   Print the byte counts.
-l   Print the newline counts.
-m   Print the character counts.
-w   Print the word counts.
Read More