Perl has been called the duct tape of the Internet and will likely forever be so. In the words of it's creator, Larry Wall, perl makes easy things easy, and hard things possible. It is a rich language that helps you program all manners of sysadmin tasks quickly, scale/grow them and maintain them well though their lifetime.

The goal of this course is to introduce you to perl. By going through this course, you will be able to create simple programs in perl for your everyday needs as a system administrator. In addition, the examples in this course material can be used as starting point for more complex scripts that you want to build but did not know where to begin. The history section should give you some perspective about perl.

This material is organized in such a way that explanations of most of the concepts occur after they have been introduced earlier. This way, you can try a few things and play around with the snippet and look forward to the gory details to follow. The exercises usually test what has been explained, but introduce a few things that have not been explicitly covered, so that you have an incentive to try them out and see what you get.

Finally, this course is not a substitute for programming, nor does it substitute the documentation that comes with perl! The more you code, the better you can program. What is not obvious is that the more you read, the lesser you need to program.

It is mandatory that you read the perl documentation available on your system. At the very least, you should try to read all the manual pages mentioned in this course. Reasonably competent system administrators can implement a lot of their common tasks by minor modifications to the program snippet from the documentation. The resources section gives you details on where to look for complete and authoritative information.

Origins, History and Philosophy of Perl

Perl is currently the most convenient and popular language for writing Web/CGI programs, System administration scripts, Database applications for the web, and even complex Data mining, Archival and retrieval. Yahoo, Amazon, Dejanews, CDROM.COM, Netscape, Mozilla, Apache.org, Synopsys, Paul Ingram group, and Compaq are some of the companies where perl is used extensively. Microsoft has homed in on perl for it's internal scripting technologies research and development. Needless to say, almost every Unix shop uses perl. Why?

The Taming of The Camel

Before perl was available on Unix, the usual recourse for programmers wanting to build home brew solutions/systems was a mix of shell, awk sed etc. for quick jobs, and C only if speed and efficiency was needed. This saved a lot of time and development effort. However, programming quick tasks in C is not fun. Shell scripts, on the other hand, don't scale well when data or processing has to be done repeatedly. Perl was conceived by Larry Wall when he found that shell/C lacked what he needed. Not content with writing just another program, he wrote perl so that he can re-use it for one other project. He then released it to the Usenet. The rest as they say, is history.

Manipulexity and Whipupitude.

Perl is designed to mimic the flexibility of C language and it's power to manipulate everything in the machine. At the same time perl was designed to help the programmer to quickly prototype an idea and whip up a working solution much faster than in most other languages.

Perl was originally used for manipulating text data (perl stands for Practical Extraction and Reporting Language). But it excelled in data transformation and file manipulation and all manners of system tasks and quietly filled a huge niche of everyday programming for everyday tasks.

A melting pot

Perl is a distilled essence of Unix. It is a language built to emulate the best features of sh, sed, awk, and C and many others. It adheres to the Unix philosophy of keeping tools simple and building complexity through the way the tools are strung together into a solution. The only difference is that all the powerful tools of Unix are now available as native constructs in perl. This saves time, increases speed, and reduces error by keeping the complexity in one place: the language, instead of the program. This has also made it possible for perl programs to run virtually unchanged on all manners and flavors of O/S!

Perl is like a Natural Language

Perl resembles English more than you expect it to. This is by design. It borrows a lot of concepts from natural languages. For example, it uses visually distinct ways to refer to different types of variables: single values, lists, and relationships. By stark contrast, most computer languages do not let you figure out the type of a variable from it's name.

Also, perl is designed to be learned once, used many times. You typically learn a small subset of perl when you start, and learn more concepts as you go. The key feature of perl, like a natural language, is that you are able to program as you learn, much like a child and an adult are able to communicate reasonably well with their own levels of competence in spoken language.

Perl allows local ambiguity in programs. This means that you can operate on some things implicitly and know that they will be doing the right thing when the program runs. The programs are thus shorter, easier to read and make better sense. Contrast the following paragraphs:

        1. Mary is 18 years old. Vijay is 19. Mary and Vijay meet everyday
        for music lessons. Mary and Vijay see Vinni and Vicki everyday after
        practice. After meeting Vinni and Vicki everyday, Mary and Vijay go
        go for a movie with Vinni and Vicki.

        2. Mary is 18 years old. Vijay is 19. They meet everyday for music
        lessons. After the practice they see Vinni and Vicki and they all
        go for a movie.

The first paragraph is what programs in most other languages look like. The second paragraph is what an equivalent perl program might look like. Perl also borrows extensively from the best of other languages. Here is a simple table (courtesy of postings on the Usenet newsgroup comp.lang.perl.misc ):

        ++++++++++++++++++++++++++++++++++++++++++++++++++++
        Feature                      Ancestor(s)
        ++++++++++++++++++++++++++++++++++++++++++++++++++++
        range operator(..)           awk, sed
        math operators (+,*,/)       FORTRAN
        match operator (=~)          awk
        scalars as number/string     sh, awk, lisp
        varying length strings       BASIC, awk
        substr                       awk
        lists                        lisp, APL, shell
        slices                       Ada, FORTRAN
        statement modifiers          BASIC-PLUS
        glob('*')                    csh 
        blocks                       Algol
        #comments                    shell
        system functions             Unix, libc
        $ for variables              shell
        quotes (', ", and `)         shell
        m//, s//                     sed
        sort                         qsort from libc
        do, if, while, for           C
        foreach                      shell
        OO setup                     python
        UNIVERSAL class              smalltalk?
        unless, until                BASIC-PLUS
        require                      LISP
        \u,|u,\l,\L                  vi
        $0 is changeable             sendmail
        \w,\s                        emacs
        formats                      FORTRAN, COBOL, BASIC
        \e, $%                       troff
        grep, map                    LISP
        BEGIN, END                   awk
        chr, ord                     Pascal
        -e, -f, -d                   /bin/test from Unix!
        pack 'u'                     uuencode
        and, or, not                 REXX
        autoloading in modules       lisp
        /i flag                      grep
        $package'variable (obsolete) Ada
        open syntax                  shell
        [] and {} dynamic structs    python
        sub arguments (variadic)     shell, lisp
        tied arrays                  BASIC-PLUS
        system calls, networking     Unix, C

Tim Toady

Contrary to most programming languages which have a minimalist set of constructs (called orthogonal) and in which there is one way to do a particular task, perl was designed with redundancy and multiple constructs that do similar things. This has led to the perl motto There's more than one way to do it, abbreviated to TMTOWTDI or Tim Toady. This also makes experimentation possible and keeps the programming from becoming a boring chore. It also makes perl programming accessible to all levels of programmers. The better you get, the more concise and clear your programs get, and the more you start to use common idioms.

History

The first version of perl was released in 1987. After successive refinements version 4 of perl was released in 1991, which also coincided with the first release of The Camel book, Programming Perl.

Perl version 4 quickly became very popular. As many people started using perl for more than a few simple tasks, the limitations of the language made it difficult for people to add features. To prevent perl from forking into many versions, a complete rewrite of perl was done and released as version 5. Perl version 5, as opposed to perl version 4 was more extensible, contained large-scale-programming features and added completely new features like lexical variables and closures, a re-hauled regular expression engine, references, and pretty much everything else. Version 5 also supported more operating systems and a clean abstraction (DBI) for database support, a Tk port to perl and also boasted a Win32 port for PCs running Microsoft operating systems (this port has since been integrated into the core perl distribution in source form).

For the most current updates and feature list for perl, you should see the distribution, which is always available at http://www.perl.com/CPAN/src. If you have a complete perl installation, AND if you're using perl 5.005 and above, perldoc perlhist should give you everything you want to know.

Simple Program Example

We will start our session with a simple example. Before going into the example, we need a digression on how perl runs your program, and also on how you run a perl program.

Perl is an interpreted, byte compiled language. That is, perl will read your source program and compile it into an internal format to run it. This means that your program can be run without being converted into a binary form. Your source program is the executable. The fact that it is interpreted means that you can greatly reduce development and build time. The fact that it is also compiled means that it is much faster than traditional interpreted languages like Basic or TCP.

You run your perl programs just like you run Unix shell scripts. You can call perl with your program as it's argument, or specify the location of your perl binary as the first line of the script and run the script directly. We will use /usr/bin/perl as the location of the perl binary in all our examples, but please replace that with the appropriate location for your site before running programs from this document. Here is the listing of a simple example program:

        1          #!/usr/bin/perl -w 
        2          use strict; 
        3          my $who; 
        4          my $day = (localtime)[6]; 
        5       
        6          print "What's your name? "; 
        7          chomp( $who = <STDIN> ); 
        8       
        9          if ($day == 5 ) {
        10          print "Have a nice weekend, $who!\n"; 
        11         } 
        12         else { 
        13           print "Have a nice day, $who!\n"; 
        14         }

In line 1, we specify that this program is a perl script (on Unix like systems only). This makes your kernel arrange for perl to run this program when it's invoked by it's name. In line 2, we tell perl to use strict variable and prototype checking. This is not mandatory. However, we will always use it to catch errors and silly mistakes early. In lines 3 and 4 we declare two lexical variables. Line 4 also sets the variable $day to the day of the week (which is available as the seventh element of the return value of the localtime function in perl, which mimics the standard C library function call of the same name). Our code uses the number 6 to pull the seventh element because perl arrays are usually indexed starting at 0. (It is possible to override this setting but we usually don't).

Line 6 prints a prompt string to standard output (which is usually your terminal for interactive scripts like this example). Line 7 does many things. It uses the perl diamond operator, <\> to read one line from STDIN (which is usually the terminal in interactive programs like this). This value is now stored in the variable $who. The function chomp removes the trailing newline in the variable $who. At the end of line 7, we have the user's input (minus the newline character) stored in the variable $who.

Lines 9 through 14 illustrate conditional branching in perl. Conditional branching in perl is similar to the C or shell `if' statement. In line 9, we check if the value of $day is 5. The standard C library returns a number between 0 and 6 for the day of the week, 0 being Sunday. Thus, we check if today is Friday, and print the appropriate wish in lines 10 and 13. At line 14, the program ends, and perl does it's normal cleanup and exits back to the calling program (our shell).

To run the above program we will save it into a file and invoke it. To follow a convention, we name it as wish.plx (plx is the current custom to name a perl `executable'). You should make the program executable by doing a chmod +x wish.plx. Now, assuming you have the program under your current working directory, type:

        ./wish.plx

This program illustrates quite a lot of perl, so let's go over the major features that we have illustrated, line by line, along with pointers to appropriate sections of the perl man-page that comes with perl:

        ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
        Line    Description
        ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
        1       O/S will invoke perl to run your program (perlrun)
        2       Make perl check for common mistakes (perlrun)
        3,4     Variable declarations (perlfunc, perldata)
        6       print function (perlfunc, perldata)
        7       chomp, <> operator (perlfunc, perlop, perlvar)
        9..13   conditional branching (perlsyn)
                == operator (perlop)
                string interpolation (perldata, perlop)
                braces {} (perlsyn, under `Compound Statements')

Basic Data types in Perl

Almost all programs and scripts manipulate data. This is fairly obvious, but the kind of basic data types available in your programming language largely determines the kind of programs you can write easily and write well. Most programming languages give you a basic set of data types and constructs to build complex data types with them. Also, most languages differ in the amount of management overhead in building complex data structures from basic data types.

Perl provides you with three basic, but powerful data-types. Unlike most other languages, these data types allow you to grow/shrink them dynamically without you ever having to worry about memory allocation/de-allocation. Perl does it all for you. The three fundamental data types in perl are called Scalars, Lists and Hashes.

Scalars

A scalar is the fundamental data type in perl. A scalar can hold a single value. This value may be a string, number, a file-handle or a reference to another perl data-type. Strings and numbers are a sequence of characters. Filehandles are special values used as place-holders to refer to open files in your program. In short, a scalar value can be equated to the English word 'the'. Scalars are prefixed with a dollar sign($). I spite of the apparently different types of values you can store in a scalar variable, perl stores them in a single format and converts between them as necessary.

Here are some examples:

        $a = 'this';    #stores the string 'this' in $a
        $nirvana = 42;  #stores the number 42 in $nirvana
        $ref = \$a;     #stores a reference to the variable $a in $ref

You can build a scalar from other scalars through numeric and string operations. The most common operation used for building scalars is string interpolation. A double-quoted string containing scalar variables inside it will be automatically interpolated with the values of the scalars. Here is an example:

        $a = 20; 
        $b = 22;
        $c = $a + $b;
        $answer = "The sum of $a and $b " . "is $c\n";
        print $answer;

        This prints: I<The sum of 20 and 22 is 42>

The `+' operator is the familiar numeric addition. The `.' operator is the string concatenation operator that concatenates it's left and right operands and returns the result.

As you may have noticed, a semicolon terminates a perl statement or declaration. You can also group multiple statements into a block by grouping them within curly braces.

Exercises

How do you define a scalar variable named foo and initialize it with the result of 5% of 92? Can you do it in one statement?

$bar

$bar = 'abc' x 4;

$dir

$file

$full_path

Lists and Arrays

A literal list is a collection of scalar values. Lists are stored in perl arrays. Thus, an array is a list each of whose element really contains a scalar value. This is an important point that will help you manipulate list elements without confusion. As with scalars, lists can be built dynamically, and their size can be increased or decreased by adding, deleting or splicing elements at will. List names act like the English word `these'. You prefix an array with the @ character. However, to get the scalar element of an array, you provide the index of the element within square braces and prefix the name of the array with a $. Here are some examples:

        my($borderline, @living, @server_ports);
        $borderline = 'prions';
        @living = ('plants', "animals", 'viruses', $borderline);
        @server_ports = qw(http smtp pop3 telnet ftp);

Note that each element of the list is a scalar. You can add/modify/delete them using different perl functions and operators. Here are a few examples:

        #Initializing/Adding to arrays

        my(@a) = ('this', 'that', 'and');       #define an array named `a'
        push @a, 'others';
        print "@a\n";   #prints `this that and others'

        #Modifying existing elements

        $a[0] = 'you';  #a[0] is the first element of a. It is a scalar.
        $a[1] = 'me';

        print "@a\n";   #prints `you me and others'

        my $size = @a;  #$size gets 4, the number of elements in @a!
        $size = scalar(@a);     #same as above

        pop @a; #deletes last element of array (and returns it's value)
        @a =(); #deletes everything in @a

Exercises in Lists/Arrays

Initialize an array named valid_users to contain the user-names root, admin and http.
Add the user-name sys to the above array. What happens if you use unshift instead of push? Try printing out the array in both cases and find out!
The pop function returns the last element of an array and deletes it at the same time. There is another perl function that returns the first element of an array and deletes it after returning it! It has the same name as a Unix shell operation that does the same thing. Try to use perlfunc to find out more details.
How do you delete all elements of an array? What does undef(@array) do?

Hashes

The final perl data structure we will see is a hash. A hash is very much like a list, but it is indexed by strings (a list is indexed by number). A hash is like a database indexed by a single key field. Hashes are initialized by specifying the key and value in pairs. For example:

        %colors = ( 'red' => '#FF0000', 'green' => '#00FF00');

Hash keys are strings and hash values are scalars, so you can refer to them as if the values were real scalars by enclosing the key within curly braces. Here is an example of adding another element to the above hash and using a value stored in it:

        $colors{'blue'} = '#0000FF';
        print( qq(<BODY BGCOLOR="$colors{red}">Red</BODY>) );

Here is how it works. %colors is the hash. It's name is colors. The key for which we want to create a value is blue. So the actual value is at colors{'blue'}, However hash values are scalars, so we prefix it with a $. Thus, $colors{'blue'} refers to this value. Similarly, $colors{red} refer to the value stored with the key 'red'.

Some simple Operations on variable types

Scalars: length, substr, tr, s, chomp, lc, uc, int, sprintf

Try each of the below statements and see if the result matches with the comments (You can ignore anything followed by a '#' because those are comments):

        $dozens = int( 97/12 ); # gets 8

        $_ = 'A single sentence.';
        $l = length($_);        #$l is now 17

        $is = substr($_, 9, 4); #$is is now 'is'

        $_ =~ tr/st/tp/;        #$_ is now 'A tingle tenpence.';
        $_ =~ s/t/s/;           #$_ is now 'A single tenpence.';
        print uc($_);           #prints "A SINGLE TENPENCE.";
        $pi = sprintf("%.4f", atan2(1, 1)*4);   #$pi gets '3.1416';

Lists: push, pop, shift, unshift, sort

        @a = (1, 2, 3);
        $last = pop @a;         #$last gets 3
        unshift @a, 0;          #@a is now (0,1,2)

        @sorted = sort('jack', 'jill', 'fred', 'barney');
        print "@sorted";        #prints `barney fred jack jill'

Hashes: keys, values, each

        %h = ('emacs' => 'RMS', 'perl' => 'Larry', 'bind' => 'Vixie');
        @software = keys %h;
        @authors  = values %h;

        while ( ($k, $v) = each %h) {
                print "$k was the brainchild of $v\n";
        }

Verify that the above code gives the below output (not necessarily in the same order):

        emacs was the brainchild of RMS
        perl was the brainchild of Larry
        bind was the brainchild of Vixie

Exercises for hash variables

Define a hash called %IP. Use it to store a few IP addresses keyed by their hostnames. Save all the keys into an array called @hosts. Print the array @hosts as well as each of the key/value pairs in %IP.
Since hashes are indexed by strings, there is no 'order' to getting back the keys or values. You would use the sort function to do this. Try to lookup this function in the perl manual pages (perldoc -f sort). If possible, modify your program for the above exercise so that it prints the hosts and IP-s sorted by hostname.

Statements, Variables, Context and meaning

Everything in perl is an expression. An expression is a basic unit of program in perl that returns a result. For example, the print statement in perl is actually an expression that returns a value.

        $result = print("Foo\n");

A perl statement is merely an expression evaluated for side effects. For example, we almost always never need to use the return value of an expression that contains a print statement. Thus, we write the print statement as below, and ignore the result:

        print "Result of previous stmt = $result\n";

Expressions can not only return results, but can also be assigned to under appropriate conditions. When the return value of an expression is merely used to assign it to something else, it is said to be used as an rvalue. In contrast, when you assign to an expression, it is said to be used in an lvalue context. Quite a few of perl functions/operations can act as lvalues. This is quite contrary to most other languages, so you may need to try a few examples to get familiar with this concept:

        1       $_ = "ABC\n";
        2       print substr($_,1,1);   #prints 'B'
        3       substr($_, 1, 1) = 'C'; 
        4       print;                  #prints 'ACC'

In line 1, we assign the value ``ABC\n'' to the perl builtin variable $_. The variable $_ is the default value used in quiet a lot of perl constructs where an argument is not explicitly provided.

In line 2, we use a perl function substr. This function takes 3 arguments:

        substr( EXPR, OFFSET, LEN)

The first argument is an expression. The second argument is an offset and the third argument is the length. substr returns a sub string of length LEN in EXPR starting at offset OFFSET. String offsets start at 0, like most other offsets in perl. Thus, the result of line 2 is to print the character ``B'' from ``ABC\n''.

In line 3, we see that substr actually returns the location in $_ which begins at offset 1 and has a length 1. This is namely the part of ``ABC\n'' that starts at ``B'' and ends right there (length is 1). When we assign 'C' to this expression, perl does something very natural: it replaces the substring ``B'' in ``ABC\n'' with ``C''. Thus, the original string is converted to ``ACC\n''! This may need getting used to, because there are very few equivalents of such flexibility that you will find in other languages.

Expressions can also return different things based on the context in which they are called! The two major types of context are described below.

Scalar context

A scalar context expects/returns a single scalar value. If you use an expression in a scalar context, the expression or it's return value(s) are coerced into a scalar. For example:

  $count = @lines;

Here, @lines is an expression that returns a list of all elements contained in the array @lines. This expression is forced into a scalar context by the assignment statement. In a scalar context, this gives the number of elements of the array @lines. Thus, $count will really contain the number of elements in the array @lines.

List context

A list context expects/returns a list of scalars. If you use an expression in a list context, the expression or it's return value(s) is/are coerced into a list. For example:

  @lines = <STDIN>;

Here, @lines provides a list context to the expression <STDIN>. This in turn makes the expression <STDIN> slurp the entire STDIN (until an eof or CTRL-Z) and return it as a list of lines. Thus, if you were to type 10 lines in the terminal followed by a CTRL-D after this statement, @lines will contain 10 elements, each of which will contain the respective line you entered.

In this example:

  @one   = 1;

@one provides a list context to the single value 1. The value 1 is coerced into a one element list whose first and only value is 1. Thus, the result here is that @one will now contain one element with value 1.

This works for lists in general, but there is a special case of a literal list that you should be aware of: A literal list appears like a ``C comma operator'' in a scalar context. Here is an example to illustrate this important distinction:

        @a = (12, 0, 32, -23);
        $b = @a;
        print "b = $b\n";

        $c = (12, 0, 32, -23);
        print "c = $c\n";

        this prints:

                b = 4
                c = -23

Exercises

What are the two contexts in perl?
What happens if you assign a single scalar to an array?
Try out the following lines of perl code. Can you reason out what is happening? Try consulting perldoc -f localtime, or perlfunc man-page to see if it provides you more information:

        $\="\n";        #force a newline on every print
        print localtime(time);
        print scalar localtime(time);

Perl Predefined Variables

Perl predefined variables are builtin variables that automatically take on certain `sensible' values at runtime. As we noted before, statements are expressions that return value(s). In the absence of an explicit assignment, some of the expressions take default arguments. In addition, some of the perl expressions may return their results into certain default variables. In other cases, changing the settings of some internal variables will make other perl functions behave differently.

This may all seem rather confusing, but it helps us in writing uncluttered code. You will be able to understand them better with usage and will someday actually depend on them! The following collection is part of the huge set of builtin variables in perl. We will follow these up with examples:

Arguments, Environment etc.

@ARGV

This is the complete list of arguments with which your program was invoked. You may use it as you would any other perl array. Here are some examples of @ARGV usage:

        print_help_and_exit() if ( @ARGV && $ARGV[0] eq '-h' );

        if ( -f $ARGV[0] ) {
                die "File $ARGV[0] exists already! Won't overwrite!\n";
        }

Here is a program that prints all arguments passed to it.

        #!/usr/bin/perl -w
        use strict;
        my $arg;

        foreach (@ARGV) {
                $arg++;
                print "Argument $arg: $_\n";
        }

%ENV

This is the hash of all environment variables that your perl program inherited from the calling program. The keys in this hash are all the environment variables that your perl program inherits from it's parent process. The values are, obviously, the actual values of those variables. Here is a simple perl emulator of the env command:

        while (($key, $value) = each %ENV) {
                print "$key=$value\n";
        }

@INC

This is the include path for perl libraries. It's value is set at the time perl itself was compiled from source. Here is a simple example that prints out your perl library include path:

        foreach (@INC) {
                print "$_\n";
        }

Here is a one liner command to find out where any of the standard perl modules are located. We take ``CPAN.pm'' as an example:

        $file = 'CPAN.pm';
        foreach (@INC) {
                print "Found $file under $_/$file\n" if ( -f "$_/$file");
        }

You can override this variable within your program. When you tell perl to use/require some libraries (eg. IO::File), perl will search all the directories in @INC for the Module/library. This is very useful for using perl libraries which you cannot install in your perl installation. You install them under a directory of your own and add to your @INC using the use lib pragmatic module.

For example, if you have installed the latest cool whiz-bang version of Foo::Bar under your $HOME/lib directory, here is what you would do:

                use lib '/my/home/dir/lib';
                use Foo::Bar;
                { #...whatever... }

$_ = default input and pattern search space

Whenever you use some perl construct which expects/returns a value and get away without an error/warning from the -w switch, it means that perl managed to understand what you wanted, and stored it, or retrieved it from somewhere. $_ is the most common place for cases when the expected/returned value is a scalar.

Example:

        while ( <FH> ) {
                split;
        }

In the above example, which is a standard perl idiom, the <FH> operator returns a single line of input from the file pointed to by FH. In the absence of an assignment within the while loop conditional, it gets automatically placed in $_. The next statement split usually needs 3 arguments: the character/expression to split on, the value to split, and the number of values into which it should be split. In the absence of any of these values, the default value to split is $_. This makes the code above look much better than the one below, which doesn't assume many defaults:

        while ( defined($var = <FH>) ) {
                @_ = split " ", $var;
        }

@_ = default parameter list for subs, default destination of 'split'

As explained in the example above, the default destination of split is @_. In the context of a subroutine call, @_ contains all the arguments to the subroutine. Note that perl subroutines can have a variable number of arguments on each invocation. @_ will automatically be sized accordingly. Sine @_ is a global variable, the old value of @_ is restored as soon as the subroutine call ends!

$. $/ $\ : File I/O counter, record separators

When you use the < > operator to read data from a file, perl automatically stores the current line number in the file in a variable named $.. How does perl know where a line ends and the next one begins? Well, that is what the record separator variable, $/, is for! As with most perl predefined variables, this takes on a default value. Here is a way to read in a whole file to a single scalar:

        $/ = '';
        open(INPUT, '/var/adm/messages') || die "/var/adm/messages: $!\n";
        $slurp = <INPUT>;
        close INPUT;

In the absence of an explicit assignment to $/, perl assumes that ``\n'' is the record separator between lines. If you clear this variable, as above, you can read whole files at a time.

Similarly, every print statement will tack on the value of the builtin variable $\ to every line/record you write. This variable is null by default, but if you want to, you can change this. See the -p and <-l> variables in perlrun for more usage information.

$0, $$ : program name, PID

When you invoke a perl program, there are two things that happen: firstly, the parent process that invokes the program (usually, your shell interpreter) forks itself. The calling program now takes on the role as the parent process and the return value of the fork call is the child's PID. The child process will be made to call the perl interpreter on your program. The name of your program is passed to itself as $0 at runtime. Similarly, the PID of your program is available as $$ at runtime.

Type the following example into a test program, say, me.plx:

        #!/usr/bin/perl -w
        print "I am called as $0\n";
        print "My PID is      $$\n";

If you run it as, say, /your/home/me.plx, you will get something like this:

        I am called as /your/home/me.plx
        My PID is      2506

$! : O/S Error string or Errno

Perl allows you to interact with your O/S, mostly through system calls for which it provides a perl function equivalent of the same name as the system call. If your system call fails for any reason, perl arranges for actual system error to be available as $! variable. You may use this in two ways: If you use it as if it was a number, it will give you the actual errno. If you use it as if it was a string, it gives you the system error string. If you ever get an O/S error code, you can find out exactly what it means using perl. Here's an example:

        /usr/bin/perl -e '$! = 2; print $!'

Here is how this works: The first assignment to $! will set $! as if it was a number. The print statement expects a list of scalars as arguments, and thus $! will be retrieved in a string context, and hence will contain the system error string. The above statement should print No such file or directory

More typically, you use it like this:

        open(FOO, '/some/file') or die "/some/file: $!\n";

        chdir('/for/bidden')    or die "Can't cd /for/bidden: $!\n";

        if ( !unlink('/read/only/dir/file')) {

                log_it_somewhere();     #your own logging routine, for example

                die "Can't delete /read/only/dir/file: $!\n";
        }

$?, $@ - Errors from child/pipe/eval

When you call a program from within perl, usually using the back-ticks `` or qx{} or system function, perl arranges for the status returned by the command to be available as $?. $? also gets set if the last pipe returned a bad status. The actual value in $? is a combination of the exit status of the command, the signal it received (if at all) when it died, and whether the program dumped core while dying. However your program encountered any of these error conditions within an eval statement in perl, the variable that is set is $@.

Example:

        `/etc/nowhere/hostname`;
        $error  = $? >> 8;
        $signal = $? & 127;
        $core   = $? & 128;

        print "Exit status of child: $error\n";
        print "Caught signal $signal\n" if $signal;
        print "No core dumps\n"     unless $core;

        prints:

        Exit status of child: 1
        No core dumps

$<, $>, $(, $) : real, effective uid/gid

When you run any program, you typically run it as yourself, which is really a uid and gid. However, you may run programs that are setuid to some fixed user id, or setgid to some fixed group. In such instances, the program runs under the effective uid/gid of the setuid/setgid program even though you have your own real uid/gid. Here is an example:

        perl -e 'print "UID = $<\n", "Effective UID = $>\n";'

$( and $) variables have slightly different semantics because you can belong to multiple groups. Both of these variables return your primary group and a space separated list of all the groups to which you belong. See perlvar for more information.

Exercises

Write a program that tries to read a non-existent file, say, '/foobar/blech'. Your program should print out the system error code and die.
How do you automatically print a newline with every print statement?
Try to print the following builtin variables. Check with the perlvar man-page to find out more: $], $^O and $^X.

Perl Operators, Precedence

To do anything useful with your data, you will need to operate on them. Perl provides the standard crop of operators and more. Here is a run down of some of them. For more details, read perlop.

Mathematical Operators

Most of the mathematical operators are available within the standard perl interpreter. The following table summarizes some standard operators:

        +       Numeric addition
        -       Subtraction
        *       Multiplication
        /       Division (floating point)
        %       Modulus operator

All of the above operands can also be used in conjunction with the assignment operator to shorten your code. Here are a few examples you can try:

        $total_size  = $total_size + $size;
        $total_size += $size;   #gives same result as above

        $usage_pct  = 100.0*($disk_capacity - $disk_free)/$disk_capacity;

        $seconds_since_midnight = time() % 86400;       #relative to GMT!

        $free_space_left = $current_free - $file_size;

Numeric comparison

To compare two numeric values, you use the numeric comparison operators in perl. This are very similar to those in C. Here are these operators, without much explanation. Try them out.

        +++++++++++++++++++++++++++++++++++++++++++++++++++
        Operator     Return Value
        +++++++++++++++++++++++++++++++++++++++++++++++++++

        ==           true if left and right side are numerically equal
        !=           true unless left side is equal to right side

        <            true if left side is less than right side
        >            true if left side is greater than right side
        <=>          returns -1 if left is less than right side (numerically)
                     returns +1 if left is greater than right side (numerically)
                     returns 0  if left is equal
                     (useful for numeric sorting)

Examples:

        if ( 2 == 2 ) {
                print "Yes, 2 == 2. What else did you expect?\n";
        }

String Comparison

To compare two strings, perl provides a different set of operators. The behavior of these operators is identical to that of their numeric equivalents. The string comparisons are done in fashion very similar to the strcmp C library function.

        eq, ne       equality tests for strings (similar to == and !=)
        lt, gt       strings (similar to < and > )
        cmp          similar to <=>, for strings

Here are some examples:

        if ( 'Anakin' lt 'Darth_Vader' ) {
                print "Dark side looks bigger!\n";
        }

        print "Which file do you want to change? ";
        chomp($file = <STDIN>);
        if ( $file eq '/etc/passwd' ) {
                print "Turned to the dark side, did you?\n";
        }

Logical Operators

The standard crop of logical operators are available in perl too. Logical operators return true or false. However, the meaning of true and false is different in perl than other languages, because perl considers strings and numbers to be the same data-type: Scalar. Here is a quick overview of truth as it applies to perl scalars:

The empty string ``'' is false. Any string that evaluates to ``0'' is false. Any number that evaluates to 0 is false. Any undefined value is false. All else is true. Sometimes, this is surprising:

        1       print "Yes\n" if ( "0.0" == '');        #"0.0" evaluates to 0
        2       print "What?\n" if ( "0.0" );           #string "0.0" evaluates to true!

In line 1, we see that the string ``0.0'' is converted to 0 in the numeric context of the == operator. The empty string on the right side is similarly converted into false/0. However, in line 2, the string ``0.0'' evaluates to TRUE according to the rules. Thus, the statement does print out something.

The perl logical operators are &&, || and !. The logical and and or operators are short circuit operators, like C. This means that the second operand is evaluated only when it's necessary. Here are some examples:

        $home = $ENV{HOME} || (getpwuid($<))[7] || die "No home directory!\n";

        print "Your machine is wide open!\n" 
                if ( $< && -r "/etc/shadow");

Exercises

The -f file-test operator checks if a file exists. The -s file-test operator returns the size of any file. Write a program that asks a user for a filename and returns the size of the file, if found! If the file is not found, you should return the system error message.
Write a program that takes exactly one argument and stores it in a variable called $infile. You can assume that the input will be a text file. Now, the program should open the file for reading, and count the number of lines. The program should print the filename and it's line count at the end.
What if your program above were given a binary file? Can you check whether the file is a text file programmatically and die if it is not a text file? See the -T operator in the perlfunc man-page.

Binding operators

When you need to match a string with a pattern or make changes to it using a regular expression match and replace, you use the binding operator, =~. To negate the logical sense of a match, you use the !~ operator. Here are some examples:

        $host = 'samba.org.au';
        if ( $host =~ /\./ ) {
                print "$host seems to be fully qualified!\n";

                if ($host !~ /\.(com|org|edu|mil|gov|net)$/ ) {
                        $country = $host;
                        $country =~ s#.*\.##;   #remove everything except the TLD marker
                        print "It's country of origin is: $country\n";
                }
        }

String operations

There are two operations that you do on numbers that have analogues in a string. You may want to concatenate strings together, like adding numbers. Or you may want to concatenate the same string multiple times. There are string operators just for such needs. Here are the operators, by example:

        $config_file = 'resolv.conf';
        $file = '/etc/'  .  $config_file;

        $recurse = 5;
        $GNU = 'GNU' . (' Not Unix' x $recurse);
        print "GNU expands recursively to: $GNU!\n";

Where do you use the 'x' operator? Well, here is a simple way to generate an attention grabbing notice:

        use Sys::Hostname;
        $stars = '*' x 79; 
        $host = hostname();
        $wall  = "ATTN: Machine $host going down. Please logoff NOW!";
        print "$stars\n$wall\n$stars\n\n";

Some new logical operators

In addition to && and || for logical operations, perl provides and and Some new logical operators

In addition to && and || for logical operations, perl provides and and or. These behave identically to the && and || except that they have very low precedence. Precedence determines the order of evaluation within a single statement. Here is an example where not knowing the precedence might bite you (in fact, the perl and/or operators were designed just so that people don't make this mistake). Perl allows you to call functions without using parentheses around the arguments. If you need to open a file, here is how you'd do it with parentheses around the arguments, without checking the return values:

        open(FOO, '/etc/passwd');

This can also be written conveniently as:

        open FOO, '/etc/passwd';

These two function calls work exactly the same way. Now, if you need to add some error checking of the return value of the open call, you would do something like this:

        open(FOO, 'bar') || die "bar: $!\n";

Unfortunately, the equivalent

        open FOO, 'bar'  || die "bar: $!\n";

never works as intended. Why? Well, this function call is exactly the same as:

        open(FOO, 'bar' || die "bar: $!\n");

This is definitely not what we want. Remember that 'bar' always returns true. Thus, the die statement can never be executed! The unintended result is that if the open really bombs out, you would never catch it! This is a situation where the or comes to the rescue:

        open FOO, 'bar' or die "bar: $!\n";

This is clearer to the eye, and also works right.

Basic I/O, Filehandles and other file operations

Perl provides lots of high level operations for file manipulation that would take quite a lot of work to do in other languages. Perl provides an abstraction called Filehandle to refer to open files in your program. This is very much like the file pointer in C. Perl provides a single function called open that allows you to access almost any data source with an amazingly simple and familiar syntax.

Standard Filehandles

Following the Unix convention, perl provides three default Filehandles that are direct analogues to C: STDIN, STDOUT and STDERR. In the absence of an explicit Filehandle, the magical diamond operator `<>' automatically reads from STDIN. In the absence of an explicit Filehandle your print statements automatically print to STDOUT (You override this by using the select function call in perl). Some perl functions (namely warn and die) will print automatically to STDERR with no need for a Filehandle argument (pun intended). You can close the standard file handles if needed (say, a daemon process) or redirect them within perl. Here are some examples where these Filehandles figure, even though you don't see them:

        $next_line = <>;

        print "This prints to your standard output!\n";

        warn("No more disk space!\n") unless ($free_space > $file_size);

        unlink("/") or die "Can't do that!\n";

        die("Please run manually!\n") unless ( -t STDIN );

The first example implicitly uses STDIN (if your program did not have any arguments). The next example shows the standard usage of the print statement. warn and die will automatically write to STDERR. In the last example, we use a function -t that operates on a file handle and returns true if it's a terminal.

Opening and Closing files

Perl's open function wears many hats. Depending on the arguments you supply to it, you can open just about any file, in any mode, without having to specify all the excruciating details and without looking at the manuals for the right usage every time. Under traditional usage, open accepts two arguments, which is the Filehandle and the name of the file. Under traditional usage, open accepts two arguments, which is the Filehandle and the name of the file. But the name of the file can include information about what mode you want it opened with, as well standard shell piping and redirection characters. This gives you enough flexibility to pretty much operate on anything under the O/S. If you forget to close a file after using it, perl closes it automatically when it exits. If you open the Filehandle again (for the same file or a different file altogether), the previously opened file is closed automatically. That's not all! If you use the open call with a single argument, the file with the same name as the first argument is opened by default. Here are a few examples:

  open(PASSWD, '/etc/passwd');
  open(LOG, "> $logfile");
  open(RCMD, "rsh $host uname -a 2>&1 |");
  open(MAIL, "|/usr/lib/sendmail -oi -t");

The first example opens /etc/passwd for reading only. The second example opens the file name contained in $logfile for writing. In the third example, something even more interesting happens: The command uname -a is executed on a remote host whose name is contained in the variable $host, and it's standard error AND standard output are available for you through the Filehandle RCMD! This obviates the need for intermediate files. Similarly, the last example opens a pipe to a sendmail process on the machine. By writing to the Filehandle MAIL in this example, you will actually be sending data to the sendmail process. When you close this Filehandle, you would have actually sent an email from perl!

Here is a program fragment that will print the whole file in which it is contained:

        open 0;
        print <0>;

This surprising fragment works as follows: the open is called with one argument, 0. The second argument is automatically set to $0 by perl. $0 is, as we saw earlier, the name of the program itself. Thus, you are opening the program file itself with this open statement! The Filehandle to this file is 0.

In the print statement, the output Filehandle is STDOUT (or the currently selected output Filehandle). If you remember the way the diamond operator works, it gets you the next line in a scalar context, and the entire file in a list context. the function print takes a list as argument and this presents a list context to <0>, which reads your entire program. See later for some more predefined Filehandles and how to use them.

Filehandles in variables

Filehandles can be stored in scalars also, using many of the standard perl modules available with the perl distribution. Here is a simple fragment that uses the perl module IO::File (see Module Basics for more explanation of modules, classes and objects in perl.

  0  #!/usr/bin/perl -w
  1  use IO::File;

  2  my $fh = new IO::File;

  3  $fh->open('/etc/resolv.conf');
  4  print STDOUT <$fh>;

  5  $fh->close;

In line 1, we express our intent to use the IO::File module in our program. In line 2, we initialize a variable $fh with an object constructed from the new method in IO::File. If this statement succeeds, we now have a generic IO::File object with which we can manipulate files. The advantage of a variable Filehandle object is that you can dictate it's scope of usage and safely manipulate it without causing side-effects on the rest of the program. With the standard FILEHANDLE notation, you would usually create a Filehandle that has a global scope within your program.

Proceeding further, in line 3, we use the IO::File open method call to open a specific file. The arrow notation ( -> ) is used to access methods of an object or class. From now on, we can use $fh within the diamond operator to read from the file which was opened in line 3. Finally, after having printed the entire file to STDOUT (remember list context?), we close the Filehandle.

Special Filehandles

There are certain file handles that perl will make available for you without an explicit open. If you run a perl program with some arguments, perl removes all arguments it can understand, and makes the rest of them available to your program as @ARGV. Now, if your program doesn't use these arguments in any way, and you use the diamond operator (<>) for reading in data, perl will consider each of those arguments as files to be opened, open them in order, and supply their contents when you use the <> operator! Here is a simple example that emulates the Unix cat command in some ways:

        #!/usr/bin/perl -w
        while ( <> ) {
                print;
        }

When you call this program without any arguments, perl will use STDIN as the input file when you read data using <>. If you call this program with some filenames as arguments, perl will cycle through each of them and print their contents to STDOUT (remember that STDOUT is the default Filehandle for the print statement)! How does perl know when a file ends? You can use the eof operator to find out. How do you find out the currently opened Filehandle? Perl provides it in the Filehandle named ARGV. What is the name of the currently opened Filehandle? $ARGV. Here is how you test this:

        #!/usr/bin/perl -w
        while ( <> ) {
                next unless eof;
        print "File is $ARGV\n";
        }

The next statement skips processing the current line unless it is the last line of the file (which makes the eof function return true!).

There are occasions when your program needs some small amount of input that you'd rather have in a file, but you don't want the script to hard code the name of the file or you don't want to carry the file around with the program. The Filehandle DATA is what you need in such cases. Perl will read your program until it reaches the end of your program or the end of the file. If perl reads a line which says __END__ (without any other characters) it stops reading the program right there. Anything that follows is available to your program with the DATA Filehandle. Here is an example:

        #!/usr/bin/perl -w
        print <DATA>;
        __END__
        This line three erros.
        This line ends input.

The open and close on the above Filehandles happens automatically, so you don't need to do that explicitly.

Exercises

Open any system configuration file of your choice. Print the following information about this file: the name, the size in bytes, the number of lines.
Open a pipe to your local ps command. Print out the first 10 lines of this output.
Try opening a file for reading and try writing to it. What error does perl complain about?

Variables and String Interpolation

We know what kinds of variables are out there in perl. But there are rules for making legal variable names as well as rules governing how they are interpolated within strings.

Variable names

Variables can contain the following characters: [a-zA-Z0-9_]. That is, you can use alphabets, digits and underscores within variable names. The first character should not be a digit. As mentioned before, you can embed variable names within strings to avoid much hassle in building complex strings. You define a plain string using single/double quotes, as we have seen in the examples. Here are the actual rules for building strings in perl:

Single quotes never interpolate except for \' and \\

        $ss = 'He said, \'She said, "Shut Up!" \'... ';

Double quotes interpolate variables and special characters

        $tobe = "To be";
        $q    = "$tobe or not $tobe is the question!\n";

Back-quotes will interpolate, even within single quotes

        qx{echo '$foo'};        #prints  the value of $foo variable

Fancy quoting operators

In addition to the standard quoting characters, perl provides additional syntax to allow you to simplify creation of strings with embedded quotes. These are the q{}, qq{}, qx{} and qr{} operators. These operators are flexible in that you can use ANY character as the quoting character. For example, instead of the curly braces, you can use the # character as quoting character:

        $something =  q#Single quoted#;
        $nother    = qq#Not '$something'#;

Thus, quoting operators allow you to embed the normal quotes within your strings without needing to escape them with backslashes galore.

Examples:

        $crazy = 'Please don\'t use \'\' within this string';
        $ok    = q{Please don't use '' within this string};

        $foo = "<A HREF=\"mailto:$address\">Mail us</A>";

        is better written as:

        $foo = qq{<A HREF="mailto:$address">Mail us</A>};

        $ip_patt = qr{^\d+\.\d+\.\d+\.\d+};
        print "Yes!\n" if ( '127.0.0.1' =~ /$ip_patt/ );

Exercises

Which of the following variables are not legal: $1_abc, $abc_1, $abc-1?
What is the main difference between the q{} and qq{} operators?

System Interaction and perl shortcuts

Perl provides features for you to interact with the operating system. The most common constructs used in such cases are the system function and the qx or ` ` operator. However, there are also perl shortcuts for these, if you need to avoid using the O/S (as when you want to make scripts portable across different O/S-es). Here are some examples:

Hostname

        chomp( $hostname = qx{ hostname });
        print "Host = $hostname\n";

        use Sys::Hostname;      #need to run h2ph after install
        print "Host = ", hostname, "\n";

Remove a file:

        system("rm $file"); 
        system("mv $file1 $file2");

However, this is better written as:

        unlink $file; 
        rename($file1, $file2) || die "can't rename: $!\n";

Daemonize

A daemon is different from normal programs: it should not have a controlling terminal, and it should be immune to signals that the launching shell/program is sent. If you close all standard Filehandles, the process will still have a controlling terminal. It will also inherit a working directory which you want to set to /. Here is one way to do it:

        use POSIX qw/:setsid/;
        close(STDIN); close(STDOUT); close(STDIN);
        chdir('/');
        fork && exit;
        setsid();
        #reopen STDIN, STDOUT etc. if needed..

The setsid call is imported from the POSIX module (may not be fully implemented in some O/S). setsid() will make the program it's own process group leader. The program will also have no controlling terminal.

Standard library calls

Perl also provides direct analogues to the C standard library calls. This way, you don't need to program in C or invoke unix commands to get at data that you would very easily get through the C standard library:

localtime, ctime, gmtime, time

These functions allow you to get/set time related values. localtime() returns a 9-element array of a time value as returned by the time() call in perl, and contains the time attributes in your local timezone. However, in scalar context, it returns a string much like the unix date command. Here are examples:

        ($second, $minute, $hour, $month_day, $month, 
        $year, $weekday, $day_of_year, $isdst) = localtime( time );

        $date = scalar(localtime);

If you don't provide an argument, localtime will use the result of a time() call as an argument. Two important points about the list context version of localtime concern the month value and the year: the month value goes from 0 through 11! The year value is a two digit year, but NO, it is NOT a Y2K bug! The two digit year value is the year offset from the base year of 1900. Thus, to get the full year, you would do something like:

        $full_year = (localtime)[5] + 1900;

And yes, perl IS Y2K compliant, as much as your O/S is, though perl programs may not be, depending on how you wrote them.

getpwnam, getpwent, getpwuid

These functions allow you to get the password file/NIS entries from within perl. You could get a value by specifying the key through getpwnam and getpwuid. Or you could cycle through the entire list using getpwent.

        $root_shell = (getpwuid(0))[7]; 
        print "Blech!\n" unless $root_shell =~ /bash/;

stat, lstat

These functions allow you to get at the file meta information. These have similar semantics to the unix system calls of the same name.

        use File::stat;
        $s = stat("/etc/passwd");
        print "/etc/passwd Last modified at: ", scalar(localtime $s->mtime);

unlink, rename, link, symlink

In spite of the above functions being available from within perl, most of us shell out from perl to do things like ``rm'', ``mv'' or ``ln''. In most cases, you don't have to. Here are some examples:

        unlink("/tmp/myfile") or die "cannot remove /tmp/myfile: $!\n";
        rename("/tmp/oldfile", "/tmp/newfile") or die;

chown $uid, $gid, @files;

Example:

        chown 0, 0, '/etc/passwd', '/etc/shadow';
        chmod 0600, '/etc/shadow';

directory operations: opendir, readdir

Here is an example: find all text files within current directory:

        opendir(DIR, '.');
        while (defined($file = readdir(DIR)) ) {
                next unless -T $file;
                print "text file: $file\n";
        }
        closedir(DIR);

Regular Expressions - basic concepts

Regular expressions are powerful tools that match a pattern in a string value. Regular expressions allow us to extract parts of information that are most relevant to us within the input data, and also allow us to transform them into any other form we need. If you are familiar with the unix grep command, you have used regular expressions already. Perl's support for regular expressions is built into the core language, so it is fast and flexible. Regular expressions regex are abstractions of general patterns you are looking for, so they can get a bit terse and hairy to read. Perl's regex syntax is however rich and supports extensions that allow you to write perfectly readable regex.

Metacharacters

The following Metacharacters allow you to match different types and amount of text:

        .       match ANY character (except a newline)
        \s, \S  whitespace, non-whitespace
        \w, \W  word, non-word character (word = a-zA-Z_0-9)
        \d, \D  digit, non-digit
        ^, $    beginning/end of line
        *       match zero or more of preceding expression
        +       match one or more of preceding expression
        ?       match zero or once
        {n,m}   match from n to m repetitions of preceding expression
        ()      grouping
        []      character class (eg. a thru z is [a-z])
        |       alternation
        $1..99  matched groups

For exact descriptions see perlre. For now, we will explain a few of these Metacharacters with examples in the following sections.

Simplest regex is a plain string

The simplest regex is a plain string. If you use it to match something, it will succeed only if your input data contains the exact same string as the regex. However, within your pattern (regex) you can use Metacharacters to match huge amounts of data in a few characters of the regex. Here is a simple example of some entries in a logfile:

        Jun 14 22:06:31 indus.fell.com in.ftpd[492]: connect from 146.223.45.6
        Jul 13 12:30:07 indus.fell.com in.telnetd[570]: connect from 10.0.15.21

This is a log of telnet/ftp sessions initiated to the machine indus.fell.com, which happens to be a linux box. This is similar to most syslog entries you will encounter, in that you may want to extract different parts of this data for different purposes. Our aim in the following examples is to construct a regular expression that matches three things: the address of the client machine, the service on this server, and the PID of the process that serviced the request. Right now, we are interested only in telnet/ftp connections. We know that the daemons are in.ftpd and in.telnetd. Here is one way to find the client IP address in the second line.

        /connect from 10.0.15.21/

Unfortunately, this will only match connections originating from 10.0.15.21 (actually it will also match 1000115021, but we'll see later how to change that). What if you want to match ANY ip address? This is where Metacharacters come to the rescue. The Metacharacters \d signifies a digit. The next regular expression will match any IP address:

        /connect from ([\d\.]+)/

The square brackets allow us to match a class of characters. In our case, this comprises of a digit (\d) and a literal dot (.) character. The plus (+) following this character class asks the expression to match a digit or a literal dot one or more times. Unfortunately, our expression not only matches valid IP addresses but spurious values as well (example: 345.567.890111.11)! In our case, we are sure the logfile will not contain such bogus matches, but in a general case, we will have to specify the pattern to match as exactly as possible. You also see the entire IP address pattern enclosed within brackets. Why?

Perl regex is non-regular: supports back-references

Regular expressions just match. However, in practice, you might want a global match out of which you need only a subset of characters for further processing. In such cases, back-references allow you to store parts of matches and retrieve them after a match. This is what makes perl regexes really powerful. Perl stores each submatch enclosed within brackets () in internal variables named $1, $2.. etc.

Back-references allow substitution and data reduction. In the above example of matching an IP address, the bracketed sub-pattern contains the IP address <when the whole pattern matches>! Thus, here is one way to make a list of all unique IP addresses that connected to your machine:

        my($ip, %connections, $n);

        open(MESSAGES, '/var/log/secure') or die("can't open logfile: $!\n");
        while ( <MESSAGES> ) {
                next unless /in\.telnetd.+connect from ([\d\.]+)/       #XXX
                $connections{ $1 }++;
        }
        close(MESSAGES);
        foreach $ip ( keys %connections) {
                printf("%-15s connected %5d times\n", $ip, $connections{$ip});
        }

In the line marked '#XXX' we do two things: we reject all lines that do not seem to have the string in.telnetd in them (The dot character, ., is a metacharacter that matches ANY character. To make a literal match for the ``.'' in in.telnetd we need to prefix it with a backslash to escape the character). Next, we store the IP address on matched lines. The very next line allows us to keep counts of connections keyed in by IP address. The IP address is stored in the variable $1 at the end of a successful match, which we use as the key. At the end, we print out a formatted cumulative statistics.. here is the result of the program on a sample machine:

        146.225.32.42   connected     2 times
        10.0.15.2       connected    21 times
        10.0.15.254     connected     1 times
        10.0.15.3       connected     4 times

The printf statement allows us to format our output in a way very similar to the printf() standard library call in C.

Perl regex: tries all possibilities for match to SUCCEED

The important concept with perl regular expressions is that perl tries ALL possibilities for a match to succeed. This is done through back-tracking and bumping-along which is very similar to what we do when we solve a maze problem: if we hit a wall, we backtrack to the last place where we had a choice of paths. After we backtrack to this point, we abandon our failed path and continue along another. In our example, when we match the subexpression ``in\.telnetd'', perl does something like the following:

The first two characters of the hostname ``indus.fell.com'' match the first two characters of our pattern. However, the next literal character d does NOT match the literal ``.'' in our pattern \.! Now perl doesn't declare a failure at this point! It now tries to bump along to the next character in the target string (which happens to be 'n') and tries the pattern. It fails immediately since the character n does not match our subexpression's first character, i. This happens until it reaches the right place ``in.telnetd''. At this point the first subexpression in\.telnetd matches exactly. Now the regex match proceeds to conclusion because it does succeed for this line.

Match stops at the FIRST/earliest successful match

Perl will not attempt to find all matches in a string. It will stop at the very first match. In addition, even if the pattern will match multiple places, perl will match at the earliest point in the target string. Here is an example:

        Writing c-shell scripts is a sure way to go to hell!

If we try to match /hell/ in this example, it would NOT match the last word in the example. It will match right in the middle of ``c-shell'', because that is the earliest place where the match succeeds! This is an important issue that will help you avoid spurious matches. How do we match the word ``hell'' in the above example? The pattern /\bhell/ will do. This is because the \b character matches a word-boundary which means that a \b will NEVER match \w. Thus, the character ``s'' in ``c-shell'' will fail to match \b and so the regex match algorithm will bump along until it finds hell :-)

Matches can be GREEDY or non greedy: backtracking

When you specify a + to match multiple characters, perl will match as many characters as it can in the beginning. If later parts of the pattern cause the match to fail, perl will backtrack into the submatch by one character and retry the failed match from the same point. This is best described by an example string and pattern:

        STRING: All that is gold does not grow old
        PATTERN1: /old/
        PATTERN2: /.+old/

Pattern 1 will match the ``old'' within the word ``gold'' in the string. This follows from the explanation in the previous section. Pattern 2 will however match the sub-pattern ``old'' at the very last word! This is because the + character is greedy. Thus, .+ gobbles up the entire string at the beginning. The sub-pattern ``old'' now fails, so perl backtracks the .+ to contain all but the last character. This fails too. Perl backtracks again, and fails. The next backtracking places the start of match before the ``o'' in ``old''. This matches with the sub-pattern ``old'' and perl reports success. In this case, the ``old'' in the regex matches the last word.

Results depend on context

As with other things, regex match in perl returns different values depending on the context in which you match. Here are the general rules:

        scalar context returns number of matches
        list context returns all matches within groups

When we introduce brackets in our regex, perl groups the subtext that matched each bracketed sub-expression and stores them in internal variables $1, $2 etc.. However, this only happens in scalar context. In a list context, all the bracketed matches are returned to the list context. Here is an example:

        $_ = 'All that is gold does not grow old';

        print "SCALAR: $1\n" if /(.+)old/;
        @foo = /(old)(.+old)/;
        print "LIST: @foo\n";

prints:

SCALAR: All that is gold does not grow LIST: old does not grow old

Regular Expressions - Basic Examples

Here are some basic examples that use some simple patterns to match various things you would commonly extract from input data:

Match a word: \w+

        if ( 'One word' =~ /\w+/ ) {
                print "Matched $&\n";
        }
        #Matched One

Match an integer: [-+]?\d+

        $_ = 'One value: +23.45';
        if ( /[-+]?\d+/ ) {
                print "Matched $&\n";
        }
        #"Matched +23"

Match a number that has 3 to 5 digits: \d{3,5}

        if ( 12345 =~ /^\d{3,5}$/ ) {
                print "Number within range\n";
        }

Match everything between foo and bar: greedy version

        $_ = 'brave fools embark on travel through bare desert';
        print $& if /foo.*bar/;

        #prints "fools embark on travel through bar"

Match everything between foo and bar: non greedy version

        $_ = 'brave fools embark on travel through bare desert';
        print $& if /foo.*?bar/;

        #prints "fools embar"

Match the host in /NFS server floozey not responding/:

        /NFS server\s+(\S+)\s+not responding/
        hostname can be retrieved as $1 (if match succeeds)

Surprise 1: '*' matches ZERO or more!

With greedy quantifiers in previous subexpressions, a later '*' will match zero times and still report success:

        $_ = 'Has a long number 12437';

        if ( /(.*)(\d*)/ ) { print "String: $1, number: $2\n"; }
        #gives  "String: Has a long number 12437, number: "

Surprise 2:

Greediness, backtracking and 'first successful match' combine to produce non-intuitive results, if you're not careful.

        $_ = 'Has a long number 12437';
        if ( /(.*)(\d+)$/ ) { print "String: $1, number: $2\n"; }
        #gives  "String: Has a long number 1243, number: 7" !!

The above expression is better written as

         if ( /(.*?)(\d+)$/ ) { print "String: $1, number: $2\n"; }

Surprise 3:

        $_ = 'your food is in the bar under the barn';
        if ( /foo(.*)bar/ ) { print "matched: $1\n";}
        #gives "matched: d is in the bar under the"

Exercises

What is the simplest kind of regex? How do you match a word in perl?
Using your work matching regex from above exercise, make a program that counts the number of words from a file (the filename is the first and only argument to your program). Using length($_) to compute the number of characters per line, add up the total number of characters. Also add your line counting code from a previous exercise and make this program print the number of characters, words AND lines.

Regular Expressions - More details

Here is the complete specification for a perl regex match operation:

        m/expr/gsimox;

You can choose to leave out the m (which stands for match, by the way) and just use /pattern/ which is what you normally do. However, perl allows you to use ANY character as the pattern delimiter, and allows you to write the regex in a more readable manner. Here are some regexes, all of which match the same pattern: finding the directory name of a file.

        1.   /(\/[^\s]+)\/[^\/\s]+/;

        2.  m,(/[^\s]+)/[^/\s]+,;

        3.  m{
                        (/[^\s]+)       #a slash followed by any non space character
                        /               #start of filename part
                        [^/\s]+         #a filename (assume no spaces in the filename)
                }x;

As we see from regex 1, match patterns can be very hairy. The reason why we had all those leaning toothpicks(\/) was due to the fact that the pattern was delimited by a /. In such cases, if you want to match a literal forward slash, you need to quote/escape it with the \ character. Regex 2 is clearer because it now uses comma characters to delimit the pattern. This, you don't have to quote the /. Even after this substantial improvement in readability, the pattern looks difficult. Regex 3 is probably the easiest for humans to parse. We don't offer any explanation, as it is self-evident. See below for more details on the /x modifier. With such powerful constructs perl allows you to match almost any type of pattern (nested patterns are one exception).

However, a match is not the only reason to use a regex. Once you perform a match, you can actually substitute whatever you matched, with anything else you may want to change it to. Here is the spec for the regex Substitution operator:

        s{expr}{replacement}egsimox;

The modifiers e,g,i,m,o,s,x} specify different ways in which the match can be directed. The one additional modifier you see is the /e modifier. Here are examples that illustrate some of them:

i: Case insensitive

        $_ = "The path to my magic scripting language is /usr/bin/awk\n";

        s{/(awk|sed|sh|csh|bash|ed)\b}{/perl};

        print;

This prints ``The path to my magic scripting language is /usr/bin/perl''.

o: optimize (variables interpolated only ONCE)

        $val = 'something';
        $new = 'somthinels';
        while ( <> ) {
                print if s/$val/$new/o;
        }

x: use extended regular expressions (allow comments!)

Perl version 5 introduced the ability to include arbitrary comments within a regex by specifying the x modifier. This allows you to write crystal clear regexes that you would otherwise have a hard time understanding on second glance. We have seen this in an example above. Here is another, more hairy example:

        /^\w+\s+\d+\s+[\d:]+\s+.+?(in\.\w+)\[\d+\]:\s+connect\s+from\s+([\d\.]+)$/

Better written as:

        m{
                ^\w+\s+\d+      #Date in year
                \s+
                [\d:]+          #Time
                \s+.+?          #ignore junk

                (in\.\w+)       #get the service daemon that was connected to

                \[\d+\]         #the PID within []

                :\s+connect\s+from\s+

                ([\d\.]+)$      #the originating client IP..
        }x;

The clarity that you get with the /x modifier is well worth the effort of increasing your LOC.

$`, $&, $' = pre match, entire match and post-match strings

Example:

        if ( 'Pre match Post' =~ /\s+match\s+/ ) {
                print "Pre  match: $`\n";
                print "Match     : $&\n";
                print "Post match: $'\n";
        }

e: evaluate the replacement as a PERL expression!

The /e modifier allows you to substitute a matched pattern with the results of perl code within the substitution string! This is very powerful. Here is a simple example:

        $_ = '2 candies at 35 cents = ';
        s{
                (\d+)\D+(\d+)   #get numbers
                .*$
        }{
                $& .    #append to end
                ($1 * $2) . ' cents'
        }ex;
        print;  #prints "2 candies at 35 cents = 70 cents";

Here is another example: if you want to change the IP address of a host, and you have a table of the new IP addresses for each old IP, here is a simple way to change it:

        %new_ip = ( '10.0.0.1' => '10.1.1.1', '192.168.100.2' => '172.16.45.2');

        @old = ('10.0.0.1', '10.3.14.3', '192.168.100.2', '192.168.100.3');
        @new = @old;

        foreach ( @new ) {
                s/([\d\.]+)/$new_ip{$1} ? $new_ip{$1} . ' <--- ' : $1 /e;
        }
        print join("\n", @new), "\n";

This code snippet prints:

        10.1.1.1 <---
        10.3.14.3
        172.16.45.2 <---
        192.168.100.3

We have crafted the regex to add the ``<--'' for clarity. This makes you clearly see where the changes have taken place in our example.

This brief introduction to regular expressions should help you craft simple regular expressions. For more details, consult perlre or the regular expressions book listed in Further reading.

Exercises

What do the /i and /o modifiers do? Why would you use the /x modifier?
Write a regular expression for checking two consecutive vowels.
Write a regular expression for substituting all numbers by the number within braces. Thus ``This is 1 line with 2 numbers'' should get converted to ``This is (1) line with (2) numbers'';

Subroutines, Variable Scoping

Perl allows you to write free form code, just like any other language. However, if you write large programs, or programs that behave in a variety of ways, you would like to bunch similar tasks together, and also re-use same code fragments over and over again. Perl subroutines are designed for this type of abstraction.

Subroutines are perl's way of dividing a problem into manageable chunks. They are exact analogues to functions in C. You declare a subroutine in perl as follows:

        sub my_sub_name {
                my(@arguments) = @_;
                #statements;
        }

Subroutines in perl are different from equivalent concepts in other languages in two important aspects: subroutines in perl have variable number of arguments, and the arguments are NOT named. Subroutines in perl can return anything they want (scalar, list or nothing).

return value

IF a subroutine does not explicitly return a value, and the calling statement/expression uses the subroutine in a context requiring a return value, the subroutine's LAST evaluated expression becomes the return value. Here is an example:

        sub sum_two_numbers {
                $_[0] + $_[1];
        }

All parameters passed to the subroutine are passed automatically through the @_ variable. However, these parameters are not copied into the subroutine's stack. Instead, any modifications to these values directly affect the original values in the calling expression's name-space.

To prevent this, and to get local copies of the parameters, declare them using 'my':

        sub sum_two {
                my($arg1, $arg2) = @_;
                return $arg1 + $arg2;
        }

Scoping: dynamic/lexical

Dynamic scoping (local) happens by default unless you declare variables as lexical. Dynamic scoped variables are global variables, accessible to the entire program/package. Subroutines may overwrite them, causing values to be changed in unpredictable ways. Typically, global values have non-intuitive consequences if you use them all over the program. Data is not protected when you use dynamically scoped variables.

Lexical scoping (my): increases data privacy. When you declare a variable using the my scoping operator, it creates a new variable and grants it a scope of the closets enclosing block. No other block can access these values unless they are passed as arguments. There are a few variables where local is unavoidable. In fact, the entire Module export mechanism in perl is built on clever use of local variables.

Using my is almost always better than local.

Example: get network number from IP

        sub get_net {
         my($ip) = shift;
         my($a, $b, $c, $d) = split /\./, $ip;
         return "$a.0.0.0"     if $a < 128;
         return "$a.$b.0.0"    if $a < 192;
         return "$a.$b.$c.0";
        }

Exercises

Write a subroutine to split it's arguments into words and return all the words.
Write a subroutine that checks if it's argument looks like a fully qualified hostname or IP address. It should return tru if it's a name, 2 if it's an IP address, and 0 otherwise.
Write a program that uses the ls output in current directory and prints out the names of all files that are greater than 10000 bytes. Test it with /tmp and /var/tmp directories.

Standard Perl Modules

Read/write files with IO::File:

        use IO::File;
        my $fh = new IO::File;
        $fh->open('/var/adm/messages') || die "/var/adm/messages: $!\n";
        while ( <$fh> ) {
                next unless /SYSERR/;
                print "Mailer error: $!\n";
        }
        $fh->close;

Do all sorts of operations on multiple files using File::Find:

        $MAXSIZE = 1000000; $MAXAGE = 5;
        sub wanted {
                return unless ( -s $_ > $MAXSIZE || -M $_ > $MAXAGE );
                print "Purging file: $File::Find::name\n";
                unlink $_;
        }
        find( \&wanted, "/var/tmp", "/usr/local/tmp");

play with file attributes using File::stat:

        use File::stat;
        $s = stat('/my/file');
        print "File size = ", $s->sz, "\n";
        print "Inode     = ", $s->ino, "\n";

white hatters can saturate networks with Net::Ping

        use Net::Ping;
        $p = new Net::Ping;     #some perl versions need root access

        if ( ! $p->ping("vtc.teamtaos.com", 2)) {
                print "Taos vtc is down!\n";
        }

Some `Non Standard' perl Modules

DNS lookups (aa, mx, ptr etc.) with Net::DNS

    use Net::DNS::Resolver;
    $res = new Net::DNS::Resolver;
    $query = $res->search('vtc.teamtaos.com');
    if ($query) {
        foreach $rr ($query->answer) {
            next unless $rr->type eq "A";
            print $rr->address, "\n";
        }
    }

        prints: "207.33.46.3 [as of Sun Jul 18 23:19:52 PDT 1999]

Automate FTP stuff with Net::FTP

           use Net::FTP;

           $ftp = Net::FTP->new("ftp.cdrom.com");
           $ftp->login("anonymous","me\@taos.com");
           $ftp->cwd("/pub/perl/CPAN");
           $ftp->get("README.html");
           $ftp->quit;

Web client programming with LWP

        use LWP::Simple;
        $content = get('http://www.linux.org') || '';

Extract MIME encoded documents (word docs eg.) in unix:

        use MIME::Base64 qw/decode_base64/;
        $doc = '...';   #get_the_mime_encoded_part
        $realdoc = decode_base64($doc);
        print SOME_MSDOC_FH $realdoc;

Wrap text using Text::Wrap

        use Text::Wrap qw(fill $tabstop $columns);
        $tabstop = 4;
        $columns = 72;
        print fill("\t", "", `cat /tmp/dead.letter`);

Send mail using Mail::Mailer

        use  Mail::Mailer qw(sendmail);

        $mailer = new Mail::Mailer;
        
        my %headers = ( 'To' => 'me@taos.com', 'From' => 'me@taos.com');
        $headers{'Subject'}  = "testing";

        $mailer->open(\%headers);
        print $mailer "This is a test\n\n";
        $mailer->close;

TMTOWTDI: Sort IP-s by subnet

To illustrate the fact that there's more than one way to do it in perl, we will take a very simple example: given some IP addresses, sort them by network and host number. The approaches described here are not the only ones.. they were chose for their gradation in complexity of algorithm design and how easy it is to grow your algorithms as you go.

Let us take the following list as a test example to be sorted:

        @ip = ('223.1.3.4', '127.0.0.1', '192.168.100.1', '223.1.3.1');

The sorted output should look like:

        127.0.0.1
        192.168.100.1
        223.1.3.1
        223.1.3.4

The perl sort function accepts an optional subroutine reference or subroutine name as argument, which it uses every time it needs to compare any two elements of the input array/list. The subroutine may be anything you like, except that it should assume the following: the comparison keys are available to your subroutine as the global variables $a and $b!

Using numeric sorting:

This method uses the standard split command to extract the individual numbers comprising the IP address. It then compares the respective bytes numerically. The short-circuit nature of the or operator ensures that the sort terminates at the very first byte that is different.

        sub numeric {
                my($a1, $a2, $a3, $a4) = split /\./, $a;
                my($b1, $b2, $b3, $b4) = split /\./, $b;
                $a1 <=> $b1  or  $a2 <=> $b2 or $a3 <=> $b3 or $a4 <=> $b4;
        }
        @result = sort numeric @ip;
        print "Sorted: @result\n";

Using pack:

The pack function in perl will allow you to compact values into a tight structure which you can unpack later for use. This allows you to conserve space AND also gain a measure of efficiency in passing data around.

        sub packed {
           pack('C4', split(/\./, $a)) cmp pack('C4', split(/\./, $b));
        }
        @result = sort packed @ip;
        print "Sorted: @result\n";

Using pack, cache for efficiency:

This is the same idea as above, but builds a cache of already seen IP addresses. This optimization will save you computation time when you have large sets of elements to sort.

        {
           my %cache;
           sub cached {
              ($cache{$a} ||= pack('C4', split /\./, $a))
                 cmp
              ($cache{$b} ||= pack('C4', split /\./, $b));
           }
        }
        @result = sort cached @ip;
        print "Sorted: @result\n";

Books, Documents, Further Reading

As mentioned before, this document is merely a primer. If you need to get in deeper, the following resources will help you greatly.

Documents

        perl manual pages
        perlfaq (perldoc perlfaq)

        `picking up perl' (http://www.ebb.org/PickingUpPerl/)

        man perlstyle (for style issues)

Books

        Learning Perl (Randal Schwartz, Tom Christiansen)
        Programming Perl (Larry Wall, Tom Christiansen, Randal Schwartz)
        Perl in a Nutshell
        Perl Cookbook (Tom Christiansen and Nathan Torkington)
        Perl, the programmer's companion (Nigel Chapman)

Newsgroups/mailing lists

        comp.lang.perl.misc, comp.lang.perl.moderated

Links:

        Perl home page:   http://www.perl.com/
        CPAN multiplexer: http://www.perl.com/CPAN
        The Perl Journal: http://tpj.com/
        Apache perl     : http://perl.apache.org/
        The perl oasis  : http://www.oasis.leo.org/perl/00-index.html
        Randal's columns: http://www.stonehenge.com/merlyn/UnixReview