Perl for C Programmers

From Geekyinfo

Jump to: navigation, search

This is a Perl tutorial for C programmers. It is my goal for a C programmer to be able to digest this entire tutorial in a single 30-minute session.


Contents


Introduction

This tutorial will not make you an expert at Perl. It will let you start writing Perl code. I hope it will also give you an appreciation of why you might want to write Perl code. I assume that you are already a reasonably good C programmer, so I don't need to teach you what the "++" operator is. Also, you will catch on faster if you are experienced writing shell scripts (Bash), and even faster if you've used "grep", "sed" and "awk". But if you're only a C programmer, you should not have too much trouble.

This tutorial assumes you will go through it from start to end. Later sections use techniques taught in earlier sections.

Being basically egocentric, I will teach my own Perl programming style. Cuz it's the best, of course. I'm an old-time C programmer (not C++), and I tend to write a lot of my Perl code as if it were C. A true Perl fanatic would sneer at my non-Perl-like code.

Also, I'll assume that you're using Unix. Cuz it's the best, of course. But Perl is intended to be portable across operating systems, and while it is not quite as successful as, say, Java, it's not bad. I promise to keep the Unix-specific stuff to a minimum.

First a simple question: why Perl? Here are my reasons, in order of decreasing importance:

  1. Regular expressions (pattern matching).
  2. Hashes (hash tables, hash maps, associative arrays, whatever you want to call them).
  3. Weak typing.
  4. Language shortcuts, like while (<>) (explained later).

All of these contribute to rapid code development. For many kinds of text-processing applications, I can get a Perl program running in a tiny fraction of the time it would take me in C.

I use Perl for tools. The decision to write a tool is always a tradeoff between how much effort it is to write it against how much effort it saves. If C is my language for tools, I won't write very many; I would just do things manually. With Perl, I am much more likely to go ahead and write the tool, thus improving my overall productivity.

And yes, I know that both Java and C++ offer regular expressions and hashes. I've written some tools in Java using them. And while my tool writing productivity IS higher in Java than in ordinary C, Perl is still more productive. Especially with regular expressions - it just takes a lot more junk code in Java to do the same thing as Perl. For all the complaints about Perl being hard to read, take a look at some java code that does regular expression matching. The junk code WAY obscures the code intent. The same code in Perl is much easier to read. (The regular expression patterns themselves are equally hard to read, and is a big reason for Perl's reputation for being hard to read.)

You will probably notice that I include a lot of personal opinion in this article. I won't hesitate to say feature foo is better than feature bar. Sometimes I'll give a line of reasoning to support my opinion. Other times I'll simply state it as fact. I can get away with that because I'm an old geezer who will talk endlessly about the good old days if you get me started.

Finally, a word of warning. I did not exhaustively try out every code fragment contained here. I may have made some minor mistakes or typos. Let me know if you find any.


In The Beginning

I like to start every Perl program with the following lines:

 #!/usr/local/bin/perl -w
 use strict;
 use Getopt::Std;  # this one is optional, but I like to use "getopt()" in many of my programs.

Perl has a reputation for being a cowboy language which allows (encourages?) the programmer play very fast and loose and basically ignore good programming practices. The -w on line 1 and the use strict on line 2 basically put a leash on the worst of programmer abuses. Perl will give you warnings and errors.

In Unix, you can turn on the executable bit for the file and run it like a normal command ... IF Perl is in "/usr/local/bin". Which sometimes it is not. And never mind Windows, which does not even support the shebang ("#!") construct. These days, I tend to run Perl programs like this:

   perl foo.pl

Note that the #!/usr/local/bin/perl -w line is still useful, even if invoking perl as above. When Perl reads that line, it recognizes the shebang and processes the command-line options. Thus, the -w option is applied.

Finally note that Perl uses "#"-style comments, like Unix shell scripts. A comment starts with "#" and continues to the end of the line.


Print

The "print" function is much more free-form than C's printf(). You typically do not need a format string, you just list things to print, separated by commas:

   print "hello\n", "How are", "you?\n";    # Oops, The second line is printed as "How areyou?".

Note that no parentheses are used. In Perl, enclosing function parameters in parenthesis is usually optional, but I tend to always use them for clarity. But "print" is different. Some things don't work if you pass "print"s parameters in parentheses.

You can also print to standard error:

   print STDERR "Error, if you get here, something is very wrong.\n";
   exit(1);

Note that no comma is used between STDERR and the start of the things to be printed. If you had a comma there, it would try to interpret STDERR as a thing you wanted to actually print.

You'll see more of "print" later on.


Data and Variables

Perl is weakly typed. The same variable might contain an integer one moment, a string the next moment, and a floating point number the next moment.

A useful mental model of Perl is to assume that data is always stored as a string. If you use the variable in a context that expects a numeric, Perl will convert the string to a number. The result of an arithmetic expression is converted to a string when stored into a variable. Note that this is just a useful way to think of Perl; in reality it does a lot to optimize things, like maintaining both a numeric and a string version of variables (if it thinks it should). But you should ignore all that and just think in terms of strings being the basis of variable storage, with automatic conversions based on context. Don't worry, it doesn't take long to get used to it.


Simple Variables

Simple variables must be prefixed with dollar sign ($). Here is some Perl code:

   $name = "Steve";
   $name = $name . "D"    # A period (.) is a string concatenation operator.
   $name .= "Ford";
   print $name, "\n";    # Oops, this prints "SteveDFord".  I forgot to include spaces.
   $age = 53;
   $age = $age + 1;    # I think it's my birthday today.
   $age--;    # Ah, just checked the calendar and it's not.
   $age += 1;    # Oops, that was last year's calendar; it IS my birthday.
   print $name, " is ", $age, " years old\n";    # prints "SteveDFord is 54 years old"

See? Once you get used to the dollar signs, it almost looks like C. In particular, all your favorite arithmetic operators are there: pre-increment, post-increment, and whatever you call those "+=" operators.


Declarations

When use strict is in effect, Perl will refuse to run the above code as-shown because you didn't declare your variables. Since Perl is weakly-typed, no data type needs to be specified. There are different ways to declare variables, but please ignore most of them and stick with "my":

   my $name;  # perl doesn't care what type "$name" will store.
   my $age = 53;    # you can init vars when declared.

Remember that $age can be interpreted as a 2-character string "53" or an integer value of 53 (numeric), depending on context. Thus:

   print ($age . "1") . "\n";  # prints 531
   print ($age + 1) .  "\n";  # prints 54


Undefined Variables

One nice thing about Perl is that it will warn you if you try to use an uninitialized variable. In Perl, there is even a special value called "undef" which means undefined.

Frequently, this lets you find your own bugs:

   my $age;
   $age = $age + 1;    # Oops, age is being used before initialization.  It has the value "undefined".

That will generate a run-time warning, alerting you to the bug.

There are also times that "undef" is a perfectly valid thing for a variable to contain. For example, when doing file input, EOF is indicated by returning "undef". You can test a variable to see if it is defined with the "defined()" function. So, if you read an input line into $in_line, you can test it with "defined()" to see if the read hit EOF:

   if (defined($in_line)) {
       print $in_line;
   }
   else {
       print "EOF!\n";
   }


Arrays

You can declare an array pretty easily:

   my @favorite_colors;

Note the @ instead of the $. Although Perl is weakly-typed, it does want to know if you intend to use a variable as an array. More on that in a moment.

Also note that the size of the array does not have to be specified. Perl will dynamically resize the array as needed. If you assign to [3] of the array, it will automatically create elements [0] through [3]. Those unused elements (0, 1 and 2) will be set to the undefined value.

Back to the use of @. Perl is context-sensitive. The use of @ tells Perl to use the variable in its array context. I.e. the whole array. For example, there is a cool function "push()" which adds an element to an array:

   my @favorite_colors;
   push(@favorite_colors, "green");
   push(@favorite_colors, "orange");

The push(@favorite_colors, "green"); passes the entire "favorite_colors" array to "push" and asks it to add "green" as a new element. The "push()" function will find the highest defined index, increment it, and store "green" there. In the above code, "favorite_colors" array starts out empty. The first "push" will store "green" into favorite_colors[0]. The second "push" will store "orange" into favorite_colors[1]. So, let's make sure:

   print $favorite_colors[1], "\n";

Whoa. What's that $ doing there?

It's the context-sensitive thing. You don't want print to operate on the entire array, only one element. A single element of an array is a simple variable, and needs to be accessed with $.

Yes, it gives Perl newbies headaches, and is probably the biggest reason Perl has a bad reputation for being hard to read and maintain. Just remember, use @ when you want to operate on the entire array, and use $ when you want to operate on a specific element.

   my @favorite_colors;
   $favorite_colors[0] = "green";    # use $ because you want to access a specific element.
   push(@favoriate_colors, "orange");    # use @ because you want push to operate on the entire array.
   my $i;
   for ($i = 0; $i < 2; ++ $i) {  # looks a lot like C, with dollar signs.
       print $favorite_colors[$i], "\n";    # print each element in array
   }
   my @temp = @favorite_colors;    # copy entire array.
   $i = join(",", @temp);    # "join()" concatenates (as strings) array elements, with a separator (comma, in this example).
   print $i, "\n";    # this prints "green,orange".


Hashes

Perl calls them "hashes", but you might know them as "maps" or "hash tables" or "associative arrays" or "key/value pairs". The concept is VERY simple. A hash is simply an array which is indexed with an arbitrary string instead of an integer. Instead of calling that string an "array index", call it a "hash key". But, as simple as the concept is, it is half the reason for using Perl in the first place (the other half is regular expressions).

The percent sign (%) is used for hashes. And the same basic rules apply - use % when referring to the entire hash, and use $ when referring to a specific element. Also, instead of using square brackets to enclose the index, use curly braces ({ }) to enclose the key.

   my %student_ages;
   $student_ages{"Steve"} = 53;    # create key "Steve"
   $student_ages{"Alice"} = 15;    # create key "Alice"
   my @student_names;    # define an array
   @student_names = keys(%student_ages);    # Whoa!

That last line deserves some explanation. The "keys()" function takes a hash (the *entire* hash; hence the use of %) and finds all of the hash keys. It returns those keys in the form of an array, and assigns that to the @student_names. The result being:

   print $student_names[0], "\n";    # prints either "Steve" or "Alice"

Note that the order of keys in the returned array is not defined. If you want a predictable order, you could do:

   @student_names = sort(keys(%student_ages));  # sort() accepts an array and returns an array, sorted in alphabetic order.
   print $student_names[0], "\n";    # prints "Alice"

Here's a bit more code to show how it all works:

   print $student_ages{"Steve"}, "\n";    # prints 53
   my $i;
   for ($i = 0; $i < 2; ++ $i) {
       my $name = $student_names[$i];
       print $name, " is ", $student_ages{ $name }, " years old\n";
   }

The variable "$name" is technically not needed; the inside of the loop could have been written as the single line:

       print $student_names[i], " is ", $student_ages{ $student_names[i] }, " years old\n";

That print line is a perfect example of what confuses newbies, what with all the dollar signs, square brackets and curly braces. But it's fundamentally not that hard. You are wanting to access a specific element of the student_ages hash, so you use $student_ages{ SomeKey }. The SomeKey is the student's name, which is in $student_names[ $i ].


separate name spaces

Here's a somewhat unfortunate thing about Perl that can lead to some very hard to read code. Simple variables, arrays, and hashes (and functions, for that matter) are in separate name spaces. Which is to say that:

   my $x;
   my @x;
   my %x;

Those are three separate variables. The fact that they have the same name does not lead to any ambiguity. Perl uses context to figure out which one is being accessed:

   $x = 0;
   $x[$x] = "greeting";    # store the string "greeting" into array x element 0.
   $x{$x[$x]} = "hello there";    # store "hello there" into the hash at key "greeting"

For obvious reasons, I suggest avoiding using the same name for different variables in different name spaces.


scalar()

Although Perl does a pretty good job of using the $, @, %, [ ] and {%nbsp;} characters to define all the appropriate contexts, sometimes you want to override it. Usually that means forcing Perl to treat an array or a hash as a simple value. This is mostly for advanced programming, but there is one place where even a new-comer will use it.

Since arrays are dynamic, you can't use a function like "sizeof" to figure out how big it is. So, if you refer to an entire array (using @), but do so in the context of a single value, it will basically return the number of elements in the array:

   my @test;
   $test[5] = "hello";
   my $num = @test;    # "$num" is a simple variable; forces "@test" to be treated as a single value.
   print $num, "\n";    # prints 6.  (Remember elements 0 - 4 are undefined and 5 is defined.)

In my $num = @test; Perl is able to understand that you want to force interpretation of "@test" to a single value (i.e. number of elements). But what about:

   print @test, "\n";

Perl will assume you want to pass the entire array to print. Try it - the results are interesting. (Spoiler: when it tries to print elements 0-4, it will complain that they are undefined.) If you want to force it to pass the number of elements, you need to use "scalar()":

   print scalar(@test), "\n";    # prints 6.

I tend to use "scalar()" more than I absolutely have to, just for clarity sake. So I would an earlier line as:

   my $num = scalar(@test);    # "scalar()" technically not necessary, but explicitly states my intention.

I'm sure there are other important uses of "scalar", but I pretty much only use it to get the number of elements in an array.


Default Variable

There is no analog to this in C, but it is so universal in Perl that you might as well get used to it. The simple variable "$_", called the "default variable", is a special pre-defined variable with some magical properties. There are a specific set of built-in Perl functions which will use "$_" as input if you don't supply any arguments. And there are a small number of Perl built-in function which will leave output in "$_" if the function's output is not specified.

For example, later on you will see the code:

   while (<>) {
       print;
   }

The "while (<>)" construct reads a line of input and places it in "$_". The "print;" line takes the contents of "$_" and prints it to STDOUT. Both of these will be explained in more detail later. For now just know that the default variable "$_" is implicitly being used there.

For a full treatment of what uses "$_", see http://perldoc.perl.org/perlvar.html#General-Variables . It is one of those things which I bring up, not because you need to know it to write Perl code, but because lots of existing Perl programs use it.


Loops

Looping constructs are pretty much what you would expect as a C programmer. While and for loops work the same, and have already been demonstrated in previous code.


break -> last

C has the "break" statement to exit out of a loop. Perl uses "last".


continue -> next

C has the "continue" statement to jump immediately back to the top of the loop. Perl uses "next".


do { ... } while

In C, a "do" loop is just another looping construct. In Perl, "do" is very different. For example, the "last" and "next" statements don't do what you think. Avoid the "do { ... } while" construct.


foreach

There is an additional looping construct called "foreach", which has no analog in C. It is designed to loop through an array without having to bother with the array index.

Consider from a previous example:

   for ($i = 0; $i < 2; ++ $i) {
       my $name = $student_names[$i];
       print $name, " is ", $student_ages{ $name }, " years old\n";
   }

You could also code that as:

   foreach $i (@student_names) {
       # the variable $i has the contents of each array element.
       print $i, " is ", $student_ages{ $i }, " years old\n";
   }

It turns out that the "foreach" construct can make implicit use of the default variable "$_":

   foreach (@student_names) {
       # the variable $_ has the contents of each array element.
       print $_, " is ", $student_ages{ $_ }, " years old\n";
   }


String Interpolation

Shell programmers will be used to this. You can do variable substitution inside of strings. For example:

   my $name = "Steve";
   my $age = 53;
   my $out_line = "$name is $age years old";

There, isn't that easier to both write and read than using concatenation:

   my $out_line = $name . " is " . $age . " years old";    # yuk

Here's a line from earlier in this tutorial:

       print $student_names[$i], " is ", $student_ages{ $student_names[$i] }, " years old\n";

Here's an easier version:

       print "$student_names[$i] is $student_ages{ $student_names[$i] } years old\n";

All that said, I have run into situations where the expression is complicated enough that the string interpolation fails. So sometimes I have to fall back on concatenation.

Suppose that you want to add a suffix to the name. For example:

   my $name = "Steve";
   print "How many $names are here today?\n";

This won't work because it looks for a variable named "$names". You can tell Perl the exact variable name with curly braces:

   print "How many ${name}s are here today?\n";    # displays How many Steves are here today?

Finally, sometimes you simply do not want any string interpolation to happen. In that case, use single quote marks around your string:

   print 'How many ${name}s are here today?\n';    # displays How many ${name}s are here today?

It is no coincidence that the above is virtually identical to how shell scripts work (Bourn / Bash).


Conditionals

Perl "if" statements work pretty much the same way as C "if" statements. Here are some exceptions:

C Perl
if (x == 1) 
  printf("x=%d\n"); 
if ($x == 1) { 
  print "x=$x\n"; 
Curly braces are always required. Same with "else" clauses and "while" loops.
if (strcmp(x, "quit") == 0) 
  exit(0); 
if ($x eq "quit") { 
  exit(0); 
Strings are compared with eq instead of ==. This is because variables can be interpreted either as strings or as numerics, depending on context. So you have to provide that context so that Perl does the right thing. If you do $x == "quit" it will interpret the string "quit" as a numeric; since it is not a valid numeric, it uses zero. Thus, "hi" == "bye" evaluates as true while "hi" eq "bye" evaluates to false. Unfortunately Perl does not warn you when it sees "hi" == "bye".
if (strcmp(x, "hello") == 0) 
  printf("greetings\n"); 
else if (strcmp(x, "goodby") == 0) 
  printf("salutations\n"); 
if ($x  eq "hello") { 
  print "greetings\n"; 
} elsif ($x eq "goodby") { 
  print "salutations\n"; 
Since curly braces are always required, "else if (...)" would have to be coded as "else { if (...) {". This doubles the nesting level. So Perl invented "elsif" as a single token. Shell scripts have a similar construct, and I often confuse them.  :-(

Other string comparisons are "ne" (Not Equal), and the expected set of "lt", "le", "gt", "ge" for doing lexical (ascii) ordering.

One final useful function is defined(). Remember above when I said that variables can have an undefined value, which will print a warning if you try to use them? This is useful for more than just catching code bugs. You can check if a variable is defined:

   my $x;
   $x = some_funct_that_might_return_undefined();
   if (defined($x)) {
       print "$x\n";
   }


Short-Circuiting Instead of "if"

Here's an easy-to-underand chunk of code:

   if (! getopts('h')) {
       usage();
   }

Parse some options, returns true if it succeeds. So the "if" statement calls usage if it failed. Here's how a lot of Perl programmers would write it:

   getopts("h") || usage();

Just as in C, the "||" operator is subject to short-circuit evaluation, meaning that if "getopts()" returns true, "usage()" is guaranteed to not be executed. Conversely, if "getopts()" returns false, "usage()" is executed. Exact same behavior as the "if" code.

So why do it? Once you learn the idiom, I feel the code IS more expressive. The important thing being done is calling "getopts()". But in the 3-line version, the call is buried in the middle of an "if" conditional. The call to "usage()" is not central to the intent of the code - it's just error handling - but visually it is front-and-center. In the one-line version, the call to "getopts()" is front-and-center, and the call to "usage()" is relegated to its rightful place: only executed if something goes wrong in "getopts()".

But even if you disagree with me in terms of expressiveness, it is a very common construct in Perl programs, so get used to it.


Input / Output

The I/O mechanisms for Perl have undergone a lot of evolution since Perl's early days. This is unfortunate since there are now several methods in common use which must be learned to be fluent in Perl. I will stick with one method.

   # Copy a file
   my $i_file;    # going to use as file handle.
   if (! open($i_file, "<", "input.txt")) {    # the "<" means it is reading the file.  Like shell file re-direction.
       print STDERR "Error, could not open 'input.txt' ($!)\n";  # "$!" is built-in variable containing the error description
       exit(1);
   }
   
   my $o_file;
   # The above "if (! open" construct is very un-Perl-like.  Let's short-circuit it and use "die" to be more Perl-like.
   open($o_file, ">", "output.txt") || die("Error, could not open 'output.txt' ($!)\n");  # print to STDERR and exit
   
   my $i_line;
   # The <FileHandle> construct is how you read a line.
   while (defined($i_line = <$i_file>)) {    # $i_line will contain undefined when <$i_file> hits EOF.
       print $o_file $i_line;
   }
   close($i_file);
   close($o_file);

A couple of things to mention about the print statement. Notice the lack of comma between "$o_file" and "$i_line". That is how print knows that "$o_file" is a file handle, not a variable to actually print. Now you see why print STDERR "Error, blah blah\n"; has no comma after STDERR.

Note that the "while" statement uses the "defined" function to detect EOF. You may see some existing Perl code that assigns an input line to a variable, but does not test for defined. *This can be a bug!* Remember that the interpretation of a variable's content is context-sensitive. Suppose you wrote your code like this:

   if ($i_line = <$i_file>) {
       print $i_line;
   }

It would seem to work fine, UNTIL your input file contained a line consisting of the single character "0" with no newline (i.e. last line in the file). In that case, Perl will interpret the variable as numeric, with the value 0, and the "if" will fail even though the read succeeded. So you should always test input with "defined()".

Also, note that the print statement does not include "\n". This is because <$i_file> reads a line, up to and including the newline. So "$i_line" already has a newline in it. It is very easy to strip the trailing newline:

   chomp($i_line);    # strip trailing newline, if any

Now, let's streamline the code, like a real Perl programmer. Remember the default variable? When a file read operation is the conditional in a "while" loop, the variable assignment can be omitted and the inputted line will be placed in "$_". So, the above while loop can be re-written as;

   while (<$i_file>) {    # Read input line into $_ and check for "defined" (requires "while")
       print $o_file;    # nothing supplied to print; so $_ is printed
   }

Note that the file read operation must the *only* thing in the "while" conditional in order for it to automatically assign to "$_" and automatically test for "defined". I.e. if you try "while (!$quit && <$i_file>)" you will find that the read line is simply thrown away, not assigned to "$_". (But you can always use "while (!$quit && defined($_=<$i_file>))".) Also note that only file read operations automatically assign to "$_" when inside a while conditional. I.e. the combination of while and file read is magical.

Next, there are three built-in pre-defined file handles, STDIN, STDOUT and STDERR. For historical reasons, there is no $ in front of these. So, instead of explicitly opening files, you could re-write it to simply copy from STDIN to STDOUT. The entire program then becomes:

   while (<STDIN>) {
       print;
   }

No opens, no closes, no variables used at all. Once again, since the file read is the only thing inside the "while" condition, it uses $_ and tests for "defined".

One final cool thing:

   my @i_lines = <STDIN>    # read operation done in array context!  Reads entire file.
   for (my $i = 0; $i < scalar(@i_lines); ++$i) {
       print $i_lines[$i];
   }

That first line declares an array and reads the ENTIRE INPUT FILE into the array. So when the "for" line starts, the whole file is in memory and available for access. Also note that you can declare the "for" loop's index variable right inside the "for" construct. Also note the use of "scalar()" to determine the number of elements in the array.

Remember the "foreach" loop? The above might be written as:

   my @i_lines = <STDIN>    # read operation done in array context!  Reads entire file.
   foreach (@i_lines) {
       print;
   }


Super Input

One more magical construct. Let's re-write the above program one more time (in the file "copyit.pl"):

   while (<>) {
       print;
   }

The construct <> is a super form of input which has extra semantics:

  1. If no files were specified on the command-line, read from STDIN. At EOF of STDIN, "<>" returns false.
  2. If one or more files were specified on the command-line, open the first file and read it. At EOF of that file, simply open the next file and read it. And so on. At EOF of the last file, "<>" returns false.
  3. If one or more files were specified on the command-line, and one of those files is "-", STDIN is read for that file.

So, "copyit.pl" can be run as:

   perl copyit.pl <input.txt >output.txt

Or:

   perl copyit.pl input.txt >output.txt

Or:

   ls | perl copyit.pl input1.txt input2.txt - input4.txt >output.txt

That last form combines input1.txt, input2.txt, the output of "ls" (STDIN), and "input4.txt". I.e. "copyit.pl" is a simple version of the Unix command "cat". Pretty amazing for a 3-line program.

I usually hate "extra magical" things because they are hard to remember and can trip up a newbie. But the construct while (<>) { is so wide-spread that you should just get used to it. I would guess that 70% of my Perl tools use it.


die()

The "die()" function is just a way of printing an error (to STDERR) and exiting the program.

   isvalid($i_line) || die("input line '$i_line' is not valid");

Note that a newline is not needed for "die()".


Assert

You use "assert()" in your C code, right? You don't? Shame on you! You should.

An assert is simply a form of internal code validation. You add a line which asserts that some required condition is true, and assert aborts the program if the condition is false. Sprinkling asserts in your code is very useful for finding bugs in your code.

Perl does not have an "assert()" function. But you can do the same thing with a common Perl idiom:

   (condition) || die(message);  # the parenthesis around "condition" are optional

The "||" operator does short-circuit evaluation, meaning it only executes "die()" if the condition is false. The "die()" function prints to STDERR and then calling "exit(1)". It has the added sweetness of printing the line number in the Perl source code. BTW, I don't think anybody besides me calls that construct an "assert", but it is basically the same idea.

Here is an example of the construct:

   open(my $i_file, "<", "input.txt") || die("Could not open 'input.txt': $!");

Compare that to the more C-like construct:

   if (! open(my $i_file, "<", "input.txt") {
       die("Could not open 'input.txt': $!");
   }

Again, I have to point out the expressiveness of the assert-like construct. Not only does it use fewer lines of code, it also emphasizes the intent of the code. The intent is to open the input file. In the C-like version, that open is buried inside an "if" conditional. The error handling code is not central to the intent of the code, and yet the C-like version puts it front-and-center. Once you are familiar with this Perl idiom, that one-liner expresses the intent of the code better.


getopts()

Many Unix-based C programmers are used to calling "getopts()". But some aren't, and Windows doesn't have "getopts()". It is used to parse command-line options in a concise, standardized way. Perl has "getopt()" on all platforms, including Windows. it supports single-letter options, *not* long-form like --foo.

   my $num_things = 10;  # default value, overridable with "-n NUM"
   
   # "getopt()" declares variables "$opt_a" - "$opt_z" and "$opt_A" - "$opt_Z".
   # We can't declare them like "my $opt_h" because that forces a local copy.
   # We need the equiv of C's "extern", which is "use vars qw(...)".
   
   use vars qw($opt_h $opt_n);  # Gain access to getopt's variables
   
   # The string passed to getopts() tells it which options to accept.  The colon means that the option has a value.
   if (! getopts("hn:")) {
       exit(1);    # user's option string is bad.  getopt() already printed an error.  
   }
   
   # Options are now parsed; $opt_h and $opt_n are either undefined, or set.
   if (defined($opt_h)) {
       help();  # our own function to print help
       exit(0);
   }
   if (defined($opt_n)) {
       $num_things = $opt_n;
   }


Functions

Calling a function is pretty C-like. But defining a function is different. Let's take the easy case first: a function with no input parameters and no return value.

   sub usage {
       print STDERR "usage: mytool -h [ IN_FILE ... ]\n";
   }

Note that even though the function has no input parameters or return value, the caller still needs to call it with parenthesis: usage(). In fact, the caller can go ahead and pass in parameters and pretend to use its return value!

   my $rtn_val = usage(1, 44, "what the hey?");    # input params are ignored.

No errors or warnings are printed. So what ends up in $rtn_val? The undefined value, of course!

   if (! defined($rtn_val)) {
       print "No return val\n;"    # this one gets printed
   } else {
       print "return val = '$rtn_val'\n";    # this one not printed
   }

Calling a function with the wrong number of input parameters is not considered an error in Perl (although it is usually poor programming practice). Also, if no value is returned, the caller simply gets the "undefined" value out of it. Any attempt to *use* that value will generate a warning.

So now let's say you DO have input parameters, two in this example:

   sub max {
       my ($a, $b) = @_;    # grab input parameters
       if ($a >= $b) {
           return $a;
       } else {
           return $b;
       }
   }

That my ($a, $b) = @_; line is a bit different, isn't it? Easy part first: the "@_". The fact that it starts with "@" says that it is an array. And it is. The fact that the variable name is "_" suggests that it is some kind of default variable. And it is. So "@_" is a default array. (Which is different from the default simple variable $_. Confused yet?) The @_ array is pre-loaded with the passed-in values from the function's caller.

In fact, you don't even really need that "my" line at all! You could re-write the function as:

   sub max {
       if ($_[0] >= $_[1]) {
           return $_[0];
       } else {
           return $_[1];
       }
   }

But that is a nightmare to read; you really want the input parameters to be in well-named variables. So one step better is:

   sub max {
       my $a = $_[0];  # first element of the @_ array
       my $b = $_[1];  # second element of the @_ array
       if ($a >= $b) {
           return $a;
       } else {
           return $b;
       }
   }

So, what's with the my ($a, $b) = @_; nonsense? For one thing, it is a Perl idiom so wide-spread that you need to get used to it. That's just how people do it. The detailed explanation of what it is doing requires explanation beyond the scope of this tutorial; suffice it to say that a parenthesized list of simple variables can be treated as if it were an array. So ($a $b) = @_; is simply copying the contents of the @_ array to the ($a, $b) array. Seems like a lot of obfuscation to save one line, doesn't it?

Well ... there actually is another advantage. You see, whereas C has an arcane syntax for defining functions with variable numbers of input parameters, Perl functions always have variable numbers of parameters. The caller can pass in as many or as few parameters as it wants, and good old Perl will simply dynamically size the array @_ to contain the right number of elements. So, suppose you coded the function with:

   sub max {
       my $a = $_[0];
       my $b = $_[1];

If the caller only passes in 1 parameter, then that second assignment statement would generate a warning. Whereas if you use my ($a, $b) = @_;, Perl will discard any extras if there are too many, and set variables to undefined if there are too few, all without any warnings. Subsequent usages of $a or $b will generate warnings if they are undefined.

So which is really better? Don't know. Don't care. As I said before, the my ($a, $b) = @_; idiom is so wide-spread that you should just get used to it.

You could do one more thing if you really want to write bullet-proof code:

   sub max {
       # The following line asserts that two parameters were passed
       (scalar(@_) == 2) || die("max: wrong number of input parameters passed:" . join(@_, ", "));
       my ($a, $b) = @_;    # grab input parameters
       if ($a >= $b) {
           return $a;
       } else {
           return $b;
       }
   }

I don't know of any Perl programmers who would actually do that, but it does suggest a slightly different technique: writing functions with a variable number of input parameters. Let's make max such that you can pass arbitrarily many input numbers in (requires at least one):

   sub max {
       my $n = scalar(@_);  # number of input parameters
       my $m = $_[0];  # assume first element is the max
       for (my $i = 1; $i < $n; ++$i) {
           if ($_[$i] > $m) {
               $m = $_[$i];
           }
       }
       return $m;
   }

Now you can call it:

   print "Max is: " . max(1, 3, -8, 2) . "\n";  # print's 3
   print "Max is: " . max(@an_array) . "\n";  # scans the array and returns the largest value contained therein

Wow! Look at that second line! Passing an array into a function is the same as passing in a series of parameters. In fact, there is no real difference between the two. The function itself just has the "@_" array containing input parameters. It has no way of knowing which of those elements are individual values v.s. elements in a passed-in array.

This is actually a bit disappointing for C programmers. Consider the C function:

   void add_arrays (int *a1, int *a2, int n)  # add contents of array a2 to array a1.
   {
       int i;
       for (i=0; i<n; ++i) {
           a1[i] += a2[i];
       }
   }

There is no good way to do that in Perl without using references (the Perl equivalent of pointers). Ditto if you want to pass in a hash; you need references. Alas, references are beyond the scope of this tutorial. So until then, just use globals if you need to pass in hashes or multiple arrays.

Besides a true "variable number of parameters" model, you can also have functions with optional parameters.

   sub abc {
       my ($a, $b) = @_;
       
       if (defined($b)) {
           return($b);
       } else {
           return($a);
       }
   }

The second parameter is optional. So:

   print abc(5), "\n";    # prints 5
   print abc(5, 6), "\n";    # prints 6
   print abc(5, 6, 7), "\n";    # prints 6 (ignores 3rd parameter)
   my @x;
   $x[0] = 5;
   print abc(@x), "\n";    # prints 5
   $x[1] = 6;
   print abc(@x), "\n";    # prints 6


Regular Expressions

Perl's powerful Regular Expression (RE) handling capabilities is half the reason for using Perl. Without REs, I would never write another line of Perl. Yes, Java has RE capabilities with approximately the same flexibility, but Perl is so expressive that it is so much faster to write in Perl than Java (or so I have found).

REs are also half the reason that people think Perl programs are impossible to read and maintain. In this respect, it's not really Perl's fault; a Java program with lots of REs will be just as hard to read as a Perl program with lots of REs. An RE is just one of those love-hate things. Blame the 1950s computer scientist Stephen Kleene.

I'm going to assume that you're already reasonably comfortable with regular expressions from tools like "vi" and "grep". See http://www.regular-expressions.info/quickstart.html for a quick tutorial and http://perldoc.perl.org/perlre.html for a full Perl RE reference. But bear in mind that entire books have been written about regular expressions. Don't try to become an expert right off the bat; you can do amazing things with very basic RE techniques.


Patterns and Match Operators

A RE pattern is normally delimited with a forward slash "/". For example: /hi there/. A better example: /there are [0-9]* pigs in the poke/.

The match operator is "=~". Think of it as "approximately equal to"; regular expressions allow all kinds of approximate matches.

   if ( $in_line =~ /there are [0-9]* pigs in the poke/ ) {
       print "There are pigs in the poke!\n";
   }


Sub-Patterns

Too bad you can't extract the number of pigs in that poke. Too bad you'll have to do all kinds of gross string operations to isolate the number. Too bad you can't do something like this:

   if ( $in_line =~ /there are ([0-9]*) pigs in the poke/ ) {
       print "There are $1 pigs in the poke!\n";
   }

Good news! Parentheses are used to identify sub-patterns for later use. Suppose there are not only pigs, but various other animals in the poke.

   if ( $in_line =~ /there are ([0-9]*) ([a-z]*) in the poke/ ) {
       print "There are $1 ${2}s in the poke!\n";  # curly braces needed to prevent Perl from looking for "$2s".
   }

But what if you want to match an actual parenthesis character? You have to escape them. You may have seen documents which spell out numbers and also include numerics in parentheses. For example, the input string might be "there are five (5) pigs in the poke". Here's how to match it:

   if ( $in_line =~ /there are [a-z]* \(([0-9]*)\) ([a-z]*) in the poke/ ) {


Case Insensitivity

What if you don't want to pay attention to case? Use the "i" suffix on the pattern:

   if ( $in_line =~ /there are ([0-9]*) ([a-z]*) in the poke/i ) {    # will match both upper and lower case


Default Variable

Remember the default variable "$_"? If the string you want to match is in $_, then you can write the "if" as:

   if ( /there are ([0-9]*) ([a-z]*) in the poke/ ) {

This, combined with the magical input construct "while (<>)" allows for a very concise way of writing simple programs:

   while (<>) {
       if ( /there are ([0-9]*) ([a-z]*) in the poke/ ) {
           print "There are $1 ${2}s in the poke!\n";
       } elsif ( /quit/ ) {
           print "good by\n";
           exit(0);
       }
   }


Substitute

A special case of pattern matching uses the "s" prefix on the pattern. It allows for quick and easy text substitutions.

   $in_line =~ s/poke/bag/;

Note that if $in_line does not contain "poke", the line has no effect. No warning is printed. But also note that the entire line returns a numeric value corresponding to the number of substitutions it did. Since non-zero is interpreted as true, you can do this:

   if ( $in_line =~ s/poke/bag/ ) {
       print "'poke' is an old-fashioned term; using 'bag'\n";
   }

Suppose that "poke" appears more than once in $in_line? The above line will only replace the first one. If you want it to replace *all* of them, use the "g" suffix on the pattern:

   if ( $in_line =~ s/poke/bag/g ) {

If you want the initial match of "poke" to be case-insensitive, you can also include the "i" suffix:

   if ( $in_line =~ s/poke/bag/gi ) {  # order of "gi" does not matter

This will change "Poke the pigs in the poke." to "bag the pigs in the bag."

As with matching, the substitute can be used with the default variable.

   while (<>) {
       s/poke/bag/gi;
       print;
   }


Perlisms I Avoid

I rarely (if ever) use the techniques in this section, but when reading Perl documentation it is helpful to understand these idioms exist.


Unless

Don't like the "not" operator (!) in if statements? As in if (! $quit) { process(); } Some people apparently hate it so much they introduced the "unless" statement:

   unless ($quit) { process(); }

There, isn't that easier to understand? No? I don't think so either.


Post-Conditional

I lied when I said that "if" statements always require curly braces. You can switch around the "if" and the "then" parts:

   process() if (! $quit);

This is *identical* to "if (! $quit) {process();}". It works for "unless" too:

   process() unless ($quit);

This last one is almost ... almost mind you ... worth using. The important thing here is that "process()" is called. But not if the "$quit" flag is set. So this idiom puts emphasis on what is normally done, and provides the exceptional case as an afterthought. I can see it, but post-conditionals are so contrary to what pretty much any other language has, that I refuse to use it. But you *will* see it used in Perl documentation, so remember it.


Optional Parentheses

In Perl, you usually don't have to use parentheses around input parameters on a function call. For example, instead of this:

   $x = max(1, 2, 3);

you can use:

   $x = max 1, 2, 3;

Again, I think it is a stupid deviation to all other languages, but it is used in perl documentation, so remember it.


Perl Disappointments

Although I am generally a Perl fan, there are some things that I don't like.

  • default variable - I find the default variable $_ to be more confusing for newbies than it is worth. It leads to confusing code for people not experienced in Perl.
  • switch case - Perl does not have a switch construct. This leads to a lot of "if/then/else" code.
  • magic - there are a lot more magical constructs in Perl than I've mentioned. Most of them do more harm than good, IMO.


To Infinity, and Beyond

There are many Perl techniques and features that are beyond the scope of this short tutorial.

Remember that Perl is a language which has evolved significantly over the years. That evolution has improved the language in many ways. Some of those improvements were not actually adding features, but making some existing features better. Like file handles: the original file handle was hard to use. The file handles mentioned in this tutorial are newer and better. But if you want to be able to read and maintain code written a long time ago (or written recently by somebody who stopped following improvements in the language), you would have to learn all those rusty old features in addition to all the shiny new ones. ("local" is another one. Don't ask.)

But even ignoring those rusty old features, there are a number of advanced features that can be very handy if you want to learn them. I'll present a few of them in the order of usefulness:

  1. References - the Perl version of pointers. Needed to pass arrays and hashes to functions, and to create nested data structures (like an array of hashes, or a hash of arrays).
  2. Object orientation - the "good" method of modularizing Perl into re-usable components. Introduces some OO semantics.
  3. Threading - I haven't looked at it, but I presume it includes things like mutexes and such.
  4. Formats - useful for generating reports, although I must admit that I've never used them.

As for learning them, I'm fond of the O'Reilly's camel book, but there are lots to choose from.

Finally, there is a staggering number of user-submitted packages available at http://www.cpan.org/ . Seriously huge. I don't even know what Galois Field arithmetic is, but cpan has a module for it. If it's not in CPAN, it's probably NP-complete.

Personal tools