Keywords: Perl documentation, Perl tutorial, Perl beginners, Guide to Perl. (For internet search engines.)
Perl is an interpreted scripting language with high-level support for text processing, file/directory management, and networking. Perl originated on Unix but as of 1997 has been ported to numerous platforms including the Win32 API (on which Win95/NT are based). It is the defacto language for CGI scripts. If I had to learn just one scripting language, it would be Perl.
This document is not meant to be a thorough reference manual; instead, see the concisely-written manual pages ("man pages") or buy the Perl book (Programming Perl 2nd Edition, by Wall, Christianson and Schwartz, ISBN 1-56592-149-6 [Note: Like the K&R book on C, this definitive reference on a popular language is dense and insightful, but not for all tastes]. This document attempts to help an experienced programmer unfamiliar with Perl up to speed as quickly as possible on the most commonly used features of Perl. For the experience Perl programmer looking for a reference, I recommend Perl in a Nutshell, by Ellen Siever, Stephen Spainhour and Nathan Patwardhan, ISBN 1-56592-286-7.
I am willing to sacrifice 100% correctness if there is a much simpler view that is correct 99% of the time. There are several reasons for taking this approach (I need to finish this paragraph).
My Perl programming philosophy emphasizes reuse and clarity over brevity. We happily acknowledge that much of the Perl code presented could easily be written in half the number of lines of code and with greater efficiency.
@_
variables whenever possible.
The latest version of this document can be found at http://www.best.com/~quong/perlin20/ . Additionally there are gzip'ped 2-up (US letter) Postscript , and 2-up (US letter) PDF versions. Have at them.
License/use: You are free to reproduce/redistribute this document in its entirety in any form for any use so long as (i) this license (what you are reading right now) is maintained, and (ii) you make no claims about the authorship. I, Russell Quong, have copyrighted this document. I would appreciate notification of any large scale reproduction and/or feedback.
As of Jun 1999, this document is fairly complete; continued work will be infrequent with updates every 4-10 months. Thanks to all who have pointed out errors.
This document covers Perl version 5. If you have an older version,
upgrade immediately. Run perl -v
to see the version. As of
6/2000, Perl 5.6 is the latest Unix and Win32 version and is available
at http://www.perl.com . (Version 5.005 was out by 9/98 and
version 5.004 was available by 2/98.) I used 5.003 when initially
writing this document in 4/98.
Before version 5, Perl was a cryptic language in large part to its use
of variables. In Version 4 most built-in variables were named via
single punctuation symbols, such as $]
, $_
and,
even worse, most statements operated on an implicit variable, named
_ (yes, the variable named underscore) to increase brevity.
In Perl 5, released sometime in late 1995 (?), most of built-in
variables now have descriptive english names and all statements can be
rewritten to show explicitly the variables being used.
Check http://www.perl.com and/or CPAN (the Comprehensive Perl Archive) for any Perl related binaries, material, documentation, source or modules. If anything, there is too much information at CPAN. CPAN is mirrored at many (over 40) different sites . Pick one near you.
Perl is a polymorphic, interpreted language with built in support for textual processing, regular expressions, file/directory manipulation, command execution, networking, associative arrays, lists, and dbm access. We next present three increasingly complicated examples using perl
In some cases, a script is not needed. For example, I often want to
replace all occurrences of a regex (regular expression) FROMX
to
a new value TOX
in one more files FILESX
.
Here's the command:
## replace FROM with TOX in all files FILESX, overwriting originals
% perl -p -i -e "s/FROM/TOX/;" FILESX
## Same as above, assumes FROM or TOX contain a '/' but not a '@'
% perl -p -i -e "s@FROM@TOX@;" FILESX
## replace FROM with TOX in all files FILESX, renaming originals with .bak
% perl -p -i.bak -e "s/FROM/TOX/;" FILESX
Sometimes you need a simple throw-away script to do a task once or
twice, in which case the full-blown script in the next section is just
too much. The following script oneShot.pl reads all files
specified as command line arguments and prints out each line preceded by
the file name and the line number. You may need to make the script file
executable (via the Unix command chmod 755 oneShot.pl
) first.
To run the script type
% oneShot.pl input-file(s)
or
% perl -w oneShot.pl input-file(s)
1 #! /usr/bin/perl -w
2 use English;
3
4 sub main () {
5 my($filename, $line, $lineno) = ("f-not-set", undef, 0); # local vars
6 ## <> returns one-by-one every line of all files in @ARGV
7 while ( defined($line=<>) ) {
8 if ($ARGV ne $filename) { # detect when we switch files
9 $lineno = 0; # reset the line number
10 $filename = $ARGV; # $ARGV = current file name
11 }
12 $lineno ++; # increment the line number
13 chomp($line); # strip off newline from the line
14 print "file=$filename, $lineno: line=($line)\n";
15 }
16 }
17
18 main();
19 0;
We present a non-trivial prototype Perl script that illustrates many common Perl script operations, including
If this script is too much for your needs, use the preceding prototype script for simpler one-shot tasks in the next section. Remember, it is much easier to remove parts from a big script than to add to a small script. (Retrospective: even after writing this prototype script, I resisted using it because it seemed too long, but in most cases I ended up cutting/pasting from it to my new script; since then, I just start with this script and wittle away.)
By breaking each of the majors steps into a separate function, you can modify this prototype script for your needs with minimial changes. Although this script is long, it should be fairly easy to read.
This example script proto-getH1.pl extracts and then sorts (alphabetizes) all the high-level headings from one or more HTML files, by looking for lines that contain
<Hn> ... </Hn>
This script proto-getH1.pl is run via:
% perl -w proto-getH1.pl [-o outputfile] input-file(s)
or
% proto-getH1.pl [-o outputfile] input-file(s)
All HTML headers are sent to the output file, which
is stdout
by default, or the file specified after the
-o command line flag.
1 #! /usr/bin/perl -w
2
3 # Example perl file - extract H1,H2 or H3 headers from HTML files
4 # Run via:
5 # perl this-perl-script.pl [-o outputfile] input-file(s)
6 # E.g.
7 # perl proto-getH1.pl -o headers *.html
8 # perl proto-getH1.pl -o output.txt homepage.htm
9 #
10 # Russell Quong 2/19/98
11
12 require 5.003; # need this version of Perl or newer
13 use English; # use English names, not cryptic ones
14 use FileHandle; # use FileHandles instead of open(),close()
15 use Carp; # get standard error / warning messages
16 use strict; # force disciplined use of variables
17
18 ## define some variables.
19 my($author) = "Russell W. Quong";
20 my($version) = "Version 1.0";
21 my($reldate) = "Jan 1998";
22
23 my($lineno) = 0; # variable, current line number
24 my($OUT) = \*STDOUT; # default output file stream, stdout
25 my(@headerArr) = (); # array of HTML headers
26
27 # print out a non-crucial for-your-information messages.
28 # By making fyi() a function, we enable/disable debugging messages easily.
29 sub fyi ($) {
30 my($str) = @_;
31 print "$str\n";
32 }
33
34 sub main () {
35 fyi("perl script = $PROGRAM_NAME, $version, $author, $reldate.");
36 handle_flags();
37 # handle remaining command line args, namely the input files
38 if (@ARGV == 0) { # @ARGV used in scalar context = number of args
39 handle_file('-');
40 } else {
41 my($i);
42 foreach $i (@ARGV) {
43 handle_file($i);
44 }
45 }
46 postProcess(); # additional processing after reading input
47 }
48
49 # handle all the arguments, in the @ARGV array.
50 # we assume flags begin with a '-' (dash or minus sign).
51 #
52 sub handle_flags () {
53 my($a, $oname) = (undef, undef);
54 foreach $a (@ARGV) {
55 if ($a =~ /^-o/) {
56 shift @ARGV; # discard ARGV[0] = the -o flag
57 $oname = $ARGV[0]; # get arg after -o
58 shift @ARGV; # discard ARGV[0] = output file name
59 $OUT = new FileHandle "> $oname";
60 if (! defined($OUT) ) {
61 croak "Unable to open output file: $oname. Bye-bye.";
62 exit(1);
63 }
64 } else {
65 last; # break out of this loop
66 }
67 }
68 }
69
70 # handle_file (FILENAME);
71 # open a file handle or input stream for the file named FILENAME.
72 # if FILENAME == '-' use stdin instead.
73 sub handle_file ($) {
74 my($infile) = @_;
75 fyi(" handle_file($infile)");
76 if ($infile eq "-") {
77 read_file(\*STDIN, "[stdin]"); # \*STDIN=input stream for STDIN.
78 } else {
79 my($IN) = new FileHandle "$infile";
80 if (! defined($IN)) {
81 fyi("Can't open spec file $infile: $!\n");
82 return;
83 }
84 read_file($IN, "$infile"); # $IN = file handle for $infile
85 $IN->close(); # done, close the file.
86 }
87 }
88
89 # read_file (INPUT_STREAM, filename);
90 #
91 sub read_file ($$) {
92 my($IN, $filename) = @_;
93 my($line, $from) = ("", "");
94 $lineno = 0; # reset line number for this file
95 while ( defined($line = <$IN>) ) {
96 $lineno++;
97 chomp($line); # strip off trailing '\n' (newline)
98 do_line($line, $lineno, $filename);
99 }
100 }
101
102 # do_line(line of text data, line number, filename);
103 # process a line of text.
104 sub do_line ($$$) {
105 my($line, $lineno, $filename) = @_;
106 my($heading, $htype) = undef;
107 # search for a <Hx> .... </Hx> line, save the .... in $header.
108 # where Hx = H1, H2 or H3.
109 if ( $line =~ m:(<H[123]>)(.*)</H[123]>:i ) {
110 $htype = $1; # either H1, H2, or H3
111 $heading = $2; # text matched in the parethesis in the regex
112 fyi("FYI: $filename, $lineno: Found ($heading)");
113 print $OUT "$filename, $lineno: $heading\n";
114
115 # we'll also save the all the headers in an array, headerArr
116 push(@headerArr, "$heading ($filename, $lineno)");
117 }
118 }
119
120 # print out headers sorted alphabetically
121 #
122 sub postProcess() {
123 my(@sorted) = sort { $a cmp $b } @headerArr; # example using sort
124 print $OUT "\n--- SORTED HEADERS ---\n";
125 my($h);
126 foreach $h (@sorted) {
127 print $OUT "$h\n";
128 }
129 my $now = localtime();
130 print $OUT "\nGenerated $now.\n"
131
132 }
133 # start executing at main()
134 #
135 main();
136 0; # return 0 (no error from this script)
Perl has the similar syntax as C/C++/Java for control constructs such as
if
, while
, for
statements. The following
table compares the control constructs between C and Perl. In Perl, the
values 0
, "0"
, and ""
(the empty string) are
false; any other value is true when evaluating a condition in an
if/for/while statement.
C | Perl (braces required) | |
same | if () { ... } | if () { ... } |
diff | } else if () { ... } | } elsif () { ... } |
same | while () { ... } | while () { ... } |
diff | do while (); | do while (); (See below) |
same | for (aaa;bbb;ccc) { ... } | for (aaa;bbb;ccc) { ... } |
diff | N/A | foreach $var (@array) { ... } |
diff | break | last |
diff | continue | next |
similar | 0 is FALSE | 0, "0", and "" is FALSE |
similar | != 0 is TRUE | anything not false is TRUE |
Note in Perl, the curly braces around a block are required, even if the block contains a single statement. Also you must use elsif in Perl, rather than else if as shown below.
if ( conditionAAA ) { ... } elsif ( conditionBBB ) { ... } else { ... }
Finally, although the do { body } while (...)
is legal Perl,
it is not an actual loop construct in Perl. Instead, it is the
do
statement with a while
modifier. In particular,
last
and next
will not work inside the body.
There are four types of data in Perl, scalars, arrays, hashes and references. Scalars and arrays are ubiquitious (used everywhere). Hashes are common in large programs and not unusual in smaller programs. References are scalars that point to other data, namely a reference is a pointer. Referencs are an advanced topic and can be ignored initially; there is a sparse coverage of references later in this document. In the following listing, the initial symbol is the context specifier for that type.
$
) A scalar is a single string or numeric
value. More advanced scalar types include references, and typeglobs.
@
) A list or array is a
one-dimensional vector of zero or more scalars. Arrays/lists are
indexed as arrays via [ ]; the starting index is 0, like C/C++. The Perl
reference documentation intermixes the terms list and
array freely; so shall we.
%
) A hash is a list of (key, value)
pairs, in which you can search for a particular key
efficiently. In practice, a hash is implemented via in a hash table,
hence the name.
\
) A reference refers to another value,
much like a pointer in C/C++ refers to some other value.
A scalar holds a single value; an array or list holds zero or more values. The scalar types in Perl are string, number, and reference[Note: There is also a symbol table entry scalar type, poorly named a typeglob in Perl, but you are not likely to use it initially]. Like awk, a scalar data value in Perl contains either a string or a (floating point) number. For reference we create scalars of all four types.
$numx = 3.14159; # numeric $strx = "The constant pi"; # string $refx = \$numx; # reference $tglobx = *numx; # typeglob (different from file name globbing)
A numeric value is a real or floating point value and can use any of the standard C specifications, e.g. (1.2, 12+e-1).
A string value is enclosed in matching single or double quotes. Within double quotes, variable references (but not expressions involving operators) are evaluated, like shells (csh,sh); within single quotes nothing is evaluated. Double quotes are especially convenient when printing out values.
$i = 123; print('i = $i\n'); # print: i = $i\n print("i = $i\n"); # print: i = 123 print("i = $i+4\n"); # print: i = 123+4 print("i = " . ($i+4) . "\n"); # print: i = 127 print("i = " . $i+4 . "\n"); # print: 4 (may get warnings) print((("i = " . $i) + 4) . "\n"); # print: 4 (same as previous)
Perl automatically converts from string to number or vice versa as
needed, based on the operation being done. Below, +
is
arithmetic plus and . is string concatenation.
$pi = "3.14"; $two_pi = 2 * $pi; # $two_pi = 6.28 $pi_pi = $pi . $pi; # $pi_pi = "3.143.14"
The following table shows that a non-numeric string value is viewed as 0 (zero), and a numeric value viewed as a string is the ASCII representation of the number.
Type of $x | (Value of) $x | $x+1 | $x . "::" | if ($x) { |
string | "abc" | 1 | abc:: | true |
number | 3 | 4 | 3:: | true |
string | "45.0" | 46 | 45.0:: | true |
number | 0 | 1 | 0:: | false |
string | "" | 1 | :: | false |
undefined | "" | 1 | :: | false |
Because strings are converted to numbers on demand and vice versa, there
is no practical difference between a number and its string equivalent.
Thus, in the following statements i
and j
are assigned
the same value.
$i = 3; # same as $i = "3" $j = "3"; # same as $j = 3 $k = $i + $j; # $k = 6 $s = $i . $j; # $s = "33" $f = "3.0" # not the same as "3" as $f . 1 would give "3.01"
A scalar variable that has a valid string or numeric value, such as 4.3 or "hello" or even "" (the empty string), is defined. In contrast, if a variable without a valid value is undefined. The builtin value undef represents this undefined value, much like NULL in C/C++, null in Java or nil in Lisp/Ada are undefined values. An array is defined if has previously held data. The empty array () is undefined; all other array values are considered defined. Use the defined() function to test if a variable is defined.
my $emptystr = ""; my(@nonemptylist) = ( undef ); if ( defined($emptystr) && defined(@nonemptylist) ) { print "will see this\n"; } my $invalid; my(@empylist) = (); if ( defined($invalid) || defined(@emptylist)) { print "will NOT see this\n"; } @emptylist = (1, 2); @emptylist = (); if ( defined(@emptylist)) { print "emptylist is empty but is defined now\n"; }
If you read or access an undefined variable var
as a string or
number, you get the undefined value, which is then converted to
""
or 0
. Thus an undefined variable is considered
false.
An entry for a key KKK in a hash can contain the undefined value. This
situation is different than the key KKK not existing in the hash. Use
the perl functions exists
and defined
to distinguish
the difference.
sub hashdefined () { my(%hhh); $hhh{"red"} = undef; if (! exists $hhh{"nowhere"} ) { print "key nowhere is not in hash hhh.\n"; # YES } if (! exists $hhh{"red"} ) { print "key red is not in hash hhh.\n"; # NOPE } if (exists $hhh{"nowhere"} && ! defined($hhh{"nowhere"}) ) { print "key nowhere exists but has the undefined value.\n"; # NOPE } if (exists $hhh{"red"} && ! defined($hhh{"red"}) ) { print "key red exists but has the undefined value.\n"; # YES } }
Most Perl operators, such as + or < or . work either on numbers or on strings but not both.
Description | string op | numeric op |
equality | eq | == |
inequality | ne | != |
ternary compare | cmp | <=> |
concatenation | . (a dot) | N/A |
arithmetic | N/A | +, -, *, / |
relational | lt, le, gt, ge | <, <=, >, >= |
ANSI C ops |
ASCII strings are ordered character by character based on the underlying
ASCII value. For purely alphabetic strings, this results in normal
alphabetization, as A < B < ... < Z < a < b < ... < z. In
general, strings are ordered using the local collating property. The
tri-valued compare operations xx cmp yy
or xx <=>
yy
, returns -1, 0, or 1 if xx
is less than, equal or greater
than yy
for strings and numbers respectively; these operators
are commonly used as sort comparison functions.
A list/array is a one-dimensional vector that holds zero or more values. To Perl, lists and arrays are identical, and we shall use the terms interchangably, using the poor justification the existing documentation does so, too. In Perl, a list/array value is denoted by scalars enclosed in parethesis. Arrays can be indexed; like C/C++/Java, the first element has index 0.
@fib = (0, 1, 1, 2, 3, 5); @mixed = ("quiet", +4, 3.14, "hot dog"); @empty = (); @emptyAlso = ( (), (), () ); $five = pop @fib; # get $five $three = $fib[4];
The length or size of an array is can be obtained in two different ways.
$len = @array ## need SCALAR CONTEXT. Number of items in the array. $last_index = $#array ## index of last element in the array.
Finally, here are three ways to iterate through an array, @arr
.
In this example, we simply print out each element. For accessing each
element, I prefer foreach; if the index is needed too, I
use the second method.
my $item; foreach $item (@arr) { ## cleanest, but no index print $item; } my $i; for ($i=0; $i<@arr; $i++) { ## just like C print $arr[$i]; } for (my $i=0; $i<@arr; $i++) { ## In v 5.004, 'my' inside for print $arr[$i]; } my $j; for ($j=0; $j<=$#arr; $j++) { ## I don't use this much print $arr[$j]; }
The next block shows some common array operations. Push and pop add/remove elements at the right-end of the array. We show how to construct the list ("one1", "two2", "three3", "four4") in the following steps.
@list = ("one1"); push(@list, "two2"); $list[2] = "three3"; $nelements = @list; # get three, as there are three elements $list[$nelements] = "four" . "4";
Perl automatically and dynamically enlarges an array so you do not have
predeclare the size of an array. However, if you know you will need a
very large array, largeArr, you can pre-allocate space by
assigning to $#largeArr
. Pre-allocating is slightly more
efficient, but potentially wastes a lot of space, and should only be
done for arrays bigger than 16K elements.
$#largeArr = 987654; ## preallocate 987K worth of space.
A hash variable stores a array of (key, value) pairs, collectively known as a map. Typically, the key and value are different but related values, such as a person's name and phone number. A hash is implemented in Perl so that you can quickly look up the value given the key, when there are many (key, value) pairs. From a algorithms/data structures standpoint, a Perl hash implements a dictionary, mostly likely using a hash table.
For example, given the name of a state, such as california
, I
want the Postal abbreviation, CA
. We define, initialize, and
modify a hash, %abbrevTable
as follows.
my(%abbrevTable) = ( # this is the initialization syntax. "california" => "CA", # key = california, value = CA "oregon" => "OR", ); sub printAbbrev($) { my($state) = @_; if (exists $abbrevTable{$state}) { print "Abbreviation for $state = $abbrevTable{$state} \n"; } else { print "No known abbreviation for $state\n"; } } sub hashdemo () { printAbbrev("arizona"); # no such key $abbrevTable{"arizona"} = "AZ"; # add a new (key, value) pair printAbbrev("arizona"); # this will succeed }
Calling the function hashdemo()
gives
No known abbreviation for arizona
Abbreviation for arizona = AZ
Note that we use the exists $hash{$key}
syntax to test if a
key exists in the hash table. Also a hash is assymetric in that we can
lookup up entries based on the key, not the value.
If treated as an normal array/list, a hash will appear as
(keyA, valueA, keyB, valueB, keyC, valueC, ... ).
The order of the keys will appear random[Note: The key order is based on the underlying hash function being used, we are simply listing the hash table buckets.].
Declare local variables using the my(var-name[s]) =
initial-vals
, which evaluates initial-vals in list context, or
my scalar-var = initial-val
, which evaluates
initial-val
in scalar context . A local variable only exists
in and hence can only be used in the function (or block) where it was
declared.
sub some_function () { my(@copyOfARGV) = @ARGV; # array local variable my($i, $mesg) = (0, "hi"); # local variables for some_function foreach $i (@ARGV) { my $arg = $ARGV[$i]; # $arg only exists in the for loop } print $arg; # Arghh. ERROR, $arg does not exist here. }
In older Perl code, you may see the local
keyword instead of
my
. If in doubt, use my
instead of
local
[Note: There are advanced situations, beyond the scope
of this document, where local
must be used.]. A local
variable is dynamically-scoped[Note: With dynamic scoping, we use the
variable in the closest function-call stack frame, which means that the
same line of code might use different non-local variables as it depends
on the function call nesting.]; a my
variable is
statically-scope, which is faster and almost certainly what you want.
For example, C/C++/Java use static scoping.
A bareword is a unquoted literal not used as a variable or
function name. Barewords are used mainly for labels and for filehandles
[Note: and for package names, but this is an advanced topic]. The
following code snippet shows three bare words, A_FILE_HANDLE
,
bare
and bareword
. filehandles are uppercase to avoid
naming conflicts, and to follow the normal Perl naming convention. (If
you use the FileHandle package, you don't need to make your own file
handles.)
open(A_FILE_HANDLE, "./perlscript.pl"); bare: while ($line = <A_FILE_HANDLE>) { bareword: while ($line[$i] ne "") { if ($line[$i] =~ /\s*#/) { next bare; } } }
A bareword not used as a filehandle or label, and which is not a known function, is viewed as string constant.
$str = hi; # AVOID. Use of bareword hi, same as "hi". $str = "hi"; # same, but much easier to read.
We advise against use barewords as strings, since it impedes clarity, as function calls are typically barewords. Instead, put your strings in double quotes, which is standard across most languages.
A context specifier, which is one of the characters $, @, % must be used before all variable references. The context indicates the kind value that will be used or assigned. The context is not part of a variable name. Consider the following assignment statements.
$eight = 8; # numeric scalar @nulllist = (); # null or empty list. $four = $eight / 2; # @cubes = (1, 8, 27, 64); # assign an entire array/list. $eight = $cubes[1]; # huh? cubes is an array, why not @cubes[1].
The $ specifier in the statement ... = $varX ... means
that we expect to read a scalar value from a variable named
varX
. Thus, Perl uses the scalar variable named varX
.
Similarly, the @ specifier in ... = @varX means
that we expect to read an array/list value from a variable
varX
; Perl uses the array/list variable varX
.
While it might seem that the $ and the @ are part of
the variable names in $varX
and @varX
, this
view is wong. In reality, there are two different variables, each
named varX
; one is a scalar, the other an array. In an
expression like varX[...], because array subscripting is used,
Perl selects the array variable. The last statement in the preceding
example $eight = $cubes[1];
illustrates the preceding rule as
we precede the array variable cubes
by a $
.
An expression like @aaa = @bbb[$ccc]
means that we expect the
element bbb[$ccc] to produce an list/array value, which is probably
wrong thinking. Since Perl arrays elements must be scalars,
@bbb[$ccc]
results in a one-element array containing
$bbb[$ccc], namely ( $bbb[$ccc] )
. [Note: If
$bbb[$ccc]
is undefined, we get the array ( undef )
]
In an expression like ... = $varX[kk]
, we first interpret the
array brackets, which means varX must be an array. We get the
kkth element. Finally the leading $ specifier
indicates we expect this element to be a scalar value.
What happens if the LHS and RHS contexts do not match in an assignment statement? Perl uses the following rules which are often convenient but sometimes unexpected.
Value assigned to LHS in LL = RR | |||
LHS | Original RHS Value | ||
Value | Scalar $RR | List @RR | Hash %RR |
"hi" | (1, 4, 9) | ("one" -> 1, "two" -> 2) | |
scalar, $LL | "hi" | 3 [arr length] | 1/8 [used/alloc buckets] |
list, @LL | ("hi") | (1, 4, 9) | ("one", 1, "two", 2) |
hash, %LL | [empty hash] | (1, 4) | ("one" -> 1, "two" -> 2) |
Variables of different types (scalar, list, hash) can have the same name, because each type has its own namespace. Thus, the following code refers to three different variables, so that no data values are overwritten.
$xyz = "my foot"; # scalar mode variable @xyz = ("tulip", "rose", "mum is the word"); # list mode variable $xyz{$xyz} = $xyz[1]; # $xyz{"my foot"} = "rose";
Even the Perl book is misleading as it states that "all variable names
start with a $
, or
%
,'' (page 37) which
would imply that $cubes[1]
is using the $cubes
variable, which is incorrect. (It is accurate to say that all
variable uses begin with a $, @ or a %.)
The condition of an if-statement or while-loop is evaluated in scalar context. Thus it is acceptable and indeed common Perl programming practice to say
if ( @array > 4 ) { ## @array ==> number of items in it. ... }
Many functions and operators behave differently depending on the
context. For example, using my($var) = RHS;
produces a list
context on the LHS and RHS, because the parenthesis denote a list, so
RHS will be evaluated in list context. Instead do my $var = RHS;
.
Thus, to get a string of the current time there are several correct ways. We show some commonly encountered cases.
my($now1) = scalar(localtime()); # CORRECT, force scalar context my $now2 = localtime(); # CORRECT, no parens, scalar context my($now3); $now3 = localtime(); # CORRECT, my($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = localtime(); # OK my($nowWRONG) = localtime(); # WRONG, list context, get $sec
Use the scalar(...)
function to force scalar context. Use
(...)
to force array/list context.
$scalarVar = scalar(@arrayVar); # force scalar context. my($line) = scalar( < file > ); # just read one line
Perl functions take a single list/array as a parameter, which naturally handles the case of passing several scalars. Parameters are separated by commas, because they are separate elements of the parameter list/array.
$two = sqrt 4.00; # square root of 4 open FILEHANDLE, "input.txt"; # open the file input.txt for reading $i = index "abcdefg", "cde"; # index of substring cde in abcdefg print "i = $i \bsl n"; # print value of i if (defined $somevar) { ... } # test if $somevar has been used
You may optionally put parenthesis around the arguments, resulting in the standard call-syntax of most langauges as shown below. I personally prefer using parenthesis. However, I prefer no parenthesis if the function call is the entire conditional of an if or while statement.
$two = sqrt(4.00); # square root of 4 open (FILEHANDLE, "input.txt"); # open the file input.txt for reading $i = index("abcdefg", "cde"); # index of substring cde in abcdefg print ("i = $i \bsl n"); # print value of i. if (defined($somevar)) { ... } # test if $somevar has been used (ugly)
A few functions, such as print
, grep
, map
,
and sort
have secondary syntaxes that require spaces after the
first parameter. If you use parenthesis around the arguments, you must
still use a space.
print STDERR "i = $i \bsl n"; # print value of i to STDERR print(STDERR "i = $i \bsl n"); # print value of i to STDERR print(STDERR, "i = $i \bsl n"); # (ACK) print 'STDERR' followed by i
Beware that the first set of outermost parenthesis fully delimit the parameters, so that subsequent values are not parameters. Whitespace does not affect things.
$ten = sqrt (1+3)*5; # Ack. same as $ten = (sqrt(4)) * 5; $ten = 5 * sqrt (1+3); # Arithmetically the same as preceding. $n = sqrt ((1+3)*5); # Good. $n = sqrt (20);
A function definition looks as follows. All the parameters to the
function are passed in the @_
list/array. This is one time
where use of this cryptic variable cannot be avoided. I always
immediately rename the parameters as shown in the prototype code.
sub do_line ($$$) { my($line, $lineno, $filename) = @_; ... }
As of Perl 5.002, you can pre-declare the number and types of the
function parameters (see Section Prototypes in perlsub) using a function
prototype, so that the parameters can be interpreted in a user specified
manner. In the function declaration sub do_line ($$$) {
,
each of the $ signifies a single scalar parameter. A @
in the
parameter list signifies a list; nothing can follow it as the list
parameter gobbles up all remaining parameters. Warning: the
function-prototype for a function fn must be seen before
calling fn for Perl to do parameter checking.
A Perl function can return any type of value including a scalar, an array, or nothing (void). Unfortunately, the return type of a function cannot be specified in the function prototype. If a function returns one type, say an array, and you expect a scalar, Perl will silently do a conversion.
You can write functions that return different types based on expected
return type (known as the calling context) by using the
wantarray
function. For example,
sub scalarOrList () { return wantarray ? ("red", "green", "blue") : 88; } ... $i = scalarOrList(); # scalar context, get 88 @color = scalarOrList(); # list context, get ("red", "green", "blue")
If a function takes optional trailing parameters, they are declared and fetched as follows.
# called as: # dieMessage("Whoops, that hurt."); # one parameter # dieMessage("Whoops, that hurt.", 0); # two parameters # sub dieMessage ($;$) { my($message) = shift @_; my($shouldDie) = (@_ > 0) ? shift @_ : 1; ## 1 = default value if no param }
In regular expressions, Perl understands the following convenient
character set symbols which match a single character. Thus, to handle
arbitrary blank space you must use \s+. You may use these
symbols in a character set. For example, when looking for a hex integer
you might look for [a-fA-F\d]
. Also, the term
regex is short for regular expressions.
Symbol | Equiv | Description |
\w | [a-zA-Z0-9_] | A "word" character (alphanumeric plus "_") |
\W | [^a-zA-Z0-9_] | Match a non-word character |
\s | [ \t\n\f\r] | Match a whitespace character |
\S | [^\s] | Match a non-whitespace character |
\d | [0-9] | Match a digit character |
\D | [^0-9] | Match a non-digit character |
Perl has the standard regex quantifiers or closures, where r is any regular expressions.
r* | Zero or more occurences of r (greedy match). |
r+ | One or more occurences of r (greedy match). |
r? | Zero or one occurence of r (greedy match). |
r*? | Zero or more occurences of r (match minimal). |
r+? | One or more occurences of r (match minimal). |
r?? | Zero or one occurence of r (match minimal). |
Let q be a regex with a quantifier. If there are many ways for q to match some text, a greedy quantifier will match (or "eats up") as much text as possible; a minimal matcher does the opposite. If a regex contains more than one quantifier, the quantifiers are "fed" left to right.
The two main regex operations are searching/finding and substituting. In searching, we test if a string contains a regular expression[Note: "Regex searching'' is often incorrectly called "regex matching''.]. In substituting, we replace part of the original string with a new string; the new string is often based on the original. Both of these operations use the regular expression operator
=~ |
=~
is officially
called the "binding operator", as there are other non-regex operations
that use it.]
Searching: For example, to determine if the string
$line contains a recent year such as 1998 or 1983, we
use the search operator =~ /.../
. Here the slashes
'/' delimit or mark the beginning and the end of the regular
expression.
if ($line =~ /19[89]\d/) { # we found a year in $line }
In general, to determine if string $var
contains the regular
expression re use any of the following forms. If the regular
expression contains a slash '/' itself, then you must use
mXreX
form, where each X is the same single
character not appearing in re.
In mX...X
, the m
stands for "match".
if ($var =~ /re/) { ... } if ($var =~ m:re:) { ... } # can replace ':' with any other character while ($var =~ m/re/) { ... } # can replace '/' with any other character
To access the substring in $var
matched by part of the regular
expression re, put the part of re in parenthesis. The
matched text is accessible via the variables $1, $2, ..., $k, where
$k matches the k-th parenthesized part of the regular expression.
For example to break up an e-mail address user@machine in
$line
we could do
if ($line =~ /(\S+)@(\S+)/) { # \S = any non-space character my($user, $machine) = ($1, $2); ... }
The submatch variables $1, $2, ... $k are updated after each successful regex operation, which wipes out the previous values. I store these submatch values into other well-named variable immediately after the regex operation, if I want them.
Use \k, not $k, in the regular expression itself to refer to a
previously matched substring. For example, to search for identical
begining and ending HTML tags <xyz>
... </xyz>
on a single line $line
use
if ($line =~ m|<(.*)>(.*)</\1>|) { # search for: <xyz>stuff</xyz> my($stuff) = $2; ... }
Substitution: To replace or substitute text in $var
from the regular expression old to new use the
following form.
$var =~ s/old/new/; # replace old with new if ($var =~ s:old:new:) { ... } # replace ':' with any other character
To use part of the actual text matched by the old regex, the
new regex can use the $k variables. Taking our previous
example involving years, to replace the year 19xy
with
xy
, use
$line =~ s/19(\d\d)/$1/;
Modifiers: When searching or substituing, there are several
optional modifiers you can use to alter the regular expression. For
example, in if ($var =~ /
<title> /i)
, the
i
at the end specifies a case-insensitive search. We use
m//
and s///
to represent searching and substituing.
Option | Where | What |
i | m//, s/// | case insensitive (upper=lower case) pattern |
m | m//, s/// | $var as multiple lines |
g | s/// | replace all orig with new. I.e. apply repeatedly. |
g | m/// | (Adv) search for all occurences. On next evaluation, continue where previous search left off. |
s | m//, s/// | (Adv) treat $var as a single line, even if imbedded '\n' chars |
x | m//, s/// | (Adv) allow extended regex syntax. Ignore spaces in the regex (for readability) |
The regex operations return different results depending on the context. For clarity, I recommend using the scalar context
context | return value |
scalar | true, if there was a match (or substitution) |
list/array | list of sub-matches ($1, $2, ...) found in the match |
Perl has many built-in functions.
There are numerous ways to access documentation about Perl functions.
%man
perfunc
.
Here are some of the more common functions I've used. If the function has additional options for a function, the description starts with a (+).
@arr=split(/[ t:]+/, $line); | (+) Split $line into words. Words are seprated by spaces or colons (but not tabs). Store words in @arr, spaces and colons are discarded. |
@arr = stat(filename); |
Returns a 13 element list ($dev, $ino, $mode (permissions
on this file), $nlink, $uid, $gid, $rdev, $size (in bytes),
$atime, $mtime (last modification time), $ctime, $blksize,
$blocks) containing information about a file. |
$str = join("::", @arr); |
Concatenate all elements of @arr into a single scalar string;
separate all the elements by a double colon. Useful when printing out
an array. |
Perl has several functions which test properties about files. These
functions have the name -X
, for some character X. (Yes, the
function name starts with a dash.) These names mimic the Unix
csh
and the Unix sh test
operations. These functions
take a filename or a file handle, as in -X filename
.
For example, if you want to run a command /bin/ccc
on the data
file ../input/ddd, you might want to check if ccc
is
executable and ddd
is readable first.
if ( (-x "/bin/ccc") && (-r "../input/ddd") ) { my(@cccout) = `/bin/ccc ../input/ddd`; # run the command. } else { ... complain ... }
I give the descriptions directly from the perlfunc
manual page,
listed from most common to least common, based on my own usage.
-f | File is a plain file. |
-e | File exists. |
-d | File is a directory. |
-l | File is a symbolic link. |
-r | File is readable by effective uid/gid. |
-x | File is executable by effective uid/gid. |
-w | File is writable by effective uid/gid. |
-z | File has zero size. |
-s | File has non-zero size (returns size). |
-o | File is owned by effective uid. |
-R | File is readable by real uid/gid. |
-W | File is writable by real uid/gid. |
-X | File is executable by real uid/gid. |
-O | File is owned by real uid. |
-p | File is a named pipe (FIFO). |
-S | File is a socket. |
-b | File is a block special file. |
-c | File is a character special file. |
-t | Filehandle is opened to a tty. |
-u | File has setuid bit set. |
-g | File has setgid bit set. |
-k | File has sticky bit set. |
-T | File is a text file. |
-B | File is a binary file (opposite of -T). |
-M | Age of file in days when script started. |
-A | Same for access time. |
-C | Same for inode change time. |
When you run a Perl script, perl puts the command line arguments in the
global array @ARGV
. For example, if you run the command
% perl somescript.pl -o abc -t one.html two.html
will result in
$ARGV[0] | -o |
$ARGV[1] | abc |
$ARGV[2] | -t |
$ARGV[3] | one.html |
$ARGV[4] | two.html |
The prototype code at the begining of this document shows one way to
process @ARGV
.
See the prototype example for reading/writing from/to a file.
Given a file handle FH
from either open()
or a
new FileHandle
, the operation <FH>
reads the next
line in scalar context or the entire file in list context.
while ( $line = <FILE_DATA> ) { # read a line at a time. if ( $line =~ /keyboard/ ) { print $line; } } my(@whole_file) = <FILE_DATA>; # be careful, file could be BIG. my($numlines) = scalar(@whole_file); #
If you only want to read from stdin, use an use
while ($line = <STDIN>) { # read a line at a time ... }
But how can we read from a file sometime and from STDIN at other times
in the same Perl script? The routines handle_file()
and
read_file()
in the prototype code show how read from
any input stream such as a file, stdin (which
itself could be a file, the keyboard or a network connection), a network
connection, the keyboard, and so on.[Note: An input stream is any
source of input data and is a generalization of an input file. In C an
input stream is a file descriptor or a FILE* pointer (from
stdio.h), such as stdin
. In C++ an input stream is an
istream
, such as cin
.] The function
handle_file()
is a "driver" for read_file()
that
passes as a parameter either STDIN
or a FileHandle
input stream to read_file()
.
In read_file(istream, fname)
the first parameter,
istream
, is the input stream, from whic we read input data.
The second parameter fname
is the file name, which is used for
say, reporting errors. To pass STDIN
as a parameter to
read_file()
, we use \*STDIN
[Note: This is a
very advanced topic as we are passing a reference to the typeglob for
STDIN.] Sadly explaining \*STDIN
is beyond the scope of
this document.
(This may or may not work on Win32) You can run an external command,
such as ls -l
by placing it in back quotes (also known as back
ticks or grave accents, `ls -l`
. The returned value is the
output the command sends to stdout. In scalar context, you get one big
string, with a \n
character separating lines; in array
context, each output line is a separate array item.
Thus, see the contents of a tar file, xyz.tar in Perl, you
could do
my(@tarlist) = `tar tfv xyz.tar`;
Commands are run in current working directory, which is initially the
directory where you started the Perl script. You can change the current
working directory to DDD by calling the built-in Perl function
chdir DDD
.
A reference in Perl is equivalent to a pointer in C. Any Perl scalar
value/variable can be a reference. The address-of operator in Perl is
the \
(backslash); the dereference operator is
sadly and confusingly the $
(dollar sign).
Thus the following lines are equivalent in Perl and C; in both cases we
change the value of str
from "hi" to "bye"
via ptr
and we add 5 to the value of num
via a
pointer. In Perl, we can use the same reference variable ptr
becuse references are not typed; in C we must use different pointers
sptr
and iptr
.
Perl | C/C++ |
$str = "hi"; | char* str = "hi"; |
$ptr = \$str; | char** sptr = &str; |
$$ptr = "bye"; | *sptr = "bye"; |
$num = 4; | int num = 4; |
$ptr = \$num; | int* iptr = # |
$$ptr += 5; | (*iptr) += 5; |
In the last line, the double dollar sign $$ptr
is pretty
ugly; as a notational convenience, for a reference to an array or hash,
the postfix ->
operator can be used. Thus, dereference the
array reference arrRef
, we can use either
$arrRef->[...]
$$arrRef[...]
.
An analoguous notation is used for hashes passed by reference. The following table shows how to use an array/hash versus a reference to it. There should be no surprises to an experienced C programmers.
Approach | Var | whole array | k-th item | address-of array |
Normal | @arr | @arr |
$arr[k] | \@arr |
Reference | $aref = \arr | @$aref
| $aref->[k] or $$aref[k] | $aref |
Approach | Var | whole hash | key lookup | address-of hash |
Normal | %hash | %hash
| $hash{k} | \%hash |
Reference | $href = \hash | %$href
| $href->{key} or $$href{key} | $href |
I typically pass arrays and hashes as references like C/C++, because this method is fast (as we only pass a scalar) and it allows the array to be modified. The basic scheme is declare the formal parameters as scalars; the actual parameters passed are "the-address-of" of the array or hash.
# call via: # toBeCalled (array-reference, hash-reference); # sub toBeCalled ($$) { # declare params to be scalars my($ref2arr, $ref2hash) = @_; ... $ref2arr->[idx] = ... ... $ref2hash->{key} = ... ... foreach item in ( @$ref2arr ) { ... } } sub caller () { my(@arr) = ( ... ); my(%hash) = (); ... toBeCalled(\@arr, \%hash); }
Here's an example of a function clearEntry
which clears the
specified index idx
of an array of strings arr
and
increments index. Because both variables are modified, they are both
passed as references.
sub clearEntry ($$) { my($idx, $arr) = @_; $arr->[$$idx] = ""; $$idx ++; } sub callClear () { my(@stuff) = ("aa", "bb", "cc", "dd"); my($indexer) = 1; print "BEFORE indexer = $indexer " . join(":", @stuff) . "\n"; clearEntry(\$indexer, \@stuff); print "AFTER indexer = $indexer " . join(":", @stuff) . "\n"; }
Calling callClear()
gives
BEFORE indexer = 1 aa:bb:cc:dd
AFTER indexer = 2 aa::cc:dd
There are a variety of other quoting mechanisms as summarized in the table below, which borrows directly from the Section Quote and Quotelike Operators in perlop. Interpolates means that variables are evaluated, which in turn means that all variable references starting with $, @, or % are fully evaluated.
@squares = (0, 1, 4, 9, 16, 25); $i = 2; print("i = $i, 3+i = (3+$i)\n"); # print: i = 2, 3+i=(3+2) print("squares[i+3] = $squares[$i+3]\n"); # print: squares[i+3] = 23
In the first print()
statement, the arithmetic expression
(3+i)
is not evaluated, because it is not a variable; however,
the reference to $squares[$i+3] is fully evaluated.
Customary | Generic | Meaning | Interpolates |
'xxx' | q:xxx: | Literal | no |
"xxx" | qq:xxx: | Literal | yes |
`xxx` | qx:xxx: | Command | yes |
none | qw:xxx: | Word list | no |
/xxx/ | m:xxx: | Pattern match | yes |
none | s:xxx:yyy: | Substitution | yes |
none | tr:xxx:yyy: | Translation | no |
The generic quoting mechanism allows you to delimit a string with arbitrary characters, which is especially convenient when the string contains single and/or double quotes.
$where = "a hot dog stand"; $proverb = 'Don\'t buy sushi from a hot dog stand.'; $proverb = q/Don't buy sushi from a hot dog stand./; $proverb = q(Don't buy sushi from a hot dog stand.); $proverb = "Don't buy sushi from $where."; $proverb = qq/Don't buy sushi from $where./; $proverb = qq(Don't buy sushi from $where.);
You can specify multi-line, verbatim strings, called "here documents", using the << syntax. This syntax originated in the Bourne shell. The following three snippets produce the same output.
sub here_one () {
my $weather = "sunny";
print $OUT <<"EOStr";
Oh great. It
is $weather today.
EOStr
}
sub here_two () {
my $weather = "sunny";
my $heredoc =<<"EOStr";
Oh great. It
is $weather today.
EOStr
print $OUT $heredoc;
}
sub no_here () {
my $weather = "sunny";
print $OUT "Oh great. It\n";
print $OUT " is $weather today.\n";
}
In the preceding examples, I use EOStr as a delimiter; as a
rule of thumb, the delimiter can be any string that does not appear in
the here document. Beware, the syntax is intolerant of extra spaces
surrounding the delimiter. In particular, at the start of the here
document (i) do not put a space after the <<, and (ii) remember to
add a ;
(semicolon), and at the end (ii) the delimiter must
be on a line by itself without spaces.
I have no plans to cover these topics in this introductory document. Perhaps in a not-in-the-near future "Reusable Perl code in 10 pages" document.
Revision | When | Description |
2000c | 9 Jun 2000 | Fixed errors (thanks AA and GGS). Added here strings. |
2000b | 19 Apr 2000 | Very minor rewrites. |
1999c | ??? 1999 | Added table of contents (by fixing ltoh ). |
I wrote this document because I wish some one had done so when I was learning Perl. I welcome any constructive feedback on this document.
This document © Russell W Quong, 1998,1999,2000. You may freely copy and distribute this document so long as the copyright is left intact. You may freely copy and post unaltered versions of this document in HTML and Postscript formats on a web site or ftp site. Lastly, if you do something injurious or stupid because of this document, I don't want to know about it. Unless it's amusing.