Friday, March 21, 2008

Perl Mass Substitution

Here's a nice trick I've used many many times over the years. Say that you have a group of files, and you want to replace some text in every file quickly and easily. Perl makes this really easy, and it's ubiquity pretty much guarantees it's around if you're on a *nix platform. For the sake of example, we'll pretend we have a group of html files, and we want to change every occurrence of index.html to index.php. The syntax is as follows:

[edited - thanks Chris!]

perl -pi.bak -e 's/index\.html/index.php/gi' *.html

The command line options provided break down as follows:

-p assumes a while loop over each line of every file and implicitly prints
-i specifies that you want in-place substitution on the file (no redirection of STDOUT required)
-i also takes an option argument of a backup file extension (.bak in this case)
-e tells perl to run the following code on the command line

From there, you're using standard Perl regular expressions. Don't worry if you're not a regexp guru. In the simplest form, you can simply replace one string with another.

The Perl code breaks down as follows:

s(means substitute)/string you want to match/string to replace/
(g means replace multiple instances per line)
(i means case-insensitive matching... aka ignore uppercase and lowercase differences)

Of course if you know regular expressions, you can do all kinds of fancy stuff.

Here's a nice reference.

3 comments:

Salvator said...

Thanks a lot :) I'm use perl over 5 years and every day finds something new

Unknown said...

While in most cases involving filename-matching, it doesn't often cause problems, the regex 'index.html' actually matches any string containing the letters 'index' followed by any single character, followed by the letters 'html.' You would want to use the escaped '\.' to represent an actual dot character (so the regex would become 'index\.html').

For instance, 'index.html' as a regex would match 'index html' (since the space qualifies as "any character"). So if you had some method with comments reading "this method will index html pages in a directory..." or something, this comment would change to "this method will index.php pages in a directory..."

Use the regex but use it mindfully! :D

Travis Whitton said...

Ahh, good point. The often abused dot operator matches (almost) anything. In an unmodified Perl regexp, it will match any character except a newline. If you add the `s' modifier to the end of your regexp (i.e., /foo/s), it will match newlines as well making it the greediest operator known to man.

For the sake of simple example, the greediness is probably ok, but best practices definitely dictate a literal period as Chris said.