I want to show how certain tasks can be automated easily step by step using just a few tools. Today’s example will be about web automation. Perl will be used, but any other language would do.

To get things done, the important part is to choose the right ready-made components. For Perl, these can be found on the CPAN. When looking for modules, one has to be aware that there is both good and crap modules out there. Check the reverse dependencies, recent releases and maybe the ratings. A quick check on the search engine might also tell if the module is proven tech.

For web automation, the module needed is WWW::Mechanize. Often it will already be packaged by your distribution so on Debians, install libwww-mechanize-perl and on openSUSE get the perl-WWW-Mechanize package.

More to learn for SuSE users: For acquiring SuSE packages, the web interface on opensuse.org will find packages in the build service and provide one click install. Pro-tip for text mode: If you install the osc package, finding Perl packages gets yet easier: The command osc se 'perl(WWW::Mechanize)' will find the package containing this Perl module.

The work can begin. First you should know which site to automate. Often, it requires filling out of forms. Then, open the site in your web browser and open a terminal. WWW::Mechanize comes with a little tool to help us get started. Just type mech-dump http://yourwebsite to get a list of the form fields:

GET http://www.google.de/search [f]
  ie=ISO-8859-1                  (hidden readonly)
  hl=de                          (hidden readonly)
  source=hp                      (hidden readonly)
  q=                             (text)
  btnG=Google-Suche              (submit)
  btnI=Auf gut Glück!            (submit)
  gbv=1                          (hidden readonly)

there we have a possible form to fill out and can also peek at the hidden values. Nothing that the “View Source” button cannot do, but more convenient.

By the way, if you are seeing garbled output for national characters and are using UTF-8 in your terminal, then you need to fiddle with the Perl Unicode settings. This can be done by calling PERL_UNICODE=S mech-dump http://www.google.de instead. S will turn on Unicode on standard input and output, latter being your terminal. See man perlrun for more details. For an amazing write-up about the state of Unicode and its caveats, see this post by Tom Christiansen. You will need to install the Symbola font to see this camel:

Now the script can be written. Fire up your favourite script editor and type the first few lines:

use strict; use warnings;
use open qw(:std :utf8);
use LWP::ConnCache;
use WWW::Mechanize;

my $mech = new WWW::Mechanize (autocheck => 1, stack_depth => 0);
$mech->conn_cache(new LWP::ConnCache ());

We turn on autocheck here so it will throw exceptions Later during the writing, we find that it is not necessary to go back in history so we set the stack_depth to zero, too.

In the next step, the website will be opened and then the form can be filled out!

$mech->get('http://www.google.de');

$mech->submit_form(
        with_fields => {
                q => 'hello world',
        },
        button => 'btnG');

here, the necessary Form fields have been taken from the output of mech-dump. We will discuss later how to proceed in case the log-in is required first.

After having filled out and submitted the form, the resulting page needs to be evaluated and the extraction of relevant information for your task has to follow. So the first step is to print the result site. This can be done with one further simple command:

print $mech->content

Use your shell to store the information. With the Z shell, a simple firefox =(perl googl.pl) will then load the resulting file in Firefox. Now you can use your web browser to inspect the site and search how to identify useful information. In Firefox, you can type Ctrl+Shift+I to open the Inspector and the use the Inspection-button to learn more about some element. For example, click on the Link and you can easily see where it is in the document tree.

In the screen-shot the relevant data can be seen. To extract the link titles, we will use another tool that makes dealing with structured mark-up a breeze: XPath. For that, we use libxml2

some XPath literature: O’Reilly, MSDN, check your favourite search engine for much more!

The corresponding package on Debian has the slightly unwieldy name libxml-libxml-perl and on openSUSE it is perl-XML-LibXML. The LibXML plugs nicely into our script:

use XML::LibXML;

my $lx = new XML::LibXML ();
$lx->recover(2);

my $doc = $lx->parse_html_string($mech->content, { suppress_errors => 1 });

these few lines get the HTML document parsed and ready for XPath queries. We set suppress_errors so that LibXML will better cope with “real” non-strict HTML. With the help of the path information as seen in Firefox, we can now formulate the query:

for ($doc->findnodes('//h3[@class="r"]/a')) {
        print $_->textContent, "\n"
}

if we want to add the actual URL that this link goes to, we might do so by printing out

$_->findvalue('@href')

However, it seems Google has garbled the links somehow. But luckily the original URL is still in there. We can now use the CGI module to extract the parameters from the URL and to get back the original URL:

use CGI;
if ((my $google_url =  $_->findvalue('@href')) =~ s|^/url\?||) {
        my $q = new CGI ($google_url);
        print $q->param('q');
}

summary

That’s all for today. We have learned how to plug LWP, Mechanize, LibXML and CGI together to do some cool stuff. Next time we can take a look at how to do log-in and integrate with shell script, and if there is still time we can check out how to write a rudimentary Javascript parser that does just enough to get us the information we need.

sample

ailin@xli51 [35270]% perl googl.pl
Hallo-Welt-Programm  Wikipedia
 -- http://de.wikipedia.org/wiki/Hallo-Welt-Programm
Hello world program - Wikipedia, the free encyclopedia
 -- http://en.wikipedia.org/wiki/Hello_world_program
The Hello World Collection
 -- http://www.roesler-ac.de/wolfram/hello.htm
Children learn German: songs, games, activities ... - Hello-World
 -- http://www.hello-world.com/German/index.php