I learned a good bit about using perl modules to parse HTML today. I did a bit of research in using HTML::LinkExtor, HTML::Parser and
HTML::TreeBuilder. The problem I had was that originally I had been doing some very simple
stuff on www.patchtrader.info where a
user would log in, edit one page worth of data and then submit the form. Now,
I'm planning on expanding the functionality so that once you are logged in, you
will stay logged in and will have more stuff you can do. I don't like cookies, so my only choice is to encode session data into all
of the URL's. Since almost all of the links are GET links, I needed to encode
my session data in every single link. I could write some template closure to
generate the links, but that would mean always having to use some contrived
method of generating links. Worse yet, it means going back and updating all of
my previous links. I wanted perl to do the thinking for me. I finally ended up using HTML::TreeBuilder so that I could magically re-write
the output of my web pages so that they always encode session information
without having to re-write all of my templates: I know I should probably use URI::URL instead of parsing those url's by
hand, but that is a project for another day.
sub add_sessions {
my $root = HTML::TreeBuilder->new_from_content( shift() );
my $session = shift;
foreach my $link ($root->look_down( '_tag', 'a' ) ) {
next unless my $url = $link->attr('href');
if ( $url =~ m|://([^/]*)/| ) {
# $owned_sites is a file scope lexical compiled regexp
# at the top of the file.
next if ( $1 !~ $owned_sites );
}
# Look for mailto: links.
next if ( $url =~ m|^[^/]*:| );
my ( $path, $params ) = split /\?/, $url, 2;
my %params = map { split( /=/, $_, 2 ) } split( /&/, $params );
$params{session} ||= $session;
$url = join( '?', $path, join( '&', map { "$_=$params{$_}" } keys( %params ) ) );
$link->attr('href', $url);
}
my $html = $root->as_HTML;
$root->delete();
return $html;
}
posted at: 22:52 | permanent link to this entry | Comments: