What's new. All postings for 2009 All postings for 2008 RSS Feed
Calendar Pictures Open-source contributions Really old pages DWright.org
View Dan Wright's profile on LinkedIn
Dan Wright's Facebook Profile
Wed, 25 Jun 2003

I learned a good bit about using perl modules to parse HTML today.

I did a bit of research in using HTML::LinkExtor, HTML::Parser and HTML::TreeBuilder.

The problem I had was that originally I had been doing some very simple stuff on www.patchtrader.info where a user would log in, edit one page worth of data and then submit the form. Now, I'm planning on expanding the functionality so that once you are logged in, you will stay logged in and will have more stuff you can do.

I don't like cookies, so my only choice is to encode session data into all of the URL's. Since almost all of the links are GET links, I needed to encode my session data in every single link. I could write some template closure to generate the links, but that would mean always having to use some contrived method of generating links. Worse yet, it means going back and updating all of my previous links. I wanted perl to do the thinking for me.

I finally ended up using HTML::TreeBuilder so that I could magically re-write the output of my web pages so that they always encode session information without having to re-write all of my templates:

sub add_sessions {
    my $root = HTML::TreeBuilder->new_from_content( shift() );
    my $session = shift;

    foreach my $link ($root->look_down( '_tag', 'a' ) ) {
        next unless my $url = $link->attr('href');

        if ( $url =~ m|://([^/]*)/| ) {
            # $owned_sites is a file scope lexical compiled regexp
            # at the top of the file.
            next if ( $1 !~ $owned_sites );
        # Look for mailto: links.
        next if ( $url =~ m|^[^/]*:| );

        my ( $path, $params ) = split /\?/, $url, 2;
        my %params = map { split( /=/, $_, 2 ) } split( /&/, $params );
        $params{session} ||= $session;

        $url = join( '?', $path, join( '&', map { "$_=$params{$_}" } keys( %params ) ) );
        $link->attr('href', $url);

    my $html = $root->as_HTML;

    return $html;

I know I should probably use URI::URL instead of parsing those url's by hand, but that is a project for another day.

posted at: 22:52 | permanent link to this entry | Comments: