Perl keyword text highlighting

Highlighting search terms

An article at A List Apart from August 2004, “Enhance Usability by Highlighting Search Terms,” suggests that websites highlight the search terms found in the referrer information. For example, if someone Googles “widget” then clicks a resulting link to your site, you might want to highlight every instance of “widget” on your page. The script promoted by authors Brian Suda and Matt Riggott uses PHP; I’m more comfortable with Perl, so not finding a robust, easy-to-use Perl version I wrote my own.

Controversy

Some writing in the article’s comments questioned the purpose and method of the PHP highlighting script. The principle objections, in my opinion:

Highlighting is an unnecessary distraction
Highlighting should be done on the client side (with Javascript) to save server resources
The script breaks, for example when the greater than symbol “>” appears in an image’s alt tag

The authors offer their own response. Here’s mine:

Highlighting is an unnecessary distraction

This is probably the strongest objection, and my response is pretty weak: I happen to like it. A better response is that you can add to the top of the page a brief explanation of the highlighting along with a link to the same page, so users can click the link and be rid of the highlighting.

Highlighting should be done on the client side (with Javascript) to save server resources

Another good objection. However, as the authors point out, this approach works only on Javascript-enabled browsers. More importantly, Javascript doesn’t have the thorough HTML parsing capability I’d like to see (although it might approach the regular expression ability offered by the article’s PHP script). This pertains to the next objection:

The script breaks, for example when the greater than symbol “>” appears in an image’s alt tag

The problem here has to do with what the authors have the PHP script doing. As they admit, “Implementing a full SGML/XML parser was well beyond the scope of our project.” I didn’t find that very satisfying. In fact, my early Perl version of their script did wacky things to the text within the alt and title tags. Here’s where Perl modules come in handy.

Perl HTML::Parser class

I’m not presuming to weigh in on the Perl vs. PHP dispute. My own opinion is that each has its value for different applications. The many Perl modules available for free download from CPAN make Perl a pleasure to use. The HTML module does all the heavy lifting for my script, and for that I’m thankful—I doubt my ability to come up with an efficient yet thorough regular expression to deal with HTML. I can let the minds behind the HTML::Parser module worry about that, and as circumstances change, I only have to update my module reference. Perl modules are easy to install. If your web host doesn’t have the HTML::Parser module (they probably do have it) and doesn’t install it when you ask, without a doubt you should switch hosts. This is a feature that the cheapest hosts offer.

The script—a subroutine

I’ll discuss it more thoroughly at the bottom, but here’s what the script does in general: It is a subroutine that takes the HTML to be parsed as its sole argument; then it returns that HTML with text highlighted (if it should be). So for example, if your Perl variable “$myhtml” contains the page you want to be parsed, you could include a line such as the following in your code. It would then replace $myhtml with the relevant search terms highlighted. I do something similar just before printing the HTML to the browser.

$myhtml = &decide_texthighlight($myhtml);

Okay, the subroutine:

#------------------------------------------

#determine if search terms should be highlighted

# script with explanations and updates at https://austinmatzko.com/blog/2005/02/19/perl-text-highlighting/

sub decide_texthighlight {

     #argument: text to highlight if applicable

     #uses HTML::Parser

     #returns text with highlighting

#------------------------------------------

     #--------------------------------

     # Variables to set

     #--------------------------------

     my $highlightstarttag = '<span class="texthighlight">';

     my $highlightendtag = '</span>';

     # tags containing text that should not be highlighted

     my @ignoretags = (

     'title',

     'script'

     );

     # A list of search query keys used by various search engines.  You probably don't

     # need to change these unless you want to add your own site's unique key.  

     # Google uses 'q' and Yahoo, 'p', for example.

     my @querykeys = (

      'q',

      'p',

      'ask',

      'searchfor',

      'key',

      'query',

      'search',

      'keyword',

      'keywords',

      'qry',

      'searchitem',

      'kwd',

      'recherche',

      'search_text',

      'search_term',

      'term',

      'terms',

      'qq',

      'qry_str',

      'qu',

      's',

      'k',

      't',

      'va'

      );

#-------------------------------------

# end variables you need to set

#-------------------------------------

my $content = $_[0];

my %form;

my $num_ignoretags = 0;

# look for search terms if the referrer line contains '?'

if ($ENV{'HTTP_REFERER'} =~ m/?/g) {

     my $buffer = $ENV{'HTTP_REFERER'};

     #remove everything leading up to and including '?' 

     $buffer =~ s/(^.*?)//;

     my @pairs = split(/&/, $buffer);

    foreach my $pair (@pairs) {

     my ($name, $value) = split(/=/, $pair);

          $value =~ tr/+/ /;

          $value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg;

          $form{$name} = $value;

    } 

     my $searchtext;

     foreach (@querykeys) {

          if (exists $form{$_}) {

               $searchtext = $form{$_};

          }

     }

     #take 'and's and 'or's out of $searchtext

     $searchtext =~ s/(?: and | or )/ /gi;

     my @words = split /W+/, $searchtext;

     #-------------------------------------

     # set up the package for parsing text

     #-------------------------------------

     my $html;

     package HTMLStrip;

     use base "HTML::Parser";

     sub start {

          my ($self, $tag, $attr, $attrseq, $origtext) = @_;

          # add in original start tags

          $html .=  $origtext;

          # determine if the tag is one to ignore

          $num_ignoretags = grep(/^$tag$/i, @ignoretags);

     }

     sub text {

     my ($self, $text) = @_;

          #if not within a tag to ignore

          if ($num_ignoretags < 1) {

               #replace all the search terms in the content with highlighted search terms

               foreach (@words) {

                    #make sure the search 'word' isn't some garbage or blank space

                    if ($_ =~ m/w/) {

                         $text =~ s/($_)/$highlightstarttag$1$highlightendtag/gi;

                    }

               }

          }

     $html .= $text;

     }

     sub end {

          my ($self, $tag, $origtext) = @_;

        # add in original end tags

        $html .=  $origtext;

          $num_ignoretags = 0;

     }      

     #invoke the package

     my $p = new HTMLStrip;

     $p->parse($content);

     $p->eof;

     $content = $html;

}

return $content;

} #end sub decide_texthighlight

Explanations

     #--------------------------------

     # Variables to set

     #--------------------------------

     my $highlightstarttag = '<span class="texthighlight">';

     my $highlightendtag = '</span>';

$highlightstarttag is whatever tag you want to use to highlight your text. I chose to use the span tag with class “texthighlight” so I can do all the formatting in my CSS stylesheet. For example, here are the relevant lines from my stylesheet, setting the properties for class “texthighlight”:

.texthighlight {

     color:black; 

     font-weight: bold; 

     background-color:#ffff66;

}

The next section puts a list of the tags to ignore into the @ignoretags array. I want to ignore <title> tags to keep <span . . .> from appearing in the browser title bar, and I ignore <script> tags to keep my Javascript from getting messed up.

     # tags containing text that should not be highlighted

     my @ignoretags = (

     'title',

     'script'

     );

The next part is a list of search query keys taken directly from the PHP used for the A List Apart article. Google “widget” and you’ll see “q=widget” in the argument string that appears in the results’ URL. Yahoo uses “p” instead of “q.”

     my @querykeys = (

      'q',

      'p',

      'ask',

      'searchfor',

      'key',

      'query',

      'search',

      'keyword',

      'keywords',

      'qry',

      'searchitem',

      'kwd',

      'recherche',

      'search_text',

      'search_term',

      'term',

      'terms',

      'qq',

      'qry_str',

      'qu',

      's',

      'k',

      't',

      'va'

      );

my $content = $_[0];

my %form;

my $num_ignoretags = 0;

$content takes the first (and only) argument passed to this subroutine, in this case, the page HTML to be parsed. We’ll come back to the other two variables I’m just initializing here.

if ($ENV{'HTTP_REFERER'} =~ m/?/g) {

 

We start looking for search terms to highlight only if a question mark (“?”) appears somewhere in the HTTP_REFERER value sent by the browser to the server.

     my $buffer = $ENV{'HTTP_REFERER'};

     #remove everything leading up to and including '?' 

     $buffer =~ s/(^.*?)//;

     my @pairs = split(/&/, $buffer);

    foreach my $pair (@pairs) {

     my ($name, $value) = split(/=/, $pair);

          $value =~ tr/+/ /;

          $value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg;

          $form{$name} = $value;

    }

Let’s say your referrer looks like “http://www.google.com/search?hl=en&q=%22widget%22&btnG=Google+Search”

$buffer =~ s/(^.*?)//;

 

shortens the string down to “hl=en&q=%22widget%22&btnG=Google+Search” The next little bit splits the string by “&,” translates stuff such as %22 into ", then stuffs the key-value pairs separated by “=” into the hash named %form. What we care about is that the key

now is associated with the value

"widget"

in the %form hash.

     my $searchtext;

     foreach (@querykeys) {

          if (exists $form{$_}) {

               $searchtext = $form{$_};

          }

     }

Above we look through the @querykeys array to see if one of our %form hash keys is in that list. Sure enough, because q appears in the @querykeys array, $searchtext will now have "widget" as its value.

     $searchtext =~ s/(?: and | or )/ /gi;

     my @words = split /W+/, $searchtext;

We don’t want to highlight “and”’s and “or”’s just because someone used those boolean terms, so we replace them with spaces. Then we split up $searchtext by non-word characters (in this example, spaces or ") into the @words array.

     my $html;

     package HTMLStrip;

     use base "HTML::Parser";

We create a package called HTMLStrip, invoking the HTML::Parser class. I got the idea for this section from Ken MacFarlane’s tutorial on HTML::Parser.

     sub start {

          my ($self, $tag, $attr, $attrseq, $origtext) = @_;

          # add in original start tags

          $html .=  $origtext;

          # determine if the tag is one to ignore

          $num_ignoretags = grep(/^$tag$/i, @ignoretags);

     }

“start” is a HTML::Parser subroutine called each time the parser encounters a start tag in the HTML. We’re modifying it for our purposes. “$html .= $origtext;” adds the original start tag back to $html variable, the variable which is going to contain our highlighted HTML. “$num_ignoretags = grep(/^$tag$/i, @ignoretags);” looks to see if the start tag is one of the tags we want to ignore, such as “title,” and returns a 1 in $num_ignoretags if it is. (More precisely, $num_ignoretags takes as its value the number of times $tag appears in the @ignoretags array.) This will come in handy next.

     sub text {

     my ($self, $text) = @_;

          #if not within a tag to ignore

          if ($num_ignoretags < 1) {

               #replace all the search terms in the content with highlighted search terms

               foreach (@words) {

                    #make sure the search 'word' isn't some garbage or blank space

                    if ($_ =~ m/w/) {

                         $text =~ s/($_)/$highlightstarttag$1$highlightendtag/gi;

                    }

               }

          }

     $html .= $text;

     }

The parser subroutine “text” deals with the text found between start and end tags. If $num_ignoretags is 0, i.e. we’re not inside a tag we’re supposed to ignore, then we replace all the search terms (if there are any) with the search terms surrounded by our highlighting start and end tags. So for example

Oh, how I love widgets!

would become

Oh, how I love <span class="texthighlight">widget<span>s!

Then we add the text to our $html variable.

     sub end {

          my ($self, $tag, $origtext) = @_;

        # add in original end tags

        $html .=  $origtext;

          $num_ignoretags = 0;

     }

As you could guess, the “end” subroutine references end tags. Here we just add the end tag to the $html variable and reset $num_ignoretags to 0 in preparation for the next tag.

     #invoke the package

     my $p = new HTMLStrip;

     $p->parse($content);

     $p->eof;

     $content = $html;

}

return $content;

We’ve reworked HTML::Parser’s subroutines, now it will call them when we call “$p->parse($content);” to parse the HTML. “$p->eof;” flushes HTML::Parser’s internal buffer, so we could re-invoke the parser if we wanted to. Finally, we assign the highlighted $html value to $content, and return $content. If you have any suggested improvements or criticisms, please leave a comment.

Austin Matzko's Blog