Thinking Sphinx with Excerpt Highlighting

So you want to highlight search terms and excerpt relevant phrases in your search results, and you want it done using the same stemming rules that your search engine uses to gather results?

Update 1/18/10: The newer version of Thinking Sphinx include support for excerpting and don’t work with the patch I present below. The patch worked in 1.13.5, but doesn’t work in 1.13.14. If you want the functionality presented here but have a modern version, try installing the plugin:

script/plugin install git@github.com:dfurber/thinking_sphinx_excerpts.git

Thinking Sphinx has the advantage of delta indexing. Ultrasphinx has lots of cool features such as excerpt highlighting. Wouldn’t it be nice if you could have TS’s delta indexing and search highlighting?

As it turns out, both Thinking Sphinx and Ultrasphinx are not only wrappers around Sphinx, but around another plugin that does the actual interfacing with Sphinx, called Riddle.

Riddle has an excerpt method (along with many other cool toys that I haven’t played with yet) that Ultrasphinx exposes but Thinking Sphinx does not.

It turned out to be straightforward to take the excerpts method out of US and stick it into TS. Using the patched TS, the way to have your search terms excerpted and highlighted is simply as follows.

For a single model:

User.search "david", :excerpts => true

For multiple models:

ThinkingSphinx.Search.search "david", :excerpts => true, :classes => [User, Post, Photo, Event]

The patch for Paginating Find:

Thinking Sphinx patch for paginating_find

I used to be a fan of paginating_find over will_paginate. This is the version I use in production. It performs the search, runs the results through Sphinx’s excerpt highlighter, and returns the results wrapped in a Paging Enumerator.

class Object
  def _metaclass 
    class << self
      self
    end
  end
end

module ThinkingSphinx
  class Search
    class << self
      # Overwrite the configured content attributes with excerpted and highlighted versions of themselves.
      # Runs run if it hasn't already been done.
      def excerpts(results, client, parsed_query)
        return if results.empty? or client.nil?
        options = {
            :before_match => '',
            :after_match => '',
            :chunk_separator => "…",
            :limit => 256,
            :around => 5
        }
        content_methods = %w{title name description} # the attributes of any model you would like to have excerpted
        # See what fields in each result might respond to our excerptable methods
        results_with_content_methods = results.map do |result|
          [result, 
          content_methods.map do |methods|
            methods.detect do |this| 
              result.respond_to? this
            end
          end
          ]
        end

        # Fetch the actual field contents
        docs = results_with_content_methods.map do |result, methods|
          methods.map do |method| 
            method and strip_bogus_characters(result.send(method)) or ""
          end
        end.flatten

        excerpting_options = {
          :docs => docs,         
          :index => "user_core", #MAIN_INDEX, # http://www.sphinxsearch.com/forum/view.html?id=100
          :words => strip_query_commands(parsed_query.to_s)
        }.merge(options)

        responses = client.excerpts(excerpting_options)

        responses = responses.in_groups_of(content_methods.size)

        results_with_content_methods.each_with_index do |result_and_methods, i|
          # Override the individual model accessors with the excerpted data
          result, methods = result_and_methods
          methods.each_with_index do |method, j|
            data = responses[i][j]
            if method
              result._metaclass.send('define_method', method) { data }
              attributes = result.instance_variable_get('@attributes')
              attributes[method] = data if attributes[method]
            end
          end
        end

        results = results_with_content_methods.map do |result_and_content_method| 
          result_and_content_method.first.freeze
        end

        results
      end  


      def search_with_excerpts_and_pagination(*args)
        query = args.clone  # an array
        options = query.extract_options!

        retry_search_on_stale_index(query, options) do
          results, client = search_results(*(query + [options]))

          ::ActiveRecord::Base.logger.error(
            "Sphinx Error: #{results[:error]}"
          ) if results[:error]

          klass   = options[:class]
          page    = options[:page] ? options[:page].to_i : 1
          total = results[:total]

          results = ThinkingSphinx::Collection.create_from_results(results, page, client.limit, options)

          if options[:excerpts] and !results.empty?
            results = excerpts(results, client, query)
          end

          if options[:page]
            PagingEnumerator.new(client.limit, total, false, page, 1) do |pg|
              results
            end
          else
            results
          end
        end
      end
      alias_method_chain :search, :excerpts_and_pagination

      def strip_bogus_characters(s)
        # Used to remove some garbage before highlighting
        s.gsub(/<.*?>|\.\.\.|\342\200\246|\n|\r/, " ").gsub(/http.*?( |$)/, ' ') if s
      end

      def strip_query_commands(s)
        # XXX Hack for query commands, since Sphinx doesn't intelligently parse the query in excerpt mode
        # Also removes apostrophes in the middle of words so that they don't get split in two.
        s.gsub(/(^|\s)(AND|OR|NOT|\@\w+)(\s|$)/i, "").gsub(/(\w)\'(\w)/, '\1\2')
      end 


    end
  end
end

The Will Paginate Version

Please asked about a will_paginate version, so I put this together and tested it. It performs the query on Sphinx, then does the highlighting, and passes the results array through Thinking Sphinx’s own will_paginate collection class.

Thinking Sphinx patch for will_paginate

class Object
  def _metaclass 
    class << self
      self
    end
  end
  
end

module ThinkingSphinx
  class Search
    class << self
      # Overwrite the configured content attributes with excerpted and highlighted versions of themselves.
      # Runs run if it hasn't already been done.
      def excerpts(results, client, parsed_query)
        return if results.empty? or client.nil?
        options = {
            :before_match => '',
            :after_match => '',
            :chunk_separator => "…",
            :limit => 256,
            :around => 5
        }
        content_methods = %w{title name description} # the attributes of any model you would like to have excerpted
        # See what fields in each result might respond to our excerptable methods
        results_with_content_methods = results.map do |result|
          [result, 
          content_methods.map do |methods|
            methods.detect do |this| 
              result.respond_to? this
            end
          end
          ]
        end

        # Fetch the actual field contents
        docs = results_with_content_methods.map do |result, methods|
          methods.map do |method| 
            method and strip_bogus_characters(result.send(method)) or ""
          end
        end.flatten

        excerpting_options = {
          :docs => docs,         
          :index => "user_core", #MAIN_INDEX, # http://www.sphinxsearch.com/forum/view.html?id=100
          :words => strip_query_commands(parsed_query.to_s)
        }.merge(options)

        responses = client.excerpts(excerpting_options)

        responses = responses.in_groups_of(content_methods.size)

        results_with_content_methods.each_with_index do |result_and_methods, i|
          # Override the individual model accessors with the excerpted data
          result, methods = result_and_methods
          methods.each_with_index do |method, j|
            data = responses[i][j]
            if method
              result._metaclass.send('define_method', method) { data }
              attributes = result.instance_variable_get('@attributes')
              attributes[method] = data if attributes[method]
            end
          end
        end
        results.results = results_with_content_methods.map do |result_and_content_method| 
          result_and_content_method.first.freeze
        end

        results
      end  


      def search_with_excerpts(*args)
        query = args.clone  # an array
        options = query.extract_options!

        retry_search_on_stale_index(query, options) do
          results, client = search_results(*(query + [options]))

          ::ActiveRecord::Base.logger.error(
            "Sphinx Error: #{results[:error]}"
          ) if results[:error]

          klass   = options[:class]
          page    = options[:page] ? options[:page].to_i : 1
          total = results[:total]

          results = ThinkingSphinx::Collection.create_from_results(results, page, client.limit, options)

          if options[:excerpts] and !results.empty?
            results = excerpts(results, client, query)
          end

          results
        end
      end
      alias_method_chain :search, :excerpts

      def strip_bogus_characters(s)
        # Used to remove some garbage before highlighting
        s.gsub(/<.*?>|\.\.\.|\342\200\246|\n|\r/, " ").gsub(/http.*?( |$)/, ' ') if s
      end

      def strip_query_commands(s)
        # XXX Hack for query commands, since Sphinx doesn't intelligently parse the query in excerpt mode
        # Also removes apostrophes in the middle of words so that they don't get split in two.
        s.gsub(/(^|\s)(AND|OR|NOT|\@\w+)(\s|$)/i, "").gsub(/(\w)\'(\w)/, '\1\2')
      end 


    end
  end
end

Simply place the code in a file in the config/initializers folder of your application, and the magic will appear the way it almost always does in Rails.

Update 6/4/09 @ 9:30AM:

1. The excerpt method in Thinking Sphinx wants to examine one of your indices (any one) for character encoding and such. See here. I have used “user_core”. If you don’t have a User model or haven’t defined an index on it, then you will get an “unknown index” error. Simply search the patch for “user_core” and replace it with “#{model_you_have_indexed}_core”.

2. There was an error in the will_paginate version in which the excerpt method was taking the paginated collection and returning a simple array. The patch has been updated so that the excerpt method replaces the results returned by TS with the excerpt highlighted results, without removing the pagination info. My apologies to those who were caught by this problem.

Update 10/06/09

There has been a new version of Thinking Sphinx since I wrote this. I have not had the chance to update this code to match the new plugin. So, if it doesn’t work, that may be why. When I have a chance, I’ll fix it up.

February 26, 2009 in Ruby on Rails