Search Engine Performance: Howto Avoid Suffix Queries

Friday, February 15, 2008

The reason why Lucene’s QueryParser doesnt allow suffix queries (“*keyword”) is the inefficient nature of such query types. In order to find a word with a suffix query, it must iterate through every term in the index and check if it matches.
Depending on the index size, this can potentially kill performance and slow down your application. There is, however, a simple trick which lets you perform suffix queries without the performance disadvantage.

Simply store the reverse of the string and transform suffix queries to prefix queries against the reverse string.

Here’s an example. With two strings in your index

A) “Tee House”
B) “Waffle House”

you’re going to store the reverse string too

A) “esuoH eeT”
B) “esuoH elffaW”

and transform each suffix query like “*House” to “esuoH*”. The results will be the same as with a real suffix query - just faster!
The disadvantage of this method? denormalization. Since each string is going to be stored twice, index size will become bigger.


5 Comments

Martin, February 16th, 2008

No shit sherlock. :)

harold, February 16th, 2008

What about substring queries? *Waffle* --> *elffaW*

kk, February 17th, 2008

For substring queries you have to store all substrings: House, ouse, use, se, e ;)

Sam Danielson, February 17th, 2008

Instead of multiple crappy indexes, how about a better index?

http://en.wikipedia.org/wiki/Suffix_array

Volkan Yazıcı, February 18th, 2008

Instead, use a Burkhard-Keller tree index on the related column and make metric (e.g. Levenshtein) comparisons for each word. No extra column is needed and serving in O(logn) access time.


Write a comment