Search Engine Performance: Howto Avoid Suffix Queries
Friday, February 15, 2008
The reason why Lucene’s QueryParser doesnt allow suffix queries (“*keyword”) is the inefficient nature of such query types. In order to find a word with a suffix query, it must iterate through every term in the index and check if it matches.
Depending on the index size, this can potentially kill performance and slow down your application. There is, however, a simple trick which lets you perform suffix queries without the performance disadvantage.
Simply store the reverse of the string and transform suffix queries to prefix queries against the reverse string.
Here’s an example. With two strings in your index
A) “Tee House”
B) “Waffle House”
you’re going to store the reverse string too
A) “esuoH eeT”
B) “esuoH elffaW”
and transform each suffix query like “*House” to “esuoH*”. The results will be the same as with a real suffix query - just faster!
The disadvantage of this method? denormalization. Since each string is going to be stored twice, index size will become bigger.
5 Comments
- Martin, February 16th, 2008
No shit sherlock. :)
- harold, February 16th, 2008
What about substring queries? *Waffle* --> *elffaW*
- kk, February 17th, 2008
For substring queries you have to store all substrings: House, ouse, use, se, e ;)
- Sam Danielson, February 17th, 2008
Instead of multiple crappy indexes, how about a better index?
- Volkan Yazıcı, February 18th, 2008
Instead, use a Burkhard-Keller tree index on the related column and make metric (e.g. Levenshtein) comparisons for each word. No extra column is needed and serving in O(logn) access time.