SharePoint and stemming
Happy New Year! :-)
Sooo, this blog has been a little quieter than planned recently, due to other activities taking priority. But hopefully it will be back on track during January.
A quick post for starters.
I've always been more than a little bit interested in the search capabilities within SharePoint, ever since Microsoft introduced probabilistic ranking in SharePoint Portal Server 2001. I can still do a pretty mean explanation of how the Okapi algorithm ranks search results and compare it to PageRank.
Anyways, there is a useful Microsoft blog specialising in the search stuff - Mike Taghizadeh. He's just written a couple of articles on word stemming. Word stemming helps determine the documents returned when you enter a search query. Mike talks all about it, so here is the short version.
- When you submit a query in SharePoint, the query is broken into individual words. For example, the query "securing the database" would be broken down into "securing", "the", and "database"
- Noise words can be eliminated, i.e. common words such as "and", "the", "or", that are unlikely to influence results. In this example, "the" would be dropped from the query
- The query words can then be stemmed for variations. For example, a query for "security" could be expanded to include documents that refer to "securing", "securely" and so on
- The query words will also be compared against the thesaurus. The thesaurus is customisable and very useful for words with domain-specific alternatives or abbreviations. You can choose to expand queries (e.g. expand "PMB" to also search for "Purple Medium Board") or replace queries (e.g. replace "ie" with "Instant Everywhere" - "ie" will return just about every document in an English-language index).
In his post, Mike mentions that word stemming is turned off by default. I've just checked on my demo laptop and he's right. If you want to turn on word stemming, here's how:
- Go to the search page, enter any old query to return the search results page
- Under Site Actions, select 'Edit page'
- Locate the 'Search Core Results' web part (usually in the bottom zone)
- From the Edit button, select 'Modify shared web part'
- In the tool bar on the right hand side, under 'Results Query Options', check the box labeled 'Enable Search Term Stemming'
And hey presto, it's switched on.
Now, before you go automatically enabling the feature, despite it seeming obvious to use it, be warned. Word stemming can affect the relevance of your search query. If some terms have lots of stemming and others have none, one word may now dominate results even if it isn't the priority in the context of what you are looking for. Stemming can also negatively affect performance - there will be a delay whilst expanding the search query to include stemming, and a larger set of results will be returned.
Technorati tags: SharePoint, SharePoint 2007, MOSS 2007