Monday, 11 July 2011

Configuring Search Relevance for SharePoint

Search Relevance refers to the process of tuning the search results, so that it matches closely to the search query which the user is trying to find.
There are a number of factors which affects the presentation and sequence of the search results when a user attempts a search query.
  1. Precision (finding the right answers)
  2. Re Call (finding all the answers)
  3. Visual Design
  4. Usability
  5. Speed
The relevance ranking engine is based on information retrieval algorithms, adapted from Stephen Robertson’s BM25F algorithm. It is specifically tuned for the unique requirements of searching enterprise content. This approach orders results by decreasing probability of relevance to the query. Query terms describe the document and the query. Statistics about the terms and the result make up the ranking: the document length, the number of occurrences of the term in the document, and the number of documents in which each term occurs at all (this is repeated for each property). This is further enhanced by tracking body text and properties, such as title or author, individually. Yet, each enhancement to the model, adding features and facts about the document or the query, will contribute to better results.

While performing search, SharePoint performs two types of rankings.
1) Dynamic Ranking
2) Static Ranking
 Dynamic Ranking of Search Results
 This ranking is based on the search query term and the information available in various indexed metadata information. The calculation takes place at the Query Servers and depends on the basis of query text and term information matching.
 The following components are used for determining the dynamic search ranks for the search results.
Anchor Text
Anchor text is the text that is included with a hyperlink to describe the target content of that hyperlink. It only influences the rank and is not responsible for including the search result in the overall search result set.
Search indexes the anchor text from the following elements:
  • HTML anchor elements
  • Microsoft Windows SharePoint Services link lists
  • Microsoft Office SharePoint Portal Server 2003 listings
  • Microsoft Office Word 2007, Microsoft Office Excel 2007, and Microsoft Office PowerPoint 2007 hyperlinks (only for files using the new Office Open XML Formats)
Property Weighting

The property weighting is the process of assigning the weights or priorities to the various properties available in the search index.  These properties can be modified to improve the chances of the search results.

Property Length Normalization
A content item can have many different properties of varying length. If the values in these properties are treated equally regardless of their size during relevance calculation, it can have a negative impact on the calculated rank. Length normalization adjusts the rank of a content item, based on the length of the property, and the length normalization setting.
URL Matching
The name of the site is one of the important search terms. The <a href=>SharePoint</a> search matches the name of the site to the URL of the site.
Title Extraction
Using the title value of a document in index server can help in returning a very high precision of search results. However, in many scenarios, this title of the document does not accurately reflect the content of the document.
  Static Ranking of search Results
This is independent of the search query and is set during the indexing phase on the index server.
The following parameters affect the static ranking of search results.
Click Distance
Click distance refers to the number of links between a content item and an "expert" page linking to the content item.
The more links that the crawler must travel from an authoritative page to the content item, the lower the relevance score. If there are multiple paths to a content item, relevance is calculated based on the shortest path, the one with the least amount of links from the authoritative page to the content item.
URL Depth
Important or relevant content is often located closer to the top of a site’s hierarchy, instead of in a location several levels deep in the site. As a result, the content has a shorter URL, so it is more easily remembered and accessed by the user. Enterprise Search makes use of this fact by reviewing URL depth, which refers to how many levels deep within a site the content item is found. The level is determined by reviewing the number of slash ("/") characters in the URL; the greater the number of slash characters in the URL path, the deeper the URL is for that content item. As a consequence, a large URL depth number can lower the relevance of that content.
Automatic Language Detection
Enterprise Search determines the user’s language based on "Accept-Language" headers from the browser they are using—automatic language detection. When calculating relevance, content that is retrieved in the user’s language is considered more relevant than content in other languages, with the exception of English language content. English language content is considered as relevant as content in the user’s language.
File Type Biasing
In most search scenarios, certain file types are more relevant than others. For example, HTML pages and Word documents are usually more relevant to a user’s search than an Excel spreadsheet or a plain text file.
Enterprise Search’s relevance calculation includes a ranking algorithm that ranks some file types higher than other file types. This applies to the following file types, listed in default ranking order in Enterprise Search, starting with the highest:
  • HTML Web pages
  • PowerPoint presentations
  • Word documents
  • XML files
  • Excel spreadsheets
  • Plain text files
  • List items
Reference website: