Google's Query Refinements
Author: Danny Wirken

Search engines aim to provide the most relevant results in
response to queries but limitations can be seen on what is
actually returned based on the queries used. Search queries can
either be too specific or too general for search engines to
recognize good results. Google has filed patent applications
regarding alternative query terms or query refinements to offer
a solution.

The Google Solution

Search queries that are not too effective in providing good
results include homonyms which are words that have the same
sound or spelling but different meanings. Improper contexts in
the choice of words can also be very confusing especially to
search engines. Very general terms provide results that are too
broad while very narrow terms can be very restrictive and may
provide non-responsive search results.

Google presents a system and method that attempts to address
this particular problem. In this system, a stored query and a
stored document are associated as a logical pairing. The
pairing is assigned a weight thus when a search query is
issued, a set of search documents is produced. There is at
least one search document that matches at least one document.
Retrieval is done when the stored query and the assigned weight
associated with it matches at least one stored document. A
cluster is formed through this and scoring is done on at least
one cluster relative to at least one other cluster. At least
one such scored query is suggested as a set of query
refinements.

The process starts when Google finds results by choosing the
top 100 documents for clustering. During this phase, term
vectors are computed for each of the said documents which were
ranked by relevance score. The documents are matched to a
stored document listed in an association database. Alternative
query terms are found by looking at associations with queries
that had been computed beforehand for the matched stored
documents.

Term vectors are also created for alternative query terms.
Clusters are created from both sets of term vectors to form
groupings. Each cluster has a calculated cluster centroid.
Search queries associated with a search document in the cluster
are scored according to the distance from this centroid and the
percent of stored documents occurring in the cluster. The best
suggested query refinement contains the highest number search
query terms and the most frequently seen in the documents in
the cluster.

Other clusters and query names may be created to come up with
additional suggested query refinements. Refinements are sorted
by relevance scores. Alternative queries can include negated
forms of terms appearing in the set of refinements but does not
appear on the original search query. A number of predetermined
search queries selected from past user queries can be used to
arrive at a precomputed possible set of refinements. The
predetermined queries would be issued while search results are
maintained in a database for future user search requests. The
refined queries would be provided to the user together with the
results of the original search.

The precomputation stage happens before any query is entered
into the search engine. It is best described with the use of at
least four parts – associator, selector, regenerator and
inverter.

The associator creates relevance-weighted relationships between
stored queries and stored documents. The selector decides which
stored documents and stored queries should be retrieved. The
regenerator looks at query logs and selects stored documents
based on previous searches. The inverter looks at the cached
data and selects documents and associated queries based on the
cached data.

The query refinements system itself has four parts. A matcher
matches one or more stored documents to the actual search
documents which have been generated by the search engine to
answer a search query. It also identifies the stored queries
and assigned weights using the associations corresponding to
the matched stored documents. A clusterer forms one or more
clusters using term vectors formed from the terms occurring in
the matched stored queries and corresponding weights. The
scorer computes centroids which represent the weighted center
of each cluster's term vector. A presenter identifies the
highest scoring search queries as one or more query refinements
to the user. The interesting aspect about this approach is how
user data is incorporated into results through the use of log
files and cached information.

The patent application shows one way of achieving query
refinements but no one really knows for sure exactly how Google
comes up with alternative results. However, it offers some hints
on how to create contents on websites and how to show up in
these alternative results. By taking into careful consideration
the words that people will probably search for and what appears
in Google's results for search phrases, a clue can be provided
on how the search refinements approach will treat a website.

Multi-Stage Query Processing

The determination of page relevancy in responding to queries
from searchers considers how a term or phrase is used in the
context of a page. A patent application that looks into the
possible ways of considering the context of these words was
likewise submitted by Google. It describes a multi-stage
process that determines relevancy and finds results to a
search.

The possible actions to be taken as described in this document
can be divided into stages. The first stage deals with deletion
of stop words, term stemming and expansion of queries to use
things like synonyms and related terms that commonly co-occur
with them. During this stage, the relevancy scores are created
between query and each document computed with one or more
scoring algorithms. The second stage uses adjacency and
proximity of terms to rank documents. The third stage reviews
the term attributes such as determining whether terms are
titles, headings, metadata or whether these terms possess
certain font characteristics. The fourth and last stage is the
generation of snippets to return with results.

Interactive query refinements have shown that it can promote
effective retrieval. Major search engines use the history of a
user's actions such as queries or clicks to personalize search
results. The query-specific web recommendations (QSRs)
retroactively answer queries from the user's history as new
results arise. Its main goal is to recommend new web pages for
user's old queries. However, this will not be of any use unless
the user has a standing interest in a particular query. Focus
can also be shifted from individual queries to query sessions
which includes all actions associated with a given initial
query. A query is considered a query refinement of the previous
one if both queries contain at least one common term.


About The Author: http://www.theinternetone.net