The name "Google Dance" is often used to describe
the index update of the Google search engine. Google's
index update occurs on average once per month. It can
be identified by significant movement in search results
and especially by Google's cache of all indexed pages
reflecting the status of Google's last spidering. But
the update does not proceed as a switch from one index
to another at one point in time. In fact, it takes several
days to complete the index update. During this period,
the old and the new index alternate on www.google.com.
At an early stage, the results from the new index occur
sporadically. But later on, they appear more frequently.
Google dances.
The start dates of previous Google Dances have been
compiled by WebmasterWorld's administrator Brett Tabke.
The reasons for the Google Dance shall be presented
here.
Technical level: Expert || Date:
9th December 2004 || Author: Markus
Sobek
MIS Editor:
Thanks to dance.efactory.de and Markus Sobek for allowing
us to reproduce this article. PageRank and Google
are trademarks of Google Inc., Mountain View CA, USA.
PageRank is protected by US Patent 6,285,999.
The Technical Background of the Google Dance
The Google search engine pulls its results from more
than 10,000 servers which are simple Linux PCs that
are used by Google for reasons of cost. Naturally,
an index update cannot be proceeded on all those servers
at the same time. One server after the other has to
be updated with the new index.
Many webmasters think that, during the Google Dance,
Google is in some way able to control if a server
with the new index or a server with an old index responds
to a search query. But, since Google's index is inverse,
this would be very complicated. As we will show below,
there is no such control within the system. In fact,
the reason for the Google Dance is Google's way of
using the Domain Name System (DNS).
Google Dance and DNS
Not only Google's index is spread over more than 10,000
servers, but also these servers are, as of now, placed
in seven different data centers. These data centers
are mainly located in the US (i.e. Santa Clara, California
and Herndon, Virginia), indeed, in June 2002 Google's
first European data center in Zurich, Switzerland
went online. Very likely, there are more data centers
to come, which will perhaps be spread over the whole
world.
In order to direct traffic to all these data centers,
Google could thoeretically record all queries centrally
and then send them to the data centers. But this would
obviously be inefficient. In fact, each data center
has its own IP address (numerical address on the internet)
and the way these IP addresses are accessed is managed
by the Domain Name System.
Basically, the DNS works like this:
On the Internet, data transfers always take place
in-between IP addresses. The information about which
domain resolves to which IP address is provided by
the name servers of the DNS. When a user enters a
domain into his browser, a locally configured name
server gets him the IP address for that domain by
contacting the name server which is responsible for
that domain. (The DNS is structured hierarchically.
Illustrating the whole process would go beyond the
scope of this paper.) The IP address is then cached
by the name server, so that it is not necessary to
contact the responsible name server each time a connection
is built up to a domain.
The records for a domain at the responsible name
server constitute for how long the record may be cached
by a caching name server. This is the Time To Live
(TTL) of a domain. As soon as the TTL expires, the
caching name server has to fetch the record for a
domain again from the responsible name server. Quite
often, the TTL is set to one or more days. In contrast,
the Time To Live of the domain www.google.com is only
five minutes. So, a name server may only cache Google's
IP address for five minutes and has then to look up
the IP address again.
Each time, Google's name server is contacted, it
sends back the IP address of only one data center.
In this way, Google queries are always directed to
different data centers by changing DNS records. On
the one hand, the DNS records may be based on the
load of the single data centers. In this way, Google
would conduct a simple form of load balancing by its
use of the DNS. On the other hand, the geographical
location of a caching name server may influence how
often it receives the single data centers' IP addresses.
So, the distance for data transmissions can be reduced.
In order to show the DNS records of the domain www.google.com,
we present them here by the example of one caching
name server.
How data centers, DNS and Google Dance are related,
is easily answered. During the Google Dance, the data
centers do not receive the new index at the same time.
In fact, the new index is transferred to one data
center after the other. When a user queries Google
during the Google Dance, he may get the results from
a data center which still has the old index at one
point im time and from a data center which has the
new index a few minutes later. From the users perspective,
the index update took place within some minutes. But
of course, this procedure may reverse, so that Google
switches seemingly between the old and the new index.
IP Addresses and Domains of Google's Data
Centers
The progression of a Google Dance could basically
be watched by querying the IP addresses of Google's
data centers. But queries on the IP addresses are
normally redirected to www.google.com. However, Google
has domains which resolve to the single data centers'
IP addresses. These domains as well as their IP addresses
are shown in the following list.
Domain IP Address
www-ex.google.com 216.239.33.100
www-sj.google.com 216.239.35.100
www-va.google.com 216.239.37.100
www-dc.google.com 216.239.39.100
www-ab.google.com 216.239.51.100
www-in.google.com 216.239.53.100
www-zu.google.com 216.239.55.100
For every domain www-xx.google.com, there is an additional
domain www-xx2.google.com. The IP address of such
a domain ends on .101 instead of .100. These pairs
of domains and IP addresses belong to the same data
center and, hence, the same index is searched by queries
on them.
Note:
Searches at www-sj and www-zu are currently redirected
to other data centers. Since results for searches
at their IP addresses fluctuate heavily during a Google
Dance, also these searches seem to be internally routed
to other data centers. As we can see from our statistics
for Google's DNS records, there are currently no searches
at www.google.com directed to www-sj or www-zu. So,
we can assume that the two data centers are offline.
Those that keep an eye on Google's index updates
often think that the Google Dance is over, when they
see the new index at www.google.com or when they don't
see the old index at www.google.com for some time.
In fact, the update is not finished until all the
domains listed above provide results from the new
index.
The index updates at the single data centers seem
to happen at one point in time. As soon as one data
center shows results from the new index, it won't
switch back to the old index. This happens most likely
because the index is redundant at each data center
and at first, only one part of the servers (eventually
half of them) is updated. During this period, only
the other half of the servers is active and provides
search results. As soon as the update of the first
half of servers is finished, they become active and
provide search results while the other half receives
the new index. Thus, from the user's perspective,
the update of one data centers happens at one point
in time.
Finally, it shall be noted that the access to the
single data centers is generally controlled by the
DNS only, but sometimes queries are redirected. However,
this is easy to detect: When for a query at one of
the domains listed above, the links to Google's cache
do not comply with the IP address that belongs to
the domain, then the query is redirected. If this
happens, Google inhibits - for whatever reason - the
access to one data center.
The Google Dance Test Domains www2 and www3
The begin of a Google Dance can always be watched
at the test domains www2.google.com and www3.google.com.
Those domains normally have stable DNS records which
make the domains resolve to only one (often the same)
IP address. Before the Google Dance begins, at least
one of the test domains is assigned the IP address
of the data center that receives the new index first.
Building up a completely new index once per month
can cause quite some trouble. After all, Google has
to spider some billion documents an then to process
many TeraBytes of data. Therefore, testing the new
index is inevitable. Of course, the folks at Google
don't need the test domains themselves. Most certainly,
they have many options to check a new index internally,
but they do not have a lot of time to conduct the
tests.
So, the reason for having www2 and www3 is rather
to show the new index to webmasters which are interested
in their upcoming rankings. Many of these webmasters
discuss the new index at the Google forums out on
the web. These discussions can be observed by Google
employees. At that time, the general public cannot
see the new index yet, because the DNS records for
www.google.com normally do not point to the IP address
of the data center that is updated first when the
update begins.
As soon as Google's test community of forums members
does not find any severe malfunctions caused by the
new index, Google's DNS records are ready to make
www.google.com resolve to the data center that is
updated first. This is the time when the Google Dance
begins. But if severe malfunctions become obvious
during this test phase, there is still the possibility
to cancel the update at the other data centers. The
domain www.google.com would not resolve to the data
center which has the flawed index and the general
public could not take any notice about it. In this
case, the index could be rebuilt or the web could
be spidered again.
So, the search results which are to be seen on www2.google.com
and www3.google.com will always appear on www.google.com
later on, as long as there is a regular index update.
However, there may be minor fluctuations. On the one
hand, the index at one data center never absolutely
equals the index at another data center. We can easily
check this by watching the number of results for the
same query at the data center domains listed above,
which often differ from each other. On the other hand,
it is often assumed that the iterative PageRank calculation
is not finished yet, when the Google Dance begins
so that preliminary values exert influence on rankings
at that point in time.
The New PageRank Values during the Google
Dance
Most webmasters are interested in ranking changes
for their website during the Google Dance. But, besides
that, many also want to know about their new PageRank
values. Normally, the Google Toolbar fetches the PageRank
values from the data center that is specified by its
IP address in the actual DNS record for www.google.com.
Hence, when the Google Dance begins, the Toolbar usually
displays the old PageRank values.
Google submits PageRank values in simple text files
to the Toolbar. In former times, this happened via
XML. The switch to text files occured in August 2002.
The PageRank files can be requested directly from
the domain www.google.com. Basically, the URLs for
those files look like follows (without line breaks):
http://www.google.com/search?
client=navclient-auto&
ch=0123456789&
features=Rank&
q=info:http://www.domain.com/
There is only one line of text in the PageRank
files. The last cipher in this line is PageRank.
The parameters incorporated in the above shown URL
are inevitable for the display of the PageRank files
in a browser. The value "navclient-auto"
for the parameter "client" identifies the
Toolbar. Via the parameter "q" the URL is
submitted. The value "Rank" for the parameter
"features" determines that the PageRank
files are requested. If it is omitted, Google's servers
still transmit XML files. The parameter "ch"
transfers a checksum for the URL to Google, whereby
this checksum can only change when the Toolbar version
is updated by Google.
The PageRank files that are requested by the Google
Toolbar are cached by the Internet Explorer. So, their
URLs and the checksums can simply been found out by
having a look at the folder Temporary Internet Files.
Knowing the checksums of your URLs, you can view the
PageRank files in your browser. Since the PageRank
files are kept in the browser cache and, thus, are
clearly visible, and as long as requests are not automated,
watching the PageRank files in a browser should not
be a violation of Google's Terms of Service. However,
you should be cautious. The Toolbar submits its own
User-Agent to Google. It is:
Mozilla/4.0 (compatible; GoogleToolbar 1.1.60-deleon;
OS SE 4.10)
1.1.60-deleon is a Toolbar version which may of course
change. OS is the operating system that you have installed.
So, Google is able to identify requests by browsers,
if they do not go out via a proxy and if the User-Agent
is not modified accordingly.
Now, let's see how we can get the new PageRank values.
Taking a look at IE's cache, you will notice that
the PageRank files are not requested from the domain
www.google.com but from IP addresses like 216.239.33.102.
Additionally, the PageRank files' URLs often contain
a parameter "failedip" that is set to values
like "216.239.35.102;1111" (Its function
is not absolutely clear). However, it is pretty easy
to get the new PageRank values. Simply modify the
IP addresses in the URL so that the request goes to
one of the data centers that already has the new index.
The necessary information is given above.