Google is Omnipotent
The halo effect has graduated from inflating stock prices to making companies godlike. Thus, they can do anything – mere mortals can just speculate. The truth, however, is frequently mundane.
Taylor Buley, writing on the Velocity blog at Forbes, has the provocative title of “Google Isn’t Just Reading Your Links, It’s Now Running Your Code.” Mr. Buley goes onto explain that “for years it’s been unclear whether or not the Googlebot actually understood what it was looking at or whether it was merely doing "’dumb’ searches for well-understood data structured like hyperlinks.” In other word, Google has built a Javascript interpreter!
The source for this headline comes directly from Google:
On Friday, a Google spokesperson confirmed to Forbes that Google does indeed go beyond mere "parsing" of JavaScript. "Google can parse and understand some JavaScript," said the spokesperson.
So it’s confirmed, then.
Mr. Buler spends most of his article explaining that building a Javascript parser is really fucking hard. In fact, a quote from one of his experts isolates the key problem – how long the code will run – and says that “The halting problem is undecidable," There is no algorithm that can solve it. Well, OK, I suppose, but couldn’t you process a lot and cut it off at an arbitrary point? Sure you’d miss some stuff, but surely you’d get enough?
Actually, that’s what another expert says:
"It’s hard to analyze a program using another program," the person says. "Executing [JavaScript code] is pretty much that’s the only way they can do it."
Mr. Buler believes this is a great accomplishment, and quite unknown.
He’s right on one count.
In a previous post, I cited a paper “Data Management Projects at Google” and talked about Edward Chang. Well, the paper is actually about three projects, and one of those is “Indexing the Deep Web,” spearheaded by Jayan Madhavan. In that 2008 paper, Dr. Madhavan had this to say about Javascript:
While our surfacing approach has generated considerable
traffic, there remains a large number of forms that continue
to present a significant challenge to automatic analysis. For
example, many forms invoke Javascript events in onselect
and onsubmit tags that enable the execution of arbitrary
Javascript code, a stumbling block to automatic analysis.
Further, many forms involve inter-related inputs and accessing
the sites involve correctly (and automatically) identifying
their underlying dependencies. Addressing these and
other such challenges efficiently on the scale of millions is
part of our continuing effort to make the contents of the
Deep Web more accessible to search engine users
It would seem they solved this problem! (This is a big accomplishment). When did they solve it? Recently?
Well, sort of. In a 2009 paper called “Harnessing the Deep Web: Past, Present, and Future.” In it, they say this:
We note that the canonical example of correlated inputs,
namely, a pair of inputs that specify the make and model of
cars (where the make restricts the possible models) is typically
handled in a form by Javascript. Hence, by adding a
Javascript emulator to the analysis of forms, one can identify
such correlations easily.
So let’s back up.
What is Google going? They’re accessing structured data hidden behind form submissions. Now, we say the information is “hidden” behind form submissions because you have to submit the form to get the data. One approach – the ”dumb” approach – is to generate all possible result URLs and then crawl all of them.
But. Those clever folks at Google noticed this might be a problem:
For example, the search form on cars.com has 5 inputs and a Cartesian product will yield over 200 million URLs, even though cars.com has only 650,000 cars on sale.
The challenge, then, is making fewer URLs. Thus, they intelligently developed an algorithm with this property:
We have found that the number of URLs our algorithms generate is proportional to the size of the underlying database, rather than the number of possible queries.
How do they do this? Well, one big challenge is (as noted above) the inputs in one field can depend on the inputs in another field. Google has taken to constructing databases of “interrelated data” (like manufacturer and car model) so they can automatically detect the data the form wants and limit their indexing accordingly.
But to detect when some fields on a form are interrelated, you… need to have more than the HTML. In fact, almost all input-dependent forms rely on Javascript to change the values around after a selection.
Well, the clever researchers at Google knew they needed to determine which fields in a form were interrelated. They also figured that they only needed to determine this once, because once they knew which fields were related, they could automatically generate their URLs using their generation algorithms.
As you can imagine, if you only need to do it once (for each form), then it becomes practical to emulate. You emulate one form, and get 650,000 URLs to index with solid data. It’s cheap – so cheap, it’s almost worth getting a human to do it. (Except no Googler would think of that!).
But – and here’s the thing – to emulate the behavior of a form driven by Javascript you have to have the Javascript files. You need to download them, and then execute them.
In other words, the second expert Mr. Buley consulted is spot-on. Google is executing the Javascript code to find out something very specific (which fields on a form are interrelated, and presumably anything done in an onsubmit event that would alter the indexing URL).
This is not news. It’s publically available information – very easily, though Google Scholar, and even easier if you’re following Google’s main researchers – and there is no reason to resort to speculation to answer the question. They’ve been accessing the Deep Web – the web hidden behind forms – for years; Javascript is an obvious stumbling block; Google researchers have papers published on it (frequently presented at conferences!).
It is galling to see a reporter say that something is “unclear” when it is very difficult to make something clearer. In 2008, Jayant Madhavan wrote on the Google Webmaster Central blog talks about crawling through forms to get to the Deep Web – this stuff isn’t restricted to academic papers easily accessible through Google Scholar and surfaced in regular Google results. No, it’s even in the blogosphere.
I think I’ve gone a bit too far, so I’ll stop now.