by David Tittle
For those of you who haven’t heard, Matt Cutts has stated that Google is now capable of indexing embedded Facebook comments. This was later confirmed by labnol.org’s Amit Agarwal. How do they do this, and why is it important?
How they do it
Most web crawlers work by simply downloading the html of a web page and pulling out the text. Makes sense, right? No need to parse images and complex JavaScript, when the most relevant content is probably just text on the page. There is one problem, though. Some pages use iframes (basically, a window to another web site), which effectively show up as links, rather than content. In addition to this, certain pieces of text on a page might be generated by JavaScript. This makes it more difficult for a web crawler.
You might say, “But, Google is full of smart people, right? I’m sure this would be a snap for them!” And, you are right, but there’s a catch. The more complicated you make your web crawler, the longer it takes to run. If you are crawling a small collection of web pages, this would be no big deal, but extend this to the entire Internet… It can be very significant. Consider the following (math alert!):
Using a simple scraper I’ve written just now, it takes about 0.75 seconds to load and scrape text from a page (about 80 pages a minute). That’s just the html.
Now, let’s consider parsing the text AND the JavaScript. To load the above page, even if we consider the iframes, it only totals to about 61.1K of data. If we add the JavaScript, that’s an additional 629.0K of data, across 37 files that have to be downloaded and parsed.
JavaScript is not the easiest language to parse, algorithmically. Without going into the details of languages, CYK algorithms and such, let’s just say that the more code you have, even more time is needed to analyze it. For the above example, it takes a web browser about 2.1 seconds to parse the JavaScript for this file. Total parse time would be about 2.85 seconds for html plus JavaScript (about 28.57 pages per minute).
This would mean that my simple scraper would be able to parse about 115,200 of these pages in one day, where the more advanced parser would only be able to parse 41,140 of these pages in one day. This means the simple parser can handle more than twice as many pages as the complex parser, while the complex parser can, potentially, find more content.
Why is it important?
This is good news for Facebook. Their mission of becoming the one, true web platform can only be helped by having their content displayed in search results. This means more people are likely to start incorporating Facebook into their sites. Facebook really should send Google a thank you letter for this one.
It also means that more websites will have their content available. Finally, there’s hope for all those companies that paid way too much for a designer to come in and create a site that loads all of their products via AJAX.
That being said…
We’ve seen how time consuming it can be to parse JavaScript. I would imagine that Google uses both types of crawlers, using only the advanced crawlers when necessary. If this is the case, then the more simple your page, the easier it is to have it indexed.
To Recap:
- Google is now parsing AJAX requests, including Facebook comments embedded in web pages.
- It is more time consuming to find JavaScript generated content.
- Pages that are easier to parse are easier to index.
Of course, this must be proven. Perhaps someone from the community has results or thoughts on this issue?