Steven Levy on how Google’s search algorithm has changed over the years.
Take, for instance, the way Google’s engine learns which words are synonyms. “We discovered a nifty thing very early on,” Singhal says. “People change words in their queries. So someone would say, ‘pictures of dogs,’ and then they’d say, ‘pictures of puppies.’ So that told us that maybe ‘dogs’ and ‘puppies’ were interchangeable. We also learned that when you boil water, it’s hot water. We were relearning semantics from humans, and that was a great advance.”
But there were obstacles. Google’s synonym system understood that a dog was similar to a puppy and that boiling water was hot. But it also concluded that a hot dog was the same as a boiling puppy. The problem was fixed in late 2002 by a breakthrough based on philosopher Ludwig Wittgenstein’s theories about how words are defined by context. As Google crawled and archived billions of documents and Web pages, it analyzed what words were close to each other. “Hot dog” would be found in searches that also contained “bread” and “mustard” and “baseball games” — not poached pooches. That helped the algorithm understand what “hot dog” — and millions of other terms — meant. “Today, if you type ‘Gandhi bio,’ we know that bio means biography,” Singhal says. “And if you type ‘bio warfare,’ it means biological.”
Or in simpler terms, here’s a snippet of a conversation that Google might have with itself:
A rock is a rock. It’s also a stone, and it could be a boulder. Spell it “rokc” and it’s still a rock. But put “little” in front of it and it’s the capital of Arkansas. Which is not an ark. Unless Noah is around.
By mapping, among other variables, how many people click on a link, and how long they linger there, Google assigns it a value, known as PageRank, after Larry Page.
That’s from Ken Auletta’s article about Google in the New Yorker last week. Didn’t know that PageRank was named after Larry Page. (via @dens)
In addition to its utility in organizing the World Wide Web, researchers say that Google’s PageRank algorithm is useful in studying food webs, “the complex networks of who eats whom in an ecosystem”.
Dr Allesina, of the University of Chicago’s department of ecology and Evolution, told BBC News: “First of all we had to reverse the definition of the algorithm. “In PageRank, a web page is important if important pages point to it. In our approach a species is important if it points to important species.”
The researchers compared the performance of PageRank and found it comparable to that of much more complex computational biology algorithms.
There are indications that Google is changing their PageRank algorithm, possibly to penalize sites running paid links or too many cross-promotional links across blog networks. Affected sites include Engadget, Forbes, and Washington Post. Even Boing Boing, which I think had been at 9, is down to 7. You can check a site’s PR here.
Depending on the site, 30-40% of a site’s total traffic can come from search engines, much of that from Google. It will be interesting to see how much of an impact the PR drop will have on their traffic and revenue. (thx, my moon my mann)
Update: Just got the following from the editor of a site that got its PR bumped down. He says:
Two weeks ago I lost 80% of my search traffic due to, I believe, using ads from Text-Link-Ads, which does not permit the “nofollow” attribute on link ads. That meant an overall drop of more than 44% of my total traffic. It also meant a 65%-95% drop in Google AdSense earnings per day and a loss of PageRank from 7 to 6.
He has removed the text links from his site and is negotiating with Google for reinstatement but estimates a loss in revenue of $10,000 for the year due to this change. And this is for a relatively small site…the Engadget folks must be freaking out.
A couple of days ago, I pointed to a patent filed by the Flickr folks for the concept of interestingness. I should have poked around a bit more because there’s a related patent filed by the Flickr and Josh Schachter of del.icio.us concerning “media object metadata association and ranking”. I’m not a big fan of software patents, but even so, I can’t see the new, useful, nonobvious invention here. I also find it odd that these patents reference exactly zero prior inventions on which they are based…compare with Larry Page’s patent for PageRank.
Robert Cringely: Google may have peaked (“What if search and PageRank and AdSense are Google’s corporate apex?”) and Microsoft may have more to worry about from Apple if they start distributing older versions of OSX (the Intel version) for free on iPods.
In reaction to some ads of questionable value being placed on some of O’Reilly’s sites (response from Tim O’Reilly), Greg Yardley has written a thoughtful piece on selling PageRank called I am not responsible for making Google better:
Google, Yahoo, Microsoft and the other big search engine companies aren’t public utilities - they’re money-making, for-profit enterprises. It’s time to stop thinking of search engines as a common resource to be nurtured, and start thinking of them as just another business to compete with or cooperate with as best suits your individual needs.
I love the idea that after more than 10 years of serious corporate interest in the Web that it’s still up to all of us and our individual decisions. The search engines in particular are based on our collective action; they watch and record the trails left as we scatter the Web with our thoughts, commerce, conversations, and connections.
Me? I tend to think I need Google to be as good a search engine as it can be and if I can help in some small way, I’m going to. As corny as it sounds, I tend to think of the sites I frequent as my neighborhood. If the barista at Starbucks is sick for a day, I’m not going to jump behind the counter and start making lattes, but if there’s a bit of litter on the stoop of the restaurant on the corner, I might stop to pick it up. Or if I see some punk slipping a candy bar into his pocket at the deli, I may alert the owner because, well, why should I be paying for that guy’s free candy bar every time I stop in for a soda?
Sure those small actions help those particular businesses, but they also benefit the neighborhood as a whole and, more importantly, the neighborhood residents. If I were the owner of a business like O’Reilly Media, I’d be concerned about making Google or Yahoo less useful because that would make it harder for my employees and customers to find what they’re looking for (including, perhaps, O’Reilly products and services). As Greg said, the Web is still largely what we make of it, so why not make it a good Web?
Long thoughtful response from Tim O’Reilly about the questionable advertising on some of O’Reilly Media’s sites. Is selling your site’s Page Rank to someone more or less legitimate than selling them your customers’ attention? (via waxy)
I missed this April article in New Scientist about Google’s plans to rank news stories according to quality and credibility of the sources:
Now Google, whose name has become synonymous with internet searching, plans to build a database that will compare the track record and credibility of all news sources around the world, and adjust the ranking of any search results accordingly.
The database will be built by continually monitoring the number of stories from all news sources, along with average story length, number with bylines, and number of the bureaux cited, along with how long they have been in business. Google’s database will also keep track of the number of staff a news source employs, the volume of internet traffic to its website and the number of countries accessing the site.
Google will take all these parameters, weight them according to formulae it is constructing, and distil them down to create a single value. This number will then be used to rank the results of any news search.
The second paragraph of the story mentions that this system has been patented by Google, but I don’t see how it’s much different than what PageRank does or what Metacritic has been doing with film, game, and book reviews:
This overall score, or METASCORE, is a weighted average of the individual critic scores. Why a weighted average? When selecting our source publications, we noticed that some critics consistently write better (more detailed, more insightful, more articulate) reviews than others. In addition, some critics and/or publications typically have more prestige and weight in the industry than others. To reflect these factors, we have assigned weights to each publication (and, in the case of film, to individual critics as well), thus making some publications count more in the METASCORE calculations than others.
I wonder if these systems will eventually let their users tweak the credibility algorithms to their liking. For instance, it won’t take long for conservatives to start complaining about the liberal bias of Google News. In the case of Metacritic, I’d like them to ignore Anthony Lane’s rating when he writes about summer blockbusters and put greater emphasis on whatever Ebert has to say. In the meantime, I’m readying my patent applications for RecipeRank, PhotoRank, ModernFurnitureRank, SoftDrinkRank, and, oooh, PatentRank. I’m sure they’re brilliantly unique enough to be recognized by the US Patent Office as new inventions.