Repeat after me: inbound links do not  AUG 12 2004

Repeat after me: inbound links do not indicate either readership or influence. Plus, Technorati's top 100 data remains dirty.

There are 24 reader comments

Matt13 12 200410:13PM

Technorati's top 100 data remains dirty

A quick glance and I see plastic.com, which is only up there because it's a default link in tens of thousands of default blogspot templates (the pre-CSS versions, it was a dark template many used). Are there others? As much as I love Dave Pell and Davenetics, I don't recall seeing Davenetics linked in too many blogrolls. Where'd that one come from?

Am I missing other obvious problems in the top 100? Drudge should be news instead of a blog?

Ramanan17 12 200410:17PM

Agreed. Most LiveJournal, Xanga, AsianAvenue, etc., etc., users would be quite influential based on a measurement such as this. Many online web communities promote linking within the community; buddy lists and what have you. This sort of link-incest mucks up a metric such as inbound links.

Matt17 12 200410:17PM

Ah, I see, Davenetics is in that default template as well. Here's a good example of a blog using the template.

dave pell19 12 200411:19PM

Hey, what can I say? I'm huge among template users from the late nineties, specifically those from Brazil. What I can't believe is that someone would actually believe that my third most popular site would somehow be LESS influential than the NY Times and CNN.

Just as a brief example, search for "my cat mister winters" on google. Guess who comes up numero uno?

And CNN and NY Times? Zilch.

Come on. I'm a player.

jkottke09 13 200412:09AM

Here are some other problems with the Top 100 (ordering may have changed by the time you read this):

#2: ScottWater writes a .NET based blogging tool...the "powered by" banner links back to his site (example).

#5: MSN Groups?

#7: Photo Matt is primarily responsible for the open source Wordpress and for some weird reason (payment for all the free work he does?), he puts link to his Web site on the default template.

#10: Balmasque. Not quite sure the deal with this one, but it seems to have tons of incoming links from blog*spot sites that contain no content and have no incoming links themselves. Coud be legit, I guess. Template designer?

#11: Mike Little's Journalized, see #7.

#12: geek ramblings, see #7.

#16: interney.com provides a JavaScript-based stats tracker that you can put on your site.

#19: Penny Arcade! Weblog? Folks are probably there for the comics.

#25: Wanker.

#26: Erin at Suicide Girls, alterna-porn, but I think it counts as a weblog.

#27: Bryan Bell designs blog templates with a link back to his site on them.

Ok, stopping there. I realize it's hard, the list needs some gardening (which is probably the last thing on Technorati's to-do list), and scraping weblogs for links is a messy business, but when people are using this data for research or journalistic purposes, I have a bit of a problem with it. (Hypocrite! But this is only a weblog, so it's ok, right?)

BTW, the graph linked above appears in this month's Wired** accompanying a Clay Shirky article on mapping the different kinds of blogs on the power law curve. Wired being Wired, the graph looks pretty but by my reckoning at least four and possibly five of the data points don't belong there: Plastic, Davenetics, Penny Arcade, interney.com, and maybe Balmasque. Don't pretty graphs deserve fact checkers? And nevermind that, once again, links != influence. Traffic stats from Alexa would probably be more illuminating. Or PageRank-weighted link statistics from Google, kinda like what Daypop does with its Ranked By Daypop Score list.

**I believe it's the exact same data points, but I'll check (and correct) tomorrow when I get to work and have access to the magazine.

jkottke39 13 200412:39AM

I'm also highly skeptical about the accuracy of Technorati's claim of tracking 3,500,000 blogs. I'm not saying they're being deliberately misleading (again, scraping blogs is a messy business) and I don't quite know how to go about proving/disproving it, but it just doesn't seem right. Google currently indexes 4.2 billion pages. If a typical weblog has just 10-15 pages and Google indexes all the weblogs that Technorati does**, weblogs comprise ~1% of all pages in Google's index. And that 10-15 pages figure may be low...Google indexes at least 7,000 pages from kottke.org.

** Certainly not a given, but you've got to think that while Technorati gets weblogs that Google misses, the reverse is also true.

Matt43 13 2004 1:43AM

The real question is -- how can this make me money?

Michael S.14 13 2004 6:14AM

Is this all links, ever? Technorati does have this information, but I've personally linked to Slate 233 times, and it's just not possible that I account for 5% of Slate's links. It also seems unlikely that less than 0.5% of weblogs have ever linked to the New York Times.

jkottke23 13 200410:23AM

I believe it's the exact same data points, but I'll check (and correct) tomorrow when I get to work and have access to the magazine.

Looked at the graph in Wired and it's the same data.

Michael48 13 200410:48AM

If one is going to look for weblog influence you have to look at links within the text of a page and
discount or even ignore sidebar links. This issue came up during all the power law stuff a year (?) or so ago. Trouble is,
At this time I have seen no indication that anyone is doing that kind of discrimination, and if they are,
how. The evidence, as you point out, is pretty clear that most sites count all links as equal.

Of course discriminating between sidebar links and editorial links would be a very difficult task, but I think
it's possible to develop some methodology that might work. Members at Technorati, for example, could be
asked to make a specific CSS class for sidebar/linkbar links, and then Technorati's algorithms could be
adjusted to weight those differently. They wouldn't get everyone, but if they got enough people to do it,
the information gained in the aggregate could be extrapolated to a larger population. Just a thought.

Gene01 13 200411:01AM

I always thought this was a better
take on influence
. Also, Google should buy Technorati. Not only would they do a better job of spidering, they would prolly be able to develop a useful algorithm to estimate influence. In the meantime, I'm going to hire Wil Wheaton to pitch my new line of terry-cloth sweats.

Beerzie Yoink04 13 200411:04AM

Drudge is a blog? Hm.

tim53 13 200411:53AM

rss feeds don't contain sidebar links. and i'd hazard a guess that smaller sites (your inbounders) more often provide full text feeds. so perhaps a combination approach, with full text rss weighted more heavily if present.

jim winstead02 13 200412:02PM

i wouldn't be so quick to call penny arcade a non-weblog. the daily postings from the authors of the comic are about as weblog-ish as you can get.

Christine07 13 200412:07PM

Photo Matt actually puts links to the 4-5 core developers of WordPress, not just himself. So using that reasoning, they should all be in the top 100 - and they are not. I think his site has actually reached that level of readership. (Then again, I've been reading it for years, so maybe I'm biased?)

jkottke21 13 200412:21PM

i wouldn't be so quick to call penny arcade a non-weblog. the daily postings from the authors of the comic are about as weblog-ish as you can get.

The front page is definitely a weblog, but most of the inbound links are to the comics, not the weblog.

Photo Matt actually puts links to the 4-5 core developers of WordPress, not just himself. So using that reasoning, they should all be in the top 100 - and they are not. I think his site has actually reached that level of readership. (Then again, I've been reading it for years, so maybe I'm biased?)

They are all in the top 100. Here's an example of a site that uses the default template. Three of the four developers linked there I listed in the post above and the fourth one (Alex King) is at #13 (I forgot to include that one). So Matt gets most of his links from the default Wordpress template. Actually, you can kinda tell which of the top 100 are legit by comparing the # of incoming blogs with the # of incoming links for each blog...if they are almost the same, the numbers for that site are artificially inflated.

Jordan14 13 2004 1:14PM

I'm #84 on the Top 100 because my site is one of the links in the default b2evolution template. I feel a little guilty about it. A little. If they used a PageRank-like scheme instead, this would not be the case, since almost all of those b2evo sites just are half-baked blogs with one or two "Testing!" entries.

Matt16 13 2004 1:16PM

I suppose one could think of Drudge as a sort of news-centric "Remaindered Links", though they are clearly not remaindered, but the focus.

Craig C.14 13 2004 2:14PM

I'm just endlessly amused that Ensign Crusher is more popular than President Bush.

Peter Cooper34 13 2004 5:34PM

It's a bit old school, but other than RSS your crawler could always look at links, see if there are lots more in proximity on new lines (or as list items) and ignore those. Very few blogs post long lists of links in posts, particularly on new lines or in lists, whereas this is how nearly all blogrolls are done. With the popularity of RSS now, however, this shouldn't be a issue, although a couple of years ago (before MT was widespread), it would have worked.

Steven Marshall21 16 200411:21PM

It's all just a popularity contest anyway

Ryan C.34 06 2004 3:34PM

Craig C.: That's because his speech and writing are better than The Pres. ;)

Randy Peterman29 06 2004 4:29PM

I write a Statistics plugin for WordPress and if there's one thing I've learned in all of the coding: trends change. When Google is old and crusty (it will be some day, unless they re-invent themselves) we'll all be glad we use search engine X (if there are still search engines). I link to PhotoMatt.net in my blog regularly because he's often got good content. Just like I link to kottke.org when I find something in remainders useful or fun.

P Scott11 07 200411:11AM

You claim that Drudge is not a blog but news.
Could a newsblog just be links to other sites without commentary? Why not?

Besides Drudge is in many blog's blogroll, so most do think of Drudge as a blog.

This thread is closed to new comments. Thanks to everyone who responded.

this is kottke.org

   Front page
   About + contact
   Site archives

You can follow kottke.org on Twitter, Facebook, Tumblr, Feedly, or RSS.

Ad from The Deck

We Work Remotely

 

Enginehosting

Hosting provided EngineHosting