Technorati is now tracking 1,000,000 weblogs.
Why hasn't anyone ever really looked at how Technorati determines what is a blog? I don't believe the Technorati numbers myself, I think it's greatly inflated.
Why? Sometimes in technorati results I see every category of a Radio weblog counted as a separate blog. I just dug around about 30 URLs until I found an example. Check out Merlin's cosmos. It says that 64 blogs are pointing 71 links to him, so there should be very few repeat listings, right?
Look down for the person running a Radio blog that is pointing at Merlin's site, called monkinetic weblog. The guy must have a sidebar link to kungfugrippe and by the looks of it he has 13 categories on his Radio blog, which show up as 13 blogs. In this entire list, there should only be 7 doubly listed blogs (Jish is one), but here we have 13 links from a single blog and the numbers don't add up.
This isn't the fault of Radio's design, it's how Sifry coded his algorithm to determine the difference between a blog and other pages of the same blog. For some reason it's not quite right for Radio blogs hosted on their own domains (personally, I've never seen the problem on the userland hosted radio sites).
I bet the counting of livejournal sites may also be wonky, since the URLs aren't that predictable and other pages might be showing up as other blogs.
Who was the lucky winner to own the one millionth blog?
I think it's greatly inflated.
Actually, I'm probably overdoing it a bit here by saying "greatly" but it could be off by a lot, if there are enough sites with weird URL storage schemes being miscounted (and I don't see why a MT blog couldn't trip the algorithm). I would say it's got to be at least 10% from my personal result tracking, and could be higher depending on how widespread the problem is.
You're right, Radio is somewhat messed up in that is attempts to count each "category" as a separate blog. We go through and cull the database regularly to pull that crap out. If you continue to see any results that look funky, please send an email to firstname.lastname@example.org and let us know.
I'm pretty sure the LJ stuff is accurate though, you'd be amazed at how many people are posting over there.
I'm working really hard to make sure that the Technorati database is accurate and clean, but wacky things happen all the time, and to expect 100% accuracy is of course, impossible. But I really believe that the numbers are pretty accurate.
Radio is somewhat messed up in that is attempts to count each "category" as a separate blog
How does Radio do the separate blog stuff, does Radio ping weblogs.com for each category? When you make a post?
I'm pretty sure the LJ stuff is accurate though
When I was going through a bunch of cosmos looking for good examples of the previous problem, I found some results with a single LJ post listed 5-10 times, but there were so many results I couldn't make it out if they were treated as one blog with many links or many blogs (they all seemed to point at the same URL).
wacky things happen all the time
I noticed that Typepad blogs are counted twice, once for the root URL of foo.typepad.com, then again for the default blog directory, foo.typepad.com/bar (it's the same files in both places).
Oh, what about the Typepad blog having a domain name? Does that mean the blog will be counted three times?
Dave's right that it can be very difficult to filter out "false" Radio weblogs, we've had that problem ourselves. I'm not doubting his assessment of the number of LJ sites either, but that's an area we scratched our heads trying to figure out for some time. The problem with LJ or any of the blog hosting groups is that "failures" of those central servers will often cause a few thousand sites to simultaneously point to some default list of links (I distinctly remember a day when a page from the PHP manual jumped to the top of our 4 hour trends list). I think we've got them under control now however.
Our site is currently tracking around 150,000 weblogs -- no where near the million Technorati's got. I wonder if one of the differences in number is that we delete URLs that don't respond to our robots after a certain number of tries. Typically if a site comes back it finds a way to get added back intot he system. This keeps our database leaner, and keeps our robots reading "actual" pages instead of waiting for errors.
Either way, "about one million" is a nice round number to point to for those of us trying to show how quickly blogging is growing around the world.
The million number seems fairly accurate. Maciej's Blog Census puts the number at 1.35 million with an estimate of ~890,000 that are active.
I think active is key here, I've noticed a lot of totally dead (I mean haven't been updated since 2000/2001) appearing in Technorati lately.
Wait, a lot is too strong. A fair amount would be a more accurate statement.
just as another few data points: according to blo.gs, 136,955 blogs have updated in the last week, 272,764 have updated in the last month, 391,042 have updated in the last two months, and 34,753 new blogs have been added in the last week (unfortunately, i haven't been keeping track of that for long).
this includes all blogs that ping weblogs.com, and that show up in the blogger.com changes feed, and a few other sources (and that ping blo.gs directly, of course).
this does almost totally exclude livejournal.com users.
Speaking of active, this Marlow post on churn rate and this Blogcensus follow up have some good information about blog activity. The Blogcensus post shows 5% of their sample had been abandoned (> 52 weeks since the last post). I wonder how Technorati's numbers would compare.
Unusual ideas can make enemies.
'May you live all the days of your life.' - Swift
The important thing isn't doing, but knowing how you do it.
Just because there's a pattern doesn't mean there's a purpose.
This thread is closed to new comments. Thanks to everyone who responded.
About + contact
Follow kottke.org on Twitter
Follow kottke.org on Tumblr
Like kottke.org on Facebook
Subscribe to the RSS feed
Ads by The Deck
And more at Amazon.com
More listings on the Job Board
Hosting provided EngineHosting