Sampling networks accurately MAR 11 2003
Here are some lists of the top weblogs (as determined by counting inbound links):
Technorati Top 100
Daypop Top Weblogs
Myelin Blogging Ecosystem
TTLB Blogosphere Ecosystem
Most Watched Blogs @ blo.gs**
Blogrolling.com Top Links**
** These two lists are not like the others and the discussion below may not apply. (Or maybe it does.)
They are all different. Why? Because each is describing a small part of the network as a whole -- with the possible exception of Technorati because its sampling size is relatively large -- much like Saxe's blind men trying to describe an elephant.
How did these lists -- which ostensively are trying to measure the same thing -- get so dissimilar? To add weblogs into the system, each probably started with small list of weblogs to seed the system, picking up other weblogs as each was scraped. That initial seed list pretty much determines how each map of the network is going to look. If you start with Scripting News and look at what it is linking to and what those sites are linking to (i.e. the two degrees of Scripting News), the popularity of SN is going to skew higher than its actual popularity because sites that SN links to are likely to link back to it.
So, my hypothesis is that because of the skew introduced by the initial conditions and the small sample sizes, all of these lists (except maybe Technorati) are pretty inaccurate. It's like the network effect squared or something -- the rich seem disproportionally richer because the network is being measured from their perspective (perhaps making this weblogs & power law business more pronounced than it actually is) -- but I can't get my head around it.
So here's my question for you. How do you construct a fairly accurate map of a network (the weblog universe in this case) with a sample size much smaller than the total number of nodes (weblogs)? Is it even possible? A random sampling would work, but how do you tell your spider to go find a random node when it can only find nodes though links from other nodes?
(I didn't have time to do 2000 words on this, so it's a little incomplete and thrown together, more of a starting point for a discussion than a statement of what I actually believe. I could be wrong about all this, but it seems like there's something interesting here.)
Josh10 11 2003 1:10PM
You have a company like...oh um.. Google...go and do something like buy ..um...let's say...Blogger. 'Pyra Labs'. I think it's definitely within Google's power to create a more accurate 'rating' of Blogs. But the question is, why would they do that...