Sampling networks accurately

posted Mar 11 @ 12:56 PM by Jason Kottke · gift link

Sampling networks accurately

Here are some lists of the top weblogs (as determined by counting inbound links):

Technorati Top 100
Daypop Top Weblogs
Myelin Blogging Ecosystem
TTLB Blogosphere Ecosystem
Most Watched Blogs @ blo.gs**
Blogrolling.com Top Links**

** These two lists are not like the others and the discussion below may not apply. (Or maybe it does.)

They are all different. Why? Because each is describing a small part of the network as a whole — with the possible exception of Technorati because its sampling size is relatively large — much like Saxe’s blind men trying to describe an elephant.

How did these lists — which ostensively are trying to measure the same thing — get so dissimilar? To add weblogs into the system, each probably started with small list of weblogs to seed the system, picking up other weblogs as each was scraped. That initial seed list pretty much determines how each map of the network is going to look. If you start with Scripting News and look at what it is linking to and what those sites are linking to (i.e. the two degrees of Scripting News), the popularity of SN is going to skew higher than its actual popularity because sites that SN links to are likely to link back to it.

So, my hypothesis is that because of the skew introduced by the initial conditions and the small sample sizes, all of these lists (except maybe Technorati) are pretty inaccurate. It’s like the network effect squared or something — the rich seem disproportionally richer because the network is being measured from their perspective (perhaps making this weblogs & power law business more pronounced than it actually is) — but I can’t get my head around it.

So here’s my question for you. How do you construct a fairly accurate map of a network (the weblog universe in this case) with a sample size much smaller than the total number of nodes (weblogs)? Is it even possible? A random sampling would work, but how do you tell your spider to go find a random node when it can only find nodes though links from other nodes?

(I didn’t have time to do 2000 words on this, so it’s a little incomplete and thrown together, more of a starting point for a discussion than a statement of what I actually believe. I could be wrong about all this, but it seems like there’s something interesting here.)

Reader comments

JoshMar 11, 2003 at 1:10PM

You have a company like...oh um.. Google...go and do something like buy ..um...let's say...Blogger. 'Pyra Labs'. I think it's definitely within Google's power to create a more accurate 'rating' of Blogs. But the question is, why would they do that...

John WehrMar 11, 2003 at 1:27PM

"because of the skew introduced by the initial conditions and the small sample sizes, all of these lists .... are pretty inaccurate."

Your link skew theory sounds correct. And because your metric is "counting inbound links," the obvious next step would be to increase sample size. They aren't innacurate at all, just not reflective of some amorphous larger community you reference.

I do not think counting inbound links is a particularly good way to go about ranking weblogs as it is partial to older, more established sites and victim to the "viewpoint" skew you talk about. The method I have chosen uses a system of tracking URLs in a set of weblogs. Not surprisingly, the list of sites that consistantly bring new, interesting URLs into a community closely mirror the results from counting inbound links. But if a website stopped updating entirely today, it would remain static in the inbound link scheme but quickly fall in the URL tracking method.

I too am limited by time, but essentially I believe counting inbound links is an old trick whose practical applications have been explored. If you want to map the blogosphere, map the flow of ideas, not the noisy connections.

Maciej CeglowskiMar 11, 2003 at 1:45PM

You seem to be asking whether for any two blogs, you can find your way from A to B and back just by following links.

Just like the power-law discussion, this would be a great time for any lurking graph theorists to speak up. They've been studying this kind of stuff forever. In the meantime, I'll spew hot air:

I recall a 2000 study on the entire web which found that there are 'islands' of unconnected sites, along with a large number that only had outgoing links (they show this on the world's ugliest map). I suspect something similar might be true for weblogs.

But even then, the lists of the 'top 100' should all be very similar, since every search engine will be able to index the big central core of accessible blogs, and count the links.

A good way to find weblogs that have no links pointing to them would be to comb through the referer logs on popular sites. You might consider making a donation from your own referer log - just make a long list of unique URLs, and send it off to Technorati, Daypop, and the other sites. If enough high-traffic bloggers did that, it would help increase the coverage.

Of course, the search sites should also spider each other to find new sites. Why do some engines have broader coverage than others? That just seems like sloppy programming. If I knew that PopTechnoRoogle had a more comprehensive list of URLs than my own search site, it wouldn't take a lot of work to remedy that. Just search and harvest, search and harvest, until you don't find anything new. Spambots have been doing it for years.

DavidMar 11, 2003 at 3:33PM

[H]ow do you tell your spider to go find a random node when it can only find nodes though links from other nodes?

Some of the sites you mentioned find new weblogs by watching Weblogs.com. This has its own problems. For one thing, most Blogger blogs don't announce their existence to that list, so they are underrepresented in the sample.

I wish there was a good map, because the blind-men-and-elephant problem has led to some pretty skewed understandings of the whole phenomenon. (I'm just as blind as everyone else, but I've convinced myself that I know exactly what's going on.)

John DowdellMar 11, 2003 at 4:02PM

"How do you construct a fairly accurate map of a network (the weblog universe in this case) with a sample size much smaller than the total number of nodes (weblogs)?"

Your question here is about mapping the network, but the initial citations were for popular links within the network... two different problems...?

It's true that pure-spidering will reinforce existing paths, particularly with blogrolls which institutionalize onetime preferences. (Pulling links out of RSS rather than HTML versions can avoid this, true?)

To get a better idea of popularity, one approach is to (a) map out the territory to the widest degree possible and then (b) analyze samples from within that set. If you have addresses for 10,000 blogs, then analyzing a random set of 500 within that can avoid the path-reinforcement problems you mentioned. (You could weight such a sample, such as those posting within the last hour or with more than 10 links in the last week... these would measure slightly different things.)

Summary: Would it work for you to get a big telephone book, and pull names at random from it, rather than just calling the names on your rolodex...?

Sidenote: I'm really hoping for increasing segmentation of link recommendations... Daypop Top 40 is interesting for learning what its total sample finds important today, but I could really go for finding what certain subgroups find important... "hmm, among my favorite sports writers most found this link important today, and the people in my ''devices' experts group recommended these two links today...."

jkottkeMar 11, 2003 at 4:27PM

Thanks for the link to the bow tie map, Maciej. I wanted to put that in my post somewhere (because it's highly relevant), but didn't have time to dig it up.

A good way to find weblogs that have no links pointing to them would be to comb through the referer logs on popular sites. You might consider making a donation from your own referer log - just make a long list of unique URLs, and send it off to Technorati, Daypop, and the other sites.

Not to mention a good way to improve the ranking of all these popular blogs within the system. Again with the skew.

For one thing, most Blogger blogs don't announce their existence to that list, so they are underrepresented in the sample.

I thought Blogger started pinging weblogs.com awhile back. Maybe not.

Your question here is about mapping the network, but the initial citations were for popular links within the network... two different problems...?

The popular links are determined by looking at the map. If the popular links list is not good, the map probably sucks too.

PatrickMar 11, 2003 at 4:58PM

When you crossreference the links from that basic set (i.e. the two degrees of Scripting News) you end up with the list of most linked blogs correct? What if you take the "bottom" part of the list and create a new group of blogs based on the links they contain. Then from there do another iteration? I agree that the current "top" sites link to a lot of the same places but I'm sure they all have a few personnal favorites that are their own. The bottom of the initial list would link to some known but mostly to a lot of rarer blogs. The second iteration I mentionned should broaden that even more. You can then group all of those groups in your new ranking system and / or have a couple of different blogospheres. If from those new spheres you start doing something like the daypop top 40, which is basically a news list, you could potentially have different editorial opinions, not only a more complete map. Does that make any sense?

David SifryMar 11, 2003 at 6:06PM

Jason, you're right in theory, and I can't speak for any of the non-Technorati engines, but for Technorati, we don't spider in the way you describe. It's interesting - looking at the Technorati DB stats, about 50% of the blogs we survey have zero incoming links or blogs, but they do link to others. So, there's a lot of dark matter out there, but we've been fortunate enough to pull together some mechanisms to shed some light on the dark matter. That's why we are approaching 125,000 blogs indexed at last count. I'm sure that there is some skew (NZ Bear's Ecosystem is a great example of the effect of warblogger skew, for example) but because of some of the database analysis that we've done, like the 50% number referenced above leads me to believe that we're getting awfully close to seeing the entire elephant.

One area where we don't have much visibility is in the Live/DeadJournal world - because many of those folks keep their journals private, spiders like Technorati's can't index them, leaving that part of the ecosystem unlit. I'd love it if the LJ/DJ folks posted XML-RPC pings to weblogs.comfor people who create public journals or did something of the like, but I don't even know who to talk to over there that could get it done.

Dave

kevinMar 11, 2003 at 6:12PM

Why bother? Honestly. I'm still not convinced that it's worth the time or effort to come up with such a beast. It seems antithetical to the whole blog philosophy.

Oh, wait.. It /is/ just a big popularity contest, isn't it? :)

MartinMar 11, 2003 at 8:22PM

Unfortunately, most of these systems are based on popularity, which doesn't account for the taste, style, content or standards.

I like the idea of "following an idea" - for example, indexing any/all sites/weblogs that happen to be linking to a particular story, or website - that way, anyone can be part of the overall consciousness of the blogosphere and the parts make up the sum of the whole.

It's a tough cookie to crack, but let's face it, exposure/popularity on the web is largely down to four or five main things:

- You have an amazing idea that catches on.

- You're the first to do something and then everyone notices - even when it's a bad idea.

- Other sites link to you - for whatever reason - but usually because your site is good and has something to offer, regularly and relatively free.

- You're damn good at what you produce and give good value if/when you have to charge for the service.

RickMar 11, 2003 at 10:21PM

I agree with martin about the blogosphere, btu I agree with him more on not linking them by popularity. Some of the more popular ones only get that way due to sexual content.

I think categorizing then popularizing(SP?) would be a better alternative. That way you are getting the most popular ina category. Of course some weblogs have different posts everyday so this system like all others would have a flaw.

David SifryMar 11, 2003 at 11:07PM

Popularity rankings are a by-product. The key issues are breadth (are we capturing all the blogging converstions), relevance (are the links accurate, are they contributing to the conversation, or just blogrolls), and freshness (how soon after posting can you capture the conversation). Technorati's best feature, if you ask me, is the ability to quickly find and allow you to participate in conversations going on around you about you.

The Top 100 is mostly meaningless except for the fact that people get on the top 100 by consistently posting interesting stuff.

Check out the Interesting Newcomers list or the Interesting Recent Blogs list - I'm constantly finding new, interesting conversations going on using those pointers.

Dave

jkottkeMar 12, 2003 at 12:45AM

Just for the sake of completeness, BlogStreet (a new-to-me site) has a top 100 list as well as a weighted top 100 list.

jkottkeMar 12, 2003 at 12:55AM

Popularity rankings are a by-product.

Exactly right...I wish people wouldn't get hung up on the popularity aspect of it all the time. The top X list is just a list of the hubs in the network. The accuracy of how, um, hubby the hubs are is important in building an accurate map of the network (that you can then use to do all sorts of things, like slice and dice the data as John suggests).

BenMar 12, 2003 at 3:45AM

As an idea.....

I've been messing around with taking snap shots of what people are sharing on the P2P networks (basically generating a bunch of "profiles") which could be thought similar to links on blogs.

Around 33% of people have an eminem song as an example.

I then generated some snazzy algorithms that take into consideration the popularity of artists and songs and tries to generate relationships between artists based purely on the fact that people share songs in common.

Rather than this being a purely person who likes x also likes y system it discounts the popular eminems of the world (u2, madonna, etc) who appear in almost everyone's collections.

So What you say?

Well I figure that a similar algorithm may be able to be applied to blogs to discount the skew issue that you are initially referring to. Even with a small crawl of the P2P networks I've been able to pull out really good data recently.

I'd be happy to provide some more details should anyone think this may have legs in determining the link patterns among Blogs And ideas within blogs.

David SifryMar 12, 2003 at 9:09AM

Ben,
Sounds like a job for LSI and the Singular Value Decomposition process. :-)
Good luck, and let us know your findings!

Dave

Maciej CeglowskiMar 12, 2003 at 1:30PM

Ben, if you want to take up David's suggestion, you can find open source LSI code (in Perl) available here.

Rich PersaudMar 13, 2003 at 12:31AM

With this much effort going into retroactive analysis of paths, it may be useful to involve the link author in proactive assertion of the link's verb.

Or, to borrow some of the text below this comment textbox, where are the semantic standards for civil "flaming, trolling, and ass-kissing" ?

Tom MorrisMar 26, 2003 at 1:46PM

What we really need is personalised aggregators. Not Newsisfree.com-style syndicators, but personalised aggregators that reccomend places based on a mixture of say, rating and popularity, in topic areas that your interested. You could have a 'plus' list and a 'minus' list - add phrases or topics to the plus list, and those topics are valued higher - ditto with the minus list.

And then you could aggregate the aggregators, and so on.

I suppose, if nothing else, the 'blog revolution' has given the geeks lots of cool things to play with. :)

cbraytonMay 24, 2003 at 1:53AM

I hadn't thought of how seeding the crystal affects the growth of the diamond, that's interesting. What bugs me is the lack of granularity in, say, Technorati or Blogshares. Technorati only filters by "authority," which it equates with link-to popularity, whereas what I want to find are "authorities" in terms of those rare people who share my interests and also know more than I do about them. Anyone can link to Clay Shirky, but few actually know what he's talking about (joke).

There's a radical continuity between "authority" defined this way and "authority" in real life: I'm an authority if I've absorbed all the information on a subject and come to some useful conclusions on how to sort it out. All that linking to something implies is that I am a at least a wannabe. I want to find those who link out to what I link out to and see what else they link out to. Blogmatchers goes in that direction, but the first 20 pages of results are everyone who also has a Creative License.

Blogshares bugs me because it doesn't define what a blog is, and consequently the monopolies (Blogger and other hosting sites) and commercial blogoids (Gizmodo) distort the stats, so that there can be no meaningful statistical difference between the one-person-blogging players in the market. This discourages what Blogshares ought to encourage: winning by having the best blogspotting eye for picking pearls from among the swine. It doesn't take a rocket scientist to see that Blogger (or "So Many Men, So Little Time" for that matter) is going to be link-popular. So where's the educational value in the game? AUthority in blogspotting accrues to those who belabor the obvious, and nothing new is discovered.

My big thing to kvetch about, for instance, is being able to filter by language. I like to read Spanish, French and Portuguese bloggers and would like to find some cool ones I can interact with. XML allows language metadata. How hard would that be to implement for RSS? Not hard at all. Why doesn't it get done?

As a guy over at the MetaFilter said rudely the other day, "This is whose problem, exactly?" Well, the gazillion foreigners with Blogger blogs that live on server farms in Menlo Park, for example. Now that Weblogues.com is up and running in France, and others up and coming, you can kiss that site traffic goodbye. But you have kept it by adding that value to the network. Big ROI!

This thread is closed to new comments. Thanks to everyone who responded.

Stay Connected