kottke.org posts about statistics

New NBA stat: points per missMay 19 2011

A couple nights ago against the Oklahoma City Thunder, Dirk Nowitzki scored 48 points and only missed three shots, prompting Bill Simmons to wonder if that was some sort of record. Jerod from Midwest Sports Fans dug into how useful a stat like points per miss would be as a measure of efficiency.

What is interesting about the table above is that Dirk comes in ahead of Bird, Jordan, and so many others. Does this mean Dirk is a better player than Jordan or Bird? Of course not. But it does mean that he is as efficient a scorer as those two were, if not better. Scoring efficiency only tells one part of the story on one side of the floor, which is why PPM can only be considered a small piece of the puzzle when comparing players, but it is a good way to give one of the most unique scoring talents in NBA history his due.

Bill Simmons on sabermetricsApr 06 2010

Bill Simmons has finally accepted the gospel of sabermetrics as scripture and in a recent column, preaches the benefits of all these newfangled statistics to his followers. The list explaining his seven favorite statistics in down-to-earth language is really helpful to the stats newbie.

Measure BABIP to determine whether a pitcher or hitter had good luck or bad luck. In 2009, the major league BABIP average was .299. If a pitcher's BABIP dipped well below that number, he might have had good luck. If it rose well above that number, he likely had terrible luck. The reverse goes for hitters.

(via djacobs, who has an extremely high VORF)

Rating the pundits: 2009 NFL preseason predictionsJan 07 2010

How accurate are all those preseason predictions about how the coming NFL season will unfold?

ESPN Ranking OffsetsIn an effort to find out, I collected a number of preseason "team power rankings" two days before the 2009 NFL regular season started in September. These ranking lists are compiled by columnists and pundits from media outlets like Sports Illustrated, Fox Sports, The Sporting News, and ESPN. In addition, I collected a fan-voted ranking from Yahoo Sports and the preseason Vegas odds to win the Super Bowl. As a baseline of sorts, I've also included the ranking for how the teams finished in the 2008 season.

Each team ranking from each list was compared to the final 2009 regular season standings (taken from this tentative 2010 draft order) by calculating the offset between the estimated rank to the team's actual finish. For instance, ESPN put the Steelers in the #1 slot but they actually finished 15th in the league...so ESPN's offset for the Steelers is 14. For each list, the offsets for all 32 teams were added up and divided by 32 to get the average number of places that the list was off by. See ESPN's list at right for example; you can see that each team ranking in the list was off by an average of about 6.3 places.

Here are the offset averages for each list (from best to worst):

Media outletOffset ave. (# of places)
CBS Sports5.6
The Sporting News5.6
USA Today5.6
Vegas odds5.8
Yahoo Sports5.9
Sports Illustrated5.9
ESPN6.3
Fox Sports6.4
2008 finish7.3

The good news is that all of the pundits beat the baseline ranking of last season's final standings. But they didn't beat it by that much...only 1.7 places in the best case. A few other observations:

- All the lists were pretty much the same. Last place Fox Sports and first place CBS Sports differ by less than one place in their rankings. The Steelers and Patriots were one and two on every list and the bottom five were pretty consistent as well. All the pundits said basically the same thing; no one had an edge or angle the others didn't.

- Nearly everyone was very wrong about the Steelers, Giants, Titans, Jets, Bengals, and Saints...and to a lesser extent, the Redskins, Bears, Vikings, and Packers. CBS Sports made the fewest big mistakes; their offset for the Bengals was only 4 places. The biggest mistakes were Fox Sports' choice and the Vegas ranking of the Bengals to finish 28th (offset: 19).

- Among the top teams, the Colts, Eagles, and Patriots more or less fulfilled the hopes of the pundits; only Fox Sports and Sports Illustrated missed the mark on one of these teams (the Colts by 9 places).

- The two "wisdom of the crowds" lists, Yahoo Sports and the Vegas list, ended up in the middle, better than some but not as good as some others. I suspect that there was not enough independent information out there for the crowd to make a good collective choice; those two lists looked pretty much like the pundits' lists.

- The teams who turned out to be bad were easier to pick than the good teams. The bottom five picks on each list were typically off by 3-5 places while the top five were off by more like 8-12 places (esp the Steelers and the Giants). Not sure why this is. Perhaps badness is easier to see than goodness. Or it's easier for a good-looking team to go bad than it is for bad-looking team to do better.

For the curious, here's the full Google Docs spreadsheet of numbers for all of the lists.

Methodology and notes: 1) I made an assumption about all these power ranking lists: that what the pundits were really picking is the final regular season ranking. That isn't precisely true but close enough for our purposes. 2) I have no idea what the statistical error is here. 3) The 2010 draft order list isn't a perfect ranking of how the teams finished, but it is close enough. 4) Using the final regular season records as the determining factor of rank is problematic because of the playoffs. By the end of the season, some teams aren't trying to win every game because they've either made the playoffs or haven't. So some teams might be a little bit better or worse than their records indicate. 5) The Vegas odds list was a rankng of the odds of each team making the Super Bowl, not the odds for the teams' final records. But close enough. 6) The Sports Illustrated list was from before the 2009 pre-season started; I couldn't find an SI list from right before the regular season. Still, it looked a lot like the other lists and did middlingly well.

Statistical evidence of election fraud in Iran?Jun 18 2009

I was kinda waiting for FiveThirtyEight to weigh in on this: using Benford's Law to check for fraud in the Iranian election results (here as well).

Benford's law is sometimes useful in these cases, because human beings intuitively tend to distribute the first digits about evenly when they're making up "random" strings of numbers, when in fact many real-world distributions will be skewed toward the smaller digits.

Both 538 pieces are skeptical that Benford's Law is applicable in this case. (thx, nick)

Update: Voting fraud expert Walter Mebane has produced a paper on the Iranian election that uses Benford's Law to check the results. He's updated the paper several times since it was first published and now writes that "the results give moderately strong support for a diagnosis that the 2009 election was affected by significant fraud". (thx,scott)

Update: Done just after the election, this analysis shows that the returns released by Iran's Interior Ministry during the course of the day of the election shows an unnaturally high steadiness of voting percentages. (thx, cliff)

Update: Regarding the previous link, Nate Silver doesn't think much of that analysis. (thx, cliff)

The overtime spike in NBA basketballJun 15 2009

The distribution of point differentials at the end of NBA basketball games shows that a tie is more than twice as likely as either team winning by one point. A possible simple explanation from the comments:

1. Teams down by 2 late are most likely to take a 2 point shot, while teams down by 3 will most often take a 3 point shot. The team's choices make ties a likely outcome.

2. A Tie is a stable equilibrium, while other scores aren't. If a team leads with the ball, they will be fouled, preventing the game from ending on that score. IF a team has the ball with a tie, they'll usually be allowed to wait and take the last shot, either winning the game or leaving it as a tie.

Update: This study about golf putting seems to have something in common with the overtime finding.

Even the world's best pros are so consumed with avoiding bogeys that they make putts for birdie discernibly less often than identical-length putts for par, according to a coming paper by two professors at the University of Pennsylvania's Wharton School. After analyzing laser-precise data on more than 1.6 million Tour putts, they estimated that this preference for avoiding a negative (bogey) more than gaining an equal positive (birdie) -- known in economics as loss aversion -- costs the average pro about one stroke per 72-hole tournament, and the top 20 golfers about $1.2 million in prize money a year.

More biking = safer bikingJun 05 2009

The "safety in numbers" effect is proving true in NYC: the number of bicycles on the streets has more than doubled since 2001 while casualties have fallen. The increased prevalence of bike lanes in the city has to be helping too. (thx, david)

Nate Silver predicts the OscarsFeb 16 2009

Nate Silver, who used polling statistics to predict a clear Obama win in the Presidential election in November, turns his analytical tools loose on the Oscars.

For example, is someone more likely to win Best Actress if her film has also been nominated for Best Picture? (Yes!) But the greatest predictor (80 percent of what you need to know) is other awards earned that year, particularly from peers (the Directors Guild Awards, for instance, reliably foretells Best Picture). Genre matters a lot (the Academy has an aversion to comedy); MPAA and release date don't at all. A film's average user rating on IMDb (the Internet Movie Database) is sometimes a predictor of success; box grosses rarely are.

Silver's "Gamble-Tron 2000 Lock of the Oscars" is that Danny Boyle wins Best Director for Slumdog Millionaire with a whopping 99.7% certainty. I suspect that the Oscars will prove more difficult to predict than the election and that Silver will be wrong in at least two categories. I will report back on Oscar night. (via fimoculous)

Search correlations with StateStatsDec 03 2008

StateStats is hours of fun. It tracks the popularity of Google searches per state and then correlates the results to a variety of metrics. For instance:

Mittens - big in Vermont, Maine, and Minnesota, moderate positive correlation with life expectancy, and moderate negative correlation with violent crime. (Difficult to commit crimes while wearing mittens?)

Nascar - popular in North and South Carolinas, strong positive correlation with obesity, and and moderate negative correlation with same sex couples and income.

Sushi - big in NY and CA, moderate positive correlation with votes for Obama, and moderate negative correlation with votes for Bush.

Gun - moderate positive correlation with suicide and moderate negative correlation with votes for Obama. (Obama is gonna take away your guns but, hey, you'll live.)

Calender (misspelled) - moderate positive correlation with illiteracy and rainfall and moderate negative correlation with suicide.

Diet - moderate positive correlation with obesity and infant mortality and moderate negative correlation with high school graduation rates.

Kottke - popular in WI and MN, moderate positive correlation with votes for Obama, and moderate negative correlation with votes for Bush.

Cuisine - This was my best attempt at a word with strong correlations but wasn't overly clustered in an obvious way (e.g. blue/red states, urban/rural, etc.). Strong positive correlation with same sex couples and votes for Obama and strong negative correlation with energy consumption and votes for Bush.

I could do this all day. A note on the site about correlation vs. causality:

Be careful drawing conclusions from this data. For example, the fact that walmart shows a moderate correlation with "Obesity" does not imply that people who search for "walmart" are obese! It only means that states with a high obesity rate tend to have a high rate of users searching for walmart, and vice versa. You should not infer causality from this tool: In the walmart example, the high correlation is driven partly by the fact that both obesity and Walmart stores are prevalent in the southeastern U.S., and these two facts may have independent explanations.

Can you find any searches that show some interesting results? Strong correlations are not that easy to find (although foie gras is a good one). (thx, ben)

The Netflix Prize and the Case of the Napoleon Dynamite ProblemNov 24 2008

Clive Thompson writes up the Netflix Prize -- which offers $1 million to the first team to improve upon Netflix's default recommendation algorithm by 10% -- and the vexing Napoleon Dynamite problem that is thwarting all comers.

Bertoni says it's partly because of "Napoleon Dynamite," an indie comedy from 2004 that achieved cult status and went on to become extremely popular on Netflix. It is, Bertoni and others have discovered, maddeningly hard to determine how much people will like it. When Bertoni runs his algorithms on regular hits like "Lethal Weapon" or "Miss Congeniality" and tries to predict how any given Netflix user will rate them, he's usually within eight-tenths of a star. But with films like "Napoleon Dynamite," he's off by an average of 1.2 stars.

The reason, Bertoni says, is that "Napoleon Dynamite" is very weird and very polarizing. It contains a lot of arch, ironic humor, including a famously kooky dance performed by the titular teenage character to help his hapless friend win a student-council election. It's the type of quirky entertainment that tends to be either loved or despised. The movie has been rated more than two million times in the Netflix database, and the ratings are disproportionately one or five stars.

This behavior was flagged as an issue by denizens of the Netflix Prize message board soon after the contest was announced two years ago.

Those are the movies you either loved loved loved or hated hated hated. These are the movies you can argue with your friends about. And good old 'Miss Congeniality' is right up there in the #4 spot. Also not surprising to see up here are: 'Napoleon Dynamite' (I hated it), 'Fahrenheit 9/11' (I loved it), and 'The Passion of the Christ' (didn't see it, but odds are, I'd hate it).

After finding that post, I wrote a little bit about why these movies are so contentious.

The thing that all those kinds of movies have in common is that if you're outside of the intended audience for a particular movie, you probably won't get it. That means that if you hear about a movie that's highly recommended within a certain group and you're not in that group, you're likely to hate it. In some ways, these are movies intended for a narrow audience, were highly regarded within that audience, tried to cross over into wider appeal, and really didn't make it.

How many coins?Oct 27 2008

Earlier this evening, I needed to take some coins that had been piling up to the Coinstar machine. Before I left, I uploaded a photo of the coin bags to Flickr and queried the masses: how much money in the bags?

How did the crowd do? Certainly not as well as the villagers at the 1906 livestock fair visited by Francis Galton.

In 1906 Galton visited a livestock fair and stumbled upon an intriguing contest. An ox was on display, and the villagers were invited to guess the animal's weight after it was slaughtered and dressed. Nearly 800 gave it a go and, not surprisingly, not one hit the exact mark: 1,198 pounds. Astonishingly, however, the median of those 800 guesses came close -- very close indeed. It was 1,208 pounds.

Nate Silver I am not, but after some rudimentary statistical analysis on the coin guesses, it was clear that the mean ($193.88) and median ($171.73) were both way off from the actual value ($426.55). That scatterplot is brutal...there are only a handful of guesses in the right area. But the best guess by a single individual was just 76 cents off.

To be fair, the crowd was likely misinformed. It's difficult to tell from that photo how fat those bags were -- they were bulging -- and how many quarters there were.

What are the Japanese up to right now?Oct 20 2008

As part of the Japanese census, people were asked to keep a record of what they were doing in 15 minute intervals. The data was publicly released and Jonathan Soma took it and graphed the results so that you can see what many Japanese are up to during the course of a normal day.

Sports: Women like swimming, but men eschew the water for productive sports, which is the most important Japanese invention.

Early to bed and early to rise... and early to bed: People start waking up at 5 AM, but are taking naps by 7:30 AM.

Fascinating.

Statistics in a Nutshell bookAug 26 2008

New book from O'Reilly: Statistics in a Nutshell.

Need to learn statistics as part of your job, or want some help passing a statistics course? Statistics in a Nutshell is a clear and concise introduction and reference that's perfect for anyone with no previous background in the subject. This book gives you a solid understanding of statistics without being too simple, yet without the numbing complexity of most college texts.

Extensive Olympic statsJul 15 2008

The folks behind the excellent Baseball Reference have launched a statistics site for the Olympics. Every athlete that's ever competed in the Games has his/her own page. Announcement here.

Ben Fry has updated his salary vs.Apr 29 2008

Ben Fry has updated his salary vs. performance chart for the 2008 MLB season that compares team payrolls with winning percentage. The entire payroll of the Florida Marlins appears to be less than what Jason Giambi and A-Rod *each* made last year.

Star Trek statistics: just how likely areApr 10 2008

Star Trek statistics: just how likely are you to die if you beam down to the planet's surface wearing a red shirt?

You don't know about the Red Shirt Phenomenon? Well, as any die-hard Trekkie knows, if you are wearing a red shirt and beam to the planet with Captain Kirk, you're gonna die. That's the common thinking, but I decided to put this to the test. After all, I hadn't seen any definitive proof; it's just what people said.

According to a simple statistical analysis usingApr 01 2008

According to a simple statistical analysis using computer simulations, a hitting streak as long as Joe DiMaggio's 1941 56-game streak is not the freakish occurrence that most people think it is.

More than half the time, or in 5,295 baseball universes, the record for the longest hitting streak exceeded 53 games. Two-thirds of the time, the best streak was between 50 and 64 games.

In other words, streaks of 56 games or longer are not at all an unusual occurrence. Forty-two percent of the simulated baseball histories have a streak of DiMaggio's length or longer. You shouldn't be too surprised that someone, at some time in the history of the game, accomplished what DiMaggio did.

I think there are probably some cumulative effects that are being ignored here though, like increasing media pressure/distraction, opponents trying particularly hard for an out as the streak continues, pitchers more likely to pitch around them, or even the streaking player getting super-confident. The first game in a streak and the 50th game in a streak are, as they say, completely different ball games.

Gelf Magazine enlisted the help of ZEUS,Feb 15 2008

Gelf Magazine enlisted the help of ZEUS, a football game analyzing computer, to see which NFL coaches called the worst plays at critical times during the 2007 season.

On average, suboptimal play-calling decisions cost each team .85 wins over the course of the season.

In particular, the world champion Giants should have won another game had they called the right plays at the right times. ZEUS also analyzed play calling in "hyper-critical" situations (those fourth-down decisions with five or fewer yards needed for the first down) and found that on average, teams made the wrong calls more than 50% of the time. Here's an interview on the results with the guys behind ZEUS.

Stats (wins, losses, probability of making theSep 08 2007

Stats (wins, losses, probability of making the playoffs, etc.) from the rest of the MLB baseball season, played a million times. "The post-season odds report was compiled by running a Monte Carlo simulation of the rest of the season one million times." (thx, david)

Crime in the three biggest American cities (Jun 14 2007

Crime in the three biggest American cities (NY, Chicago, LA) is down...and up almost everywhere else. In part, this is due to the aging of the population in those cities. "Together they lost more than 200,000 15-to 24-year-olds between 2000 and 2005. That bodes ill for their creativity and future competitiveness, but it is good news for the police. Young people are not just more likely to commit crimes. Thanks to their habit of walking around at night and their taste for portable electronic gizmos, they are also more likely to become its targets." Young people, your gizmos are hurting America!

Alex Reisner's cabinet of statistical wondersMay 21 2007

While bumping around on the internet last night, I stumbled upon Alex Reisner's site. Worth checking out are his US roadtrip photos and NYC adventures, which include an account and photographs of a man jumping from the Williamsburg Bridge.

But the real gold here is Reisner's research on baseball...a must-see for baseball and infographics nerds alike. Regarding the home run discussion on the post about Ken Griffey Jr. a few weeks ago, Reisner offers this graph of career home runs by age for a number of big-time sluggers. You can see the trajectory that Griffey was on before he turned 32/33 and how A-Rod, if he stays healthy, is poised to break any record set by Bonds. His article on Baseball Geography and Transportation details how low-cost cross-country travel made it possible for the Brooklyn Dodgers and New York Giants to move to California. The same article also riffs on how stadiums have changed from those that fit into urban environments (like Fenway Park) to more symmetric ballfields built in suburbs and other open areas accessible by car.

Fenway Shea

And then there's the pennant race graphs for each year since 1900...you can compare the dominance of the 1927 Yankees with the 1998 Yankees. And if you've gotten through all that, prepare to spend several hours sifting through all sorts of MLB statistics, represented in a way you may not have seen before:

The goal here is not to duplicate excellent resources like Total Baseball or The Baseball Encyclopedia, but to take the same data and present it in a way that shows different relationships, yields new insights, and raises new questions. The focus is on putting single season stats in a historical context and identifying the truly outstanding player seasons, not just those with big raw numbers.

Reisner's primary method of comparing players over different eras is the z-score, a measure of how a player compares to their contemporaries, (e.g. the fantastic seasons of Babe Ruth in 1920 and Barry Bonds in 2001):

In short, z-score is a measure of a player's dominance in a given league and season. It allows us to compare players in different eras by quantifying how good they were compared to their competition. It it a useful measure but a relative one, and does not allow us to draw any absolute conclusions like "Babe Ruth was a better home run hitter than Barry Bonds." All we can say is that Ruth was more dominant in his time.

I'm more of a basketball fan than of baseball, so I immediately thought of applying the same technique to NBA players, to shed some light on the perennial Jordan vs. Chamberlain vs. Oscar Robertson vs. whoever arguments. Until recently, the NBA hasn't collected statistics as tenaciously as MLB has so the z-score technique is not as useful, but some work has been done in that area.

Anyway, great stuff all the way around.

Update: Reisner's site seems to have gone offline since I wrote this. I hope the two aren't related and that it appears again soon.

Update: It's back up!

Ben Fry has updated his salary vs.May 17 2007

Ben Fry has updated his salary vs. performance graph for the 2007 MLB season...it plots team payrolls vs. winning percentage. The Mets and Red Sox should be winning and are...the Yankees, not so much. Cleveland and the Brewers are making good use of their relatively low payrolls.

Twitter vs. Blogger reduxMay 11 2007

Regarding the Twitter vs. Blogger thing from earlier in the week, I took another stab at the faulty Twitter data. Using some educated guesses and fitting some curves, I'm 80-90% sure that this is what the Twitter message growth looks like:

Blogger vs. Twitter cumulative messages

Twitter cumulative messages

These graphs cover the following time periods: 8/23/1999 - 3/7/2002 for Blogger and 3/21/2006 - 5/7/2007 for Twitter. It's important to note that the Twitter trend is not comprised of actual data points but is rather a best-guess line, an estimate based on the data. Take it as fact at your own risk. (More specifically, I'm more sure of the general shape of the curve than with the steepness. My gut tells me that the curve is probably a little flatter than depicted rather than steeper.)

That said, most of what I wrote in the original post still holds, as do the comments in subsequent thread. Twitter did not grow as fast as the faulty data indicated, but it did get to ~6,000,000 messages in about half the time of Blogger. Here are the reasons I offered for the difference in growth:

1. Twitter is easier to use than Blogger was and had a lower barrier to entry.
2. Twitter has more ways to update (web, phone, IM, Twitterific) than did Blogger.
3. Blogger's growth was limited by a lack of funding.
4. Twitter had a larger pool of potential users to draw on.
5. Twitter has a built-in social aspect that Blogger did not.

And commenters in the thread noted that:

6. Twitter's 140-character limit encourages more messages.
7. More people are using Twitter for conversations than was the case with Blogger.

What's interesting is that these seeming advantages (in terms of message growth potential) for Twitter didn't result in higher message growth than Blogger over the first 9-10 months. But then the social and network effects (#5 and #7 above) kicked in and Twitter took off.

Growth of Twitter vs. BloggerMay 08 2007

Important update: I've re-evaluated the Twitter data and came up with what I think is a much more accurate representation of what's going on.

Further update: The Twitter data is bad, bad, bad, rendering Andy's post and most of this here post useless. Both jumps in Twitter activity in Nov 2006 and March 2007 are artificial in nature. See here for an update.

Update: A commenter noted that sometime in mid-March, Twitter stopped using sequential IDs. So that big upswing that the below graphs currently show is partially artificial. I'm attempting to correct now. This is the danger of doing this type of analysis with "data" instead of data.
--

In mid-March, Andy Baio noted that Twitter uses publicly available sequential message IDs and employed Twitter co-founder Evan Williams' messages to graph the growth of the service over the first year of its existence. Williams co-founded Blogger back in 1999, a service that, as it happens, also exposed its sequential post IDs to the public. Itching to compare the growth of the two services from their inception, I emailed Matt Webb about a script he'd written a few years ago that tracked the daily growth of Blogger. His stats didn't go back far enough so I borrowed Andy's idea and used Williams' own blog to get his Blogger post IDs and corresponding dates. Here are the resulting graphs of that data.1

The first one covers the first 253 days of each service. The second graph shows the Twitter data through May 7, 2007 and the Blogger data through March 7, 2002. [Some notes about the data are contained in this footnote.]

Blogger vs. Twitter cumulative messages (first 253 days)

Blogger vs. Twitter cumulative messages

As you can see, the two services grew at a similar pace until around 240 days in, with Blogger posts increasing faster than Twitter messages. Then around November 21, 2006, Twitter took off and never looked back. At last count, Twitter has amassed five times the number of messages than Blogger did in just under half the time period. But Blogger was not the slouch that the graph makes it out to be. Plotting the service by itself reveals a healthy growth curve:

Blogger cumulative posts

From late 2001 to early 2002, Blogger doubled the number of messages in its database from 5M to 10M in under 200 days. Of course, it took Twitter just over 40 days to do the same and under 20 days to double again to 20M. The curious thing about Blogger's message growth is that large events like 9/11, SXSW 2000 & 2001, new versions of Blogger, and the launch of blog*spot didn't affect the growth at all. I expected to see a huge message spike on 9/11/01 but there was barely a blip.

The second graph also shows that Twitter's post-SXSW 2007 growth is real and not just a temporary bump...a bunch of people came to check it out, stayed on, and everyone messaged like crazy. However, it does look like growth is slowing just a bit if you look at the data on a logarithmic scale:

Blogger vs. Twitter cumulative messages, log scale

Actually, as the graph shows, the biggest rate of growth for Twitter didn't occur following SXSW 2007 but after November 21.

As for why Twitter took off so much faster than Blogger, I came up with five possible reasons (there are likely more):

1. Twitter is easier to use than Blogger was. All you need is a web browser or mobile phone. Before blog*spot came along in August 2000, you needed web space with FTP access to set up a Blogger blog, not something that everyone had.

2. Twitter has more ways to create a new message than Blogger did at that point. With Blogger, you needed to use the form on the web site to create a post. To post to Twitter, you can use the web, your phone, an IM client, Twitterrific, etc. It's also far easier to send data to Twitter programatically...the NY Times account alone sends a couple dozen new messages into the Twitter database every day without anyone having to sit there and type them in.

3. Blogger was more strapped for cash and resources than Twitter is. The company that built Blogger ran out of money in early 2001 and nearly out of employees shortly after that. Hard to say how Blogger might have grown if the dot com crash and other factors hadn't led to the severe limitation of its resources for several key months.

4. Twitter has a much larger pool of available users than Blogger did. Blogger launched in August 1999 and Twitter almost 7 years later in March 2006. In the intervening time, hundreds of millions of people, the media, and technology & media companies have become familiar and comfortable with services like YouTube, Friendster, MySpace, Typepad, Blogger, Facebook, and GMail. Hundreds of millions more now have internet access and mobile phones. The potential user base for the two probably differed by an order of magnitude or two, if not more.

5. But the biggest factor is that the social aspect of Twitter is built in and that's where the super-fast growth comes from. With Blogger, reading, writing, and creating social ties were decoupled from each other but they're all integrated into Twitter. Essentially, the top graph shows the difference between a site with social networking and one largely without. Those steep parts of the Twitter trend on Nov 21 and mid-March? That's crazy insane viral growth2, very contagious, users attracting more users, messages resulting in more messages, multiplying rapidly. With the way Blogger worked, it just didn't have the capability for that kind of growth.

A few miscellaneous thoughts:

It's important to keep in mind that these graphs depict the growth in messages, not users or web traffic. It would be great to have user growth data, but that's not publicly available in either case (I don't think). It's tempting to look at the growth and think of it in terms of new users because the two are obviously related. More users = more messages. But that's not a static relationship...perhaps Twitter's userbase is not increasing all that much and the message growth is due to the existing users increasing their messaging output. So, grain of salt and all that.

What impact does Twitter's API have on its message growth? As I said above, the NY Times is pumping dozens of messages into Twitter daily and hundreds of other sites do the same. This is where it would be nice to have data for the number of active users and/or readers. The usual caveats apply, but if you look at the Alexa trends for Twitter, pageviews and traffic seem to leveling out. Compete, which only offers data as recently as March 2007, still shows traffic growing quickly for Twitter.

Just for comparison, here's a graph showing the adoption of various technologies ranging from the automobile to the internet. Here's another graph showing the adoption of four internet-based applications: Skype, Hotmail, ICQ, and Kazaa (source: a Tim Draper presentation from April 2006).

[Thanks to Andy, Matt, Anil, Meg, and Jonah for their data and thoughts.]

[1] Some notes and caveats about the data. The Blogger post IDs were taken from archived versions of Evhead and Anil Dash's site stored at the Internet Archive and from a short-lived early collaborative blog called Mezzazine. For posts prior to the introduction of the permalink in March 2000, most pages output by Blogger didn't publish the post IDs. Luckily, both Ev and Anil republished their old archives with permalinks at a later time, which allowed me to record the IDs.

The earliest Blogger post ID I could find was 9871 on November 23, 1999. Posts from before that date had higher post IDs because they were re-imported into the database at a later time so an accurate trend from before 11/23/99 is impossible. According to an archived version of the Blogger site, Blogger was released to the public on August 23, 1999, so for the purposes of the graph, I assumed that post #1 happened on that day. (As you can see, Anil was one of the first 2-3 users of Blogger who didn't work at Pyra. That's some old school flavor right there.)

Regarding the re-importing of the early posts, that happened right around mid-December 1999...the post ID numbers jumped from ~13,000 to ~25,000 in one day. In addition to the early posts, I imagine some other posts were imported from various Pyra weblogs that weren't published with Blogger at the time. I adjusted the numbers subsequent to this discontinuity and the resulting numbers are not precise but are within 100-200 of the actual values, an error of less than 1% at that point and becoming significantly smaller as the number of posts grows large. The last usable Blogger post ID is from March 7, 2002. After that, the database numbering scheme changed and I was unable to correct for it. A few months later, Blogger switched to a post numbering system that wasn't strictly sequential.

The data for Twitter from March 21, 2006 to March 15, 2007 is from Andy Baio. Twitter data subsequent to 3/15/07 was collected by me.

[2] "Crazy insane viral growth" is a very technical epidemiological term. I don't expect you to understand its precise meaning.

Bread is dangerous. Here are some frighteningApr 30 2007

Bread is dangerous. Here are some frightening stats: "More than 90 percent of violent crimes are committed within 24 hours of eating bread" and "Bread is made from a substance called 'dough.' It has been proven that as little as one pound of dough can be used to suffocate a mouse. The average American eats more bread than that in one month!"

Shoulda, woulda, couldaApr 25 2007

Last night, Ken Griffey Jr. hit the 564th home run of his career to move into 10th place on the all-time list. Reading about his accomplishment, I was surprised he was so far up on the list, given the number of injuries he's had since coming into the league in 1989. That got me wondering about what might have been had Griffey stayed healthy throughout his career...if he would have lived up to the promise of his youth when he was predicted to become one of the game's all-time greats.

Looking at his stats, I assumed a full season to be 155 games and extrapolated what his home run total would have been for each season after his rookie year in which he played under 155 games. Given that methodology, Griffey would have hit about 687 home runs up to this point. In two of those seasons, 1995 and 2002, his adjusted home run numbers were far below the usual because of injuries limiting his at-bats and effectiveness at the plate. Further adjusting those numbers brings the total up to 717 home runs, good for 3rd place on the all-time list and a race to the top with Barry Bonds.

Of course, if you're going to play what-if, Babe Ruth had a couple of seasons in which he missed a lot of games and also played in the era of the 154-game season. Willie Mays played a big chunk of his career in the 154-game season era as well. Ted Williams, while known more for hitting for average, missed a lot of games for WWII & the Korean War (almost 5 full seasons) and played in the 154-game season era...and still hit 521 home runs.

A paper on the tradeoff in baseballFeb 02 2007

A paper on the tradeoff in baseball between home runs and hitting for average that I don't fully understand but seems interesting. "Both models find a significant and negative relationship between home runs per at-bat and contact rate." (thx, aaron)

Do The Right ThingJan 26 2007

I don't typically write about many new Web 2.0 products, but Do The Right Thing is doing something interesting. The site works on a modified Digg model. If you see a story you like, you click a button to declare your interest in it. But then you also rate the social impact of the subject of the story, either positive or negative. Over time and given enough users, you can look at all the stories about a company like Starbucks and see how they're doing. This is something that people do when reading the news anyway -- e.g. "I feel worse about Exxon Mobil because they outsourced 20,000 jobs to India" -- and having them explicitly rate stories like this is a quick way of taking the temperature of the social climate around issues & companies and recording the results for all to see.

It would be interesting to see if people would be willing to specify some demographic information (provided that it's not sold to a third party) like sex, age, race, religion, political party affiliation, and income bracket...that would allow the social impact data to be sliced and diced in interesting ways. Even without that data, the opportunities for data analysis are intriguing...like graphs of a company's social impact over time.

Running the Numbers, a great new seriesJan 24 2007

Running the Numbers, a great new series of photography from Chris Jordan, is kind of a combination of Chuck Close and Edward Burtynsky, with a bit of Stamen thrown in for good measure. (via conscientious)

Neat little infographics video.Jan 17 2007

Neat little infographics video.

Ethics books gets stolen more often thanJan 16 2007

Ethics books gets stolen more often than non-ethics books. "Missing books as a percentage of those off shelf were 8.7% for ethics, 6.9% for non-ethics, for an odds ratio of 1.25 to 1." (via mr)

Nicholas Felton's personal annual report for 2006. "Disclaimer:Jan 10 2007

Nicholas Felton's personal annual report for 2006. "Disclaimer: Alcoholic beverages were consumed during the collection of this data and the author acknowledges that the occasional drink may have gone unrecorded." Here's the one for 2005. LOVE this.

Proposal from Language Log: scientific and technicalJan 04 2007

Proposal from Language Log: scientific and technical papers should come with an executable "recipe" for generating numbers, graphs, and tables from published data.

Tag frequency and popularity accelerationNov 03 2006

As many of you don't know, I've been working less-than-diligently1 on a project with the eventual goal of adding tags to kottke.org. I posted some early results back in August of 2005. The other day, I started thinking about how tags could help people get a sense of what's been talked about recently on the site, like Flickr's listing of hot tags. I started by compiling a list of tags from the last 200 entries and ordering them by how many times they were used over that period. Here is the top 20 (with # of instances in parentheses)

photography (33), books (26), art (26), science (22), tv (21), movies (21), lists (20), video (17), nyc (16), weblogs (15), design (14), interviews (13), bestof (13), business (12), thewire (12), food (11), sports (11), games (10), language (10), music (9)

The items in bold also appear in the top 50 of the all-time popular tags, so obviously this list isn't telling us anything new about what's going on around here. To weed those always-popular tags from the list, I compared the recent frequency of each tag with its all-time frequency and came up with a list of tags that are freakishly popular right now compared to how popular they usually are. Call this list a measure of the popularity acceleration of each tag. The top 20:

blindside (3), pablopicasso (3), ghostmap (3), davidsimon (5), poptech2006 (4), thewire (12), andywarhol (3), michaellewis (4), education (4), youtube (4), richarddawkins (5), realestate (3), crime (8), working (8), school (3), dvd (4), georgewbush (4), stevenjohnson (5), writing (4), photoshop (3)

(Note: I also removed tags with less than three instances from this list and the ones below.) Now we're getting somewhere. None of these appear in the top 50 all-time list. But it's still not that accurate a list of what's been going on here recently. I've posted 3 times about Photoshop, but you can't discount entirely the 33 posts about photography. What's needed is a mix of the two lists: generally popular tags that are also popular right now (first list) + generally unpopular tags that are popular right now (second list). So I blended the two lists together in different proportions:

75% recent / 25% all-time:
davidsimon (5), poptech2006 (4), ghostmap (3), pablopicasso (3), blindside (3), thewire (12), andywarhol (3), michaellewis (4), education (4), photography (33), art (26), youtube (4), tv (21), richarddawkins (5), books (26), crime (8), video (17), working (8), realestate (3), science (22)

67% recent / 33% all-time:
davidsimon (5), poptech2006 (4), pablopicasso (3), ghostmap (3), blindside (3), thewire (12), andywarhol (3), photography (33), art (26), michaellewis (4), education (4), tv (21), books (26), youtube (4), video (17), science (22), richarddawkins (5), crime (8), movies (21), lists (20)

50% recent / 50% all-time:
thewire (12), davidsimon (5), photography (33), poptech2006 (4), blindside (3), ghostmap (3), pablopicasso (3), art (26), books (26), tv (21), science (22), movies (21), lists (20), andywarhol (3), video (17), michaellewis (4), education (4), nyc (16), weblogs (15), crime (8)

25% recent / 75% all-time:
photography (33), art (26), books (26), tv (21), science (22), movies (21), lists (20), thewire (12), video (17), nyc (16), weblogs (15), davidsimon (5), poptech2006 (4), design (14), interviews (13), bestof (13), blindside (3), ghostmap (3), pablopicasso (3), business (12)

The 75%-66% recent lists look like a nice mix of the newly & perenially popular and a fairly accurate representation of the last 3 weeks of posts on kottke.org.

Digression for programmers and math enthusiastists only: I'm curious to know how others would have handled this issue. I approached the problem in the most straighforward manner I could think of (using simple algebra) and the results are pretty good, but it seems like an approach that makes use an equation that approximates the distribution of the popularity of the tags (which roughly follows a power law curve) would work better. Here's what I did for each tag (using the nyc tag as an example):

# of recent entries: 300
# of total entries: 3399
# of recent instances of the nyc tag: 16
# of total instances of the nyc tag: 247
# of instances of the most frequent recent tag: 33
# of instances of the most frequent tag, all-time: 272

Calculate the recent and all-time frequencies of the nyc tag:
16/300 = 0.0533
247/3399 = 0.0726

Then divide the recent frequency by the all-time frequency to get the popularity acceleration:
0.0533/0.0726 = 0.7342

That's how much more popular the nyc tag is now than it has been all-time. In other words, the nyc tag is 0.7342 times as popular over the last 300 entries as it has been overall...~1/4 less popular than it usually is. To get the third list with the 75% emphasis on population acceleration and 25% on all-time popularity, I stated by normalizing the popularity acceleration and all-time frequency by dividing the nyc tag values by the top value of the group in each case (11.33 is the popularity acceleration of the blindside tag and 0.11 is the recent frequency of the photography tag (33/300)):

0.7342/11.33 = 0.0647
0.0533/0.11 = 0.4845

So, the nyc tag has a popularity acceleration of 0.0647 times that of the blindside tag and has a recent frequency that is 0.4845 times that of the most popular recent tag. Then:

0.0647*0.75 + 0.4845*0.25 = 0.1696

Calculate this number for each recent tag, rank them from highest to lowest, and you get the third list above. Now, it seems to me that I may have fudged something in the last two steps, but I'm not exactly sure. And if I did, I don't know what got fudged. Any help or insight would be appreciated.

[1] Great artists ship. Mediocre artists ship slowly.

Suroweicki explans why ever-rising housing prices mayOct 27 2006

Suroweicki explans why ever-rising housing prices may be deceiving. "If you control for inflation and quality...real home prices barely budged between the eighteen-nineties and the nineteen-nineties. The idea that housing prices have nowhere to go but up is, in other words, a statistical illusion."

Love it or hate it moviesOct 27 2006

Netflix, the online DVD rental company, recently released a bunch of their ratings data with the offer of a $1 million prize to anyone who could use that data to make a better movie recommendation system. On the forum for the prize, someone noted that the top 5 most frequently rated movies on Netflix were not particularly popular or critically acclaimed (via fakeisthenewreal):

1. Miss Congeniality
2. Independence Day
3. The Patriot
4. The Day After Tomorrow
5. Pirates of the Caribbean

That led another forum participant to analyze the data and he found some interesting things. The most intriguing result is a list of the movies that Netflix users either really love or really hate:

1. The Royal Tenenbaums
2. Lost in Translation
3. Pearl Harbor
4. Miss Congeniality
5. Napoleon Dynamite
6. Fahrenheit 9/11
7. The Patriot
8. The Day After Tomorrow
9. Sister Act
10. Armageddon
11. Kill Bill: Vol. 1
12. Independence Day
13. Sweet Home Alabama
14. Titanic
15. Gone in 60 Seconds
16. Twister
17. Anchorman: The Legend of Ron Burgundy
18. Con Air
19. The Fast and the Furious
20. Dirty Dancing
21. Troy
22. Eternal Sunshine of the Spotless Mind
23. The Passion of the Christ
24. How to Lose a Guy in 10 Days
25. Pretty Woman

So what makes these movies so contentious? Generalizing slightly (*cough*), the list is populated with three basic kinds of movies:

Misunderstood masterpieces / cult favorites (Royal Tenenbaums, Kill Bill, Eternal Sunshine)
Action movies (Pearl Harbor, Armageddon, Fast and the Furious)
Chick flicks (Sister Act, Sweet Home Alabama, Miss Congeniality)

The thing that all those kinds of movies have in common is that if you're outside of the intended audience for a particular movie, you probably won't get it. That means that if you hear about a movie that's highly recommended within a certain group and you're not in that group, you're likely to hate it. In some ways, these are movies intended for a narrow audience, were highly regarded within that audience, tried to cross over into wider appeal, and really didn't make it.

Titanic is really the only outlier on the list...massively popular among several different groups of people and critically well-regarded as well. But I know quite a few people who absolutely hate this movie -- the usual complaints are a) chick flick, b) James Cameron's heavy-handedness, and c) reaction to the huge success of what is perceived to be a marginally entertaining, middling quality film.

BTW, here are the movies on that list that fit into my "love it" category:

The Royal Tenenbaums
Lost in Translation
Napoleon Dynamite
The Day After Tomorrow
Kill Bill: Vol. 1
Titanic
Eternal Sunshine of the Spotless Mind

Where do Craigslist's Missed Connections occur inOct 04 2006

Where do Craigslist's Missed Connections occur in NYC? Gawker has the breakdown by location and subway line.

A recent study concludes that in termsSep 20 2006

A recent study concludes that in terms of life expectancy, there are eight different Americas, all with differing levels of health. "In 2001, 15-year-old blacks in high-risk city areas were three to four times more likely than Asians to die before age 60, and four to five times more likely before age 45. In fact, young black men living in poor, high-crime urban America have death risks similar to people living in Russia or sub-Saharan Africa." If I'm reading this right, it's interesting that geography or income doesn't have that big of an impact on the life expectancy of Asians; it's their Asian-ness (either cultural, genetic, or both) that's the key factor. Here's the study itself. (via 3qd)

Forecast Advisor tracks how accurate the majorSep 15 2006

Forecast Advisor tracks how accurate the major weather forecasting companies are in predicting temperature and precipitation. Results vary based on what part of the country you're in (the weather in Honoulu is easier to forecast than that of Minneapolis), but overall the forecasters have an accuracy rate of around 72%.

Graph of American house values from 1890 toAug 31 2006

Graph of American house values from 1890 to the present. You can't miss the sheer cliff starting in 1997. Houses have also gotten bigger over time. It would be interesting to see the same graph in price/square feet. (via ben hyde)

Fascinating charts of how the US SenateAug 23 2006

Fascinating charts of how the US Senate votes on issues from a liberal-conservative perspective and a social issues perspective. More charts here. You'll notice that the lines on the graphs are mostly straight up and down which means "it's all economic; all the noise about social issues never actually flows thru into the legislative agenda." That is, the Senate decides issues, even social issues, based mostly on economics.

Rethinking Moneyball. Jeff Passan looks at howAug 18 2006

Rethinking Moneyball. Jeff Passan looks at how the Oakland A's 2002 draft class, immortalized in Michael Lewis' Moneyball, has done since then. "It is not so much scouts vs. stats anymore as it is finding the right balance between information gleaned by scouts and statistical analyses. That the Moneyball draft has produced three successful big-league players, a pair of busts and two on the fence only adds to its polarizing nature." Richard Van Zandt did a more extensive analysis back in April.

Kevin Burton looks at the Technorati "data"Aug 09 2006

Kevin Burton looks at the Technorati "data" and discovers that since the number of daily postings is growing linearly, the number of active blogs is probably growing lineary too...which means that the exponential growth of the blogosphere touted repeatedly by Technorati and parroted by mainstream media outlets is actually the growth of dead blogs.

Using the sequential serial numbers of capturedJul 21 2006

Using the sequential serial numbers of captured German tanks, Allied statisticians accurately determined the number of tanks the Nazis were producing each month.

An enormous amount of statistics about theJul 07 2006

An enormous amount of statistics about the book industry. "58% of the US adult population never reads another book after high school."

People are trying to figure out whyJul 05 2006

People are trying to figure out why the Alexa statistics for a bunch of sites (including kottke.org) jumped sharply in mid-April. I don't buy the Digg explanation (for one thing, the timeline is off by a month)...it's gotta be some partnership or something that kicked in. Or how about Alexa's "facelift" on April 11?

A quick study shows that stocks ofMay 30 2006

A quick study shows that stocks of simply named companies do better than those of more complexly named companies. Even companies with pronounceable ticker symbols did better than those with unpronounceable symbols.

The Junk Charts blog searches for exampleApr 19 2006

The Junk Charts blog searches for example of crappy graphs and charts in the media. (via do)

Demographic charts for New York City usingApr 17 2006

Demographic charts for New York City using data from 1790 to the present.

Catching cheaters with Benford's LawFeb 21 2006

Benford's Law describes a curious phenomenon about the counterintuitive distribution of numbers in sets of non-random data:

A phenomenological law also called the first digit law, first digit phenomenon, or leading digit phenomenon. Benford's law states that in listings, tables of statistics, etc., the digit 1 tends to occur with probability ~30%, much greater than the expected 11.1% (i.e., one digit out of 9). Benford's law can be observed, for instance, by examining tables of logarithms and noting that the first pages are much more worn and smudged than later pages (Newcomb 1881). While Benford's law unquestionably applies to many situations in the real world, a satisfactory explanation has been given only recently through the work of Hill (1996).

I first heard of Benford's Law in connection with the IRS using it to detect tax fraud. If you're cheating on your taxes, you might fill in amounts of money somewhat at random, the distribution of which would not match that of actual financial data. So if the digit "1" shows up on Al Capone's tax return about 15% of the time (as opposed to the expected 30%), the IRS can reasonably assume they should take a closer look at Mr. Capone's return.

Since I installed Movable Type 3.15 back in March 2005, I have been using its "post to the future" option pretty regularly to post my remaindered links...and have been using it almost exclusively for the last few months[1]. That means I'm saving the entries in draft, manually changing the dates and times, and then setting the entries to post at some point in the future. For example, an entry with a timestamp like "2006-02-20 22:19:09" when I wrote the draft might get changed to something like "2006-02-21 08:41:09" for future posting at around 8:41 am the next morning. The point is, I'm choosing basically random numbers for the timestamps of my remaindered links, particularly for the hours and minutes digits. I'm "cheating"...committing post timestamp fraud.

That got me thinking...can I use the distribution of numbers in these post timestamps to detect my cheating? Hoping that I could (or this would be a lot of work wasted), I whipped up a MT template that produced two long strings of numbers: 1) one of all the hours and minutes digits from the post timestamps from May 2005 to the present (i.e. the cheating period), 2) and one of all the hours and minutes digits from Dec 2002 - Jan 2005 (i.e. the control group). Then I used a PHP script to count the numbers in each string, dumped the results into Excel, and graphed the two distributions together. And here's what they look like, followed by a table of the values used to produce the chart:

Catching cheaters

Digit   5/05-now   12/02-1/05   Difference
131.76%33.46%1.70%
211.76%14.65%2.89%
310.30%9.96%0.34%
410.44%9.58%0.86%
510.02%10.52%0.51%
64.83%5.40%0.57%
75.66%4.96%0.70%
87.62%4.65%2.97%
97.60%6.81%0.79%

As expected, 1 & 2 show up less than they should during the cheating period, but not overly so[2]. The real fingerprint of the crime lies with the 8s. The number 8 shows up during the cheating period ~64% more than expected. After thinking about it for awhile, I came up with an explanation for the abundance of 8s. I often schedule posts between 8am-9am so that there's stuff on the site for the early-morning browse and I usually finish off the day with something between 6pm-7pm (18:00 - 19:00). Not exactly the glaring evidence I was expecting, but you can still tell.

The obvious next question is, can this technqiue be utilized for anything useful? How about detecting comment, trackback. or ping spam? I imagine IPs and timestamps from these types of spam are forged to at least some extent. The difficulties are getting enough data to be statistically significant (one forged timestamp isn't enough to tell anything) and having "clean" data to compare it against. In my case, I knew when and where to look for the cheating...it's unclear if someone who didn't know about the timestamp tampering would have been able to detect it. I bet companies with services that deal with huge amounts of spam (Gmail, Yahoo Mail, Hotmail, TypePad, Technorati) could use this technique to filter out the unwanted emails, comments, trackbacks, or pings...although there's probably better methods for doing so.

[1] I've been doing this to achieve a more regular publishing schedule for kottke.org. I typically do a lot of work in the evening and at night and instead of posting all the links in a bunch from 10pm to 1am, I space them out over the course of the next day. Not a big deal because increasing few of the links I feature are time-sensitive and it's better for readers who check back several times a day for updates...they've always got a little something new to read.

[2] You'll also notice that the distributions don't quite follow Benford's Law either. Because of the constraints on which digits can appear in timestamps (e.g. you can never have a timestamp of 71:95), some digits appear proportionally more or less than they would in statistical data. Here's the distribution of digits of every possible time from 00:00 to 23:59:

1 - 25.33
2 - 17.49
3 - 12.27
4 - 10.97
5 - 10.97
6 - 5.74
7 - 5.74
8 - 5.74
9 - 5.74

Fun analysis of a moviegoer's six yearsFeb 21 2006

Fun analysis of a moviegoer's six years of ticket stubs. You can see the ticket prices rise over the years, but what's really interesting is the correspondence between the ticket price and his opinion of the movie...he ended up paying more for the movies he really liked.

Interesting graph comparing the size of newJan 24 2006

Interesting graph comparing the size of new homes and the obesity rate in America (which seem to track quite closely since 1995), prompting the question, are Americans growing to fit their environment? Relatedly, Bernard-Henri Levy on American obesity: "The obesity of the body is a metaphor of another obesity. There is a tendency in America to believe that the bigger the better for everything -- for churches, cities, malls, companies and campaign budgets. There's an idolatry of bigness."

The Baseball Visualization Tool was designed toJan 19 2006

The Baseball Visualization Tool was designed to help managers answer the question: should the pitcher be pulled from the game? Handy charts and pie graphs give managers an at-a-glance view of how much trouble the current pitcher is in. I wonder what TBVT would have told Grady Little about Pedro at the end of Game 7 of the 2003 ALCS?

Digg vs. Slashdot (or, traffic vs. influence)Jan 12 2006

There's been lots of talk on the web lately about Digg being the new Slashdot. Two months ago, a Digg reader noted that according to Alexa, Digg's traffic was catching up to that of Slashdot, even though Slashdot has been around for several years and Digg is just over a year old. The brash newcomer vs. the reigning champ, an intriguing matchup.

Last weekend, a piece on kottke.org (50 Fun Things to Do With Your iPod) was featured on Digg and Slashdot[1] and the experience left behind some data that presents a interesting comparison to the Alexa data.

On 1/7 at around 11:00pm ET (a Saturday night), the 50 Things/iPod link appeared on Digg's front page. It's unclear exactly what time the link fell off the front page, but from the traffic pattern on my server, it looks like it lasted until around 2am Sunday night (about 3 hours). As of 10pm ET on 1/11, the story had been "dugg" 1387 times[2], garnered 65 comments, and had sent ~20,000 people to kottke.org.

On 1/8 at around 5pm ET (a Sunday afternoon), the 50 Things/iPod link appeared on Slashdot's front page and was up there for around 24 hours. As of 10pm ET on 1/11, the story has elicited 254 comments and sent ~84,100 people to kottke.org.

Here's a graph of my server's traffic (technically, it's a graph of the bandwidth out in megabits/second) during the Digg and Slashdot events. I've overlaid the Digg trend on the Slashdot one so you can directly compare them:

Slashdot versus Digg

That's roughly 18 hours of data...and the scales of the two trends are the same. Here's a graph that shows the two events together on the same trend, along with a "baseline" traffic graph of what the bandwidth approximately would have been had neither site linked to kottke.org:

Slashdot versus Digg (with baseline)

That's about 4.5 days of data. Each "bump" on the baseline curve is a day[3].

The two events are separated by just enough time that it's possible to consider them more or less separately and make some interesting observations. Along with some caveats, here's what the data might be telling us:

  • The bandwidth graphs represent everything that was happening on the kottke.org server during the time period in question. That means that bandwidth from all other outgoing traffic is on there, mixed in with that caused just by the Digg and Slashdot traffic. According to my stats, no other significant events happened during the period shown that would cause unusual amounts of bandwidth to be consumed. Including the baseline traffic (from mid-December actually) on the second graph is an attempt to give you an idea of what it looks like normally and so you can see what effect the two sites had on the traffic.
  • The Digg link happened late Saturday night in the US and the Slashdot link occurred midday on Sunday. Traffic to sites like Slashdot and Digg are typically lower during the weekend than during the weekday and also less late at night. So, Digg might be at somewhat of a disadvantage here and this is perhaps not an apples to apples comparison.
  • I'm pretty sure that the person who submitted this link to Slashdot got it from Digg or at least from a site that got it from Digg. Bottom line: if the iPod thing, which is several months old, hadn't been Dugg, it would not have appeared on Slashdot the next day.
  • If you look at the first 16-18 hours of the link being both sites (first graph), you'll see that the traffic from Slashdot was initially larger and stayed large longer than that from Digg. Stories appear to stay on the front page of Slashdot for about a day, but the churn is much faster on Digg...it only lasted three hours and that was late on a Saturday night.
  • Slashdot sent roughly 4 times the traffic to kottke.org than Digg did since Saturday.
  • If you look at the second graph, Slashdot appears to have a significant "aftershock" effect on the traffic to kottke.org. The traffic went up and stayed up for days. In contrast, the traffic from Digg fell off when the link dropped off the front page and increased traffic a little the next day (compared to the baseline) before Slashdot came and blew the doors off at 4pm. Some of this difference is due to the late hour at which the link was Dugg and how much longer the link remained on the Slashdot front page. But that doesn't account for the size and duration of the aftershock from Slashdot, which is going on three days now.
  • The traffic from the Slashdot link obscures any secondary Digg effect beyond 16-18 hours. But the bump in traffic (if any) from Digg on Sunday afternoon pre-Slashdot was not that large and was declining as the afternoon wore on, so any possible Digg aftershock that's obscured by the Slashdot link is minimal and short-lived.
  • I'm guessing the Slashdot aftershock is due to 1) traffic from links to kottke.org from other blogs that got it from Slashdot (from blogs that got it from those blogs, etc.), 2) people passing the link around via email, etc. after getting it from Slashdot, 3) Slashdot visitors returning to kottke.org to check out other content, and 4) an embedded Digg mini-aftershock of linkers, emailers, and repeat visitors. The del.icio.us page for the 50 ways/iPod link shows that before 1/8, only a few del.icio.us users per day were bookmarking it, but after that it was dozens per day.

In terms of comparing this with the Alexa data, it's not a direct comparison because they're measuring visitors to Digg and Slashdot, and I'm measuring (roughly) visitors from each of those sites. From the kottke.org data, you can infer how many people visit each site by how many people visited from each site initially...the bandwidth burst from Slashdot was roughly about 1.8 times as large as Digg's. That's actually almost exactly what Alexa shows (~1.8x).

But over a period of about 4 days, Slashdot has sent more than 4 times the number of visitors to kottke.org than Digg -- despite a 18-hour headstart for Digg -- and the aftershock for Slashdot is much larger and prolonged. It's been four days since the Slashdotting and kottke.org is still getting 15,000 more visitors a day than usual. This indicates that although Digg may rapidly be catching up to Slashdot traffic-wise, it has a way to go in terms of influence[4].

Slashdot is far from dying...the site still wields an enormous amount of influence. That's because it's been around so long, it's been big, visible, and influential for so long, and their purpose is provide their audience with 20-25 relevant links/stories each day. The "word-of-mouth" network that Slashdot has built over the years is broad and deep. When a link is posted to Slashdot, not only do their readers see it, it's posted to other blogs (and from there to other blogs, etc.), forwarded around, etc. And those are well-established pathways.

In contrast, Digg's network is not quite so broad and certainly less deep...they just haven't been around as long. Plus Digg has so much flow (links/day) that what influence they do have is spread out over many more links, imparting less to each individual link. (There are quite a few analogies you can use somewhat successfully here...the mafia don who outsmarts a would-be usurper because of his connections and wisdom or the aging rock group that may currently be less popular than the flavor of the month but has collectively had a bigger influence on pop music. But I'll leave making those analogies as an exercise to the reader.)

What all this suggests is that if you're really interested in how influence works on the web, just looking at traffic or links doesn't tell you the whole story and can sometimes be quite misleading. Things like longevity, what the social & linking networks look like, and how sites are designed are also important. The Alexa data suggests that Digg has half the traffic of Slashdot, but that results in 4x the number of visitors from Slashdot and a much larger influence afterwards. The data aside, the Digg link was fun and all but ultimately insignificant. The Slashdot link brought significantly more readers to the site, spurred many other sites to link to it, and appears to have left me with a sizable chunk of new readers. As an online publisher, having those new long-term readers is a wonderful thing.

Anyway, lots of interesting stuff here just from this little bit of data...more questions than conclusions probably. And I didn't even get into the question of quality that Gene brings up here[5] or the possible effect of RSS[6]. It would be neat to be a researcher at someplace like Google or Yahoo! and be able to look more deeply into traffic flows, link propagation, different network topologies, etc. etc. etc.

[1] The way I discovered the Digging and Slashdotting was that I started getting all sorts of really stupid email, calling me names and swearing. One Slashdot reader called me a "fag" and asked me to stop talking about "gay ipod shit". The wisdom of the crowds tragedy of the commons indeed.

[2] On Digg, a "digg" is a like a thumbs-up. You dig?

[3] That's the normal traffic pattern for kottke.org and probably most similar sites...a nearly bell-shaped curve of traffic that is low in the early morning, builds from 8am to the highest point around noon, and declines in the afternoon until it's low again at night (although not as low as in the morning).

[4] The clever reader will note here that Slashdot got the link from Digg, so who's influencing who here? All this aftershock business...the Slashdotting is part of the Digg aftershock. To stick with the earthquake analogy though, no one cares about the 5.4 quake if it's followed up by a 7.2 later in the day.

[5] Ok, twist my arm. Both Digg and Slashbot use the wisdom of crowds to offer content to their readers. Slashdot's human editors post 25 stories a day suggested by individual readers while Digg might feature dozens of stories on the front page per day, collectively voted there by their readers. In terms of editorial and quality, I am unconvinced that a voting system like Digg's can produce a quality editorial product...it's too much of an informational firehose. Bloggers and Slashdot story submitters might like drinking from that hose, but there's just too much flow (and not enough editing) to make it an everyday, long-term source of information. (You might say that, duh, Digg doesn't want to be a publication like Slashdot and you'd probably be right, in which case, why are people comparing the two sites in the first place? But still, in terms of influence, editing matters and if Digg wants to keep expanding its influence, it's gotta deal with that.)

[6] Digg might be more "bursty" than Slashdot because a higher percentage of its audience reads the site via RSS (because they're younger, grew up with newsreaders in their cribs, etc.). Brighter initial burn but less influence over time.

Chris Anderson has one of the bestDec 22 2005

Chris Anderson has one of the best descriptions I've read of collective knowledge systems like Google, Wikipedia, and blogs: they're probabilistic systems "which sacrifice perfection at the microscale for optimization at the macroscale".

Author Kevin O'Keefe, fresh from his searchDec 13 2005

Author Kevin O'Keefe, fresh from his search for the average American, goes looking for the average New Yorker, discovering that there's perhaps no such thing.

Table of the odds of dying fromDec 12 2005

Table of the odds of dying from various injuries. Looking at statistics like these, I'm always amazed at how worried people are about things that don't often result in death (fireworks, sharks) and how relatively dangerous automobiles are (see, for example, this list of people on MySpace who have died...many of the deaths on the first two pages involve cars).

Three economists share a cab, getting offDec 09 2005

Three economists share a cab, getting off at three different destinations. How do they split the fare? For answers, you might look to John Nash or the Talmud.

Gapminder has a ton of stats andDec 08 2005

Gapminder has a ton of stats and resources about human population and development trends. (thx, chris)

Quick overview of increased use of statisticsNov 07 2005

Quick overview of increased use of statistics in pro basketball, i.e. the moneyballing of the NBA. More NBA stats madness at 82games.com.

Nobody's talking about the anal sex portionSep 21 2005

Nobody's talking about the anal sex portion of a recently released survey on American sexual habits. "Evidently anal sex is too icky to mention in print. But not too icky to have been tried by 35 percent of young women and 40 to 44 percent of young men -- or to have killed some of them."

Odd size comparison of Yahoo and GoogleAug 15 2005

Odd size comparison of Yahoo and Google indices. I think their assumption (that a "series of random searches to both search engines should return more than twice as many results from Yahoo! than Google") is pretty flawed. The number of returned results could vary because of the sites' different optimizations for dictionary words, for searches with small result sets, and differences in how their search algorithms include or exclude relevant results. Put it this way: if I'm looking for a frying pan in my apartment, I'm gonna refine my search to the kitchen and not worry about the rest of the house, no matter how large it is. (via /.)

I love that Davenetics still shows upAug 10 2005

I love that Davenetics still shows up in these graphs of the top blogs on Technorati. I read Davenetics daily but the only reason it is on the list is because it's linked in a default Blogger template. If T'rati actually looked at their "statistics" instead of just using them to market to us, this sort of thing is pretty easy to spot (if the ratio of the # of links vs. the # of sites linking is close to 1.0, the site may not belong on the list). (Oh, and Binary Bonsai is suspect as well...its high rank is at least partially due to a default link on a popular Wordpress template.)

MMORPG and the Dunbar numberAug 04 2005

MMORPG and the Dunbar number. "Overall, these statistics still support my original hypothesis in my Dunbar Number post that mean group sizes will be smaller than 150 for non-survival oriented groups."

Comparison of the power law in warJul 29 2005

Comparison of the power law in war. Statistics show that fatalities in modern warfare trend toward non-G7 terrorism patterns rather than those of conventional warfare, independent of context.

Another in Edward Jay Epstein's series onJul 27 2005

Another in Edward Jay Epstein's series on the business of Hollywood. This one's about the secret industry reports done by the MPAA that reveal hard-to-come-by statistics about how much Hollywood is making from which businesses.

Racial disparities in tipping taxi driversJul 18 2005

Racial disparities in tipping taxi drivers. African-American drivers were tipped 1/3 less than white drivers and African-American passengers tipped 50% less than white passengers.

Pokernomics: Steven Levitt is researching the economics of pokerJul 18 2005

Pokernomics: Steven Levitt is researching the economics of poker. If you send him statistics from your online games, he'll share the results with you.

The demographic transition modelJun 10 2005

The demographic transition model.

Comparing newspapers' online "circulation" (# of blog links)Apr 25 2005

Comparing newspapers' online "circulation" (# of blog links) with their offline circulation. The Christian Science Monitor had the highest ratio by far, with the Wall Street Journal being almost invisible on the web (which will eventually affect their influence, I think).

kottke.org

Front page
About + contact
Site archives

Subscribe

Follow kottke.org on Twitter

Follow kottke.org on Tumblr

Like kottke.org on Facebook

Subscribe to the RSS feed

Advertisement

Ads by The Deck

Support kottke.org shop at Amazon

And more at Amazon.com

Looking for work?

More at We Work Remotely

Kottke @ Quarterly

Subscribe to Quarterly and get a real-life mailing from Jason every three months.

 

Enginehosting

Hosting provided EngineHosting