homeaboutarchivenewslettermembership!
aboutarchivemembership!
aboutarchivemembers!

PHP5, DOM, scraping web pages, and McSweeney’s Lists RSS

posted by Jason Kottke   Apr 25, 2005

In preparation for a larger project, I recently spent some time playing around with PHP5’s DOM support to scrape web pages. Basically you point your script at a page and use the DOM methods to root around in it. This little chunk of code gets you a tree of the contents of all the <p> tags in document.html:

$dom = new DomDocument();
$file = ‘document.html’;
$dom->loadHTMLFile($file);
$pgs = $dom->getElementsByTagName(“p”);

I never learn anything like this without a little project to do, so I decided to use the above to make an RSS feed for McSweeney’s Lists (which currently doesn’t have one and now that I’m using a newsreader to keep up with the web, I never remember to visit there on anything resembling a regular basis). I’ve got a cron job set up that goes out and gets the lists page each night (using Tidy to convert their circa-1999 HTML to proper XHTML that can be easily parsed with the DOM), scans it for new lists (and if it finds new ones, puts them in a DB), and then writes an RSS file.

Anyway, here’s the RSS feed for McSweeney’s Lists. Since it relies on screen scraping, my meagre PHP skills, and the good graces of McSweeney’s in not asking me to shut it down, there’s no guarantee this will work forever, so enjoy it while you can. I’m trying out Feedburner as well, so we’ll see how that goes.

Update: my code snippet was incorrect and is now fixed. Thanks to Eliot for pointing that out.

Update: As some of you may have noticed, the above RSS feed has not worked for some months now…it broke at some point and I never got around to fixing it. Additionally, McSweeney’s has contacted me and asked me to discontinue the feed, so it won’t ever be fixed. They’re looking at doing their own RSS feeds and hopefully that will happen sooner rather than later.