Newsfeed Mania
A couple of weeks ago, I discovered "syndicated" newsfeeds based on RSS (Rich Site Summary) files. I know this is old news for a lot of folks, but it opened up some programming & collaborative opportunities for me that I thought I'd share here...
First off, RSS files are written in XML (Extensible Markup Language) & define such things as the site's title, its base URL, general site description & logo. Most importantly, RSS newsfeed files can contain a list of <item> definitions that provide headlines & links to articles on the source site. For a technical overview, consult UserLand.Com & syndic8.com.
The RSS definition was orginally developed to allow headlines to be "channeled" to My Netscape user home pages. The XML format has been widely adopted by "Content Management Systems", such as Slash & many of these systems provide RSS files as part of their automated "backend". The popular "blogging" programs also generate RSS newsfeeds, although you may find that content isn't terrifically interesting & a lot of "blogs" turn out to be personal diaries more than compelling informational pages.
There are a number of Perl modules for creating & parsing headline lists from RSS files. One that's particularly easy to use is RSSLite.pm. The downside to using these tools is two-fold: (1) many ISPs & hosting services don't have these modules installed, which puts the erstwhile newsfeed maven in the position of installing them in his/her cgi-bin (not a tough thing to do, but a P.I.T.A. nonetheless), and (2) the actual RSS you may want to process may have sloppy entity coding or may not conform strictly to the RSS definition expected by the module. After writing my own RSS parsers & using RSSLite, I found there wasn't much advantage to using the module, since most of my "custom" script coding was still devoted to cleaning up the titles, links & descriptions. Still, for fast roll-out of an RSS-based headline list, you can do far worse than looking into some of the ready-rolled module-based Perl & PHP scripts available at such places as Slashdot.
Because cleanest RSS feeds I could find initially come from "geek" techie sites (and because I'm a "geek", according to my wife!), my first expedition into the wilderness of newsfeed processing was the djeaux news pages. Although it was just an "experiment", I found it very useful to have tons of tech & coding links dynamically updated in one place.
All that computer stuff is well & good, but I've been very active for almost 10 years in the online Bob Dylan fan community. I began to wonder why there weren't newsfeeds available from the many popular Dylan fan sites. A quick email to Arlo, chief cook & bottlewasher of the world famous Dylan Pool website, spread the "Hog Wild Newsfeed Mania" to another innocent person. Arlo set up RSS feeds for the most recent concert setlist & the Dylan Pool Phorum. In short order, Dylanfreak Headline News was born.
Now, one problem that Arlo & I encountered right off is that Slash & blogging sites aren't very common among Dylan fans. In fact, I couldn't find any. Simply put, there weren't any RSS feeds to be had! The response was to write "scraper" scripts. Scrapers are programs that load in a web page, identify "headlines" & their associated hyperlinks, and then churn out a proper RSS file from the "scraped" information. Using this approach, Arlo enabled me to add newsfeeds for the esteemed Expecting Rain site & the Google News Search Beta system. I tried my own hand at it by writing scripts to scrape the latest posts from Google Groups' rendition of rec.music.dylan & Dylan-related auction items from eBay. The official bobdylan.com discussion boards have also been scraped. (And for what it's worth, we did get approval from the bobdylan.com webmeister to use those feeds!)
As an offshoot of this project, I developed a scraper that will create an RSS newsfeed file from any Usenet group archived by Google Groups. Over the next few weeks, I plan to add a customizable user page where anyone can select their favorite half dozen or so newsgroups & view all the latest message postings in one place.
"Things should start to get interesting right about now!"
|