[11:00] jesus so many channels :P [11:00] Right [11:00] when you do collapse all items [11:00] when a new item occurs [11:00] it's not collapsed [11:00] Yeah, I've noticed that as well [11:01] I can raise a bug if we want... [11:01] I think it'd be pretty simple to fix [11:03] Yeah, that'd be best.. I'm trying to remember where and when this was added [11:03] I believe it's in seesaw-kit [11:04] This is the addition of the feature https://github.com/ArchiveTeam/seesaw-kit/pull/18 [11:04] so seesaw-kit is the right project to file towards [15:53] I think this is the best room to talk about scrapers again. [15:53] So the one I have been building is coming along nicely [15:53] I wanted to ask after the features [15:55] On top of the basics that we get with wget here are the features I see the need for based on scraping different sites [15:55] http://pastebin.com/xiXjZ4Kp [15:58] The spider currently fetches pages in parallel, uses DNS caching for faster accesses and outputs a log file that could easily become a cdx file [16:00] Of all the features the main feature I find important is that this spider can be started, stopped, crash and not lose data. It will resume cleanly. [16:02] What do all of you see as the major problems with current crawling and fetching programs? [16:06] the "One Terabyte of Kilobyte Age" blog has some complaints about the Geocities scrapes [16:06] http://contemporary-home-computing.org/1tb/archives/3308 [16:06] might be something to keep in mind? [16:16] Ooh I will read that [16:18] Yeah reading over my list, it is really short. Many of the features it already has I just left off [16:19] Like user agent rotation based on time or number of uses [16:20] Support for IP address binding [16:21] I spend more time writing and testing the thing then writing the specification for it. [16:22] Like the fetch code uses libcurl for speed [16:22] and LibXML as the html parser because it is the fastest thing out there in the open source world [16:24] The fastest xml/html library is the Microsoft one. Great if you are windows [23:54] Can we update this so the old projects are off the list? http://zeppelin.xrtc.net/corp.xrtc.net/shilling.corp.xrtc.net/index.html [23:57] Just to verify I was reading this https://github.com/ArchiveTeam/universal-tracker [23:57] Can the list be as simple as a list of large files [23:58] I think it should work since this gets broken down and fed to wget