#warrior 2013-03-14,Thu

↑back Search

Time	Nickname	Message
11:00 ^🔗	Smiley	jesus so many channels :P
11:00 ^🔗	Smiley	Right
11:00 ^🔗	Smiley	when you do collapse all items
11:00 ^🔗	Smiley	when a new item occurs
11:00 ^🔗	Smiley	it's not collapsed
11:00 ^🔗	ersi	Yeah, I've noticed that as well
11:01 ^🔗	Smiley	I can raise a bug if we want...
11:01 ^🔗	Smiley	I think it'd be pretty simple to fix <shrug>
11:03 ^🔗	ersi	Yeah, that'd be best.. I'm trying to remember where and when this was added
11:03 ^🔗	ersi	I believe it's in seesaw-kit
11:04 ^🔗	ersi	This is the addition of the feature https://github.com/ArchiveTeam/seesaw-kit/pull/18
11:04 ^🔗	ersi	so seesaw-kit is the right project to file towards
15:53 ^🔗	omf_	I think this is the best room to talk about scrapers again.
15:53 ^🔗	omf_	So the one I have been building is coming along nicely
15:53 ^🔗	omf_	I wanted to ask after the features
15:55 ^🔗	omf_	On top of the basics that we get with wget here are the features I see the need for based on scraping different sites
15:55 ^🔗	omf_	http://pastebin.com/xiXjZ4Kp
15:58 ^🔗	omf_	The spider currently fetches pages in parallel, uses DNS caching for faster accesses and outputs a log file that could easily become a cdx file
16:00 ^🔗	omf_	Of all the features the main feature I find important is that this spider can be started, stopped, crash and not lose data. It will resume cleanly.
16:02 ^🔗	omf_	What do all of you see as the major problems with current crawling and fetching programs?
16:06 ^🔗	sep332	the "One Terabyte of Kilobyte Age" blog has some complaints about the Geocities scrapes
16:06 ^🔗	sep332	http://contemporary-home-computing.org/1tb/archives/3308
16:06 ^🔗	sep332	might be something to keep in mind?
16:16 ^🔗	omf_	Ooh I will read that
16:18 ^🔗	omf_	Yeah reading over my list, it is really short. Many of the features it already has I just left off
16:19 ^🔗	omf_	Like user agent rotation based on time or number of uses
16:20 ^🔗	omf_	Support for IP address binding
16:21 ^🔗	omf_	I spend more time writing and testing the thing then writing the specification for it.
16:22 ^🔗	omf_	Like the fetch code uses libcurl for speed
16:22 ^🔗	omf_	and LibXML as the html parser because it is the fastest thing out there in the open source world
16:24 ^🔗	omf_	The fastest xml/html library is the Microsoft one. Great if you are windows
23:54 ^🔗	omf_	Can we update this so the old projects are off the list? http://zeppelin.xrtc.net/corp.xrtc.net/shilling.corp.xrtc.net/index.html
23:57 ^🔗	omf_	Just to verify I was reading this https://github.com/ArchiveTeam/universal-tracker
23:57 ^🔗	omf_	Can the list be as simple as a list of large files
23:58 ^🔗	omf_	I think it should work since this gets broken down and fed to wget

irclogger-viewer