#warrior 2013-03-14,Thu

↑back Search

Time Nickname Message
11:00 🔗 Smiley jesus so many channels :P
11:00 🔗 Smiley Right
11:00 🔗 Smiley when you do collapse all items
11:00 🔗 Smiley when a new item occurs
11:00 🔗 Smiley it's not collapsed
11:00 🔗 ersi Yeah, I've noticed that as well
11:01 🔗 Smiley I can raise a bug if we want...
11:01 🔗 Smiley I think it'd be pretty simple to fix <shrug>
11:03 🔗 ersi Yeah, that'd be best.. I'm trying to remember where and when this was added
11:03 🔗 ersi I believe it's in seesaw-kit
11:04 🔗 ersi This is the addition of the feature https://github.com/ArchiveTeam/seesaw-kit/pull/18
11:04 🔗 ersi so seesaw-kit is the right project to file towards
15:53 🔗 omf_ I think this is the best room to talk about scrapers again.
15:53 🔗 omf_ So the one I have been building is coming along nicely
15:53 🔗 omf_ I wanted to ask after the features
15:55 🔗 omf_ On top of the basics that we get with wget here are the features I see the need for based on scraping different sites
15:55 🔗 omf_ http://pastebin.com/xiXjZ4Kp
15:58 🔗 omf_ The spider currently fetches pages in parallel, uses DNS caching for faster accesses and outputs a log file that could easily become a cdx file
16:00 🔗 omf_ Of all the features the main feature I find important is that this spider can be started, stopped, crash and not lose data. It will resume cleanly.
16:02 🔗 omf_ What do all of you see as the major problems with current crawling and fetching programs?
16:06 🔗 sep332 the "One Terabyte of Kilobyte Age" blog has some complaints about the Geocities scrapes
16:06 🔗 sep332 http://contemporary-home-computing.org/1tb/archives/3308
16:06 🔗 sep332 might be something to keep in mind?
16:16 🔗 omf_ Ooh I will read that
16:18 🔗 omf_ Yeah reading over my list, it is really short. Many of the features it already has I just left off
16:19 🔗 omf_ Like user agent rotation based on time or number of uses
16:20 🔗 omf_ Support for IP address binding
16:21 🔗 omf_ I spend more time writing and testing the thing then writing the specification for it.
16:22 🔗 omf_ Like the fetch code uses libcurl for speed
16:22 🔗 omf_ and LibXML as the html parser because it is the fastest thing out there in the open source world
16:24 🔗 omf_ The fastest xml/html library is the Microsoft one. Great if you are windows
23:54 🔗 omf_ Can we update this so the old projects are off the list? http://zeppelin.xrtc.net/corp.xrtc.net/shilling.corp.xrtc.net/index.html
23:57 🔗 omf_ Just to verify I was reading this https://github.com/ArchiveTeam/universal-tracker
23:57 🔗 omf_ Can the list be as simple as a list of large files
23:58 🔗 omf_ I think it should work since this gets broken down and fed to wget

irclogger-viewer