Time |
Nickname |
Message |
11:00
🔗
|
Smiley |
jesus so many channels :P |
11:00
🔗
|
Smiley |
Right |
11:00
🔗
|
Smiley |
when you do collapse all items |
11:00
🔗
|
Smiley |
when a new item occurs |
11:00
🔗
|
Smiley |
it's not collapsed |
11:00
🔗
|
ersi |
Yeah, I've noticed that as well |
11:01
🔗
|
Smiley |
I can raise a bug if we want... |
11:01
🔗
|
Smiley |
I think it'd be pretty simple to fix <shrug> |
11:03
🔗
|
ersi |
Yeah, that'd be best.. I'm trying to remember where and when this was added |
11:03
🔗
|
ersi |
I believe it's in seesaw-kit |
11:04
🔗
|
ersi |
This is the addition of the feature https://github.com/ArchiveTeam/seesaw-kit/pull/18 |
11:04
🔗
|
ersi |
so seesaw-kit is the right project to file towards |
15:53
🔗
|
omf_ |
I think this is the best room to talk about scrapers again. |
15:53
🔗
|
omf_ |
So the one I have been building is coming along nicely |
15:53
🔗
|
omf_ |
I wanted to ask after the features |
15:55
🔗
|
omf_ |
On top of the basics that we get with wget here are the features I see the need for based on scraping different sites |
15:55
🔗
|
omf_ |
http://pastebin.com/xiXjZ4Kp |
15:58
🔗
|
omf_ |
The spider currently fetches pages in parallel, uses DNS caching for faster accesses and outputs a log file that could easily become a cdx file |
16:00
🔗
|
omf_ |
Of all the features the main feature I find important is that this spider can be started, stopped, crash and not lose data. It will resume cleanly. |
16:02
🔗
|
omf_ |
What do all of you see as the major problems with current crawling and fetching programs? |
16:06
🔗
|
sep332 |
the "One Terabyte of Kilobyte Age" blog has some complaints about the Geocities scrapes |
16:06
🔗
|
sep332 |
http://contemporary-home-computing.org/1tb/archives/3308 |
16:06
🔗
|
sep332 |
might be something to keep in mind? |
16:16
🔗
|
omf_ |
Ooh I will read that |
16:18
🔗
|
omf_ |
Yeah reading over my list, it is really short. Many of the features it already has I just left off |
16:19
🔗
|
omf_ |
Like user agent rotation based on time or number of uses |
16:20
🔗
|
omf_ |
Support for IP address binding |
16:21
🔗
|
omf_ |
I spend more time writing and testing the thing then writing the specification for it. |
16:22
🔗
|
omf_ |
Like the fetch code uses libcurl for speed |
16:22
🔗
|
omf_ |
and LibXML as the html parser because it is the fastest thing out there in the open source world |
16:24
🔗
|
omf_ |
The fastest xml/html library is the Microsoft one. Great if you are windows |
23:54
🔗
|
omf_ |
Can we update this so the old projects are off the list? http://zeppelin.xrtc.net/corp.xrtc.net/shilling.corp.xrtc.net/index.html |
23:57
🔗
|
omf_ |
Just to verify I was reading this https://github.com/ArchiveTeam/universal-tracker |
23:57
🔗
|
omf_ |
Can the list be as simple as a list of large files |
23:58
🔗
|
omf_ |
I think it should work since this gets broken down and fed to wget |