#archiveteam 2013-04-08,Mon

↑back Search

Time Nickname Message
02:08 πŸ”— ivan` can I make wget assume that everything already downloaded is not newer on server? I need to continue a 14 day --mirror that stopped due to full disk
02:24 πŸ”— omf_ ivan`, The problem is that wget is going to head check everything it already got before continuing
02:24 πŸ”— omf_ so basically resume is not very useful
02:28 πŸ”— omf_ httrack is a step in the correct direction with a resume and update update
02:29 πŸ”— omf_ but these tools all worked on the idea of the internet in the 1990s
02:29 πŸ”— omf_ now with sites way bigger things like stopping and starting a grab need to be first class features in the software
02:32 πŸ”— omf_ One of the reasons why curl was created was to have a more performance based cli tool and a library to boot
02:49 πŸ”— ivan` thanks. too bad.
02:57 πŸ”— omf_ It was designed for a different time.
02:57 πŸ”— omf_ I have done looking for more modern tooling but it is not encouraging
03:10 πŸ”— chfoo would it be worth it to write a better alternative to wget/curl for our needs?
03:11 πŸ”— chronomex I've thought about that a bit
03:11 πŸ”— omf_ I have been working on it for a few months. The hard part is robustly testing it
03:11 πŸ”— chronomex not in great detail though
03:12 πŸ”— omf_ Here is what feature we need that would be extremely helpful
03:12 πŸ”— omf_ The ability to start, stop or crash the program with no data loss and the ability to resume
03:13 πŸ”— omf_ a display mode like httrack --display which shows the status of multiple file fetchs at once instead of just endless wget scrolling
03:13 πŸ”— omf_ Threaded page fetches so you can scale up and down the number of grabs
03:13 πŸ”— omf_ Having the application be event based so you can put it in interactive mode to change settings dynamically
03:14 πŸ”— omf_ Everything our standard wget command is as the default
03:14 πŸ”— omf_ The ability to sort hits to other domains into another grab pool which can be paused
03:15 πŸ”— omf_ this allows complex domain spanning and the ability to filter out ad networks
03:16 πŸ”— omf_ url redirection tracking so you never get the same page twice
03:16 πŸ”— omf_ smart fetching of CSS and linked assets as to not get tripped up on modern cache busting techniques like appending junk strings on the end
03:16 πŸ”— omf_ support web font fetching
03:17 πŸ”— omf_ The list I have is a few pages long
03:17 πŸ”— chfoo i have lots of free time right now. i might be able to work on a project like this
03:17 πŸ”— nwh is there a 'perfect' wget setup that will let me pull down an entire domain, and assets hosted on external ones too?
03:18 πŸ”— omf_ If you are willing to pull some cruft in at the same time then yes
03:18 πŸ”— nwh I either miss things that I shouldn't, or end up with wget running off and trying to archive the whole interwho
03:18 πŸ”— SketchCow rgdfgdfgdfgfg
03:18 πŸ”— SketchCow Going to bed
03:18 πŸ”— nwh omf_: hit me.
03:18 πŸ”— omf_ That is the problem, Getting everything can be real big
03:18 πŸ”— SketchCow Machine is in MUCH, MUCH better shape
03:18 πŸ”— SketchCow That is
03:18 πŸ”— SketchCow al
03:18 πŸ”— SketchCow l
03:19 πŸ”— nwh it's a small site, just with image resources hosted on completely random servers, more than I can whitelist
03:19 πŸ”— omf_ http://paste.archivingyoursh.it/
03:19 πŸ”— omf_ http://paste.archivingyoursh.it/towiguremo.hs
03:20 πŸ”— omf_ The other major feature I am working on is driving a headless webkit browser to do the actual page loads and navigation. This allows smarter processing of javascript
03:21 πŸ”— nwh now that is looking nicer
03:21 πŸ”— nwh PhantomJS-style?
03:21 πŸ”— omf_ webkit + gtk
03:21 πŸ”— omf_ PhantomJS is not thread safe
03:22 πŸ”— chronomex omf_: https://github.com/iramari/WarcProxy -> transparent proxy, everything goes into a warc file
03:22 πŸ”— chronomex that + headless browser would do great
03:24 πŸ”— omf_ Doesn't support https and to make it do that you lose the async nature of the application
03:24 πŸ”— chronomex ah true
03:24 πŸ”— chronomex well ok
03:26 πŸ”— chfoo omf_, how much work have you put into this so far?
03:26 πŸ”— omf_ I have test crawled 5 million web pages
03:27 πŸ”— omf_ I have tried 40 different libraries from html parsing and uri verification to file based persistent storage on variable update to save state
03:27 πŸ”— omf_ I need to finish up all the warc stuff
03:27 πŸ”— nwh for a small personal archive, am I better off messing around with WARC files, or just dumping into HTML
03:29 πŸ”— omf_ All that is left is the warc stuff really
03:30 πŸ”— omf_ It is multi-threaded, event driven and fast. It uses libcurl for all the network stuff and libxml2 for parsing
03:32 πŸ”— omf_ The warc format has no compliance test suite so I am building off the ISO draft which is very bland reading and not as helpful as would have thought
03:33 πŸ”— omf_ There are only 2 programs I know of that have tests that test warc
03:33 πŸ”— omf_ https://github.com/internetarchive/warc
03:34 πŸ”— omf_ and Heritrix. wget does not have any code testing for warc support
03:36 πŸ”— chfoo what language did you code it in? is the project hosted online?
03:36 πŸ”— omf_ It is going online once I get warc support in
03:36 πŸ”— chfoo i'm just wondering if its duplicated effort if i write a replacement for wget
03:37 πŸ”— omf_ There is more than 3 wget replacements out there
03:37 πŸ”— omf_ none with warc support but that is covered ground
03:38 πŸ”— omf_ wget, curl, aria2, zsync, httrack, aget, mulk
03:39 πŸ”— omf_ and the list goes on and on
03:44 πŸ”— chfoo that's true. but most of them don't fit the needs for large scale archives i assume.
03:44 πŸ”— omf_ zsync can grab some pretty big things
03:45 πŸ”— omf_ In terms of archiving almost all are fails since they do not have warc, arc or har support
03:45 πŸ”— omf_ warc support and smart url parsing are the two major things existing apps would need
03:46 πŸ”— chfoo i see
03:46 πŸ”— omf_ If you pop open the wget code the parser is a mess
03:47 πŸ”— omf_ The comments mention how it is simple and does not handle all cases for linked files like in CSS
03:47 πŸ”— omf_ and no JS ability
03:49 πŸ”— omf_ Here is html-parse.c http://paste.archivingyoursh.it/fajucuwaca.vbs
03:49 πŸ”— omf_ The comments are very telling
03:50 πŸ”— chfoo yes, it is
03:51 πŸ”— omf_ 3 html parsers
03:51 πŸ”— omf_ two of which were not needed since libxml2 was already released
03:51 πŸ”— omf_ See I never get why most of these apps reinvent the same shit. HTML parsing is hard so find the best library and stick with that
03:57 πŸ”— chfoo very true
03:57 πŸ”— omf_ I do url finding via xpath for the normal content and css selectors for css
03:58 πŸ”— omf_ xpath is way faster than it used to be
03:58 πŸ”— omf_ plus libxml2 allows you to stick in callbacks if junk html is found so you can patch over it if you want
03:58 πŸ”— omf_ wget just ignores
04:08 πŸ”— chfoo in the meantime, if there's a need for some sort of tool or processor that isn't being worked on, i might be interested.
04:09 πŸ”— omf_ another warc tool would be helpful
04:09 πŸ”— omf_ like you give it a warc file and it gives back stats on how many pages, images, etc.. in the warc
04:10 πŸ”— omf_ This would be where I would start on that http://code.hanzoarchives.com/warc-tools
04:12 πŸ”— chfoo the more warc tools, the better warc support?
04:19 πŸ”— chfoo ha ha, the day the number of warc tools matches the number of wget alternates would make an archiver very happy.
04:21 πŸ”— chfoo i'll take a closer look at warc files and see what i can do
04:30 πŸ”— namespace HTML parsing is easy if all the HTML files follow the standard, and are completely correct...
04:30 πŸ”— namespace Like that wil ever happen.
04:30 πŸ”— namespace *will
10:45 πŸ”— SketchCow Oh hells yes. ANOTHER disk error on FOS.
10:46 πŸ”— Smiley D:
10:52 πŸ”— SketchCow I am inclined to think that maybe, just maybe, it's underlying hardware and not the actual disks.
10:53 πŸ”— Smiley Hmmm, or disks from the same batch/disks that are teh same age?
10:54 πŸ”— Smiley One disk fails, another gets higher load, in turn fails, increasing the load on a third disk 3x
10:55 πŸ”— Smiley MTUF kind of points to that happening if all the disks are of simular age (i.e. none failed when the system was originally setup until now).
11:10 πŸ”— SketchCow Yeah, but my years of adminning tell me that there's likely a more lingering issue upward, unless of course they're not replacing disks and just doing 'repairs'
11:10 πŸ”— SketchCow Which they might.
11:11 πŸ”— SketchCow Whoops, machine went off the air.
11:11 πŸ”— SketchCow Now I'm working locally. :)
11:12 πŸ”— Smiley Repairs..... on disks marked as bad? Oh lord tell me no.
13:53 πŸ”— omf_ Linux game tome actually did a database dump. Already grabbed a copy
14:03 πŸ”— godane ok
14:03 πŸ”— godane i was going to uploaded it
16:43 πŸ”— jerry66 http://xeroticmomentsx.blogspot.com/2013/04/sandra-romain-double-anal-gangbang.html
16:44 πŸ”— omf_ (ҕ¯Â°Ò–‘°)ҕ¯ï¸¡ 99ΓŠΒŽΓ‰ΒΉΓ‰ΒΉΓ‡ΒΓ‰ΒΎ
17:30 πŸ”— Smiley D:
18:35 πŸ”— omf_ Do we need an ftp.cdrom.com archive on IA?
18:35 πŸ”— omf_ I couldn't find one
18:41 πŸ”— SketchCow I'd like one.
18:41 πŸ”— omf_ I will start pulling it down
18:44 πŸ”— omf_ This brings up warm and fuzzy feelings, the first file pulled down - http://paste.archivingyoursh.it/pelewesude.vhdl
18:48 πŸ”— Smiley hmmm
18:48 πŸ”— Smiley omf_: give me a list of urls? XD
18:48 πŸ”— Smiley Hornet Underground Volume 2 (Games) Beautiful multimedia works of art, produced by brilliant young minds, advertising nothing.
20:00 πŸ”— alard SketchCow / someone else: Do you know if there could be anything wrong with the S3 uploads? The formspring / posterous uploads are failing since yesterday (after 30-40GB they'll receive an error 500 response).
20:10 πŸ”— forester ATTENTION***CONCORD DISPATCHERS BLACK MALE FOR MONEY AND PROPERT6Y. MIDENIGHT SHIFT WITH FATY STUFF AND HER GANG ST3EALIONG OUT OF UPS, MAIL, LARGE ITEMS INGTEREARING WIGTH FWECDERAL GOV. ALSO SGTEALING INTERNATIONAL ITEMS. THIS IS AN EMERGENCY STOP THIS GROUP TOO MANY DIED AND OVER 500 UNDER 18 MURDERED BY THIS GROUP
20:12 πŸ”— InitHello buh?
20:13 πŸ”— DFJustin never a dull moment on efnet
20:14 πŸ”— Nemo_ter aww lovely heroku https://wikipulse.herokuapp.com/
20:16 πŸ”— SketchCow alard: Yes
20:16 πŸ”— SketchCow We had a MAJOR outage
20:16 πŸ”— SketchCow Stuff should be better but might not
20:17 πŸ”— alard Ah, thanks. I've started one uploader again ten minutes ago, so we'll see.
20:17 πŸ”— alard There's a 1.5TB backlog, but it's not urgent yet. (Still 1.5TB free.)
20:28 πŸ”— SketchCow FOS crashed hard again
20:28 πŸ”— SketchCow That shit is sick sick sick
20:28 πŸ”— SketchCow The good news is I jammed it down by, like, 3tb
20:28 πŸ”— SketchCow So it was working better.
20:28 πŸ”— SketchCow Also: http://teamarchive-1.us.archive.org:8088/mrtg/
20:44 πŸ”— Smiley ouch
20:51 πŸ”— SketchCow Teamarchive-1 is now down
20:52 πŸ”— SketchCow It's going to go into sick bay
20:56 πŸ”— alard It's overworked.
21:02 πŸ”— SketchCow I don't think so.
21:02 πŸ”— SketchCow I think we just literally have a fucked array under it.
21:02 πŸ”— SketchCow We're on a very nice journalling filesystem, so things aren't getting lost.
21:41 πŸ”— DFJustin so what you're saying is we literally downloaded an array to death
21:43 πŸ”— omf_ Achievement unlocked!
21:44 πŸ”— SketchCow Yes
21:44 πŸ”— SketchCow Also, DFJustin: additiona_collections got folkscanomy and bitsavers
21:45 πŸ”— DFJustin \o/
22:48 πŸ”— omf_ Great news. I got my vhs to digital convert box back up. So if anyone has vhs tapes they want scanned in
22:48 πŸ”— omf_ I finally found the time to set it up and run it
22:49 πŸ”— omf_ I am converting a tape right now to make sure everything is tip top
22:52 πŸ”— dashcloud good luck- I've spent no end of time on that project
22:52 πŸ”— dashcloud what's your setup?
22:53 πŸ”— omf_ I got a vhs to dvd/digital conversion box
22:53 πŸ”— omf_ all in one unit I got years ago
22:53 πŸ”— dashcloud so, a DVD recorder?
22:53 πŸ”— dashcloud unit you hook up to your computer?
22:53 πŸ”— omf_ yep
22:55 πŸ”— dashcloud which one?
23:46 πŸ”— dashcloud hi guys, in case you're not following IA or SketchCow on twitter (you really should if you're on twitter) : http://internetarchive.tumblr.com/ - you could run the IA tumblr for a week

irclogger-viewer