[02:08] can I make wget assume that everything already downloaded is not newer on server? I need to continue a 14 day --mirror that stopped due to full disk [02:24] ivan`, The problem is that wget is going to head check everything it already got before continuing [02:24] so basically resume is not very useful [02:28] httrack is a step in the correct direction with a resume and update update [02:29] but these tools all worked on the idea of the internet in the 1990s [02:29] now with sites way bigger things like stopping and starting a grab need to be first class features in the software [02:32] One of the reasons why curl was created was to have a more performance based cli tool and a library to boot [02:49] thanks. too bad. [02:57] It was designed for a different time. [02:57] I have done looking for more modern tooling but it is not encouraging [03:10] would it be worth it to write a better alternative to wget/curl for our needs? [03:11] I've thought about that a bit [03:11] I have been working on it for a few months. The hard part is robustly testing it [03:11] not in great detail though [03:12] Here is what feature we need that would be extremely helpful [03:12] The ability to start, stop or crash the program with no data loss and the ability to resume [03:13] a display mode like httrack --display which shows the status of multiple file fetchs at once instead of just endless wget scrolling [03:13] Threaded page fetches so you can scale up and down the number of grabs [03:13] Having the application be event based so you can put it in interactive mode to change settings dynamically [03:14] Everything our standard wget command is as the default [03:14] The ability to sort hits to other domains into another grab pool which can be paused [03:15] this allows complex domain spanning and the ability to filter out ad networks [03:16] url redirection tracking so you never get the same page twice [03:16] smart fetching of CSS and linked assets as to not get tripped up on modern cache busting techniques like appending junk strings on the end [03:16] support web font fetching [03:17] The list I have is a few pages long [03:17] i have lots of free time right now. i might be able to work on a project like this [03:17] is there a 'perfect' wget setup that will let me pull down an entire domain, and assets hosted on external ones too? [03:18] If you are willing to pull some cruft in at the same time then yes [03:18] I either miss things that I shouldn't, or end up with wget running off and trying to archive the whole interwho [03:18] rgdfgdfgdfgfg [03:18] Going to bed [03:18] omf_: hit me. [03:18] That is the problem, Getting everything can be real big [03:18] Machine is in MUCH, MUCH better shape [03:18] That is [03:18] al [03:18] l [03:19] it's a small site, just with image resources hosted on completely random servers, more than I can whitelist [03:19] http://paste.archivingyoursh.it/ [03:19] http://paste.archivingyoursh.it/towiguremo.hs [03:20] The other major feature I am working on is driving a headless webkit browser to do the actual page loads and navigation. This allows smarter processing of javascript [03:21] now that is looking nicer [03:21] PhantomJS-style? [03:21] webkit + gtk [03:21] PhantomJS is not thread safe [03:22] omf_: https://github.com/iramari/WarcProxy -> transparent proxy, everything goes into a warc file [03:22] that + headless browser would do great [03:24] Doesn't support https and to make it do that you lose the async nature of the application [03:24] ah true [03:24] well ok [03:26] omf_, how much work have you put into this so far? [03:26] I have test crawled 5 million web pages [03:27] I have tried 40 different libraries from html parsing and uri verification to file based persistent storage on variable update to save state [03:27] I need to finish up all the warc stuff [03:27] for a small personal archive, am I better off messing around with WARC files, or just dumping into HTML [03:29] All that is left is the warc stuff really [03:30] It is multi-threaded, event driven and fast. It uses libcurl for all the network stuff and libxml2 for parsing [03:32] The warc format has no compliance test suite so I am building off the ISO draft which is very bland reading and not as helpful as would have thought [03:33] There are only 2 programs I know of that have tests that test warc [03:33] https://github.com/internetarchive/warc [03:34] and Heritrix. wget does not have any code testing for warc support [03:36] what language did you code it in? is the project hosted online? [03:36] It is going online once I get warc support in [03:36] i'm just wondering if its duplicated effort if i write a replacement for wget [03:37] There is more than 3 wget replacements out there [03:37] none with warc support but that is covered ground [03:38] wget, curl, aria2, zsync, httrack, aget, mulk [03:39] and the list goes on and on [03:44] that's true. but most of them don't fit the needs for large scale archives i assume. [03:44] zsync can grab some pretty big things [03:45] In terms of archiving almost all are fails since they do not have warc, arc or har support [03:45] warc support and smart url parsing are the two major things existing apps would need [03:46] i see [03:46] If you pop open the wget code the parser is a mess [03:47] The comments mention how it is simple and does not handle all cases for linked files like in CSS [03:47] and no JS ability [03:49] Here is html-parse.c http://paste.archivingyoursh.it/fajucuwaca.vbs [03:49] The comments are very telling [03:50] yes, it is [03:51] 3 html parsers [03:51] two of which were not needed since libxml2 was already released [03:51] See I never get why most of these apps reinvent the same shit. HTML parsing is hard so find the best library and stick with that [03:57] very true [03:57] I do url finding via xpath for the normal content and css selectors for css [03:58] xpath is way faster than it used to be [03:58] plus libxml2 allows you to stick in callbacks if junk html is found so you can patch over it if you want [03:58] wget just ignores [04:08] in the meantime, if there's a need for some sort of tool or processor that isn't being worked on, i might be interested. [04:09] another warc tool would be helpful [04:09] like you give it a warc file and it gives back stats on how many pages, images, etc.. in the warc [04:10] This would be where I would start on that http://code.hanzoarchives.com/warc-tools [04:12] the more warc tools, the better warc support? [04:19] ha ha, the day the number of warc tools matches the number of wget alternates would make an archiver very happy. [04:21] i'll take a closer look at warc files and see what i can do [04:30] HTML parsing is easy if all the HTML files follow the standard, and are completely correct... [04:30] Like that wil ever happen. [04:30] *will [10:45] Oh hells yes. ANOTHER disk error on FOS. [10:46] D: [10:52] I am inclined to think that maybe, just maybe, it's underlying hardware and not the actual disks. [10:53] Hmmm, or disks from the same batch/disks that are teh same age? [10:54] One disk fails, another gets higher load, in turn fails, increasing the load on a third disk 3x [10:55] MTUF kind of points to that happening if all the disks are of simular age (i.e. none failed when the system was originally setup until now). [11:10] Yeah, but my years of adminning tell me that there's likely a more lingering issue upward, unless of course they're not replacing disks and just doing 'repairs' [11:10] Which they might. [11:11] Whoops, machine went off the air. [11:11] Now I'm working locally. :) [11:12] Repairs..... on disks marked as bad? Oh lord tell me no. [13:53] Linux game tome actually did a database dump. Already grabbed a copy [14:03] ok [14:03] i was going to uploaded it [16:43] http://xeroticmomentsx.blogspot.com/2013/04/sandra-romain-double-anal-gangbang.html [16:44] (╯°□°)╯︵ 99ʎɹɹǝɾ [17:30] D: [18:35] Do we need an ftp.cdrom.com archive on IA? [18:35] I couldn't find one [18:41] I'd like one. [18:41] I will start pulling it down [18:44] This brings up warm and fuzzy feelings, the first file pulled down - http://paste.archivingyoursh.it/pelewesude.vhdl [18:48] hmmm [18:48] omf_: give me a list of urls? XD [18:48] Hornet Underground Volume 2 (Games) Beautiful multimedia works of art, produced by brilliant young minds, advertising nothing. [20:00] SketchCow / someone else: Do you know if there could be anything wrong with the S3 uploads? The formspring / posterous uploads are failing since yesterday (after 30-40GB they'll receive an error 500 response). [20:10] ATTENTION***CONCORD DISPATCHERS BLACK MALE FOR MONEY AND PROPERT6Y. MIDENIGHT SHIFT WITH FATY STUFF AND HER GANG ST3EALIONG OUT OF UPS, MAIL, LARGE ITEMS INGTEREARING WIGTH FWECDERAL GOV. ALSO SGTEALING INTERNATIONAL ITEMS. THIS IS AN EMERGENCY STOP THIS GROUP TOO MANY DIED AND OVER 500 UNDER 18 MURDERED BY THIS GROUP [20:12] buh? [20:13] never a dull moment on efnet [20:14] aww lovely heroku https://wikipulse.herokuapp.com/ [20:16] alard: Yes [20:16] We had a MAJOR outage [20:16] Stuff should be better but might not [20:17] Ah, thanks. I've started one uploader again ten minutes ago, so we'll see. [20:17] There's a 1.5TB backlog, but it's not urgent yet. (Still 1.5TB free.) [20:28] FOS crashed hard again [20:28] That shit is sick sick sick [20:28] The good news is I jammed it down by, like, 3tb [20:28] So it was working better. [20:28] Also: http://teamarchive-1.us.archive.org:8088/mrtg/ [20:44] ouch [20:51] Teamarchive-1 is now down [20:52] It's going to go into sick bay [20:56] It's overworked. [21:02] I don't think so. [21:02] I think we just literally have a fucked array under it. [21:02] We're on a very nice journalling filesystem, so things aren't getting lost. [21:41] so what you're saying is we literally downloaded an array to death [21:43] Achievement unlocked! [21:44] Yes [21:44] Also, DFJustin: additiona_collections got folkscanomy and bitsavers [21:45] \o/ [22:48] Great news. I got my vhs to digital convert box back up. So if anyone has vhs tapes they want scanned in [22:48] I finally found the time to set it up and run it [22:49] I am converting a tape right now to make sure everything is tip top [22:52] good luck- I've spent no end of time on that project [22:52] what's your setup? [22:53] I got a vhs to dvd/digital conversion box [22:53] all in one unit I got years ago [22:53] so, a DVD recorder? [22:53] unit you hook up to your computer? [22:53] yep [22:55] which one? [23:46] hi guys, in case you're not following IA or SketchCow on twitter (you really should if you're on twitter) : http://internetarchive.tumblr.com/ - you could run the IA tumblr for a week