[01:29] USB over IP has been available for a while, perhaps not in kernel but its been a thing you could do. [03:07] Does warcextract.py actually extract files or does it do something else that is considered "extract"? [03:39] tfgbd: I can't tell you how good this warcviewer is, but it does run on Windows, and I did use it before on a small warc: https://github.com/odie5533/WarcQtViewer [03:56] Oh, thanks. [03:56] will check it out [03:56] I'm working on trying the proxy in ubuntu right now [03:57] This may be even better, though [03:57] will have to see how it handles 10GB warcs [03:57] Can you put this on the Wiki? [03:57] I'd have never found this thing on my own [04:11] this darn proxy won't work on linux either [04:12] I get further along but it still doesn't seem to work [04:12] When i go to http://warc I just get [04:12] Please make sure that the WARC server is installed and active. [04:12] WARC server unreachable! [04:12] You need to start the special WARC server before you can browse the past. [04:12] lol, obviously the server is started or I wouldn't see a message saying the server can't be reached [04:13] it is only supposed to be run form localhost? [04:14] from* [04:15] yup, that was it. seems to work on the local machine [04:16] nm, that doesn't work right either [04:16] or it doesn't like links, at least [04:20] Okay, just tried WarcQtViewer. [04:20] It works but it's a bit limited [04:20] it seems to only let you extract one file at a time [04:20] I need to get thousands [04:49] tfgbd: if you're seeing that message, you've probably got a cached page [04:49] also, what WARC is this [04:50] I think you are overcomplicating this [05:12] Well, I'm using linux now [05:12] Still having some issues with some of this stuff [05:12] I'm trying warctozip.py on a network share atm [05:12] lets see how far it goes [05:16] this is the warc: https://archive.org/download/archiveteam_archivebot_go_150/wdl1.winworldpc.com-inf-20140913-144638-8m8pg-00001.warc.gz [05:23] tfgbd: ok, so that 00001 indicates a sequence number; ArchiveBot generates 10 GB WARCs [05:24] So they ARE multi-part? [05:24] hold the fuck on [05:24] the reason for that is that it ensures that you have to repeat an upload of at most 10 GB on failure [05:24] they are not "multipart" in the sense that a single part is an incomplete archive [05:24] So some of thest 500GB of archives may be repeats? [05:24] no, they're not repeats [05:24] these* [05:25] But I don't think the site is even 500GB [05:25] why are there 500GB of warcs? [05:25] Different crawl dates? [05:25] it probably contains extra stuff, like offsite page requisites [05:25] not different crawl dates; ArchiveBot goes continuously at a site until it finishes the queue or the job is aborted [05:25] as someone who can barely upload a gigabyte in the space of six or seven hours [05:25] 10GB a part seems a bit high [05:25] what happens if a site blocks it? [05:26] then we abort the job when we notice and call it whatever [05:26] BlueMaxim: file an issue, it works for us [05:26] Or, I dunno. Do you not mind using up like 500GB of bandwidth for this? [05:26] I don't care, I burn through 160 TB or so a month running ArchiveBot nodes [05:26] I mean for the site owners [05:27] most people don't care [05:27] some do [05:27] The site owner is right in #winboards and he was pissed at me for leaching even 20GB [05:27] [00:27:09] yipdw: Job status: 19186 completed, 532 aborted, 137 failed, 48 in progress, 0 pending [05:27] [00:27:09] !status [05:27] most don't [05:27] Do you do stuff to make it look like the requests come from different people and/or do the download slowly? [05:27] (1) no (2) depends on the job [05:27] Or do you hit them as hard as possible with as many connections as it will let you? ;P [05:27] depends on the job [05:28] archivebot has configurable concurrency and delay [05:28] so, how did you do winworldpc? [05:28] archivebot [05:28] hence the collection [05:28] but I mean what were the settings [05:28] I don't recall the parameters because (1) they aren't stored in the job and (2) the settings may vary over time [05:28] i see [05:28] wpull saves the parameters are saved in the warcinfo record [05:29] even when they change over time? [05:29] I don't recall seeing that in the warcinfo record, but I might just have missed it [05:29] no, just the command line args [05:29] well, considering it is mosly a http file server with a bunch of zipped OS installs, what would you usually do for that sort of thing [05:29] I usually just throw it in archivebot [05:30] but you can also just crawl the site with wget or whatever [05:30] welp, warctozip.py failed again even under linux [05:30] Do I really need more than 300mb? [05:30] ram* [05:31] does warcat not do what you want it to? [05:33] warcat works fine on that WARC [05:34] Will warcat let me extract all the files into directories/subdirectories? [05:34] I didn't get there yet [05:34] https://gist.github.com/anonymous/f8f92c6858259cb8c73c [05:34] yes [05:34] archivebot3@starnode:~/warctozip/extraction$ python3 -mwarcat extract ../wdl1.winworldpc.com-inf-20140913-144638-8m8pg-00001.warc.gz [05:34] for completion the command line [05:35] also that's still running so yes that gist is incomplete [05:35] 300 MB of RAM is probably pushing the low side [05:36] how about 2GB? [05:36] I tried with that and it still failed [05:36] warcat or warctozip [05:36] warctozip [05:36] then try 2 GB with warcat and see what happens [05:36] Will it work on NT? [05:37] I don't know [05:37] Will try [05:37] warctozip worked but it still failed about 100MB into the convert [05:37] died even sooner on the linux install with 300MB of ram [05:37] ps reports memory rss spikes around 300-500 MB using warcat but it drops soon afterwards [05:38] I'll give er a go [05:38] In the end, after all the frustration, I at least get to learn a bit about python, I guess.. ;P [05:38] working fine here [05:38] the frustration is self-induced [05:39] Well, it's not really frustration. I consider it fun. [05:39] But i tried at least 4 tools by now [05:39] I'm not sure why you insisted on running warctozip in a configuration that few people here have ever used, demanded assistance with said configuration, and kept on running warctozip when warcat was recommended a while back [05:39] but if that's fun I guess whatever floats your boat [05:41] i would suggest in the future to look at the change dates of the software. generally i avoid software that hasn't been updated in years [05:42] I saw no mention of warcat [05:42] This is the first I've heard it [05:42] maybe a netsplit or a connection timeout [05:43] but the full list is here: http://archiveteam.org/index.php?title=WARC [05:43] I can't even find this warcat with google [05:43] https://pypi.python.org/pypi/Warcat/ [05:44] google customizes their search indexes according to factors unknown, but that's the first search hit I get for "warcat" [05:45] I searched for mwarcat [05:45] why [05:45] because: archivebot3@starnode:~/warctozip/extraction$ python3 -mwarcat extract [05:45] the package is called warcat, -m specifies a module [05:45] without a space? [05:45] we don't call it mwarcat [05:46] -mwarcat and -m warcat both work fine [05:46] -m is an option [06:07] you could also just try downloading the files through the wayback machine [06:07] that's the primary way these archives are intended to be used [06:12] There is nothing there [06:13] i understood that -m was an option but not using a space confused me [06:13] I find it strange nobody here has tested this on or uses Windows [06:14] while 90% of the rest of the world does [06:14] oh I see it blocks wayback [06:14] well a) this is a self-selecting group of huge nerds and b) nobody is really testing anything a whole lot anywhere because nobody is being paid [06:15] I see [06:15] but this stuff isn't even yours [06:15] It's random 3rd party stuff you made a list of, I guess [06:16] I personally use windows but I mainly download things rather than actually using them [06:17] ? [06:17] warcat was developed by chfoo [06:17] warctozip was built by alard [06:17] none of this is random third-party stuff; the reason why it's mentioned is because people who hang out here made it [06:18] where did you get the opposite idea [06:32] I suppose because the wiki has stuff by archive.org and other people too [06:35] Do you "support" BSDs? [06:35] depends on the maintainer [06:39] I wasn't really expecting "support" but just a few suggestions. The wiki page isn't that helpful and I was kind of expecting the tools I was using to just work [06:39] I mean how should I know the warctozip web site didn't support large files? [06:41] you aren't supposed to know [06:42] it's bad that it doesn't work but unless one of the small number of people who can fix it is paying attention there's no point in saying so more than once [06:43] I'm more put off by your assumptions; I read them as entitlement [06:43] file a bug on github and move on [06:43] if we had paid staff maintaining this stuff ok fine [06:43] that's just me though, there are much calmer people here [06:44] yeah, you seem a bit harsh [06:47] I basically agree with you that the stuff is not nearly as working or user friendly as it needs to be but I can't do much about it so it's annoying to keep reading it [07:01] i gather it's still early days for a lot of this stuff, it's largely built for nerds who use linux, don't mind the command-line and can tolerate breakage [07:01] perhaps you can help drive those changes you want to see, others may follow your lead [08:45] I need a dectalk emulator FF addon... [11:53] [08:13] I find it strange nobody here has tested this on or uses Windows [11:53] [08:14] while 90% of the rest of the world does [11:53] Windows is not a nice platform for scripting or archiving [11:54] you'll find that lots of software, unless it *explicitly* targets non-techy end-users, won't run on Windows or won't be tested on Windows [11:54] because the regular user group are advanced users and they've chosen to use Linux, OS X or some other Unix-like so that stuff won't break [11:55] sure, 99%(?) of people playing AAA games will do so on Windows [11:55] i get ya [11:55] Windows is annoying with copying folder date stamps [11:55] but that doesn't hold true for stuff like data processing (which is more or less what archiving is from a technical POV) [11:55] I personally don't even bother testing *anything* I build in Windows or IE (I primarily do webdev) [11:56] if it works, great - if it doesn't, so be it [11:56] I only have an XP VM for reverse-engineering weird USB protocols :P [11:56] tfgbd: not just that, it's annoying in lots of ways [11:57] the perfect testcase is copying over some 3k files of 5kb [11:57] on NTFS, it'll take you a few minutes [11:57] on ext3/ext4, under a second [11:58] if you work with lots of small files, that slowness gets very annoying very fast :) [11:58] anyway, I was about to leave for the supermarket [12:02] tfgbd: Just so you know, contributions are very welcome. ArchiveTeam is a small pack of individuals doing things. There's no backing up from anyone really. There's no money, no paid time etc. [12:17] uhm, anyone else being hammered with fake bt traffic? [12:18] brb [12:18] Amazing you did all this then [12:22] that was unpleasant :o [12:24] was not AT related though [12:27] schbirid: bittorrent, or bt the ISP? [12:27] torrent [12:27] someone ddosed my torrent port [12:28] well, tried to. or transmission did something bad before (which would not surprise me one bit) [12:28] got ~30 kilobytes/s of empty announces :) [12:28] no announces, i mean whatever things peers say to each other [12:29] blame it on transmssion, useless client [13:10] so i'm grabbing more Japanology episodes [13:11] anyways i upload feb 01 to 10 2008 videos for funny or die collection [13:22] joepie91: you're familiar with USB reverse engineering? if so, msg me- got a question for you [13:34] heh, 55k different IPs [13:34] that is not transmisssions fault [13:34] either what, torrentbytes or x264 is attacked [14:00] dashcloud: not really, I just dicked around with it [14:01] tfgbd: archiveteam infrastructure is really rather impressive and polished given the kind of operation it is [14:01] it's not perfect, but I've gone "huh, neat" on a number of occasions [14:01] :p [14:02] thanks [14:02] dashcloud: for context, http://cryto.net/~joepie91/areson [14:02] haven't yet gotten further than that [15:04] SketchCow: so looks you guys at least got over half of the images from twitpic [15:05] 500 million number is what is on the globalnews.ca article: http://globalnews.ca/news/1633807/800-million-twitpic-photos-to-vanish-from-the-web-saturday/ [15:06] Got 500m directly from Cloudfront, but it's literally 500 million unindexed pictures which, since Noah tweaked Twitpic to frustrate the efforts, will be hard to tie back to the actual twitpic URLs they originally lived at [15:06] [as I understand it] [15:07] It would have been easy to do - but he seemed to deliberately change the site to make it hard [15:17] Since he's a douchebag [15:21] https://web.archive.org/web/20141025151927/http://twitpic.com/19jjry [15:21] we can still archive it [15:24] do anyone know of the blocks that you guys don't have? [15:59] http://twitpic.com/eabw9h funny.. his pictures still work [16:04] something has changed, midas, by the look of it [16:04] #quitpic [16:05] moving over [16:17] [17:31] All those weeny commenters on HN, all "they can't afford the transfer costs" and stuff. [16:17] link? [16:18] two ticks [16:18] k [16:18] https://news.ycombinator.com/item?id=8472047 [16:18] process.nextTick(function(){ process.nextTick(function(){ getUrl(); });}); [16:18] :) [16:20] "For those downvoting this - why don't you try paying for 800M images to be leached from the server of a failed company of yours with no hope of recouping a single dollar." [16:20] :) [16:20] um, yeah, you mean like a number of individuals have been attempting to do for the past weeks? [16:21] >.> [16:24] The 500 million number came from me. [16:28] Why do I keep visiting the HN page [16:30] It's like it has a big banner over the tunnel entrance saying "Tunnel of Cheeto-fingered Libertarian Tech Cultists" and motherfuck if I don't just get into the goddamn boat going "Well, maybe I'll be able to extract some value from [music starts] IT'S A WORLD OF CRUELTY A WORLD OF FATE / IT'S A WORLD OF THE POWER OF DOLLARS AND HATE" [16:37] did you have the talk yet SketchCow ? [17:06] No, that's Monday [17:06] * SketchCow was just chatting with Rick Prelinger, the god [19:17] .tw https://twitter.com/joepie91/status/526089515119968256 [19:17] TwitPic gets acquired. Again. I think. Until it doesn't after all, I suppose. Really, who the fuck knows by now? http://blog.twitpic.com/2014/10/twitpics-future/ (@joepie91) [19:27] pikhq: so perhaps we should continue here [19:27] :P [19:28] anyway, the whole hard-to-contact issue also exists for Google [19:28] Ya [19:28] and a bunch of other companies [19:28] it really bothers me [19:28] That's really weird. [19:28] they basically just run a skeleton crew for consumer customer support [19:28] responding to abusemails and that kind of crap [19:28] if you want any actual support, you're shit out of luck [19:28] At least Google itself has mailing address and phone number findable. [19:29] phone number for Google is useless [19:29] it's just the corporate HQ [19:29] they won't provide any support [19:29] I suspect the same goes for mailing address [19:29] For the particular situation I was talking about, corporate HQ is what you *want*. :P [19:29] they'll just tell you to "post on the Google Group for the product" [19:29] "We'd like to make off with a copy of your content before you die". [19:29] pikhq: not according to them, you don't [19:30] corporate HQ contact details are for things that make them money, and for unavoidable legal things [19:30] everything else is end-user support [19:30] :p [19:30] And fuck those guys. [19:30] basically [19:30] Google doesn't have actual customer support unless you're an enterprise customer [21:11] So I'm upto 'Bongo' on adding metadata... [21:12] This thing is huge D: [21:12] SketchCow: the actual 'internet arcade' collection itself doesn't have any description, I don't know if that's deliberate (and I can't add one anyway). [21:19] Smiley: you're working on arcade metadata as well? [21:20] Nod [21:23] I've been doing it, and I'm up to Loderunner- maybe I'll start from the bottom and work up then [21:23] hmmm can do [21:23] whatever works [21:26] what collection is this? just curious [21:28] pm BlueMaxim [21:28] OK [21:33] you got my mesg rigbht? [21:51] no smiley [21:52] and now? [21:52] yup :P [21:53] odd [21:53] wow this looks like a nice collection [21:59] Nod [22:37] Jebus, #-bs day today [22:57] ersi: it's been overall busy and lively today [22:57] no doubt thanks to Jason publicly declaring archival war on Twatpic [22:57] :) [22:58] or people reading the same news and splattering out over #at and starting the same discussion again [22:58] Twitpic got acquired! etc [22:59] but haha not really etc [23:10] so looks like Canada had a power outage like i had in 2008 [23:11] based on News Hour Toronto for dec 2013 [23:12] my power outage post: http://godane.wordpress.com/2008/12/21/out-of-power-for-9-days/ [23:52] Smiley: I'm actually fixing that. [23:53] The more people who help with the arcde thing the better [23:53] I redid the structure of the software library: http://archive.org/details/softwarelibrary