[00:12] 80Gb [00:45] Well, now it's officially a clusterfuck. [00:51] You know what the Ello guy needed? The Svpply guy [00:57] Calling Ello guy on skype [00:57] ooh, that'll be fun [01:05] I love how he says an export isn't a priority when he is holding other people's stuff. That is like giving your money to a bank that says "you can't get your money back now but don't worry, you can in the future." [01:06] "plus we won't disappear with it. Promise." [01:06] except in the case of banks, they are required to have insurance just for that sort of occurrence. [01:08] "You can reassemble your money from these jars of pennies, right?" [01:22] Okay, I just installed the warc-proxy [01:22] This has terrible documentation [01:23] Where the heck am I supported to put the WARC file once I have the thing running? [01:26] supposed* [01:26] if you configured the http proxy, go to http://warc/ and you should see an add warc button [01:26] Ahh [01:27] I tried to use the Firefox addon but it isn't showing up in Firefox 30's Tools menu [01:31] Okay, it's half working now. [01:31] I can load the http://warc page but I just get a list of Python errors in a frame [01:34] Does it not like spaces in the path? [01:53] WHO HELLO [01:53] Skype chat was good. [01:54] I said, and I quote, "I am more than happy to call a truce until Ello does the next Stupid Thing." [01:54] And there we are. [01:54] yay [01:54] i think [02:02] Just keep an eye on them [02:20] Jesus, why can't IA just use zip? [02:20] I found reference to another format .war and those seem to just be renamed zips [02:21] nm, those are jars [02:25] still. This is a horrible "standard" if it's a pain in the ass to even get a file out of it [02:28] Arguably it could have been designed more conveniently, but there's some features of warc that they *really* want that nothing else really has. [02:30] And honestly it's not that crazy or anything. It's more-or-less a stream of HTTP-ish encoded HTTP responses. [02:32] I guess there is just (annoyingly) limited interest in decoding it [02:33] It sure would be nice of 7-zip or whatever could view and extract these [02:33] Yeah, *that's* the sucky thing. There's not that much in the way of good tooling. [02:33] Right now I'm getting ready to install this: https://github.com/iipc/openwayback/wiki/How-to-install [02:35] that least python script doesn't seem to want to work [02:36] last& [02:43] Why do some of the websites you guys did have like 5 seperate files? [02:43] Are they all different? [02:43] Split? [02:43] Just continuations of the previous crawl so each file isn't too huge? [02:55] Okay, it's no so bad using archive.org conversion web service [03:04] These guys must have some setup [03:11] TFGBD_: WARCs aren't designed for file extraction, because there is no concept of "file" on the Web [03:12] they are request/response recordings, and for archiving HTTP sessions, that is appropriate [03:12] I see [03:12] Though, I certainly see files in these dumps... [03:12] before you knock something it helps to know what it is for [03:12] I wasn't knocking it that bad. [03:13] Mostly just complaining aloud. [03:13] I'm good, now [03:13] I'll just use archive.org's warc2zip service for now [03:13] funny you mention that because it was written by the same guy who wrote warc-proxy [03:15] Funny. [03:15] I just downloaded a 1GB Warc with it and it compressed to 408 MB?! [03:15] And there is way less in it then I expected [03:15] what gives? [03:15] Was the rest all just http responses?! [03:15] there are a lot of factors [03:16] if it was warc.gz then each WARC record is individually compressed [03:16] there is a reason for that, and the reason is seekability [03:16] yipdw: Though, it's ZIP that he's got. [03:16] Or did the tool just choke on a 10GB warc? [03:16] however you lose the benefits of solid compression [03:16] ZIP compresses each file separately. [03:17] It was a WARC.gz bit the Warc.gz was 10GB [03:17] iirc it chokes on over 2gb because of lack of zip64 [03:17] and it was still 10GB extracted, so no compression [03:17] Oh, that sucks [03:17] will it work better if it run it locally? [03:17] it's a trivial fix in the local script [03:17] Oh, you passed it a 10GB warc? Yeah, that'll probably choke. .zip doesn't handle archives that big. [03:18] Darnit. Guess I'm back to square one. And it worked so well for the 40MB one... ;P [03:18] Is there a WARC to gzipped files tool? [03:19] you can try warcat's extract mode [03:19] https://pypi.python.org/pypi/Warcat/ [03:20] https://github.com/alard/warctozip + https://gist.github.com/DopefishJustin/ae8262bede1b77d87709 [03:21] nice. Why isn't that in the live tool? [03:21] no good reason [03:22] looks like there is also a useful change in a pull request https://github.com/alard/warctozip/pull/1/files [03:22] does the guy who made it come here? [03:23] he used to but not for a while [03:23] someone should update the official archive.org copy [03:25] Maybe my problem with the proxy was I'm trying to use portable python [03:29] When a WARC ends in 001, 002, etc... [03:29] Does that mean it is a multi-part split warc? [03:29] Is that a thing? [03:31] Do I need to download all of them to get a proper dump of the files? [03:32] As far as I know, no. [03:41] Hmm, this warc2zip is an offline app [03:41] is this what the web service is based on? [03:43] Can't I download the web app? [03:43] is this what I need? [03:43] https://github.com/alard/warctozip-service [03:46] Is there a zip64 diff for the web service version? [04:06] nope [04:09] Okay, so I have all the requirements for the web service installed in my python but how do I actually run this thing? [04:09] The documentation sucks [04:10] It's no longer giving errors but when I run it out an argument with python, it just starts and quits with no output [04:10] it does create a stream_post.pyc but that's about it [04:10] Does this need to run with apache or something? [04:16] install the packages listed in requirements.txt, use a procfile runner like foreman or whatever [04:17] the patch DFJustin supplied can be applied at line 160 of app.py [04:22] Ah, okay [04:23] that's what I needed. I'm not too familiar with python and had no idea what a procfile was [04:24] it should really mention that in the documentation, no [04:25] maybe, but this had an audience of like two people and both people knew how to start it [04:25] submit a PR [04:25] Ahh, I get it [04:25] It kind of amazes me, though [04:26] I'd have thought there would be a huge team of big companies behind this format [04:26] there are [04:26] you are conflating WARC and the tools people build to operate on it [04:26] well, correction [04:26] there aren't any "big" companies behind this [04:27] it has support from significant players in the sector where it matters; two them are Hanzo Archives and Internet Archive [04:27] if you ask Google they'll probably push HAR on you [04:27] Ohh, so that's where the "hanzo tools" comes from [04:27] I'm not familiar with hanzo [04:27] is that a competitor to Archive.org? [04:28] http://www.hanzoarchives.com/ [04:28] no [04:37] Ah, I see [04:37] legal stuff [05:00] ugh, there is no foreman for win32... [05:06] guess i'm SOL [05:06] or is this some way to run it manually without the procfile? [05:06] at least the offline tool works [05:25] https://github.com/ddollar/foreman-windows [05:25] yes there is [05:26] although it is weird that it has Ruby and C# code in the same project [05:26] in any case, running this on Windows is hard to support because most of us don't try to run this code on Windows [05:27] you are likely to receive better support on something unixish [05:29] Just looking at the twitpic grab tracker, can someone explain how so many users manage to get so many GB of data with so few of items ? [05:29] they got in on the ground floor [05:29] when we actually had images [05:29] ah [06:04] I understandthough, I'd rather not install cygwin or interix right now [06:25] a VM is another option [07:07] ugh, this stupid thing is giving me out of memory errors [07:11] does it need a 64 bit python and python and os install? [14:44] netsplits \o/ [14:48] Boop [14:48] -bs [15:43] SketchCow: when you have a moment, can you please move the following items into the Archive Team collection: comeback_inn_forums-20140326, metamorphosisalpha.net_forums-20141022, starfrontiers.info_forum-20140324, pathfinderchronicler.net_grabs, fraternity_of_shadows_forum-20140325 [17:24] stupid warctozip [17:24] it keeps failing at 134MB [17:26] Hi all - I'm manually running the code (on a VPS) instead of using a Warrior VM. Is there any convenient way to find out the "most urgent" project I should run? [17:26] Corion: unless you're using the warrior-code repo, no -- each project has its own codebase [17:27] Corion: You could take a look at http://warriorhq.archiveteam.org/projects.json [17:27] if you are running warrior-code(2) on a VPS then just set it to ArchiveTeam's Choice [17:27] auto_project is what the warrior uses to work out the 'most important' job [17:27] File "warctozip.py", line 63, in [17:27] sys.exit(main(sys.argv)) [17:27] File "warctozip.py", line 42, in main [17:27] dump_record(fh, outzip) [17:27] File "warctozip.py", line 51, in dump_record [17:27] leftover = message.feed(record.content[1]) [17:27] File "hanzo\httptools\messaging.py", line 576, in feed [17:27] Kazzy: That sounds like what I wanted, thanks! [17:27] TFGBD: wtf [17:27] text = HTTPMessage.feed(self, text) [17:27] File "hanzo\httptools\messaging.py", line 97, in feed [17:27] text = self.feed_headers(text) [17:27] File "hanzo\httptools\messaging.py", line 191, in feed_headers [17:27] line, text = self.feed_line(text) [17:27] File "hanzo\httptools\messaging.py", line 159, in feed_line [17:27] text = str(self.buffer[pos:]) [17:27] MemoryError [17:27] gah, sorry [17:27] No flood protection here? A stray right-click easily wreaks havoc ;) [17:27] didn't mean to paste it all [17:28] but that is the error [17:28] how about don't paste any of it and use a pastebin [17:28] my bad [17:28] also did you apply the zi[p64 change [17:28] efnet doesn't kill you on that level of flooding, and there's no bots in the chan to do it either [17:28] Anyway, thanks for the information - I'll look at whether I can automate that, or at least, send myself an email when the main project changes [17:28] yesstill didn't work [17:28] Corion: enjoy :) [17:29] i tried it on a 64bit OS too [17:29] should that matter? [17:29] do I need to use a 64-bit python? [17:30] Ehh, guess I'll spin up a Colinux and see how it goes there [17:30] TFGBD: summarize your issue in one sentence? [17:30] (haven't been following convo) [17:30] s'cool [17:31] running warctozip-the-service on Windows and trying to use it to extract stuff from a 10 GB WARC [17:31] yipdw: warctozip-the-service? [17:31] joepie91: I tried using warctozip with the zip64 diff and it sis still only extracting about 140MB of the 10GB warc [17:31] warc-to-zip service wont run at all [17:31] I'm using the cli tool [17:31] or, ic ouldn't ge tit to run [17:32] taking a stab at the obvious: have you tried processing a different WARC and comparing whether it breaks at the same point? [17:32] may be a special-characters-in-filename issue [17:32] because Windows [17:32] hmm [17:32] (Windows is considerably less friendly to weird characters in filenames than Linux/OSX, in my experience) [17:32] it worked ith a 40mb warc [17:32] (or well, I suppose that it's technically NTFS that's failing, not Windows) [17:32] MemoryError and weird characters is a stretch [17:32] anyway #-bs [17:33] TFGBD: try to find one that's bigger than your failing file [17:33] er [17:33] than your failing position in the failing file * [17:33] i'd have to download another one then [17:33] right [17:33] TFGBD: can you join #archiveteam-bs [17:33] sure [17:33] HI [17:33] Had a nice chat with Canadian press about twitpic [17:34] I'm sure they were thrilled just to be not talking about wednesday's shooting [17:34] can you say which news org? [17:43] Global News [17:44] I was in some other ... oh, Globe and Mail a day or two ago [17:45] Nice... I'll try and remember to keep an eye on their newscasts [17:59] SketchCow: yeah I saw that [18:00] http://www.theglobeandmail.com/technology/digital-culture/the-race-to-archive-twitpic-before-800-million-pictures-vanish/article21199755/ [18:06] hm [18:06] i wonder why twitpic are acting the way they are [18:11] Carl Malamud in the house!! [18:25] SketchCow: :D [19:15] http://globalnews.ca/news/1633807/800-million-twitpic-photos-to-vanish-from-the-web-saturday/ [19:15] http://globalnews.ca/video/1633770/twitpic-is-about-to-shut-down-after-dispute-with-twitter [21:47] oh wow [21:47] here's to hoping peter chura (global winnipeg anchor) gets to mention that article [21:48] * wp494 sets a recording for 6 pm news [22:39] SketchCow: midas dropped this wonderful quote earlier in -bs, you are probably the most likely to be able to use it: clouds dissapear when the heat is on