#archiveteam 2014-10-24,Fri

↑back Search

Time Nickname Message
00:12 🔗 SketchCow 80Gb
00:45 🔗 SketchCow Well, now it's officially a clusterfuck.
00:51 🔗 SketchCow You know what the Ello guy needed? The Svpply guy
00:57 🔗 SketchCow Calling Ello guy on skype
00:57 🔗 garyrh ooh, that'll be fun
01:05 🔗 aaaaaaaaa I love how he says an export isn't a priority when he is holding other people's stuff. That is like giving your money to a bank that says "you can't get your money back now but don't worry, you can in the future."
01:06 🔗 aaaaaaaaa "plus we won't disappear with it. Promise."
01:06 🔗 aaaaaaaaa except in the case of banks, they are required to have insurance just for that sort of occurrence.
01:08 🔗 garyrh "You can reassemble your money from these jars of pennies, right?"
01:22 🔗 TFGBD_ Okay, I just installed the warc-proxy
01:22 🔗 TFGBD_ This has terrible documentation
01:23 🔗 TFGBD_ Where the heck am I supported to put the WARC file once I have the thing running?
01:26 🔗 TFGBD_ supposed*
01:26 🔗 garyrh if you configured the http proxy, go to http://warc/ and you should see an add warc button
01:26 🔗 TFGBD_ Ahh
01:27 🔗 TFGBD_ I tried to use the Firefox addon but it isn't showing up in Firefox 30's Tools menu
01:31 🔗 TFGBD_ Okay, it's half working now.
01:31 🔗 TFGBD_ I can load the http://warc page but I just get a list of Python errors in a frame
01:34 🔗 TFGBD_ Does it not like spaces in the path?
01:53 🔗 SketchCow WHO HELLO
01:53 🔗 SketchCow Skype chat was good.
01:54 🔗 SketchCow I said, and I quote, "I am more than happy to call a truce until Ello does the next Stupid Thing."
01:54 🔗 SketchCow And there we are.
01:54 🔗 garyrh yay
01:54 🔗 garyrh i think
02:02 🔗 SketchCow Just keep an eye on them
02:20 🔗 TFGBD_ Jesus, why can't IA just use zip?
02:20 🔗 TFGBD_ I found reference to another format .war and those seem to just be renamed zips
02:21 🔗 TFGBD_ nm, those are jars
02:25 🔗 TFGBD_ still. This is a horrible "standard" if it's a pain in the ass to even get a file out of it
02:28 🔗 pikhq Arguably it could have been designed more conveniently, but there's some features of warc that they *really* want that nothing else really has.
02:30 🔗 pikhq And honestly it's not that crazy or anything. It's more-or-less a stream of HTTP-ish encoded HTTP responses.
02:32 🔗 TFGBD_ I guess there is just (annoyingly) limited interest in decoding it
02:33 🔗 TFGBD_ It sure would be nice of 7-zip or whatever could view and extract these
02:33 🔗 pikhq Yeah, *that's* the sucky thing. There's not that much in the way of good tooling.
02:33 🔗 TFGBD_ Right now I'm getting ready to install this: https://github.com/iipc/openwayback/wiki/How-to-install
02:35 🔗 TFGBD_ that least python script doesn't seem to want to work
02:36 🔗 TFGBD_ last&
02:43 🔗 TFGBD_ Why do some of the websites you guys did have like 5 seperate files?
02:43 🔗 TFGBD_ Are they all different?
02:43 🔗 TFGBD_ Split?
02:43 🔗 TFGBD_ Just continuations of the previous crawl so each file isn't too huge?
02:55 🔗 TFGBD_ Okay, it's no so bad using archive.org conversion web service
03:04 🔗 TFGBD_ These guys must have some setup
03:11 🔗 yipdw TFGBD_: WARCs aren't designed for file extraction, because there is no concept of "file" on the Web
03:12 🔗 yipdw they are request/response recordings, and for archiving HTTP sessions, that is appropriate
03:12 🔗 TFGBD_ I see
03:12 🔗 TFGBD_ Though, I certainly see files in these dumps...
03:12 🔗 yipdw before you knock something it helps to know what it is for
03:12 🔗 TFGBD_ I wasn't knocking it that bad.
03:13 🔗 TFGBD_ Mostly just complaining aloud.
03:13 🔗 TFGBD_ I'm good, now
03:13 🔗 TFGBD_ I'll just use archive.org's warc2zip service for now
03:13 🔗 yipdw funny you mention that because it was written by the same guy who wrote warc-proxy
03:15 🔗 TFGBD_ Funny.
03:15 🔗 TFGBD_ I just downloaded a 1GB Warc with it and it compressed to 408 MB?!
03:15 🔗 TFGBD_ And there is way less in it then I expected
03:15 🔗 TFGBD_ what gives?
03:15 🔗 TFGBD_ Was the rest all just http responses?!
03:15 🔗 yipdw there are a lot of factors
03:16 🔗 yipdw if it was warc.gz then each WARC record is individually compressed
03:16 🔗 yipdw there is a reason for that, and the reason is seekability
03:16 🔗 pikhq yipdw: Though, it's ZIP that he's got.
03:16 🔗 TFGBD_ Or did the tool just choke on a 10GB warc?
03:16 🔗 yipdw however you lose the benefits of solid compression
03:16 🔗 pikhq ZIP compresses each file separately.
03:17 🔗 TFGBD_ It was a WARC.gz bit the Warc.gz was 10GB
03:17 🔗 DFJustin iirc it chokes on over 2gb because of lack of zip64
03:17 🔗 TFGBD_ and it was still 10GB extracted, so no compression
03:17 🔗 TFGBD_ Oh, that sucks
03:17 🔗 TFGBD_ will it work better if it run it locally?
03:17 🔗 DFJustin it's a trivial fix in the local script
03:17 🔗 pikhq Oh, you passed it a 10GB warc? Yeah, that'll probably choke. .zip doesn't handle archives that big.
03:18 🔗 TFGBD_ Darnit. Guess I'm back to square one. And it worked so well for the 40MB one... ;P
03:18 🔗 TFGBD_ Is there a WARC to gzipped files tool?
03:19 🔗 yipdw you can try warcat's extract mode
03:19 🔗 yipdw https://pypi.python.org/pypi/Warcat/
03:20 🔗 DFJustin https://github.com/alard/warctozip + https://gist.github.com/DopefishJustin/ae8262bede1b77d87709
03:21 🔗 TFGBD_ nice. Why isn't that in the live tool?
03:21 🔗 DFJustin no good reason
03:22 🔗 DFJustin looks like there is also a useful change in a pull request https://github.com/alard/warctozip/pull/1/files
03:22 🔗 TFGBD_ does the guy who made it come here?
03:23 🔗 DFJustin he used to but not for a while
03:23 🔗 TFGBD someone should update the official archive.org copy
03:25 🔗 TFGBD Maybe my problem with the proxy was I'm trying to use portable python
03:29 🔗 TFGBD When a WARC ends in 001, 002, etc...
03:29 🔗 TFGBD Does that mean it is a multi-part split warc?
03:29 🔗 TFGBD Is that a thing?
03:31 🔗 TFGBD Do I need to download all of them to get a proper dump of the files?
03:32 🔗 pikhq As far as I know, no.
03:41 🔗 TFGBD Hmm, this warc2zip is an offline app
03:41 🔗 TFGBD is this what the web service is based on?
03:43 🔗 TFGBD Can't I download the web app?
03:43 🔗 TFGBD is this what I need?
03:43 🔗 TFGBD https://github.com/alard/warctozip-service
03:46 🔗 TFGBD Is there a zip64 diff for the web service version?
04:06 🔗 DFJustin nope
04:09 🔗 TFGBD Okay, so I have all the requirements for the web service installed in my python but how do I actually run this thing?
04:09 🔗 TFGBD The documentation sucks
04:10 🔗 TFGBD It's no longer giving errors but when I run it out an argument with python, it just starts and quits with no output
04:10 🔗 TFGBD it does create a stream_post.pyc but that's about it
04:10 🔗 TFGBD Does this need to run with apache or something?
04:16 🔗 yipdw install the packages listed in requirements.txt, use a procfile runner like foreman or whatever
04:17 🔗 yipdw the patch DFJustin supplied can be applied at line 160 of app.py
04:22 🔗 TFGBD Ah, okay
04:23 🔗 TFGBD that's what I needed. I'm not too familiar with python and had no idea what a procfile was
04:24 🔗 TFGBD it should really mention that in the documentation, no
04:25 🔗 yipdw maybe, but this had an audience of like two people and both people knew how to start it
04:25 🔗 yipdw submit a PR
04:25 🔗 TFGBD Ahh, I get it
04:25 🔗 TFGBD It kind of amazes me, though
04:26 🔗 TFGBD I'd have thought there would be a huge team of big companies behind this format
04:26 🔗 yipdw there are
04:26 🔗 yipdw you are conflating WARC and the tools people build to operate on it
04:26 🔗 yipdw well, correction
04:26 🔗 yipdw there aren't any "big" companies behind this
04:27 🔗 yipdw it has support from significant players in the sector where it matters; two them are Hanzo Archives and Internet Archive
04:27 🔗 yipdw if you ask Google they'll probably push HAR on you
04:27 🔗 TFGBD Ohh, so that's where the "hanzo tools" comes from
04:27 🔗 TFGBD I'm not familiar with hanzo
04:27 🔗 TFGBD is that a competitor to Archive.org?
04:28 🔗 yipdw http://www.hanzoarchives.com/
04:28 🔗 yipdw no
04:37 🔗 TFGBD Ah, I see
04:37 🔗 TFGBD legal stuff
05:00 🔗 TFGBD ugh, there is no foreman for win32...
05:06 🔗 TFGBD guess i'm SOL
05:06 🔗 TFGBD or is this some way to run it manually without the procfile?
05:06 🔗 TFGBD at least the offline tool works
05:25 🔗 yipdw https://github.com/ddollar/foreman-windows
05:25 🔗 yipdw yes there is
05:26 🔗 yipdw although it is weird that it has Ruby and C# code in the same project
05:26 🔗 yipdw in any case, running this on Windows is hard to support because most of us don't try to run this code on Windows
05:27 🔗 yipdw you are likely to receive better support on something unixish
05:29 🔗 signius Just looking at the twitpic grab tracker, can someone explain how so many users manage to get so many GB of data with so few of items ?
05:29 🔗 yipdw they got in on the ground floor
05:29 🔗 yipdw when we actually had images
05:29 🔗 signius ah
06:04 🔗 TFGBD I understandthough, I'd rather not install cygwin or interix right now
06:25 🔗 yipdw a VM is another option
07:07 🔗 TFGBD ugh, this stupid thing is giving me out of memory errors
07:11 🔗 TFGBD does it need a 64 bit python and python and os install?
14:44 🔗 Muad_Dib netsplits \o/
14:48 🔗 SketchCow Boop
14:48 🔗 SketchCow -bs
15:43 🔗 SadDM SketchCow: when you have a moment, can you please move the following items into the Archive Team collection: comeback_inn_forums-20140326, metamorphosisalpha.net_forums-20141022, starfrontiers.info_forum-20140324, pathfinderchronicler.net_grabs, fraternity_of_shadows_forum-20140325
17:24 🔗 TFGBD stupid warctozip
17:24 🔗 TFGBD it keeps failing at 134MB
17:26 🔗 Corion Hi all - I'm manually running the code (on a VPS) instead of using a Warrior VM. Is there any convenient way to find out the "most urgent" project I should run?
17:26 🔗 yipdw Corion: unless you're using the warrior-code repo, no -- each project has its own codebase
17:27 🔗 Kazzy Corion: You could take a look at http://warriorhq.archiveteam.org/projects.json
17:27 🔗 yipdw if you are running warrior-code(2) on a VPS then just set it to ArchiveTeam's Choice
17:27 🔗 Kazzy auto_project is what the warrior uses to work out the 'most important' job
17:27 🔗 TFGBD File "warctozip.py", line 63, in <module>
17:27 🔗 TFGBD sys.exit(main(sys.argv))
17:27 🔗 TFGBD File "warctozip.py", line 42, in main
17:27 🔗 TFGBD dump_record(fh, outzip)
17:27 🔗 TFGBD File "warctozip.py", line 51, in dump_record
17:27 🔗 TFGBD leftover = message.feed(record.content[1])
17:27 🔗 TFGBD File "hanzo\httptools\messaging.py", line 576, in feed
17:27 🔗 Corion Kazzy: That sounds like what I wanted, thanks!
17:27 🔗 yipdw TFGBD: wtf
17:27 🔗 TFGBD text = HTTPMessage.feed(self, text)
17:27 🔗 TFGBD File "hanzo\httptools\messaging.py", line 97, in feed
17:27 🔗 TFGBD text = self.feed_headers(text)
17:27 🔗 TFGBD File "hanzo\httptools\messaging.py", line 191, in feed_headers
17:27 🔗 TFGBD line, text = self.feed_line(text)
17:27 🔗 TFGBD File "hanzo\httptools\messaging.py", line 159, in feed_line
17:27 🔗 TFGBD text = str(self.buffer[pos:])
17:27 🔗 TFGBD MemoryError
17:27 🔗 TFGBD gah, sorry
17:27 🔗 Corion No flood protection here? A stray right-click easily wreaks havoc ;)
17:27 🔗 TFGBD didn't mean to paste it all
17:28 🔗 TFGBD but that is the error
17:28 🔗 yipdw how about don't paste any of it and use a pastebin
17:28 🔗 TFGBD my bad
17:28 🔗 yipdw also did you apply the zi[p64 change
17:28 🔗 Kazzy efnet doesn't kill you on that level of flooding, and there's no bots in the chan to do it either
17:28 🔗 Corion Anyway, thanks for the information - I'll look at whether I can automate that, or at least, send myself an email when the main project changes
17:28 🔗 TFGBD yesstill didn't work
17:28 🔗 Kazzy Corion: enjoy :)
17:29 🔗 TFGBD i tried it on a 64bit OS too
17:29 🔗 TFGBD should that matter?
17:29 🔗 TFGBD do I need to use a 64-bit python?
17:30 🔗 TFGBD Ehh, guess I'll spin up a Colinux and see how it goes there
17:30 🔗 joepie91 TFGBD: summarize your issue in one sentence?
17:30 🔗 joepie91 (haven't been following convo)
17:30 🔗 TFGBD s'cool
17:31 🔗 yipdw running warctozip-the-service on Windows and trying to use it to extract stuff from a 10 GB WARC
17:31 🔗 joepie91 yipdw: warctozip-the-service?
17:31 🔗 TFGBD joepie91: I tried using warctozip with the zip64 diff and it sis still only extracting about 140MB of the 10GB warc
17:31 🔗 TFGBD warc-to-zip service wont run at all
17:31 🔗 TFGBD I'm using the cli tool
17:31 🔗 TFGBD or, ic ouldn't ge tit to run
17:32 🔗 joepie91 taking a stab at the obvious: have you tried processing a different WARC and comparing whether it breaks at the same point?
17:32 🔗 joepie91 may be a special-characters-in-filename issue
17:32 🔗 joepie91 because Windows
17:32 🔗 TFGBD hmm
17:32 🔗 joepie91 (Windows is considerably less friendly to weird characters in filenames than Linux/OSX, in my experience)
17:32 🔗 TFGBD it worked ith a 40mb warc
17:32 🔗 joepie91 (or well, I suppose that it's technically NTFS that's failing, not Windows)
17:32 🔗 yipdw MemoryError and weird characters is a stretch
17:32 🔗 yipdw anyway #-bs
17:33 🔗 joepie91 TFGBD: try to find one that's bigger than your failing file
17:33 🔗 joepie91 er
17:33 🔗 joepie91 than your failing position in the failing file *
17:33 🔗 TFGBD i'd have to download another one then
17:33 🔗 joepie91 right
17:33 🔗 joepie91 TFGBD: can you join #archiveteam-bs
17:33 🔗 TFGBD sure
17:33 🔗 SketchCow HI
17:33 🔗 SketchCow Had a nice chat with Canadian press about twitpic
17:34 🔗 SadDM I'm sure they were thrilled just to be not talking about wednesday's shooting
17:34 🔗 SadDM can you say which news org?
17:43 🔗 SketchCow Global News
17:44 🔗 SketchCow I was in some other ... oh, Globe and Mail a day or two ago
17:45 🔗 SadDM Nice... I'll try and remember to keep an eye on their newscasts
17:59 🔗 balrog SketchCow: yeah I saw that
18:00 🔗 balrog http://www.theglobeandmail.com/technology/digital-culture/the-race-to-archive-twitpic-before-800-million-pictures-vanish/article21199755/
18:06 🔗 raylee hm
18:06 🔗 raylee i wonder why twitpic are acting the way they are
18:11 🔗 SketchCow Carl Malamud in the house!!
18:25 🔗 balrog SketchCow: :D
19:15 🔗 bzc6p http://globalnews.ca/news/1633807/800-million-twitpic-photos-to-vanish-from-the-web-saturday/
19:15 🔗 bzc6p http://globalnews.ca/video/1633770/twitpic-is-about-to-shut-down-after-dispute-with-twitter
21:47 🔗 wp494 oh wow
21:47 🔗 wp494 here's to hoping peter chura (global winnipeg anchor) gets to mention that article
21:48 🔗 * wp494 sets a recording for 6 pm news
22:39 🔗 schbirid SketchCow: midas dropped this wonderful quote earlier in -bs, you are probably the most likely to be able to use it: <midas> clouds dissapear when the heat is on

irclogger-viewer