#archiveteam 2014-10-24,Fri

↑back Search

Time	Nickname	Message
00:12 ^🔗	SketchCow	80Gb
00:45 ^🔗	SketchCow	Well, now it's officially a clusterfuck.
00:51 ^🔗	SketchCow	You know what the Ello guy needed? The Svpply guy
00:57 ^🔗	SketchCow	Calling Ello guy on skype
00:57 ^🔗	garyrh	ooh, that'll be fun
01:05 ^🔗	aaaaaaaaa	I love how he says an export isn't a priority when he is holding other people's stuff. That is like giving your money to a bank that says "you can't get your money back now but don't worry, you can in the future."
01:06 ^🔗	aaaaaaaaa	"plus we won't disappear with it. Promise."
01:06 ^🔗	aaaaaaaaa	except in the case of banks, they are required to have insurance just for that sort of occurrence.
01:08 ^🔗	garyrh	"You can reassemble your money from these jars of pennies, right?"
01:22 ^🔗	TFGBD_	Okay, I just installed the warc-proxy
01:22 ^🔗	TFGBD_	This has terrible documentation
01:23 ^🔗	TFGBD_	Where the heck am I supported to put the WARC file once I have the thing running?
01:26 ^🔗	TFGBD_	supposed*
01:26 ^🔗	garyrh	if you configured the http proxy, go to http://warc/ and you should see an add warc button
01:26 ^🔗	TFGBD_	Ahh
01:27 ^🔗	TFGBD_	I tried to use the Firefox addon but it isn't showing up in Firefox 30's Tools menu
01:31 ^🔗	TFGBD_	Okay, it's half working now.
01:31 ^🔗	TFGBD_	I can load the http://warc page but I just get a list of Python errors in a frame
01:34 ^🔗	TFGBD_	Does it not like spaces in the path?
01:53 ^🔗	SketchCow	WHO HELLO
01:53 ^🔗	SketchCow	Skype chat was good.
01:54 ^🔗	SketchCow	I said, and I quote, "I am more than happy to call a truce until Ello does the next Stupid Thing."
01:54 ^🔗	SketchCow	And there we are.
01:54 ^🔗	garyrh	yay
01:54 ^🔗	garyrh	i think
02:02 ^🔗	SketchCow	Just keep an eye on them
02:20 ^🔗	TFGBD_	Jesus, why can't IA just use zip?
02:20 ^🔗	TFGBD_	I found reference to another format .war and those seem to just be renamed zips
02:21 ^🔗	TFGBD_	nm, those are jars
02:25 ^🔗	TFGBD_	still. This is a horrible "standard" if it's a pain in the ass to even get a file out of it
02:28 ^🔗	pikhq	Arguably it could have been designed more conveniently, but there's some features of warc that they really want that nothing else really has.
02:30 ^🔗	pikhq	And honestly it's not that crazy or anything. It's more-or-less a stream of HTTP-ish encoded HTTP responses.
02:32 ^🔗	TFGBD_	I guess there is just (annoyingly) limited interest in decoding it
02:33 ^🔗	TFGBD_	It sure would be nice of 7-zip or whatever could view and extract these
02:33 ^🔗	pikhq	Yeah, that's the sucky thing. There's not that much in the way of good tooling.
02:33 ^🔗	TFGBD_	Right now I'm getting ready to install this: https://github.com/iipc/openwayback/wiki/How-to-install
02:35 ^🔗	TFGBD_	that least python script doesn't seem to want to work
02:36 ^🔗	TFGBD_	last&
02:43 ^🔗	TFGBD_	Why do some of the websites you guys did have like 5 seperate files?
02:43 ^🔗	TFGBD_	Are they all different?
02:43 ^🔗	TFGBD_	Split?
02:43 ^🔗	TFGBD_	Just continuations of the previous crawl so each file isn't too huge?
02:55 ^🔗	TFGBD_	Okay, it's no so bad using archive.org conversion web service
03:04 ^🔗	TFGBD_	These guys must have some setup
03:11 ^🔗	yipdw	TFGBD_: WARCs aren't designed for file extraction, because there is no concept of "file" on the Web
03:12 ^🔗	yipdw	they are request/response recordings, and for archiving HTTP sessions, that is appropriate
03:12 ^🔗	TFGBD_	I see
03:12 ^🔗	TFGBD_	Though, I certainly see files in these dumps...
03:12 ^🔗	yipdw	before you knock something it helps to know what it is for
03:12 ^🔗	TFGBD_	I wasn't knocking it that bad.
03:13 ^🔗	TFGBD_	Mostly just complaining aloud.
03:13 ^🔗	TFGBD_	I'm good, now
03:13 ^🔗	TFGBD_	I'll just use archive.org's warc2zip service for now
03:13 ^🔗	yipdw	funny you mention that because it was written by the same guy who wrote warc-proxy
03:15 ^🔗	TFGBD_	Funny.
03:15 ^🔗	TFGBD_	I just downloaded a 1GB Warc with it and it compressed to 408 MB?!
03:15 ^🔗	TFGBD_	And there is way less in it then I expected
03:15 ^🔗	TFGBD_	what gives?
03:15 ^🔗	TFGBD_	Was the rest all just http responses?!
03:15 ^🔗	yipdw	there are a lot of factors
03:16 ^🔗	yipdw	if it was warc.gz then each WARC record is individually compressed
03:16 ^🔗	yipdw	there is a reason for that, and the reason is seekability
03:16 ^🔗	pikhq	yipdw: Though, it's ZIP that he's got.
03:16 ^🔗	TFGBD_	Or did the tool just choke on a 10GB warc?
03:16 ^🔗	yipdw	however you lose the benefits of solid compression
03:16 ^🔗	pikhq	ZIP compresses each file separately.
03:17 ^🔗	TFGBD_	It was a WARC.gz bit the Warc.gz was 10GB
03:17 ^🔗	DFJustin	iirc it chokes on over 2gb because of lack of zip64
03:17 ^🔗	TFGBD_	and it was still 10GB extracted, so no compression
03:17 ^🔗	TFGBD_	Oh, that sucks
03:17 ^🔗	TFGBD_	will it work better if it run it locally?
03:17 ^🔗	DFJustin	it's a trivial fix in the local script
03:17 ^🔗	pikhq	Oh, you passed it a 10GB warc? Yeah, that'll probably choke. .zip doesn't handle archives that big.
03:18 ^🔗	TFGBD_	Darnit. Guess I'm back to square one. And it worked so well for the 40MB one... ;P
03:18 ^🔗	TFGBD_	Is there a WARC to gzipped files tool?
03:19 ^🔗	yipdw	you can try warcat's extract mode
03:19 ^🔗	yipdw	https://pypi.python.org/pypi/Warcat/
03:20 ^🔗	DFJustin	https://github.com/alard/warctozip + https://gist.github.com/DopefishJustin/ae8262bede1b77d87709
03:21 ^🔗	TFGBD_	nice. Why isn't that in the live tool?
03:21 ^🔗	DFJustin	no good reason
03:22 ^🔗	DFJustin	looks like there is also a useful change in a pull request https://github.com/alard/warctozip/pull/1/files
03:22 ^🔗	TFGBD_	does the guy who made it come here?
03:23 ^🔗	DFJustin	he used to but not for a while
03:23 ^🔗	TFGBD	someone should update the official archive.org copy
03:25 ^🔗	TFGBD	Maybe my problem with the proxy was I'm trying to use portable python
03:29 ^🔗	TFGBD	When a WARC ends in 001, 002, etc...
03:29 ^🔗	TFGBD	Does that mean it is a multi-part split warc?
03:29 ^🔗	TFGBD	Is that a thing?
03:31 ^🔗	TFGBD	Do I need to download all of them to get a proper dump of the files?
03:32 ^🔗	pikhq	As far as I know, no.
03:41 ^🔗	TFGBD	Hmm, this warc2zip is an offline app
03:41 ^🔗	TFGBD	is this what the web service is based on?
03:43 ^🔗	TFGBD	Can't I download the web app?
03:43 ^🔗	TFGBD	is this what I need?
03:43 ^🔗	TFGBD	https://github.com/alard/warctozip-service
03:46 ^🔗	TFGBD	Is there a zip64 diff for the web service version?
04:06 ^🔗	DFJustin	nope
04:09 ^🔗	TFGBD	Okay, so I have all the requirements for the web service installed in my python but how do I actually run this thing?
04:09 ^🔗	TFGBD	The documentation sucks
04:10 ^🔗	TFGBD	It's no longer giving errors but when I run it out an argument with python, it just starts and quits with no output
04:10 ^🔗	TFGBD	it does create a stream_post.pyc but that's about it
04:10 ^🔗	TFGBD	Does this need to run with apache or something?
04:16 ^🔗	yipdw	install the packages listed in requirements.txt, use a procfile runner like foreman or whatever
04:17 ^🔗	yipdw	the patch DFJustin supplied can be applied at line 160 of app.py
04:22 ^🔗	TFGBD	Ah, okay
04:23 ^🔗	TFGBD	that's what I needed. I'm not too familiar with python and had no idea what a procfile was
04:24 ^🔗	TFGBD	it should really mention that in the documentation, no
04:25 ^🔗	yipdw	maybe, but this had an audience of like two people and both people knew how to start it
04:25 ^🔗	yipdw	submit a PR
04:25 ^🔗	TFGBD	Ahh, I get it
04:25 ^🔗	TFGBD	It kind of amazes me, though
04:26 ^🔗	TFGBD	I'd have thought there would be a huge team of big companies behind this format
04:26 ^🔗	yipdw	there are
04:26 ^🔗	yipdw	you are conflating WARC and the tools people build to operate on it
04:26 ^🔗	yipdw	well, correction
04:26 ^🔗	yipdw	there aren't any "big" companies behind this
04:27 ^🔗	yipdw	it has support from significant players in the sector where it matters; two them are Hanzo Archives and Internet Archive
04:27 ^🔗	yipdw	if you ask Google they'll probably push HAR on you
04:27 ^🔗	TFGBD	Ohh, so that's where the "hanzo tools" comes from
04:27 ^🔗	TFGBD	I'm not familiar with hanzo
04:27 ^🔗	TFGBD	is that a competitor to Archive.org?
04:28 ^🔗	yipdw	http://www.hanzoarchives.com/
04:28 ^🔗	yipdw	no
04:37 ^🔗	TFGBD	Ah, I see
04:37 ^🔗	TFGBD	legal stuff
05:00 ^🔗	TFGBD	ugh, there is no foreman for win32...
05:06 ^🔗	TFGBD	guess i'm SOL
05:06 ^🔗	TFGBD	or is this some way to run it manually without the procfile?
05:06 ^🔗	TFGBD	at least the offline tool works
05:25 ^🔗	yipdw	https://github.com/ddollar/foreman-windows
05:25 ^🔗	yipdw	yes there is
05:26 ^🔗	yipdw	although it is weird that it has Ruby and C# code in the same project
05:26 ^🔗	yipdw	in any case, running this on Windows is hard to support because most of us don't try to run this code on Windows
05:27 ^🔗	yipdw	you are likely to receive better support on something unixish
05:29 ^🔗	signius	Just looking at the twitpic grab tracker, can someone explain how so many users manage to get so many GB of data with so few of items ?
05:29 ^🔗	yipdw	they got in on the ground floor
05:29 ^🔗	yipdw	when we actually had images
05:29 ^🔗	signius	ah
06:04 ^🔗	TFGBD	I understandthough, I'd rather not install cygwin or interix right now
06:25 ^🔗	yipdw	a VM is another option
07:07 ^🔗	TFGBD	ugh, this stupid thing is giving me out of memory errors
07:11 ^🔗	TFGBD	does it need a 64 bit python and python and os install?
14:44 ^🔗	Muad_Dib	netsplits \o/
14:48 ^🔗	SketchCow	Boop
14:48 ^🔗	SketchCow	-bs
15:43 ^🔗	SadDM	SketchCow: when you have a moment, can you please move the following items into the Archive Team collection: comeback_inn_forums-20140326, metamorphosisalpha.net_forums-20141022, starfrontiers.info_forum-20140324, pathfinderchronicler.net_grabs, fraternity_of_shadows_forum-20140325
17:24 ^🔗	TFGBD	stupid warctozip
17:24 ^🔗	TFGBD	it keeps failing at 134MB
17:26 ^🔗	Corion	Hi all - I'm manually running the code (on a VPS) instead of using a Warrior VM. Is there any convenient way to find out the "most urgent" project I should run?
17:26 ^🔗	yipdw	Corion: unless you're using the warrior-code repo, no -- each project has its own codebase
17:27 ^🔗	Kazzy	Corion: You could take a look at http://warriorhq.archiveteam.org/projects.json
17:27 ^🔗	yipdw	if you are running warrior-code(2) on a VPS then just set it to ArchiveTeam's Choice
17:27 ^🔗	Kazzy	auto_project is what the warrior uses to work out the 'most important' job
17:27 ^🔗	TFGBD	File "warctozip.py", line 63, in <module>
17:27 ^🔗	TFGBD	sys.exit(main(sys.argv))
17:27 ^🔗	TFGBD	File "warctozip.py", line 42, in main
17:27 ^🔗	TFGBD	dump_record(fh, outzip)
17:27 ^🔗	TFGBD	File "warctozip.py", line 51, in dump_record
17:27 ^🔗	TFGBD	leftover = message.feed(record.content[1])
17:27 ^🔗	TFGBD	File "hanzo\httptools\messaging.py", line 576, in feed
17:27 ^🔗	Corion	Kazzy: That sounds like what I wanted, thanks!
17:27 ^🔗	yipdw	TFGBD: wtf
17:27 ^🔗	TFGBD	text = HTTPMessage.feed(self, text)
17:27 ^🔗	TFGBD	File "hanzo\httptools\messaging.py", line 97, in feed
17:27 ^🔗	TFGBD	text = self.feed_headers(text)
17:27 ^🔗	TFGBD	File "hanzo\httptools\messaging.py", line 191, in feed_headers
17:27 ^🔗	TFGBD	line, text = self.feed_line(text)
17:27 ^🔗	TFGBD	File "hanzo\httptools\messaging.py", line 159, in feed_line
17:27 ^🔗	TFGBD	text = str(self.buffer[pos:])
17:27 ^🔗	TFGBD	MemoryError
17:27 ^🔗	TFGBD	gah, sorry
17:27 ^🔗	Corion	No flood protection here? A stray right-click easily wreaks havoc ;)
17:27 ^🔗	TFGBD	didn't mean to paste it all
17:28 ^🔗	TFGBD	but that is the error
17:28 ^🔗	yipdw	how about don't paste any of it and use a pastebin
17:28 ^🔗	TFGBD	my bad
17:28 ^🔗	yipdw	also did you apply the zi[p64 change
17:28 ^🔗	Kazzy	efnet doesn't kill you on that level of flooding, and there's no bots in the chan to do it either
17:28 ^🔗	Corion	Anyway, thanks for the information - I'll look at whether I can automate that, or at least, send myself an email when the main project changes
17:28 ^🔗	TFGBD	yesstill didn't work
17:28 ^🔗	Kazzy	Corion: enjoy :)
17:29 ^🔗	TFGBD	i tried it on a 64bit OS too
17:29 ^🔗	TFGBD	should that matter?
17:29 ^🔗	TFGBD	do I need to use a 64-bit python?
17:30 ^🔗	TFGBD	Ehh, guess I'll spin up a Colinux and see how it goes there
17:30 ^🔗	joepie91	TFGBD: summarize your issue in one sentence?
17:30 ^🔗	joepie91	(haven't been following convo)
17:30 ^🔗	TFGBD	s'cool
17:31 ^🔗	yipdw	running warctozip-the-service on Windows and trying to use it to extract stuff from a 10 GB WARC
17:31 ^🔗	joepie91	yipdw: warctozip-the-service?
17:31 ^🔗	TFGBD	joepie91: I tried using warctozip with the zip64 diff and it sis still only extracting about 140MB of the 10GB warc
17:31 ^🔗	TFGBD	warc-to-zip service wont run at all
17:31 ^🔗	TFGBD	I'm using the cli tool
17:31 ^🔗	TFGBD	or, ic ouldn't ge tit to run
17:32 ^🔗	joepie91	taking a stab at the obvious: have you tried processing a different WARC and comparing whether it breaks at the same point?
17:32 ^🔗	joepie91	may be a special-characters-in-filename issue
17:32 ^🔗	joepie91	because Windows
17:32 ^🔗	TFGBD	hmm
17:32 ^🔗	joepie91	(Windows is considerably less friendly to weird characters in filenames than Linux/OSX, in my experience)
17:32 ^🔗	TFGBD	it worked ith a 40mb warc
17:32 ^🔗	joepie91	(or well, I suppose that it's technically NTFS that's failing, not Windows)
17:32 ^🔗	yipdw	MemoryError and weird characters is a stretch
17:32 ^🔗	yipdw	anyway #-bs
17:33 ^🔗	joepie91	TFGBD: try to find one that's bigger than your failing file
17:33 ^🔗	joepie91	er
17:33 ^🔗	joepie91	than your failing position in the failing file *
17:33 ^🔗	TFGBD	i'd have to download another one then
17:33 ^🔗	joepie91	right
17:33 ^🔗	joepie91	TFGBD: can you join #archiveteam-bs
17:33 ^🔗	TFGBD	sure
17:33 ^🔗	SketchCow	HI
17:33 ^🔗	SketchCow	Had a nice chat with Canadian press about twitpic
17:34 ^🔗	SadDM	I'm sure they were thrilled just to be not talking about wednesday's shooting
17:34 ^🔗	SadDM	can you say which news org?
17:43 ^🔗	SketchCow	Global News
17:44 ^🔗	SketchCow	I was in some other ... oh, Globe and Mail a day or two ago
17:45 ^🔗	SadDM	Nice... I'll try and remember to keep an eye on their newscasts
17:59 ^🔗	balrog	SketchCow: yeah I saw that
18:00 ^🔗	balrog	http://www.theglobeandmail.com/technology/digital-culture/the-race-to-archive-twitpic-before-800-million-pictures-vanish/article21199755/
18:06 ^🔗	raylee	hm
18:06 ^🔗	raylee	i wonder why twitpic are acting the way they are
18:11 ^🔗	SketchCow	Carl Malamud in the house!!
18:25 ^🔗	balrog	SketchCow: :D
19:15 ^🔗	bzc6p	http://globalnews.ca/news/1633807/800-million-twitpic-photos-to-vanish-from-the-web-saturday/
19:15 ^🔗	bzc6p	http://globalnews.ca/video/1633770/twitpic-is-about-to-shut-down-after-dispute-with-twitter
21:47 ^🔗	wp494	oh wow
21:47 ^🔗	wp494	here's to hoping peter chura (global winnipeg anchor) gets to mention that article
21:48 ^🔗	*	wp494 sets a recording for 6 pm news
22:39 ^🔗	schbirid	SketchCow: midas dropped this wonderful quote earlier in -bs, you are probably the most likely to be able to use it: <midas> clouds dissapear when the heat is on

irclogger-viewer