Time |
Nickname |
Message |
00:12
🔗
|
SketchCow |
80Gb |
00:45
🔗
|
SketchCow |
Well, now it's officially a clusterfuck. |
00:51
🔗
|
SketchCow |
You know what the Ello guy needed? The Svpply guy |
00:57
🔗
|
SketchCow |
Calling Ello guy on skype |
00:57
🔗
|
garyrh |
ooh, that'll be fun |
01:05
🔗
|
aaaaaaaaa |
I love how he says an export isn't a priority when he is holding other people's stuff. That is like giving your money to a bank that says "you can't get your money back now but don't worry, you can in the future." |
01:06
🔗
|
aaaaaaaaa |
"plus we won't disappear with it. Promise." |
01:06
🔗
|
aaaaaaaaa |
except in the case of banks, they are required to have insurance just for that sort of occurrence. |
01:08
🔗
|
garyrh |
"You can reassemble your money from these jars of pennies, right?" |
01:22
🔗
|
TFGBD_ |
Okay, I just installed the warc-proxy |
01:22
🔗
|
TFGBD_ |
This has terrible documentation |
01:23
🔗
|
TFGBD_ |
Where the heck am I supported to put the WARC file once I have the thing running? |
01:26
🔗
|
TFGBD_ |
supposed* |
01:26
🔗
|
garyrh |
if you configured the http proxy, go to http://warc/ and you should see an add warc button |
01:26
🔗
|
TFGBD_ |
Ahh |
01:27
🔗
|
TFGBD_ |
I tried to use the Firefox addon but it isn't showing up in Firefox 30's Tools menu |
01:31
🔗
|
TFGBD_ |
Okay, it's half working now. |
01:31
🔗
|
TFGBD_ |
I can load the http://warc page but I just get a list of Python errors in a frame |
01:34
🔗
|
TFGBD_ |
Does it not like spaces in the path? |
01:53
🔗
|
SketchCow |
WHO HELLO |
01:53
🔗
|
SketchCow |
Skype chat was good. |
01:54
🔗
|
SketchCow |
I said, and I quote, "I am more than happy to call a truce until Ello does the next Stupid Thing." |
01:54
🔗
|
SketchCow |
And there we are. |
01:54
🔗
|
garyrh |
yay |
01:54
🔗
|
garyrh |
i think |
02:02
🔗
|
SketchCow |
Just keep an eye on them |
02:20
🔗
|
TFGBD_ |
Jesus, why can't IA just use zip? |
02:20
🔗
|
TFGBD_ |
I found reference to another format .war and those seem to just be renamed zips |
02:21
🔗
|
TFGBD_ |
nm, those are jars |
02:25
🔗
|
TFGBD_ |
still. This is a horrible "standard" if it's a pain in the ass to even get a file out of it |
02:28
🔗
|
pikhq |
Arguably it could have been designed more conveniently, but there's some features of warc that they *really* want that nothing else really has. |
02:30
🔗
|
pikhq |
And honestly it's not that crazy or anything. It's more-or-less a stream of HTTP-ish encoded HTTP responses. |
02:32
🔗
|
TFGBD_ |
I guess there is just (annoyingly) limited interest in decoding it |
02:33
🔗
|
TFGBD_ |
It sure would be nice of 7-zip or whatever could view and extract these |
02:33
🔗
|
pikhq |
Yeah, *that's* the sucky thing. There's not that much in the way of good tooling. |
02:33
🔗
|
TFGBD_ |
Right now I'm getting ready to install this: https://github.com/iipc/openwayback/wiki/How-to-install |
02:35
🔗
|
TFGBD_ |
that least python script doesn't seem to want to work |
02:36
🔗
|
TFGBD_ |
last& |
02:43
🔗
|
TFGBD_ |
Why do some of the websites you guys did have like 5 seperate files? |
02:43
🔗
|
TFGBD_ |
Are they all different? |
02:43
🔗
|
TFGBD_ |
Split? |
02:43
🔗
|
TFGBD_ |
Just continuations of the previous crawl so each file isn't too huge? |
02:55
🔗
|
TFGBD_ |
Okay, it's no so bad using archive.org conversion web service |
03:04
🔗
|
TFGBD_ |
These guys must have some setup |
03:11
🔗
|
yipdw |
TFGBD_: WARCs aren't designed for file extraction, because there is no concept of "file" on the Web |
03:12
🔗
|
yipdw |
they are request/response recordings, and for archiving HTTP sessions, that is appropriate |
03:12
🔗
|
TFGBD_ |
I see |
03:12
🔗
|
TFGBD_ |
Though, I certainly see files in these dumps... |
03:12
🔗
|
yipdw |
before you knock something it helps to know what it is for |
03:12
🔗
|
TFGBD_ |
I wasn't knocking it that bad. |
03:13
🔗
|
TFGBD_ |
Mostly just complaining aloud. |
03:13
🔗
|
TFGBD_ |
I'm good, now |
03:13
🔗
|
TFGBD_ |
I'll just use archive.org's warc2zip service for now |
03:13
🔗
|
yipdw |
funny you mention that because it was written by the same guy who wrote warc-proxy |
03:15
🔗
|
TFGBD_ |
Funny. |
03:15
🔗
|
TFGBD_ |
I just downloaded a 1GB Warc with it and it compressed to 408 MB?! |
03:15
🔗
|
TFGBD_ |
And there is way less in it then I expected |
03:15
🔗
|
TFGBD_ |
what gives? |
03:15
🔗
|
TFGBD_ |
Was the rest all just http responses?! |
03:15
🔗
|
yipdw |
there are a lot of factors |
03:16
🔗
|
yipdw |
if it was warc.gz then each WARC record is individually compressed |
03:16
🔗
|
yipdw |
there is a reason for that, and the reason is seekability |
03:16
🔗
|
pikhq |
yipdw: Though, it's ZIP that he's got. |
03:16
🔗
|
TFGBD_ |
Or did the tool just choke on a 10GB warc? |
03:16
🔗
|
yipdw |
however you lose the benefits of solid compression |
03:16
🔗
|
pikhq |
ZIP compresses each file separately. |
03:17
🔗
|
TFGBD_ |
It was a WARC.gz bit the Warc.gz was 10GB |
03:17
🔗
|
DFJustin |
iirc it chokes on over 2gb because of lack of zip64 |
03:17
🔗
|
TFGBD_ |
and it was still 10GB extracted, so no compression |
03:17
🔗
|
TFGBD_ |
Oh, that sucks |
03:17
🔗
|
TFGBD_ |
will it work better if it run it locally? |
03:17
🔗
|
DFJustin |
it's a trivial fix in the local script |
03:17
🔗
|
pikhq |
Oh, you passed it a 10GB warc? Yeah, that'll probably choke. .zip doesn't handle archives that big. |
03:18
🔗
|
TFGBD_ |
Darnit. Guess I'm back to square one. And it worked so well for the 40MB one... ;P |
03:18
🔗
|
TFGBD_ |
Is there a WARC to gzipped files tool? |
03:19
🔗
|
yipdw |
you can try warcat's extract mode |
03:19
🔗
|
yipdw |
https://pypi.python.org/pypi/Warcat/ |
03:20
🔗
|
DFJustin |
https://github.com/alard/warctozip + https://gist.github.com/DopefishJustin/ae8262bede1b77d87709 |
03:21
🔗
|
TFGBD_ |
nice. Why isn't that in the live tool? |
03:21
🔗
|
DFJustin |
no good reason |
03:22
🔗
|
DFJustin |
looks like there is also a useful change in a pull request https://github.com/alard/warctozip/pull/1/files |
03:22
🔗
|
TFGBD_ |
does the guy who made it come here? |
03:23
🔗
|
DFJustin |
he used to but not for a while |
03:23
🔗
|
TFGBD |
someone should update the official archive.org copy |
03:25
🔗
|
TFGBD |
Maybe my problem with the proxy was I'm trying to use portable python |
03:29
🔗
|
TFGBD |
When a WARC ends in 001, 002, etc... |
03:29
🔗
|
TFGBD |
Does that mean it is a multi-part split warc? |
03:29
🔗
|
TFGBD |
Is that a thing? |
03:31
🔗
|
TFGBD |
Do I need to download all of them to get a proper dump of the files? |
03:32
🔗
|
pikhq |
As far as I know, no. |
03:41
🔗
|
TFGBD |
Hmm, this warc2zip is an offline app |
03:41
🔗
|
TFGBD |
is this what the web service is based on? |
03:43
🔗
|
TFGBD |
Can't I download the web app? |
03:43
🔗
|
TFGBD |
is this what I need? |
03:43
🔗
|
TFGBD |
https://github.com/alard/warctozip-service |
03:46
🔗
|
TFGBD |
Is there a zip64 diff for the web service version? |
04:06
🔗
|
DFJustin |
nope |
04:09
🔗
|
TFGBD |
Okay, so I have all the requirements for the web service installed in my python but how do I actually run this thing? |
04:09
🔗
|
TFGBD |
The documentation sucks |
04:10
🔗
|
TFGBD |
It's no longer giving errors but when I run it out an argument with python, it just starts and quits with no output |
04:10
🔗
|
TFGBD |
it does create a stream_post.pyc but that's about it |
04:10
🔗
|
TFGBD |
Does this need to run with apache or something? |
04:16
🔗
|
yipdw |
install the packages listed in requirements.txt, use a procfile runner like foreman or whatever |
04:17
🔗
|
yipdw |
the patch DFJustin supplied can be applied at line 160 of app.py |
04:22
🔗
|
TFGBD |
Ah, okay |
04:23
🔗
|
TFGBD |
that's what I needed. I'm not too familiar with python and had no idea what a procfile was |
04:24
🔗
|
TFGBD |
it should really mention that in the documentation, no |
04:25
🔗
|
yipdw |
maybe, but this had an audience of like two people and both people knew how to start it |
04:25
🔗
|
yipdw |
submit a PR |
04:25
🔗
|
TFGBD |
Ahh, I get it |
04:25
🔗
|
TFGBD |
It kind of amazes me, though |
04:26
🔗
|
TFGBD |
I'd have thought there would be a huge team of big companies behind this format |
04:26
🔗
|
yipdw |
there are |
04:26
🔗
|
yipdw |
you are conflating WARC and the tools people build to operate on it |
04:26
🔗
|
yipdw |
well, correction |
04:26
🔗
|
yipdw |
there aren't any "big" companies behind this |
04:27
🔗
|
yipdw |
it has support from significant players in the sector where it matters; two them are Hanzo Archives and Internet Archive |
04:27
🔗
|
yipdw |
if you ask Google they'll probably push HAR on you |
04:27
🔗
|
TFGBD |
Ohh, so that's where the "hanzo tools" comes from |
04:27
🔗
|
TFGBD |
I'm not familiar with hanzo |
04:27
🔗
|
TFGBD |
is that a competitor to Archive.org? |
04:28
🔗
|
yipdw |
http://www.hanzoarchives.com/ |
04:28
🔗
|
yipdw |
no |
04:37
🔗
|
TFGBD |
Ah, I see |
04:37
🔗
|
TFGBD |
legal stuff |
05:00
🔗
|
TFGBD |
ugh, there is no foreman for win32... |
05:06
🔗
|
TFGBD |
guess i'm SOL |
05:06
🔗
|
TFGBD |
or is this some way to run it manually without the procfile? |
05:06
🔗
|
TFGBD |
at least the offline tool works |
05:25
🔗
|
yipdw |
https://github.com/ddollar/foreman-windows |
05:25
🔗
|
yipdw |
yes there is |
05:26
🔗
|
yipdw |
although it is weird that it has Ruby and C# code in the same project |
05:26
🔗
|
yipdw |
in any case, running this on Windows is hard to support because most of us don't try to run this code on Windows |
05:27
🔗
|
yipdw |
you are likely to receive better support on something unixish |
05:29
🔗
|
signius |
Just looking at the twitpic grab tracker, can someone explain how so many users manage to get so many GB of data with so few of items ? |
05:29
🔗
|
yipdw |
they got in on the ground floor |
05:29
🔗
|
yipdw |
when we actually had images |
05:29
🔗
|
signius |
ah |
06:04
🔗
|
TFGBD |
I understandthough, I'd rather not install cygwin or interix right now |
06:25
🔗
|
yipdw |
a VM is another option |
07:07
🔗
|
TFGBD |
ugh, this stupid thing is giving me out of memory errors |
07:11
🔗
|
TFGBD |
does it need a 64 bit python and python and os install? |
14:44
🔗
|
Muad_Dib |
netsplits \o/ |
14:48
🔗
|
SketchCow |
Boop |
14:48
🔗
|
SketchCow |
-bs |
15:43
🔗
|
SadDM |
SketchCow: when you have a moment, can you please move the following items into the Archive Team collection: comeback_inn_forums-20140326, metamorphosisalpha.net_forums-20141022, starfrontiers.info_forum-20140324, pathfinderchronicler.net_grabs, fraternity_of_shadows_forum-20140325 |
17:24
🔗
|
TFGBD |
stupid warctozip |
17:24
🔗
|
TFGBD |
it keeps failing at 134MB |
17:26
🔗
|
Corion |
Hi all - I'm manually running the code (on a VPS) instead of using a Warrior VM. Is there any convenient way to find out the "most urgent" project I should run? |
17:26
🔗
|
yipdw |
Corion: unless you're using the warrior-code repo, no -- each project has its own codebase |
17:27
🔗
|
Kazzy |
Corion: You could take a look at http://warriorhq.archiveteam.org/projects.json |
17:27
🔗
|
yipdw |
if you are running warrior-code(2) on a VPS then just set it to ArchiveTeam's Choice |
17:27
🔗
|
Kazzy |
auto_project is what the warrior uses to work out the 'most important' job |
17:27
🔗
|
TFGBD |
File "warctozip.py", line 63, in <module> |
17:27
🔗
|
TFGBD |
sys.exit(main(sys.argv)) |
17:27
🔗
|
TFGBD |
File "warctozip.py", line 42, in main |
17:27
🔗
|
TFGBD |
dump_record(fh, outzip) |
17:27
🔗
|
TFGBD |
File "warctozip.py", line 51, in dump_record |
17:27
🔗
|
TFGBD |
leftover = message.feed(record.content[1]) |
17:27
🔗
|
TFGBD |
File "hanzo\httptools\messaging.py", line 576, in feed |
17:27
🔗
|
Corion |
Kazzy: That sounds like what I wanted, thanks! |
17:27
🔗
|
yipdw |
TFGBD: wtf |
17:27
🔗
|
TFGBD |
text = HTTPMessage.feed(self, text) |
17:27
🔗
|
TFGBD |
File "hanzo\httptools\messaging.py", line 97, in feed |
17:27
🔗
|
TFGBD |
text = self.feed_headers(text) |
17:27
🔗
|
TFGBD |
File "hanzo\httptools\messaging.py", line 191, in feed_headers |
17:27
🔗
|
TFGBD |
line, text = self.feed_line(text) |
17:27
🔗
|
TFGBD |
File "hanzo\httptools\messaging.py", line 159, in feed_line |
17:27
🔗
|
TFGBD |
text = str(self.buffer[pos:]) |
17:27
🔗
|
TFGBD |
MemoryError |
17:27
🔗
|
TFGBD |
gah, sorry |
17:27
🔗
|
Corion |
No flood protection here? A stray right-click easily wreaks havoc ;) |
17:27
🔗
|
TFGBD |
didn't mean to paste it all |
17:28
🔗
|
TFGBD |
but that is the error |
17:28
🔗
|
yipdw |
how about don't paste any of it and use a pastebin |
17:28
🔗
|
TFGBD |
my bad |
17:28
🔗
|
yipdw |
also did you apply the zi[p64 change |
17:28
🔗
|
Kazzy |
efnet doesn't kill you on that level of flooding, and there's no bots in the chan to do it either |
17:28
🔗
|
Corion |
Anyway, thanks for the information - I'll look at whether I can automate that, or at least, send myself an email when the main project changes |
17:28
🔗
|
TFGBD |
yesstill didn't work |
17:28
🔗
|
Kazzy |
Corion: enjoy :) |
17:29
🔗
|
TFGBD |
i tried it on a 64bit OS too |
17:29
🔗
|
TFGBD |
should that matter? |
17:29
🔗
|
TFGBD |
do I need to use a 64-bit python? |
17:30
🔗
|
TFGBD |
Ehh, guess I'll spin up a Colinux and see how it goes there |
17:30
🔗
|
joepie91 |
TFGBD: summarize your issue in one sentence? |
17:30
🔗
|
joepie91 |
(haven't been following convo) |
17:30
🔗
|
TFGBD |
s'cool |
17:31
🔗
|
yipdw |
running warctozip-the-service on Windows and trying to use it to extract stuff from a 10 GB WARC |
17:31
🔗
|
joepie91 |
yipdw: warctozip-the-service? |
17:31
🔗
|
TFGBD |
joepie91: I tried using warctozip with the zip64 diff and it sis still only extracting about 140MB of the 10GB warc |
17:31
🔗
|
TFGBD |
warc-to-zip service wont run at all |
17:31
🔗
|
TFGBD |
I'm using the cli tool |
17:31
🔗
|
TFGBD |
or, ic ouldn't ge tit to run |
17:32
🔗
|
joepie91 |
taking a stab at the obvious: have you tried processing a different WARC and comparing whether it breaks at the same point? |
17:32
🔗
|
joepie91 |
may be a special-characters-in-filename issue |
17:32
🔗
|
joepie91 |
because Windows |
17:32
🔗
|
TFGBD |
hmm |
17:32
🔗
|
joepie91 |
(Windows is considerably less friendly to weird characters in filenames than Linux/OSX, in my experience) |
17:32
🔗
|
TFGBD |
it worked ith a 40mb warc |
17:32
🔗
|
joepie91 |
(or well, I suppose that it's technically NTFS that's failing, not Windows) |
17:32
🔗
|
yipdw |
MemoryError and weird characters is a stretch |
17:32
🔗
|
yipdw |
anyway #-bs |
17:33
🔗
|
joepie91 |
TFGBD: try to find one that's bigger than your failing file |
17:33
🔗
|
joepie91 |
er |
17:33
🔗
|
joepie91 |
than your failing position in the failing file * |
17:33
🔗
|
TFGBD |
i'd have to download another one then |
17:33
🔗
|
joepie91 |
right |
17:33
🔗
|
joepie91 |
TFGBD: can you join #archiveteam-bs |
17:33
🔗
|
TFGBD |
sure |
17:33
🔗
|
SketchCow |
HI |
17:33
🔗
|
SketchCow |
Had a nice chat with Canadian press about twitpic |
17:34
🔗
|
SadDM |
I'm sure they were thrilled just to be not talking about wednesday's shooting |
17:34
🔗
|
SadDM |
can you say which news org? |
17:43
🔗
|
SketchCow |
Global News |
17:44
🔗
|
SketchCow |
I was in some other ... oh, Globe and Mail a day or two ago |
17:45
🔗
|
SadDM |
Nice... I'll try and remember to keep an eye on their newscasts |
17:59
🔗
|
balrog |
SketchCow: yeah I saw that |
18:00
🔗
|
balrog |
http://www.theglobeandmail.com/technology/digital-culture/the-race-to-archive-twitpic-before-800-million-pictures-vanish/article21199755/ |
18:06
🔗
|
raylee |
hm |
18:06
🔗
|
raylee |
i wonder why twitpic are acting the way they are |
18:11
🔗
|
SketchCow |
Carl Malamud in the house!! |
18:25
🔗
|
balrog |
SketchCow: :D |
19:15
🔗
|
bzc6p |
http://globalnews.ca/news/1633807/800-million-twitpic-photos-to-vanish-from-the-web-saturday/ |
19:15
🔗
|
bzc6p |
http://globalnews.ca/video/1633770/twitpic-is-about-to-shut-down-after-dispute-with-twitter |
21:47
🔗
|
wp494 |
oh wow |
21:47
🔗
|
wp494 |
here's to hoping peter chura (global winnipeg anchor) gets to mention that article |
21:48
🔗
|
* |
wp494 sets a recording for 6 pm news |
22:39
🔗
|
schbirid |
SketchCow: midas dropped this wonderful quote earlier in -bs, you are probably the most likely to be able to use it: <midas> clouds dissapear when the heat is on |