Time |
Nickname |
Message |
01:09
🔗
|
nintendud |
hot damn, I just noticed the warrior is uploading to archive.org now |
01:09
🔗
|
nintendud |
and multiple uploads at once |
01:09
🔗
|
nintendud |
I like |
01:45
🔗
|
flaushy |
great talk at defcon SketchCow |
01:55
🔗
|
underscor |
http://ia600109.us.archive.org:8088/mrtg/networkv2.html |
01:55
🔗
|
underscor |
Guess where webshots started |
01:55
🔗
|
underscor |
xD |
01:59
🔗
|
nintendud |
hmmmmmmmmm |
02:57
🔗
|
SketchCow |
Things are slowing down on the webshots side for FOS, which is good. |
02:59
🔗
|
flaushy |
if i happen to have an ip-address change in the rsync process, is that a problem? |
03:04
🔗
|
Sue |
SketchCow: it's slowing down because a few of us are having issues running the script |
03:06
🔗
|
SketchCow |
No, no. |
03:06
🔗
|
SketchCow |
They should not going to FOS anymore. |
03:06
🔗
|
SketchCow |
There's some stragglers. |
03:07
🔗
|
Sue |
oh |
03:50
🔗
|
godane |
uploaded: http://archive.org/details/cdrom-linuxformatmagazine-130 |
04:09
🔗
|
flaushy |
SketchCow: do you keep FOS running for rsync over the weekend? |
04:10
🔗
|
flaushy |
since i need to leave here soon and won't be able to fix stuff until monday, and currently getting some nasty errors with the new version of the script |
04:11
🔗
|
flaushy |
(it's a small node, nooon) |
04:14
🔗
|
flaushy |
ah nevermind, it wont let me run outdated code :/ |
04:23
🔗
|
S[h]O[r]T |
fos should be fine to accept pending rsyncs afaik |
04:23
🔗
|
S[h]O[r]T |
but you cant get new items from the tracker now |
04:23
🔗
|
S[h]O[r]T |
on the old code |
04:24
🔗
|
flaushy |
yepe just realized |
04:24
🔗
|
flaushy |
and thx :) |
05:04
🔗
|
Sue |
webshots standalone users getting async error: do which curl |
05:13
🔗
|
SketchCow |
http://archive.org/details/archiveteam-city-of-heroes-forums-megawarc-5 |
05:13
🔗
|
SketchCow |
Gaze upon the future! |
05:14
🔗
|
chronomex |
MEGAWARC |
05:14
🔗
|
chronomex |
I am chronomex and I approve this message. |
05:16
🔗
|
SketchCow |
Watching Breaking Bad |
05:16
🔗
|
chronomex |
I also approve of Breaking Bad. |
05:17
🔗
|
SketchCow |
What I like is that it used/uses all this music that, it is later reported, the bands had no say in it going in. |
05:18
🔗
|
SketchCow |
So you get someone getting his face shot off or some prostitute ruining her life, and a sad little musician has to see it was paired with their music. |
05:19
🔗
|
godane |
i found the glenn beck forums |
05:20
🔗
|
chronomex |
hahaha, I didn't know that |
05:20
🔗
|
godane |
what is funny is that there are 'pirated' copies of old glenn beck radio shows on it |
05:20
🔗
|
chronomex |
lol |
05:22
🔗
|
godane |
i found 3 days from 2007 |
05:26
🔗
|
SketchCow |
Uploading the SOPA blackout collection. |
05:26
🔗
|
SketchCow |
That should be interesting. |
05:28
🔗
|
SketchCow |
It'll be a nice complete grab. |
05:29
🔗
|
SketchCow |
I think we've about hit the end of the uploads through FOS of webshots |
05:32
🔗
|
godane |
looks like i need to grab the glenn beck forums |
05:32
🔗
|
godane |
based on wayback machine there only grabs from 2009 and 2010 |
05:34
🔗
|
SketchCow |
Well, I'll give you an inside tip, godane. |
05:34
🔗
|
SketchCow |
By the end of October, the Wayback machine will have doubled its content. |
05:34
🔗
|
SketchCow |
Subsequently, it might be worth it to just wait to see what flies in first. |
05:34
🔗
|
godane |
ok |
05:38
🔗
|
SketchCow |
I've just killed webshots uploading on FOS |
05:38
🔗
|
SketchCow |
Since it's going to be replaced with a much more powerful system |
05:38
🔗
|
SketchCow |
And I need to dump a bunch of stuff through FOS to get it into Wayback |
05:39
🔗
|
SketchCow |
City of Heroes Forum is getting prepped for wayback. |
05:39
🔗
|
SketchCow |
That should be exciting in the extreme for them. |
05:40
🔗
|
SketchCow |
Wayback access |
06:20
🔗
|
SketchCow |
Breaking Bad.... perfect background for archiving work. |
06:31
🔗
|
DFJustin |
all glory to the megawarc |
06:34
🔗
|
Cameron_D |
Yay, all my BT users are done |
07:57
🔗
|
godane |
i know why there is not real archive for glenn beck forums |
07:58
🔗
|
chronomex |
why? |
07:58
🔗
|
godane |
you have to pay for access to it |
07:58
🔗
|
godane |
i think thats the case |
07:58
🔗
|
godane |
the link is only on the archive mp3 page anyway so it sort makes sense |
07:59
🔗
|
chronomex |
hmm |
08:02
🔗
|
godane |
so i maybe the only hope to archive this |
08:03
🔗
|
godane |
thats what i tell my self cause stuff like dl.tv and crankygeeks would have be lost if i left it up to you guys |
08:04
🔗
|
godane |
good thing i did archived when i did |
08:17
🔗
|
chronomex |
Go for it! |
08:25
🔗
|
alard |
We're now out of btinternet usernames. 4 hard cases left, but that's it. |
09:00
🔗
|
SmileyG |
hurrar |
09:01
🔗
|
SmileyG |
Haha I was last upload? \o/ |
11:04
🔗
|
C-Keen |
hm, how can I add a warc file to the internet archive? |
11:20
🔗
|
alard |
C-Keen: http://archive.org/create/ |
11:20
🔗
|
C-Keen |
alard: and then just upload the warc? |
11:20
🔗
|
alard |
Yes, make a new item. |
11:21
🔗
|
C-Keen |
alard: will that get put in the wayback machine? |
11:21
🔗
|
alard |
Ah, that I do not know. |
11:21
🔗
|
alard |
Although if you upload it there and put it on SketchCow's list there might be a chance. |
11:22
🔗
|
C-Keen |
alard: alright, will do |
11:30
🔗
|
C-Keen |
alard: hm, should a website mirror that contains mainly educational texts and source code be put in the Community Texts collection? I am unsure what to pick here |
11:43
🔗
|
C-Keen |
hah uploaded my first item to the archive... |
11:53
🔗
|
alard |
I think Community Texts is the only collection you can pick. The others are protected (SketchCow can move items). |
12:00
🔗
|
SmileyG |
errrr |
12:00
🔗
|
SmileyG |
why has my webshots stopped generating new processes |
12:00
🔗
|
SmileyG |
Oh, restarting project ¬_¬ |
13:41
🔗
|
SketchCow |
I HAVE to start heading south. |
13:42
🔗
|
SketchCow |
But we just had a CDX derive fail off of a megawarc generator. |
13:43
🔗
|
SketchCow |
alard: http://www.us.archive.org/log_show.php?task_id=127674813 |
13:46
🔗
|
alard |
SketchCow: Is the original tar somewhere? |
13:47
🔗
|
SketchCow |
Sadly, no. |
13:47
🔗
|
SketchCow |
I should have left it. |
13:47
🔗
|
SketchCow |
BOARDS-COH-01.tar.megawarc.json.gz BOARDS-COH-02.tar.megawarc.warc.gz BOARDS-COH-04.tar.megawarc.tar |
13:47
🔗
|
SketchCow |
root@teamarchive-1:/2/CITY# ls |
13:47
🔗
|
SketchCow |
BOARDS-COH-01.tar.megawarc.tar BOARDS-COH-03.tar.megawarc.json.gz BOARDS-COH-04.tar.megawarc.warc.gz |
13:47
🔗
|
SketchCow |
BOARDS-COH-01.tar.megawarc.warc.gz BOARDS-COH-03.tar.megawarc.tar megawarc |
13:47
🔗
|
SketchCow |
BOARDS-COH-02.tar.megawarc.json.gz BOARDS-COH-03.tar.megawarc.warc.gz |
13:47
🔗
|
SketchCow |
I'm going to gunzip one myself while I get dressed here. |
13:47
🔗
|
SketchCow |
BOARDS-COH-02.tar.megawarc.tar BOARDS-COH-04.tar.megawarc.json.gz |
13:48
🔗
|
alard |
I'm downloading that failed .warc.gz now, but that will take a while. |
13:48
🔗
|
SketchCow |
I wouldn't do that. |
13:48
🔗
|
SketchCow |
root@teamarchive-1:/2/CITY# gunzip BOARDS-COH-01.tar.megawarc.warc.gz |
13:49
🔗
|
SketchCow |
More critical, MUCH more critical, is http://www.us.archive.org/log_show.php?task_id=127728961 |
13:49
🔗
|
SketchCow |
Watch it, and if it fails, THEN we have some testing to do. |
13:51
🔗
|
SketchCow |
http://www.us.archive.org/log_show.php?task_id=127728688 or this one. |
13:51
🔗
|
SketchCow |
That'll happen faster. |
13:52
🔗
|
alard |
These are converted tars? Or original megawarcs? |
13:52
🔗
|
SketchCow |
Converted tars. |
13:52
🔗
|
SketchCow |
Wait no |
13:52
🔗
|
SketchCow |
I took three different tars, uncompresed them into a file directory. |
13:53
🔗
|
SketchCow |
Then megawarc'd the file directory |
13:54
🔗
|
dragondon |
greetings all! is there a way to set the upload speed? This 0.4kB/s is ridiculous.... |
13:55
🔗
|
alard |
The BOARDS-COH-01 too? That would be strange, since it contains a directory, and the pack option isn't supposed to add directories. |
13:55
🔗
|
alard |
dragondon: Which project? |
13:55
🔗
|
dragondon |
webshots |
13:55
🔗
|
alard |
Does it say CurlUpload? |
13:55
🔗
|
dragondon |
yes |
13:56
🔗
|
alard |
Hmm. No, there isn't any limit. |
13:56
🔗
|
dragondon |
I did see higher speeds earlier but now it's dragging... |
13:56
🔗
|
dragondon |
been doing so for a few hours now |
13:58
🔗
|
SketchCow |
alard: So: |
13:58
🔗
|
SketchCow |
if http://www.us.archive.org/log_show.php?task_id=127728688 doesn't work, warning sign. |
13:58
🔗
|
SketchCow |
http://www.us.archive.org/log_show.php?task_id=127728961 is the critical one. |
13:58
🔗
|
SketchCow |
If that doesn't work, we have real issues, that's a webshots generator. |
13:59
🔗
|
SketchCow |
I have to start driving to NYC now. |
14:00
🔗
|
SketchCow |
If it doesn't work, for whatever reason (webshots), underscor needs to go back to .tar generation until we figure it out. |
14:00
🔗
|
SketchCow |
Otherwise, just assume BOARDS-COH is me doing something fucked up |
14:00
🔗
|
alard |
Yes. (I can't reach archive.org now. www.us.archive.org works.) |
14:00
🔗
|
godane |
see |
14:00
🔗
|
godane |
i'm not crazy when i couldn't get to archive.org |
14:00
🔗
|
dragondon |
same here (South Korea) "Iceweasel can't establish a connection to the server at archive.org" |
14:01
🔗
|
godane |
i got the same error |
14:01
🔗
|
godane |
:-D |
14:01
🔗
|
dragondon |
I can ping it thought |
14:01
🔗
|
dragondon |
though |
14:01
🔗
|
godane |
its not just me |
14:01
🔗
|
SketchCow |
Just alerted them |
14:03
🔗
|
alard |
The last gzip record in BOARDS-COH-01.tar.megawarc.warc.gz is fine. |
14:04
🔗
|
alard |
As is the first. |
14:04
🔗
|
dragondon |
with this version of the VM, will it loose everything if I force the machine to shutdown? I need to figure out some hardware issues here. |
14:05
🔗
|
alard |
Yes. |
14:05
🔗
|
dragondon |
I'm hoping that furutre updates will have a buffer to prevent that :) |
14:05
🔗
|
SketchCow |
http://www.us.archive.org/log_show.php?task_id=127728961 - task failed. |
14:06
🔗
|
SketchCow |
Luckily (?) it's a mysql error. |
14:06
🔗
|
alard |
dragondon: Wget can't resume, and upload resuming is complicated, so it's unlikely. |
14:06
🔗
|
dragondon |
:( |
14:07
🔗
|
dragondon |
alard, is there no way to generate the files first, then send, and have soem sort of check/confirm/then resume? |
14:08
🔗
|
alard |
It's complicated, and you don't have to restart that often. Resume things would also complicate error recovery (if the warrior has a problem now, you can reboot and start again). |
14:09
🔗
|
dragondon |
it's not a warrior issue, for some reason my system is reporting only have my phyiscal memory....kinda don't like killing all the work it did, hence why I was asking for any speed mods. Guess I'll have to force restart |
14:26
🔗
|
SketchCow |
the sopa item failed |
14:28
🔗
|
SketchCow |
in both of these cases, I generated it from a set of directory. |
14:28
🔗
|
alard |
I suspect there's an undetected invalid warc in there. |
14:30
🔗
|
SketchCow |
4 web shots I am going to suggest that we go back to generating large tar files. |
14:31
🔗
|
alard |
Yes. |
14:31
🔗
|
SketchCow |
it sounds like we need to do a few more additional tests. |
14:31
🔗
|
alard |
We didn't do enough. |
14:32
🔗
|
SketchCow |
that is just because I think of you as an unstoppable code juggernaut |
14:34
🔗
|
SketchCow |
however, there is a hole range of code you have absolutely no access to. |
14:37
🔗
|
alard |
It would be handy if these error messages included a byte position. That would make it easier to find the problem. |
14:43
🔗
|
SketchCow |
I am in the car in can't look this up easily, but I do believe there is a public repository of all this code. |
14:45
🔗
|
alard |
If the gzip is invalid (that's what the error message suggests, at least) that just needs to be fixed. There's nothing wrong with the indexer. |
14:45
🔗
|
alard |
./megawarc --verbose pack test.tar data/infiles/ |
14:45
🔗
|
alard |
Checking data/infiles/bad.warc.gz |
14:45
🔗
|
alard |
Checking data/infiles/good.warc.gz |
14:45
🔗
|
alard |
Copying data/infiles/good.warc.gz to warc |
14:45
🔗
|
alard |
Copying data/infiles/bad.warc.gz to warc |
14:46
🔗
|
alard |
That's wrong: bad.warc.gz isn't complete (I chopped off the last 1000 bytes) so it shouldn't go in the warc. |
14:49
🔗
|
alard |
The megawarc gzip-testing doesn't work, it seems. (The good news is that the positions in the json are correct, so the current megawarcs can be repaired.) |
15:00
🔗
|
SketchCow |
old versions are kept. |
15:00
🔗
|
alard |
For webshots? |
15:11
🔗
|
SketchCow |
all I had not to sure did wait 1 moment |
15:12
🔗
|
alard |
You shouldn't text while driving. :) |
15:12
🔗
|
chronomex |
Watch out SketchCow is using voice recognition |
15:12
🔗
|
SketchCow |
let's try again. Any archive that I was given are still in car for bad. The new batch was being tested, but we have not fully committed to it, instead we are just feelings disks on the round robin machine. |
15:13
🔗
|
alard |
Sure. |
15:14
🔗
|
SketchCow |
we were going to suck my nuts off |
15:15
🔗
|
SketchCow |
I let that 1 go because what I said was sign off. |
15:15
🔗
|
SketchCow |
obviously, voice recognition has a way to go |
15:17
🔗
|
SketchCow |
although, if something with my computer end up sucking my nuts off, dad hey, what's a little problem here and there with voice recognition? |
15:18
🔗
|
SketchCow |
maybe that's how the algorithm got the job in the first place |
15:29
🔗
|
SmileyG |
Fatal error: Uncaught exception 'Exception' with message 'WARNING-OR-ERROR: [2] [mysql_connect(): Too many connections] [/usr/local/petabox/www/common/DB.inc] [269]' in /usr/local/petabox/deriver/derive.php:46 |
15:29
🔗
|
SmileyG |
Stack trace:# |
15:29
🔗
|
SmileyG |
It died. |
15:29
🔗
|
SmileyG |
Though that doesn't appear to be an issue with the megawarc itself which seems good. |
15:42
🔗
|
alard |
I think it works better now: |
15:42
🔗
|
alard |
CRC check failed 0x5cdcbe41 != 0x30788a20L |
15:42
🔗
|
alard |
Checking data/infiles/bad-extra.warc.gz |
15:42
🔗
|
alard |
Checking data/infiles/good.warc.gz |
15:42
🔗
|
alard |
Copying data/infiles/good.warc.gz to warc |
15:42
🔗
|
alard |
Invalid gzip data/infiles/bad-extra.warc.gz |
15:42
🔗
|
alard |
Copying data/infiles/bad-extra.warc.gz to tar |
15:42
🔗
|
alard |
Checking data/infiles/bad.warc.gz |
15:42
🔗
|
alard |
CRC check failed 0xdcbe4175 != 0xc21fb9ffL |
15:42
🔗
|
alard |
Invalid gzip data/infiles/bad.warc.gz |
15:42
🔗
|
alard |
Copying data/infiles/bad.warc.gz to tar |
15:42
🔗
|
alard |
https://github.com/alard/megawarc/commit/fb0ba014ff4df76411cdd426a15764695a33c59e |
15:51
🔗
|
joepie91 |
oh look |
15:51
🔗
|
joepie91 |
http://catalysthost.com/clientarea/cart.php?gid=4 |
15:51
🔗
|
joepie91 |
:P |
15:51
🔗
|
joepie91 |
>Unmetered 1gbit |
17:02
🔗
|
sankin1 |
for $7 a month? doesn't sound like a bad deal |
17:21
🔗
|
underscor |
SketchCow: http://www.us.archive.org/log_show.php?task_id=127728961 failed due to DB problems, rerunning. |
17:22
🔗
|
underscor |
(DB problems that are unrelated to megawarc) |
17:25
🔗
|
underscor |
oh shit, boxes are almost full |
17:25
🔗
|
underscor |
better start bailing |
17:26
🔗
|
SmileyG |
underscor: herp |
17:26
🔗
|
SmileyG |
weren't they already doing so :S |
17:27
🔗
|
underscor |
hm? |
17:27
🔗
|
underscor |
There |
17:27
🔗
|
SmileyG |
Why are they running outta room? |
17:27
🔗
|
SmileyG |
:/ |
17:27
🔗
|
underscor |
Oh |
17:27
🔗
|
underscor |
There's no auto-ingest to IA |
17:27
🔗
|
SmileyG |
Do they not automatically pump to IA? |
17:27
🔗
|
SmileyG |
Ah ok. |
17:28
🔗
|
underscor |
Jason (and I) want a human to eyeball them |
17:28
🔗
|
underscor |
For now |
17:28
🔗
|
SmileyG |
Understandable. |
17:28
🔗
|
SmileyG |
So, do YOU work at IA? |
17:31
🔗
|
godane |
all of my theregister.co.uk warc dumps are up |
17:31
🔗
|
godane |
i have not done 2011 yet |
17:31
🔗
|
godane |
but its up to 2010 |
17:31
🔗
|
godane |
which is all i have right now |
17:40
🔗
|
underscor |
SmileyG: Yeah |
17:40
🔗
|
underscor |
I'm part time, though |
17:40
🔗
|
SmileyG |
o |
17:40
🔗
|
underscor |
(I'm a student the rest of the time) |
17:40
🔗
|
SmileyG |
Still, awesome. |
17:40
🔗
|
underscor |
In upstate NY |
17:40
🔗
|
underscor |
hehe, thanks :D |
17:41
🔗
|
SmileyG |
IA should have some more DC's ;) |
17:41
🔗
|
SmileyG |
Like one in coventry, hahah here, its cheap(yeah right) |
17:43
🔗
|
DFJustin |
looks like that megawarc has gz issues as well |
17:48
🔗
|
SmileyG |
Awww |
17:56
🔗
|
alard |
There must be quite a few invalid warc files then. |
18:05
🔗
|
underscor |
alard: I thought megawarc checked them out? |
18:06
🔗
|
SmileyG |
:< |
18:06
🔗
|
SmileyG |
hmmm this worries me |
18:06
🔗
|
underscor |
Is there a way to check the validity of a gz on the command line? |
18:06
🔗
|
underscor |
(besides just extracting it) |
18:12
🔗
|
SmileyG |
gunzip -t file.tar.gz |
18:12
🔗
|
underscor |
thx |
18:12
🔗
|
underscor |
alard: should we switch to tars for now, or what do you think? |
18:12
🔗
|
SmileyG |
for test :) |
18:13
🔗
|
SmileyG |
also hmmm |
18:13
🔗
|
SmileyG |
if your worried about the tars, you can check them too |
18:13
🔗
|
SmileyG |
gunzip -c file.tar.gz | tar t > /dev/null |
18:13
🔗
|
underscor |
I am beginning to get close to drowning, so I need to figure out the exit strategy |
18:13
🔗
|
SmileyG |
herp |
18:14
🔗
|
underscor |
(people should save slower!) |
18:14
🔗
|
underscor |
xD |
18:14
🔗
|
SmileyG |
can you just do what the warrior would do? |
18:14
🔗
|
SmileyG |
but just the upload bit (and direct it to FOS/IA ? |
18:14
🔗
|
SmileyG |
You said theres... 12? servers/ |
18:16
🔗
|
underscor |
SmileyG: No, no, I *am* fos/IA |
18:16
🔗
|
* |
underscor is the servers warriors are uploading to |
18:16
🔗
|
underscor |
Those servers are nearing full |
18:16
🔗
|
SmileyG |
yeah |
18:16
🔗
|
SmileyG |
but originally all teh warriors were uploading to 1 location, which is now a number of locations? |
18:16
🔗
|
underscor |
FOS is full/not accessible for this project |
18:16
🔗
|
underscor |
Yes |
18:16
🔗
|
SmileyG |
What was the plan for the orignal server? |
18:17
🔗
|
underscor |
It was to upload tars, which had been happening |
18:17
🔗
|
SmileyG |
can you not replicate that process over to the other servers? |
18:17
🔗
|
underscor |
now the plan was doing the megawarcing with the script from alard, which I've been doing |
18:17
🔗
|
underscor |
but if they're corrupt, then maybe we should go back to tars for now |
18:17
🔗
|
SmileyG |
yeah |
18:18
🔗
|
underscor |
I may just make a Command Decision(tm) since SketchCow is on the road |
18:18
🔗
|
underscor |
and deal with the fallout later |
18:19
🔗
|
SmileyG |
well if you don't, all archiving basically stops |
18:19
🔗
|
SmileyG |
unless you've got his number? |
18:19
🔗
|
underscor |
yeah, I may call him after class |
18:19
🔗
|
underscor |
http://p.defau.lt/?Fy8RdcZOojTsFPlt6Yyzcg |
18:19
🔗
|
underscor |
uh oh |
18:19
🔗
|
underscor |
cc alard |
18:23
🔗
|
DFJustin |
<underscor> alard: I thought megawarc checked them out? <-- I think it did, but the check wasn't working? (until he fixed it just now) |
18:23
🔗
|
DFJustin |
<alard> https://github.com/alard/megawarc/commit/fb0ba014ff4df76411cdd426a15764695a33c59e |
18:26
🔗
|
underscor |
aha |
18:34
🔗
|
underscor |
gunzip -t webshots-20121012070021.megawarc.warc.gz │·········· |
18:34
🔗
|
underscor |
gzip: webshots-20121012070021.megawarc.warc.gz: invalid compressed data--crc error |
18:34
🔗
|
underscor |
│·········· |
18:34
🔗
|
underscor |
sigh |
18:35
🔗
|
underscor |
so I guess this one is fucked |
18:38
🔗
|
underscor |
sigh |
18:38
🔗
|
underscor |
gunzip -t webshots-20121012070358.megawarc.warc.gz │·········· |
18:38
🔗
|
underscor |
gzip: webshots-20121012070358.megawarc.warc.gz: invalid compressed data--crc error |
18:38
🔗
|
underscor |
│·········· |
18:40
🔗
|
underscor |
Rebuilding using new code |
18:40
🔗
|
underscor |
alard: what does the script do if it encounters a "bad" .warc.gz? |
18:44
🔗
|
underscor |
DFJustin: Where did alard say that link btw? |
18:46
🔗
|
S[h]O[r]T |
underscor |
18:46
🔗
|
S[h]O[r]T |
do you not have history in here from earlier |
18:46
🔗
|
S[h]O[r]T |
alard and SketchCow were talking about the corruption. i can paste in pm if you need |
18:47
🔗
|
underscor |
woah, there we go |
18:47
🔗
|
underscor |
what the hell, quassel |
18:48
🔗
|
SmileyG |
how long does it take to rebuild :S |
18:49
🔗
|
underscor |
S[h]O[r]T: Found it. Not sure what quassel was doing saying there wasn't more scrollback >:( |
18:49
🔗
|
underscor |
SmileyG: Uh, I haven't timed them, actually |
18:50
🔗
|
underscor |
I assume if a set passes gunzip -t, then it's probably safe to upload |
18:51
🔗
|
SmileyG |
I *believe* so, the only better check is physically unpacking it and checking. |
18:51
🔗
|
SmileyG |
which kind of negates the point. |
18:58
🔗
|
alard |
underscor: It turned out that the gzip check I had in megawarc didn't really check anything. |
18:59
🔗
|
alard |
So if there was an invalid warc, it was added to the big warc, which then became unreadable. |
18:59
🔗
|
alard |
I think it is fixed in the latest megawarc version (it works on my test files, at least). |
18:59
🔗
|
underscor |
Is there a way to easyclean from the json? |
18:59
🔗
|
underscor |
Ouch. |
19:00
🔗
|
alard |
Before that fix, SketchCow suggested that we keep using tar until the megawarc is somewhat more stable and tested. |
19:00
🔗
|
alard |
Yes. |
19:00
🔗
|
underscor |
schweet |
19:00
🔗
|
alard |
The positions of the warcs in the json are correct. |
19:00
🔗
|
alard |
So it's possible to untangle them. |
19:01
🔗
|
alard |
So it might be an idea to keep using the latest megawarc script for webshots. |
19:01
🔗
|
alard |
I think it works, it's a good test. We also don't loose data if it does not, it just means rebuilding things. |
19:02
🔗
|
alard |
(To answer your question about what happens to the invalid gzips: they're added to the tar file.) |
19:02
🔗
|
SmileyG |
-Die in a Fire ? |
19:02
🔗
|
underscor |
ah, the "extras" tar file |
19:02
🔗
|
underscor |
? |
19:03
🔗
|
alard |
Yes. So if the tar file is not empty, that means there were things that couldn't be saved in the warc. |
19:03
🔗
|
underscor |
What about the ones that say "extra field of 10 bytes ignored"? |
19:04
🔗
|
underscor |
(ones = warc.gz, when testing with gunzip -t) |
19:07
🔗
|
underscor |
Uploading the first new set |
19:08
🔗
|
alard |
That's the warc format: it has an extra gzip field with the length of the compressed warc record. |
19:08
🔗
|
SmileyG |
hmmm |
19:08
🔗
|
SmileyG |
gzip patch needed at some point then? :S |
19:08
🔗
|
alard |
That's handy if you want to skip through the warc, but the gzip utility doesn't know how to use it. |
19:08
🔗
|
alard |
Well, it does what it says: it sees an extra field and ignores it. |
19:08
🔗
|
SmileyG |
:D |
19:08
🔗
|
SmileyG |
least it doesn't blow up I guess |
19:09
🔗
|
SmileyG |
Wonder if you can tell the test to ignore it (so it only raises errors on _real_ error |
19:09
🔗
|
SmileyG |
s |
19:09
🔗
|
SmileyG |
I smell diner. |
19:13
🔗
|
underscor |
SmileyG: It still returns $? = 0 |
19:13
🔗
|
underscor |
so it's not really a big deal |
19:14
🔗
|
alard |
Is this a new one? http://www.us.archive.org/catalog.php?history=1&identifier=webshots-freeze-frame-20121012103518 |
19:15
🔗
|
underscor |
Yes |
19:15
🔗
|
underscor |
Only the json is up though |
19:15
🔗
|
underscor |
the warc is still uploading |
19:16
🔗
|
alard |
Ah. Was there a tar? |
19:17
🔗
|
underscor |
0 bytes |
19:17
🔗
|
alard |
So it's exiting to see if this one passes the test. |
19:17
🔗
|
underscor |
It passed gunzip -t too |
19:17
🔗
|
SmileyG |
underscor: Ah ok ! I presumed it'd return some non-fatal error code |
19:17
🔗
|
SmileyG |
but if its not showing it other than the stout output.... no worries |
19:19
🔗
|
underscor |
alard: Is the procedure to fix these to "create" the tar backwards, and repack, or will you be able to write a "fixme" thing? :) |
19:19
🔗
|
alard |
I think it will be a fixme thing. |
19:19
🔗
|
underscor |
rad |
19:20
🔗
|
underscor |
another one finished! |
19:20
🔗
|
underscor |
-rw-r--r-- 1 abuie users 50G Oct 12 19:18 webshots-20121012070358.megawarc.warc.gz |
19:20
🔗
|
underscor |
-rw-r--r-- 1 abuie users 103K Oct 12 19:18 webshots-20121012070358.megawarc.json.gz |
19:20
🔗
|
underscor |
-rw-r--r-- 1 abuie users 388M Oct 12 19:18 webshots-20121012070358.megawarc.tar |
19:20
🔗
|
alard |
Hey, a tar. |
19:20
🔗
|
SmileyG |
working and looking correct now? |
19:20
🔗
|
alard |
That's both good and bad news. |
19:20
🔗
|
alard |
Good for megawarc, bad for webshots. |
19:21
🔗
|
SmileyG |
o_O |
19:22
🔗
|
underscor |
We can extract out the "bad" users and requeue them, though, right? |
19:23
🔗
|
SmileyG |
AH, the tars are failed users getting left over? |
19:24
🔗
|
underscor |
SmileyG: Yeah |
19:24
🔗
|
alard |
The invalid warcs end up in the tar. |
19:24
🔗
|
underscor |
Well, faulty warc.g |
19:24
🔗
|
underscor |
z |
19:24
🔗
|
underscor |
mhm |
19:25
🔗
|
alard |
We could make a list of the users that have made it to archive.org and compare that with the full list of users. |
19:25
🔗
|
alard |
But for the moment we have enough new users. |
19:26
🔗
|
underscor |
Lots of limestone networks hosts, wonder who that is |
19:26
🔗
|
underscor |
They're pumping a lot of data :D |
19:27
🔗
|
SmileyG |
Is it Sue? |
19:27
🔗
|
SmileyG |
She was saying shes gonna hit her cap shortly in #webshots |
19:29
🔗
|
underscor |
alard: http://archive.org/catalog.php?history=1&identifier=webshots-freeze-frame-20121012103518 Here we go! |
19:31
🔗
|
chronomex |
wooooo |
19:31
🔗
|
* |
chronomex parties |
19:34
🔗
|
underscor |
ugh, moving 50GB takes so long |
19:39
🔗
|
underscor |
http://www.us.archive.org/catalog.php?history=1&identifier=webshots-freeze-frame-20121012070358 is getting its replacement uploaded |
19:42
🔗
|
underscor |
schweet |
19:42
🔗
|
underscor |
Every box is now megawarcing |
19:43
🔗
|
underscor |
Although these take quite a bit of time |
19:43
🔗
|
underscor |
Wonder if I can keep up with the inflow |
19:43
🔗
|
chronomex |
INTERFLOW |
19:44
🔗
|
underscor |
http://ia600109.us.archive.org:8088/mrtg/networkv2.html http://ia601104.us.archive.org:8088/mrtg/networkv2.html http://ia700106.us.archive.org:8088/mrtg/networkv2.html |
19:44
🔗
|
underscor |
Y'all have been keeping them pretty nice and busy |
19:49
🔗
|
SmileyG |
- Downloaded 18400 URLs got another nice one :D |
21:02
🔗
|
underscor |
http://www.us.archive.org/log_show.php?task_id=127779327 It's cdxing now! |
21:02
🔗
|
underscor |
Cross fingers!!!! :D |
21:12
🔗
|
alard |
The CDX indexer is already running twice as long as the previous time. |
21:12
🔗
|
underscor |
's a good sign :D |
21:15
🔗
|
SketchCow |
hooray. |
21:15
🔗
|
SmileyG |
\o/ |
21:15
🔗
|
* |
SmileyG waits for underscor to start groveling. |
21:15
🔗
|
underscor |
yay, Jason's back |
21:16
🔗
|
chronomex |
25 seconds! |
21:16
🔗
|
underscor |
I have like 2TB processing |
21:16
🔗
|
underscor |
:P |
21:16
🔗
|
underscor |
megawarcing takes a fair bit of time/work, though |
21:16
🔗
|
underscor |
still can't quite tell if I'm filling faster than I'm dumping |
21:17
🔗
|
SmileyG |
hope not :S |
21:17
🔗
|
underscor |
Also, just the sheer (super awesome!) scale of moving 50gb bricks is... interesting |
21:17
🔗
|
SmileyG |
:) |
21:20
🔗
|
SketchCow |
[6~[6~[6~[6~[6~[6~[6~[6~ |
21:20
🔗
|
SmileyG |
what he said. |
21:23
🔗
|
alard |
The voice recognition gets more and more interesting. |
21:25
🔗
|
underscor |
hahahha |
21:25
🔗
|
underscor |
It was better when it was sucking his nuts off or whatever |
21:30
🔗
|
SketchCow |
so, status update please |
21:31
🔗
|
SketchCow |
this android ssh client has no pgup |
21:32
🔗
|
SketchCow |
also. comiccon is hell on earth. |
21:34
🔗
|
underscor |
SketchCow: We (think) we patched the bugs |
21:35
🔗
|
underscor |
Test derive is still running |
21:35
🔗
|
underscor |
but it got further than any of them have |
21:35
🔗
|
underscor |
so (probably) good |
21:35
🔗
|
underscor |
I have like 2.5TB to ingest |
21:35
🔗
|
underscor |
once we see how this goes |
21:44
🔗
|
alard |
Looks good? http://www.us.archive.org/log_show.php?task_id=127779327 |
21:47
🔗
|
underscor |
alard: you are the freakin' man |
21:47
🔗
|
underscor |
we need to set you up on gittip :D |
21:48
🔗
|
SmileyG |
i am so jeli |
21:49
🔗
|
underscor |
SketchCow: IT WORKED IT WORKED IT WORKED |
21:50
🔗
|
alard |
This isn't actually that much better than before: there's no tar with invalid warcs. That already worked. |
21:50
🔗
|
SmileyG |
http://archive.org/details/webshots-freeze-frame-20121012173401 the latest one lacks a warc? |
21:50
🔗
|
underscor |
SmileyG: still uploading |
21:51
🔗
|
underscor |
alard: oh |
21:51
🔗
|
SmileyG |
o |
21:51
🔗
|
SmileyG |
:D |
21:51
🔗
|
SketchCow |
find one where the problem is fixed versus before. |
21:51
🔗
|
SketchCow |
sopa is a good one |
21:51
🔗
|
underscor |
alard: 3 finished, no tar |
21:51
🔗
|
underscor |
:D |
21:52
🔗
|
alard |
The two that crashed with the zlib.error problem should have tars. |
21:52
🔗
|
underscor |
I just had to restart them |
21:52
🔗
|
underscor |
yes, those haven't finished |
21:52
🔗
|
alard |
I'm now testing my fix script on the sopa files. |
21:52
🔗
|
alard |
(Takes a while.) |
21:53
🔗
|
underscor |
Ah, this will be the one that lets us fix a bad megawarc.warc.gz |
21:53
🔗
|
underscor |
? |
21:53
🔗
|
SketchCow |
underscor. ask hank when the last load in of the wayback happens, please. |
21:54
🔗
|
alard |
Yes. It reads the megawarc, checks every warc.gz, sorts them into new warc/tar files, and saves the locations in a new json file. |
21:54
🔗
|
alard |
It worked on my tiny test file, but 15GB takes a little longer. |
21:54
🔗
|
SketchCow |
alard, call it megarepair and add it to the repository. :) |
21:58
🔗
|
alard |
Too late, it's already called megawarc-fix. It's now in the repository. |
22:03
🔗
|
underscor |
SketchCow: asking |
22:04
🔗
|
underscor |
alard: pulling :D |
22:04
🔗
|
underscor |
you're amazing |
22:04
🔗
|
alard |
Might need some testing first, though. |
22:11
🔗
|
alard |
Hmm. Apparently not every tar header is exactly 512 bytes long. |
22:12
🔗
|
alard |
There's a 'gnu tar' type that has headers of 1024, 1536 etc bytes long, if there are long filenames. |
22:12
🔗
|
alard |
As there are in the SOPA file. |
22:35
🔗
|
underscor |
-rw-r--r-- 1 abuie users 50G Oct 12 21:37 webshots-20121012183139.megawarc.warc.gz |
22:35
🔗
|
underscor |
-rw-r--r-- 1 abuie users 73K Oct 12 21:37 webshots-20121012183139.megawarc.json.gz |
22:35
🔗
|
underscor |
-rw-r--r-- 1 abuie users 639M Oct 12 21:37 webshots-20121012183139.megawarc.tar |
22:35
🔗
|
underscor |
alard: One of the ones with tars finished, fyi |
22:35
🔗
|
underscor |
Uploading now |
23:20
🔗
|
alard |
So, which date is *only* available in a faulty megawarc? |
23:20
🔗
|
alard |
*data |
23:20
🔗
|
alard |
The SOPA megawarc is really hard to fix, since it has these ridiculously long file names. |
23:23
🔗
|
alard |
So I'd like to suggest that we 1. make new megawarcs from scratch, and test with gunzip -tv / tar -tv / megawarc restore, if we have the original data; and 2. use megawarc-fix to fix the megawarcs that we don't have in another form, such as webshots. |
23:23
🔗
|
underscor |
alard: I'm currently running the fixer on 20121012070021 |
23:23
🔗
|
alard |
webshots doesn't have long filenames, so the fixer should work for those files. |
23:27
🔗
|
alard |
There are no long filenames in 20121012070021, so that should work, I hope: curl -s -L http://archive.org/download/webshots-freeze-frame-20121012070021/webshots-20121012070021.megawarc.json.gz | gunzip | grep LongLink |
23:28
🔗
|
alard |
CoH can also be fixed: curl -s -L http://archive.org/download/archiveteam-city-of-heroes-forums-megawarc-1/BOARDS-COH-01.tar.megawarc.json.gz | gunzip | grep LongLink |
23:29
🔗
|
alard |
But SOPA can not: curl -s -L http://archive.org/download/archiveteam-sopa-blackout/2012-sopa-day-collection.megawarc.json.gz | gunzip | grep LongLink |
23:35
🔗
|
underscor |
I'll defer to SketchCow before fixing CoH, but I assume he'll want it to be |
23:56
🔗
|
tef_ |
how are they corrupt ? |