#archiveteam 2012-10-12,Fri

↑back Search

Time Nickname Message
01:09 🔗 nintendud hot damn, I just noticed the warrior is uploading to archive.org now
01:09 🔗 nintendud and multiple uploads at once
01:09 🔗 nintendud I like
01:45 🔗 flaushy great talk at defcon SketchCow
01:55 🔗 underscor http://ia600109.us.archive.org:8088/mrtg/networkv2.html
01:55 🔗 underscor Guess where webshots started
01:55 🔗 underscor xD
01:59 🔗 nintendud hmmmmmmmmm
02:57 🔗 SketchCow Things are slowing down on the webshots side for FOS, which is good.
02:59 🔗 flaushy if i happen to have an ip-address change in the rsync process, is that a problem?
03:04 🔗 Sue SketchCow: it's slowing down because a few of us are having issues running the script
03:06 🔗 SketchCow No, no.
03:06 🔗 SketchCow They should not going to FOS anymore.
03:06 🔗 SketchCow There's some stragglers.
03:07 🔗 Sue oh
03:50 🔗 godane uploaded: http://archive.org/details/cdrom-linuxformatmagazine-130
04:09 🔗 flaushy SketchCow: do you keep FOS running for rsync over the weekend?
04:10 🔗 flaushy since i need to leave here soon and won't be able to fix stuff until monday, and currently getting some nasty errors with the new version of the script
04:11 🔗 flaushy (it's a small node, nooon)
04:14 🔗 flaushy ah nevermind, it wont let me run outdated code :/
04:23 🔗 S[h]O[r]T fos should be fine to accept pending rsyncs afaik
04:23 🔗 S[h]O[r]T but you cant get new items from the tracker now
04:23 🔗 S[h]O[r]T on the old code
04:24 🔗 flaushy yepe just realized
04:24 🔗 flaushy and thx :)
05:04 🔗 Sue webshots standalone users getting async error: do which curl
05:13 🔗 SketchCow http://archive.org/details/archiveteam-city-of-heroes-forums-megawarc-5
05:13 🔗 SketchCow Gaze upon the future!
05:14 🔗 chronomex MEGAWARC
05:14 🔗 chronomex I am chronomex and I approve this message.
05:16 🔗 SketchCow Watching Breaking Bad
05:16 🔗 chronomex I also approve of Breaking Bad.
05:17 🔗 SketchCow What I like is that it used/uses all this music that, it is later reported, the bands had no say in it going in.
05:18 🔗 SketchCow So you get someone getting his face shot off or some prostitute ruining her life, and a sad little musician has to see it was paired with their music.
05:19 🔗 godane i found the glenn beck forums
05:20 🔗 chronomex hahaha, I didn't know that
05:20 🔗 godane what is funny is that there are 'pirated' copies of old glenn beck radio shows on it
05:20 🔗 chronomex lol
05:22 🔗 godane i found 3 days from 2007
05:26 🔗 SketchCow Uploading the SOPA blackout collection.
05:26 🔗 SketchCow That should be interesting.
05:28 🔗 SketchCow It'll be a nice complete grab.
05:29 🔗 SketchCow I think we've about hit the end of the uploads through FOS of webshots
05:32 🔗 godane looks like i need to grab the glenn beck forums
05:32 🔗 godane based on wayback machine there only grabs from 2009 and 2010
05:34 🔗 SketchCow Well, I'll give you an inside tip, godane.
05:34 🔗 SketchCow By the end of October, the Wayback machine will have doubled its content.
05:34 🔗 SketchCow Subsequently, it might be worth it to just wait to see what flies in first.
05:34 🔗 godane ok
05:38 🔗 SketchCow I've just killed webshots uploading on FOS
05:38 🔗 SketchCow Since it's going to be replaced with a much more powerful system
05:38 🔗 SketchCow And I need to dump a bunch of stuff through FOS to get it into Wayback
05:39 🔗 SketchCow City of Heroes Forum is getting prepped for wayback.
05:39 🔗 SketchCow That should be exciting in the extreme for them.
05:40 🔗 SketchCow Wayback access
06:20 🔗 SketchCow Breaking Bad.... perfect background for archiving work.
06:31 🔗 DFJustin all glory to the megawarc
06:34 🔗 Cameron_D Yay, all my BT users are done
07:57 🔗 godane i know why there is not real archive for glenn beck forums
07:58 🔗 chronomex why?
07:58 🔗 godane you have to pay for access to it
07:58 🔗 godane i think thats the case
07:58 🔗 godane the link is only on the archive mp3 page anyway so it sort makes sense
07:59 🔗 chronomex hmm
08:02 🔗 godane so i maybe the only hope to archive this
08:03 🔗 godane thats what i tell my self cause stuff like dl.tv and crankygeeks would have be lost if i left it up to you guys
08:04 🔗 godane good thing i did archived when i did
08:17 🔗 chronomex Go for it!
08:25 🔗 alard We're now out of btinternet usernames. 4 hard cases left, but that's it.
09:00 🔗 SmileyG hurrar
09:01 🔗 SmileyG Haha I was last upload? \o/
11:04 🔗 C-Keen hm, how can I add a warc file to the internet archive?
11:20 🔗 alard C-Keen: http://archive.org/create/
11:20 🔗 C-Keen alard: and then just upload the warc?
11:20 🔗 alard Yes, make a new item.
11:21 🔗 C-Keen alard: will that get put in the wayback machine?
11:21 🔗 alard Ah, that I do not know.
11:21 🔗 alard Although if you upload it there and put it on SketchCow's list there might be a chance.
11:22 🔗 C-Keen alard: alright, will do
11:30 🔗 C-Keen alard: hm, should a website mirror that contains mainly educational texts and source code be put in the Community Texts collection? I am unsure what to pick here
11:43 🔗 C-Keen hah uploaded my first item to the archive...
11:53 🔗 alard I think Community Texts is the only collection you can pick. The others are protected (SketchCow can move items).
12:00 🔗 SmileyG errrr
12:00 🔗 SmileyG why has my webshots stopped generating new processes
12:00 🔗 SmileyG Oh, restarting project ¬_¬
13:41 🔗 SketchCow I HAVE to start heading south.
13:42 🔗 SketchCow But we just had a CDX derive fail off of a megawarc generator.
13:43 🔗 SketchCow alard: http://www.us.archive.org/log_show.php?task_id=127674813
13:46 🔗 alard SketchCow: Is the original tar somewhere?
13:47 🔗 SketchCow Sadly, no.
13:47 🔗 SketchCow I should have left it.
13:47 🔗 SketchCow BOARDS-COH-01.tar.megawarc.json.gz BOARDS-COH-02.tar.megawarc.warc.gz BOARDS-COH-04.tar.megawarc.tar
13:47 🔗 SketchCow root@teamarchive-1:/2/CITY# ls
13:47 🔗 SketchCow BOARDS-COH-01.tar.megawarc.tar BOARDS-COH-03.tar.megawarc.json.gz BOARDS-COH-04.tar.megawarc.warc.gz
13:47 🔗 SketchCow BOARDS-COH-01.tar.megawarc.warc.gz BOARDS-COH-03.tar.megawarc.tar megawarc
13:47 🔗 SketchCow BOARDS-COH-02.tar.megawarc.json.gz BOARDS-COH-03.tar.megawarc.warc.gz
13:47 🔗 SketchCow I'm going to gunzip one myself while I get dressed here.
13:47 🔗 SketchCow BOARDS-COH-02.tar.megawarc.tar BOARDS-COH-04.tar.megawarc.json.gz
13:48 🔗 alard I'm downloading that failed .warc.gz now, but that will take a while.
13:48 🔗 SketchCow I wouldn't do that.
13:48 🔗 SketchCow root@teamarchive-1:/2/CITY# gunzip BOARDS-COH-01.tar.megawarc.warc.gz
13:49 🔗 SketchCow More critical, MUCH more critical, is http://www.us.archive.org/log_show.php?task_id=127728961
13:49 🔗 SketchCow Watch it, and if it fails, THEN we have some testing to do.
13:51 🔗 SketchCow http://www.us.archive.org/log_show.php?task_id=127728688 or this one.
13:51 🔗 SketchCow That'll happen faster.
13:52 🔗 alard These are converted tars? Or original megawarcs?
13:52 🔗 SketchCow Converted tars.
13:52 🔗 SketchCow Wait no
13:52 🔗 SketchCow I took three different tars, uncompresed them into a file directory.
13:53 🔗 SketchCow Then megawarc'd the file directory
13:54 🔗 dragondon greetings all! is there a way to set the upload speed? This 0.4kB/s is ridiculous....
13:55 🔗 alard The BOARDS-COH-01 too? That would be strange, since it contains a directory, and the pack option isn't supposed to add directories.
13:55 🔗 alard dragondon: Which project?
13:55 🔗 dragondon webshots
13:55 🔗 alard Does it say CurlUpload?
13:55 🔗 dragondon yes
13:56 🔗 alard Hmm. No, there isn't any limit.
13:56 🔗 dragondon I did see higher speeds earlier but now it's dragging...
13:56 🔗 dragondon been doing so for a few hours now
13:58 🔗 SketchCow alard: So:
13:58 🔗 SketchCow if http://www.us.archive.org/log_show.php?task_id=127728688 doesn't work, warning sign.
13:58 🔗 SketchCow http://www.us.archive.org/log_show.php?task_id=127728961 is the critical one.
13:58 🔗 SketchCow If that doesn't work, we have real issues, that's a webshots generator.
13:59 🔗 SketchCow I have to start driving to NYC now.
14:00 🔗 SketchCow If it doesn't work, for whatever reason (webshots), underscor needs to go back to .tar generation until we figure it out.
14:00 🔗 SketchCow Otherwise, just assume BOARDS-COH is me doing something fucked up
14:00 🔗 alard Yes. (I can't reach archive.org now. www.us.archive.org works.)
14:00 🔗 godane see
14:00 🔗 godane i'm not crazy when i couldn't get to archive.org
14:00 🔗 dragondon same here (South Korea) "Iceweasel can't establish a connection to the server at archive.org"
14:01 🔗 godane i got the same error
14:01 🔗 godane :-D
14:01 🔗 dragondon I can ping it thought
14:01 🔗 dragondon though
14:01 🔗 godane its not just me
14:01 🔗 SketchCow Just alerted them
14:03 🔗 alard The last gzip record in BOARDS-COH-01.tar.megawarc.warc.gz is fine.
14:04 🔗 alard As is the first.
14:04 🔗 dragondon with this version of the VM, will it loose everything if I force the machine to shutdown? I need to figure out some hardware issues here.
14:05 🔗 alard Yes.
14:05 🔗 dragondon I'm hoping that furutre updates will have a buffer to prevent that :)
14:05 🔗 SketchCow http://www.us.archive.org/log_show.php?task_id=127728961 - task failed.
14:06 🔗 SketchCow Luckily (?) it's a mysql error.
14:06 🔗 alard dragondon: Wget can't resume, and upload resuming is complicated, so it's unlikely.
14:06 🔗 dragondon :(
14:07 🔗 dragondon alard, is there no way to generate the files first, then send, and have soem sort of check/confirm/then resume?
14:08 🔗 alard It's complicated, and you don't have to restart that often. Resume things would also complicate error recovery (if the warrior has a problem now, you can reboot and start again).
14:09 🔗 dragondon it's not a warrior issue, for some reason my system is reporting only have my phyiscal memory....kinda don't like killing all the work it did, hence why I was asking for any speed mods. Guess I'll have to force restart
14:26 🔗 SketchCow the sopa item failed
14:28 🔗 SketchCow in both of these cases, I generated it from a set of directory.
14:28 🔗 alard I suspect there's an undetected invalid warc in there.
14:30 🔗 SketchCow 4 web shots I am going to suggest that we go back to generating large tar files.
14:31 🔗 alard Yes.
14:31 🔗 SketchCow it sounds like we need to do a few more additional tests.
14:31 🔗 alard We didn't do enough.
14:32 🔗 SketchCow that is just because I think of you as an unstoppable code juggernaut
14:34 🔗 SketchCow however, there is a hole range of code you have absolutely no access to.
14:37 🔗 alard It would be handy if these error messages included a byte position. That would make it easier to find the problem.
14:43 🔗 SketchCow I am in the car in can't look this up easily, but I do believe there is a public repository of all this code.
14:45 🔗 alard If the gzip is invalid (that's what the error message suggests, at least) that just needs to be fixed. There's nothing wrong with the indexer.
14:45 🔗 alard ./megawarc --verbose pack test.tar data/infiles/
14:45 🔗 alard Checking data/infiles/bad.warc.gz
14:45 🔗 alard Checking data/infiles/good.warc.gz
14:45 🔗 alard Copying data/infiles/good.warc.gz to warc
14:45 🔗 alard Copying data/infiles/bad.warc.gz to warc
14:46 🔗 alard That's wrong: bad.warc.gz isn't complete (I chopped off the last 1000 bytes) so it shouldn't go in the warc.
14:49 🔗 alard The megawarc gzip-testing doesn't work, it seems. (The good news is that the positions in the json are correct, so the current megawarcs can be repaired.)
15:00 🔗 SketchCow old versions are kept.
15:00 🔗 alard For webshots?
15:11 🔗 SketchCow all I had not to sure did wait 1 moment
15:12 🔗 alard You shouldn't text while driving. :)
15:12 🔗 chronomex Watch out SketchCow is using voice recognition
15:12 🔗 SketchCow let's try again. Any archive that I was given are still in car for bad. The new batch was being tested, but we have not fully committed to it, instead we are just feelings disks on the round robin machine.
15:13 🔗 alard Sure.
15:14 🔗 SketchCow we were going to suck my nuts off
15:15 🔗 SketchCow I let that 1 go because what I said was sign off.
15:15 🔗 SketchCow obviously, voice recognition has a way to go
15:17 🔗 SketchCow although, if something with my computer end up sucking my nuts off, dad hey, what's a little problem here and there with voice recognition?
15:18 🔗 SketchCow maybe that's how the algorithm got the job in the first place
15:29 🔗 SmileyG Fatal error: Uncaught exception 'Exception' with message 'WARNING-OR-ERROR: [2] [mysql_connect(): Too many connections] [/usr/local/petabox/www/common/DB.inc] [269]' in /usr/local/petabox/deriver/derive.php:46
15:29 🔗 SmileyG Stack trace:#
15:29 🔗 SmileyG It died.
15:29 🔗 SmileyG Though that doesn't appear to be an issue with the megawarc itself which seems good.
15:42 🔗 alard I think it works better now:
15:42 🔗 alard CRC check failed 0x5cdcbe41 != 0x30788a20L
15:42 🔗 alard Checking data/infiles/bad-extra.warc.gz
15:42 🔗 alard Checking data/infiles/good.warc.gz
15:42 🔗 alard Copying data/infiles/good.warc.gz to warc
15:42 🔗 alard Invalid gzip data/infiles/bad-extra.warc.gz
15:42 🔗 alard Copying data/infiles/bad-extra.warc.gz to tar
15:42 🔗 alard Checking data/infiles/bad.warc.gz
15:42 🔗 alard CRC check failed 0xdcbe4175 != 0xc21fb9ffL
15:42 🔗 alard Invalid gzip data/infiles/bad.warc.gz
15:42 🔗 alard Copying data/infiles/bad.warc.gz to tar
15:42 🔗 alard https://github.com/alard/megawarc/commit/fb0ba014ff4df76411cdd426a15764695a33c59e
15:51 🔗 joepie91 oh look
15:51 🔗 joepie91 http://catalysthost.com/clientarea/cart.php?gid=4
15:51 🔗 joepie91 :P
15:51 🔗 joepie91 >Unmetered 1gbit
17:02 🔗 sankin1 for $7 a month? doesn't sound like a bad deal
17:21 🔗 underscor SketchCow: http://www.us.archive.org/log_show.php?task_id=127728961 failed due to DB problems, rerunning.
17:22 🔗 underscor (DB problems that are unrelated to megawarc)
17:25 🔗 underscor oh shit, boxes are almost full
17:25 🔗 underscor better start bailing
17:26 🔗 SmileyG underscor: herp
17:26 🔗 SmileyG weren't they already doing so :S
17:27 🔗 underscor hm?
17:27 🔗 underscor There
17:27 🔗 SmileyG Why are they running outta room?
17:27 🔗 SmileyG :/
17:27 🔗 underscor Oh
17:27 🔗 underscor There's no auto-ingest to IA
17:27 🔗 SmileyG Do they not automatically pump to IA?
17:27 🔗 SmileyG Ah ok.
17:28 🔗 underscor Jason (and I) want a human to eyeball them
17:28 🔗 underscor For now
17:28 🔗 SmileyG Understandable.
17:28 🔗 SmileyG So, do YOU work at IA?
17:31 🔗 godane all of my theregister.co.uk warc dumps are up
17:31 🔗 godane i have not done 2011 yet
17:31 🔗 godane but its up to 2010
17:31 🔗 godane which is all i have right now
17:40 🔗 underscor SmileyG: Yeah
17:40 🔗 underscor I'm part time, though
17:40 🔗 SmileyG o
17:40 🔗 underscor (I'm a student the rest of the time)
17:40 🔗 SmileyG Still, awesome.
17:40 🔗 underscor In upstate NY
17:40 🔗 underscor hehe, thanks :D
17:41 🔗 SmileyG IA should have some more DC's ;)
17:41 🔗 SmileyG Like one in coventry, hahah here, its cheap(yeah right)
17:43 🔗 DFJustin looks like that megawarc has gz issues as well
17:48 🔗 SmileyG Awww
17:56 🔗 alard There must be quite a few invalid warc files then.
18:05 🔗 underscor alard: I thought megawarc checked them out?
18:06 🔗 SmileyG :<
18:06 🔗 SmileyG hmmm this worries me
18:06 🔗 underscor Is there a way to check the validity of a gz on the command line?
18:06 🔗 underscor (besides just extracting it)
18:12 🔗 SmileyG gunzip -t file.tar.gz
18:12 🔗 underscor thx
18:12 🔗 underscor alard: should we switch to tars for now, or what do you think?
18:12 🔗 SmileyG for test :)
18:13 🔗 SmileyG also hmmm
18:13 🔗 SmileyG if your worried about the tars, you can check them too
18:13 🔗 SmileyG gunzip -c file.tar.gz | tar t > /dev/null
18:13 🔗 underscor I am beginning to get close to drowning, so I need to figure out the exit strategy
18:13 🔗 SmileyG herp
18:14 🔗 underscor (people should save slower!)
18:14 🔗 underscor xD
18:14 🔗 SmileyG can you just do what the warrior would do?
18:14 🔗 SmileyG but just the upload bit (and direct it to FOS/IA ?
18:14 🔗 SmileyG You said theres... 12? servers/
18:16 🔗 underscor SmileyG: No, no, I *am* fos/IA
18:16 🔗 * underscor is the servers warriors are uploading to
18:16 🔗 underscor Those servers are nearing full
18:16 🔗 SmileyG yeah
18:16 🔗 SmileyG but originally all teh warriors were uploading to 1 location, which is now a number of locations?
18:16 🔗 underscor FOS is full/not accessible for this project
18:16 🔗 underscor Yes
18:16 🔗 SmileyG What was the plan for the orignal server?
18:17 🔗 underscor It was to upload tars, which had been happening
18:17 🔗 SmileyG can you not replicate that process over to the other servers?
18:17 🔗 underscor now the plan was doing the megawarcing with the script from alard, which I've been doing
18:17 🔗 underscor but if they're corrupt, then maybe we should go back to tars for now
18:17 🔗 SmileyG yeah
18:18 🔗 underscor I may just make a Command Decision(tm) since SketchCow is on the road
18:18 🔗 underscor and deal with the fallout later
18:19 🔗 SmileyG well if you don't, all archiving basically stops
18:19 🔗 SmileyG unless you've got his number?
18:19 🔗 underscor yeah, I may call him after class
18:19 🔗 underscor http://p.defau.lt/?Fy8RdcZOojTsFPlt6Yyzcg
18:19 🔗 underscor uh oh
18:19 🔗 underscor cc alard
18:23 🔗 DFJustin <underscor> alard: I thought megawarc checked them out? <-- I think it did, but the check wasn't working? (until he fixed it just now)
18:23 🔗 DFJustin <alard> https://github.com/alard/megawarc/commit/fb0ba014ff4df76411cdd426a15764695a33c59e
18:26 🔗 underscor aha
18:34 🔗 underscor gunzip -t webshots-20121012070021.megawarc.warc.gz │··········
18:34 🔗 underscor gzip: webshots-20121012070021.megawarc.warc.gz: invalid compressed data--crc error
18:34 🔗 underscor │··········
18:34 🔗 underscor sigh
18:35 🔗 underscor so I guess this one is fucked
18:38 🔗 underscor sigh
18:38 🔗 underscor gunzip -t webshots-20121012070358.megawarc.warc.gz │··········
18:38 🔗 underscor gzip: webshots-20121012070358.megawarc.warc.gz: invalid compressed data--crc error
18:38 🔗 underscor │··········
18:40 🔗 underscor Rebuilding using new code
18:40 🔗 underscor alard: what does the script do if it encounters a "bad" .warc.gz?
18:44 🔗 underscor DFJustin: Where did alard say that link btw?
18:46 🔗 S[h]O[r]T underscor
18:46 🔗 S[h]O[r]T do you not have history in here from earlier
18:46 🔗 S[h]O[r]T alard and SketchCow were talking about the corruption. i can paste in pm if you need
18:47 🔗 underscor woah, there we go
18:47 🔗 underscor what the hell, quassel
18:48 🔗 SmileyG how long does it take to rebuild :S
18:49 🔗 underscor S[h]O[r]T: Found it. Not sure what quassel was doing saying there wasn't more scrollback >:(
18:49 🔗 underscor SmileyG: Uh, I haven't timed them, actually
18:50 🔗 underscor I assume if a set passes gunzip -t, then it's probably safe to upload
18:51 🔗 SmileyG I *believe* so, the only better check is physically unpacking it and checking.
18:51 🔗 SmileyG which kind of negates the point.
18:58 🔗 alard underscor: It turned out that the gzip check I had in megawarc didn't really check anything.
18:59 🔗 alard So if there was an invalid warc, it was added to the big warc, which then became unreadable.
18:59 🔗 alard I think it is fixed in the latest megawarc version (it works on my test files, at least).
18:59 🔗 underscor Is there a way to easyclean from the json?
18:59 🔗 underscor Ouch.
19:00 🔗 alard Before that fix, SketchCow suggested that we keep using tar until the megawarc is somewhat more stable and tested.
19:00 🔗 alard Yes.
19:00 🔗 underscor schweet
19:00 🔗 alard The positions of the warcs in the json are correct.
19:00 🔗 alard So it's possible to untangle them.
19:01 🔗 alard So it might be an idea to keep using the latest megawarc script for webshots.
19:01 🔗 alard I think it works, it's a good test. We also don't loose data if it does not, it just means rebuilding things.
19:02 🔗 alard (To answer your question about what happens to the invalid gzips: they're added to the tar file.)
19:02 🔗 SmileyG -Die in a Fire ?
19:02 🔗 underscor ah, the "extras" tar file
19:02 🔗 underscor ?
19:03 🔗 alard Yes. So if the tar file is not empty, that means there were things that couldn't be saved in the warc.
19:03 🔗 underscor What about the ones that say "extra field of 10 bytes ignored"?
19:04 🔗 underscor (ones = warc.gz, when testing with gunzip -t)
19:07 🔗 underscor Uploading the first new set
19:08 🔗 alard That's the warc format: it has an extra gzip field with the length of the compressed warc record.
19:08 🔗 SmileyG hmmm
19:08 🔗 SmileyG gzip patch needed at some point then? :S
19:08 🔗 alard That's handy if you want to skip through the warc, but the gzip utility doesn't know how to use it.
19:08 🔗 alard Well, it does what it says: it sees an extra field and ignores it.
19:08 🔗 SmileyG :D
19:08 🔗 SmileyG least it doesn't blow up I guess
19:09 🔗 SmileyG Wonder if you can tell the test to ignore it (so it only raises errors on _real_ error
19:09 🔗 SmileyG s
19:09 🔗 SmileyG I smell diner.
19:13 🔗 underscor SmileyG: It still returns $? = 0
19:13 🔗 underscor so it's not really a big deal
19:14 🔗 alard Is this a new one? http://www.us.archive.org/catalog.php?history=1&identifier=webshots-freeze-frame-20121012103518
19:15 🔗 underscor Yes
19:15 🔗 underscor Only the json is up though
19:15 🔗 underscor the warc is still uploading
19:16 🔗 alard Ah. Was there a tar?
19:17 🔗 underscor 0 bytes
19:17 🔗 alard So it's exiting to see if this one passes the test.
19:17 🔗 underscor It passed gunzip -t too
19:17 🔗 SmileyG underscor: Ah ok ! I presumed it'd return some non-fatal error code
19:17 🔗 SmileyG but if its not showing it other than the stout output.... no worries
19:19 🔗 underscor alard: Is the procedure to fix these to "create" the tar backwards, and repack, or will you be able to write a "fixme" thing? :)
19:19 🔗 alard I think it will be a fixme thing.
19:19 🔗 underscor rad
19:20 🔗 underscor another one finished!
19:20 🔗 underscor -rw-r--r-- 1 abuie users 50G Oct 12 19:18 webshots-20121012070358.megawarc.warc.gz
19:20 🔗 underscor -rw-r--r-- 1 abuie users 103K Oct 12 19:18 webshots-20121012070358.megawarc.json.gz
19:20 🔗 underscor -rw-r--r-- 1 abuie users 388M Oct 12 19:18 webshots-20121012070358.megawarc.tar
19:20 🔗 alard Hey, a tar.
19:20 🔗 SmileyG working and looking correct now?
19:20 🔗 alard That's both good and bad news.
19:20 🔗 alard Good for megawarc, bad for webshots.
19:21 🔗 SmileyG o_O
19:22 🔗 underscor We can extract out the "bad" users and requeue them, though, right?
19:23 🔗 SmileyG AH, the tars are failed users getting left over?
19:24 🔗 underscor SmileyG: Yeah
19:24 🔗 alard The invalid warcs end up in the tar.
19:24 🔗 underscor Well, faulty warc.g
19:24 🔗 underscor z
19:24 🔗 underscor mhm
19:25 🔗 alard We could make a list of the users that have made it to archive.org and compare that with the full list of users.
19:25 🔗 alard But for the moment we have enough new users.
19:26 🔗 underscor Lots of limestone networks hosts, wonder who that is
19:26 🔗 underscor They're pumping a lot of data :D
19:27 🔗 SmileyG Is it Sue?
19:27 🔗 SmileyG She was saying shes gonna hit her cap shortly in #webshots
19:29 🔗 underscor alard: http://archive.org/catalog.php?history=1&identifier=webshots-freeze-frame-20121012103518 Here we go!
19:31 🔗 chronomex wooooo
19:31 🔗 * chronomex parties
19:34 🔗 underscor ugh, moving 50GB takes so long
19:39 🔗 underscor http://www.us.archive.org/catalog.php?history=1&identifier=webshots-freeze-frame-20121012070358 is getting its replacement uploaded
19:42 🔗 underscor schweet
19:42 🔗 underscor Every box is now megawarcing
19:43 🔗 underscor Although these take quite a bit of time
19:43 🔗 underscor Wonder if I can keep up with the inflow
19:43 🔗 chronomex INTERFLOW
19:44 🔗 underscor http://ia600109.us.archive.org:8088/mrtg/networkv2.html http://ia601104.us.archive.org:8088/mrtg/networkv2.html http://ia700106.us.archive.org:8088/mrtg/networkv2.html
19:44 🔗 underscor Y'all have been keeping them pretty nice and busy
19:49 🔗 SmileyG - Downloaded 18400 URLs got another nice one :D
21:02 🔗 underscor http://www.us.archive.org/log_show.php?task_id=127779327 It's cdxing now!
21:02 🔗 underscor Cross fingers!!!! :D
21:12 🔗 alard The CDX indexer is already running twice as long as the previous time.
21:12 🔗 underscor 's a good sign :D
21:15 🔗 SketchCow hooray.
21:15 🔗 SmileyG \o/
21:15 🔗 * SmileyG waits for underscor to start groveling.
21:15 🔗 underscor yay, Jason's back
21:16 🔗 chronomex 25 seconds!
21:16 🔗 underscor I have like 2TB processing
21:16 🔗 underscor :P
21:16 🔗 underscor megawarcing takes a fair bit of time/work, though
21:16 🔗 underscor still can't quite tell if I'm filling faster than I'm dumping
21:17 🔗 SmileyG hope not :S
21:17 🔗 underscor Also, just the sheer (super awesome!) scale of moving 50gb bricks is... interesting
21:17 🔗 SmileyG :)
21:20 🔗 SketchCow [6~[6~[6~[6~[6~[6~[6~[6~
21:20 🔗 SmileyG what he said.
21:23 🔗 alard The voice recognition gets more and more interesting.
21:25 🔗 underscor hahahha
21:25 🔗 underscor It was better when it was sucking his nuts off or whatever
21:30 🔗 SketchCow so, status update please
21:31 🔗 SketchCow this android ssh client has no pgup
21:32 🔗 SketchCow also. comiccon is hell on earth.
21:34 🔗 underscor SketchCow: We (think) we patched the bugs
21:35 🔗 underscor Test derive is still running
21:35 🔗 underscor but it got further than any of them have
21:35 🔗 underscor so (probably) good
21:35 🔗 underscor I have like 2.5TB to ingest
21:35 🔗 underscor once we see how this goes
21:44 🔗 alard Looks good? http://www.us.archive.org/log_show.php?task_id=127779327
21:47 🔗 underscor alard: you are the freakin' man
21:47 🔗 underscor we need to set you up on gittip :D
21:48 🔗 SmileyG i am so jeli
21:49 🔗 underscor SketchCow: IT WORKED IT WORKED IT WORKED
21:50 🔗 alard This isn't actually that much better than before: there's no tar with invalid warcs. That already worked.
21:50 🔗 SmileyG http://archive.org/details/webshots-freeze-frame-20121012173401 the latest one lacks a warc?
21:50 🔗 underscor SmileyG: still uploading
21:51 🔗 underscor alard: oh
21:51 🔗 SmileyG o
21:51 🔗 SmileyG :D
21:51 🔗 SketchCow find one where the problem is fixed versus before.
21:51 🔗 SketchCow sopa is a good one
21:51 🔗 underscor alard: 3 finished, no tar
21:51 🔗 underscor :D
21:52 🔗 alard The two that crashed with the zlib.error problem should have tars.
21:52 🔗 underscor I just had to restart them
21:52 🔗 underscor yes, those haven't finished
21:52 🔗 alard I'm now testing my fix script on the sopa files.
21:52 🔗 alard (Takes a while.)
21:53 🔗 underscor Ah, this will be the one that lets us fix a bad megawarc.warc.gz
21:53 🔗 underscor ?
21:53 🔗 SketchCow underscor. ask hank when the last load in of the wayback happens, please.
21:54 🔗 alard Yes. It reads the megawarc, checks every warc.gz, sorts them into new warc/tar files, and saves the locations in a new json file.
21:54 🔗 alard It worked on my tiny test file, but 15GB takes a little longer.
21:54 🔗 SketchCow alard, call it megarepair and add it to the repository. :)
21:58 🔗 alard Too late, it's already called megawarc-fix. It's now in the repository.
22:03 🔗 underscor SketchCow: asking
22:04 🔗 underscor alard: pulling :D
22:04 🔗 underscor you're amazing
22:04 🔗 alard Might need some testing first, though.
22:11 🔗 alard Hmm. Apparently not every tar header is exactly 512 bytes long.
22:12 🔗 alard There's a 'gnu tar' type that has headers of 1024, 1536 etc bytes long, if there are long filenames.
22:12 🔗 alard As there are in the SOPA file.
22:35 🔗 underscor -rw-r--r-- 1 abuie users 50G Oct 12 21:37 webshots-20121012183139.megawarc.warc.gz
22:35 🔗 underscor -rw-r--r-- 1 abuie users 73K Oct 12 21:37 webshots-20121012183139.megawarc.json.gz
22:35 🔗 underscor -rw-r--r-- 1 abuie users 639M Oct 12 21:37 webshots-20121012183139.megawarc.tar
22:35 🔗 underscor alard: One of the ones with tars finished, fyi
22:35 🔗 underscor Uploading now
23:20 🔗 alard So, which date is *only* available in a faulty megawarc?
23:20 🔗 alard *data
23:20 🔗 alard The SOPA megawarc is really hard to fix, since it has these ridiculously long file names.
23:23 🔗 alard So I'd like to suggest that we 1. make new megawarcs from scratch, and test with gunzip -tv / tar -tv / megawarc restore, if we have the original data; and 2. use megawarc-fix to fix the megawarcs that we don't have in another form, such as webshots.
23:23 🔗 underscor alard: I'm currently running the fixer on 20121012070021
23:23 🔗 alard webshots doesn't have long filenames, so the fixer should work for those files.
23:27 🔗 alard There are no long filenames in 20121012070021, so that should work, I hope: curl -s -L http://archive.org/download/webshots-freeze-frame-20121012070021/webshots-20121012070021.megawarc.json.gz | gunzip | grep LongLink
23:28 🔗 alard CoH can also be fixed: curl -s -L http://archive.org/download/archiveteam-city-of-heroes-forums-megawarc-1/BOARDS-COH-01.tar.megawarc.json.gz | gunzip | grep LongLink
23:29 🔗 alard But SOPA can not: curl -s -L http://archive.org/download/archiveteam-sopa-blackout/2012-sopa-day-collection.megawarc.json.gz | gunzip | grep LongLink
23:35 🔗 underscor I'll defer to SketchCow before fixing CoH, but I assume he'll want it to be
23:56 🔗 tef_ how are they corrupt ?

irclogger-viewer