#archiveteam 2012-10-27,Sat

โ†‘back Search

Time Nickname Message
01:45 ๐Ÿ”— SketchCow FortuneCities is now 100% into the format for the Wayback machine.
01:46 ๐Ÿ”— SketchCow No idea if they've done the final sweep yet ; the power outage definitely set some projects and activities back.
01:46 ๐Ÿ”— SketchCow But this basically leaves PicPlz.
01:47 ๐Ÿ”— SketchCow I've got that project going now in two windows (one ingesting, one uploading)
01:47 ๐Ÿ”— SketchCow Will definitely take a day or two, since it's 3.5tb of pictures.
01:51 ๐Ÿ”— SketchCow For MobileMe, we're going to have to do flat-out pull-downs, conversion, and REPLACEMENT, so I want to hold off on that for a bit.
01:52 ๐Ÿ”— SketchCow There's just no other way - too much data.
01:54 ๐Ÿ”— chronomex yeah, we oughtn't double it unnecessarily
01:54 ๐Ÿ”— SketchCow So first I want to see all this data we just did make the full journey into Wayback and completely live.
01:55 ๐Ÿ”— SketchCow When it does, I want to then start dialing down/removing the doubled Fortunecity and other large collections, like Picplz, to not be doubles.
02:02 ๐Ÿ”— SketchCow On the good side, I'm getting between 15-21mb a second off of the pipe.
02:02 ๐Ÿ”— SketchCow So I think whatever was going on before the power outage is now in good shape.
02:03 ๐Ÿ”— Patt SketchCow, so it really went out?
02:03 ๐Ÿ”— chronomex no, it was staged, just like the moon landing
02:03 ๐Ÿ”— chronomex of course the power went out
02:04 ๐Ÿ”— SketchCow Richmond district of SF had a power outage.
02:04 ๐Ÿ”— SketchCow Took Internet Archive with it.
02:04 ๐Ÿ”— SketchCow Exciting.
02:04 ๐Ÿ”— SketchCow That's a lot of stuff out.
02:04 ๐Ÿ”— Patt sorry
02:04 ๐Ÿ”— SketchCow Stuff came back slowly, but it did come back.
02:04 ๐Ÿ”— SketchCow RIGHT during their big party to celebrate 10 petabytes of web historical data in the Wayback machine.
02:05 ๐Ÿ”— SketchCow http://www.flickr.com/photos/mlinksva/8126312466/
02:05 ๐Ÿ”— SketchCow So as you can see, they got out emergency lights and power, put the laptop on it, and just kept going.
02:06 ๐Ÿ”— chronomex sucko
02:06 ๐Ÿ”— chronomex funny tho
02:06 ๐Ÿ”— chronomex was the livestream interrupted?
02:06 ๐Ÿ”— SketchCow Well, this was the party - it had no livestream.
02:07 ๐Ÿ”— chronomex ah
02:07 ๐Ÿ”— SketchCow The whole Books in Browsers went fine - they lost power at, like, 8pm.
02:09 ๐Ÿ”— shaqfu Is H-Net at any sort of risk?
02:09 ๐Ÿ”— SketchCow What's H-Net in this context?
02:09 ๐Ÿ”— chronomex hurricane electric?
02:10 ๐Ÿ”— shaqfu The humanities mailing list
02:10 ๐Ÿ”— SketchCow I mean, never trust any mailing list is my rule.
02:10 ๐Ÿ”— SketchCow And it's text and trivial to grab.
02:10 ๐Ÿ”— shaqfu Did a very quick survey and saw a lot of defunct lists, but it still seems to be in some use
02:10 ๐Ÿ”— SketchCow I assume you don't mean http://dhhumanist.org/text.html
02:11 ๐Ÿ”— chronomex I'm on several mailing lists more or less solely so I have my own archive of it
02:11 ๐Ÿ”— chronomex for example, it's why I will never unsubscribe from a yahoo list
02:11 ๐Ÿ”— shaqfu No, http://www.h-net.org/lists/
02:11 ๐Ÿ”— shaqfu DHHumanist is what made me look at it
02:16 ๐Ÿ”— shaqfu And yeah, it's trivial, but no need if it's still under active watch
02:27 ๐Ÿ”— godane hey shaqfu
02:27 ๐Ÿ”— shaqfu Yo
02:28 ๐Ÿ”— godane i'm grabing another magazine
02:28 ๐Ÿ”— godane called ce lifestyles
02:40 ๐Ÿ”— SketchCow https://docs.google.com/a/textfiles.com/spreadsheet/ccc?key=0ApQeH7pQrcBWdDZIUEVjR3d1UmRoU0lPSWZYX0Q1Ync#gid=0
02:40 ๐Ÿ”— SketchCow Lot of great stuff - thanks, team.
02:57 ๐Ÿ”— flaushy alard: 2 instances running with different ips
03:23 ๐Ÿ”— SketchCow Could someone please WARC http://www.wikipediareview.com/ up?
03:29 ๐Ÿ”— flaushy is all the info how to do it on the wiki? then i might try :)
03:29 ๐Ÿ”— SketchCow SHOULD be. If not, let me know
03:30 ๐Ÿ”— flaushy then i ll try. Thx
03:35 ๐Ÿ”— flaushy SketchCow: src/wget "http://www.archiveteam.org/" --mirror --warc-file="at" <- is the command to use, right? anything else i need to watch out for?
03:35 ๐Ÿ”— flaushy via http://archiveteam.org/index.php?title=Wget_with_WARC_output
03:38 ๐Ÿ”— tef flaushy: with a different starting url, of course :)
03:38 ๐Ÿ”— flaushy tef yeah :)
03:52 ๐Ÿ”— SketchCow http://projects.metafilter.com/3766/Just-Solve-the-Problem-Month-Solve-File-Formats
07:09 ๐Ÿ”— DFJustin looks like we have an admirer https://archive.org/details/virtualitera.freeweb7.com
07:14 ๐Ÿ”— joepie91 yay, wrote a javascript unpacker
07:14 ๐Ÿ”— joepie91 ... in python
09:24 ๐Ÿ”— alard flaushy: Thanks. The ask-crawl produced 100 new usernames overnight.
13:32 ๐Ÿ”— flaushy alard: i think i got blocked at ask
13:32 ๐Ÿ”— flaushy Your client does not have permission to access this site.
13:54 ๐Ÿ”— flaushy I m getting HTTPs 400 on wikipediareview after a while, and i feel like it is too small to be a complete rip. Any suggestions on wget parameters to "crawl gently" and avoid blocks? And any suggestions on "checking" complete rips?
14:20 ๐Ÿ”— alard flaushy: You can use gunzip *.warc.gz | grep Target-URI to get a list of the URLs in the warc file.
14:21 ๐Ÿ”— alard wget has options to have a delay between requests (there are multiple, look in wget --help).
14:22 ๐Ÿ”— alard If you want to download the images, you probably need --page-requisites (I think that's not included in --mirror).
14:23 ๐Ÿ”— flaushy ah cool
14:23 ๐Ÿ”— flaushy humanizing as well?
14:23 ๐Ÿ”— flaushy what delays would be good? (i once mirrored a wiki with 5 sec average, painful...)
14:25 ๐Ÿ”— flaushy alard: do we want to keep crawling at ask?
14:25 ๐Ÿ”— alard What do you mean with "humanizing"? I don't know about delays, it really depends on the purpose.
14:26 ๐Ÿ”— alard Well, I'm not sure if it's worth it. It is producing some usernames, very slowly, and they block reasonably quick.
14:26 ๐Ÿ”— flaushy you have a random backoff and it is on average Xsecs
14:26 ๐Ÿ”— flaushy eg so you keep a window of 0 - 10 secs
14:27 ๐Ÿ”— flaushy ah --random-wait was its name in wget :)
14:27 ๐Ÿ”— alard Yes, that's it.
14:27 ๐Ÿ”— alard You might want to set the --user-agent to something other than Wget.
14:28 ๐Ÿ”— flaushy okie thx
14:28 ๐Ÿ”— flaushy btw did yacy crawl over bt?
14:28 ๐Ÿ”— flaushy maybe we could get data out of their index
14:31 ๐Ÿ”— alard Yacy, never heard of before. Does it contain any data? (I just used the demo to search for "archive team", but that didn't produce results.)
14:32 ๐Ÿ”— flaushy it is a p2p search engine attempt
14:32 ๐Ÿ”— flaushy it didnt skale a couple of years ago, lost interest in it after a while
14:34 ๐Ÿ”— alard http://search.yacy.net/HostBrowser.html?path=www.btinternet.com&list=Browse+Host
14:34 ๐Ÿ”— flaushy okie screw that ;(
14:35 ๐Ÿ”— alard I now see that http://wikipediareview.com/ is a forum. They're hard to archive.
14:39 ๐Ÿ”— SketchCow Agreed
14:40 ๐Ÿ”— SketchCow There's not a RUSH on this. They're just on the skids
14:40 ๐Ÿ”— SketchCow They've been up and down over the past couple years. Unpaid bills, etc.
14:43 ๐Ÿ”— alard Well, with 'hard to archive' I only meant that it probably needs something more structured than just wget --mirror.
14:43 ๐Ÿ”— alard A Lua script could be a solution to do a structured download. What type of forum software is it?
14:44 ๐Ÿ”— alard It may be time for a new script in the forum download library.
14:49 ๐Ÿ”— alard Which reminds me: should we do a second run of boards.cityofheroes.com?
14:53 ๐Ÿ”— SketchCow I am not against it.
14:53 ๐Ÿ”— SketchCow I think people are probably going pretty nuts towards the end as this idiocy goes down.
14:54 ๐Ÿ”— alard Oh, wait, it's less urgent than I thought: "The City of Heroesร‚ยฎ servers will shut off on November 30, 2012" http://na.cityofheroes.com/en/news/news_archive/city_of_heroes_sunset_faq.php
14:57 ๐Ÿ”— SketchCow Well, that's still urgent. And I mean obviously we wait to closer to end, like the 20th or later.
14:57 ๐Ÿ”— SketchCow Set an alarm!
14:58 ๐Ÿ”— SketchCow Someone is working on a file format of motion picture film.
14:58 ๐Ÿ”— SketchCow Agreed, they encode audio and visual data right into the film.
15:39 ๐Ÿ”— closure ha! http://source.git-annex.branchable.com/?p=source.git;a=commitdiff;h=b72d04988f767dd4b8dab3e1267c03b7f80d4c2c;hp=6633a5158d4d3a6f0bdf9fa5c2c8725e47b051cc
15:59 ๐Ÿ”— underscor http://monitor.us.archive.org/weathermap/weathermap.html
16:11 ๐Ÿ”— alard "Paul" is very important.
16:18 ๐Ÿ”— alard So here's a script to download the Invision Powerboard forums on wikipediareview.com:
16:18 ๐Ÿ”— alard https://github.com/ArchiveTeam/wikipediareview-grab/blob/master/invpowerboard.lua
16:22 ๐Ÿ”— alard It could be a small warrior project. (I think it might be too big for one single wget run.)
16:38 ๐Ÿ”— flaushy alard: thx. i will put it on my nas lateron :)
20:16 ๐Ÿ”— ivan` is all of ftp.scene.org backed up?
20:31 ๐Ÿ”— DFJustin it has multiple mirrors right
20:51 ๐Ÿ”— godane i think i found some thing interesting
20:51 ๐Ÿ”— godane a magazine called hebdogiciel
20:52 ๐Ÿ”— godane its a french magazine from 1983 to 1987
20:52 ๐Ÿ”— godane collection item is not viewable even though i can see the magazines just fine
20:53 ๐Ÿ”— godane anyways archive.org only has 13 issues
20:55 ๐Ÿ”— godane i have found the rest of them
22:17 ๐Ÿ”— DFJustin godane: the french site that has all that stuff got really butthurt when jason put it up so that's why it all went dark
22:48 ๐Ÿ”— godane DFJustin: Only 13 issues are visable
22:49 ๐Ÿ”— godane I don't think jason uploaded the full set
23:11 ๐Ÿ”— SketchCow Back.
23:11 ๐Ÿ”— SketchCow Had to visit underscor
23:11 ๐Ÿ”— SketchCow Yes.
23:12 ๐Ÿ”— SketchCow All the french magazines are dark unless we digitize them ourselves.
23:13 ๐Ÿ”— SketchCow They got butthurt like the Al Qaeda gets butthurt
23:13 ๐Ÿ”— balrog- :(
23:14 ๐Ÿ”— balrog- yeah I was digging around about early PC programming books ... quite a bit of that on IA, but dark :/
23:15 ๐Ÿ”— SketchCow Anyway, you're all missing the important point
23:15 ๐Ÿ”— SketchCow John Romero got married.
23:15 ๐Ÿ”— SketchCow Off the market
23:15 ๐Ÿ”— SketchCow Now, this is a major blow to the group but I think we can recover
23:15 ๐Ÿ”— SketchCow If we stick together
23:16 ๐Ÿ”— SketchCow "Sandy could potentially be an unprecedented threat to the masses in its path, a massive storm that hasn't been rivaled in generations."
23:16 ๐Ÿ”— SketchCow Whew, way to couch it carefully
23:36 ๐Ÿ”— godane SketchCow: So you can only undark them if archive.org digitize them?
23:38 ๐Ÿ”— godane SketchCow:: that sort of would defeat the point of darking it then
23:40 ๐Ÿ”— SketchCow No.
23:40 ๐Ÿ”— SketchCow I can undark them if SOMEONE I KNOW digitizes them.
23:40 ๐Ÿ”— SketchCow This is a specific situation, to those magazines.
23:41 ๐Ÿ”— godane thats just weird
23:42 ๐Ÿ”— godane anyway there is no seed to torrent collection right now
23:42 ๐Ÿ”— godane no point in downloading it

irclogger-viewer