Time |
Nickname |
Message |
01:45
๐
|
SketchCow |
FortuneCities is now 100% into the format for the Wayback machine. |
01:46
๐
|
SketchCow |
No idea if they've done the final sweep yet ; the power outage definitely set some projects and activities back. |
01:46
๐
|
SketchCow |
But this basically leaves PicPlz. |
01:47
๐
|
SketchCow |
I've got that project going now in two windows (one ingesting, one uploading) |
01:47
๐
|
SketchCow |
Will definitely take a day or two, since it's 3.5tb of pictures. |
01:51
๐
|
SketchCow |
For MobileMe, we're going to have to do flat-out pull-downs, conversion, and REPLACEMENT, so I want to hold off on that for a bit. |
01:52
๐
|
SketchCow |
There's just no other way - too much data. |
01:54
๐
|
chronomex |
yeah, we oughtn't double it unnecessarily |
01:54
๐
|
SketchCow |
So first I want to see all this data we just did make the full journey into Wayback and completely live. |
01:55
๐
|
SketchCow |
When it does, I want to then start dialing down/removing the doubled Fortunecity and other large collections, like Picplz, to not be doubles. |
02:02
๐
|
SketchCow |
On the good side, I'm getting between 15-21mb a second off of the pipe. |
02:02
๐
|
SketchCow |
So I think whatever was going on before the power outage is now in good shape. |
02:03
๐
|
Patt |
SketchCow, so it really went out? |
02:03
๐
|
chronomex |
no, it was staged, just like the moon landing |
02:03
๐
|
chronomex |
of course the power went out |
02:04
๐
|
SketchCow |
Richmond district of SF had a power outage. |
02:04
๐
|
SketchCow |
Took Internet Archive with it. |
02:04
๐
|
SketchCow |
Exciting. |
02:04
๐
|
SketchCow |
That's a lot of stuff out. |
02:04
๐
|
Patt |
sorry |
02:04
๐
|
SketchCow |
Stuff came back slowly, but it did come back. |
02:04
๐
|
SketchCow |
RIGHT during their big party to celebrate 10 petabytes of web historical data in the Wayback machine. |
02:05
๐
|
SketchCow |
http://www.flickr.com/photos/mlinksva/8126312466/ |
02:05
๐
|
SketchCow |
So as you can see, they got out emergency lights and power, put the laptop on it, and just kept going. |
02:06
๐
|
chronomex |
sucko |
02:06
๐
|
chronomex |
funny tho |
02:06
๐
|
chronomex |
was the livestream interrupted? |
02:06
๐
|
SketchCow |
Well, this was the party - it had no livestream. |
02:07
๐
|
chronomex |
ah |
02:07
๐
|
SketchCow |
The whole Books in Browsers went fine - they lost power at, like, 8pm. |
02:09
๐
|
shaqfu |
Is H-Net at any sort of risk? |
02:09
๐
|
SketchCow |
What's H-Net in this context? |
02:09
๐
|
chronomex |
hurricane electric? |
02:10
๐
|
shaqfu |
The humanities mailing list |
02:10
๐
|
SketchCow |
I mean, never trust any mailing list is my rule. |
02:10
๐
|
SketchCow |
And it's text and trivial to grab. |
02:10
๐
|
shaqfu |
Did a very quick survey and saw a lot of defunct lists, but it still seems to be in some use |
02:10
๐
|
SketchCow |
I assume you don't mean http://dhhumanist.org/text.html |
02:11
๐
|
chronomex |
I'm on several mailing lists more or less solely so I have my own archive of it |
02:11
๐
|
chronomex |
for example, it's why I will never unsubscribe from a yahoo list |
02:11
๐
|
shaqfu |
No, http://www.h-net.org/lists/ |
02:11
๐
|
shaqfu |
DHHumanist is what made me look at it |
02:16
๐
|
shaqfu |
And yeah, it's trivial, but no need if it's still under active watch |
02:27
๐
|
godane |
hey shaqfu |
02:27
๐
|
shaqfu |
Yo |
02:28
๐
|
godane |
i'm grabing another magazine |
02:28
๐
|
godane |
called ce lifestyles |
02:40
๐
|
SketchCow |
https://docs.google.com/a/textfiles.com/spreadsheet/ccc?key=0ApQeH7pQrcBWdDZIUEVjR3d1UmRoU0lPSWZYX0Q1Ync#gid=0 |
02:40
๐
|
SketchCow |
Lot of great stuff - thanks, team. |
02:57
๐
|
flaushy |
alard: 2 instances running with different ips |
03:23
๐
|
SketchCow |
Could someone please WARC http://www.wikipediareview.com/ up? |
03:29
๐
|
flaushy |
is all the info how to do it on the wiki? then i might try :) |
03:29
๐
|
SketchCow |
SHOULD be. If not, let me know |
03:30
๐
|
flaushy |
then i ll try. Thx |
03:35
๐
|
flaushy |
SketchCow: src/wget "http://www.archiveteam.org/" --mirror --warc-file="at" <- is the command to use, right? anything else i need to watch out for? |
03:35
๐
|
flaushy |
via http://archiveteam.org/index.php?title=Wget_with_WARC_output |
03:38
๐
|
tef |
flaushy: with a different starting url, of course :) |
03:38
๐
|
flaushy |
tef yeah :) |
03:52
๐
|
SketchCow |
http://projects.metafilter.com/3766/Just-Solve-the-Problem-Month-Solve-File-Formats |
07:09
๐
|
DFJustin |
looks like we have an admirer https://archive.org/details/virtualitera.freeweb7.com |
07:14
๐
|
joepie91 |
yay, wrote a javascript unpacker |
07:14
๐
|
joepie91 |
... in python |
09:24
๐
|
alard |
flaushy: Thanks. The ask-crawl produced 100 new usernames overnight. |
13:32
๐
|
flaushy |
alard: i think i got blocked at ask |
13:32
๐
|
flaushy |
Your client does not have permission to access this site. |
13:54
๐
|
flaushy |
I m getting HTTPs 400 on wikipediareview after a while, and i feel like it is too small to be a complete rip. Any suggestions on wget parameters to "crawl gently" and avoid blocks? And any suggestions on "checking" complete rips? |
14:20
๐
|
alard |
flaushy: You can use gunzip *.warc.gz | grep Target-URI to get a list of the URLs in the warc file. |
14:21
๐
|
alard |
wget has options to have a delay between requests (there are multiple, look in wget --help). |
14:22
๐
|
alard |
If you want to download the images, you probably need --page-requisites (I think that's not included in --mirror). |
14:23
๐
|
flaushy |
ah cool |
14:23
๐
|
flaushy |
humanizing as well? |
14:23
๐
|
flaushy |
what delays would be good? (i once mirrored a wiki with 5 sec average, painful...) |
14:25
๐
|
flaushy |
alard: do we want to keep crawling at ask? |
14:25
๐
|
alard |
What do you mean with "humanizing"? I don't know about delays, it really depends on the purpose. |
14:26
๐
|
alard |
Well, I'm not sure if it's worth it. It is producing some usernames, very slowly, and they block reasonably quick. |
14:26
๐
|
flaushy |
you have a random backoff and it is on average Xsecs |
14:26
๐
|
flaushy |
eg so you keep a window of 0 - 10 secs |
14:27
๐
|
flaushy |
ah --random-wait was its name in wget :) |
14:27
๐
|
alard |
Yes, that's it. |
14:27
๐
|
alard |
You might want to set the --user-agent to something other than Wget. |
14:28
๐
|
flaushy |
okie thx |
14:28
๐
|
flaushy |
btw did yacy crawl over bt? |
14:28
๐
|
flaushy |
maybe we could get data out of their index |
14:31
๐
|
alard |
Yacy, never heard of before. Does it contain any data? (I just used the demo to search for "archive team", but that didn't produce results.) |
14:32
๐
|
flaushy |
it is a p2p search engine attempt |
14:32
๐
|
flaushy |
it didnt skale a couple of years ago, lost interest in it after a while |
14:34
๐
|
alard |
http://search.yacy.net/HostBrowser.html?path=www.btinternet.com&list=Browse+Host |
14:34
๐
|
flaushy |
okie screw that ;( |
14:35
๐
|
alard |
I now see that http://wikipediareview.com/ is a forum. They're hard to archive. |
14:39
๐
|
SketchCow |
Agreed |
14:40
๐
|
SketchCow |
There's not a RUSH on this. They're just on the skids |
14:40
๐
|
SketchCow |
They've been up and down over the past couple years. Unpaid bills, etc. |
14:43
๐
|
alard |
Well, with 'hard to archive' I only meant that it probably needs something more structured than just wget --mirror. |
14:43
๐
|
alard |
A Lua script could be a solution to do a structured download. What type of forum software is it? |
14:44
๐
|
alard |
It may be time for a new script in the forum download library. |
14:49
๐
|
alard |
Which reminds me: should we do a second run of boards.cityofheroes.com? |
14:53
๐
|
SketchCow |
I am not against it. |
14:53
๐
|
SketchCow |
I think people are probably going pretty nuts towards the end as this idiocy goes down. |
14:54
๐
|
alard |
Oh, wait, it's less urgent than I thought: "The City of Heroesรยฎ servers will shut off on November 30, 2012" http://na.cityofheroes.com/en/news/news_archive/city_of_heroes_sunset_faq.php |
14:57
๐
|
SketchCow |
Well, that's still urgent. And I mean obviously we wait to closer to end, like the 20th or later. |
14:57
๐
|
SketchCow |
Set an alarm! |
14:58
๐
|
SketchCow |
Someone is working on a file format of motion picture film. |
14:58
๐
|
SketchCow |
Agreed, they encode audio and visual data right into the film. |
15:39
๐
|
closure |
ha! http://source.git-annex.branchable.com/?p=source.git;a=commitdiff;h=b72d04988f767dd4b8dab3e1267c03b7f80d4c2c;hp=6633a5158d4d3a6f0bdf9fa5c2c8725e47b051cc |
15:59
๐
|
underscor |
http://monitor.us.archive.org/weathermap/weathermap.html |
16:11
๐
|
alard |
"Paul" is very important. |
16:18
๐
|
alard |
So here's a script to download the Invision Powerboard forums on wikipediareview.com: |
16:18
๐
|
alard |
https://github.com/ArchiveTeam/wikipediareview-grab/blob/master/invpowerboard.lua |
16:22
๐
|
alard |
It could be a small warrior project. (I think it might be too big for one single wget run.) |
16:38
๐
|
flaushy |
alard: thx. i will put it on my nas lateron :) |
20:16
๐
|
ivan` |
is all of ftp.scene.org backed up? |
20:31
๐
|
DFJustin |
it has multiple mirrors right |
20:51
๐
|
godane |
i think i found some thing interesting |
20:51
๐
|
godane |
a magazine called hebdogiciel |
20:52
๐
|
godane |
its a french magazine from 1983 to 1987 |
20:52
๐
|
godane |
collection item is not viewable even though i can see the magazines just fine |
20:53
๐
|
godane |
anyways archive.org only has 13 issues |
20:55
๐
|
godane |
i have found the rest of them |
22:17
๐
|
DFJustin |
godane: the french site that has all that stuff got really butthurt when jason put it up so that's why it all went dark |
22:48
๐
|
godane |
DFJustin: Only 13 issues are visable |
22:49
๐
|
godane |
I don't think jason uploaded the full set |
23:11
๐
|
SketchCow |
Back. |
23:11
๐
|
SketchCow |
Had to visit underscor |
23:11
๐
|
SketchCow |
Yes. |
23:12
๐
|
SketchCow |
All the french magazines are dark unless we digitize them ourselves. |
23:13
๐
|
SketchCow |
They got butthurt like the Al Qaeda gets butthurt |
23:13
๐
|
balrog- |
:( |
23:14
๐
|
balrog- |
yeah I was digging around about early PC programming books ... quite a bit of that on IA, but dark :/ |
23:15
๐
|
SketchCow |
Anyway, you're all missing the important point |
23:15
๐
|
SketchCow |
John Romero got married. |
23:15
๐
|
SketchCow |
Off the market |
23:15
๐
|
SketchCow |
Now, this is a major blow to the group but I think we can recover |
23:15
๐
|
SketchCow |
If we stick together |
23:16
๐
|
SketchCow |
"Sandy could potentially be an unprecedented threat to the masses in its path, a massive storm that hasn't been rivaled in generations." |
23:16
๐
|
SketchCow |
Whew, way to couch it carefully |
23:36
๐
|
godane |
SketchCow: So you can only undark them if archive.org digitize them? |
23:38
๐
|
godane |
SketchCow:: that sort of would defeat the point of darking it then |
23:40
๐
|
SketchCow |
No. |
23:40
๐
|
SketchCow |
I can undark them if SOMEONE I KNOW digitizes them. |
23:40
๐
|
SketchCow |
This is a specific situation, to those magazines. |
23:41
๐
|
godane |
thats just weird |
23:42
๐
|
godane |
anyway there is no seed to torrent collection right now |
23:42
๐
|
godane |
no point in downloading it |