Time |
Nickname |
Message |
01:20
🔗
|
human39 |
neat, I found an undeveloped roll of film |
01:20
🔗
|
human39 |
(in this box, not mine) |
01:20
🔗
|
Coderjoe |
uh |
01:21
🔗
|
Coderjoe |
I hope it was not exposed to light or anything |
01:21
🔗
|
human39 |
well, it's mine. |
01:21
🔗
|
human39 |
now |
01:21
🔗
|
human39 |
na, it's been in the container |
01:21
🔗
|
Coderjoe |
(aside from actually taking the pictures) |
01:21
🔗
|
Coderjoe |
oh. you said roll not reel |
01:22
🔗
|
Coderjoe |
easier to tell with rolls |
01:22
🔗
|
human39 |
yeah |
01:22
🔗
|
human39 |
I wonder if it's worth getting developed. Hope this guy wasn't into weird stuff. |
01:27
🔗
|
underscor |
Yeah, EFNet |
01:27
🔗
|
underscor |
Fuck you too |
01:27
🔗
|
underscor |
alard: Absolutely |
01:30
🔗
|
Coderjoe |
mmm |
01:30
🔗
|
Coderjoe |
DCI... talking around 1.5TB for a single 100-minute movie, with only one 8-channel soundtrack (at 96k) |
02:43
🔗
|
Coderjoe |
lachlan mirror still chugging. at 3.7G |
02:46
🔗
|
chronomex |
that's quite the website |
03:24
🔗
|
SketchCow |
Back |
03:42
🔗
|
underscor |
wb |
03:42
🔗
|
underscor |
:> |
05:24
🔗
|
chronomex |
today my work as an archivist involes simulating a tape read circuit to decode bits off a data tape image recorded with audio gear |
05:24
🔗
|
chronomex |
just in case you guys thought I was slacking :) |
05:26
🔗
|
balrog |
ooh, wow. what's this for? |
05:29
🔗
|
chronomex |
http://xrtc.net/f/phreak/3ess.shtml <-- this machine, a 1973 computer welded to a telephone switch, has bad tape carts. |
05:29
🔗
|
chronomex |
solution: replace tape drive with something solid-state |
05:29
🔗
|
chronomex |
tape drive is in center above teletype, the thing with the round sticker on |
05:30
🔗
|
chronomex |
have to replace tape drive to run diagnostics |
05:31
🔗
|
chronomex |
have to run diagnostics to figure out what's wrong with the offline processor |
05:31
🔗
|
chronomex |
have to fix the offline processor to run code on the machine safely |
05:31
🔗
|
chronomex |
have to run code on the machine to do a backup |
05:31
🔗
|
chronomex |
have to do a backup before rebooting |
05:31
🔗
|
chronomex |
have to reboot because that will probably clear some stuck trouble that's been plaguing it since 1998 at least |
05:32
🔗
|
chronomex |
yeah ... it was last booted in 1992 |
05:33
🔗
|
chronomex |
that view is the operator console side; the machine is two of those lineups - the second is the switching network and stuff |
05:35
🔗
|
chronomex |
I want to strangle the fucker that decided that 1/4" tape cartridges are better than open-reel tape |
05:36
🔗
|
chronomex |
STRANGLE you hear me |
05:52
🔗
|
SketchCow |
Yeah |
05:52
🔗
|
SketchCow |
batcave went south, can't get anyone to reset. |
05:53
🔗
|
SketchCow |
So heartbroken, I know |
05:53
🔗
|
chronomex |
D: |
06:19
🔗
|
SketchCow |
http://www.freshdv.com/wp-content/uploads/2011/10/hurlbut-letus-41.jpg |
06:19
🔗
|
SketchCow |
What a way to jizz up a perfectly fine DSLR |
06:20
🔗
|
chronomex |
wow that's a lot of shit to bolt onto a dslr |
06:21
🔗
|
bbot_ |
wow |
06:21
🔗
|
bbot_ |
I count... four different handles? |
06:55
🔗
|
SketchCow |
http://www.archive.org/search.php?query=collection%3Aarchiveteam-yahoovideo&sort=-publicdate |
06:55
🔗
|
SketchCow |
Back in business. |
06:55
🔗
|
chronomex |
speaking of video: http://ia700209.us.archive.org/6/items/dicksonfilmtwo/DicksonFilm_High_512kb.mp4 |
06:55
🔗
|
chronomex |
cool shit |
07:02
🔗
|
SketchCow |
Yeah, going to let those go |
07:02
🔗
|
SketchCow |
And get some rest, then back up |
07:02
🔗
|
SketchCow |
There's so much stuff uploading now, the machine's finally emptying out |
07:07
🔗
|
SketchCow |
Oh, and I found the artist for the archiveteam t-shirt and poster |
07:10
🔗
|
chronomex |
oh? |
07:34
🔗
|
Ymgve |
Dicks On Film? |
07:34
🔗
|
Ymgve |
documentary about chatroulette? |
07:34
🔗
|
Coderjoe |
ah. that explains the rsync troubles |
07:45
🔗
|
Ymgve |
daamn: http://popc64.blogspot.com/ |
07:48
🔗
|
Coderjoe |
lachlan mirror still underway, at 4.2G |
11:10
🔗
|
underscor |
chronomex: http://www.myspace.com/pagefault D: |
11:10
🔗
|
underscor |
hahahaha |
11:18
🔗
|
SketchCow |
Morning, probably need to sleep a tad |
11:18
🔗
|
SketchCow |
But the batcave now has 12tb free |
11:19
🔗
|
SketchCow |
So we have a lot of room again. |
11:36
🔗
|
alard |
SketchCow: The scripts for me.com/mac.com are more or less working now, so that would be a way to get new things to fill it with. |
11:36
🔗
|
SketchCow |
Excellent. |
11:36
🔗
|
SketchCow |
So, we should talk about that. |
11:37
🔗
|
SketchCow |
The number one thing besides making stuff be in a way the wayback machine can accept, when possible, is to have ways to package this crap up into units I can use to upload again. |
11:37
🔗
|
alard |
Yes, probably have a look at the results as well. |
11:37
🔗
|
SketchCow |
I'm starting down the google groups stuff, and oh man, this is going to take it forever. |
11:38
🔗
|
ersi |
Did wayback successfully swallow the earlier warc-files btw? |
11:38
🔗
|
SketchCow |
They've been doing lots of runs against them. |
11:38
🔗
|
SketchCow |
I don't know how many are fully in but that work is being done. |
11:38
🔗
|
alard |
MobileMe works with usernames, so there's not an easy way to group it into numbered chunks. (And the full list of usernames is not yet available.) |
11:38
🔗
|
ersi |
So that's a yes? |
11:39
🔗
|
SketchCow |
I am pretty sure it's a yes. |
11:39
🔗
|
ersi |
Awesome, to 11 |
11:39
🔗
|
alard |
Even the wget-warc ones? That's good news. |
11:41
🔗
|
SketchCow |
So, I asked archive team to back up a site. |
11:41
🔗
|
SketchCow |
Someone came out and said he was doing it, but he got me nervous because he basically said "their robots.txt is blocking the images!" |
11:42
🔗
|
SketchCow |
Which is like a private detective saying "and then they walked into a building that said no tresspassers!" |
11:42
🔗
|
SketchCow |
11:31 <bearh> I have the backup of csoon.com |
11:42
🔗
|
SketchCow |
11:45 <bearh> And i'm kinda unsure where to upload it. |
11:42
🔗
|
SketchCow |
So, I'd like someone else to do it. |
11:42
🔗
|
SketchCow |
It's not that large. |
11:42
🔗
|
SketchCow |
But it's fucking hilarious. |
11:42
🔗
|
SketchCow |
Died in 2000. |
11:42
🔗
|
alard |
Heh. (Already did it, yesterday. Look in batcave. :) |
11:42
🔗
|
SketchCow |
Been there ever since. |
11:42
🔗
|
SketchCow |
Good deal, thanks. |
11:43
🔗
|
SketchCow |
They're right, that's like finding an untouched dinosaur fossil |
11:44
🔗
|
SketchCow |
I found another amazing site |
11:44
🔗
|
SketchCow |
Collections of old department stores |
11:45
🔗
|
SketchCow |
http://departmentstoremuseum.blogspot.com/ |
11:46
🔗
|
SketchCow |
http://departmentstoremuseum.blogspot.com/2010/06/may-co-cleveland-ohio.html |
11:46
🔗
|
SketchCow |
That is a lot of crazy work |
11:46
🔗
|
SketchCow |
I also had a nice long chat with the head of the CULINARY CURATION GROUP OF THE NEW YORK PUBLIC LIBRARY |
11:46
🔗
|
SketchCow |
Try THAT for crazy |
11:46
🔗
|
SketchCow |
http://legacy.www.nypl.org/research/chss/grd/resguides/menus/ |
11:57
🔗
|
SketchCow |
http://batcave.textfiles.com/ocrcount/ <--- You can see how long batcave was in the shitter |
12:00
🔗
|
ersi |
was that, ocr jobs that were running on batcave? :o |
12:08
🔗
|
SketchCow |
No. |
12:09
🔗
|
SketchCow |
This was me tracking a limit imposed on my ingestion. |
12:09
🔗
|
ersi |
Ah, alrighty |
12:09
🔗
|
SketchCow |
I was using a method that worked fine but was hard on the structure |
12:09
🔗
|
SketchCow |
And got into a fight over that |
12:09
🔗
|
SketchCow |
Part of it was "you shouldn't use that method if there's more than 200 jobs in queue" |
12:09
🔗
|
SketchCow |
Now, over time, that's not going to matter, i.e., a queue will be made that DOESN'T hold the job in queue on the machine, but just generally. |
12:10
🔗
|
SketchCow |
But this was me seeing "So, does it EVER go below 200 or should I even watch" |
12:10
🔗
|
SketchCow |
Answer: Yes |
12:10
🔗
|
ersi |
And bam, you started filling it up gradually instead of appending to an ever increasing derive queue? :) |
12:10
🔗
|
SketchCow |
Fuck no |
12:11
🔗
|
SketchCow |
I slammed that shit up to max |
12:11
🔗
|
ersi |
Then what was the point of that tracking? |
12:11
🔗
|
SketchCow |
To no if I was being lied to |
12:11
🔗
|
SketchCow |
I was not specifically being lied to |
12:11
🔗
|
ersi |
ah |
12:12
🔗
|
SketchCow |
Any time you see me mention interacting with other human beings, ask yourself "So, what's the most hostile interpretation as to why Jason is doing this" |
12:12
🔗
|
SketchCow |
It'll save you time |
12:12
🔗
|
SketchCow |
"Hey, guys, I went out to eat" |
12:12
🔗
|
SketchCow |
Meaning: I got banned from a new diner |
12:13
🔗
|
ersi |
Already known for.. long :) |
12:13
🔗
|
SketchCow |
Apparently you forgot, twerp! |
12:13
🔗
|
ersi |
Zing! |
12:13
🔗
|
SketchCow |
The brutal thing coming up with yahoo video is I will be writing something that pulls down an item, does huge stats on it, then uploads again. |
12:14
🔗
|
ersi |
hm, I should get going on instructables again |
12:14
🔗
|
ersi |
that thing is fuckin' huge though |
12:15
🔗
|
SketchCow |
It's funny for me that I now go into a directory on batcave, see it's 35gb, go "oh." |
12:15
🔗
|
SketchCow |
I've put up 400gb items |
12:15
🔗
|
SketchCow |
This is going to be hilarious |
12:17
🔗
|
SketchCow |
http://googleblog.blogspot.com/2011/10/fall-sweep.html |
12:18
🔗
|
SketchCow |
Shutting down: Code Search, Google Buzz, Jaiku, Google Labs (Immediately), University Research Program for Google Search |
12:18
🔗
|
ersi |
Yeah |
12:18
🔗
|
SketchCow |
Boutiques.com and like.com gone |
12:19
🔗
|
SketchCow |
Code Search was critical |
12:47
🔗
|
alard |
What would you like to get from the me.com/mac.com downloaders? At the moment, they produce: |
12:48
🔗
|
alard |
1. a warc.gz for web.me.com (plus xml index and log file) |
12:48
🔗
|
alard |
2. a warc.gz for homepage.mac.com (plus a log file) |
12:48
🔗
|
alard |
3. the xml feed for public.me.com, plus a copy of the file structure + the headers for each file (not warc) |
12:49
🔗
|
alard |
4. the xml feed for gallery.me.com, plus a zip file for each gallery |
13:37
🔗
|
SketchCow |
Hmmm. |
13:37
🔗
|
SketchCow |
I'd like all of it - what's the size differential. |
13:42
🔗
|
alard |
You do get all of the content, it's just a question of in what form you'd like to get it. |
13:42
🔗
|
alard |
Just a WARC or also separate files, that sort of thing. |
13:45
🔗
|
alard |
Here's an example listing of what it produces now: http://pastebin.com/raw.php?i=438zhmSR |
13:46
🔗
|
SketchCow |
http://vimeo.com/28173775 |
13:46
🔗
|
alard |
The files can get quite large (up to a 2 GB for the users I've tried so far), so I don't think it's useful to have the data in more than one form. |
13:46
🔗
|
SketchCow |
I think it could be. |
13:47
🔗
|
SketchCow |
WARC is so forward looking, but you can't use it for anything BUT wayback. |
13:47
🔗
|
alard |
Or you have to run a WARC extractor to create the structure wget would create otherwise. |
13:48
🔗
|
SketchCow |
Hmmm. |
13:48
🔗
|
alard |
So you'd like to have the wget copy as well? |
13:48
🔗
|
SketchCow |
Well, you know, I could see that. |
13:48
🔗
|
alard |
With or without link conversion? |
13:48
🔗
|
SketchCow |
Massive post-processing. |
13:48
🔗
|
SketchCow |
I am fine with massive post-processing. |
13:48
🔗
|
SketchCow |
So WARC might make the most sense. |
13:48
🔗
|
SketchCow |
I'd like to run that against your warcs we've added already to archive.org, see how that looks. |
13:48
🔗
|
alard |
It does save a lot of duplicate uploading. |
13:49
🔗
|
SketchCow |
Agreed. |
13:49
🔗
|
SketchCow |
And the thing with these machines I have, they suck down data at 40-80MB a second. |
13:49
🔗
|
SketchCow |
So it can yank it down, rejigger, upload |
13:50
🔗
|
alard |
(As a reference: the four users I have now have 3.6GB of data together. But maybe I chose the wrong examples.) |
13:50
🔗
|
SketchCow |
Wow, what the hell. |
13:50
🔗
|
SketchCow |
Can you link me to them? |
13:50
🔗
|
alard |
http://web.me.com/sleemason/ |
13:51
🔗
|
SketchCow |
WARC is the way. |
13:51
🔗
|
alard |
http://homepage.mac.com/ueda_daisuke/ |
13:52
🔗
|
alard |
http://gallery.me.com/amurnieks |
13:52
🔗
|
balrog |
yeah, those. |
13:52
🔗
|
alard |
(each user has something on homepage, gallery, public, web) |
13:53
🔗
|
balrog |
hmm, how does WARC do it? |
13:53
🔗
|
alard |
I currently make WARCs for homepage.mac.com and web.me.com. |
13:53
🔗
|
alard |
For gallery.me.com I download the zip files that the server offers. |
13:53
🔗
|
balrog |
ohh, http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml |
13:53
🔗
|
alard |
For public.me.com I download the files. |
13:53
🔗
|
alard |
balrog: Yup. |
13:53
🔗
|
SketchCow |
And this is all closing June 2012? |
13:54
🔗
|
SketchCow |
Are they blocking with robots.txt? |
13:54
🔗
|
balrog |
SketchCow: yes, as per current info |
13:54
🔗
|
SketchCow |
Sorry for not paying more attention, been dealing with data |
13:54
🔗
|
balrog |
SketchCow: last I checked, no, but it's messy to parse because it uses XML and JS |
13:54
🔗
|
balrog |
basically it uses JS to load the web content on many pages |
13:54
🔗
|
balrog |
(from an XML file) |
13:55
🔗
|
alard |
Only gallery.me.com has a robots.txt. public.me.com doesn't, but it is somewhat inaccessible to crawlers. |
13:55
🔗
|
SketchCow |
Well, Jobs is dead, nobody is watching |
13:55
🔗
|
alard |
homepage.mac.com has normal sites, can be crawled. web.me.com has some iWeb sites which are hard to crawl (but it's possible if you use webdav). |
13:56
🔗
|
balrog |
alard: homepage.mac.com could have iWeb sites. |
13:56
🔗
|
alard |
Any examples? The wayback machine doesn't have any. |
13:56
🔗
|
balrog |
I should dig around, but I thought I saw some. |
13:57
🔗
|
SketchCow |
But wow, we're talking a fuckton of data, aren't wee. |
13:57
🔗
|
alard |
Not really sure, the gallery/public sections can get large, the web sites are somewhat smaller. |
13:57
🔗
|
SketchCow |
I'm sure this is some related concept to having such intense integration of the OS and the site |
13:57
🔗
|
balrog |
I'm pretty sure there are many GB of data on here. |
13:57
🔗
|
SketchCow |
So people can just blow shit back and forth. |
13:58
🔗
|
alard |
balrog: TB, probably. |
13:58
🔗
|
balrog |
alard: I'll send you a list of homepage.mac.com pulled from my webhistory (which unfortunately doesn't go all that far back) |
13:58
🔗
|
balrog |
SketchCow: what exactly are you referring to? |
13:58
🔗
|
balrog |
alard: a few hundred TB, if you count all the gallery data |
13:58
🔗
|
SketchCow |
I mean that the .me stuff Apple did really smoothed the process of handling data and stuff. |
13:59
🔗
|
SketchCow |
Similar to what we saw with Friendster, when photo albums explode |
13:59
🔗
|
balrog |
yeah, they did. they improved it with iCloud, but took away the web-facing features :[ |
14:01
🔗
|
balrog |
alard: hold on a moment :) |
14:02
🔗
|
balrog |
alard: this is not mac.com but may be useful … http://www.wilmut.webspace.virginmedia.com/notes/webpages.html |
14:02
🔗
|
SketchCow |
http://www.archive.org/details/ARCHIVETEAM-YV-9200002-9299997 |
14:02
🔗
|
SketchCow |
I am going to get in trouble for that one. |
14:03
🔗
|
SketchCow |
There was major debate what the maximum item size should be. |
14:03
🔗
|
SketchCow |
Most people agreed 100gb |
14:03
🔗
|
balrog |
ooooh. |
14:03
🔗
|
SketchCow |
That's 408gb |
14:03
🔗
|
balrog |
why not break it up then? |
14:03
🔗
|
SketchCow |
I meant to but it was in the wrong directory when an uploader script ran |
14:03
🔗
|
SketchCow |
I misread it as 40gb |
14:03
🔗
|
SketchCow |
I may have to yank it down and split it |
14:03
🔗
|
balrog |
urgh. can you take it down? |
14:03
🔗
|
SketchCow |
I am really good at yanking it, ask around |
14:04
🔗
|
SketchCow |
Nothing's breaking, it just becomes harder for it to be moved around. |
14:04
🔗
|
balrog |
alard: you there? |
14:04
🔗
|
alard |
Yes. |
14:04
🔗
|
balrog |
http://pastie.org/private/gi3mrystmzx5ogyeocapg |
14:04
🔗
|
balrog |
that came out of my history |
14:04
🔗
|
balrog |
not all may work though |
14:04
🔗
|
balrog |
and it's short |
14:04
🔗
|
balrog |
there's another db I have which I have to go through |
14:05
🔗
|
balrog |
(raw sql) |
14:07
🔗
|
SketchCow |
http://www.archive.org/details/ARCHIVETEAM-YV-3900000-3999999&reCache=1 |
14:07
🔗
|
SketchCow |
Really, 200gb is not bad for the videos from 100,000 potential userspaces |
14:07
🔗
|
balrog |
isn't that a little large too? |
14:07
🔗
|
SketchCow |
I am fine with 200gb |
14:08
🔗
|
balrog |
alard: I'll grep this db for mac.com/me.com :p |
14:08
🔗
|
balrog |
however, do you know of a regex that can be used? |
14:08
🔗
|
alard |
balrog: I downloaded your list. (Though most of the users were already on my list, it seems.) |
14:08
🔗
|
alard |
grep (homepage|web)\.(me|mac)\.com ? |
14:09
🔗
|
balrog |
I'll get another bigger list, I just need a regex that will get the proper results |
14:09
🔗
|
balrog |
yeah but this is sql |
14:09
🔗
|
balrog |
it's likely to be in the middle of a line |
14:09
🔗
|
balrog |
like, a forum post |
14:09
🔗
|
alard |
I see. Dump all the content, feed it to grep? |
14:09
🔗
|
balrog |
well yeah, I'd be working from a sql dump |
14:09
🔗
|
balrog |
but there's stuff in the middle of lines |
14:10
🔗
|
alard |
In that case, I repeat the previous regexp. |
14:10
🔗
|
balrog |
ok ... |
14:10
🔗
|
balrog |
we'll see if it works. |
14:10
🔗
|
alard |
SketchCow: So I should keep it as WARCs? |
14:11
🔗
|
SketchCow |
Yeah |
14:11
🔗
|
alard |
What about the files public.me.com? |
14:11
🔗
|
SketchCow |
As we discussed, we can make more contemporary extractions. |
14:11
🔗
|
SketchCow |
All of them |
14:11
🔗
|
SketchCow |
archive.org can sustain two copies, one generated from the others. |
14:11
🔗
|
alard |
So don't download them separately, but download to a WARC. |
14:11
🔗
|
SketchCow |
WARC ensures long-term sustaining |
14:11
🔗
|
SketchCow |
This is the tradeoff, which I am fine with |
14:12
🔗
|
SketchCow |
(archive.org prefers we always do WARCs, in return a fuck they do not give how much we waterfall into their serverspace) |
14:12
🔗
|
SketchCow |
This from on-high |
14:12
🔗
|
balrog |
alard: you mean each user in his own WARC? |
14:12
🔗
|
alard |
What about the images on gallery.me.com? I currently ask Apple to produces zip files, which is really handy, but isn't WARC. |
14:12
🔗
|
SketchCow |
If that's the best we can do, that's fine. |
14:12
🔗
|
alard |
balrog: Yes, each user results in four WARCs. |
14:12
🔗
|
balrog |
aha. |
14:12
🔗
|
alard |
SketchCow: You can download the images, it just takes a little longer. |
14:13
🔗
|
alard |
So if WARC is nicer, we should do WARC. |
14:13
🔗
|
SketchCow |
Yes |
14:13
🔗
|
* |
balrog copies over the latest .sql |
14:13
🔗
|
SketchCow |
Also a mess: Our star wars forum thing |
14:13
🔗
|
alard |
(Although I should look at what happens to the album structure if we do that.) |
14:13
🔗
|
SketchCow |
That's what's not up |
14:13
🔗
|
SketchCow |
I trust your judgement, alard. |
14:14
🔗
|
SketchCow |
Now you know big daddy's preferences. |
14:14
🔗
|
alard |
Heh. |
14:14
🔗
|
SketchCow |
I just didn't like us shutting out the potential for contemporary users, and if post-facto conversions to items that are easier to regard is possible then I'm on board. |
14:14
🔗
|
SketchCow |
Where possible, WARC is what the "legit" sites like |
14:14
🔗
|
balrog |
alard: what's used to dump sites as WARC? |
14:15
🔗
|
alard |
wget-warc. |
14:15
🔗
|
balrog |
also does that deal with when you have to use phantomjs? |
14:15
🔗
|
SketchCow |
What's the status on those fucks accepting wget-warc |
14:15
🔗
|
balrog |
or are those special-case? |
14:16
🔗
|
alard |
SketchCow: The last response was 'wow, that diff is huge', and he was inclined not to include it, but offer it as a separate extension (as in: you'd have to enable it before compiling). |
14:16
🔗
|
balrog |
alard: your regex doesn't work :/ |
14:16
🔗
|
balrog |
alard: hmmmm… mailing list? |
14:16
🔗
|
alard |
But I made the mistake to include the whole warctools library, which includes things like the curl-extension etc. |
14:16
🔗
|
SketchCow |
Well optimize and get that in |
14:16
🔗
|
SketchCow |
That's a huge win |
14:17
🔗
|
SketchCow |
It'll change everything out there |
14:17
🔗
|
* |
balrog reads up on regex |
14:17
🔗
|
alard |
Yeah, well, I replied that the files that the wget extension uses are much smaller. I haven't yet got a reply to that. |
14:17
🔗
|
SketchCow |
I say just do it. |
14:17
🔗
|
SketchCow |
It'll make a huge change in the world. |
14:17
🔗
|
alard |
I'll probably make a smaller diff and send that to them. |
14:18
🔗
|
alard |
Or two versions: the small one with built-in warc, the other one with warc included. |
14:18
🔗
|
SketchCow |
I have now discovered I have two .tar files of the same range. |
14:18
🔗
|
ersi |
Kick ass effort alard. Kick ass |
14:18
🔗
|
SketchCow |
One is 111gb. One is 206gb |
14:18
🔗
|
balrog |
huh, why the difference? |
14:18
🔗
|
SketchCow |
NO IDEA |
14:18
🔗
|
alard |
balrog: Did you use grep -E ? |
14:18
🔗
|
balrog |
oops, no :p |
14:19
🔗
|
balrog |
that worked, but it grabbed full lines |
14:19
🔗
|
balrog |
I don't want full lines |
14:19
🔗
|
balrog |
I want to isolate the relevant parts |
14:19
🔗
|
alard |
Maybe do grep -oE "http://(homepage|web)\.(mac|me)\.com/[^/]+" |
14:20
🔗
|
balrog |
alard: does that assume lines start with http://? they don't |
14:21
🔗
|
alard |
Yes, it does. It also assumes that every url ends with a / |
14:21
🔗
|
alard |
grep -oE "(homepage|web)\.(mac|me)\.com/[^ ]+" stops as the first whitespace character. |
14:22
🔗
|
balrog |
URLs are formatted http:// … /username. however they may have text in front, or after them, within the same line |
14:22
🔗
|
balrog |
you could have like "Check out this site: <a href="http://homepage.mac.com/someone">Here!</a>" |
14:22
🔗
|
alard |
Oh, sorry, it doesn't assume that the *line* starts with http://, just that the *url* starts with http://. |
14:22
🔗
|
alard |
grep -oE 'http://(homepage|web)\.(mac|me)\.com/[^/"]+' |
14:26
🔗
|
balrog |
much shorter list than I expected. |
14:26
🔗
|
alard |
Then it's probably good to check the regexp. |
14:26
🔗
|
balrog |
http://pastie.org/private/l5cjotdi58ttf8bq8g4m8g |
14:26
🔗
|
balrog |
I did. |
14:27
🔗
|
balrog |
the incoming HTML filter would put http:// before all urls |
14:27
🔗
|
balrog |
you have all these? |
14:40
🔗
|
balrog |
alard: did you have these already? |
14:43
🔗
|
alard |
balrog: Just checked, most of them, not all. |
14:43
🔗
|
balrog |
OK |
15:01
🔗
|
alard |
SketchCow: One more question, if you're still there. It's possible to download the gallery contents to WARC. However, I think it doesn't make sense. It certainly wouldn't be useful with the wayback machine. |
15:02
🔗
|
alard |
So I'm thinking that downloading the metadata xml/json and zipping the images per album is the best solution. |
15:03
🔗
|
SketchCow |
I agree, then. |
15:04
🔗
|
alard |
The problem with the gallery is that it isn't really a web page, but a collection of image files that can be renderd in different formats. So for a wayback-thing, you'd have to get every possible format. |
15:14
🔗
|
alard |
Well then, I think that the scripts are finished. |
15:14
🔗
|
alard |
If anyone would like to do a test run, please do! https://github.com/ArchiveTeam/mobileme-grab |
15:42
🔗
|
SketchCow |
-rw-r--r-- 1 root root 205 2011-10-05 17:14 ballsack |
15:42
🔗
|
SketchCow |
-rw-r--r-- 1 root root 2425 2011-10-05 16:20 balls |
15:42
🔗
|
SketchCow |
drwxr-xr-x 2 root root 4096 2011-10-05 17:19 DONE |
15:42
🔗
|
SketchCow |
That's how you know it was me |
15:43
🔗
|
balrog |
LOL |
15:48
🔗
|
lowtekk |
i seem to have acquired an "@", considering I may as well be a stranger, someone should probably take it away |
15:49
🔗
|
balrog |
"@"? |
15:49
🔗
|
lowtekk |
i do enjoy lurking, and as much as i love collecting old documents, i haven't contributed a darn thing to this cause |
15:49
🔗
|
lowtekk |
op status, unless I'm mistaken |
15:49
🔗
|
balrog |
oh, that |
15:50
🔗
|
balrog |
yeah I don't know :p |
15:50
🔗
|
balrog |
I think I was made op here once, though. idk either |
15:50
🔗
|
balrog |
this is efnet though |
15:50
🔗
|
balrog |
if you were to part and return, it would go away |
15:50
🔗
|
lowtekk |
i've grown rather fond of it |
16:00
🔗
|
sp0rus |
lol, i was made ops once in this chan |
16:00
🔗
|
sp0rus |
happens sometimes |
16:01
🔗
|
SketchCow |
It's all on my arbitrary observations, bitches |
16:31
🔗
|
yipdw |
free-flowing ephemeral op-bit |
16:31
🔗
|
yipdw |
probably the best way to avoid power clashes |
16:56
🔗
|
jjonas |
hey friends:) |
16:56
🔗
|
sp0rus |
hello |
16:57
🔗
|
jjonas |
its old news but i think it would make sense to note the closure of labs.google.com somewhere in the archiveteam.org wiki? |
16:58
🔗
|
sp0rus |
do it |
16:59
🔗
|
sp0rus |
http://archiveteam.org/index.php?title=Deathwatch |
17:01
🔗
|
jjonas |
it made me lose thrust in google and google inovation, i miss google sets and google squared |
17:02
🔗
|
jjonas |
ok im going to add a line there and to the article about google |
17:02
🔗
|
ersi |
s/thrust/trust |
17:02
🔗
|
ersi |
I made that spelling error a lot earlier :) |
17:03
🔗
|
jjonas |
of course... |
17:03
🔗
|
SketchCow |
I agree. |
17:04
🔗
|
SketchCow |
Stupid Google |
17:04
🔗
|
SketchCow |
It's not impressive to turn off Google Labs |
17:04
🔗
|
SketchCow |
It was inspiring to go there and see crazy projects |
17:04
🔗
|
SketchCow |
The only (only) justification I can come up with is that people/businesses/entities were monetizing or showing reliance on them |
17:05
🔗
|
ersi |
Closing down Google Code Search is fucking stupid as well |
17:05
🔗
|
jjonas |
that was back when you had gameing equipment by thrustmaster?^^ |
17:05
🔗
|
ersi |
their main shit is/was search once in a time |
17:07
🔗
|
jjonas |
when did google code search vanish :-O |
17:07
🔗
|
jjonas |
? |
17:07
🔗
|
jjonas |
was it considered part of google labs too? |
17:07
🔗
|
SketchCow |
It's not gone yet |
17:08
🔗
|
SketchCow |
It's being killed |
17:08
🔗
|
SketchCow |
January |
17:08
🔗
|
jjonas |
*sigh* |
17:09
🔗
|
ersi |
Also, no, it was a seperate project. |
17:36
🔗
|
Ymgve |
but is there any content in google code search? or was it just an alternative view of stuff that's already on the web? |
17:36
🔗
|
SketchCow |
No content |
17:36
🔗
|
SketchCow |
Just a great tool |
17:36
🔗
|
ersi |
Which still makes it a fucking shame that they're disbanding it |
17:37
🔗
|
Ymgve |
someone tell ms to make bing code search |
17:37
🔗
|
ersi |
I mean, what do you think, when you think Google? Most people think Search. |
17:37
🔗
|
ersi |
Or did, atleast. I think of advertisement these days.. and crappy search |
17:42
🔗
|
sep332 |
is there a better search engine? I know blekko and duckduckgo have some cool stuff, but for general web stuff? |
17:45
🔗
|
SketchCow |
grep |
17:52
🔗
|
* |
Coderjoe grumbles |
17:52
🔗
|
Coderjoe |
I am beginning to think I should have used wget-warc |
17:53
🔗
|
Coderjoe |
5GB and still going. apparently there are some books in there too |
17:55
🔗
|
jjonas |
what are you archiveing? |
17:55
🔗
|
sp0rus |
Coderjoe: wow, when he popped in talking about the site I expected a few hundred megs tops |
17:56
🔗
|
jjonas |
possibly for google code search there is some rationale to close it down - that it can be used as a tool for hacking in various ways |
17:56
🔗
|
Coderjoe |
jjonas: lachlan.bluehaze.com.au |
17:57
🔗
|
Coderjoe |
australian physicist that died last year. doing an AFK pull |
17:57
🔗
|
Coderjoe |
I should go bluehaze.com.au as well, as that site belonged to a guy that died in 2006 |
17:57
🔗
|
Coderjoe |
s/go/do |
17:58
🔗
|
Coderjoe |
argh. can't type |
17:58
🔗
|
jjonas |
but then, who wrote on top if it that he died in 2010 and that it stays as a memorial? |
17:59
🔗
|
Coderjoe |
the person keeping bluehaze around as well. |
17:59
🔗
|
jjonas |
.... but i really have no idea why they droped/hide google labs completley |
17:59
🔗
|
jjonas |
i tried to look it up in the waybackmachine |
18:00
🔗
|
jjonas |
to see all the various nice tools/attempts that i dont even remember |
18:00
🔗
|
jjonas |
but its not in the waybackmachine |
18:03
🔗
|
jjonas |
nvm! googlelabs.com is, just the subdomain isnt |
18:15
🔗
|
ersi |
jjonas: That's a fucking stupid ass rationale |
18:15
🔗
|
jjonas |
:D |
18:15
🔗
|
jjonas |
haha |
18:15
🔗
|
ersi |
I mean seriously, punch you in the face stupid |
18:16
🔗
|
Coderjoe |
I can stab someone in the eye with a pencil. should we remove all pencils? |
18:16
🔗
|
jjonas |
i wasnt trying to defend such a rationale |
18:17
🔗
|
ersi |
I didn't perhaps mean you as in you |
18:17
🔗
|
ersi |
If you're a sad frightened panda right now, that is |
18:17
🔗
|
jjonas |
i would just be as surprised about that kind of reasoing |
18:17
🔗
|
jjonas |
that google might did before decideing to close it down |
18:18
🔗
|
Coderjoe |
heh.. it's like someone went "The terrorists crashed planes into buildings. We must outlaw all planes." |
18:18
🔗
|
jjonas |
than iam about them closeing google labs |
18:18
🔗
|
jjonas |
*NOT be as surprised |
18:19
🔗
|
sep332 |
Remember Jonny Long's "Google Hacking" books? |
18:20
🔗
|
sp0rus |
sep332: aye |
18:21
🔗
|
jjonas |
but if google and other big companies would think like you consequently |
18:21
🔗
|
jjonas |
they had realized many usefull features already |
18:22
🔗
|
jjonas |
that arnt there yet |
18:23
🔗
|
jjonas |
if you add this as a firefox bookmark and set keyword "mp3" |
18:23
🔗
|
jjonas |
http://www.google.de/search?hl=de&safe=off&q=intitle%3A%22index.of%22+(mp*|avi|wma|mov)+%s%2Bparent%2Bdirectory+-inurl%3A(htm|html|cf|jsp|asp|php|js)+-site%3Amp3s.pl+-download+-torrent+-inurl%3A(franceradio|null3d|infoweb|realm|boxxet|openftp|indexofmp3|spider|listen77|karelia|randombase|mp3*)&btnG=Suche&meta= |
18:23
🔗
|
ersi |
shrug |
18:23
🔗
|
sep332 |
I think we should remove all CoderJoe's, the world will be safer without their(?) violent imaginations |
18:23
🔗
|
jjonas |
then you can type in the adress bar "mp3 any title/artist" |
18:23
🔗
|
jjonas |
and find working mp3 links |
18:24
🔗
|
jjonas |
i changed it the last time like 5 years ago so the excluded spam sites might not be uptodate |
18:24
🔗
|
Coderjoe |
yeah... there is apparently another coderjoe out there, whose name is actually Joe |
18:24
🔗
|
Coderjoe |
(mine is not) |
18:24
🔗
|
jjonas |
...but it works |
18:24
🔗
|
jjonas |
and you maybe use something similar already |
18:24
🔗
|
jjonas |
so why does google not have a tab "mp3" next to images,maps,... |
18:25
🔗
|
ersi |
Let's get back to talking about archiveteam stuff instead of fluff |
18:25
🔗
|
Coderjoe |
expected record company outrage? |
18:26
🔗
|
sep332 |
baidu has an mp3 search, mp3.baidu.com |
18:26
🔗
|
jjonas |
thats a differnt enviornment, google china also has a million songs freely downloadable |
18:26
🔗
|
jjonas |
(with a chinese IP only of course |
18:27
🔗
|
jjonas |
...... |
18:28
🔗
|
jjonas |
yeah, lets talk about archiving, since i made my point why they maybe would (sadly) close down google code for such a reason :D |
18:29
🔗
|
jjonas |
if you dont mind check my grammar about google labs in http://archiveteam.org/index.php?title=Deathwatch#2011 |
18:34
🔗
|
jjonas |
btw, just to finish the mp3 subtopic condignly: the russian facebook pendant vkontakte.ru has a great community directory shareing all mp3s paird with lyrics files among 100+ million users just like there is no copyright :D |
18:34
🔗
|
chronomex |
is no copyright in soviet russia |
18:35
🔗
|
chronomex |
nor in capitalist russia |
18:35
🔗
|
jjonas |
so warez sites are legal there too |
18:35
🔗
|
jjonas |
? |
18:35
🔗
|
jjonas |
even if they have international users |
18:35
🔗
|
jjonas |
:-O |
18:35
🔗
|
* |
chronomex shrugs |
18:35
🔗
|
chronomex |
eez joke |
18:36
🔗
|
ersi |
Calm the fuck down |
18:36
🔗
|
* |
ersi brings out the sedatives |
18:36
🔗
|
SketchCow |
http://yfrog.com/z/obj01nxj |
18:36
🔗
|
ersi |
SketchCow: Hah, awesome |
18:37
🔗
|
jjonas |
:) im not nervous, just kidding |
20:19
🔗
|
Coderjoe |
oh joy |
20:19
🔗
|
Coderjoe |
I don't know where the link was that caused me to go astray |
20:19
🔗
|
Coderjoe |
but apparently, the server has no trouble treating html files as directories |
20:19
🔗
|
Coderjoe |
http://lachlan.bluehaze.com.au/deep.html/books/usa2001/usa2001/usa2001/gnomes.html |
20:20
🔗
|
Coderjoe |
that gives you the "deep.html" page |
20:20
🔗
|
Frigolit |
that's called "path info" |
20:20
🔗
|
Coderjoe |
yes, i know. and I've used it on php, just not html |
20:21
🔗
|
Coderjoe |
but there is a bad link that lead me to an infinite recursion problem |
20:21
🔗
|
Frigolit |
ah |
20:23
🔗
|
Coderjoe |
my apache config at home does not appear to allow pathinfo on html, but then I am not parsing html (while the lachlan server is) |
20:25
🔗
|
Coderjoe |
somewhere on that site is at least one bad link that adds a directory level to the entire site |
20:30
🔗
|
Coderjoe |
i'm going to terminate that until I have a chance to inspect things a bit more |
21:40
🔗
|
Paradoks |
http://www.economist.com/node/21529030 |
21:41
🔗
|
Paradoks |
Scanning and destroying books, for a fee. I wonder if this horrifies Sketchcow. Obviously, it's not archiving, though some people might use it that way. |
21:44
🔗
|
Coderjoe |
scanning good. destruction BAD |
21:46
🔗
|
sep332 |
related blog post on it http://ascii.textfiles.com/archives/2672 |
21:52
🔗
|
goekesmi |
It's always a hard call when it comes to books. |
22:20
🔗
|
dashcloud |
if the book needs to be destroyed, I'm expecting perfection for results- anything less isn't worth it (for the sake of archiving, it's not worth it, but I'm sure many people would be happy to make that choice) |
22:21
🔗
|
yipdw |
oh, I dunno -- people seem perfectly happy to accept 1080p masters for films these days |
22:24
🔗
|
sp0rus |
yeah, but people are stupid |
22:24
🔗
|
yipdw |
at least 1DollarScan/Bookscan seem to be clear that they only do this for mass-market copies |
22:24
🔗
|
yipdw |
that seems to be a bit more sane |
22:24
🔗
|
yipdw |
well, I think, I dunno -- it's not spelled out in that article |
22:26
🔗
|
sp0rus |
if it's mass-market and not hard to find, that's a little different |
22:27
🔗
|
yipdw |
right |
22:27
🔗
|
yipdw |
I think that's the intent here |
22:30
🔗
|
dashcloud |
yipdw: what quality masters should people be asking for? |
22:32
🔗
|
SketchCow |
Hiiii |
22:33
🔗
|
yipdw |
dashcloud: the highest available, which for some films is 1080p -- Ultraviolet and new scenes in Star Wars come to mind |
22:33
🔗
|
yipdw |
dashcloud: but it's more that 1080p is markedly inferior in terms of resolution to earlier production techniques, and what with the availability of digital cameras like the RED ONE system it doesn't have to be that way |
22:33
🔗
|
yipdw |
so, yeah, more of an offhand snark |
22:34
🔗
|
dashcloud |
at least some of the Blender Foundation's open movies are available as higher than 1080p films |
22:36
🔗
|
yipdw |
yeah, and with those it's theoretically better because the film assets are available |
22:36
🔗
|
dashcloud |
here's an awesome article about gifs : http://motherboard.tv/2010/11/19/the-gif-that-keeps-on-gifing-why-animated-images-are-still-a-defining-part-of-our-internets |
22:37
🔗
|
yipdw |
I say theoretically because I sure as hell haven't been able to e.g. re-render Big Buck Bunny from the assets directory :P |
22:37
🔗
|
dashcloud |
I know the 2k frames were/are available from xiph's sample site |
22:41
🔗
|
SketchCow |
My attitude on 1dollarbookscan is it makes more sense than throwing them out |
22:54
🔗
|
SketchCow |
Barely |
23:20
🔗
|
chronomex |
^ |
23:45
🔗
|
underscor |
BURP |
23:45
🔗
|
underscor |
Another 300GB into the archive |