Time |
Nickname |
Message |
00:52
π
|
joepie91 |
Google Latitude is shutting down, and will be deleting friends lists, badges, and perhaps other things |
09:15
π
|
archivist |
WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD |
09:22
π
|
omf_ |
yahoosucks |
09:22
π
|
omf_ |
IS THY SECRET WORD |
11:05
π
|
SmileyG |
re's an update: it works. A team of researchers at the University of Southampton have demonstrated a way to record and retrieve as much as 360 terabytes of digital data onto a single disk of quartz glass in a way that can withstand temperatures of up to 1000 C and should keep the data stable and readable for up to a million years. |
11:05
π
|
SmileyG |
:O :D |
11:08
π
|
godane |
can we start a kickstart for them? |
11:33
π
|
antomatic |
And the single glass disc is approximately three miles in diameter. :) |
11:42
π
|
winr4r |
and/or costs six billion quid |
11:47
π
|
antomatic |
or both. :) |
11:48
π
|
antomatic |
"The wildly impractical storage innovation was immediately purchased by Iomega Corporation." :) |
11:49
π
|
Baljem |
when the technology takes off they'll cost-reduce it to 'up to 100 C and readable for up to a million minutes', and in ten years we'll be worrying about backing it all up again |
11:49
π
|
Baljem |
or, in other words, what antomatic said ;) |
11:54
π
|
godane |
g4tv.com-video3281: Alan Paller Interview: https://archive.org/details/g4tv.com-video3281 |
11:56
π
|
godane |
g4tv.com-video3166: DMCA Debate: |
11:56
π
|
godane |
g4tv.com-video3166: DMCA Debate: https://archive.org/details/g4tv.com-video3166 |
11:57
π
|
godane |
g4tv.com-video3145: Andy Jones Interview: https://archive.org/details/g4tv.com-video3145 |
18:10
π
|
joepie91 |
newsflash: Linden Labs bought Desura |
18:10
π
|
joepie91 |
and Linden doesn't exactly have a stellar reputation of being careful stewards for user data |
18:10
π
|
joepie91 |
so perhaps it's worth seeing if there's anything scrape-able on Desura in terms of user content |
19:19
π
|
SmileyG |
Maybe old BUT. "After slightly more than 30 years, PCWorld Γ’ΒΒ one of the most successful computer magazines of all time Γ’ΒΒ is discontinuing print publication. |
20:47
π
|
SketchCow |
http://i.imgur.com/h8qs65w.gif |
20:47
π
|
SketchCow |
ARCHIVE TEAM SUMMONS |
20:55
π
|
omf_ |
me gusta |
21:26
π
|
godane |
i'm starting hate the way wayback machine finds pages |
21:27
π
|
godane |
its archiving pages that i was trying to search using * with |
21:27
π
|
godane |
like: http://podcast.cbc.ca/mp3/podcasts/bonusspark* |
21:28
π
|
godane |
and that was 2 days ago |
21:28
π
|
Asparagir |
Howdy peoples. I found a site that needs saving and SketchCow suggested I show up here and mention it so we can haz collaboration. |
21:29
π
|
Asparagir |
BuzzData.com is closing, and all data is being deleted at the end of July. |
21:29
π
|
Asparagir |
The 31st, to be specific. |
21:29
π
|
Asparagir |
If you have a username and password, you can see 2300+ datasets, plus comments on them and user profiles and stuff like that. All about to go *poof*. |
21:30
π
|
Asparagir |
Not very personal stuff, mostly dry government created spreadsheets. |
21:32
π
|
Asparagir |
But still, someone ought to try, right? Screenshot #1: https://www.dropbox.com/s/b891l8y177g6vt9/buzzdata_screenshot_01.png |
21:32
π
|
Asparagir |
Screenshot #2: https://www.dropbox.com/s/7868f285lnbiref/buzzdata_screenshot_02.png |
21:35
π
|
winr4r |
yes, get it |
21:37
π
|
Asparagir |
Okay. I can start a panic grab on a cloud server later tonight. I have wget 1.14, which I think is the latest, and I will follow the directions on the AT wiki for doing the WARC dump. |
21:38
π
|
winr4r |
"The team behind BuzzData has a new product, a new name and a new mission Γ’ΒΒ weΓ’ΒΒre now LookBookHQ." |
21:38
π
|
Asparagir |
Based on the wiki, it should be this, I think? |
21:38
π
|
winr4r |
nothing says "wow i want to use that new product" more than that |
21:38
π
|
Asparagir |
wget -e robots=off --mirror --page-requisites --save-headers --wait 3 --waitretry 5 --timeout 60 --tries 5 -H -Dbuzzdata.com --warc-header "operator: Archive Team" --warc-cdx --warc-file="$WARC_NAME" -U "$USER_AGENT" "$SAVE_HOST" |
21:38
π
|
winr4r |
Asparagir: yes |
21:39
π
|
Asparagir |
Okay. New to the world of WARC, want to make sure I get this right. |
21:39
π
|
Asparagir |
I will also need to figure out the code for cookies to do the initial login |
21:39
π
|
winr4r |
if there's stuff that only appears while logged in, you might want to look at --use-cookies as well |
21:39
π
|
Asparagir |
Since the data requires user login first. |
21:39
π
|
Asparagir |
Right, basicvally the entire site is visible *only* when logged in. |
21:40
π
|
Asparagir |
Even the "public" stuff. |
21:40
π
|
Asparagir |
(Gee, I wonder why it never got popular.) |
21:40
π
|
winr4r |
in that case, i'd suggest you concentrate on getting the data, WARCs are mostly useful while ingesting into wayback, which that won't be |
21:41
π
|
winr4r |
^^ NOT OFFICIAL ARCHIVE TEAM OPINION, YOU CAN IGNORE IT |
21:42
π
|
omf_ |
Can we still create an account to access the data |
21:42
π
|
Asparagir |
I don't know. |
21:42
π
|
Asparagir |
Accounts are (or were) free, though. |
21:42
π
|
Asparagir |
You're all welcome to use mine. :-) |
21:46
π
|
Asparagir |
Okay, have to step out for 30 minutes to go pick up my daughter from summer camp. Back later... |
21:46
π
|
winr4r |
hb |
21:47
π
|
omf_ |
I just created a new account |
21:48
π
|
omf_ |
I would recommend a few other people create accounts as well so we have multiple ones to work with in case the ban hammer comes down |
21:48
π
|
omf_ |
Their base url scheme is not retarded so that is good |
22:47
π
|
Asparagir |
Back now. |
22:52
π
|
Asparagir |
So, how does one organize a panic grab of something like this? Do we each just run wget on our own boxes, and hope we each get different things than the other AT peeps? Or do we seed a tracker with usernames first, or what? |
22:52
π
|
Asparagir |
Verily, I am new at this. |
22:56
π
|
omf_ |
I just went through the motions of trying to get a dataset. It is a mess of javascript to make things go |
22:59
π
|
omf_ |
If someone pulls down the 238 pages of public dataset results I can probably whip up some javascript bullshit to get the datasets |
23:00
π
|
omf_ |
they do some ajax stuff to get to the dataset url |
23:16
π
|
Asparagir |
Ugh, yeah, it looks like they're using jQuery templates to render a lot of the links to the actual data; it's not all written right to the HTML. |
23:25
π
|
Asparagir |
Format of the data preview is like this: http://buzzdata.com/data_files/blIVT0LGqr4y37yAyCM7w3 |
23:33
π
|
Asparagir |
Aaaaand I see unique authenticity tokens posted for each dataset, to make it hard to screen scrape. |
23:34
π
|
Asparagir |
Curses, folied again. |
23:38
π
|
Asparagir |
Hold up, they have a free API. |
23:38
π
|
Asparagir |
http://buzzdata.com/faq/api/api-methods |
23:40
π
|
omf_ |
that makes it easy |
23:40
π
|
winr4r |
hooray! |
23:41
π
|
omf_ |
you want to do it Asparagir ? |
23:42
π
|
Asparagir |
I don't think my kung-fu is quite strong enough to code the whole thing. |
23:44
π
|
Asparagir |
It looks like at a minimum you would need to know a list of usernames and/or "hive names" beforehand, in order to use the API to grab each of their datasets. |
23:44
π
|
Asparagir |
See also https://github.com/buzzdata/api-docs |
23:46
π
|
Asparagir |
Yeah, minimum requirements for all the API endpoints is an already-known username. Such as GET `https://:HIVE_NAME.buzzdata.com/api/:USERNAME` where HIVE_NAME is optional (if you leave it out, it just gets the public stuff) |