#archiveteam 2013-07-11,Thu

↑back Search

Time Nickname Message
00:52 πŸ”— joepie91 Google Latitude is shutting down, and will be deleting friends lists, badges, and perhaps other things
09:15 πŸ”— archivist WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD
09:22 πŸ”— omf_ yahoosucks
09:22 πŸ”— omf_ IS THY SECRET WORD
11:05 πŸ”— SmileyG re's an update: it works. A team of researchers at the University of Southampton have demonstrated a way to record and retrieve as much as 360 terabytes of digital data onto a single disk of quartz glass in a way that can withstand temperatures of up to 1000 C and should keep the data stable and readable for up to a million years.
11:05 πŸ”— SmileyG :O :D
11:08 πŸ”— godane can we start a kickstart for them?
11:33 πŸ”— antomatic And the single glass disc is approximately three miles in diameter. :)
11:42 πŸ”— winr4r and/or costs six billion quid
11:47 πŸ”— antomatic or both. :)
11:48 πŸ”— antomatic "The wildly impractical storage innovation was immediately purchased by Iomega Corporation." :)
11:49 πŸ”— Baljem when the technology takes off they'll cost-reduce it to 'up to 100 C and readable for up to a million minutes', and in ten years we'll be worrying about backing it all up again
11:49 πŸ”— Baljem or, in other words, what antomatic said ;)
11:54 πŸ”— godane g4tv.com-video3281: Alan Paller Interview: https://archive.org/details/g4tv.com-video3281
11:56 πŸ”— godane g4tv.com-video3166: DMCA Debate:
11:56 πŸ”— godane g4tv.com-video3166: DMCA Debate: https://archive.org/details/g4tv.com-video3166
11:57 πŸ”— godane g4tv.com-video3145: Andy Jones Interview: https://archive.org/details/g4tv.com-video3145
18:10 πŸ”— joepie91 newsflash: Linden Labs bought Desura
18:10 πŸ”— joepie91 and Linden doesn't exactly have a stellar reputation of being careful stewards for user data
18:10 πŸ”— joepie91 so perhaps it's worth seeing if there's anything scrape-able on Desura in terms of user content
19:19 πŸ”— SmileyG Maybe old BUT. "After slightly more than 30 years, PCWorld Ҁ” one of the most successful computer magazines of all time Ҁ” is discontinuing print publication.
20:47 πŸ”— SketchCow http://i.imgur.com/h8qs65w.gif
20:47 πŸ”— SketchCow ARCHIVE TEAM SUMMONS
20:55 πŸ”— omf_ me gusta
21:26 πŸ”— godane i'm starting hate the way wayback machine finds pages
21:27 πŸ”— godane its archiving pages that i was trying to search using * with
21:27 πŸ”— godane like: http://podcast.cbc.ca/mp3/podcasts/bonusspark*
21:28 πŸ”— godane and that was 2 days ago
21:28 πŸ”— Asparagir Howdy peoples. I found a site that needs saving and SketchCow suggested I show up here and mention it so we can haz collaboration.
21:29 πŸ”— Asparagir BuzzData.com is closing, and all data is being deleted at the end of July.
21:29 πŸ”— Asparagir The 31st, to be specific.
21:29 πŸ”— Asparagir If you have a username and password, you can see 2300+ datasets, plus comments on them and user profiles and stuff like that. All about to go *poof*.
21:30 πŸ”— Asparagir Not very personal stuff, mostly dry government created spreadsheets.
21:32 πŸ”— Asparagir But still, someone ought to try, right? Screenshot #1: https://www.dropbox.com/s/b891l8y177g6vt9/buzzdata_screenshot_01.png
21:32 πŸ”— Asparagir Screenshot #2: https://www.dropbox.com/s/7868f285lnbiref/buzzdata_screenshot_02.png
21:35 πŸ”— winr4r yes, get it
21:37 πŸ”— Asparagir Okay. I can start a panic grab on a cloud server later tonight. I have wget 1.14, which I think is the latest, and I will follow the directions on the AT wiki for doing the WARC dump.
21:38 πŸ”— winr4r "The team behind BuzzData has a new product, a new name and a new mission Ҁ“ weҀ™re now LookBookHQ."
21:38 πŸ”— Asparagir Based on the wiki, it should be this, I think?
21:38 πŸ”— winr4r nothing says "wow i want to use that new product" more than that
21:38 πŸ”— Asparagir wget -e robots=off --mirror --page-requisites --save-headers --wait 3 --waitretry 5 --timeout 60 --tries 5 -H -Dbuzzdata.com --warc-header "operator: Archive Team" --warc-cdx --warc-file="$WARC_NAME" -U "$USER_AGENT" "$SAVE_HOST"
21:38 πŸ”— winr4r Asparagir: yes
21:39 πŸ”— Asparagir Okay. New to the world of WARC, want to make sure I get this right.
21:39 πŸ”— Asparagir I will also need to figure out the code for cookies to do the initial login
21:39 πŸ”— winr4r if there's stuff that only appears while logged in, you might want to look at --use-cookies as well
21:39 πŸ”— Asparagir Since the data requires user login first.
21:39 πŸ”— Asparagir Right, basicvally the entire site is visible *only* when logged in.
21:40 πŸ”— Asparagir Even the "public" stuff.
21:40 πŸ”— Asparagir (Gee, I wonder why it never got popular.)
21:40 πŸ”— winr4r in that case, i'd suggest you concentrate on getting the data, WARCs are mostly useful while ingesting into wayback, which that won't be
21:41 πŸ”— winr4r ^^ NOT OFFICIAL ARCHIVE TEAM OPINION, YOU CAN IGNORE IT
21:42 πŸ”— omf_ Can we still create an account to access the data
21:42 πŸ”— Asparagir I don't know.
21:42 πŸ”— Asparagir Accounts are (or were) free, though.
21:42 πŸ”— Asparagir You're all welcome to use mine. :-)
21:46 πŸ”— Asparagir Okay, have to step out for 30 minutes to go pick up my daughter from summer camp. Back later...
21:46 πŸ”— winr4r hb
21:47 πŸ”— omf_ I just created a new account
21:48 πŸ”— omf_ I would recommend a few other people create accounts as well so we have multiple ones to work with in case the ban hammer comes down
21:48 πŸ”— omf_ Their base url scheme is not retarded so that is good
22:47 πŸ”— Asparagir Back now.
22:52 πŸ”— Asparagir So, how does one organize a panic grab of something like this? Do we each just run wget on our own boxes, and hope we each get different things than the other AT peeps? Or do we seed a tracker with usernames first, or what?
22:52 πŸ”— Asparagir Verily, I am new at this.
22:56 πŸ”— omf_ I just went through the motions of trying to get a dataset. It is a mess of javascript to make things go
22:59 πŸ”— omf_ If someone pulls down the 238 pages of public dataset results I can probably whip up some javascript bullshit to get the datasets
23:00 πŸ”— omf_ they do some ajax stuff to get to the dataset url
23:16 πŸ”— Asparagir Ugh, yeah, it looks like they're using jQuery templates to render a lot of the links to the actual data; it's not all written right to the HTML.
23:25 πŸ”— Asparagir Format of the data preview is like this: http://buzzdata.com/data_files/blIVT0LGqr4y37yAyCM7w3
23:33 πŸ”— Asparagir Aaaaand I see unique authenticity tokens posted for each dataset, to make it hard to screen scrape.
23:34 πŸ”— Asparagir Curses, folied again.
23:38 πŸ”— Asparagir Hold up, they have a free API.
23:38 πŸ”— Asparagir http://buzzdata.com/faq/api/api-methods
23:40 πŸ”— omf_ that makes it easy
23:40 πŸ”— winr4r hooray!
23:41 πŸ”— omf_ you want to do it Asparagir ?
23:42 πŸ”— Asparagir I don't think my kung-fu is quite strong enough to code the whole thing.
23:44 πŸ”— Asparagir It looks like at a minimum you would need to know a list of usernames and/or "hive names" beforehand, in order to use the API to grab each of their datasets.
23:44 πŸ”— Asparagir See also https://github.com/buzzdata/api-docs
23:46 πŸ”— Asparagir Yeah, minimum requirements for all the API endpoints is an already-known username. Such as GET `https://:HIVE_NAME.buzzdata.com/api/:USERNAME` where HIVE_NAME is optional (if you leave it out, it just gets the public stuff)

irclogger-viewer