#archiveteam 2013-07-11,Thu

↑back Search

Time	Nickname	Message
00:52 ^🔗	joepie91	Google Latitude is shutting down, and will be deleting friends lists, badges, and perhaps other things
09:15 ^🔗	archivist	WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD
09:22 ^🔗	omf_	yahoosucks
09:22 ^🔗	omf_	IS THY SECRET WORD
11:05 ^🔗	SmileyG	re's an update: it works. A team of researchers at the University of Southampton have demonstrated a way to record and retrieve as much as 360 terabytes of digital data onto a single disk of quartz glass in a way that can withstand temperatures of up to 1000 C and should keep the data stable and readable for up to a million years.
11:05 ^🔗	SmileyG	:O :D
11:08 ^🔗	godane	can we start a kickstart for them?
11:33 ^🔗	antomatic	And the single glass disc is approximately three miles in diameter. :)
11:42 ^🔗	winr4r	and/or costs six billion quid
11:47 ^🔗	antomatic	or both. :)
11:48 ^🔗	antomatic	"The wildly impractical storage innovation was immediately purchased by Iomega Corporation." :)
11:49 ^🔗	Baljem	when the technology takes off they'll cost-reduce it to 'up to 100 C and readable for up to a million minutes', and in ten years we'll be worrying about backing it all up again
11:49 ^🔗	Baljem	or, in other words, what antomatic said ;)
11:54 ^🔗	godane	g4tv.com-video3281: Alan Paller Interview: https://archive.org/details/g4tv.com-video3281
11:56 ^🔗	godane	g4tv.com-video3166: DMCA Debate:
11:56 ^🔗	godane	g4tv.com-video3166: DMCA Debate: https://archive.org/details/g4tv.com-video3166
11:57 ^🔗	godane	g4tv.com-video3145: Andy Jones Interview: https://archive.org/details/g4tv.com-video3145
18:10 ^🔗	joepie91	newsflash: Linden Labs bought Desura
18:10 ^🔗	joepie91	and Linden doesn't exactly have a stellar reputation of being careful stewards for user data
18:10 ^🔗	joepie91	so perhaps it's worth seeing if there's anything scrape-able on Desura in terms of user content
19:19 ^🔗	SmileyG	Maybe old BUT. "After slightly more than 30 years, PCWorld â one of the most successful computer magazines of all time â is discontinuing print publication.
20:47 ^🔗	SketchCow	http://i.imgur.com/h8qs65w.gif
20:47 ^🔗	SketchCow	ARCHIVE TEAM SUMMONS
20:55 ^🔗	omf_	me gusta
21:26 ^🔗	godane	i'm starting hate the way wayback machine finds pages
21:27 ^🔗	godane	its archiving pages that i was trying to search using * with
21:27 ^🔗	godane	like: http://podcast.cbc.ca/mp3/podcasts/bonusspark*
21:28 ^🔗	godane	and that was 2 days ago
21:28 ^🔗	Asparagir	Howdy peoples. I found a site that needs saving and SketchCow suggested I show up here and mention it so we can haz collaboration.
21:29 ^🔗	Asparagir	BuzzData.com is closing, and all data is being deleted at the end of July.
21:29 ^🔗	Asparagir	The 31st, to be specific.
21:29 ^🔗	Asparagir	If you have a username and password, you can see 2300+ datasets, plus comments on them and user profiles and stuff like that. All about to go poof.
21:30 ^🔗	Asparagir	Not very personal stuff, mostly dry government created spreadsheets.
21:32 ^🔗	Asparagir	But still, someone ought to try, right? Screenshot #1: https://www.dropbox.com/s/b891l8y177g6vt9/buzzdata_screenshot_01.png
21:32 ^🔗	Asparagir	Screenshot #2: https://www.dropbox.com/s/7868f285lnbiref/buzzdata_screenshot_02.png
21:35 ^🔗	winr4r	yes, get it
21:37 ^🔗	Asparagir	Okay. I can start a panic grab on a cloud server later tonight. I have wget 1.14, which I think is the latest, and I will follow the directions on the AT wiki for doing the WARC dump.
21:38 ^🔗	winr4r	"The team behind BuzzData has a new product, a new name and a new mission â weâre now LookBookHQ."
21:38 ^🔗	Asparagir	Based on the wiki, it should be this, I think?
21:38 ^🔗	winr4r	nothing says "wow i want to use that new product" more than that
21:38 ^🔗	Asparagir	wget -e robots=off --mirror --page-requisites --save-headers --wait 3 --waitretry 5 --timeout 60 --tries 5 -H -Dbuzzdata.com --warc-header "operator: Archive Team" --warc-cdx --warc-file="$WARC_NAME" -U "$USER_AGENT" "$SAVE_HOST"
21:38 ^🔗	winr4r	Asparagir: yes
21:39 ^🔗	Asparagir	Okay. New to the world of WARC, want to make sure I get this right.
21:39 ^🔗	Asparagir	I will also need to figure out the code for cookies to do the initial login
21:39 ^🔗	winr4r	if there's stuff that only appears while logged in, you might want to look at --use-cookies as well
21:39 ^🔗	Asparagir	Since the data requires user login first.
21:39 ^🔗	Asparagir	Right, basicvally the entire site is visible only when logged in.
21:40 ^🔗	Asparagir	Even the "public" stuff.
21:40 ^🔗	Asparagir	(Gee, I wonder why it never got popular.)
21:40 ^🔗	winr4r	in that case, i'd suggest you concentrate on getting the data, WARCs are mostly useful while ingesting into wayback, which that won't be
21:41 ^🔗	winr4r	^^ NOT OFFICIAL ARCHIVE TEAM OPINION, YOU CAN IGNORE IT
21:42 ^🔗	omf_	Can we still create an account to access the data
21:42 ^🔗	Asparagir	I don't know.
21:42 ^🔗	Asparagir	Accounts are (or were) free, though.
21:42 ^🔗	Asparagir	You're all welcome to use mine. :-)
21:46 ^🔗	Asparagir	Okay, have to step out for 30 minutes to go pick up my daughter from summer camp. Back later...
21:46 ^🔗	winr4r	hb
21:47 ^🔗	omf_	I just created a new account
21:48 ^🔗	omf_	I would recommend a few other people create accounts as well so we have multiple ones to work with in case the ban hammer comes down
21:48 ^🔗	omf_	Their base url scheme is not retarded so that is good
22:47 ^🔗	Asparagir	Back now.
22:52 ^🔗	Asparagir	So, how does one organize a panic grab of something like this? Do we each just run wget on our own boxes, and hope we each get different things than the other AT peeps? Or do we seed a tracker with usernames first, or what?
22:52 ^🔗	Asparagir	Verily, I am new at this.
22:56 ^🔗	omf_	I just went through the motions of trying to get a dataset. It is a mess of javascript to make things go
22:59 ^🔗	omf_	If someone pulls down the 238 pages of public dataset results I can probably whip up some javascript bullshit to get the datasets
23:00 ^🔗	omf_	they do some ajax stuff to get to the dataset url
23:16 ^🔗	Asparagir	Ugh, yeah, it looks like they're using jQuery templates to render a lot of the links to the actual data; it's not all written right to the HTML.
23:25 ^🔗	Asparagir	Format of the data preview is like this: http://buzzdata.com/data_files/blIVT0LGqr4y37yAyCM7w3
23:33 ^🔗	Asparagir	Aaaaand I see unique authenticity tokens posted for each dataset, to make it hard to screen scrape.
23:34 ^🔗	Asparagir	Curses, folied again.
23:38 ^🔗	Asparagir	Hold up, they have a free API.
23:38 ^🔗	Asparagir	http://buzzdata.com/faq/api/api-methods
23:40 ^🔗	omf_	that makes it easy
23:40 ^🔗	winr4r	hooray!
23:41 ^🔗	omf_	you want to do it Asparagir ?
23:42 ^🔗	Asparagir	I don't think my kung-fu is quite strong enough to code the whole thing.
23:44 ^🔗	Asparagir	It looks like at a minimum you would need to know a list of usernames and/or "hive names" beforehand, in order to use the API to grab each of their datasets.
23:44 ^🔗	Asparagir	See also https://github.com/buzzdata/api-docs
23:46 ^🔗	Asparagir	Yeah, minimum requirements for all the API endpoints is an already-known username. Such as GET `https://:HIVE_NAME.buzzdata.com/api/:USERNAME` where HIVE_NAME is optional (if you leave it out, it just gets the public stuff)

irclogger-viewer