#archiveteam 2014-06-14,Sat

↑back Search

Time	Nickname	Message
03:03 ^🔗	redhook	Hello Archive Team. The website GameTrailers was sold by Viacom to a different media conglomerate today, Defy. GameTrailers has spent 12 years making superb video game review videos, as well as original video game-related shows. Several staff have been laid off already, and I fear their years of content may be in jeopardy to make way for a ârebootâ or somesuch. No official announcement to that effect yet, but I wanted to bring
03:03 ^🔗	redhook	this site to your attention.
03:38 ^🔗	garyrh	hmm, it looks like gametrailers.com's videos are downloadable, even w/o being logged in
03:40 ^🔗	garyrh	...and youtube-dl can get the videos as well, so yeah.
03:43 ^🔗	redhook	Yeah, most have direct download links. Not 100% though.
03:54 ^🔗	trs80	how many videos are there?
04:05 ^🔗	SN4T14	trs80, tons, it's a big site
04:14 ^🔗	redhook	It looks like thereâs 1,704 reviews, which I think are the most important. Theyâve also produced some great retrospectives, covering the history of franchises like Zelda, Grand Theft Auto, etc. Thereâs talk shows too, which are less important. If you just go to their videos page (http://www.gametrailers.com/videos-trailers) and do the math (20 videos/page * 3520 pages) itâs about 70,000 gameplay, interview, trailer, etc.
04:14 ^🔗	redhook	videos, which would be nice to have but not crucial.
04:17 ^🔗	garyrh	for user videos, it looks to be ~263,180 videos
04:20 ^🔗	garyrh	looks like youtube-dl can get videos w/o a download button via rtmpdump
05:59 ^🔗	Nemo_bis	the Ancestry.com agreement, by which Ancestry digitized records of genealogical interest to make available behind their subscription service (which is free to use at NARA facilities) and then transmitted the digital copies to NARA to put in the catalog after 5 or 10 years.
07:41 ^🔗	*	db48x laughs at http://archiveteam.org/images/1/1b/Archiveteam_warrior_infrastructure.png
07:42 ^🔗	db48x	chfoo gets a +1 for that
07:56 ^🔗	Nemo_bis	Yes, it's pretty :)
08:01 ^🔗	godane	so i got about 30mins of video about ritalin
08:01 ^🔗	godane	from 2001
08:40 ^🔗	godane	SketchCow: i really need some sort of back access to IA
08:41 ^🔗	godane	nbc blocked modules folder i think cause of drupal
08:42 ^🔗	godane	if i can get access to everything here i may have a better change to get all nbc news clips: http://msnbc.com/modules/
08:45 ^🔗	godane	fun fact robots.txt doesn't exist every in 2007 for msnbc.com: https://web.archive.org/web/20070326005247/http://www.msnbc.com/robots.txt
08:46 ^🔗	godane	2011 not blocked: https://web.archive.org/web/20110625001406/http://msnbc.com/robots.txt
09:54 ^🔗	schbirid	earbits news: we did not manage to get a file list off the half-open earbits s3 bucket with the music, but will grab the assets (images etc) off another one where we did.
09:55 ^🔗	schbirid	if someone wants a real challenge, reverse engineer how their stream IDs are constructed. see http://archiveteam.org/index.php?title=Earbits
09:55 ^🔗	schbirid	i think it is done in client side javascript, so it should be doable in a way
10:21 ^🔗	schbirid	if you want to help with downloading images etc, come to #earbite
12:00 ^🔗	danneh_	So just to letchas know, I'm grabbing a bunch more from here: http://h18000.www1.hp.com/cpq-products/quickspecs/productbulletin.html
12:00 ^🔗	danneh_	looking into how their URLs and resources are addressed, fairly easy to get lists of every single product ID on there
12:01 ^🔗	danneh_	so I'll just go through and make some lists and set stuff to download, got HTML files, images, PDF files, all that sorta stuff should be alright to save
12:01 ^🔗	danneh_	will letchas know
12:46 ^🔗	danneh_	Grabbing the item JSON files now, after that I should be able to parse through those, extract all the PDF/jpg/html/etc links from that and set those to all download
12:47 ^🔗	danneh_	About 14k items to go through, so it might take a little bit to grab, but it should be alright
12:48 ^🔗	danneh_	Easier than trying to do it manually, just got script generating all the links to grab at each step
13:28 ^🔗	dashcloud	my angelfire.com grab is continuing along slowly
13:35 ^🔗	Nemo_bis	aww memories
13:47 ^🔗	dashcloud	tripod's still around if you want to try grabbing that
15:18 ^🔗	joepie91	dashcloud: for relative values of "around"
15:23 ^🔗	Nemo_bis	"around the graveyard"
15:24 ^🔗	joepie91	the Dutch Tripod is obliterated as far as I can tell
16:45 ^🔗	dashcloud	if I provided a list of URLs to wget in a file, can I append that file with new URLs and have wget pick them up, or does wget just read the file once at startup?
17:12 ^🔗	schbirid	i am very sure it reads it just once :(
17:15 ^🔗	SN4T14	He's gone. >.>
17:29 ^🔗	schbirid	we now have about 50000 mp3s to download from earbits, join #earbite if you want to help
17:48 ^🔗	schbirid	can you make aria2c download files with their server datetime like wget does?
17:49 ^🔗	schbirid	--remote-time=true
21:17 ^🔗	schbirid	http://www.ikeahackers.net/2014/06/big-changes-coming-to-ikeahackers.html
21:36 ^🔗	danneh_	Alright, and downloading about 46k pdf/json/jpg files, should be done in about 12 hours hopefully
21:37 ^🔗	danneh_	And that should be pretty well 100% of the stuff on that HP website, from what I've seen
21:37 ^🔗	danneh_	As much as can be accessed through that interfcae, at least
22:58 ^🔗	db48x	only one more justintv item left

irclogger-viewer