Time |
Nickname |
Message |
03:03
π
|
redhook |
Hello Archive Team. The website GameTrailers was sold by Viacom to a different media conglomerate today, Defy. GameTrailers has spent 12 years making superb video game review videos, as well as original video game-related shows. Several staff have been laid off already, and I fear their years of content may be in jeopardy to make way for a Γ’ΒΒrebootΓ’ΒΒ or somesuch. No official announcement to that effect yet, but I wanted to bring |
03:03
π
|
redhook |
this site to your attention. |
03:38
π
|
garyrh |
hmm, it looks like gametrailers.com's videos are downloadable, even w/o being logged in |
03:40
π
|
garyrh |
...and youtube-dl can get the videos as well, so yeah. |
03:43
π
|
redhook |
Yeah, most have direct download links. Not 100% though. |
03:54
π
|
trs80 |
how many videos are there? |
04:05
π
|
SN4T14 |
trs80, tons, it's a big site |
04:14
π
|
redhook |
It looks like thereΓ’ΒΒs 1,704 reviews, which I think are the most important. TheyΓ’ΒΒve also produced some great retrospectives, covering the history of franchises like Zelda, Grand Theft Auto, etc. ThereΓ’ΒΒs talk shows too, which are less important. If you just go to their videos page (http://www.gametrailers.com/videos-trailers) and do the math (20 videos/page * 3520 pages) itΓ’ΒΒs about 70,000 gameplay, interview, trailer, etc. |
04:14
π
|
redhook |
videos, which would be nice to have but not crucial. |
04:17
π
|
garyrh |
for user videos, it looks to be ~263,180 videos |
04:20
π
|
garyrh |
looks like youtube-dl can get videos w/o a download button via rtmpdump |
05:59
π
|
Nemo_bis |
the Ancestry.com agreement, by which Ancestry digitized records of genealogical interest to make available behind their subscription service (which is free to use at NARA facilities) and then transmitted the digital copies to NARA to put in the catalog after 5 or 10 years. |
07:41
π
|
* |
db48x laughs at http://archiveteam.org/images/1/1b/Archiveteam_warrior_infrastructure.png |
07:42
π
|
db48x |
chfoo gets a +1 for that |
07:56
π
|
Nemo_bis |
Yes, it's pretty :) |
08:01
π
|
godane |
so i got about 30mins of video about ritalin |
08:01
π
|
godane |
from 2001 |
08:40
π
|
godane |
SketchCow: i really need some sort of back access to IA |
08:41
π
|
godane |
nbc blocked modules folder i think cause of drupal |
08:42
π
|
godane |
if i can get access to everything here i may have a better change to get all nbc news clips: http://msnbc.com/modules/ |
08:45
π
|
godane |
fun fact robots.txt doesn't exist every in 2007 for msnbc.com: https://web.archive.org/web/20070326005247/http://www.msnbc.com/robots.txt |
08:46
π
|
godane |
2011 not blocked: https://web.archive.org/web/20110625001406/http://msnbc.com/robots.txt |
09:54
π
|
schbirid |
earbits news: we did not manage to get a file list off the half-open earbits s3 bucket with the music, but will grab the assets (images etc) off another one where we did. |
09:55
π
|
schbirid |
if someone wants a real challenge, reverse engineer how their stream IDs are constructed. see http://archiveteam.org/index.php?title=Earbits |
09:55
π
|
schbirid |
i think it is done in client side javascript, so it should be doable in a way |
10:21
π
|
schbirid |
if you want to help with downloading images etc, come to #earbite |
12:00
π
|
danneh_ |
So just to letchas know, I'm grabbing a bunch more from here: http://h18000.www1.hp.com/cpq-products/quickspecs/productbulletin.html |
12:00
π
|
danneh_ |
looking into how their URLs and resources are addressed, fairly easy to get lists of every single product ID on there |
12:01
π
|
danneh_ |
so I'll just go through and make some lists and set stuff to download, got HTML files, images, PDF files, all that sorta stuff should be alright to save |
12:01
π
|
danneh_ |
will letchas know |
12:46
π
|
danneh_ |
Grabbing the item JSON files now, after that I should be able to parse through those, extract all the PDF/jpg/html/etc links from that and set those to all download |
12:47
π
|
danneh_ |
About 14k items to go through, so it might take a little bit to grab, but it should be alright |
12:48
π
|
danneh_ |
Easier than trying to do it manually, just got script generating all the links to grab at each step |
13:28
π
|
dashcloud |
my angelfire.com grab is continuing along slowly |
13:35
π
|
Nemo_bis |
aww memories |
13:47
π
|
dashcloud |
tripod's still around if you want to try grabbing that |
15:18
π
|
joepie91 |
dashcloud: for relative values of "around" |
15:23
π
|
Nemo_bis |
"around the graveyard" |
15:24
π
|
joepie91 |
the Dutch Tripod is obliterated as far as I can tell |
16:45
π
|
dashcloud |
if I provided a list of URLs to wget in a file, can I append that file with new URLs and have wget pick them up, or does wget just read the file once at startup? |
17:12
π
|
schbirid |
i am very sure it reads it just once :( |
17:15
π
|
SN4T14 |
He's gone. >.> |
17:29
π
|
schbirid |
we now have about 50000 mp3s to download from earbits, join #earbite if you want to help |
17:48
π
|
schbirid |
can you make aria2c download files with their server datetime like wget does? |
17:49
π
|
schbirid |
--remote-time=true |
21:17
π
|
schbirid |
http://www.ikeahackers.net/2014/06/big-changes-coming-to-ikeahackers.html |
21:36
π
|
danneh_ |
Alright, and downloading about 46k pdf/json/jpg files, should be done in about 12 hours hopefully |
21:37
π
|
danneh_ |
And that should be pretty well 100% of the stuff on that HP website, from what I've seen |
21:37
π
|
danneh_ |
As much as can be accessed through that interfcae, at least |
22:58
π
|
db48x |
only one more justintv item left |