[04:29] anyone working on justin.tv? [04:30] SketchCow said on #archiveteam that its deleting all archive footage in week [04:48] ohhdemgir: 700gb is pretty large for an item [04:48] I don't think that will successfully derive ever [04:49] not positive, but I think few if any of the workers have that much free space at once [04:49] (plus room for all the output files) [04:50] SketchCow: i found something useful [04:50] you can grab the id of a video then put it like this: http://api.justin.tv/api/clip/show/502307186.xml [04:50] then get the flv file [05:00] anyways i'm grabbing the twit videos [05:20] so they ip ban me on the api end [06:20] so i figured out my other problem [06:21] i was seeing like 41 hours of twit on the web pages [06:21] but only got about 30 mins [06:21] turns out that clip/show is just one of the files [06:21] the best way is do it this way: http://api.justin.tv/api/broadcast/by_archive/502307186.xml [06:21] that way you get the full broadcast [06:37] youtube-dl will do it too [06:37] and get all the pieces [06:37] it can even save the metadata, too [06:38] youtube-dl http://justin.tv/marugawa/b/533453715 --restrict-filenames --write-info-json --write-annotations --write-thumbnail [07:45] can i just give one of you guys a list of urls? [07:45] i'm grabbing it based on the api [08:02] its now clsoe to 15k videos just for twit alone [08:03] i'm going to pastebin the video list [08:03] i don't think i have the storage to grab it anyways [08:04] with most videos of twit being 500mb x 15k [08:05] 1G x 7500 [08:05] shit [08:05] make that 17k [08:10] part 1: http://pastebin.com/SzBzn6XE [08:12] part 2: http://pastebin.com/rSEBJHjr [08:13] part 3: [08:13] part 3: http://pastebin.com/Fbnk9B8f [08:14] part 4: http://pastebin.com/7vhSqCVV [08:15] underscor: you can use the backend of IA right? [09:01] godane: grab them one by one or all at once, how strict is the banning policy? [09:05] i don't know if there strict banning with videos [09:05] but there is when hiting the api [09:06] i'm grabbing geek beat live epsiodes [09:06] part1, lots missing [09:07] did you get stuff in 2010-4 folders [09:08] i only has cause there was a lot there [09:13] right, using more than 2 wgets will break it it seems [09:14] getting the video's is easy like this, but we probably need some metadata to go with that [09:22] i think people are grabbing a ton of stuff from justin.tv [09:22] keep getting error page [09:22] *pages [09:43] justin downloads are between 60K and 18MB/s so yeah, no idea! :p [17:07] any help needed with justin? [17:09] antomatic: talk to midas / SketchCow [18:12] and to add, godane [21:09] We need help. [21:20] SketchCow: here is a example of the api to grab the justin.tv archive broadcast: http://api.justin.tv/api/broadcast/by_archive/326563367.xml [21:21] the number is the id for the video id [21:22] midas has my 4 part twit channel list [21:22] since that has close to 4 years of twit.tv 24 live stream [21:23] based on what i can tell i maybe close to 7.5TB to 10TB alone [21:23] midas: re check 404 errors [21:24] the trove has gave me fake 404 errors so they maybe fake [22:31] Only I care about this, but.... [22:32] I wrote this word cloud generation into subject keywords thing on IA. [22:32] It worked, but occasionally it hit some bug in the system [22:32] Fixed it... now adding those one or two words every 12 items that were missing [22:53] you guys will be getting a GeekBeat.TV Live collection at some point [22:53] based on these justin.tv rips [23:48] SketchCow: for the record, I care... that thing is super-awesome.