[00:30] maybe the internet archive will want to get this: http://www.ebay.com/itm/IC-Technology-Fabrication-Dr-Carlton-Osburn-North-Carolina-State-University-VHS-/390573993963 [00:43] Not for $1k [00:53] probably the NCSU have it archived [01:02] even i thought the price is way too much [01:02] maybe $50 or $100 but not a $1000 [01:20] Why is the Atari Diagnostic Test Cart so popular? Is there some kind of meme or link going around that references it? [02:13] Mobygames never ceases to amaze me with the depth of content on it [02:42] dashcloud: cause it's the spotlight item for the collection [02:44] ah [03:14] Right [03:14] Also, small bug in some cases. [10:14] so i'm close to being up to date with wall street journal tech briefs [10:15] up to date in that i have every thing from 2006 to 2013 uploaded [10:52] so some good news with the wall street videos [10:52] i found there api [11:17] i need to filter this by dates: http://live.wsj.com/api-video/find_all_videos.asp?count=5&fields=id,name,description,duration,thumbnailURL,videoURL,formattedCreationDate,linkURL [11:17] query is not working for me for some reason [14:31] another stupidly rare find http://mamedev.emulab.it/haze/2014/05/12/other-news-part-4/ [16:56] i still need help with wsj api [16:57] i need to be able got grab metadata by dates if possible [16:57] here is the example url: http://live.wsj.com/api-video/find_all_videos.asp?count=5&fields=id,name,description,duration,thumbnailURL,videoURL,formattedCreationDate,linkURL [17:00] Got a total of 14,000 so far. What date ranges? [17:00] how did you get 14,000 [17:00] Iterating by id [17:01] I will start parsing these down with regex and give you something usable. [17:02] post urls your using? [17:02] i keep get the 5 recent videos [17:03] http://live.wsj.com/api-video/find_all_videos.asp?count=15000&fields=id,name,description,duration,thumbnailURL,videoURL,formattedCreationDate,linkURL So far I haven't run into a limit [17:04] ho you just run the count up [17:04] Ayep [17:05] i couldn't get that to work on my end [17:05] i get a proxy error [17:07] https://gist.githubusercontent.com/rocode/8a45dcc192dfaecab930/raw/gistfile1.txt [17:08] Small sample, doing it by 2000 every 15 seconds to avoid the lockout [17:09] I haven't found a way to limit the data yet. [17:11] Oh, this isn't a documented API. Neat. [17:20] There is a massive slowdown after 15k. Iterating by 500 now. It will error out unless it already has a partial amount of the data in cache to hand you. So you have to start out small and work your way up. [17:23] Current data set so far: https://gist.githubusercontent.com/rocode/9ecb6f53be6011d85624/raw/gistfile1.txt [17:24] i'm doing it now [17:25] its doing it in sets of 100 [17:25] What are you up to? [17:26] i'm mirroring the wsj videos so we can have a collection of them [17:26] No, I mean, how many so far? [17:26] If you are ahead of me, I don't want to duplicate effort [17:26] its at the 1701 count [17:28] 2401 now [17:28] I am starting to run into proxy errors. Not sure if we are hitting this too hard. [17:29] https://gist.githubusercontent.com/rocode/e3780ecbf7d09862ddd8/raw/gistfile1.txt [17:32] Just broke into 2012 videos. [17:38] Well, uh, I think I just got IP banned. wget is now pulling in 403 errors. [17:40] How are you going, godane? [17:41] i'm at 5000 count [17:42] it skip 5501 but i got 6001 [17:46] Here is 1-6500, no skips. https://gist.github.com/rocode/a3f6ec3af3209c27dd30/raw/gistfile1.txt [17:46] metadata motherfuckers https://en.wikipedia.org/wiki/File:Autographic_Kodak_writing.jpg [17:47] Author: Kodak [17:47] ahaha [17:54] https://gist.github.com/rocode/a3f6ec3af3209c27dd30/raw/gistfile1.txt last set before I got 403'd. Keep trucking, godane. [17:56] when uploading a 3TB+ ftp site what is the best way to pack it? [18:01] tar.gz I presume... [18:02] the archive can auto-extract/view some types of archive, but I'm unsure which to be quite honest. [18:02] so a single 3TB tar is ok? [18:02] no [18:02] didn't think so.. [18:03] i think you need to split it to get a hope of it uploading [18:03] 500Gb is wehat we go with I believe.... but you may just want to contact SketchCow about getting him to upload it. [18:03] there is a script for bucketing [18:03] 50! [18:03] Oh yeah, 50Gb tars, haha [18:03] * Smiley so dumb [18:05] bucket script: http://pastebin.com/jww5mVZx (will expire in 1 hour) [18:05] looks like i wrote it so it will probably dd quantum noise into your boot sector [18:06] i mirror that bucket script [18:06] lol [18:06] please edit the fileplanet mentions out of it then [18:07] i just gave the link to archivebot [18:07] :( [18:07] muh privacee [18:08] it look like a script make tar files out of big things [18:08] schbirid: INTERNETS MOTHERFUCKER. DO YOU KNOW HOW THEY WORK? [18:09] aka don't share what you don't want shared :S [18:09] Srry bud :D [18:09] Smiley: I WOULD KICK YOU IF I COULD BUT ITS LIKE OPPOSITE DAY IN HERE [18:09] ah, no problem :) [18:09] ohhdemgir: archive.org can browse tar, zip, or iso [18:09] it can't browse tar.gz / rar / anything else [18:09] Doh [18:09] i was mostly doing it so we can point to the wayback machine copy of it :P [18:10] when its needed [18:10] * Smiley tells schbirid to update it [18:10] * Smiley then tells godane to add the updated verson to the bot. [18:10] and if individual items (including all files on the item) get up into the hundreds of gb it doesn't play well with the archive.org infrastructure [18:11] so ~50gb tars or zips is a good rule of thumb [18:11] i should just put it on github [18:11] some ftps you can split them up nicely by subdirectories [18:13] here's an example sketchcow did recently https://archive.org/search.php?query=collection%3Aftpsites%20ftp.icm.edu.pl&sort=-publicdate [18:16] some of them he just got lazy and threw in a 600gb file though heh [18:17] it happens [18:17] I think I'll be lazy up to around 250-300GB, I don't know, getting a few sites at once right now [18:18] basically the issue is that archive.org has a whole bunch of numbered servers with their own amount of free space rather than one big flat filesystem, and each item is assigned to a particular server [18:18] so the bigger the item gets the more likely that server will run out of space [18:19] makes sense [18:20] also nemo_bis uploaded a file that was like 2tb and overflowed their database column but I think that's fixed now [18:20] hah [18:20] pure class [18:24] That was great. [18:24] We have, in all of archive.org, something like 12 1tb+ objects. [18:24] That's one. [18:24] And it made a column explode. [18:24] Now fixed. [18:25] My boss likes this, she likes me causing things to explode. [18:25] I was pretty happy with IA's response [18:25] something like e.g. a YC company would have banned nemo [18:27] yep [18:27] SketchCow: i wish my boss would do the same [18:27] i must cause things to explode sometimes [18:27] you killed archivebot a few times [18:32] everyone knows about this list right? it's super old now, but does anyone know who compiled, why and how many times have you scanned the 'do not scan' ranges, I remember doing this like 8-10 years ago!! - http://pastebin.com/raw.php?i=vcMXurEX [18:33] talk of it dating back to 2003 - https://www.webhostingtalk.com/showthread.php?t=144678 [18:33] pfff, scan all the ranges [18:33] exactly [18:34] many of those don't even route anyway [18:38] 207.60.13.64 - 207.60.13.71 SierraCom o_O :D