#archiveteam-bs 2014-05-12,Mon

↑back Search

Time Nickname Message
00:30 🔗 godane maybe the internet archive will want to get this: http://www.ebay.com/itm/IC-Technology-Fabrication-Dr-Carlton-Osburn-North-Carolina-State-University-VHS-/390573993963
00:43 🔗 SketchCow Not for $1k
00:53 🔗 Ravenloft probably the NCSU have it archived
01:02 🔗 godane even i thought the price is way too much
01:02 🔗 godane maybe $50 or $100 but not a $1000
01:20 🔗 dashcloud Why is the Atari Diagnostic Test Cart so popular? Is there some kind of meme or link going around that references it?
02:13 🔗 dashcloud Mobygames never ceases to amaze me with the depth of content on it
02:42 🔗 DFJustin dashcloud: cause it's the spotlight item for the collection
02:44 🔗 dashcloud ah
03:14 🔗 SketchCow Right
03:14 🔗 SketchCow Also, small bug in some cases.
10:14 🔗 godane so i'm close to being up to date with wall street journal tech briefs
10:15 🔗 godane up to date in that i have every thing from 2006 to 2013 uploaded
10:52 🔗 godane so some good news with the wall street videos
10:52 🔗 godane i found there api
11:17 🔗 godane i need to filter this by dates: http://live.wsj.com/api-video/find_all_videos.asp?count=5&fields=id,name,description,duration,thumbnailURL,videoURL,formattedCreationDate,linkURL
11:17 🔗 godane query is not working for me for some reason
14:31 🔗 DFJustin another stupidly rare find http://mamedev.emulab.it/haze/2014/05/12/other-news-part-4/
16:56 🔗 godane i still need help with wsj api
16:57 🔗 godane i need to be able got grab metadata by dates if possible
16:57 🔗 godane here is the example url: http://live.wsj.com/api-video/find_all_videos.asp?count=5&fields=id,name,description,duration,thumbnailURL,videoURL,formattedCreationDate,linkURL
17:00 🔗 rocode Got a total of 14,000 so far. What date ranges?
17:00 🔗 godane how did you get 14,000
17:00 🔗 rocode Iterating by id
17:01 🔗 rocode I will start parsing these down with regex and give you something usable.
17:02 🔗 godane post urls your using?
17:02 🔗 godane i keep get the 5 recent videos
17:03 🔗 rocode http://live.wsj.com/api-video/find_all_videos.asp?count=15000&fields=id,name,description,duration,thumbnailURL,videoURL,formattedCreationDate,linkURL So far I haven't run into a limit
17:04 🔗 godane ho you just run the count up
17:04 🔗 rocode Ayep
17:05 🔗 godane i couldn't get that to work on my end
17:05 🔗 godane i get a proxy error
17:07 🔗 rocode https://gist.githubusercontent.com/rocode/8a45dcc192dfaecab930/raw/gistfile1.txt
17:08 🔗 rocode Small sample, doing it by 2000 every 15 seconds to avoid the lockout
17:09 🔗 rocode I haven't found a way to limit the data yet.
17:11 🔗 rocode Oh, this isn't a documented API. Neat.
17:20 🔗 rocode There is a massive slowdown after 15k. Iterating by 500 now. It will error out unless it already has a partial amount of the data in cache to hand you. So you have to start out small and work your way up.
17:23 🔗 rocode Current data set so far: https://gist.githubusercontent.com/rocode/9ecb6f53be6011d85624/raw/gistfile1.txt
17:24 🔗 godane i'm doing it now
17:25 🔗 godane its doing it in sets of 100
17:25 🔗 rocode What are you up to?
17:26 🔗 godane i'm mirroring the wsj videos so we can have a collection of them
17:26 🔗 rocode No, I mean, how many so far?
17:26 🔗 rocode If you are ahead of me, I don't want to duplicate effort
17:26 🔗 godane its at the 1701 count
17:28 🔗 godane 2401 now
17:28 🔗 rocode I am starting to run into proxy errors. Not sure if we are hitting this too hard.
17:29 🔗 rocode https://gist.githubusercontent.com/rocode/e3780ecbf7d09862ddd8/raw/gistfile1.txt
17:32 🔗 rocode Just broke into 2012 videos.
17:38 🔗 rocode Well, uh, I think I just got IP banned. wget is now pulling in 403 errors.
17:40 🔗 rocode How are you going, godane?
17:41 🔗 godane i'm at 5000 count
17:42 🔗 godane it skip 5501 but i got 6001
17:46 🔗 rocode Here is 1-6500, no skips. https://gist.github.com/rocode/a3f6ec3af3209c27dd30/raw/gistfile1.txt
17:46 🔗 exmic metadata motherfuckers https://en.wikipedia.org/wiki/File:Autographic_Kodak_writing.jpg
17:47 🔗 rocode Author: Kodak
17:47 🔗 rocode ahaha
17:54 🔗 rocode https://gist.github.com/rocode/a3f6ec3af3209c27dd30/raw/gistfile1.txt last set before I got 403'd. Keep trucking, godane.
17:56 🔗 ohhdemgir when uploading a 3TB+ ftp site what is the best way to pack it?
18:01 🔗 Smiley tar.gz I presume...
18:02 🔗 Smiley the archive can auto-extract/view some types of archive, but I'm unsure which to be quite honest.
18:02 🔗 ohhdemgir so a single 3TB tar is ok?
18:02 🔗 schbirid no
18:02 🔗 ohhdemgir didn't think so..
18:03 🔗 Smiley i think you need to split it to get a hope of it uploading
18:03 🔗 Smiley 500Gb is wehat we go with I believe.... but you may just want to contact SketchCow about getting him to upload it.
18:03 🔗 schbirid there is a script for bucketing
18:03 🔗 schbirid 50!
18:03 🔗 Smiley Oh yeah, 50Gb tars, haha
18:03 🔗 * Smiley so dumb
18:05 🔗 schbirid bucket script: http://pastebin.com/jww5mVZx (will expire in 1 hour)
18:05 🔗 schbirid looks like i wrote it so it will probably dd quantum noise into your boot sector
18:06 🔗 godane i mirror that bucket script
18:06 🔗 Smiley lol
18:06 🔗 schbirid please edit the fileplanet mentions out of it then
18:07 🔗 godane i just gave the link to archivebot
18:07 🔗 schbirid :(
18:07 🔗 schbirid muh privacee
18:08 🔗 godane it look like a script make tar files out of big things
18:08 🔗 Smiley schbirid: INTERNETS MOTHERFUCKER. DO YOU KNOW HOW THEY WORK?
18:09 🔗 Smiley aka don't share what you don't want shared :S
18:09 🔗 Smiley Srry bud :D
18:09 🔗 schbirid Smiley: I WOULD KICK YOU IF I COULD BUT ITS LIKE OPPOSITE DAY IN HERE
18:09 🔗 schbirid ah, no problem :)
18:09 🔗 DFJustin ohhdemgir: archive.org can browse tar, zip, or iso
18:09 🔗 DFJustin it can't browse tar.gz / rar / anything else
18:09 🔗 Smiley Doh
18:09 🔗 godane i was mostly doing it so we can point to the wayback machine copy of it :P
18:10 🔗 godane when its needed
18:10 🔗 * Smiley tells schbirid to update it
18:10 🔗 * Smiley then tells godane to add the updated verson to the bot.
18:10 🔗 DFJustin and if individual items (including all files on the item) get up into the hundreds of gb it doesn't play well with the archive.org infrastructure
18:11 🔗 DFJustin so ~50gb tars or zips is a good rule of thumb
18:11 🔗 schbirid i should just put it on github
18:11 🔗 DFJustin some ftps you can split them up nicely by subdirectories
18:13 🔗 DFJustin here's an example sketchcow did recently https://archive.org/search.php?query=collection%3Aftpsites%20ftp.icm.edu.pl&sort=-publicdate
18:16 🔗 DFJustin some of them he just got lazy and threw in a 600gb file though heh
18:17 🔗 exmic it happens
18:17 🔗 ohhdemgir I think I'll be lazy up to around 250-300GB, I don't know, getting a few sites at once right now
18:18 🔗 DFJustin basically the issue is that archive.org has a whole bunch of numbered servers with their own amount of free space rather than one big flat filesystem, and each item is assigned to a particular server
18:18 🔗 DFJustin so the bigger the item gets the more likely that server will run out of space
18:19 🔗 ohhdemgir makes sense
18:20 🔗 DFJustin also nemo_bis uploaded a file that was like 2tb and overflowed their database column but I think that's fixed now
18:20 🔗 exmic hah
18:20 🔗 exmic pure class
18:24 🔗 SketchCow That was great.
18:24 🔗 SketchCow We have, in all of archive.org, something like 12 1tb+ objects.
18:24 🔗 SketchCow That's one.
18:24 🔗 SketchCow And it made a column explode.
18:24 🔗 SketchCow Now fixed.
18:25 🔗 SketchCow My boss likes this, she likes me causing things to explode.
18:25 🔗 yipdw I was pretty happy with IA's response
18:25 🔗 yipdw something like e.g. a YC company would have banned nemo
18:27 🔗 exmic yep
18:27 🔗 midas SketchCow: i wish my boss would do the same
18:27 🔗 godane i must cause things to explode sometimes
18:27 🔗 yipdw you killed archivebot a few times
18:32 🔗 ohhdemgir everyone knows about this list right? it's super old now, but does anyone know who compiled, why and how many times have you scanned the 'do not scan' ranges, I remember doing this like 8-10 years ago!! - http://pastebin.com/raw.php?i=vcMXurEX
18:33 🔗 ohhdemgir talk of it dating back to 2003 - https://www.webhostingtalk.com/showthread.php?t=144678
18:33 🔗 exmic pfff, scan all the ranges
18:33 🔗 ohhdemgir exactly
18:34 🔗 exmic many of those don't even route anyway
18:38 🔗 Smiley 207.60.13.64 - 207.60.13.71 SierraCom o_O :D

irclogger-viewer