Time |
Nickname |
Message |
00:30
🔗
|
godane |
maybe the internet archive will want to get this: http://www.ebay.com/itm/IC-Technology-Fabrication-Dr-Carlton-Osburn-North-Carolina-State-University-VHS-/390573993963 |
00:43
🔗
|
SketchCow |
Not for $1k |
00:53
🔗
|
Ravenloft |
probably the NCSU have it archived |
01:02
🔗
|
godane |
even i thought the price is way too much |
01:02
🔗
|
godane |
maybe $50 or $100 but not a $1000 |
01:20
🔗
|
dashcloud |
Why is the Atari Diagnostic Test Cart so popular? Is there some kind of meme or link going around that references it? |
02:13
🔗
|
dashcloud |
Mobygames never ceases to amaze me with the depth of content on it |
02:42
🔗
|
DFJustin |
dashcloud: cause it's the spotlight item for the collection |
02:44
🔗
|
dashcloud |
ah |
03:14
🔗
|
SketchCow |
Right |
03:14
🔗
|
SketchCow |
Also, small bug in some cases. |
10:14
🔗
|
godane |
so i'm close to being up to date with wall street journal tech briefs |
10:15
🔗
|
godane |
up to date in that i have every thing from 2006 to 2013 uploaded |
10:52
🔗
|
godane |
so some good news with the wall street videos |
10:52
🔗
|
godane |
i found there api |
11:17
🔗
|
godane |
i need to filter this by dates: http://live.wsj.com/api-video/find_all_videos.asp?count=5&fields=id,name,description,duration,thumbnailURL,videoURL,formattedCreationDate,linkURL |
11:17
🔗
|
godane |
query is not working for me for some reason |
14:31
🔗
|
DFJustin |
another stupidly rare find http://mamedev.emulab.it/haze/2014/05/12/other-news-part-4/ |
16:56
🔗
|
godane |
i still need help with wsj api |
16:57
🔗
|
godane |
i need to be able got grab metadata by dates if possible |
16:57
🔗
|
godane |
here is the example url: http://live.wsj.com/api-video/find_all_videos.asp?count=5&fields=id,name,description,duration,thumbnailURL,videoURL,formattedCreationDate,linkURL |
17:00
🔗
|
rocode |
Got a total of 14,000 so far. What date ranges? |
17:00
🔗
|
godane |
how did you get 14,000 |
17:00
🔗
|
rocode |
Iterating by id |
17:01
🔗
|
rocode |
I will start parsing these down with regex and give you something usable. |
17:02
🔗
|
godane |
post urls your using? |
17:02
🔗
|
godane |
i keep get the 5 recent videos |
17:03
🔗
|
rocode |
http://live.wsj.com/api-video/find_all_videos.asp?count=15000&fields=id,name,description,duration,thumbnailURL,videoURL,formattedCreationDate,linkURL So far I haven't run into a limit |
17:04
🔗
|
godane |
ho you just run the count up |
17:04
🔗
|
rocode |
Ayep |
17:05
🔗
|
godane |
i couldn't get that to work on my end |
17:05
🔗
|
godane |
i get a proxy error |
17:07
🔗
|
rocode |
https://gist.githubusercontent.com/rocode/8a45dcc192dfaecab930/raw/gistfile1.txt |
17:08
🔗
|
rocode |
Small sample, doing it by 2000 every 15 seconds to avoid the lockout |
17:09
🔗
|
rocode |
I haven't found a way to limit the data yet. |
17:11
🔗
|
rocode |
Oh, this isn't a documented API. Neat. |
17:20
🔗
|
rocode |
There is a massive slowdown after 15k. Iterating by 500 now. It will error out unless it already has a partial amount of the data in cache to hand you. So you have to start out small and work your way up. |
17:23
🔗
|
rocode |
Current data set so far: https://gist.githubusercontent.com/rocode/9ecb6f53be6011d85624/raw/gistfile1.txt |
17:24
🔗
|
godane |
i'm doing it now |
17:25
🔗
|
godane |
its doing it in sets of 100 |
17:25
🔗
|
rocode |
What are you up to? |
17:26
🔗
|
godane |
i'm mirroring the wsj videos so we can have a collection of them |
17:26
🔗
|
rocode |
No, I mean, how many so far? |
17:26
🔗
|
rocode |
If you are ahead of me, I don't want to duplicate effort |
17:26
🔗
|
godane |
its at the 1701 count |
17:28
🔗
|
godane |
2401 now |
17:28
🔗
|
rocode |
I am starting to run into proxy errors. Not sure if we are hitting this too hard. |
17:29
🔗
|
rocode |
https://gist.githubusercontent.com/rocode/e3780ecbf7d09862ddd8/raw/gistfile1.txt |
17:32
🔗
|
rocode |
Just broke into 2012 videos. |
17:38
🔗
|
rocode |
Well, uh, I think I just got IP banned. wget is now pulling in 403 errors. |
17:40
🔗
|
rocode |
How are you going, godane? |
17:41
🔗
|
godane |
i'm at 5000 count |
17:42
🔗
|
godane |
it skip 5501 but i got 6001 |
17:46
🔗
|
rocode |
Here is 1-6500, no skips. https://gist.github.com/rocode/a3f6ec3af3209c27dd30/raw/gistfile1.txt |
17:46
🔗
|
exmic |
metadata motherfuckers https://en.wikipedia.org/wiki/File:Autographic_Kodak_writing.jpg |
17:47
🔗
|
rocode |
Author: Kodak |
17:47
🔗
|
rocode |
ahaha |
17:54
🔗
|
rocode |
https://gist.github.com/rocode/a3f6ec3af3209c27dd30/raw/gistfile1.txt last set before I got 403'd. Keep trucking, godane. |
17:56
🔗
|
ohhdemgir |
when uploading a 3TB+ ftp site what is the best way to pack it? |
18:01
🔗
|
Smiley |
tar.gz I presume... |
18:02
🔗
|
Smiley |
the archive can auto-extract/view some types of archive, but I'm unsure which to be quite honest. |
18:02
🔗
|
ohhdemgir |
so a single 3TB tar is ok? |
18:02
🔗
|
schbirid |
no |
18:02
🔗
|
ohhdemgir |
didn't think so.. |
18:03
🔗
|
Smiley |
i think you need to split it to get a hope of it uploading |
18:03
🔗
|
Smiley |
500Gb is wehat we go with I believe.... but you may just want to contact SketchCow about getting him to upload it. |
18:03
🔗
|
schbirid |
there is a script for bucketing |
18:03
🔗
|
schbirid |
50! |
18:03
🔗
|
Smiley |
Oh yeah, 50Gb tars, haha |
18:03
🔗
|
* |
Smiley so dumb |
18:05
🔗
|
schbirid |
bucket script: http://pastebin.com/jww5mVZx (will expire in 1 hour) |
18:05
🔗
|
schbirid |
looks like i wrote it so it will probably dd quantum noise into your boot sector |
18:06
🔗
|
godane |
i mirror that bucket script |
18:06
🔗
|
Smiley |
lol |
18:06
🔗
|
schbirid |
please edit the fileplanet mentions out of it then |
18:07
🔗
|
godane |
i just gave the link to archivebot |
18:07
🔗
|
schbirid |
:( |
18:07
🔗
|
schbirid |
muh privacee |
18:08
🔗
|
godane |
it look like a script make tar files out of big things |
18:08
🔗
|
Smiley |
schbirid: INTERNETS MOTHERFUCKER. DO YOU KNOW HOW THEY WORK? |
18:09
🔗
|
Smiley |
aka don't share what you don't want shared :S |
18:09
🔗
|
Smiley |
Srry bud :D |
18:09
🔗
|
schbirid |
Smiley: I WOULD KICK YOU IF I COULD BUT ITS LIKE OPPOSITE DAY IN HERE |
18:09
🔗
|
schbirid |
ah, no problem :) |
18:09
🔗
|
DFJustin |
ohhdemgir: archive.org can browse tar, zip, or iso |
18:09
🔗
|
DFJustin |
it can't browse tar.gz / rar / anything else |
18:09
🔗
|
Smiley |
Doh |
18:09
🔗
|
godane |
i was mostly doing it so we can point to the wayback machine copy of it :P |
18:10
🔗
|
godane |
when its needed |
18:10
🔗
|
* |
Smiley tells schbirid to update it |
18:10
🔗
|
* |
Smiley then tells godane to add the updated verson to the bot. |
18:10
🔗
|
DFJustin |
and if individual items (including all files on the item) get up into the hundreds of gb it doesn't play well with the archive.org infrastructure |
18:11
🔗
|
DFJustin |
so ~50gb tars or zips is a good rule of thumb |
18:11
🔗
|
schbirid |
i should just put it on github |
18:11
🔗
|
DFJustin |
some ftps you can split them up nicely by subdirectories |
18:13
🔗
|
DFJustin |
here's an example sketchcow did recently https://archive.org/search.php?query=collection%3Aftpsites%20ftp.icm.edu.pl&sort=-publicdate |
18:16
🔗
|
DFJustin |
some of them he just got lazy and threw in a 600gb file though heh |
18:17
🔗
|
exmic |
it happens |
18:17
🔗
|
ohhdemgir |
I think I'll be lazy up to around 250-300GB, I don't know, getting a few sites at once right now |
18:18
🔗
|
DFJustin |
basically the issue is that archive.org has a whole bunch of numbered servers with their own amount of free space rather than one big flat filesystem, and each item is assigned to a particular server |
18:18
🔗
|
DFJustin |
so the bigger the item gets the more likely that server will run out of space |
18:19
🔗
|
ohhdemgir |
makes sense |
18:20
🔗
|
DFJustin |
also nemo_bis uploaded a file that was like 2tb and overflowed their database column but I think that's fixed now |
18:20
🔗
|
exmic |
hah |
18:20
🔗
|
exmic |
pure class |
18:24
🔗
|
SketchCow |
That was great. |
18:24
🔗
|
SketchCow |
We have, in all of archive.org, something like 12 1tb+ objects. |
18:24
🔗
|
SketchCow |
That's one. |
18:24
🔗
|
SketchCow |
And it made a column explode. |
18:24
🔗
|
SketchCow |
Now fixed. |
18:25
🔗
|
SketchCow |
My boss likes this, she likes me causing things to explode. |
18:25
🔗
|
yipdw |
I was pretty happy with IA's response |
18:25
🔗
|
yipdw |
something like e.g. a YC company would have banned nemo |
18:27
🔗
|
exmic |
yep |
18:27
🔗
|
midas |
SketchCow: i wish my boss would do the same |
18:27
🔗
|
godane |
i must cause things to explode sometimes |
18:27
🔗
|
yipdw |
you killed archivebot a few times |
18:32
🔗
|
ohhdemgir |
everyone knows about this list right? it's super old now, but does anyone know who compiled, why and how many times have you scanned the 'do not scan' ranges, I remember doing this like 8-10 years ago!! - http://pastebin.com/raw.php?i=vcMXurEX |
18:33
🔗
|
ohhdemgir |
talk of it dating back to 2003 - https://www.webhostingtalk.com/showthread.php?t=144678 |
18:33
🔗
|
exmic |
pfff, scan all the ranges |
18:33
🔗
|
ohhdemgir |
exactly |
18:34
🔗
|
exmic |
many of those don't even route anyway |
18:38
🔗
|
Smiley |
207.60.13.64 - 207.60.13.71 SierraCom o_O :D |