Time |
Nickname |
Message |
02:55
🔗
|
SketchCow |
There are geniuses in this channel. |
02:55
🔗
|
SketchCow |
https://archive.org/stream/zx_Mushroom_Alley_1984_Mogul_Communications/Mushroom_Alley_1984_Mogul_Communications.z80?module=zx81&scale=2 |
02:55
🔗
|
SketchCow |
What is it getting a 404 on |
02:55
🔗
|
SketchCow |
(I'm looking into it myself) |
02:59
🔗
|
garyrh |
https://archive.org/cors/jsmess_config_v1/zx81.cfg shows up in network requests and 404s |
02:59
🔗
|
SketchCow |
Thank you. |
03:00
🔗
|
garyrh |
:) |
04:38
🔗
|
garyrh |
https://medium.com/@vijayp/of-taxis-and-rainbows-f6bc289679a1 |
04:55
🔗
|
yipdw |
garyrh: guess we better download that |
04:58
🔗
|
garyrh |
you mean the data? someone already got that: https://archive.org/details/nycTaxiTripData2013 |
19:01
🔗
|
honestdua |
hello |
19:02
🔗
|
honestdua |
Jason told me on twitter to come here and ask about backups of sourceforge.net |
19:03
🔗
|
hduane |
. |
19:06
🔗
|
hduane |
hello? Is anybody alive? |
19:06
🔗
|
hduane |
Or have the bots fully taken over? |
19:08
🔗
|
* |
hduane thinks to himself.. "What could get people fired up a bit enough to respond?" |
19:08
🔗
|
hduane |
Destroy all old backups? |
19:08
🔗
|
hduane |
*jk* |
19:09
🔗
|
db48x |
howdy |
19:09
🔗
|
hduane |
http://www.quickmeme.com/meme/354li3 |
19:09
🔗
|
hduane |
Hello |
19:09
🔗
|
db48x |
someone is alive somewhere, I am sure |
19:09
🔗
|
hduane |
Jason Scott sent me here on twitter to ask about backups of sourceforge |
19:09
🔗
|
db48x |
excellent |
19:10
🔗
|
hduane |
I'm trying to get copies |
19:10
🔗
|
db48x |
backing up forges is tricky |
19:10
🔗
|
db48x |
esr has written some nice software that can back up a single project from a forge, with very high fidelity, but only if you are an admin of that project |
19:11
🔗
|
hduane |
https://twitter.com/textfiles/status/480766550593773569 |
19:12
🔗
|
hduane |
Is it just me or is SF not very open |
19:12
🔗
|
db48x |
it's from an older era |
19:12
🔗
|
hduane |
it seems to use a lot of tricks to eep you from getting copies unles lyu explicitly click a link |
19:13
🔗
|
db48x |
http://esr.ibiblio.org/?p=1369 |
19:13
🔗
|
db48x |
http://home.gna.org/forgeplucker/ |
19:14
🔗
|
hduane |
http://www.coderforgood.com |
19:15
🔗
|
hduane |
is a litel thing i'm starting as well, and thats one of the reasons I wanted to have a backup of SF |
19:15
🔗
|
hduane |
is one of the things that I am involved with, thats why I wanted a copy of SF's data |
19:15
🔗
|
hduane |
I think I idled out |
19:16
🔗
|
hduane |
oh ok just lag |
19:16
🔗
|
db48x |
cool :) |
19:17
🔗
|
hduane |
Wel if I wnated to mirror SF or at least get a list of all its projects that doesnt seem to be that hard if only due to how the sitemap is |
19:17
🔗
|
hduane |
if google can do it so can we |
19:18
🔗
|
hduane |
in act i just looked at the sitemap and was givne a file hat leads to other files, for all projects, sitemaped out into seperate files |
19:19
🔗
|
DFJustin |
"data jail", that's a good term |
19:19
🔗
|
db48x |
Yes, grabbing a list of all projects on a forge is fairly straight-forward |
19:19
🔗
|
db48x |
the real work will be in adapting ForgePlucker |
19:20
🔗
|
db48x |
last time I used it it didn't handle failure gracefully |
19:20
🔗
|
db48x |
so it just doesn't do anything useful if you're not an admin of the project |
19:20
🔗
|
hduane |
well SF uses multiple types of data storage |
19:20
🔗
|
hduane |
its old style cvs, not as old but still old svn |
19:21
🔗
|
db48x |
yes, plus bug tracking, mailing list, etc |
19:21
🔗
|
hduane |
To me the code is the priority |
19:21
🔗
|
hduane |
the most |
19:21
🔗
|
hduane |
all the other stuff matters as well |
19:21
🔗
|
hduane |
but the code is what is important |
19:21
🔗
|
db48x |
yes :) |
19:22
🔗
|
db48x |
the other stuff is context and community, but the code is the thing itself in a sense |
19:23
🔗
|
db48x |
if all you want is the code, then the path is quite straight-forward |
19:23
🔗
|
db48x |
cloning an cvs repository isn't hard (but hard to do perfectly), svn is easier |
19:23
🔗
|
db48x |
if you want to go the extra mile, then consider creating a job for our ArchiveTeam Warrior |
19:24
🔗
|
db48x |
then you'll have a couple of dozen people helping out with the downloading automatically |
19:24
🔗
|
db48x |
I'd be more interested in extending ForgePlucker though, and then making a warrior task out of that |
19:25
🔗
|
db48x |
http://archiveteam.org/index.php?title=Warrior |
19:26
🔗
|
hduane |
Well I just signed up for GNA and am waiting on teh email |
19:26
🔗
|
hduane |
that said I was hopeing that I could just ask SF for a copy and get one |
19:27
🔗
|
hduane |
ok on gna as honestduane |
19:28
🔗
|
db48x |
that would be nice :) |
19:29
🔗
|
hduane |
ell I did send sf a email and a tweet asking about this but it was yesterday on a satarday |
19:32
🔗
|
hduane |
Do you know roughly how much data is on SF? |
19:35
🔗
|
db48x |
more than Geocities, less than Google Video? |
19:37
🔗
|
hduane |
so a lot? ;) |
19:37
🔗
|
hduane |
well I dot ahve access to teh source for the plucker but I can look deeper into this issue of the sitemap of SF |
19:39
🔗
|
Nemo_bis |
There were some papers on SourceForge, they probably mention the total size but I forgot |
19:39
🔗
|
hduane |
let me see if I can calcualte it myself using the sitemap as input |
19:39
🔗
|
db48x |
http://svn.gna.org/viewcvs/forgeplucker/trunk/ |
19:39
🔗
|
db48x |
Geocities was only a terabyte |
19:40
🔗
|
hduane |
yet I remember when ti seemed to be at east half the internet |
19:44
🔗
|
joepie91_ |
nobody pinged SketchCow about hduane yet? :P |
19:45
🔗
|
joepie91_ |
hduane: it was, probably |
19:45
🔗
|
joepie91_ |
content-wise |
19:45
🔗
|
db48x |
yea, it was the place to be |
20:38
🔗
|
honestdua |
ok so I was able to collect all the links but interestingly enough it looks like they have an Apache Allura sitemap as well |
20:40
🔗
|
honestdua |
also the patterns used by each project in csv are pretty predictable |
20:41
🔗
|
honestdua |
and outlined in the robots.txt file for google and company to not mirror |
20:42
🔗
|
db48x |
yea :) |
20:42
🔗
|
SketchCow |
Hey. |
20:43
🔗
|
db48x |
I dislike robots.txt in general, but they are occasionally awesome |
20:43
🔗
|
SketchCow |
I figured hduane could get the archive team outlook |
20:43
🔗
|
honestdua |
Also found 1,505,096 links in the main sitemap |
20:43
🔗
|
honestdua |
hey Jason |
20:43
🔗
|
honestdua |
they all sem to follow a common pattern of about 3-4 links per project |
20:44
🔗
|
SketchCow |
Grab it all. |
20:44
🔗
|
SketchCow |
Did we LOSE freshmeat or is freshmeat around in some way? |
20:45
🔗
|
antithesi |
Yo |
20:45
🔗
|
honestdua |
I'm running my sitemap slurper right now.. it doesnt exactly respect robots.txt |
20:45
🔗
|
antithesi |
Can you guys archive userscripts.org? I'm afraid it'll die |
20:45
🔗
|
db48x |
SketchCow: it's still up |
20:46
🔗
|
honestdua |
but the link file for the main sitemap is about 75 megs of just urls, one per line |
20:46
🔗
|
honestdua |
github has gotten so big.. I worry about ti being the main choice |
20:46
🔗
|
honestdua |
no its gone |
20:46
🔗
|
honestdua |
is what was posted on slashdot etc |
20:46
🔗
|
honestdua |
I heard on reddit about it first then slashdot that it was gone |
20:46
🔗
|
db48x |
the site is still there |
20:47
🔗
|
db48x |
antithesi: I can't load it; it just times out |
20:48
🔗
|
antithesi |
db48x it's still available at userscripts.org:8080 |
20:48
🔗
|
honestdua |
it redirects to something else last i checked for freshmeat |
20:49
🔗
|
db48x |
honestdua: it was renamed to freecode a while back |
20:50
🔗
|
antithesi |
Okay, looks like there's http://userscripts-mirror.org/ too, but that one isn't downloadable |
20:51
🔗
|
honestdua |
hmm OOM error from trying the secondary Apache Allura sitemap |
20:52
🔗
|
honestdua |
its over 600 files so that may be it |
20:52
🔗
|
honestdua |
hmm.. recodeing it to use less memory at the expense of speed.. |
20:55
🔗
|
honestdua |
oh and I am on this cruddy irc cgi chat client |
20:55
🔗
|
honestdua |
its been years since i had something like mirc installed |
20:55
🔗
|
honestdua |
not even sure if mirc is still around |
20:55
🔗
|
honestdua |
so anyway I may timeout |
20:55
🔗
|
honestdua |
as I work |
21:05
🔗
|
honestdua |
ok 615 sitemap files for the second part |
21:05
🔗
|
honestdua |
at sitemap 7 and we have over 379k urls |
21:05
🔗
|
honestdua |
distinct urls* |
21:06
🔗
|
honestdua |
as I am makign sure they are all unique |
21:06
🔗
|
honestdua |
just passed over a million urls as of sitemap number 100 |
21:08
🔗
|
Smiley |
looking fun |
21:08
🔗
|
honestdua |
i think if this fails I'm just going to start having it download the files and work on processing as |
21:08
🔗
|
honestdua |
seperate task |
21:08
🔗
|
honestdua |
just passed over 2 million as of file 200 |
21:09
🔗
|
Smiley |
you put this list online somewhere yet? |
21:09
🔗
|
honestdua |
s ifthe numbers stay sane SF has almost 7 million links |
21:10
🔗
|
honestdua |
and if its true to the 3-4 link per project that I'm seeing, under 2 million projects |
21:11
🔗
|
honestdua |
looks like some of the optimizations I made also got rid of my OOM *crosses fingers* |
21:12
🔗
|
honestdua |
so about up to around 2 million or so projects at most of course this is also user profiles |
21:12
🔗
|
honestdua |
so the number may be bad |
21:12
🔗
|
honestdua |
ok over half done |
21:15
🔗
|
honestdua |
just passed 400 files and oer 400 links |
21:16
🔗
|
honestdua |
I am so glad i have 16gb of ram on this thing |
21:16
🔗
|
honestdua |
http://jeffcovey.net/2014/06/19/freshmeat-net-1997-2014/ |
21:17
🔗
|
SketchCow |
Anyway, the summary is that archive team downloads everything it can, and we don't ask nicely. |
21:17
🔗
|
honestdua |
What about legality? |
21:18
🔗
|
honestdua |
I'm going to put it in my dropbox after its done and then send you links to the public urls |
21:18
🔗
|
honestdua |
Its just the urls |
21:18
🔗
|
honestdua |
about 500 sitemap file done |
21:18
🔗
|
honestdua |
so only 20% left |
21:19
🔗
|
honestdua |
i feel like its perfectly legit to do this because they make everything public anyway |
21:19
🔗
|
honestdua |
but if it was paywalled or whatever |
21:19
🔗
|
honestdua |
it would not be ok |
21:20
🔗
|
honestdua |
my wife is sitting here asking me to make sure I do not endanger her or our family with "hacker stuff" |
21:21
🔗
|
honestdua |
but sitemaps are public so no problem |
21:21
🔗
|
godane |
PRO TIP: don't endanger family with "hacker stuff" |
21:21
🔗
|
honestdua |
yes never do that |
21:22
🔗
|
honestdua |
besides I'm a security professional i need to keep my rep solid |
21:22
🔗
|
godane |
going after sitemaps should be find |
21:22
🔗
|
godane |
*fine |
21:23
🔗
|
honestdua |
ok just got to 6 million |
21:26
🔗
|
honestdua |
ok so final allura links file is 273.4 megs in size of just links |
21:27
🔗
|
honestdua |
and the main sitemap is 73 or megs of just links |
21:27
🔗
|
SketchCow |
See, here I'm sad. |
21:27
🔗
|
honestdua |
and they compress down to a simple 40 meg zip |
21:28
🔗
|
SketchCow |
Because this is IRC, and not a dystopian hacker movie |
21:28
🔗
|
honestdua |
https://dl.dropboxusercontent.com/u/18627325/sourceforge.net/sf_net_sitemap.zip is syncing now |
21:28
🔗
|
SketchCow |
Because then I'd turn up to the rafters with all the insane kids in harnesses and hammocks with laptops |
21:28
🔗
|
SketchCow |
and go "AND WHAT ABOUT THE LEGALITY, BOYS" |
21:28
🔗
|
SketchCow |
And, like, hundreds of soda cans just come raining down in laughter |
21:29
🔗
|
SketchCow |
But we'll settle for "let's worry about saving the data" |
21:30
🔗
|
honestdua |
that data seems very valuable at this point it should be everything you need to automate the collection of every projects data |
21:36
🔗
|
midas |
honestdua: tell your wife you're just grabbing public data anyway. nobody got sued for visiting a site. EVER. |
21:37
🔗
|
midas |
thats all we do, but then at warpspeed 10 |
21:37
🔗
|
SketchCow |
17:33 < honestdua> my wife is sitting here asking me to make sure I do not endanger her or our family with "hacker stuff" |
21:37
🔗
|
SketchCow |
Then hand the fun off to us, we'll do the rest. |
21:37
🔗
|
SketchCow |
You've done enough! |
21:38
🔗
|
Smiley |
plz make sure you upload infos asap |
21:38
🔗
|
Smiley |
just incase you disappear. |
21:39
🔗
|
honestdua |
oh btw the data shows that over 3.28 million of the 6.15+ million links are for user profiles |
21:39
🔗
|
honestdua |
that zip is the output of my code |
21:40
🔗
|
honestdua |
you want the code as well? |
21:40
🔗
|
honestdua |
its just like 7 lines of C# but it should run on mono |
21:41
🔗
|
honestdua |
https://dl.dropboxusercontent.com/u/18627325/sourceforge.net/sf_sitemap_sucker%20-%20Copy.zip |
21:41
🔗
|
honestdua |
there you go |
21:43
🔗
|
honestdua |
ok between the sf_net_sitemap.zip file above and that 9k zip of the code to generate it you have everything to duplicate my efforts to use the public sitemap to generate a map of sf and its projects/users |
21:44
🔗
|
db48x |
sweet |
21:45
🔗
|
honestdua |
https://dl.dropboxusercontent.com/u/18627325/sourceforge.net/sf_net_sitemap.zip is the output |
21:46
🔗
|
honestdua |
not bad for an hour or so's work |
21:46
🔗
|
honestdua |
I'm on twitter as @honestduane and it looks like my wife wants me to go to home depot and get a bolt to fix the lawn mower.. so I may end up having to mow the lawn as well knowing how things go :/ |
21:46
🔗
|
honestdua |
Either way, hope that helps |
21:47
🔗
|
honestdua |
I need to log, have a good day |
21:48
🔗
|
honestdua |
just going to let this idle |
21:48
🔗
|
db48x |
honestdua: you too :) |
21:48
🔗
|
db48x |
and thanks :) |
21:53
🔗
|
db48x |
nooo, everyone is catching up: http://argonath.db48x.net/ |
22:14
🔗
|
midas |
what is it db48x ? |
23:01
🔗
|
amerrykan |
wow, that's huge. is there some concern about sf going away? |
23:03
🔗
|
SketchCow |
By most assumptions, it has |
23:11
🔗
|
db48x |
midas: we're scraping the pixori.al url shortner in preperation for grabbing all the pixorial videos we can find |
23:22
🔗
|
honestdua |
ok back, should probably get a real irc client installed |
23:29
🔗
|
honestdua |
ok shutting this down |
23:58
🔗
|
honestdua |
Question: Is bittorent the only way to get all this data you guys are archiving? |
23:58
🔗
|
honestdua |
that doesnt seem liek a very stable storage medium. |
23:59
🔗
|
db48x |
it's not a storage medium, it's a delivery mechanism |
23:59
🔗
|
honestdua |
Well what if I wnat to get a copy of everything. |
23:59
🔗
|
honestdua |
10 tb from bittorrent would take forever |