[02:55] <SketchCow> There are geniuses in this channel. [02:55] <SketchCow> https://archive.org/stream/zx_Mushroom_Alley_1984_Mogul_Communications/Mushroom_Alley_1984_Mogul_Communications.z80?module=zx81&scale=2 [02:55] <SketchCow> What is it getting a 404 on [02:55] <SketchCow> (I'm looking into it myself) [02:59] <garyrh> https://archive.org/cors/jsmess_config_v1/zx81.cfg shows up in network requests and 404s [02:59] <SketchCow> Thank you. [03:00] <garyrh> :) [04:38] <garyrh> https://medium.com/@vijayp/of-taxis-and-rainbows-f6bc289679a1 [04:55] <yipdw> garyrh: guess we better download that [04:58] <garyrh> you mean the data? someone already got that: https://archive.org/details/nycTaxiTripData2013 [19:01] <honestdua> hello [19:02] <honestdua> Jason told me on twitter to come here and ask about backups of sourceforge.net [19:03] <hduane> . [19:06] <hduane> hello? Is anybody alive? [19:06] <hduane> Or have the bots fully taken over? [19:08] * hduane thinks to himself.. "What could get people fired up a bit enough to respond?" [19:08] <hduane> Destroy all old backups? [19:08] <hduane> *jk* [19:09] <db48x> howdy [19:09] <hduane> http://www.quickmeme.com/meme/354li3 [19:09] <hduane> Hello [19:09] <db48x> someone is alive somewhere, I am sure [19:09] <hduane> Jason Scott sent me here on twitter to ask about backups of sourceforge [19:09] <db48x> excellent [19:10] <hduane> I'm trying to get copies [19:10] <db48x> backing up forges is tricky [19:10] <db48x> esr has written some nice software that can back up a single project from a forge, with very high fidelity, but only if you are an admin of that project [19:11] <hduane> https://twitter.com/textfiles/status/480766550593773569 [19:12] <hduane> Is it just me or is SF not very open [19:12] <db48x> it's from an older era [19:12] <hduane> it seems to use a lot of tricks to eep you from getting copies unles lyu explicitly click a link [19:13] <db48x> http://esr.ibiblio.org/?p=1369 [19:13] <db48x> http://home.gna.org/forgeplucker/ [19:14] <hduane> http://www.coderforgood.com [19:15] <hduane> is a litel thing i'm starting as well, and thats one of the reasons I wanted to have a backup of SF [19:15] <hduane> is one of the things that I am involved with, thats why I wanted a copy of SF's data [19:15] <hduane> I think I idled out [19:16] <hduane> oh ok just lag [19:16] <db48x> cool :) [19:17] <hduane> Wel if I wnated to mirror SF or at least get a list of all its projects that doesnt seem to be that hard if only due to how the sitemap is [19:17] <hduane> if google can do it so can we [19:18] <hduane> in act i just looked at the sitemap and was givne a file hat leads to other files, for all projects, sitemaped out into seperate files [19:19] <DFJustin> "data jail", that's a good term [19:19] <db48x> Yes, grabbing a list of all projects on a forge is fairly straight-forward [19:19] <db48x> the real work will be in adapting ForgePlucker [19:20] <db48x> last time I used it it didn't handle failure gracefully [19:20] <db48x> so it just doesn't do anything useful if you're not an admin of the project [19:20] <hduane> well SF uses multiple types of data storage [19:20] <hduane> its old style cvs, not as old but still old svn [19:21] <db48x> yes, plus bug tracking, mailing list, etc [19:21] <hduane> To me the code is the priority [19:21] <hduane> the most [19:21] <hduane> all the other stuff matters as well [19:21] <hduane> but the code is what is important [19:21] <db48x> yes :) [19:22] <db48x> the other stuff is context and community, but the code is the thing itself in a sense [19:23] <db48x> if all you want is the code, then the path is quite straight-forward [19:23] <db48x> cloning an cvs repository isn't hard (but hard to do perfectly), svn is easier [19:23] <db48x> if you want to go the extra mile, then consider creating a job for our ArchiveTeam Warrior [19:24] <db48x> then you'll have a couple of dozen people helping out with the downloading automatically [19:24] <db48x> I'd be more interested in extending ForgePlucker though, and then making a warrior task out of that [19:25] <db48x> http://archiveteam.org/index.php?title=Warrior [19:26] <hduane> Well I just signed up for GNA and am waiting on teh email [19:26] <hduane> that said I was hopeing that I could just ask SF for a copy and get one [19:27] <hduane> ok on gna as honestduane [19:28] <db48x> that would be nice :) [19:29] <hduane> ell I did send sf a email and a tweet asking about this but it was yesterday on a satarday [19:32] <hduane> Do you know roughly how much data is on SF? [19:35] <db48x> more than Geocities, less than Google Video? [19:37] <hduane> so a lot? ;) [19:37] <hduane> well I dot ahve access to teh source for the plucker but I can look deeper into this issue of the sitemap of SF [19:39] <Nemo_bis> There were some papers on SourceForge, they probably mention the total size but I forgot [19:39] <hduane> let me see if I can calcualte it myself using the sitemap as input [19:39] <db48x> http://svn.gna.org/viewcvs/forgeplucker/trunk/ [19:39] <db48x> Geocities was only a terabyte [19:40] <hduane> yet I remember when ti seemed to be at east half the internet [19:44] <joepie91_> nobody pinged SketchCow about hduane yet? :P [19:45] <joepie91_> hduane: it was, probably [19:45] <joepie91_> content-wise [19:45] <db48x> yea, it was the place to be [20:38] <honestdua> ok so I was able to collect all the links but interestingly enough it looks like they have an Apache Allura sitemap as well [20:40] <honestdua> also the patterns used by each project in csv are pretty predictable [20:41] <honestdua> and outlined in the robots.txt file for google and company to not mirror [20:42] <db48x> yea :) [20:42] <SketchCow> Hey. [20:43] <db48x> I dislike robots.txt in general, but they are occasionally awesome [20:43] <SketchCow> I figured hduane could get the archive team outlook [20:43] <honestdua> Also found 1,505,096 links in the main sitemap [20:43] <honestdua> hey Jason [20:43] <honestdua> they all sem to follow a common pattern of about 3-4 links per project [20:44] <SketchCow> Grab it all. [20:44] <SketchCow> Did we LOSE freshmeat or is freshmeat around in some way? [20:45] <antithesi> Yo [20:45] <honestdua> I'm running my sitemap slurper right now.. it doesnt exactly respect robots.txt [20:45] <antithesi> Can you guys archive userscripts.org? I'm afraid it'll die [20:45] <db48x> SketchCow: it's still up [20:46] <honestdua> but the link file for the main sitemap is about 75 megs of just urls, one per line [20:46] <honestdua> github has gotten so big.. I worry about ti being the main choice [20:46] <honestdua> no its gone [20:46] <honestdua> is what was posted on slashdot etc [20:46] <honestdua> I heard on reddit about it first then slashdot that it was gone [20:46] <db48x> the site is still there [20:47] <db48x> antithesi: I can't load it; it just times out [20:48] <antithesi> db48x it's still available at userscripts.org:8080 [20:48] <honestdua> it redirects to something else last i checked for freshmeat [20:49] <db48x> honestdua: it was renamed to freecode a while back [20:50] <antithesi> Okay, looks like there's http://userscripts-mirror.org/ too, but that one isn't downloadable [20:51] <honestdua> hmm OOM error from trying the secondary Apache Allura sitemap [20:52] <honestdua> its over 600 files so that may be it [20:52] <honestdua> hmm.. recodeing it to use less memory at the expense of speed.. [20:55] <honestdua> oh and I am on this cruddy irc cgi chat client [20:55] <honestdua> its been years since i had something like mirc installed [20:55] <honestdua> not even sure if mirc is still around [20:55] <honestdua> so anyway I may timeout [20:55] <honestdua> as I work [21:05] <honestdua> ok 615 sitemap files for the second part [21:05] <honestdua> at sitemap 7 and we have over 379k urls [21:05] <honestdua> distinct urls* [21:06] <honestdua> as I am makign sure they are all unique [21:06] <honestdua> just passed over a million urls as of sitemap number 100 [21:08] <Smiley> looking fun [21:08] <honestdua> i think if this fails I'm just going to start having it download the files and work on processing as [21:08] <honestdua> seperate task [21:08] <honestdua> just passed over 2 million as of file 200 [21:09] <Smiley> you put this list online somewhere yet? [21:09] <honestdua> s ifthe numbers stay sane SF has almost 7 million links [21:10] <honestdua> and if its true to the 3-4 link per project that I'm seeing, under 2 million projects [21:11] <honestdua> looks like some of the optimizations I made also got rid of my OOM *crosses fingers* [21:12] <honestdua> so about up to around 2 million or so projects at most of course this is also user profiles [21:12] <honestdua> so the number may be bad [21:12] <honestdua> ok over half done [21:15] <honestdua> just passed 400 files and oer 400 links [21:16] <honestdua> I am so glad i have 16gb of ram on this thing [21:16] <honestdua> http://jeffcovey.net/2014/06/19/freshmeat-net-1997-2014/ [21:17] <SketchCow> Anyway, the summary is that archive team downloads everything it can, and we don't ask nicely. [21:17] <honestdua> What about legality? [21:18] <honestdua> I'm going to put it in my dropbox after its done and then send you links to the public urls [21:18] <honestdua> Its just the urls [21:18] <honestdua> about 500 sitemap file done [21:18] <honestdua> so only 20% left [21:19] <honestdua> i feel like its perfectly legit to do this because they make everything public anyway [21:19] <honestdua> but if it was paywalled or whatever [21:19] <honestdua> it would not be ok [21:20] <honestdua> my wife is sitting here asking me to make sure I do not endanger her or our family with "hacker stuff" [21:21] <honestdua> but sitemaps are public so no problem [21:21] <godane> PRO TIP: don't endanger family with "hacker stuff" [21:21] <honestdua> yes never do that [21:22] <honestdua> besides I'm a security professional i need to keep my rep solid [21:22] <godane> going after sitemaps should be find [21:22] <godane> *fine [21:23] <honestdua> ok just got to 6 million [21:26] <honestdua> ok so final allura links file is 273.4 megs in size of just links [21:27] <honestdua> and the main sitemap is 73 or megs of just links [21:27] <SketchCow> See, here I'm sad. [21:27] <honestdua> and they compress down to a simple 40 meg zip [21:28] <SketchCow> Because this is IRC, and not a dystopian hacker movie [21:28] <honestdua> https://dl.dropboxusercontent.com/u/18627325/sourceforge.net/sf_net_sitemap.zip is syncing now [21:28] <SketchCow> Because then I'd turn up to the rafters with all the insane kids in harnesses and hammocks with laptops [21:28] <SketchCow> and go "AND WHAT ABOUT THE LEGALITY, BOYS" [21:28] <SketchCow> And, like, hundreds of soda cans just come raining down in laughter [21:29] <SketchCow> But we'll settle for "let's worry about saving the data" [21:30] <honestdua> that data seems very valuable at this point it should be everything you need to automate the collection of every projects data [21:36] <midas> honestdua: tell your wife you're just grabbing public data anyway. nobody got sued for visiting a site. EVER. [21:37] <midas> thats all we do, but then at warpspeed 10 [21:37] <SketchCow> 17:33 < honestdua> my wife is sitting here asking me to make sure I do not endanger her or our family with "hacker stuff" [21:37] <SketchCow> Then hand the fun off to us, we'll do the rest. [21:37] <SketchCow> You've done enough! [21:38] <Smiley> plz make sure you upload infos asap [21:38] <Smiley> just incase you disappear. [21:39] <honestdua> oh btw the data shows that over 3.28 million of the 6.15+ million links are for user profiles [21:39] <honestdua> that zip is the output of my code [21:40] <honestdua> you want the code as well? [21:40] <honestdua> its just like 7 lines of C# but it should run on mono [21:41] <honestdua> https://dl.dropboxusercontent.com/u/18627325/sourceforge.net/sf_sitemap_sucker%20-%20Copy.zip [21:41] <honestdua> there you go [21:43] <honestdua> ok between the sf_net_sitemap.zip file above and that 9k zip of the code to generate it you have everything to duplicate my efforts to use the public sitemap to generate a map of sf and its projects/users [21:44] <db48x> sweet [21:45] <honestdua> https://dl.dropboxusercontent.com/u/18627325/sourceforge.net/sf_net_sitemap.zip is the output [21:46] <honestdua> not bad for an hour or so's work [21:46] <honestdua> I'm on twitter as @honestduane and it looks like my wife wants me to go to home depot and get a bolt to fix the lawn mower.. so I may end up having to mow the lawn as well knowing how things go :/ [21:46] <honestdua> Either way, hope that helps [21:47] <honestdua> I need to log, have a good day [21:48] <honestdua> just going to let this idle [21:48] <db48x> honestdua: you too :) [21:48] <db48x> and thanks :) [21:53] <db48x> nooo, everyone is catching up: http://argonath.db48x.net/ [22:14] <midas> what is it db48x ? [23:01] <amerrykan> wow, that's huge. is there some concern about sf going away? [23:03] <SketchCow> By most assumptions, it has [23:11] <db48x> midas: we're scraping the pixori.al url shortner in preperation for grabbing all the pixorial videos we can find [23:22] <honestdua> ok back, should probably get a real irc client installed [23:29] <honestdua> ok shutting this down [23:58] <honestdua> Question: Is bittorent the only way to get all this data you guys are archiving? [23:58] <honestdua> that doesnt seem liek a very stable storage medium. [23:59] <db48x> it's not a storage medium, it's a delivery mechanism [23:59] <honestdua> Well what if I wnat to get a copy of everything. [23:59] <honestdua> 10 tb from bittorrent would take forever