[02:55] There are geniuses in this channel. [02:55] https://archive.org/stream/zx_Mushroom_Alley_1984_Mogul_Communications/Mushroom_Alley_1984_Mogul_Communications.z80?module=zx81&scale=2 [02:55] What is it getting a 404 on [02:55] (I'm looking into it myself) [02:59] https://archive.org/cors/jsmess_config_v1/zx81.cfg shows up in network requests and 404s [02:59] Thank you. [03:00] :) [04:38] https://medium.com/@vijayp/of-taxis-and-rainbows-f6bc289679a1 [04:55] garyrh: guess we better download that [04:58] you mean the data? someone already got that: https://archive.org/details/nycTaxiTripData2013 [19:01] hello [19:02] Jason told me on twitter to come here and ask about backups of sourceforge.net [19:03] . [19:06] hello? Is anybody alive? [19:06] Or have the bots fully taken over? [19:08] * hduane thinks to himself.. "What could get people fired up a bit enough to respond?" [19:08] Destroy all old backups? [19:08] *jk* [19:09] howdy [19:09] http://www.quickmeme.com/meme/354li3 [19:09] Hello [19:09] someone is alive somewhere, I am sure [19:09] Jason Scott sent me here on twitter to ask about backups of sourceforge [19:09] excellent [19:10] I'm trying to get copies [19:10] backing up forges is tricky [19:10] esr has written some nice software that can back up a single project from a forge, with very high fidelity, but only if you are an admin of that project [19:11] https://twitter.com/textfiles/status/480766550593773569 [19:12] Is it just me or is SF not very open [19:12] it's from an older era [19:12] it seems to use a lot of tricks to eep you from getting copies unles lyu explicitly click a link [19:13] http://esr.ibiblio.org/?p=1369 [19:13] http://home.gna.org/forgeplucker/ [19:14] http://www.coderforgood.com [19:15] is a litel thing i'm starting as well, and thats one of the reasons I wanted to have a backup of SF [19:15] is one of the things that I am involved with, thats why I wanted a copy of SF's data [19:15] I think I idled out [19:16] oh ok just lag [19:16] cool :) [19:17] Wel if I wnated to mirror SF or at least get a list of all its projects that doesnt seem to be that hard if only due to how the sitemap is [19:17] if google can do it so can we [19:18] in act i just looked at the sitemap and was givne a file hat leads to other files, for all projects, sitemaped out into seperate files [19:19] "data jail", that's a good term [19:19] Yes, grabbing a list of all projects on a forge is fairly straight-forward [19:19] the real work will be in adapting ForgePlucker [19:20] last time I used it it didn't handle failure gracefully [19:20] so it just doesn't do anything useful if you're not an admin of the project [19:20] well SF uses multiple types of data storage [19:20] its old style cvs, not as old but still old svn [19:21] yes, plus bug tracking, mailing list, etc [19:21] To me the code is the priority [19:21] the most [19:21] all the other stuff matters as well [19:21] but the code is what is important [19:21] yes :) [19:22] the other stuff is context and community, but the code is the thing itself in a sense [19:23] if all you want is the code, then the path is quite straight-forward [19:23] cloning an cvs repository isn't hard (but hard to do perfectly), svn is easier [19:23] if you want to go the extra mile, then consider creating a job for our ArchiveTeam Warrior [19:24] then you'll have a couple of dozen people helping out with the downloading automatically [19:24] I'd be more interested in extending ForgePlucker though, and then making a warrior task out of that [19:25] http://archiveteam.org/index.php?title=Warrior [19:26] Well I just signed up for GNA and am waiting on teh email [19:26] that said I was hopeing that I could just ask SF for a copy and get one [19:27] ok on gna as honestduane [19:28] that would be nice :) [19:29] ell I did send sf a email and a tweet asking about this but it was yesterday on a satarday [19:32] Do you know roughly how much data is on SF? [19:35] more than Geocities, less than Google Video? [19:37] so a lot? ;) [19:37] well I dot ahve access to teh source for the plucker but I can look deeper into this issue of the sitemap of SF [19:39] There were some papers on SourceForge, they probably mention the total size but I forgot [19:39] let me see if I can calcualte it myself using the sitemap as input [19:39] http://svn.gna.org/viewcvs/forgeplucker/trunk/ [19:39] Geocities was only a terabyte [19:40] yet I remember when ti seemed to be at east half the internet [19:44] nobody pinged SketchCow about hduane yet? :P [19:45] hduane: it was, probably [19:45] content-wise [19:45] yea, it was the place to be [20:38] ok so I was able to collect all the links but interestingly enough it looks like they have an Apache Allura sitemap as well [20:40] also the patterns used by each project in csv are pretty predictable [20:41] and outlined in the robots.txt file for google and company to not mirror [20:42] yea :) [20:42] Hey. [20:43] I dislike robots.txt in general, but they are occasionally awesome [20:43] I figured hduane could get the archive team outlook [20:43] Also found 1,505,096 links in the main sitemap [20:43] hey Jason [20:43] they all sem to follow a common pattern of about 3-4 links per project [20:44] Grab it all. [20:44] Did we LOSE freshmeat or is freshmeat around in some way? [20:45] Yo [20:45] I'm running my sitemap slurper right now.. it doesnt exactly respect robots.txt [20:45] Can you guys archive userscripts.org? I'm afraid it'll die [20:45] SketchCow: it's still up [20:46] but the link file for the main sitemap is about 75 megs of just urls, one per line [20:46] github has gotten so big.. I worry about ti being the main choice [20:46] no its gone [20:46] is what was posted on slashdot etc [20:46] I heard on reddit about it first then slashdot that it was gone [20:46] the site is still there [20:47] antithesi: I can't load it; it just times out [20:48] db48x it's still available at userscripts.org:8080 [20:48] it redirects to something else last i checked for freshmeat [20:49] honestdua: it was renamed to freecode a while back [20:50] Okay, looks like there's http://userscripts-mirror.org/ too, but that one isn't downloadable [20:51] hmm OOM error from trying the secondary Apache Allura sitemap [20:52] its over 600 files so that may be it [20:52] hmm.. recodeing it to use less memory at the expense of speed.. [20:55] oh and I am on this cruddy irc cgi chat client [20:55] its been years since i had something like mirc installed [20:55] not even sure if mirc is still around [20:55] so anyway I may timeout [20:55] as I work [21:05] ok 615 sitemap files for the second part [21:05] at sitemap 7 and we have over 379k urls [21:05] distinct urls* [21:06] as I am makign sure they are all unique [21:06] just passed over a million urls as of sitemap number 100 [21:08] looking fun [21:08] i think if this fails I'm just going to start having it download the files and work on processing as [21:08] seperate task [21:08] just passed over 2 million as of file 200 [21:09] you put this list online somewhere yet? [21:09] s ifthe numbers stay sane SF has almost 7 million links [21:10] and if its true to the 3-4 link per project that I'm seeing, under 2 million projects [21:11] looks like some of the optimizations I made also got rid of my OOM *crosses fingers* [21:12] so about up to around 2 million or so projects at most of course this is also user profiles [21:12] so the number may be bad [21:12] ok over half done [21:15] just passed 400 files and oer 400 links [21:16] I am so glad i have 16gb of ram on this thing [21:16] http://jeffcovey.net/2014/06/19/freshmeat-net-1997-2014/ [21:17] Anyway, the summary is that archive team downloads everything it can, and we don't ask nicely. [21:17] What about legality? [21:18] I'm going to put it in my dropbox after its done and then send you links to the public urls [21:18] Its just the urls [21:18] about 500 sitemap file done [21:18] so only 20% left [21:19] i feel like its perfectly legit to do this because they make everything public anyway [21:19] but if it was paywalled or whatever [21:19] it would not be ok [21:20] my wife is sitting here asking me to make sure I do not endanger her or our family with "hacker stuff" [21:21] but sitemaps are public so no problem [21:21] PRO TIP: don't endanger family with "hacker stuff" [21:21] yes never do that [21:22] besides I'm a security professional i need to keep my rep solid [21:22] going after sitemaps should be find [21:22] *fine [21:23] ok just got to 6 million [21:26] ok so final allura links file is 273.4 megs in size of just links [21:27] and the main sitemap is 73 or megs of just links [21:27] See, here I'm sad. [21:27] and they compress down to a simple 40 meg zip [21:28] Because this is IRC, and not a dystopian hacker movie [21:28] https://dl.dropboxusercontent.com/u/18627325/sourceforge.net/sf_net_sitemap.zip is syncing now [21:28] Because then I'd turn up to the rafters with all the insane kids in harnesses and hammocks with laptops [21:28] and go "AND WHAT ABOUT THE LEGALITY, BOYS" [21:28] And, like, hundreds of soda cans just come raining down in laughter [21:29] But we'll settle for "let's worry about saving the data" [21:30] that data seems very valuable at this point it should be everything you need to automate the collection of every projects data [21:36] honestdua: tell your wife you're just grabbing public data anyway. nobody got sued for visiting a site. EVER. [21:37] thats all we do, but then at warpspeed 10 [21:37] 17:33 < honestdua> my wife is sitting here asking me to make sure I do not endanger her or our family with "hacker stuff" [21:37] Then hand the fun off to us, we'll do the rest. [21:37] You've done enough! [21:38] plz make sure you upload infos asap [21:38] just incase you disappear. [21:39] oh btw the data shows that over 3.28 million of the 6.15+ million links are for user profiles [21:39] that zip is the output of my code [21:40] you want the code as well? [21:40] its just like 7 lines of C# but it should run on mono [21:41] https://dl.dropboxusercontent.com/u/18627325/sourceforge.net/sf_sitemap_sucker%20-%20Copy.zip [21:41] there you go [21:43] ok between the sf_net_sitemap.zip file above and that 9k zip of the code to generate it you have everything to duplicate my efforts to use the public sitemap to generate a map of sf and its projects/users [21:44] sweet [21:45] https://dl.dropboxusercontent.com/u/18627325/sourceforge.net/sf_net_sitemap.zip is the output [21:46] not bad for an hour or so's work [21:46] I'm on twitter as @honestduane and it looks like my wife wants me to go to home depot and get a bolt to fix the lawn mower.. so I may end up having to mow the lawn as well knowing how things go :/ [21:46] Either way, hope that helps [21:47] I need to log, have a good day [21:48] just going to let this idle [21:48] honestdua: you too :) [21:48] and thanks :) [21:53] nooo, everyone is catching up: http://argonath.db48x.net/ [22:14] what is it db48x ? [23:01] wow, that's huge. is there some concern about sf going away? [23:03] By most assumptions, it has [23:11] midas: we're scraping the pixori.al url shortner in preperation for grabbing all the pixorial videos we can find [23:22] ok back, should probably get a real irc client installed [23:29] ok shutting this down [23:58] Question: Is bittorent the only way to get all this data you guys are archiving? [23:58] that doesnt seem liek a very stable storage medium. [23:59] it's not a storage medium, it's a delivery mechanism [23:59] Well what if I wnat to get a copy of everything. [23:59] 10 tb from bittorrent would take forever