[02:55] <SketchCow> There are geniuses in this channel.
[02:55] <SketchCow> https://archive.org/stream/zx_Mushroom_Alley_1984_Mogul_Communications/Mushroom_Alley_1984_Mogul_Communications.z80?module=zx81&scale=2
[02:55] <SketchCow> What is it getting a 404 on
[02:55] <SketchCow> (I'm looking into it myself)
[02:59] <garyrh> https://archive.org/cors/jsmess_config_v1/zx81.cfg shows up in network requests and 404s
[02:59] <SketchCow> Thank you.
[03:00] <garyrh> :)
[04:38] <garyrh> https://medium.com/@vijayp/of-taxis-and-rainbows-f6bc289679a1
[04:55] <yipdw> garyrh: guess we better download that
[04:58] <garyrh> you mean the data? someone already got that: https://archive.org/details/nycTaxiTripData2013
[19:01] <honestdua> hello
[19:02] <honestdua> Jason told me on twitter to come here and ask about backups of sourceforge.net
[19:03] <hduane> .
[19:06] <hduane> hello? Is anybody alive?
[19:06] <hduane> Or have the bots fully taken over?
[19:08] * hduane thinks to himself.. "What could get people fired up a bit enough to respond?"
[19:08] <hduane> Destroy all old backups?
[19:08] <hduane> *jk*
[19:09] <db48x> howdy
[19:09] <hduane> http://www.quickmeme.com/meme/354li3
[19:09] <hduane> Hello
[19:09] <db48x> someone is alive somewhere, I am sure
[19:09] <hduane> Jason Scott sent me here on twitter to ask about backups of sourceforge
[19:09] <db48x> excellent
[19:10] <hduane> I'm trying to get copies
[19:10] <db48x> backing up forges is tricky
[19:10] <db48x> esr has written some nice software that can back up a single project from a forge, with very high fidelity, but only if you are an admin of that project
[19:11] <hduane> https://twitter.com/textfiles/status/480766550593773569
[19:12] <hduane> Is it just me or is SF not very open
[19:12] <db48x> it's from an older era
[19:12] <hduane> it seems to use a lot of tricks to eep you from getting copies unles lyu explicitly click a link
[19:13] <db48x> http://esr.ibiblio.org/?p=1369
[19:13] <db48x> http://home.gna.org/forgeplucker/
[19:14] <hduane> http://www.coderforgood.com
[19:15] <hduane> is a litel thing i'm starting as well, and thats one of the reasons I wanted to have a backup of SF
[19:15] <hduane> is one of the things that I am involved with, thats why I wanted a copy of SF's data
[19:15] <hduane> I think I idled out
[19:16] <hduane> oh ok just lag
[19:16] <db48x> cool :)
[19:17] <hduane> Wel if I wnated to mirror SF or at least get a list of all its projects that doesnt seem to be that hard if only due to how the sitemap is
[19:17] <hduane> if google can do it so can we
[19:18] <hduane> in act i just looked at the sitemap and was givne a file hat leads to other files, for all projects, sitemaped out into seperate files
[19:19] <DFJustin> "data jail", that's a good term
[19:19] <db48x> Yes, grabbing a list of all projects on a forge is fairly straight-forward
[19:19] <db48x> the real work will be in adapting ForgePlucker
[19:20] <db48x> last time I used it it didn't handle failure gracefully
[19:20] <db48x> so it just doesn't do anything useful if you're not an admin of the project
[19:20] <hduane> well SF uses multiple types of data storage
[19:20] <hduane> its old style cvs, not as old but still old svn
[19:21] <db48x> yes, plus bug tracking, mailing list, etc
[19:21] <hduane> To me the code is the priority
[19:21] <hduane> the most
[19:21] <hduane> all the other stuff matters as well
[19:21] <hduane> but the code is what is important
[19:21] <db48x> yes :)
[19:22] <db48x> the other stuff is context and community, but the code is the thing itself in a sense
[19:23] <db48x> if all you want is the code, then the path is quite straight-forward
[19:23] <db48x> cloning an cvs repository isn't hard (but hard to do perfectly), svn is easier
[19:23] <db48x> if you want to go the extra mile, then consider creating a job for our ArchiveTeam Warrior
[19:24] <db48x> then you'll have a couple of dozen people helping out with the downloading automatically
[19:24] <db48x> I'd be more interested in extending ForgePlucker though, and then making a warrior task out of that
[19:25] <db48x> http://archiveteam.org/index.php?title=Warrior
[19:26] <hduane> Well I just signed up for GNA and am waiting on teh email
[19:26] <hduane> that said I was hopeing that I could just ask SF for a copy and get one
[19:27] <hduane> ok on gna as honestduane
[19:28] <db48x> that would be nice :)
[19:29] <hduane> ell I did send sf a email and a tweet asking about this but it was yesterday on a satarday
[19:32] <hduane> Do you know roughly how much data is on SF?
[19:35] <db48x> more than Geocities, less than Google Video?
[19:37] <hduane> so a lot? ;)
[19:37] <hduane> well I dot ahve access to teh source for the plucker but I can look deeper into this issue of the sitemap of SF
[19:39] <Nemo_bis> There were some papers on SourceForge, they probably mention the total size but I forgot
[19:39] <hduane> let me see if I can calcualte it myself using the sitemap as input
[19:39] <db48x> http://svn.gna.org/viewcvs/forgeplucker/trunk/
[19:39] <db48x> Geocities was only a terabyte
[19:40] <hduane> yet I remember when ti seemed to be at east half the internet
[19:44] <joepie91_> nobody pinged SketchCow about hduane yet? :P
[19:45] <joepie91_> hduane: it was, probably
[19:45] <joepie91_> content-wise
[19:45] <db48x> yea, it was the place to be
[20:38] <honestdua> ok so I was able to collect all the links but interestingly enough it looks like they have an Apache Allura sitemap as well
[20:40] <honestdua> also the patterns used by each project in csv are pretty predictable
[20:41] <honestdua> and outlined in the robots.txt file for google and company to not mirror
[20:42] <db48x> yea :)
[20:42] <SketchCow> Hey.
[20:43] <db48x> I dislike robots.txt in general, but they are occasionally awesome
[20:43] <SketchCow> I figured hduane could get the archive team outlook
[20:43] <honestdua> Also found 1,505,096 links in the main sitemap
[20:43] <honestdua> hey Jason
[20:43] <honestdua> they all sem to follow a common pattern of about 3-4 links per project
[20:44] <SketchCow> Grab it all.
[20:44] <SketchCow> Did we LOSE freshmeat or is freshmeat around in some way?
[20:45] <antithesi> Yo
[20:45] <honestdua> I'm running my sitemap slurper right now.. it doesnt exactly respect robots.txt
[20:45] <antithesi> Can you guys archive userscripts.org? I'm afraid it'll die
[20:45] <db48x> SketchCow: it's still up
[20:46] <honestdua> but the link file for the main sitemap is about 75 megs of just urls, one per line
[20:46] <honestdua> github has gotten so big.. I worry about ti being the main choice
[20:46] <honestdua> no its gone
[20:46] <honestdua> is what was posted on slashdot etc
[20:46] <honestdua> I heard on reddit about it first then slashdot that it was gone
[20:46] <db48x> the site is still there
[20:47] <db48x> antithesi: I can't load it; it just times out
[20:48] <antithesi> db48x it's still available at userscripts.org:8080
[20:48] <honestdua> it redirects to something else last i checked for freshmeat
[20:49] <db48x> honestdua: it was renamed to freecode a while back
[20:50] <antithesi> Okay, looks like there's http://userscripts-mirror.org/ too, but that one isn't downloadable
[20:51] <honestdua> hmm OOM error from trying the secondary Apache Allura sitemap
[20:52] <honestdua> its over 600 files so that may be it
[20:52] <honestdua> hmm.. recodeing it to use less memory at the expense of speed..
[20:55] <honestdua> oh and I am on this cruddy irc cgi chat client
[20:55] <honestdua> its been years since i had something like mirc installed
[20:55] <honestdua> not even sure if mirc is still around
[20:55] <honestdua> so anyway I may timeout
[20:55] <honestdua> as I work
[21:05] <honestdua> ok 615 sitemap files for the second part
[21:05] <honestdua> at sitemap 7 and we have over 379k urls
[21:05] <honestdua> distinct urls*
[21:06] <honestdua> as I am makign sure they are all unique
[21:06] <honestdua> just passed over a million urls as of sitemap number 100
[21:08] <Smiley> looking fun
[21:08] <honestdua> i think if this fails I'm just going to start having it download the files and work on processing as
[21:08] <honestdua> seperate task
[21:08] <honestdua> just passed over 2 million as of file 200
[21:09] <Smiley> you put this list online somewhere yet?
[21:09] <honestdua> s ifthe numbers stay sane SF has almost 7 million links
[21:10] <honestdua> and if its true to the 3-4 link per project that I'm seeing, under 2 million projects
[21:11] <honestdua> looks like some of the optimizations I made also got rid of my OOM *crosses fingers*
[21:12] <honestdua> so about up to around 2 million or so projects at most of course this is also user profiles
[21:12] <honestdua> so the number may be bad
[21:12] <honestdua> ok over half done
[21:15] <honestdua> just passed 400 files and oer 400 links
[21:16] <honestdua> I am so glad i have 16gb of ram on this thing
[21:16] <honestdua> http://jeffcovey.net/2014/06/19/freshmeat-net-1997-2014/
[21:17] <SketchCow> Anyway, the summary is that archive team downloads everything it can, and we don't ask nicely.
[21:17] <honestdua> What about legality?
[21:18] <honestdua> I'm going to put it in my dropbox after its done and then send you links to the public urls
[21:18] <honestdua> Its just the urls
[21:18] <honestdua> about 500 sitemap file done
[21:18] <honestdua> so only 20% left
[21:19] <honestdua> i feel like its perfectly legit to do this because they make everything public anyway
[21:19] <honestdua> but if it was paywalled or whatever
[21:19] <honestdua> it would not be ok
[21:20] <honestdua> my wife is sitting here asking me to make sure I do not endanger her or our family with "hacker stuff"
[21:21] <honestdua> but sitemaps are public so no problem
[21:21] <godane> PRO TIP: don't endanger family with "hacker stuff"
[21:21] <honestdua> yes never do that
[21:22] <honestdua> besides I'm a security professional i need to keep my rep solid
[21:22] <godane> going after sitemaps should be find
[21:22] <godane> *fine
[21:23] <honestdua> ok just got to 6 million
[21:26] <honestdua> ok so final allura links file is 273.4 megs in size of just links
[21:27] <honestdua> and the main sitemap is 73 or megs of just links
[21:27] <SketchCow> See, here I'm sad.
[21:27] <honestdua> and they compress down to a simple 40 meg zip
[21:28] <SketchCow> Because this is IRC, and not a dystopian hacker movie
[21:28] <honestdua> https://dl.dropboxusercontent.com/u/18627325/sourceforge.net/sf_net_sitemap.zip is syncing now
[21:28] <SketchCow> Because then I'd turn up to the rafters with all the insane kids in harnesses and hammocks with laptops
[21:28] <SketchCow> and go "AND WHAT ABOUT THE LEGALITY, BOYS"
[21:28] <SketchCow> And, like, hundreds of soda cans just come raining down in laughter
[21:29] <SketchCow> But we'll settle for "let's worry about saving the data"
[21:30] <honestdua> that data seems very valuable at this point it should be everything you need to automate the collection of every projects data
[21:36] <midas> honestdua: tell your wife you're just grabbing public data anyway. nobody got sued for visiting a site. EVER.
[21:37] <midas> thats all we do, but then at warpspeed 10
[21:37] <SketchCow> 17:33 < honestdua> my wife is sitting here asking me to make sure I do not endanger her or our family with "hacker stuff"
[21:37] <SketchCow> Then hand the fun off to us, we'll do the rest.
[21:37] <SketchCow> You've done enough!
[21:38] <Smiley> plz make sure you upload infos asap
[21:38] <Smiley> just incase you disappear.
[21:39] <honestdua> oh btw the data shows that over 3.28 million of the 6.15+ million links are for user profiles
[21:39] <honestdua> that zip is the output of my code
[21:40] <honestdua> you want the code as well?
[21:40] <honestdua> its just like 7 lines of C# but it should run on mono
[21:41] <honestdua> https://dl.dropboxusercontent.com/u/18627325/sourceforge.net/sf_sitemap_sucker%20-%20Copy.zip
[21:41] <honestdua> there you go
[21:43] <honestdua> ok between the sf_net_sitemap.zip file above and that 9k zip of the code to generate it you have everything to duplicate my efforts to use the public sitemap to generate a map of sf and its projects/users
[21:44] <db48x> sweet
[21:45] <honestdua> https://dl.dropboxusercontent.com/u/18627325/sourceforge.net/sf_net_sitemap.zip is the output
[21:46] <honestdua> not bad for an hour or so's work
[21:46] <honestdua> I'm on twitter as @honestduane and it looks like my wife wants me to go to home depot and get a bolt to fix the lawn mower.. so I may end up having to mow the lawn as well knowing how things go :/
[21:46] <honestdua> Either way, hope that helps
[21:47] <honestdua> I need to log, have a good day
[21:48] <honestdua> just going to let this idle
[21:48] <db48x> honestdua: you too :)
[21:48] <db48x> and thanks :)
[21:53] <db48x> nooo, everyone is catching up: http://argonath.db48x.net/
[22:14] <midas> what is it db48x ?
[23:01] <amerrykan> wow, that's huge.  is there some concern about sf going away?
[23:03] <SketchCow> By most assumptions, it has
[23:11] <db48x> midas: we're scraping the pixori.al url shortner in preperation for grabbing all the pixorial videos we can find
[23:22] <honestdua> ok back, should probably get a real irc client installed
[23:29] <honestdua> ok shutting this down
[23:58] <honestdua> Question: Is bittorent the only way to get all this data you guys are archiving?
[23:58] <honestdua> that doesnt seem liek a very stable storage medium.
[23:59] <db48x> it's not a storage medium, it's a delivery mechanism
[23:59] <honestdua> Well what if I wnat to get a copy of everything.
[23:59] <honestdua> 10 tb from bittorrent would take forever