#archiveteam 2014-06-22,Sun

↑back Search

Time Nickname Message
02:55 🔗 SketchCow There are geniuses in this channel.
02:55 🔗 SketchCow https://archive.org/stream/zx_Mushroom_Alley_1984_Mogul_Communications/Mushroom_Alley_1984_Mogul_Communications.z80?module=zx81&scale=2
02:55 🔗 SketchCow What is it getting a 404 on
02:55 🔗 SketchCow (I'm looking into it myself)
02:59 🔗 garyrh https://archive.org/cors/jsmess_config_v1/zx81.cfg shows up in network requests and 404s
02:59 🔗 SketchCow Thank you.
03:00 🔗 garyrh :)
04:38 🔗 garyrh https://medium.com/@vijayp/of-taxis-and-rainbows-f6bc289679a1
04:55 🔗 yipdw garyrh: guess we better download that
04:58 🔗 garyrh you mean the data? someone already got that: https://archive.org/details/nycTaxiTripData2013
19:01 🔗 honestdua hello
19:02 🔗 honestdua Jason told me on twitter to come here and ask about backups of sourceforge.net
19:03 🔗 hduane .
19:06 🔗 hduane hello? Is anybody alive?
19:06 🔗 hduane Or have the bots fully taken over?
19:08 🔗 * hduane thinks to himself.. "What could get people fired up a bit enough to respond?"
19:08 🔗 hduane Destroy all old backups?
19:08 🔗 hduane *jk*
19:09 🔗 db48x howdy
19:09 🔗 hduane http://www.quickmeme.com/meme/354li3
19:09 🔗 hduane Hello
19:09 🔗 db48x someone is alive somewhere, I am sure
19:09 🔗 hduane Jason Scott sent me here on twitter to ask about backups of sourceforge
19:09 🔗 db48x excellent
19:10 🔗 hduane I'm trying to get copies
19:10 🔗 db48x backing up forges is tricky
19:10 🔗 db48x esr has written some nice software that can back up a single project from a forge, with very high fidelity, but only if you are an admin of that project
19:11 🔗 hduane https://twitter.com/textfiles/status/480766550593773569
19:12 🔗 hduane Is it just me or is SF not very open
19:12 🔗 db48x it's from an older era
19:12 🔗 hduane it seems to use a lot of tricks to eep you from getting copies unles lyu explicitly click a link
19:13 🔗 db48x http://esr.ibiblio.org/?p=1369
19:13 🔗 db48x http://home.gna.org/forgeplucker/
19:14 🔗 hduane http://www.coderforgood.com
19:15 🔗 hduane is a litel thing i'm starting as well, and thats one of the reasons I wanted to have a backup of SF
19:15 🔗 hduane is one of the things that I am involved with, thats why I wanted a copy of SF's data
19:15 🔗 hduane I think I idled out
19:16 🔗 hduane oh ok just lag
19:16 🔗 db48x cool :)
19:17 🔗 hduane Wel if I wnated to mirror SF or at least get a list of all its projects that doesnt seem to be that hard if only due to how the sitemap is
19:17 🔗 hduane if google can do it so can we
19:18 🔗 hduane in act i just looked at the sitemap and was givne a file hat leads to other files, for all projects, sitemaped out into seperate files
19:19 🔗 DFJustin "data jail", that's a good term
19:19 🔗 db48x Yes, grabbing a list of all projects on a forge is fairly straight-forward
19:19 🔗 db48x the real work will be in adapting ForgePlucker
19:20 🔗 db48x last time I used it it didn't handle failure gracefully
19:20 🔗 db48x so it just doesn't do anything useful if you're not an admin of the project
19:20 🔗 hduane well SF uses multiple types of data storage
19:20 🔗 hduane its old style cvs, not as old but still old svn
19:21 🔗 db48x yes, plus bug tracking, mailing list, etc
19:21 🔗 hduane To me the code is the priority
19:21 🔗 hduane the most
19:21 🔗 hduane all the other stuff matters as well
19:21 🔗 hduane but the code is what is important
19:21 🔗 db48x yes :)
19:22 🔗 db48x the other stuff is context and community, but the code is the thing itself in a sense
19:23 🔗 db48x if all you want is the code, then the path is quite straight-forward
19:23 🔗 db48x cloning an cvs repository isn't hard (but hard to do perfectly), svn is easier
19:23 🔗 db48x if you want to go the extra mile, then consider creating a job for our ArchiveTeam Warrior
19:24 🔗 db48x then you'll have a couple of dozen people helping out with the downloading automatically
19:24 🔗 db48x I'd be more interested in extending ForgePlucker though, and then making a warrior task out of that
19:25 🔗 db48x http://archiveteam.org/index.php?title=Warrior
19:26 🔗 hduane Well I just signed up for GNA and am waiting on teh email
19:26 🔗 hduane that said I was hopeing that I could just ask SF for a copy and get one
19:27 🔗 hduane ok on gna as honestduane
19:28 🔗 db48x that would be nice :)
19:29 🔗 hduane ell I did send sf a email and a tweet asking about this but it was yesterday on a satarday
19:32 🔗 hduane Do you know roughly how much data is on SF?
19:35 🔗 db48x more than Geocities, less than Google Video?
19:37 🔗 hduane so a lot? ;)
19:37 🔗 hduane well I dot ahve access to teh source for the plucker but I can look deeper into this issue of the sitemap of SF
19:39 🔗 Nemo_bis There were some papers on SourceForge, they probably mention the total size but I forgot
19:39 🔗 hduane let me see if I can calcualte it myself using the sitemap as input
19:39 🔗 db48x http://svn.gna.org/viewcvs/forgeplucker/trunk/
19:39 🔗 db48x Geocities was only a terabyte
19:40 🔗 hduane yet I remember when ti seemed to be at east half the internet
19:44 🔗 joepie91_ nobody pinged SketchCow about hduane yet? :P
19:45 🔗 joepie91_ hduane: it was, probably
19:45 🔗 joepie91_ content-wise
19:45 🔗 db48x yea, it was the place to be
20:38 🔗 honestdua ok so I was able to collect all the links but interestingly enough it looks like they have an Apache Allura sitemap as well
20:40 🔗 honestdua also the patterns used by each project in csv are pretty predictable
20:41 🔗 honestdua and outlined in the robots.txt file for google and company to not mirror
20:42 🔗 db48x yea :)
20:42 🔗 SketchCow Hey.
20:43 🔗 db48x I dislike robots.txt in general, but they are occasionally awesome
20:43 🔗 SketchCow I figured hduane could get the archive team outlook
20:43 🔗 honestdua Also found 1,505,096 links in the main sitemap
20:43 🔗 honestdua hey Jason
20:43 🔗 honestdua they all sem to follow a common pattern of about 3-4 links per project
20:44 🔗 SketchCow Grab it all.
20:44 🔗 SketchCow Did we LOSE freshmeat or is freshmeat around in some way?
20:45 🔗 antithesi Yo
20:45 🔗 honestdua I'm running my sitemap slurper right now.. it doesnt exactly respect robots.txt
20:45 🔗 antithesi Can you guys archive userscripts.org? I'm afraid it'll die
20:45 🔗 db48x SketchCow: it's still up
20:46 🔗 honestdua but the link file for the main sitemap is about 75 megs of just urls, one per line
20:46 🔗 honestdua github has gotten so big.. I worry about ti being the main choice
20:46 🔗 honestdua no its gone
20:46 🔗 honestdua is what was posted on slashdot etc
20:46 🔗 honestdua I heard on reddit about it first then slashdot that it was gone
20:46 🔗 db48x the site is still there
20:47 🔗 db48x antithesi: I can't load it; it just times out
20:48 🔗 antithesi db48x it's still available at userscripts.org:8080
20:48 🔗 honestdua it redirects to something else last i checked for freshmeat
20:49 🔗 db48x honestdua: it was renamed to freecode a while back
20:50 🔗 antithesi Okay, looks like there's http://userscripts-mirror.org/ too, but that one isn't downloadable
20:51 🔗 honestdua hmm OOM error from trying the secondary Apache Allura sitemap
20:52 🔗 honestdua its over 600 files so that may be it
20:52 🔗 honestdua hmm.. recodeing it to use less memory at the expense of speed..
20:55 🔗 honestdua oh and I am on this cruddy irc cgi chat client
20:55 🔗 honestdua its been years since i had something like mirc installed
20:55 🔗 honestdua not even sure if mirc is still around
20:55 🔗 honestdua so anyway I may timeout
20:55 🔗 honestdua as I work
21:05 🔗 honestdua ok 615 sitemap files for the second part
21:05 🔗 honestdua at sitemap 7 and we have over 379k urls
21:05 🔗 honestdua distinct urls*
21:06 🔗 honestdua as I am makign sure they are all unique
21:06 🔗 honestdua just passed over a million urls as of sitemap number 100
21:08 🔗 Smiley looking fun
21:08 🔗 honestdua i think if this fails I'm just going to start having it download the files and work on processing as
21:08 🔗 honestdua seperate task
21:08 🔗 honestdua just passed over 2 million as of file 200
21:09 🔗 Smiley you put this list online somewhere yet?
21:09 🔗 honestdua s ifthe numbers stay sane SF has almost 7 million links
21:10 🔗 honestdua and if its true to the 3-4 link per project that I'm seeing, under 2 million projects
21:11 🔗 honestdua looks like some of the optimizations I made also got rid of my OOM *crosses fingers*
21:12 🔗 honestdua so about up to around 2 million or so projects at most of course this is also user profiles
21:12 🔗 honestdua so the number may be bad
21:12 🔗 honestdua ok over half done
21:15 🔗 honestdua just passed 400 files and oer 400 links
21:16 🔗 honestdua I am so glad i have 16gb of ram on this thing
21:16 🔗 honestdua http://jeffcovey.net/2014/06/19/freshmeat-net-1997-2014/
21:17 🔗 SketchCow Anyway, the summary is that archive team downloads everything it can, and we don't ask nicely.
21:17 🔗 honestdua What about legality?
21:18 🔗 honestdua I'm going to put it in my dropbox after its done and then send you links to the public urls
21:18 🔗 honestdua Its just the urls
21:18 🔗 honestdua about 500 sitemap file done
21:18 🔗 honestdua so only 20% left
21:19 🔗 honestdua i feel like its perfectly legit to do this because they make everything public anyway
21:19 🔗 honestdua but if it was paywalled or whatever
21:19 🔗 honestdua it would not be ok
21:20 🔗 honestdua my wife is sitting here asking me to make sure I do not endanger her or our family with "hacker stuff"
21:21 🔗 honestdua but sitemaps are public so no problem
21:21 🔗 godane PRO TIP: don't endanger family with "hacker stuff"
21:21 🔗 honestdua yes never do that
21:22 🔗 honestdua besides I'm a security professional i need to keep my rep solid
21:22 🔗 godane going after sitemaps should be find
21:22 🔗 godane *fine
21:23 🔗 honestdua ok just got to 6 million
21:26 🔗 honestdua ok so final allura links file is 273.4 megs in size of just links
21:27 🔗 honestdua and the main sitemap is 73 or megs of just links
21:27 🔗 SketchCow See, here I'm sad.
21:27 🔗 honestdua and they compress down to a simple 40 meg zip
21:28 🔗 SketchCow Because this is IRC, and not a dystopian hacker movie
21:28 🔗 honestdua https://dl.dropboxusercontent.com/u/18627325/sourceforge.net/sf_net_sitemap.zip is syncing now
21:28 🔗 SketchCow Because then I'd turn up to the rafters with all the insane kids in harnesses and hammocks with laptops
21:28 🔗 SketchCow and go "AND WHAT ABOUT THE LEGALITY, BOYS"
21:28 🔗 SketchCow And, like, hundreds of soda cans just come raining down in laughter
21:29 🔗 SketchCow But we'll settle for "let's worry about saving the data"
21:30 🔗 honestdua that data seems very valuable at this point it should be everything you need to automate the collection of every projects data
21:36 🔗 midas honestdua: tell your wife you're just grabbing public data anyway. nobody got sued for visiting a site. EVER.
21:37 🔗 midas thats all we do, but then at warpspeed 10
21:37 🔗 SketchCow 17:33 < honestdua> my wife is sitting here asking me to make sure I do not endanger her or our family with "hacker stuff"
21:37 🔗 SketchCow Then hand the fun off to us, we'll do the rest.
21:37 🔗 SketchCow You've done enough!
21:38 🔗 Smiley plz make sure you upload infos asap
21:38 🔗 Smiley just incase you disappear.
21:39 🔗 honestdua oh btw the data shows that over 3.28 million of the 6.15+ million links are for user profiles
21:39 🔗 honestdua that zip is the output of my code
21:40 🔗 honestdua you want the code as well?
21:40 🔗 honestdua its just like 7 lines of C# but it should run on mono
21:41 🔗 honestdua https://dl.dropboxusercontent.com/u/18627325/sourceforge.net/sf_sitemap_sucker%20-%20Copy.zip
21:41 🔗 honestdua there you go
21:43 🔗 honestdua ok between the sf_net_sitemap.zip file above and that 9k zip of the code to generate it you have everything to duplicate my efforts to use the public sitemap to generate a map of sf and its projects/users
21:44 🔗 db48x sweet
21:45 🔗 honestdua https://dl.dropboxusercontent.com/u/18627325/sourceforge.net/sf_net_sitemap.zip is the output
21:46 🔗 honestdua not bad for an hour or so's work
21:46 🔗 honestdua I'm on twitter as @honestduane and it looks like my wife wants me to go to home depot and get a bolt to fix the lawn mower.. so I may end up having to mow the lawn as well knowing how things go :/
21:46 🔗 honestdua Either way, hope that helps
21:47 🔗 honestdua I need to log, have a good day
21:48 🔗 honestdua just going to let this idle
21:48 🔗 db48x honestdua: you too :)
21:48 🔗 db48x and thanks :)
21:53 🔗 db48x nooo, everyone is catching up: http://argonath.db48x.net/
22:14 🔗 midas what is it db48x ?
23:01 🔗 amerrykan wow, that's huge. is there some concern about sf going away?
23:03 🔗 SketchCow By most assumptions, it has
23:11 🔗 db48x midas: we're scraping the pixori.al url shortner in preperation for grabbing all the pixorial videos we can find
23:22 🔗 honestdua ok back, should probably get a real irc client installed
23:29 🔗 honestdua ok shutting this down
23:58 🔗 honestdua Question: Is bittorent the only way to get all this data you guys are archiving?
23:58 🔗 honestdua that doesnt seem liek a very stable storage medium.
23:59 🔗 db48x it's not a storage medium, it's a delivery mechanism
23:59 🔗 honestdua Well what if I wnat to get a copy of everything.
23:59 🔗 honestdua 10 tb from bittorrent would take forever

irclogger-viewer