#archiveteam 2014-06-23,Mon

↑back Search

Time Nickname Message
00:00 🔗 honestdua it seems liek you could fund yourselves by selling copies, or something
00:00 🔗 honestdua I myself am now wondering how big a hd can be
00:01 🔗 honestdua hmm 6tb is i spend a lot
00:01 🔗 honestdua I see 2tb drives sometimes, but 1tb is more common
00:03 🔗 honestdua one thing you may want to think about is that when we get quantum computers, only thing we will need is the hashes of files and thier size in bytes, so a list of the files by type, size in bytes, md5 hash, sha1 hash, and crc32 hash should be enough for you to recreate everything at that point.
00:08 🔗 honestdua Do you guys enjoy http://www.drobo.com/ type products?
00:10 🔗 db48x are you talking about the torrents hosted by the Internet Archive?
00:11 🔗 db48x those are backed by IA's monster servers; you can download those as fast as your internet connection will allow
00:11 🔗 db48x as a seeder it's really hard to compete
00:13 🔗 honestdua Not sure whee they atre hosted, I saw the word "bit-torrent" and assumed a bunch of peoples random computers
00:15 🔗 honestdua If I hosted all of it on AWS it would cost about 320/month to store 12tb
00:16 🔗 db48x yep
00:18 🔗 honestdua thats without downloading it at all
00:18 🔗 honestdua just storeing it
00:18 🔗 db48x almost everything we archive ends up on the Internet Archive (in addition to other places)
00:19 🔗 honestdua https://archive.org/index.php ?
00:19 🔗 db48x yes
00:19 🔗 db48x so if being able to access something is as good as having a local copy, then a donation to IA is a pretty cost-effective way to go :)
00:19 🔗 honestdua So every time I pull up a website tht no longer exists and extract a bit of data ftrom an old copy of the page, thats you guys?
00:19 🔗 db48x no
00:20 🔗 honestdua the "wayback machine"?
00:20 🔗 db48x we're Archive Team; we're just a bunch of hobbiests
00:20 🔗 db48x (however that might be spelled)
00:20 🔗 honestdua so you source for IA but are not a part of them?
00:21 🔗 db48x right. we focus on grabbing things that are shutting down, while IA uses the Wayback Machine to crawl everything on the net, hitting most places a few times a year
00:21 🔗 honestdua so the sourcforge.net stuff I gave you guys earlier, is that going to be used?
00:21 🔗 honestdua or is it not very high priority?
00:22 🔗 db48x we're not that cohesive or structured
00:22 🔗 honestdua So you run "Warriors" but you are not set up as an army/
00:22 🔗 honestdua ?
00:23 🔗 honestdua ;)
00:23 🔗 db48x :)
00:23 🔗 honestdua I see the word warrior used and that makes me expect a chain of command, etc
00:23 🔗 db48x heh
00:23 🔗 honestdua and honestly I think sf is going to die, suddenly, with no body told in advance
00:23 🔗 honestdua if it goes the way freshmeat and such did
00:24 🔗 db48x yea, it's quite possible
00:24 🔗 honestdua its owned by the same groups I think
00:24 🔗 db48x I'd like to see it get saved
00:24 🔗 db48x I'm working on Pixorial right now though; that has an actual deadline
00:24 🔗 honestdua The "Warrior" thing thats just your distributed comuting efforts, correct?
00:24 🔗 honestdua When yiu say "working" do you mean you are just running a script? or manually working on the websites/
00:24 🔗 db48x yea, it's just a VM people can download and run, it'll automatically join in on any job we put up on the server
00:25 🔗 honestdua ?
00:25 🔗 db48x I'm working on writing the script that the warriors will download and run so that we can archive Pixorial
00:26 🔗 db48x right now I've got the warrior scanning Pixorial's url shortener, so that we can simultaneously archive the mapping of short url to full url, and get a list of things that need to be saved
00:26 🔗 honestdua I see. So your building the tasklet run by the 'Warrior" distrubuted task running system, if I understand correctly?
00:26 🔗 db48x (Pixorial doesn't provide a way to search or browse the content they host, unlike most video sites)
00:26 🔗 db48x correct
00:26 🔗 honestdua What gets me is the large barrier to entry to run a tasklet
00:27 🔗 db48x it's not very large :)
00:27 🔗 honestdua its a vm setup correct?
00:27 🔗 db48x just download a virtual machine image and run it in virtualbox
00:27 🔗 db48x there's also a docker image a few people use
00:28 🔗 honestdua why not a webstart browser page to let people boot from a webpage?
00:28 🔗 honestdua http://bellard.org/jslinux/
00:29 🔗 db48x that would be pretty funny, actually
00:29 🔗 honestdua just make the image used your warrior
00:29 🔗 db48x it's not exactly the fastest way to go
00:30 🔗 honestdua Well fastest deployment or the fastest execution?
00:30 🔗 db48x and the memory and storage needed to archive a single "item" varies
00:33 🔗 db48x if you're interested in running things inside the browser, you should checkout the JSMESS project
00:33 🔗 db48x http://jsmess.textfiles.com/
00:33 🔗 honestdua interesting
00:34 🔗 honestdua So you already have a warrior taslet for checking out files from a svn/
00:34 🔗 honestdua ?
00:35 🔗 db48x hmm. not specifically
00:35 🔗 db48x I would instead program the task to download and then run svnsync
00:38 🔗 db48x I wouldn't spider the HTML view of the repository though, that would be much more laborious
00:38 🔗 honestdua hmm... ok. Well my wife ants me to go get burgers for the grill, bbl
00:38 🔗 db48x on the other hand, a historical recreation would be harder as a result
00:39 🔗 db48x would have to at least make a note of what version of cvsweb was in use at the time
00:39 🔗 honestdua well I have a list of all proects on sourceforge as of earlier today
00:39 🔗 db48x mmm, burgers
00:39 🔗 honestdua wouldnt be hard to get the svn's of every one
00:39 🔗 honestdua anyway, bbl
00:39 🔗 db48x honestdua: if you want to make a warrior task for that, I'd be happy to help out
00:39 🔗 db48x enjoy your burgers :)
00:40 🔗 honestdua just going to the store.. wife is going to grill them
00:40 🔗 honestdua like most canadian women she is not at all worthless around a grill
00:40 🔗 honestdua bbiab
01:04 🔗 honestdua ok. back
01:05 🔗 yipdw honestdua: the tricky thing about boot-from-webpage is that, although the warrior infrastructure has some degree of fault tolerance, we do not have any way for clients to communicate "this client gave up on this work item"
01:05 🔗 yipdw we do have ways to requeue "failed" items, but "failure" is more or less defined as "project admin thinks some node is gone"
01:06 🔗 yipdw that said, warrior pipelines do provide ways to explicitly fail items, and the tracker has an endpoint for reporting failures, so AFAICT the remaining bit is plumbing
01:07 🔗 db48x I think boot-from-webpage would be fine for something like urlteam where the items are all quite small (on the order of a kilobyte)
01:07 🔗 yipdw yes
01:07 🔗 db48x but less fine for something like Google Video where an individual video could be a gigabyte
01:08 🔗 db48x also, for urlteam the client could just be written in straight-up javascript, rather than writing in in python and then compiling the python compiler, linux kernel, filesystem drivers and a million other things to Javascript
01:09 🔗 honestdua Its an interesting idea either way; if the goal is to harvet more faster, logic state that more workers is better.
01:09 🔗 honestdua *harvest
01:09 🔗 db48x yea
01:09 🔗 yipdw sure, but we've also managed to do that by being lucky and having people who run ISPs run workers :P
01:09 🔗 trs80 I doubt jslinux could provide the performance required
01:10 🔗 db48x did you guys watch the 'Birth and Death of Javascript' video?
01:10 🔗 yipdw is that Gary Bernhardt's thing
01:10 🔗 db48x yes
01:10 🔗 db48x it's probable that lots of things will end up that way
01:11 🔗 yipdw yeah
01:11 🔗 yipdw I also hope the part about the San Francisco Exclusion Zone is also true
01:11 🔗 db48x at least on powerful machines, there will probably be more aggregate computing power in tiny machines though
01:11 🔗 honestdua Exclusion zone??
01:11 🔗 db48x heh
01:11 🔗 db48x honestdua: a joke from the video
01:11 🔗 db48x https://www.destroyallsoftware.com/talks/the-birth-and-death-of-javascript
01:13 🔗 trs80 jslinux doesn't seem to have a network stack
01:13 🔗 trs80 although it has wget for some reason
01:13 🔗 honestdua the issue there is cors
01:14 🔗 honestdua by default browsers limit traffic to just the site hosting the page
01:14 🔗 honestdua uless you disable it
01:14 🔗 db48x honestdua: yes
01:14 🔗 honestdua on the website to tell the cleint its ok
01:14 🔗 honestdua *client
01:14 🔗 honestdua man I can't type sorry
01:14 🔗 db48x it's cool
01:14 🔗 honestdua but thats something you can disable/enable
01:14 🔗 yipdw honestdua: anyway, if you'd like to get the warrior working on jslinux, that'd be cool
01:14 🔗 honestdua if you host the page that loads the app
01:14 🔗 db48x I _think_ we could do urlteam in spite of CORS
01:15 🔗 honestdua right now i'm looking through the data I collected earlier on my 16gb Ram box
01:15 🔗 yipdw it'd be hilarious to have that and then exploit some inevitable Twitter client XSS exploit to have a billion warriors
01:15 🔗 yipdw no just kidding, that'd be mean
01:15 🔗 db48x heh
01:15 🔗 honestdua and extracting a list of actual projects verses user profiles, etc
01:15 🔗 honestdua since over 3.2 million users profiles are included in the list of links thats really only 2.5 or so million project links
01:16 🔗 honestdua and most projects ahve 3-4 links in there
01:16 🔗 honestdua each
01:17 🔗 honestdua codeing up the extractors now
01:17 🔗 honestdua and users have up to 4 links for them as well
01:17 🔗 honestdua so if all projects ahve 4 links and all users had 4 links
01:18 🔗 honestdua interesting math
01:18 🔗 honestdua 312k or so possible projects in that scenerio
01:19 🔗 db48x you'd probably want to do one item per user and one item per project
01:19 🔗 db48x or if you're just going after repositories, then one per project
01:24 🔗 honestdua yep
01:24 🔗 honestdua code would be the priority
01:24 🔗 honestdua indexes by licence
01:24 🔗 honestdua etc
01:27 🔗 honestdua and users on SF can have blogs and wiki's
01:27 🔗 honestdua not just an activities page and a profile
01:28 🔗 db48x yes, that's why I'd like to use ForgePlucker
01:28 🔗 db48x it knows how to grab all of that efficiently
01:28 🔗 db48x our standard tools would just follow the links and record what the website returned
01:29 🔗 db48x which is great for recreating the website, but not for exploring the data or importing it elsewhere
01:36 🔗 honestdua hmm they alo have a third type of link, http://sourceforge.net/apps/mediawiki/nhruby.u/ to show the apps a given person is related to
01:38 🔗 honestdua hmm.. I'm counting up to 7 possible links for just one project
01:38 🔗 honestdua there could be only 10k or so projects, a lot less than I thought, on SF
01:40 🔗 trs80 http://sourceforge.net/blog/sourceforge-myths/ says 325k
01:42 🔗 honestdua hmm thats in line with teh number of links i'm finging
01:42 🔗 honestdua *finding
01:42 🔗 honestdua but we ahve multiple links per project
01:42 🔗 honestdua we shal know soon enough teh exact number
02:00 🔗 honestdua wow.. gettign OOM's
02:01 🔗 honestdua pretty much means anybody with less than my 16 gb of RAM would too
02:01 🔗 honestdua thats the serialization step however
02:01 🔗 honestdua hmm...
02:01 🔗 * honestdua fades into his computer code
02:02 🔗 honestdua Found 443487 Projects and 1451925 users
02:03 🔗 honestdua thats the actual number
02:03 🔗 honestdua of projects and users on sourceforge as of earlier today
02:04 🔗 honestdua from just the big sitemap file
02:05 🔗 db48x awesome
02:18 🔗 honestdua well I can serliaze out the project data into json but my machine says "no" to the uses file, I thik its due to me being on windows however
02:22 🔗 honestdua https://dl.dropboxusercontent.com/u/18627325/sourceforge.net/projects.json
02:23 🔗 honestdua thats every project url, and its sub urls, collected in a Dictionary<project-name, Set<project-sub-page>> collection
02:23 🔗 honestdua using the /p/ pages as aliases of /project/
02:26 🔗 honestdua 398418 of them have a wiki
02:27 🔗 honestdua 443174 of them have a files download page
02:28 🔗 honestdua only 27 of them have a git page
02:28 🔗 honestdua 3 of them have a page named svn
02:28 🔗 db48x heh
02:29 🔗 honestdua 3574 of them have a page named cvs
02:29 🔗 honestdua as in, most just are file uploads
02:29 🔗 honestdua and I bet you a lot of such projects are binary only or no uplaods and a link to an xternal site
02:32 🔗 honestdua 74973 of them have mailman setups
02:33 🔗 honestdua 143518 of them have a tickets page
02:36 🔗 honestdua so i would say that around 200 ofthem are actually active
02:36 🔗 honestdua *200k ofthem
02:39 🔗 honestdua still thats a lot of code
02:44 🔗 honestdua and an average of over 3 users per project
02:47 🔗 honestdua mam.. this is cooler than I expected
02:48 🔗 honestdua I wonder if i posted this online if anybody else besides you guys would be interested?
02:55 🔗 honestdua hmm teeting.. just cuz
03:49 🔗 SketchCow Boop
05:03 🔗 trs80 freecode appears to be back up
05:04 🔗 trs80 with a "no longer updated" banner
06:18 🔗 joepie91_ trs80: it's always been up for me, just stylesheets broke
07:52 🔗 midas db48x: not using the tracker? if it works it works :)
08:09 🔗 db48x midas: I don't understand your question
08:10 🔗 db48x I've got the tinyarchive tracker running on http://argonath.db48x.net/
08:13 🔗 midas ah ok :)
08:13 🔗 midas got you
08:45 🔗 exmic I'm still seeding that 75G urlteam torrent.
09:56 🔗 Nemo_bis ouch https://gerrit.wikimedia.org/r/141386
09:57 🔗 Nemo_bis exmic: cute, do you have 100 % of it? is it on archive.org now?
10:01 🔗 midas Nemo_bis: what? :|
10:07 🔗 Nemo_bis https://ganglia.wikimedia.org/latest/graph.php?r=day&z=large&c=Miscellaneous+eqiad&h=sodium.wikimedia.org&jr=&js=&v=13.5&m=cpu_wio&vl=%25&ti=CPU+wio I think
10:15 🔗 midas all we did was cause a little cpu load and everybody starts screaming
10:15 🔗 Nemo_bis 20 % io wait probably equals swapdeath :)
10:21 🔗 midas it's not like we killed wikipedia :p
13:58 🔗 Jonimus they could have contacted someone rather than just banning via useragent
14:01 🔗 midas use google's useragent, good luck!
14:03 🔗 Smiley hon
14:03 🔗 Smiley gah he's not here
14:04 🔗 Smiley asking f he should post stuff online if we ar einterested... _post everything_ even if people aren't.

irclogger-viewer