Time |
Nickname |
Message |
00:00
🔗
|
honestdua |
it seems liek you could fund yourselves by selling copies, or something |
00:00
🔗
|
honestdua |
I myself am now wondering how big a hd can be |
00:01
🔗
|
honestdua |
hmm 6tb is i spend a lot |
00:01
🔗
|
honestdua |
I see 2tb drives sometimes, but 1tb is more common |
00:03
🔗
|
honestdua |
one thing you may want to think about is that when we get quantum computers, only thing we will need is the hashes of files and thier size in bytes, so a list of the files by type, size in bytes, md5 hash, sha1 hash, and crc32 hash should be enough for you to recreate everything at that point. |
00:08
🔗
|
honestdua |
Do you guys enjoy http://www.drobo.com/ type products? |
00:10
🔗
|
db48x |
are you talking about the torrents hosted by the Internet Archive? |
00:11
🔗
|
db48x |
those are backed by IA's monster servers; you can download those as fast as your internet connection will allow |
00:11
🔗
|
db48x |
as a seeder it's really hard to compete |
00:13
🔗
|
honestdua |
Not sure whee they atre hosted, I saw the word "bit-torrent" and assumed a bunch of peoples random computers |
00:15
🔗
|
honestdua |
If I hosted all of it on AWS it would cost about 320/month to store 12tb |
00:16
🔗
|
db48x |
yep |
00:18
🔗
|
honestdua |
thats without downloading it at all |
00:18
🔗
|
honestdua |
just storeing it |
00:18
🔗
|
db48x |
almost everything we archive ends up on the Internet Archive (in addition to other places) |
00:19
🔗
|
honestdua |
https://archive.org/index.php ? |
00:19
🔗
|
db48x |
yes |
00:19
🔗
|
db48x |
so if being able to access something is as good as having a local copy, then a donation to IA is a pretty cost-effective way to go :) |
00:19
🔗
|
honestdua |
So every time I pull up a website tht no longer exists and extract a bit of data ftrom an old copy of the page, thats you guys? |
00:19
🔗
|
db48x |
no |
00:20
🔗
|
honestdua |
the "wayback machine"? |
00:20
🔗
|
db48x |
we're Archive Team; we're just a bunch of hobbiests |
00:20
🔗
|
db48x |
(however that might be spelled) |
00:20
🔗
|
honestdua |
so you source for IA but are not a part of them? |
00:21
🔗
|
db48x |
right. we focus on grabbing things that are shutting down, while IA uses the Wayback Machine to crawl everything on the net, hitting most places a few times a year |
00:21
🔗
|
honestdua |
so the sourcforge.net stuff I gave you guys earlier, is that going to be used? |
00:21
🔗
|
honestdua |
or is it not very high priority? |
00:22
🔗
|
db48x |
we're not that cohesive or structured |
00:22
🔗
|
honestdua |
So you run "Warriors" but you are not set up as an army/ |
00:22
🔗
|
honestdua |
? |
00:23
🔗
|
honestdua |
;) |
00:23
🔗
|
db48x |
:) |
00:23
🔗
|
honestdua |
I see the word warrior used and that makes me expect a chain of command, etc |
00:23
🔗
|
db48x |
heh |
00:23
🔗
|
honestdua |
and honestly I think sf is going to die, suddenly, with no body told in advance |
00:23
🔗
|
honestdua |
if it goes the way freshmeat and such did |
00:24
🔗
|
db48x |
yea, it's quite possible |
00:24
🔗
|
honestdua |
its owned by the same groups I think |
00:24
🔗
|
db48x |
I'd like to see it get saved |
00:24
🔗
|
db48x |
I'm working on Pixorial right now though; that has an actual deadline |
00:24
🔗
|
honestdua |
The "Warrior" thing thats just your distributed comuting efforts, correct? |
00:24
🔗
|
honestdua |
When yiu say "working" do you mean you are just running a script? or manually working on the websites/ |
00:24
🔗
|
db48x |
yea, it's just a VM people can download and run, it'll automatically join in on any job we put up on the server |
00:25
🔗
|
honestdua |
? |
00:25
🔗
|
db48x |
I'm working on writing the script that the warriors will download and run so that we can archive Pixorial |
00:26
🔗
|
db48x |
right now I've got the warrior scanning Pixorial's url shortener, so that we can simultaneously archive the mapping of short url to full url, and get a list of things that need to be saved |
00:26
🔗
|
honestdua |
I see. So your building the tasklet run by the 'Warrior" distrubuted task running system, if I understand correctly? |
00:26
🔗
|
db48x |
(Pixorial doesn't provide a way to search or browse the content they host, unlike most video sites) |
00:26
🔗
|
db48x |
correct |
00:26
🔗
|
honestdua |
What gets me is the large barrier to entry to run a tasklet |
00:27
🔗
|
db48x |
it's not very large :) |
00:27
🔗
|
honestdua |
its a vm setup correct? |
00:27
🔗
|
db48x |
just download a virtual machine image and run it in virtualbox |
00:27
🔗
|
db48x |
there's also a docker image a few people use |
00:28
🔗
|
honestdua |
why not a webstart browser page to let people boot from a webpage? |
00:28
🔗
|
honestdua |
http://bellard.org/jslinux/ |
00:29
🔗
|
db48x |
that would be pretty funny, actually |
00:29
🔗
|
honestdua |
just make the image used your warrior |
00:29
🔗
|
db48x |
it's not exactly the fastest way to go |
00:30
🔗
|
honestdua |
Well fastest deployment or the fastest execution? |
00:30
🔗
|
db48x |
and the memory and storage needed to archive a single "item" varies |
00:33
🔗
|
db48x |
if you're interested in running things inside the browser, you should checkout the JSMESS project |
00:33
🔗
|
db48x |
http://jsmess.textfiles.com/ |
00:33
🔗
|
honestdua |
interesting |
00:34
🔗
|
honestdua |
So you already have a warrior taslet for checking out files from a svn/ |
00:34
🔗
|
honestdua |
? |
00:35
🔗
|
db48x |
hmm. not specifically |
00:35
🔗
|
db48x |
I would instead program the task to download and then run svnsync |
00:38
🔗
|
db48x |
I wouldn't spider the HTML view of the repository though, that would be much more laborious |
00:38
🔗
|
honestdua |
hmm... ok. Well my wife ants me to go get burgers for the grill, bbl |
00:38
🔗
|
db48x |
on the other hand, a historical recreation would be harder as a result |
00:39
🔗
|
db48x |
would have to at least make a note of what version of cvsweb was in use at the time |
00:39
🔗
|
honestdua |
well I have a list of all proects on sourceforge as of earlier today |
00:39
🔗
|
db48x |
mmm, burgers |
00:39
🔗
|
honestdua |
wouldnt be hard to get the svn's of every one |
00:39
🔗
|
honestdua |
anyway, bbl |
00:39
🔗
|
db48x |
honestdua: if you want to make a warrior task for that, I'd be happy to help out |
00:39
🔗
|
db48x |
enjoy your burgers :) |
00:40
🔗
|
honestdua |
just going to the store.. wife is going to grill them |
00:40
🔗
|
honestdua |
like most canadian women she is not at all worthless around a grill |
00:40
🔗
|
honestdua |
bbiab |
01:04
🔗
|
honestdua |
ok. back |
01:05
🔗
|
yipdw |
honestdua: the tricky thing about boot-from-webpage is that, although the warrior infrastructure has some degree of fault tolerance, we do not have any way for clients to communicate "this client gave up on this work item" |
01:05
🔗
|
yipdw |
we do have ways to requeue "failed" items, but "failure" is more or less defined as "project admin thinks some node is gone" |
01:06
🔗
|
yipdw |
that said, warrior pipelines do provide ways to explicitly fail items, and the tracker has an endpoint for reporting failures, so AFAICT the remaining bit is plumbing |
01:07
🔗
|
db48x |
I think boot-from-webpage would be fine for something like urlteam where the items are all quite small (on the order of a kilobyte) |
01:07
🔗
|
yipdw |
yes |
01:07
🔗
|
db48x |
but less fine for something like Google Video where an individual video could be a gigabyte |
01:08
🔗
|
db48x |
also, for urlteam the client could just be written in straight-up javascript, rather than writing in in python and then compiling the python compiler, linux kernel, filesystem drivers and a million other things to Javascript |
01:09
🔗
|
honestdua |
Its an interesting idea either way; if the goal is to harvet more faster, logic state that more workers is better. |
01:09
🔗
|
honestdua |
*harvest |
01:09
🔗
|
db48x |
yea |
01:09
🔗
|
yipdw |
sure, but we've also managed to do that by being lucky and having people who run ISPs run workers :P |
01:09
🔗
|
trs80 |
I doubt jslinux could provide the performance required |
01:10
🔗
|
db48x |
did you guys watch the 'Birth and Death of Javascript' video? |
01:10
🔗
|
yipdw |
is that Gary Bernhardt's thing |
01:10
🔗
|
db48x |
yes |
01:10
🔗
|
db48x |
it's probable that lots of things will end up that way |
01:11
🔗
|
yipdw |
yeah |
01:11
🔗
|
yipdw |
I also hope the part about the San Francisco Exclusion Zone is also true |
01:11
🔗
|
db48x |
at least on powerful machines, there will probably be more aggregate computing power in tiny machines though |
01:11
🔗
|
honestdua |
Exclusion zone?? |
01:11
🔗
|
db48x |
heh |
01:11
🔗
|
db48x |
honestdua: a joke from the video |
01:11
🔗
|
db48x |
https://www.destroyallsoftware.com/talks/the-birth-and-death-of-javascript |
01:13
🔗
|
trs80 |
jslinux doesn't seem to have a network stack |
01:13
🔗
|
trs80 |
although it has wget for some reason |
01:13
🔗
|
honestdua |
the issue there is cors |
01:14
🔗
|
honestdua |
by default browsers limit traffic to just the site hosting the page |
01:14
🔗
|
honestdua |
uless you disable it |
01:14
🔗
|
db48x |
honestdua: yes |
01:14
🔗
|
honestdua |
on the website to tell the cleint its ok |
01:14
🔗
|
honestdua |
*client |
01:14
🔗
|
honestdua |
man I can't type sorry |
01:14
🔗
|
db48x |
it's cool |
01:14
🔗
|
honestdua |
but thats something you can disable/enable |
01:14
🔗
|
yipdw |
honestdua: anyway, if you'd like to get the warrior working on jslinux, that'd be cool |
01:14
🔗
|
honestdua |
if you host the page that loads the app |
01:14
🔗
|
db48x |
I _think_ we could do urlteam in spite of CORS |
01:15
🔗
|
honestdua |
right now i'm looking through the data I collected earlier on my 16gb Ram box |
01:15
🔗
|
yipdw |
it'd be hilarious to have that and then exploit some inevitable Twitter client XSS exploit to have a billion warriors |
01:15
🔗
|
yipdw |
no just kidding, that'd be mean |
01:15
🔗
|
db48x |
heh |
01:15
🔗
|
honestdua |
and extracting a list of actual projects verses user profiles, etc |
01:15
🔗
|
honestdua |
since over 3.2 million users profiles are included in the list of links thats really only 2.5 or so million project links |
01:16
🔗
|
honestdua |
and most projects ahve 3-4 links in there |
01:16
🔗
|
honestdua |
each |
01:17
🔗
|
honestdua |
codeing up the extractors now |
01:17
🔗
|
honestdua |
and users have up to 4 links for them as well |
01:17
🔗
|
honestdua |
so if all projects ahve 4 links and all users had 4 links |
01:18
🔗
|
honestdua |
interesting math |
01:18
🔗
|
honestdua |
312k or so possible projects in that scenerio |
01:19
🔗
|
db48x |
you'd probably want to do one item per user and one item per project |
01:19
🔗
|
db48x |
or if you're just going after repositories, then one per project |
01:24
🔗
|
honestdua |
yep |
01:24
🔗
|
honestdua |
code would be the priority |
01:24
🔗
|
honestdua |
indexes by licence |
01:24
🔗
|
honestdua |
etc |
01:27
🔗
|
honestdua |
and users on SF can have blogs and wiki's |
01:27
🔗
|
honestdua |
not just an activities page and a profile |
01:28
🔗
|
db48x |
yes, that's why I'd like to use ForgePlucker |
01:28
🔗
|
db48x |
it knows how to grab all of that efficiently |
01:28
🔗
|
db48x |
our standard tools would just follow the links and record what the website returned |
01:29
🔗
|
db48x |
which is great for recreating the website, but not for exploring the data or importing it elsewhere |
01:36
🔗
|
honestdua |
hmm they alo have a third type of link, http://sourceforge.net/apps/mediawiki/nhruby.u/ to show the apps a given person is related to |
01:38
🔗
|
honestdua |
hmm.. I'm counting up to 7 possible links for just one project |
01:38
🔗
|
honestdua |
there could be only 10k or so projects, a lot less than I thought, on SF |
01:40
🔗
|
trs80 |
http://sourceforge.net/blog/sourceforge-myths/ says 325k |
01:42
🔗
|
honestdua |
hmm thats in line with teh number of links i'm finging |
01:42
🔗
|
honestdua |
*finding |
01:42
🔗
|
honestdua |
but we ahve multiple links per project |
01:42
🔗
|
honestdua |
we shal know soon enough teh exact number |
02:00
🔗
|
honestdua |
wow.. gettign OOM's |
02:01
🔗
|
honestdua |
pretty much means anybody with less than my 16 gb of RAM would too |
02:01
🔗
|
honestdua |
thats the serialization step however |
02:01
🔗
|
honestdua |
hmm... |
02:01
🔗
|
* |
honestdua fades into his computer code |
02:02
🔗
|
honestdua |
Found 443487 Projects and 1451925 users |
02:03
🔗
|
honestdua |
thats the actual number |
02:03
🔗
|
honestdua |
of projects and users on sourceforge as of earlier today |
02:04
🔗
|
honestdua |
from just the big sitemap file |
02:05
🔗
|
db48x |
awesome |
02:18
🔗
|
honestdua |
well I can serliaze out the project data into json but my machine says "no" to the uses file, I thik its due to me being on windows however |
02:22
🔗
|
honestdua |
https://dl.dropboxusercontent.com/u/18627325/sourceforge.net/projects.json |
02:23
🔗
|
honestdua |
thats every project url, and its sub urls, collected in a Dictionary<project-name, Set<project-sub-page>> collection |
02:23
🔗
|
honestdua |
using the /p/ pages as aliases of /project/ |
02:26
🔗
|
honestdua |
398418 of them have a wiki |
02:27
🔗
|
honestdua |
443174 of them have a files download page |
02:28
🔗
|
honestdua |
only 27 of them have a git page |
02:28
🔗
|
honestdua |
3 of them have a page named svn |
02:28
🔗
|
db48x |
heh |
02:29
🔗
|
honestdua |
3574 of them have a page named cvs |
02:29
🔗
|
honestdua |
as in, most just are file uploads |
02:29
🔗
|
honestdua |
and I bet you a lot of such projects are binary only or no uplaods and a link to an xternal site |
02:32
🔗
|
honestdua |
74973 of them have mailman setups |
02:33
🔗
|
honestdua |
143518 of them have a tickets page |
02:36
🔗
|
honestdua |
so i would say that around 200 ofthem are actually active |
02:36
🔗
|
honestdua |
*200k ofthem |
02:39
🔗
|
honestdua |
still thats a lot of code |
02:44
🔗
|
honestdua |
and an average of over 3 users per project |
02:47
🔗
|
honestdua |
mam.. this is cooler than I expected |
02:48
🔗
|
honestdua |
I wonder if i posted this online if anybody else besides you guys would be interested? |
02:55
🔗
|
honestdua |
hmm teeting.. just cuz |
03:49
🔗
|
SketchCow |
Boop |
05:03
🔗
|
trs80 |
freecode appears to be back up |
05:04
🔗
|
trs80 |
with a "no longer updated" banner |
06:18
🔗
|
joepie91_ |
trs80: it's always been up for me, just stylesheets broke |
07:52
🔗
|
midas |
db48x: not using the tracker? if it works it works :) |
08:09
🔗
|
db48x |
midas: I don't understand your question |
08:10
🔗
|
db48x |
I've got the tinyarchive tracker running on http://argonath.db48x.net/ |
08:13
🔗
|
midas |
ah ok :) |
08:13
🔗
|
midas |
got you |
08:45
🔗
|
exmic |
I'm still seeding that 75G urlteam torrent. |
09:56
🔗
|
Nemo_bis |
ouch https://gerrit.wikimedia.org/r/141386 |
09:57
🔗
|
Nemo_bis |
exmic: cute, do you have 100 % of it? is it on archive.org now? |
10:01
🔗
|
midas |
Nemo_bis: what? :| |
10:07
🔗
|
Nemo_bis |
https://ganglia.wikimedia.org/latest/graph.php?r=day&z=large&c=Miscellaneous+eqiad&h=sodium.wikimedia.org&jr=&js=&v=13.5&m=cpu_wio&vl=%25&ti=CPU+wio I think |
10:15
🔗
|
midas |
all we did was cause a little cpu load and everybody starts screaming |
10:15
🔗
|
Nemo_bis |
20 % io wait probably equals swapdeath :) |
10:21
🔗
|
midas |
it's not like we killed wikipedia :p |
13:58
🔗
|
Jonimus |
they could have contacted someone rather than just banning via useragent |
14:01
🔗
|
midas |
use google's useragent, good luck! |
14:03
🔗
|
Smiley |
hon |
14:03
🔗
|
Smiley |
gah he's not here |
14:04
🔗
|
Smiley |
asking f he should post stuff online if we ar einterested... _post everything_ even if people aren't. |