#archiveteam 2012-12-13,Thu

↑back Search

Time Nickname Message
00:02 🔗 corobo running it on a couple machines, if you could add something like an optional argument to bind to an ip could run it off a few ips :)
05:32 🔗 tuankiet @alard Hello!
10:43 🔗 alard tuankiet: Hi.
10:44 🔗 alard kennethre: No help needed, yet, other than running one of those repository-discovery scripts, perhaps. Once we have a nice list we can look at downloading the downloads.
10:46 🔗 alard tuankiet: You were after the Google scraper script for the Yahoo blogs, but I haven't had time to look at it yet (the version I have is really personalized, so it basically only works for me, at the moment).
11:56 🔗 tuankiet @alard: Ok
11:57 🔗 tuankiet I am running the Yahoo and Github script
12:34 🔗 alard tuankiet: Very good.
12:35 🔗 alard It's a pity that Dailybooth is so slow. We're working on too many projects.
13:23 🔗 Nemo_bis At last, it looks like Wikia is generating a dump per minute instead of one every 5, since the 10th http://wikistats.wikia.com/c/ca/
13:48 🔗 SketchCow OK, so.
13:48 🔗 SketchCow I have to say.
13:48 🔗 SketchCow When you checked in the github content for the github content for us to turn around and download github
13:48 🔗 SketchCow Oh man
13:48 🔗 SketchCow I almost died
13:53 🔗 GLaDOS So uuh, I heard you like github..
13:55 🔗 SketchCow At this exact moment, archive.org has one petabyte of disk space
13:56 🔗 SketchCow free
13:56 🔗 Nemo_bis SketchCow: are you saying because you plan to reduce it vastly and very soon? :p
13:56 🔗 SketchCow Yes
13:56 🔗 Nemo_bis :)
13:56 🔗 SketchCow I'd like to understand.... do we need more archiveteam warriors on the dailybooth project?
13:57 🔗 Nemo_bis I also have to admit that it's not so obvious what one has to do to help the archiveteam
13:57 🔗 Nemo_bis too many projects and we're too lazy to update the wiki
14:03 🔗 SketchCow We're not too lazy.
14:03 🔗 SketchCow The wiki's choked because of the spam. I will fix it.
14:04 🔗 Nemo_bis Speaking of which, can you make me sysop
14:04 🔗 Nemo_bis it's weird not to have the delete button on a wiki
14:04 🔗 Nemo_bis and frustrating for me :)
14:18 🔗 alard No, I don't think more warriors would help with dailybooth. It's dailybooth that's too slow.
14:19 🔗 alard Perhaps we should consider giving the warriors something else to do (github!), since we have more than enough non-warriors to keep dailybooth busy.
14:20 🔗 alard We're doing 12 / 7 / 8 / 14 / 16 dailybooth users per minute (and that includes 404's).
14:21 🔗 alard Do we want the Github downloads in warc format?
14:36 🔗 SketchCow I personally think no.
14:38 🔗 alard You don't want to go for maximum inaccessibility?
14:40 🔗 alard If not warc, then what? A .tar?
14:40 🔗 alard (What to do with the /downloads HTML page?)
14:42 🔗 alard We could also just rsync the files as-is. The url structure is tidy enough (user/repo/download).
14:43 🔗 alard The downloads page has the download count, everything else exists in other forms: https://github.com/ArchiveTeam/mobileme-grab/downloads
15:18 🔗 SketchCow I think in this case, we're rescuing a filesystem, not an experience
15:19 🔗 SketchCow A .txt file accompanying the files indicating the download count, if you're being completist.
15:19 🔗 SketchCow And personally, I think that assassment could be in a single .txt file
16:18 🔗 alard SketchCow: Could you have a look at alardland/github?
16:36 🔗 * closure perks up his ears hearing about plans to do something with github
16:36 🔗 closure is this about archiving the git repos, or some of their other data?
16:36 🔗 alard The downloads.
16:37 🔗 closure hmm, not familiar with that
16:37 🔗 alard https://github.com/blog/1302-goodbye-uploads
16:37 🔗 closure aha, thanks
16:38 🔗 alard We're making a list of repositories, so that could be used for other things in the future.
16:38 🔗 closure so there's a guy who has been using their API to find all new repositories for a while.. I forget the link to his site
16:40 🔗 SketchCow alard - Looks good.
16:40 🔗 SketchCow I suspect this won't be a LOT of data
16:41 🔗 alard You *hope* it's not a lot of data.
16:42 🔗 closure for a lot of data, see sourceforge downloads :P
16:42 🔗 SketchCow I don't actually (hope)
16:42 🔗 SketchCow Because once again the COmpass Has Swung and archive.org has tons of disk space.
16:42 🔗 SketchCow I mean, we still should help raise funds because it helps
16:43 🔗 SketchCow But 1 petabyte of free disk space right now
16:43 🔗 SketchCow So yeah, let's do it.
16:43 🔗 SketchCow I'll e-mail a hug to my github buddies
16:46 🔗 closure ah, I see you already found githubarchive.org
16:48 🔗 alard SketchCow: Want to say hi in the User-Agent header as well?
16:52 🔗 SketchCow Sure.
16:52 🔗 SketchCow "Archive Team Loves GitHub"
16:55 🔗 alard https://github.com/ArchiveTeam/github-download-grab/commit/e3073ec5573a6d9b1e9508ad283168358019aae3
17:07 🔗 alard Heh, the tracker might not like this: http://tracker.archiveteam.org/github/
17:08 🔗 closure have you already pulled in the api dump data? If not, I might try some massaging
17:09 🔗 alard No, I haven't. We're well on our way with the API exploration, though: http://tracker.archiveteam.org:8125/
17:09 🔗 alard (I think the highest ID is in the 7,000,000 range.)
17:12 🔗 closure I'm running the scraper for that, so if there's time to plow through the whole range, that's fine
17:44 🔗 SketchCow What is our HQ url again?
17:45 🔗 nitro2k01 What? Headquarters? http://archiveteam.org/ ?
17:49 🔗 SketchCow No, got it.
17:49 🔗 SketchCow http://warriorhq.archiveteam.org/
17:49 🔗 nitro2k01 Ah, that
17:50 🔗 godane burning a bluray of gbtv/theblaze episodes
17:50 🔗 godane the rest of november and election coverage is on this one
18:43 🔗 Nemo_bis SketchCow: can I buy other 50 kg of magazines to send you? :D
18:43 🔗 Nemo_bis "PC Professionale" 110-189
18:43 🔗 Nemo_bis shipping will cost about three times as buying
18:44 🔗 DFJustin I like how kg is our standard unit for magazines now
18:45 🔗 Nemo_bis DFJustin: what other unit could I choose for transatlantic cooperations? :p
18:53 🔗 Nemo_bis I don't remember if ias3upload.pl overwrites existing files with same name or not
18:57 🔗 godane i uploaded august of 2011 episodes of x-play today
19:11 🔗 SketchCow At current trends, github data will be about 200gb
19:14 🔗 DFJustin *yawn*
19:17 🔗 chronomex *slurp*
19:31 🔗 chronomex alard: did we already finish the API grabbing?
19:32 🔗 chronomex my discoverer died last night with requests.exceptions.ConnectionError: HTTPSConnectionPool(host=u'api.github.com', port=443): Max retries exceeded with url: /repositories?since=1295141
19:32 🔗 Deewiant I'm still running the github repo explorer, it seems to come up with some new tasks every couple of minutes
19:33 🔗 Deewiant (I put my auth info in there so it can do 5000 instead of 60 per hour)
19:33 🔗 chronomex neato
19:34 🔗 Deewiant (At first I accidentally put them on a tracker HTTP request, had to change the password then >_<)
19:34 🔗 chronomex hah, woops
19:34 🔗 chronomex probably nobody's looking at those ... except the NSA watches them in transit
19:35 🔗 Deewiant Yep, I think it was an unencrypted request too
19:35 🔗 chronomex you're fucked
19:35 🔗 Deewiant Well, I managed to change the password without any trouble
19:36 🔗 Deewiant Maybe somebody defaced all my repos in the interim ;-P
19:45 🔗 chronomex you seem to be sucking the job queue dry
19:45 🔗 chronomex good work
19:46 🔗 Deewiant Where does it get jobs from?
20:10 🔗 kennethre chronomex: sorry :)
20:11 🔗 kennethre I'd recommend using something like celery
20:23 🔗 chronomex erp, what?
20:27 🔗 kennethre re: requests.exceptions.ConnectionError
20:27 🔗 kennethre to spread them across different machines, handle exceptions, etc
20:27 🔗 SketchCow 62 BBC R&D Descriptions left!
20:27 🔗 SketchCow Poor github
20:30 🔗 balrog_ yeah, I'm getting no tasks.
20:30 🔗 balrog_ actually I am getting one in a while
20:33 🔗 alard http://zeppelin.xrtc.net/corp.xrtc.net/shilling.corp.xrtc.net/index.html
20:42 🔗 godane SketchCow: Thanks for puting up x-play episodes in collection
20:54 🔗 SketchCow No problem.
20:54 🔗 SketchCow More soon
21:04 🔗 godane i will do 2012 episodes in 2013 so i don't get this stuff darked
21:05 🔗 godane when the network is died there shouldn't be fear of nbc sending dmca notices i hope
21:07 🔗 soultcer there are so many people fetching github repo lists that it is hard to actually get a task assigned
21:24 🔗 sankin1 the leaderboard is flying
21:26 🔗 soultcer Whoa there's already a project to download
21:28 🔗 alard Perhaps I should ask: what is an acceptable number of requests to send to GitHub?
21:28 🔗 alard We're currently doing over 50 requests per second.
21:31 🔗 soultcer As long as Github doesn't show elevated error response rates, keep it up :D
21:32 🔗 alard Apparently underscor has joined us.
21:32 🔗 Deewiant The non-Warrioring cheater.
21:33 🔗 soultcer Well dailybooth is kind of boring with it's low download speed and timeouts
21:33 🔗 * underscor pads in drearily, rubbing sleep out of his eyes
21:33 🔗 underscor what oh yes hi
21:33 🔗 kennethre alard: to the api?
21:33 🔗 alard No, to the /downloads page.
21:33 🔗 kennethre i wouldn't worry about it
21:34 🔗 kennethre unless you get 500s
21:37 🔗 soultcer The actual downloads are from cloudfront and probably s3-backed
21:37 🔗 kennethre yep
21:38 🔗 alard (The precise thing to say would be: 50 r/s to the /downloads pages.)
21:38 🔗 SketchCow Just for the record, godane - you are cutting it way close to the edge.
21:39 🔗 SketchCow I realize a lot of these safe times and cooldown periods are fake and wishful thinking, but putting up stuff that is less than a year old is inviting the scorpion to reflexively sting even though it is "dead"
21:39 🔗 SketchCow I'd be happier if we were downloading and putting up stuff from the 1980s, like you were doing with tv shows and older material.
21:39 🔗 SketchCow Even the 90s
21:39 🔗 SketchCow I mean, if you have a choice.
21:44 🔗 SketchCow In other news, this test looks successful. http://archive.org/details/creativecomputingv11n11-tiffcbz\
21:44 🔗 SketchCow In other news, this test looks successful. http://archive.org/details/creativecomputingv11n11-tiffcbz
21:46 🔗 underscor Boy, my browser really hates the tracker
21:47 🔗 Nemo_bis underscor: isn't it cute to see the top downloaded items in http://archive.org/details/philosophicaltransactions a year later :)
21:47 🔗 Deewiant Pause your scripts, it'll be much more palatable ;-)
21:48 🔗 underscor Nemo_bis: Wow, I'd forgotten about that
21:48 🔗 Nemo_bis :D
21:48 🔗 underscor Damn, that is cool :D
21:48 🔗 underscor Deewiant: <:B <:B
21:51 🔗 Nemo_bis Experiments on the Refrangibility of the Invisible Rays of the Sun. By William Herschel, LL. D. F. R. S. 602 downloads
21:58 🔗 SketchCow http://archive.org/details/creativecomputingv11n11-tiffcbz
21:59 🔗 DFJustin same url is same
22:00 🔗 Nemo_bis I love how iasupload smartly retries
22:01 🔗 Nemo_bis SketchCow: would you create a collection for these 106 magazines issue I uploaded? https://archive.org/search.php?query=subject%3A"Hacker+Journal"
22:06 🔗 SketchCow If I do it right now, it'll explode. Let them derive and settle and I'll do it in 5 seconds.
22:06 🔗 SketchCow They're still deriving
22:08 🔗 godane looks like archive.org is having problems
22:10 🔗 godane also everything is waiting to be archived
22:11 🔗 Nemo_bis SketchCow: ah ok sorry, yes there are about 6000 items in the derive queue
22:12 🔗 Nemo_bis Also, I miss ocrcount
22:19 🔗 chronomex oh jesus I just loaded the github tracker
22:19 🔗 chronomex I don't think I've ever seen a tracker go this fast
22:19 🔗 chronomex zoooom
22:25 🔗 SketchCow Is there a total reposity count somewhere on github?
22:25 🔗 SketchCow I'm looking for it.
22:25 🔗 SketchCow I see press release saying 3.7 million
22:26 🔗 balrog_ that's in Sep 13
22:28 🔗 SketchCow https://twitter.com/textfiles/status/279350174541819905
22:28 🔗 balrog_ this is just downloading file listings, right?
22:29 🔗 balrog_ or is that part finished?
22:29 🔗 balrog_ SketchCow: also note that there are many private github repos
22:29 🔗 balrog_ since you can pay for private ones
22:30 🔗 balrog_ 3.7 million would include those
22:30 🔗 chronomex I think that number includes gists as well
22:30 🔗 balrog_ doubt it, but maybe
22:31 🔗 balrog_ I liked github downloads because you could post binaries and hotlink them from elsewhere
22:31 🔗 balrog_ sucks that they're going away
22:34 🔗 balrog_ are you guys sure the downloads contain data?
22:34 🔗 balrog_ or is this just listings?
22:35 🔗 DFJustin I saw one that was 7mb
22:36 🔗 balrog_ some should be 20-50 or more
22:36 🔗 balrog_ DFJustin: are all the lists retrieved?
22:36 🔗 SketchCow First, realize what these are.
22:36 🔗 balrog_ so it's now downloading files, right?
22:37 🔗 SketchCow These are NOT the code repositories.
22:37 🔗 balrog_ most of them will be under 1mb
22:37 🔗 balrog_ yes, I understand
22:37 🔗 SketchCow Like github/boner-muncher is code
22:37 🔗 balrog_ however, some projects have posted fairly large files
22:37 🔗 balrog_ I've used this service myself for some of my code.
22:37 🔗 SketchCow The /downloads are JUST the separate files.
22:37 🔗 balrog_ yes
22:37 🔗 DFJustin just watched it for a couple seconds and some are dozens of mb so I think it's ok
22:37 🔗 SketchCow Well, conclusively, we're finding the vast vast vast majority of the 3.7 million never used this feature
22:37 🔗 SketchCow VAST majority.
22:37 🔗 balrog_ that is correct
22:37 🔗 balrog_ ahh, so the warrior lists those who didn't use it.
22:37 🔗 DFJustin also that is cartoonishly fast
22:37 🔗 SketchCow root@teamarchive-1:/1/ALARD/warrior/github# du -sh .
22:37 🔗 SketchCow 55G .
22:38 🔗 balrog_ hopefully wget-lua compiles before this is done :P
22:38 🔗 SketchCow 18303
22:38 🔗 SketchCow root@teamarchive-1:/1/ALARD/warrior/github# find . -type f | wc -l
22:38 🔗 SketchCow Remember, that's including index.txt
22:38 🔗 balrog_ and index.txt is generated for all repos?
22:38 🔗 SketchCow 2717
22:38 🔗 SketchCow root@teamarchive-1:/1/ALARD/warrior/github# find . -name index.txt | wc -l
22:38 🔗 SketchCow See? Yes
22:39 🔗 SketchCow Yes, it is.
22:39 🔗 SketchCow Just to keep the download counts
22:45 🔗 balrog_ how do I set this to work without warrior?
22:46 🔗 soultcer I assume same as all other warrior projects that use wget-lua
22:46 🔗 balrog_ just python ./pipeline.py?
22:47 🔗 soultcer 1) Install python, python tornado (> v2), python tornadio, python argparse, openssl headers, lua headers
22:47 🔗 soultcer 2) git clone github.com/archiveteam/seesaw-kit.git
22:47 🔗 SketchCow Poor github, they just want to do the right thing.
22:47 🔗 SketchCow OK, separate hannel
22:48 🔗 SketchCow #gothub
22:48 🔗 soultcer 3) git clone github.com/archiveteam/github-download.git
22:51 🔗 balrog_ soultcer: yeah I have all that, I have wget lua, just how to start it?
22:51 🔗 soultcer with run-pipeline, as usual
22:52 🔗 SketchCow Please redirect people over to #gothub.
22:52 🔗 SketchCow We're back to the Usual Crap again
23:10 🔗 SketchCow alard: Please come to #gothub - possible bug

irclogger-viewer