[00:02] running it on a couple machines, if you could add something like an optional argument to bind to an ip could run it off a few ips :) [05:32] @alard Hello! [10:43] tuankiet: Hi. [10:44] kennethre: No help needed, yet, other than running one of those repository-discovery scripts, perhaps. Once we have a nice list we can look at downloading the downloads. [10:46] tuankiet: You were after the Google scraper script for the Yahoo blogs, but I haven't had time to look at it yet (the version I have is really personalized, so it basically only works for me, at the moment). [11:56] @alard: Ok [11:57] I am running the Yahoo and Github script [12:34] tuankiet: Very good. [12:35] It's a pity that Dailybooth is so slow. We're working on too many projects. [13:23] At last, it looks like Wikia is generating a dump per minute instead of one every 5, since the 10th http://wikistats.wikia.com/c/ca/ [13:48] OK, so. [13:48] I have to say. [13:48] When you checked in the github content for the github content for us to turn around and download github [13:48] Oh man [13:48] I almost died [13:53] So uuh, I heard you like github.. [13:55] At this exact moment, archive.org has one petabyte of disk space [13:56] free [13:56] SketchCow: are you saying because you plan to reduce it vastly and very soon? :p [13:56] Yes [13:56] :) [13:56] I'd like to understand.... do we need more archiveteam warriors on the dailybooth project? [13:57] I also have to admit that it's not so obvious what one has to do to help the archiveteam [13:57] too many projects and we're too lazy to update the wiki [14:03] We're not too lazy. [14:03] The wiki's choked because of the spam. I will fix it. [14:04] Speaking of which, can you make me sysop [14:04] it's weird not to have the delete button on a wiki [14:04] and frustrating for me :) [14:18] No, I don't think more warriors would help with dailybooth. It's dailybooth that's too slow. [14:19] Perhaps we should consider giving the warriors something else to do (github!), since we have more than enough non-warriors to keep dailybooth busy. [14:20] We're doing 12 / 7 / 8 / 14 / 16 dailybooth users per minute (and that includes 404's). [14:21] Do we want the Github downloads in warc format? [14:36] I personally think no. [14:38] You don't want to go for maximum inaccessibility? [14:40] If not warc, then what? A .tar? [14:40] (What to do with the /downloads HTML page?) [14:42] We could also just rsync the files as-is. The url structure is tidy enough (user/repo/download). [14:43] The downloads page has the download count, everything else exists in other forms: https://github.com/ArchiveTeam/mobileme-grab/downloads [15:18] I think in this case, we're rescuing a filesystem, not an experience [15:19] A .txt file accompanying the files indicating the download count, if you're being completist. [15:19] And personally, I think that assassment could be in a single .txt file [16:18] SketchCow: Could you have a look at alardland/github? [16:36] * closure perks up his ears hearing about plans to do something with github [16:36] is this about archiving the git repos, or some of their other data? [16:36] The downloads. [16:37] hmm, not familiar with that [16:37] https://github.com/blog/1302-goodbye-uploads [16:37] aha, thanks [16:38] We're making a list of repositories, so that could be used for other things in the future. [16:38] so there's a guy who has been using their API to find all new repositories for a while.. I forget the link to his site [16:40] alard - Looks good. [16:40] I suspect this won't be a LOT of data [16:41] You *hope* it's not a lot of data. [16:42] for a lot of data, see sourceforge downloads :P [16:42] I don't actually (hope) [16:42] Because once again the COmpass Has Swung and archive.org has tons of disk space. [16:42] I mean, we still should help raise funds because it helps [16:43] But 1 petabyte of free disk space right now [16:43] So yeah, let's do it. [16:43] I'll e-mail a hug to my github buddies [16:46] ah, I see you already found githubarchive.org [16:48] SketchCow: Want to say hi in the User-Agent header as well? [16:52] Sure. [16:52] "Archive Team Loves GitHub" [16:55] https://github.com/ArchiveTeam/github-download-grab/commit/e3073ec5573a6d9b1e9508ad283168358019aae3 [17:07] Heh, the tracker might not like this: http://tracker.archiveteam.org/github/ [17:08] have you already pulled in the api dump data? If not, I might try some massaging [17:09] No, I haven't. We're well on our way with the API exploration, though: http://tracker.archiveteam.org:8125/ [17:09] (I think the highest ID is in the 7,000,000 range.) [17:12] I'm running the scraper for that, so if there's time to plow through the whole range, that's fine [17:44] What is our HQ url again? [17:45] What? Headquarters? http://archiveteam.org/ ? [17:49] No, got it. [17:49] http://warriorhq.archiveteam.org/ [17:49] Ah, that [17:50] burning a bluray of gbtv/theblaze episodes [17:50] the rest of november and election coverage is on this one [18:43] SketchCow: can I buy other 50 kg of magazines to send you? :D [18:43] "PC Professionale" 110-189 [18:43] shipping will cost about three times as buying [18:44] I like how kg is our standard unit for magazines now [18:45] DFJustin: what other unit could I choose for transatlantic cooperations? :p [18:53] I don't remember if ias3upload.pl overwrites existing files with same name or not [18:57] i uploaded august of 2011 episodes of x-play today [19:11] At current trends, github data will be about 200gb [19:14] *yawn* [19:17] *slurp* [19:31] alard: did we already finish the API grabbing? [19:32] my discoverer died last night with requests.exceptions.ConnectionError: HTTPSConnectionPool(host=u'api.github.com', port=443): Max retries exceeded with url: /repositories?since=1295141 [19:32] I'm still running the github repo explorer, it seems to come up with some new tasks every couple of minutes [19:33] (I put my auth info in there so it can do 5000 instead of 60 per hour) [19:33] neato [19:34] (At first I accidentally put them on a tracker HTTP request, had to change the password then >_<) [19:34] hah, woops [19:34] probably nobody's looking at those ... except the NSA watches them in transit [19:35] Yep, I think it was an unencrypted request too [19:35] you're fucked [19:35] Well, I managed to change the password without any trouble [19:36] Maybe somebody defaced all my repos in the interim ;-P [19:45] you seem to be sucking the job queue dry [19:45] good work [19:46] Where does it get jobs from? [20:10] chronomex: sorry :) [20:11] I'd recommend using something like celery [20:23] erp, what? [20:27] re: requests.exceptions.ConnectionError [20:27] to spread them across different machines, handle exceptions, etc [20:27] 62 BBC R&D Descriptions left! [20:27] Poor github [20:30] yeah, I'm getting no tasks. [20:30] actually I am getting one in a while [20:33] http://zeppelin.xrtc.net/corp.xrtc.net/shilling.corp.xrtc.net/index.html [20:42] SketchCow: Thanks for puting up x-play episodes in collection [20:54] No problem. [20:54] More soon [21:04] i will do 2012 episodes in 2013 so i don't get this stuff darked [21:05] when the network is died there shouldn't be fear of nbc sending dmca notices i hope [21:07] there are so many people fetching github repo lists that it is hard to actually get a task assigned [21:24] the leaderboard is flying [21:26] Whoa there's already a project to download [21:28] Perhaps I should ask: what is an acceptable number of requests to send to GitHub? [21:28] We're currently doing over 50 requests per second. [21:31] As long as Github doesn't show elevated error response rates, keep it up :D [21:32] Apparently underscor has joined us. [21:32] The non-Warrioring cheater. [21:33] Well dailybooth is kind of boring with it's low download speed and timeouts [21:33] * underscor pads in drearily, rubbing sleep out of his eyes [21:33] what oh yes hi [21:33] alard: to the api? [21:33] No, to the /downloads page. [21:33] i wouldn't worry about it [21:34] unless you get 500s [21:37] The actual downloads are from cloudfront and probably s3-backed [21:37] yep [21:38] (The precise thing to say would be: 50 r/s to the /downloads pages.) [21:38] Just for the record, godane - you are cutting it way close to the edge. [21:39] I realize a lot of these safe times and cooldown periods are fake and wishful thinking, but putting up stuff that is less than a year old is inviting the scorpion to reflexively sting even though it is "dead" [21:39] I'd be happier if we were downloading and putting up stuff from the 1980s, like you were doing with tv shows and older material. [21:39] Even the 90s [21:39] I mean, if you have a choice. [21:44] In other news, this test looks successful. http://archive.org/details/creativecomputingv11n11-tiffcbz\ [21:44] In other news, this test looks successful. http://archive.org/details/creativecomputingv11n11-tiffcbz [21:46] Boy, my browser really hates the tracker [21:47] underscor: isn't it cute to see the top downloaded items in http://archive.org/details/philosophicaltransactions a year later :) [21:47] Pause your scripts, it'll be much more palatable ;-) [21:48] Nemo_bis: Wow, I'd forgotten about that [21:48] :D [21:48] Damn, that is cool :D [21:48] Deewiant: <:B <:B [21:51] Experiments on the Refrangibility of the Invisible Rays of the Sun. By William Herschel, LL. D. F. R. S. 602 downloads [21:58] http://archive.org/details/creativecomputingv11n11-tiffcbz [21:59] same url is same [22:00] I love how iasupload smartly retries [22:01] SketchCow: would you create a collection for these 106 magazines issue I uploaded? https://archive.org/search.php?query=subject%3A"Hacker+Journal" [22:06] If I do it right now, it'll explode. Let them derive and settle and I'll do it in 5 seconds. [22:06] They're still deriving [22:08] looks like archive.org is having problems [22:10] also everything is waiting to be archived [22:11] SketchCow: ah ok sorry, yes there are about 6000 items in the derive queue [22:12] Also, I miss ocrcount [22:19] oh jesus I just loaded the github tracker [22:19] I don't think I've ever seen a tracker go this fast [22:19] zoooom [22:25] Is there a total reposity count somewhere on github? [22:25] I'm looking for it. [22:25] I see press release saying 3.7 million [22:26] that's in Sep 13 [22:28] https://twitter.com/textfiles/status/279350174541819905 [22:28] this is just downloading file listings, right? [22:29] or is that part finished? [22:29] SketchCow: also note that there are many private github repos [22:29] since you can pay for private ones [22:30] 3.7 million would include those [22:30] I think that number includes gists as well [22:30] doubt it, but maybe [22:31] I liked github downloads because you could post binaries and hotlink them from elsewhere [22:31] sucks that they're going away [22:34] are you guys sure the downloads contain data? [22:34] or is this just listings? [22:35] I saw one that was 7mb [22:36] some should be 20-50 or more [22:36] DFJustin: are all the lists retrieved? [22:36] First, realize what these are. [22:36] so it's now downloading files, right? [22:37] These are NOT the code repositories. [22:37] most of them will be under 1mb [22:37] yes, I understand [22:37] Like github/boner-muncher is code [22:37] however, some projects have posted fairly large files [22:37] I've used this service myself for some of my code. [22:37] The /downloads are JUST the separate files. [22:37] yes [22:37] just watched it for a couple seconds and some are dozens of mb so I think it's ok [22:37] Well, conclusively, we're finding the vast vast vast majority of the 3.7 million never used this feature [22:37] VAST majority. [22:37] that is correct [22:37] ahh, so the warrior lists those who didn't use it. [22:37] also that is cartoonishly fast [22:37] root@teamarchive-1:/1/ALARD/warrior/github# du -sh . [22:37] 55G . [22:38] hopefully wget-lua compiles before this is done :P [22:38] 18303 [22:38] root@teamarchive-1:/1/ALARD/warrior/github# find . -type f | wc -l [22:38] Remember, that's including index.txt [22:38] and index.txt is generated for all repos? [22:38] 2717 [22:38] root@teamarchive-1:/1/ALARD/warrior/github# find . -name index.txt | wc -l [22:38] See? Yes [22:39] Yes, it is. [22:39] Just to keep the download counts [22:45] how do I set this to work without warrior? [22:46] I assume same as all other warrior projects that use wget-lua [22:46] just python ./pipeline.py? [22:47] 1) Install python, python tornado (> v2), python tornadio, python argparse, openssl headers, lua headers [22:47] 2) git clone github.com/archiveteam/seesaw-kit.git [22:47] Poor github, they just want to do the right thing. [22:47] OK, separate hannel [22:48] #gothub [22:48] 3) git clone github.com/archiveteam/github-download.git [22:51] soultcer: yeah I have all that, I have wget lua, just how to start it? [22:51] with run-pipeline, as usual [22:52] Please redirect people over to #gothub. [22:52] We're back to the Usual Crap again [23:10] alard: Please come to #gothub - possible bug