Time |
Nickname |
Message |
00:02
🔗
|
corobo |
running it on a couple machines, if you could add something like an optional argument to bind to an ip could run it off a few ips :) |
05:32
🔗
|
tuankiet |
@alard Hello! |
10:43
🔗
|
alard |
tuankiet: Hi. |
10:44
🔗
|
alard |
kennethre: No help needed, yet, other than running one of those repository-discovery scripts, perhaps. Once we have a nice list we can look at downloading the downloads. |
10:46
🔗
|
alard |
tuankiet: You were after the Google scraper script for the Yahoo blogs, but I haven't had time to look at it yet (the version I have is really personalized, so it basically only works for me, at the moment). |
11:56
🔗
|
tuankiet |
@alard: Ok |
11:57
🔗
|
tuankiet |
I am running the Yahoo and Github script |
12:34
🔗
|
alard |
tuankiet: Very good. |
12:35
🔗
|
alard |
It's a pity that Dailybooth is so slow. We're working on too many projects. |
13:23
🔗
|
Nemo_bis |
At last, it looks like Wikia is generating a dump per minute instead of one every 5, since the 10th http://wikistats.wikia.com/c/ca/ |
13:48
🔗
|
SketchCow |
OK, so. |
13:48
🔗
|
SketchCow |
I have to say. |
13:48
🔗
|
SketchCow |
When you checked in the github content for the github content for us to turn around and download github |
13:48
🔗
|
SketchCow |
Oh man |
13:48
🔗
|
SketchCow |
I almost died |
13:53
🔗
|
GLaDOS |
So uuh, I heard you like github.. |
13:55
🔗
|
SketchCow |
At this exact moment, archive.org has one petabyte of disk space |
13:56
🔗
|
SketchCow |
free |
13:56
🔗
|
Nemo_bis |
SketchCow: are you saying because you plan to reduce it vastly and very soon? :p |
13:56
🔗
|
SketchCow |
Yes |
13:56
🔗
|
Nemo_bis |
:) |
13:56
🔗
|
SketchCow |
I'd like to understand.... do we need more archiveteam warriors on the dailybooth project? |
13:57
🔗
|
Nemo_bis |
I also have to admit that it's not so obvious what one has to do to help the archiveteam |
13:57
🔗
|
Nemo_bis |
too many projects and we're too lazy to update the wiki |
14:03
🔗
|
SketchCow |
We're not too lazy. |
14:03
🔗
|
SketchCow |
The wiki's choked because of the spam. I will fix it. |
14:04
🔗
|
Nemo_bis |
Speaking of which, can you make me sysop |
14:04
🔗
|
Nemo_bis |
it's weird not to have the delete button on a wiki |
14:04
🔗
|
Nemo_bis |
and frustrating for me :) |
14:18
🔗
|
alard |
No, I don't think more warriors would help with dailybooth. It's dailybooth that's too slow. |
14:19
🔗
|
alard |
Perhaps we should consider giving the warriors something else to do (github!), since we have more than enough non-warriors to keep dailybooth busy. |
14:20
🔗
|
alard |
We're doing 12 / 7 / 8 / 14 / 16 dailybooth users per minute (and that includes 404's). |
14:21
🔗
|
alard |
Do we want the Github downloads in warc format? |
14:36
🔗
|
SketchCow |
I personally think no. |
14:38
🔗
|
alard |
You don't want to go for maximum inaccessibility? |
14:40
🔗
|
alard |
If not warc, then what? A .tar? |
14:40
🔗
|
alard |
(What to do with the /downloads HTML page?) |
14:42
🔗
|
alard |
We could also just rsync the files as-is. The url structure is tidy enough (user/repo/download). |
14:43
🔗
|
alard |
The downloads page has the download count, everything else exists in other forms: https://github.com/ArchiveTeam/mobileme-grab/downloads |
15:18
🔗
|
SketchCow |
I think in this case, we're rescuing a filesystem, not an experience |
15:19
🔗
|
SketchCow |
A .txt file accompanying the files indicating the download count, if you're being completist. |
15:19
🔗
|
SketchCow |
And personally, I think that assassment could be in a single .txt file |
16:18
🔗
|
alard |
SketchCow: Could you have a look at alardland/github? |
16:36
🔗
|
* |
closure perks up his ears hearing about plans to do something with github |
16:36
🔗
|
closure |
is this about archiving the git repos, or some of their other data? |
16:36
🔗
|
alard |
The downloads. |
16:37
🔗
|
closure |
hmm, not familiar with that |
16:37
🔗
|
alard |
https://github.com/blog/1302-goodbye-uploads |
16:37
🔗
|
closure |
aha, thanks |
16:38
🔗
|
alard |
We're making a list of repositories, so that could be used for other things in the future. |
16:38
🔗
|
closure |
so there's a guy who has been using their API to find all new repositories for a while.. I forget the link to his site |
16:40
🔗
|
SketchCow |
alard - Looks good. |
16:40
🔗
|
SketchCow |
I suspect this won't be a LOT of data |
16:41
🔗
|
alard |
You *hope* it's not a lot of data. |
16:42
🔗
|
closure |
for a lot of data, see sourceforge downloads :P |
16:42
🔗
|
SketchCow |
I don't actually (hope) |
16:42
🔗
|
SketchCow |
Because once again the COmpass Has Swung and archive.org has tons of disk space. |
16:42
🔗
|
SketchCow |
I mean, we still should help raise funds because it helps |
16:43
🔗
|
SketchCow |
But 1 petabyte of free disk space right now |
16:43
🔗
|
SketchCow |
So yeah, let's do it. |
16:43
🔗
|
SketchCow |
I'll e-mail a hug to my github buddies |
16:46
🔗
|
closure |
ah, I see you already found githubarchive.org |
16:48
🔗
|
alard |
SketchCow: Want to say hi in the User-Agent header as well? |
16:52
🔗
|
SketchCow |
Sure. |
16:52
🔗
|
SketchCow |
"Archive Team Loves GitHub" |
16:55
🔗
|
alard |
https://github.com/ArchiveTeam/github-download-grab/commit/e3073ec5573a6d9b1e9508ad283168358019aae3 |
17:07
🔗
|
alard |
Heh, the tracker might not like this: http://tracker.archiveteam.org/github/ |
17:08
🔗
|
closure |
have you already pulled in the api dump data? If not, I might try some massaging |
17:09
🔗
|
alard |
No, I haven't. We're well on our way with the API exploration, though: http://tracker.archiveteam.org:8125/ |
17:09
🔗
|
alard |
(I think the highest ID is in the 7,000,000 range.) |
17:12
🔗
|
closure |
I'm running the scraper for that, so if there's time to plow through the whole range, that's fine |
17:44
🔗
|
SketchCow |
What is our HQ url again? |
17:45
🔗
|
nitro2k01 |
What? Headquarters? http://archiveteam.org/ ? |
17:49
🔗
|
SketchCow |
No, got it. |
17:49
🔗
|
SketchCow |
http://warriorhq.archiveteam.org/ |
17:49
🔗
|
nitro2k01 |
Ah, that |
17:50
🔗
|
godane |
burning a bluray of gbtv/theblaze episodes |
17:50
🔗
|
godane |
the rest of november and election coverage is on this one |
18:43
🔗
|
Nemo_bis |
SketchCow: can I buy other 50 kg of magazines to send you? :D |
18:43
🔗
|
Nemo_bis |
"PC Professionale" 110-189 |
18:43
🔗
|
Nemo_bis |
shipping will cost about three times as buying |
18:44
🔗
|
DFJustin |
I like how kg is our standard unit for magazines now |
18:45
🔗
|
Nemo_bis |
DFJustin: what other unit could I choose for transatlantic cooperations? :p |
18:53
🔗
|
Nemo_bis |
I don't remember if ias3upload.pl overwrites existing files with same name or not |
18:57
🔗
|
godane |
i uploaded august of 2011 episodes of x-play today |
19:11
🔗
|
SketchCow |
At current trends, github data will be about 200gb |
19:14
🔗
|
DFJustin |
*yawn* |
19:17
🔗
|
chronomex |
*slurp* |
19:31
🔗
|
chronomex |
alard: did we already finish the API grabbing? |
19:32
🔗
|
chronomex |
my discoverer died last night with requests.exceptions.ConnectionError: HTTPSConnectionPool(host=u'api.github.com', port=443): Max retries exceeded with url: /repositories?since=1295141 |
19:32
🔗
|
Deewiant |
I'm still running the github repo explorer, it seems to come up with some new tasks every couple of minutes |
19:33
🔗
|
Deewiant |
(I put my auth info in there so it can do 5000 instead of 60 per hour) |
19:33
🔗
|
chronomex |
neato |
19:34
🔗
|
Deewiant |
(At first I accidentally put them on a tracker HTTP request, had to change the password then >_<) |
19:34
🔗
|
chronomex |
hah, woops |
19:34
🔗
|
chronomex |
probably nobody's looking at those ... except the NSA watches them in transit |
19:35
🔗
|
Deewiant |
Yep, I think it was an unencrypted request too |
19:35
🔗
|
chronomex |
you're fucked |
19:35
🔗
|
Deewiant |
Well, I managed to change the password without any trouble |
19:36
🔗
|
Deewiant |
Maybe somebody defaced all my repos in the interim ;-P |
19:45
🔗
|
chronomex |
you seem to be sucking the job queue dry |
19:45
🔗
|
chronomex |
good work |
19:46
🔗
|
Deewiant |
Where does it get jobs from? |
20:10
🔗
|
kennethre |
chronomex: sorry :) |
20:11
🔗
|
kennethre |
I'd recommend using something like celery |
20:23
🔗
|
chronomex |
erp, what? |
20:27
🔗
|
kennethre |
re: requests.exceptions.ConnectionError |
20:27
🔗
|
kennethre |
to spread them across different machines, handle exceptions, etc |
20:27
🔗
|
SketchCow |
62 BBC R&D Descriptions left! |
20:27
🔗
|
SketchCow |
Poor github |
20:30
🔗
|
balrog_ |
yeah, I'm getting no tasks. |
20:30
🔗
|
balrog_ |
actually I am getting one in a while |
20:33
🔗
|
alard |
http://zeppelin.xrtc.net/corp.xrtc.net/shilling.corp.xrtc.net/index.html |
20:42
🔗
|
godane |
SketchCow: Thanks for puting up x-play episodes in collection |
20:54
🔗
|
SketchCow |
No problem. |
20:54
🔗
|
SketchCow |
More soon |
21:04
🔗
|
godane |
i will do 2012 episodes in 2013 so i don't get this stuff darked |
21:05
🔗
|
godane |
when the network is died there shouldn't be fear of nbc sending dmca notices i hope |
21:07
🔗
|
soultcer |
there are so many people fetching github repo lists that it is hard to actually get a task assigned |
21:24
🔗
|
sankin1 |
the leaderboard is flying |
21:26
🔗
|
soultcer |
Whoa there's already a project to download |
21:28
🔗
|
alard |
Perhaps I should ask: what is an acceptable number of requests to send to GitHub? |
21:28
🔗
|
alard |
We're currently doing over 50 requests per second. |
21:31
🔗
|
soultcer |
As long as Github doesn't show elevated error response rates, keep it up :D |
21:32
🔗
|
alard |
Apparently underscor has joined us. |
21:32
🔗
|
Deewiant |
The non-Warrioring cheater. |
21:33
🔗
|
soultcer |
Well dailybooth is kind of boring with it's low download speed and timeouts |
21:33
🔗
|
* |
underscor pads in drearily, rubbing sleep out of his eyes |
21:33
🔗
|
underscor |
what oh yes hi |
21:33
🔗
|
kennethre |
alard: to the api? |
21:33
🔗
|
alard |
No, to the /downloads page. |
21:33
🔗
|
kennethre |
i wouldn't worry about it |
21:34
🔗
|
kennethre |
unless you get 500s |
21:37
🔗
|
soultcer |
The actual downloads are from cloudfront and probably s3-backed |
21:37
🔗
|
kennethre |
yep |
21:38
🔗
|
alard |
(The precise thing to say would be: 50 r/s to the /downloads pages.) |
21:38
🔗
|
SketchCow |
Just for the record, godane - you are cutting it way close to the edge. |
21:39
🔗
|
SketchCow |
I realize a lot of these safe times and cooldown periods are fake and wishful thinking, but putting up stuff that is less than a year old is inviting the scorpion to reflexively sting even though it is "dead" |
21:39
🔗
|
SketchCow |
I'd be happier if we were downloading and putting up stuff from the 1980s, like you were doing with tv shows and older material. |
21:39
🔗
|
SketchCow |
Even the 90s |
21:39
🔗
|
SketchCow |
I mean, if you have a choice. |
21:44
🔗
|
SketchCow |
In other news, this test looks successful. http://archive.org/details/creativecomputingv11n11-tiffcbz\ |
21:44
🔗
|
SketchCow |
In other news, this test looks successful. http://archive.org/details/creativecomputingv11n11-tiffcbz |
21:46
🔗
|
underscor |
Boy, my browser really hates the tracker |
21:47
🔗
|
Nemo_bis |
underscor: isn't it cute to see the top downloaded items in http://archive.org/details/philosophicaltransactions a year later :) |
21:47
🔗
|
Deewiant |
Pause your scripts, it'll be much more palatable ;-) |
21:48
🔗
|
underscor |
Nemo_bis: Wow, I'd forgotten about that |
21:48
🔗
|
Nemo_bis |
:D |
21:48
🔗
|
underscor |
Damn, that is cool :D |
21:48
🔗
|
underscor |
Deewiant: <:B <:B |
21:51
🔗
|
Nemo_bis |
Experiments on the Refrangibility of the Invisible Rays of the Sun. By William Herschel, LL. D. F. R. S. 602 downloads |
21:58
🔗
|
SketchCow |
http://archive.org/details/creativecomputingv11n11-tiffcbz |
21:59
🔗
|
DFJustin |
same url is same |
22:00
🔗
|
Nemo_bis |
I love how iasupload smartly retries |
22:01
🔗
|
Nemo_bis |
SketchCow: would you create a collection for these 106 magazines issue I uploaded? https://archive.org/search.php?query=subject%3A"Hacker+Journal" |
22:06
🔗
|
SketchCow |
If I do it right now, it'll explode. Let them derive and settle and I'll do it in 5 seconds. |
22:06
🔗
|
SketchCow |
They're still deriving |
22:08
🔗
|
godane |
looks like archive.org is having problems |
22:10
🔗
|
godane |
also everything is waiting to be archived |
22:11
🔗
|
Nemo_bis |
SketchCow: ah ok sorry, yes there are about 6000 items in the derive queue |
22:12
🔗
|
Nemo_bis |
Also, I miss ocrcount |
22:19
🔗
|
chronomex |
oh jesus I just loaded the github tracker |
22:19
🔗
|
chronomex |
I don't think I've ever seen a tracker go this fast |
22:19
🔗
|
chronomex |
zoooom |
22:25
🔗
|
SketchCow |
Is there a total reposity count somewhere on github? |
22:25
🔗
|
SketchCow |
I'm looking for it. |
22:25
🔗
|
SketchCow |
I see press release saying 3.7 million |
22:26
🔗
|
balrog_ |
that's in Sep 13 |
22:28
🔗
|
SketchCow |
https://twitter.com/textfiles/status/279350174541819905 |
22:28
🔗
|
balrog_ |
this is just downloading file listings, right? |
22:29
🔗
|
balrog_ |
or is that part finished? |
22:29
🔗
|
balrog_ |
SketchCow: also note that there are many private github repos |
22:29
🔗
|
balrog_ |
since you can pay for private ones |
22:30
🔗
|
balrog_ |
3.7 million would include those |
22:30
🔗
|
chronomex |
I think that number includes gists as well |
22:30
🔗
|
balrog_ |
doubt it, but maybe |
22:31
🔗
|
balrog_ |
I liked github downloads because you could post binaries and hotlink them from elsewhere |
22:31
🔗
|
balrog_ |
sucks that they're going away |
22:34
🔗
|
balrog_ |
are you guys sure the downloads contain data? |
22:34
🔗
|
balrog_ |
or is this just listings? |
22:35
🔗
|
DFJustin |
I saw one that was 7mb |
22:36
🔗
|
balrog_ |
some should be 20-50 or more |
22:36
🔗
|
balrog_ |
DFJustin: are all the lists retrieved? |
22:36
🔗
|
SketchCow |
First, realize what these are. |
22:36
🔗
|
balrog_ |
so it's now downloading files, right? |
22:37
🔗
|
SketchCow |
These are NOT the code repositories. |
22:37
🔗
|
balrog_ |
most of them will be under 1mb |
22:37
🔗
|
balrog_ |
yes, I understand |
22:37
🔗
|
SketchCow |
Like github/boner-muncher is code |
22:37
🔗
|
balrog_ |
however, some projects have posted fairly large files |
22:37
🔗
|
balrog_ |
I've used this service myself for some of my code. |
22:37
🔗
|
SketchCow |
The /downloads are JUST the separate files. |
22:37
🔗
|
balrog_ |
yes |
22:37
🔗
|
DFJustin |
just watched it for a couple seconds and some are dozens of mb so I think it's ok |
22:37
🔗
|
SketchCow |
Well, conclusively, we're finding the vast vast vast majority of the 3.7 million never used this feature |
22:37
🔗
|
SketchCow |
VAST majority. |
22:37
🔗
|
balrog_ |
that is correct |
22:37
🔗
|
balrog_ |
ahh, so the warrior lists those who didn't use it. |
22:37
🔗
|
DFJustin |
also that is cartoonishly fast |
22:37
🔗
|
SketchCow |
root@teamarchive-1:/1/ALARD/warrior/github# du -sh . |
22:37
🔗
|
SketchCow |
55G . |
22:38
🔗
|
balrog_ |
hopefully wget-lua compiles before this is done :P |
22:38
🔗
|
SketchCow |
18303 |
22:38
🔗
|
SketchCow |
root@teamarchive-1:/1/ALARD/warrior/github# find . -type f | wc -l |
22:38
🔗
|
SketchCow |
Remember, that's including index.txt |
22:38
🔗
|
balrog_ |
and index.txt is generated for all repos? |
22:38
🔗
|
SketchCow |
2717 |
22:38
🔗
|
SketchCow |
root@teamarchive-1:/1/ALARD/warrior/github# find . -name index.txt | wc -l |
22:38
🔗
|
SketchCow |
See? Yes |
22:39
🔗
|
SketchCow |
Yes, it is. |
22:39
🔗
|
SketchCow |
Just to keep the download counts |
22:45
🔗
|
balrog_ |
how do I set this to work without warrior? |
22:46
🔗
|
soultcer |
I assume same as all other warrior projects that use wget-lua |
22:46
🔗
|
balrog_ |
just python ./pipeline.py? |
22:47
🔗
|
soultcer |
1) Install python, python tornado (> v2), python tornadio, python argparse, openssl headers, lua headers |
22:47
🔗
|
soultcer |
2) git clone github.com/archiveteam/seesaw-kit.git |
22:47
🔗
|
SketchCow |
Poor github, they just want to do the right thing. |
22:47
🔗
|
SketchCow |
OK, separate hannel |
22:48
🔗
|
SketchCow |
#gothub |
22:48
🔗
|
soultcer |
3) git clone github.com/archiveteam/github-download.git |
22:51
🔗
|
balrog_ |
soultcer: yeah I have all that, I have wget lua, just how to start it? |
22:51
🔗
|
soultcer |
with run-pipeline, as usual |
22:52
🔗
|
SketchCow |
Please redirect people over to #gothub. |
22:52
🔗
|
SketchCow |
We're back to the Usual Crap again |
23:10
🔗
|
SketchCow |
alard: Please come to #gothub - possible bug |