Time |
Nickname |
Message |
00:13
🔗
|
db48x |
closure: very cool :) |
01:01
🔗
|
|
patricko- is now known as patrickod |
01:28
🔗
|
Kazzy |
git-annex seems to ignore annex-ssh-options when running git annex find? |
01:29
🔗
|
Kazzy |
(I have no idea how git-annex, works.. i assume that tries to connect to isbak.archiveteam.org when it runs.) |
01:29
🔗
|
db48x |
git annex find only needs to look at the local metadata |
01:34
🔗
|
Kazzy |
oops, debug messages i threw in threw me.. git annex sync doesn't play nicely with ssh-options |
01:34
🔗
|
Kazzy |
https://github.com/db48x/IA.BAK/blob/a320bbbf0abd1359c0b20fbe7f412864437fa357/iabak-helper#L21 |
01:36
🔗
|
closure |
Kazzy: well, git annex sync didn't used to support the ssh-options, but that was improved in february and afaik it does now |
01:37
🔗
|
Kazzy |
hm, this was downloaded today from https://downloads.kitenet.net/git-annex/linux... etc |
01:37
🔗
|
Kazzy |
is there a newer version out there somewhere? |
01:37
🔗
|
|
patrickod is now known as patricko- |
01:37
🔗
|
closure |
if it wasn't working, we'd not be getting stats updates from all the clients |
01:37
🔗
|
trs80 |
Kazzy: are you running git-annex in your regular PATH, or the one that was downloaded? |
01:38
🔗
|
|
patricko- is now known as patrickod |
01:38
🔗
|
Kazzy |
from the downloaded copy |
01:42
🔗
|
trs80 |
closure: flock: invalid option -- 'E' |
01:42
🔗
|
trs80 |
flock (util-linux 2.20.1) |
01:42
🔗
|
closure |
oh, ok |
01:42
🔗
|
trs80 |
although that's the wheezy version, let me go to jessie |
01:42
🔗
|
closure |
that makes it not exit 1 when the lock is busy |
01:44
🔗
|
trs80 |
yeah, flock from util-linux 2.25.2 has -E |
01:44
🔗
|
GitHub103/#internetarchive.bak |
[IA.BAK] joeyh pushed 1 new commit to server: http://git.io/vek2l |
01:44
🔗
|
GitHub103/#internetarchive.bak |
IA.BAK/server 926dc43 Joey Hess: comment typo |
01:51
🔗
|
closure |
trs80: should be fixed |
01:56
🔗
|
Kazzy |
closure: what version of git-annex are you currently using? |
01:57
🔗
|
closure |
5.20150219 or newer will work |
02:02
🔗
|
Kazzy |
totally weird behaviour, really.. can auth fine using the key itself, but when the script calls it, we don't play ball |
02:04
🔗
|
closure |
if you have a ~/.config/git-annex/program that points at some other, old version you have installed, that could possibly expain it |
02:05
🔗
|
closure |
hmm, no, I made a change recently that prevents that being a problem |
02:06
🔗
|
Kazzy |
nope, ~/.config/git-annex/ doesn't exist at all, i changed the script to use the copy saved in a directory i downloaded it to earlier, from the link in the script |
02:06
🔗
|
Kazzy |
git-annex version: 5.20150327-g0ae1f8c |
02:07
🔗
|
closure |
my guess comes down to "I changed the script" ... |
02:08
🔗
|
Kazzy |
running 'git annex sync' in the shard1 directory asks for the password, even outside the script |
02:09
🔗
|
Kazzy |
the only real difference is the location that remote.origin.annex-ssh-options points to for the key, it's an absolute path |
02:22
🔗
|
|
Disconnected (Connection reset by peer). |
02:23
🔗
|
|
Now talking on #internetarchive.bak |
02:23
🔗
|
|
Topic for #internetarchive.bak is: http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK | #archiveteam |
02:23
🔗
|
|
Topic for #internetarchive.bak set by chfoo!~chris@[redacted] at Wed Mar 4 18:38:46 2015 |
02:25
🔗
|
|
chfoo has quit (Read error: Operation timed out) |
02:27
🔗
|
|
You are now known as chfoo |
03:00
🔗
|
underscor |
closure: does the "IA Only" include dark items? |
03:01
🔗
|
underscor |
(ie, we'll never get rid of all of the red on that graph?) |
03:01
🔗
|
db48x |
no, dark items were excluded from the census, and therefore weren't added to the shard |
03:02
🔗
|
underscor |
There have since been a lot though |
03:02
🔗
|
underscor |
I get a lot of 403s, at least |
03:03
🔗
|
underscor |
http://archive.org/history/housegarden140julnewy |
03:03
🔗
|
underscor |
for example |
03:03
🔗
|
underscor |
isn't dark, but is in printdisabled collection |
03:03
🔗
|
underscor |
which makes it "private" but not "dark" |
03:05
🔗
|
db48x |
hmm, 12 days |
03:05
🔗
|
db48x |
I think that change is newer than the census |
03:06
🔗
|
db48x |
yes, the census was taken 3-4 weeks ago: https://archive.org/details/ia-bak-census_20150304 |
03:08
🔗
|
db48x |
so I guess the answer is that there will eventually be some small number of IA Only items left |
03:13
🔗
|
closure |
we can remove them from the repo if we want to |
03:18
🔗
|
GitHub24/#internetarchive.bak |
IA.BAK/server 94105b3 Joey Hess: run gc as SHARD user, not root, which was messing up perms and stats |
03:18
🔗
|
GitHub24/#internetarchive.bak |
[IA.BAK] joeyh pushed 1 new commit to server: http://git.io/vekFJ |
03:22
🔗
|
|
ENDING LOGGING AT Thu Apr 2 22:22:26 2015 |
03:23
🔗
|
|
BEGIN LOGGING AT Thu Apr 2 22:23:08 2015 |
03:37
🔗
|
GitHub19/#internetarchive.bak |
IA.BAK/server 3bc8668 Joey Hess: cache geoip lookups |
03:37
🔗
|
GitHub19/#internetarchive.bak |
[IA.BAK] joeyh pushed 1 new commit to server: http://git.io/vekAO |
03:39
🔗
|
closure |
only 1193 files remain that have not been downloaded at least once |
03:39
🔗
|
closure |
those may all be dark at this point? |
03:45
🔗
|
closure |
underscor: I didn't know about printdisabled collection.. |
03:47
🔗
|
closure |
I guess it's unlikely items in it will change.. tending toward deleting those from the repo |
03:52
🔗
|
tpw_rules |
will it finish the git-annex if it can't get some because they have been darked? |
03:53
🔗
|
closure |
it will finish, just will say some files failed to download |
03:53
🔗
|
closure |
usenet-alt/alt.sex.pictures.mbox.zip |
03:53
🔗
|
closure |
hah that's one file we've not gotten yet |
03:53
🔗
|
closure |
that is not dark |
03:53
🔗
|
|
closure watches for that file suddenly jump to 10 copies |
03:58
🔗
|
tpw_rules |
i wonder what my disk bottleneck is like running 16 processes at once |
04:11
🔗
|
pikhq |
Hrm, y'know, I've been idling here long enough. |
04:12
🔗
|
pikhq |
I should actually go ahead and get a clone of the annex going. :) |
04:13
🔗
|
|
pikhq would like to request being on the permitted hosts to access the repo |
04:13
🔗
|
closure |
pikhq: I'll add you, paste me the key |
04:14
🔗
|
pikhq |
Thanks. |
04:15
🔗
|
GitHub103/#internetarchive.bak |
IA.BAK/pubkey 6918741 Joey Hess: add pikhq |
04:15
🔗
|
GitHub103/#internetarchive.bak |
[IA.BAK] joeyh pushed 1 new commit to pubkey: http://git.io/veIJa |
04:16
🔗
|
db48x |
that red patch is getting delightfully small |
04:21
🔗
|
pikhq |
Hooray, "downloading from the Internet Archive". |
04:38
🔗
|
GitHub100/#internetarchive.bak |
IA.BAK/server 546181a Joey Hess: typo |
04:38
🔗
|
GitHub100/#internetarchive.bak |
[IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veIkC |
04:43
🔗
|
GitHub130/#internetarchive.bak |
IA.BAK/server 74438bc Joey Hess: fix su call params |
04:43
🔗
|
GitHub130/#internetarchive.bak |
[IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veIIT |
04:54
🔗
|
GitHub138/#internetarchive.bak |
IA.BAK/server 0a5354e Joey Hess: support for more shards |
04:54
🔗
|
GitHub138/#internetarchive.bak |
[IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veILH |
05:01
🔗
|
db48x |
I've been thinking about the stats page |
05:01
🔗
|
db48x |
I think we need time-series data, to show the rate of progress |
05:03
🔗
|
db48x |
we could use Graphite, and feed the existing stats into it whenever they're updated |
05:13
🔗
|
db48x |
a simpler improvement would be to add the item count to the tooltips in the treemap: |
05:13
🔗
|
db48x |
generateTooltip: function (row) { return "Of all the items in this shard, "+ data.getValue(row, 2) +" have "+ data.getValue(row, 3) +" known copies."; } |
05:20
🔗
|
closure |
db48x: I have some data files that I think you could use |
05:20
🔗
|
closure |
http://iabak.archiveteam.org/stats/SHARD1.filestransferred |
05:20
🔗
|
closure |
http://iabak.archiveteam.org/stats/SHARD1.clientconnsperhour |
05:21
🔗
|
closure |
please come up with a visualization |
05:30
🔗
|
db48x |
I can't do a visualization, but I can do a chart |
05:37
🔗
|
SketchCow |
I think he wants me to do it. |
05:37
🔗
|
SketchCow |
Well, so, my stats page stuff is good for one-offs |
05:38
🔗
|
SketchCow |
Also, I wanted someone else to do visualization, but nobody was stepping forward, so I started it. |
05:39
🔗
|
SketchCow |
Also, looks like we're just on the edge of having zero IA-only items in the shard. |
05:40
🔗
|
GitHub181/#internetarchive.bak |
IA.BAK/pubkey 92f416f Joey Hess: copy SHARD1 pubkeys to SHARD2 |
05:40
🔗
|
GitHub181/#internetarchive.bak |
[IA.BAK] joeyh pushed 1 new commit to pubkey: http://git.io/veIGg |
05:41
🔗
|
db48x |
SketchCow: that generateTooltip line above could be added to the options that you pass to tree.draw |
05:46
🔗
|
db48x |
closure: assuming we had a graphite server, something like this would push the current shardstats into it: |
05:46
🔗
|
db48x |
echo "something.something.numcopies.${mult}.${SHARD} ${count} ${DATE}" | nc ${GRAPHITE_HOST} ${GRAPHITE_PORT} |
05:46
🔗
|
db48x |
while read mult count; do |
05:46
🔗
|
db48x |
done < grep 'numcopies +' "/var/www/html/stats/$SHARD"; |
05:46
🔗
|
pikhq |
Apparently I picked a halfway-decent time to start grabbing stuff: when there was a point in fetching new things. :) |
05:47
🔗
|
db48x |
should be modified slightly to feed it all of the stats in a single TCP connection |
05:47
🔗
|
db48x |
and we've got to name the stats fairly well, and pick storage policies for them, and so on |
05:48
🔗
|
db48x |
but then Graphite puts them into a time-series database (Whisper), and makes querying that via HTTP really easy |
05:49
🔗
|
GitHub111/#internetarchive.bak |
IA.BAK/server 4e1486f Joey Hess: typo |
05:49
🔗
|
GitHub111/#internetarchive.bak |
[IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veInB |
05:50
🔗
|
GitHub173/#internetarchive.bak |
IA.BAK/server 3a1b03a Joey Hess: typo |
05:50
🔗
|
GitHub173/#internetarchive.bak |
[IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veInD |
05:52
🔗
|
GitHub89/#internetarchive.bak |
IA.BAK/server e04331c Joey Hess: fix cd |
05:52
🔗
|
GitHub89/#internetarchive.bak |
[IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veIcv |
05:53
🔗
|
GitHub49/#internetarchive.bak |
IA.BAK/server d3bab05 Joey Hess: remove dirname complication, not needed |
05:53
🔗
|
GitHub49/#internetarchive.bak |
[IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veIc8 |
05:54
🔗
|
GitHub156/#internetarchive.bak |
IA.BAK/server 41a6f38 Joey Hess: typo |
05:54
🔗
|
GitHub156/#internetarchive.bak |
[IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veIcu |
06:01
🔗
|
SketchCow |
We're going to break the zero-IA barrier |
06:06
🔗
|
GitHub61/#internetarchive.bak |
IA.BAK/master 229385f Joey Hess: initial support for multiple shards |
06:06
🔗
|
GitHub61/#internetarchive.bak |
IA.BAK/master e819d37 Joey Hess: typo |
06:06
🔗
|
GitHub61/#internetarchive.bak |
IA.BAK/master e9adea4 Joey Hess: fix set -- lines |
06:06
🔗
|
GitHub27/#internetarchive.bak |
IA.BAK/server 25d0680 Joey Hess: few fixed to shard user setup |
06:06
🔗
|
GitHub27/#internetarchive.bak |
[IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veIlU |
06:06
🔗
|
GitHub61/#internetarchive.bak |
[IA.BAK] joeyh pushed 4 new commits to master: http://git.io/veIlT |
06:07
🔗
|
closure |
so, I've set up SHARD2. If someone wants to jump the gun, see the README for how to add another shard to your system |
06:07
🔗
|
closure |
SHARD2 is 3.2 tb IIRC |
06:07
🔗
|
closure |
it contains: NISTJournalofResearch, 1880_census, speedydeletionwiki |
06:11
🔗
|
GitHub159/#internetarchive.bak |
IA.BAK/server a5325ee Joey Hess: new script |
06:11
🔗
|
GitHub159/#internetarchive.bak |
[IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veI8p |
06:14
🔗
|
SketchCow |
I'd like to do some tests with SHARD1 when it's synced |
06:14
🔗
|
closure |
yes, absolutely |
06:15
🔗
|
closure |
quite a few things to work out.. need to test a restore of some files. |
06:15
🔗
|
closure |
and, need to get the fscking working and make sure it notices when a client drops out, and deals with that |
06:17
🔗
|
closure |
I just added a repolist file to the repo, that lists the shards. I'm thinking they can be in one of a few states |
06:17
🔗
|
closure |
active: how SHARD1 is now, being filled in |
06:17
🔗
|
closure |
pending: like SHARD2, not recommended to use yet |
06:18
🔗
|
closure |
maintenance: just sitting there and being checked periodically |
06:18
🔗
|
closure |
if a client falls out, a shard may change from maintenance back to active, so it'll get more clients to fix it back up |
06:20
🔗
|
closure |
oh, and also restore: clients should try to upload files from this shard that are marked as not being present in the IA any longer |
06:22
🔗
|
SketchCow |
I think one thing is I'd like a random client to delete 50-60gb of material |
06:24
🔗
|
underscor |
closure: ./iabak-helper: 54: ./iabak-helper: numfmt: not found |
06:25
🔗
|
closure |
and/or a client to quietly vanish and never be heard from again |
06:25
🔗
|
closure |
(for all we know, this has already happened) |
06:27
🔗
|
GitHub195/#internetarchive.bak |
[IA.BAK] joeyh pushed 1 new commit to master: http://git.io/veIE5 |
06:27
🔗
|
GitHub195/#internetarchive.bak |
IA.BAK/master 1f8be49 Joey Hess: if numfmt is not available (old coreutils), skip the diskreserve sanity check |
06:46
🔗
|
closure |
mappy |
06:58
🔗
|
SketchCow |
http://iabackup.archiveteam.org/ia.bak/ now has US |
06:58
🔗
|
SketchCow |
It loads in a little weird and slow for me, but I'll leave it for now. |
07:07
🔗
|
db48x |
closure: syncing now needs a password; was that intentional? |
07:16
🔗
|
yipdw |
that US map is freaky accurate |
07:16
🔗
|
yipdw |
SketchCow: is that all geolocation data? |
07:17
🔗
|
SketchCow |
yes, it's built off the geoip utility |
07:17
🔗
|
yipdw |
"freaky accurate" meaning "I can see myself in that map" |
07:17
🔗
|
DFJustin |
ooh lithuania get |
07:17
🔗
|
SketchCow |
https://freegeoip.net |
07:17
🔗
|
SketchCow |
So, you can go and do: |
07:17
🔗
|
SketchCow |
freegeoip.net/csv/8.8.8.8 |
07:17
🔗
|
SketchCow |
freegeoip.net/json/github.com |
07:18
🔗
|
SketchCow |
And it'll do things in that format. |
07:18
🔗
|
SketchCow |
As you can see, you can use hostnames or IPs. |
07:18
🔗
|
SketchCow |
You get a limit of 10,000 free requests a day. |
07:18
🔗
|
yipdw |
now I'm wondering who the others in Chicagoland are |
07:19
🔗
|
SketchCow |
They're not. |
07:19
🔗
|
SketchCow |
I knew it, knew I'd have to go ahead and get that right. |
07:19
🔗
|
SketchCow |
Ach, coding |
07:20
🔗
|
SketchCow |
I'm using cities |
07:20
🔗
|
SketchCow |
Just cities, no states. Goog tries to do its best. |
07:20
🔗
|
SketchCow |
I have to go in and make that different now. |
07:20
🔗
|
yipdw |
ahh |
07:20
🔗
|
SketchCow |
For example, it's Deerfield, NH. Not Deerfield, IL |
07:25
🔗
|
yipdw |
dang and I was getting excited that there were others around here |
07:29
🔗
|
db48x |
:) |
07:29
🔗
|
closure |
db48x: shouldn't need password. Seems to work ok from here. |
07:29
🔗
|
SketchCow |
OK, so. |
07:29
🔗
|
SketchCow |
I redid it. |
07:29
🔗
|
SketchCow |
It uses zip code instead of city. |
07:30
🔗
|
SketchCow |
You can see now there ARE two clients in chicagoland |
07:32
🔗
|
db48x |
closure: hrm, it's only broken on one machine... |
07:33
🔗
|
|
yipdw has changed the topic to: http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK | Stats: http://iabackup.archiveteam.org/ia.bak/ | #archiveteam |
07:33
🔗
|
closure |
db48x: must be a ssh key issue |
07:37
🔗
|
db48x |
repo has the key in it... |
07:38
🔗
|
GitHub94/#internetarchive.bak |
IA.BAK/master 28755e9 Joey Hess: disable auto gc by default... |
07:38
🔗
|
GitHub94/#internetarchive.bak |
[IA.BAK] joeyh pushed 1 new commit to master: http://git.io/veI1K |
07:40
🔗
|
underscor |
closure: does shard2 require a different ssh key adding location? |
07:40
🔗
|
underscor |
or will it use the same user as shard1? |
07:40
🔗
|
underscor |
(I suppose I could read the scripts lol) |
07:41
🔗
|
db48x |
different directory on the pubkeys branch |
07:42
🔗
|
SketchCow |
Added information about the project at the bottom of the graph page. |
07:44
🔗
|
db48x |
SketchCow: s/which/wish/ |
07:44
🔗
|
SketchCow |
it was always perfect and that never happened |
07:45
🔗
|
db48x |
:) |
07:48
🔗
|
|
bzc6p_ (~bzc6p@[redacted]) has joined #internetarchive.bak |
07:49
🔗
|
SketchCow |
closure: I upped my side with the graph to 10 minutes. |
07:49
🔗
|
SketchCow |
Simply because we're starting to see real progress |
07:50
🔗
|
|
bzc6p has quit (Read error: Operation timed out) |
07:50
🔗
|
SketchCow |
Also: If in the future we have a country with a ton of clients (I guess Germany is getting there) it is not difficult to do additional ones. |
08:48
🔗
|
db48x |
closure: it's untested, but if you want to take a look, https://github.com/db48x/IA.BAK/commit/c8ea0e2d8c70bda9a117d03d933c179cf3dd265c has code for sending the stats to graphite |
09:02
🔗
|
SketchCow |
Very minor addition: My graph now looks at numbers higher than 3 and just smooshes them in |
09:02
🔗
|
SketchCow |
Apparently some files/items are in six places now, meaning a total of 8! |
09:07
🔗
|
|
zottelbey (~zottelbey@[redacted]) has joined #internetarchive.bak |
09:59
🔗
|
|
zottelbe- (~zottelbey@[redacted]) has joined #internetarchive.bak |
10:00
🔗
|
|
zottelbey has quit (Ping timeout: 512 seconds) |
10:11
🔗
|
|
zottelbe- is now known as zottelbey |
10:53
🔗
|
|
db48x has quit (Ping timeout: 258 seconds) |
12:45
🔗
|
hater |
underscor: ./iabak-helper: numfmt: not found what's your OS? |
13:39
🔗
|
|
closure looks at beautiful docker-style multi progress bars |
13:39
🔗
|
closure |
oh yesss |
13:40
🔗
|
closure |
Working 85% [========================================= ] 17/ 20 (for 1.7, 0.3 remaining) |
13:40
🔗
|
closure |
Working 20% [========== ] 4/ 20 (for 1.6, 6.4 remaining) |
13:40
🔗
|
closure |
Working 40% [==================== ] 8/ 20 (for 1.6, 2.4 remaining) |
13:40
🔗
|
closure |
guy who wrote the library explicitly wants to support git-annex |
13:41
🔗
|
hater |
nice |
13:42
🔗
|
hater |
when is the patch ready? :D |
13:42
🔗
|
hater |
want to try it myself |
13:42
🔗
|
closure |
well, I still have to write all the code to use the library, and he has to fix at least one of the 5 bugs I've filed on it this morning |
13:50
🔗
|
zottelbey |
looking forward to it! |
14:20
🔗
|
|
bzc6p_ is now known as bzc6p |
14:25
🔗
|
trs80 |
what's the total size of shard1? |
14:27
🔗
|
zottelbey |
http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/git-annex_implementation - 2.91TB |
14:33
🔗
|
trs80 |
ah, for some reason I thought it was 1.9TB |
14:33
🔗
|
trs80 |
still got another TB to go |
14:36
🔗
|
zottelbey |
im at a measly .06TB -.- |
14:36
🔗
|
hater |
trs80: how long did it take to download 1.9TB? |
14:37
🔗
|
hater |
closure: does it even make sense to implement a multi-support into git-annex and not into the wrapper-script? |
14:38
🔗
|
trs80 |
it's been about 36 hours with 10 instances of iabak |
14:38
🔗
|
trs80 |
at about 150Mb/s |
14:40
🔗
|
|
trs80 fires up another 10 |
15:14
🔗
|
SketchCow |
http://iabackup.archiveteam.org/ia.bak |
15:14
🔗
|
SketchCow |
So, I'm noticing that +0 is at 22. I'm intrigued that it's still sticking around, even after hours. |
15:14
🔗
|
SketchCow |
Maybe they're really, really huge? Or something else? |
15:15
🔗
|
closure |
there are some 403s |
15:15
🔗
|
SketchCow |
OK. So 22 ones that turn out to be private. |
15:16
🔗
|
closure |
not clear what's going on, I just ran git annex get --not --copies 4 again, and it managed to download 1 more file |
15:16
🔗
|
closure |
that had failed earlier |
15:41
🔗
|
sep332 |
tonight I'm going to move this drive to a different computer |
15:41
🔗
|
sep332 |
so if you want to see what it looks like when a node disappears... |
15:46
🔗
|
SketchCow |
And we do |
16:41
🔗
|
trs80 |
going from 10->20 iabaks has hit diminishing returns, was getting 150Mbps and now 200Mbps |
16:45
🔗
|
zottelbey |
oh well. i get 1.2-2MB/s with 8. |
16:47
🔗
|
closure |
look at that red evaporate |
16:48
🔗
|
sep332 |
Will records of this copy just expire after failing to sync for a while? or do i have to remove the data and then sync again? |
16:52
🔗
|
SketchCow |
closure: I told you graphs were important! |
16:52
🔗
|
SketchCow |
It really helps get a grip on what's there. |
17:02
🔗
|
closure |
sep332: the distributed fsck stuff that will let us notice when a copy drops out is implemented in git-annex, but the scripts need to be made to run it periodically |
17:07
🔗
|
|
sankin1 (~sankin@[redacted]) has joined #internetarchive.bak |
17:12
🔗
|
|
londoncal (~londoncal@[redacted]) has joined #internetarchive.bak |
17:13
🔗
|
sep332 |
So when another node runs fsck --distributed, it will notice that my node hasn't checked in in a while? |
17:17
🔗
|
closure |
we run fsck --expire on the server, and it notices |
17:17
🔗
|
closure |
the nodes run fsck --distributed to avoid being expired |
17:19
🔗
|
sep332 |
ok |
17:22
🔗
|
SketchCow |
closure - is --expire run automatically? |
17:23
🔗
|
SketchCow |
(or is this a feature you're adding?) |
17:23
🔗
|
closure |
we can run --expire in a cron job |
17:23
🔗
|
closure |
on the server, so easy |
17:23
🔗
|
closure |
running fsck --distributed on the clients periodially is harder |
17:24
🔗
|
closure |
means the clients have to run iabak from time to time even once they're full |
17:24
🔗
|
closure |
and users have to remember, or it be automated somehow, etc |
17:24
🔗
|
SketchCow |
I am assuming that. |
17:24
🔗
|
SketchCow |
Like, I'm assuming the client will do that. |
17:24
🔗
|
SketchCow |
My estimate is once every two weeks. |
17:25
🔗
|
SketchCow |
And it should be automated, and our server should mail the contact to go "we haven't heard from you in X days" |
17:25
🔗
|
closure |
something on the order of a month, 2 weeks might work. It will increase the git repo size over time some |
17:25
🔗
|
SketchCow |
By how much |
17:25
🔗
|
SketchCow |
A few k, right |
17:25
🔗
|
closure |
one of the things I am going to check with shard1 |
17:25
🔗
|
closure |
dunno exactly |
17:26
🔗
|
closure |
so the email contact.. I was thinking about that too, and it's worth it, but we need registration then |
17:27
🔗
|
SketchCow |
Agreed. |
17:28
🔗
|
pikhq |
Email's especially going to be worth it if anyone's using mostly offline disks for this. |
17:28
🔗
|
SketchCow |
Well, without communication, or a way to mail out updates, it won't scale anyway. |
17:28
🔗
|
SketchCow |
I mean, we can do 30 people relatively well. |
17:28
🔗
|
pikhq |
e.g. have a spare external drive that they're shoving bits on rather than spare space on a system they use. |
17:28
🔗
|
SketchCow |
But we haven't figured out how many people this is going to be. Possibly over 5000. |
17:28
🔗
|
closure |
pikhq: I agree, and we can also give people we trust longer offline between checkins before they get expired |
17:29
🔗
|
closure |
SketchCow: 25000, back of the napkin estimate |
17:30
🔗
|
closure |
we have 1 shard almost done with around 15 people |
17:31
🔗
|
SketchCow |
25,000 assumes people don't take up roles. |
17:31
🔗
|
SketchCow |
You mean 25,000 "Shard Roles" |
17:32
🔗
|
SketchCow |
We don't want the same person taking two copies of the same shard, but we want people taking 5-50 shards where possible. |
17:33
🔗
|
SketchCow |
Tell me when I'm wrong. |
17:33
🔗
|
closure |
although.. this shard is probably smaller than average disk size. average shard is probably 11 tb. |
17:34
🔗
|
closure |
SketchCow: if people take 5-50 shards (which does make sense), they are probably not storing a full tb of each shard, which some of the people are for SHARD1. |
17:35
🔗
|
SketchCow |
So, two obvious thoughts I will probably restate multiple times |
17:36
🔗
|
SketchCow |
1. The amount this project backs up will always be a fraction of the Archive's stores, simply because of the realistic situation of not sharing non-public items and the fact that some of the contents are warmed over horseshit |
17:36
🔗
|
closure |
20 petabytes is 42000 x 500 gb, so 25000 people seems ballpark |
17:36
🔗
|
SketchCow |
2. The project will "succeed" when it gains enough traction that various companies donate disk server space to the project, real space. Real REAL space. |
17:37
🔗
|
SketchCow |
Whether we stay down in Ham Radio Club-level adoption or SETI-level is unclear. |
17:37
🔗
|
SketchCow |
What this IS doing is making git-annex better and it's making IA ask very important questions it needed to ask 10 years ago. |
18:09
🔗
|
|
wp494_ (~wickedpla@[redacted]) has joined #internetarchive.bak |
18:12
🔗
|
|
wp494 has quit (Ping timeout: 740 seconds) |
18:13
🔗
|
|
wp494_ is now known as wp494 |
18:15
🔗
|
|
beardicus (~beardicus@[redacted]) has joined #internetarchive.bak |
18:17
🔗
|
|
wp494 has quit (LOUD UNNECESSARY QUIT MESSAGES) |
18:20
🔗
|
beardicus |
ahoy! |
18:21
🔗
|
beardicus |
who shall recieve my glorious public key? |
18:21
🔗
|
|
wp494 (~wickedpla@[redacted]) has joined #internetarchive.bak |
18:23
🔗
|
closure |
me |
18:23
🔗
|
closure |
bestow it upon me |
18:23
🔗
|
beardicus |
get ready!!!!! |
18:23
🔗
|
beardicus |
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDCGLT7D7/XP2ZhTF0sEBU1L1cOWXWeEAGNa/P4OvjZfaaA4Sew6V9jTvK8fL3bX |
18:23
🔗
|
beardicus |
H/z9dcSHZ4mpLj6+7RvSo2F8jpxc9ggO/dVu6bxtnicsP6Yha6yeFwu6V7n5zDOQ57YSRzo8OR3tfzumH6Gg08nzdmSkmjMrOFk34 |
18:23
🔗
|
beardicus |
K32sXINKkmOL9ekJneBkIx7yQ1buFXXMQl57JsKD3QE1kGWM1oWidMsNF8Q3WJ1mmE6yS3Pa489WW4B8frgfYU4UeA0UmlPBCH0cR |
18:23
🔗
|
beardicus |
f3M9jKkN4ET8Q5zfeI42J6ZvO6cr2INKwDKUA1GFOk8zDeWE4JAMv9GNVqShKbv3KsbqMZEo5bclX bert@laslo |
18:23
🔗
|
beardicus |
doh. |
18:24
🔗
|
beardicus |
can pastebin that if you prefer. newb mistake. |
18:24
🔗
|
closure |
one line plz. /msg is ok |
18:27
🔗
|
GitHub155/#internetarchive.bak |
IA.BAK/pubkey 0cb8d91 Joey Hess: I am beardicus! |
18:27
🔗
|
GitHub155/#internetarchive.bak |
[IA.BAK] joeyh pushed 1 new commit to pubkey: http://git.io/vemK3 |
18:27
🔗
|
closure |
yer in |
18:27
🔗
|
beardicus |
thanky hanky. |
18:29
🔗
|
beardicus |
no dice closure. patience required or mis-paste? |
18:31
🔗
|
closure |
seems I need to fix something on the server |
18:31
🔗
|
closure |
key didn't deploy |
18:31
🔗
|
closure |
try it now |
18:32
🔗
|
beardicus |
success. |
18:32
🔗
|
beardicus |
or wait... |
18:32
🔗
|
closure |
hmm, all I did was hit the webhook. Perhaps github is being slow.. |
18:32
🔗
|
beardicus |
it's asking me for SHARD1's password. |
18:32
🔗
|
beardicus |
so not so much. |
18:34
🔗
|
beardicus |
closure: http://pastebin.com/Mh7yjGKV |
18:35
🔗
|
closure |
so, it shouldn't be prompting for a password, ssh is being run with -o BatchMode=yes |
18:36
🔗
|
closure |
and, it managed to connect to the server once with the key |
18:36
🔗
|
closure |
interesting. Maybe we need to enable batchmode all the time, not just in the initial probe |
18:37
🔗
|
closure |
muxserver_listen: link mux listener .git/annex/ssh/SHARD1@iabak.archiveteam.org.QK8zOCbtNebI7q54 => .git/annex/ssh/SHARD1@iabak.archiveteam.org: Operation not permitted |
18:37
🔗
|
closure |
I have never seen ssh say that before |
18:38
🔗
|
closure |
are you perhaps using this on a filesystem that does not support unix sockets? |
18:38
🔗
|
beardicus |
heh. it's on an nfs share. :( |
18:38
🔗
|
closure |
try this: cd shard1; git config annex.sshcaching false |
18:38
🔗
|
closure |
and then re-run ./iabak |
18:39
🔗
|
beardicus |
that got rid of the muxserver error, but the password prompt remains. |
18:40
🔗
|
Kazzy |
beardicus: check the permissions of id_rsa in the .git/anex/ directory |
18:40
🔗
|
closure |
yeah, that sounds wise |
18:40
🔗
|
Kazzy |
i think it's possible that due ot being a nnfs share, permissions screw up, and ssh will (for some reason silently) refuse to use the key |
18:40
🔗
|
Kazzy |
to*, nfs* |
18:41
🔗
|
closure |
maybe try: ssh -i shard1/.git/annex/id_rsa -v SHARD1@iabak.archiveteam.org |
18:41
🔗
|
closure |
and see why ssh is refusing to use that id_rsa |
18:42
🔗
|
yipdw |
-o IdentitiesOnly yes might also be useful if this is a recent Ubuntu system and you have many SSH keys |
18:42
🔗
|
beardicus |
id_rsa is rw user only. it is user "nobody" though. so that's probably it. |
18:43
🔗
|
closure |
beardicus: also, I'm curious about the fifo problem. git-annex probes to see if the filesystem supports them. but its probe seems to have missed this case |
18:44
🔗
|
closure |
I wonder if you could try, on the nfs share: mkfifo foo, and then if that succeeds, see if you can ln foo bar |
18:44
🔗
|
closure |
beardicus: are you running this as nobody? |
18:44
🔗
|
beardicus |
i am not. |
18:45
🔗
|
closure |
so, nfs ate owner?! |
18:45
🔗
|
|
closure remembers there was a reason he stopped using nfs in 1996 |
18:45
🔗
|
beardicus |
yeah. nfs is a black hole for me. can't think of a better way to permanently mount my nas on my linux box. |
18:46
🔗
|
closure |
well, you can run git-annex and iabak on some nas's :) |
18:46
🔗
|
closure |
but I would like to investigate the nfs problems a bit more |
18:46
🔗
|
beardicus |
yeah. thought about it. i think it's x86-based. |
18:47
🔗
|
closure |
it create IA.BAK/id_rsa and then copys it to IA.BAK/shard1/.git/annex/id_rsa .. is only the copy owned by the wrong user |
18:47
🔗
|
closure |
? |
18:49
🔗
|
beardicus |
everything on the mount ends up as nobody/nogroup |
18:49
🔗
|
beardicus |
could not hard link that fifo closure |
18:50
🔗
|
closure |
hmm, it was able to use the id_rsa once |
18:50
🔗
|
closure |
so you were able to create the fifo? |
18:50
🔗
|
beardicus |
yes. |
18:50
🔗
|
closure |
cool.. I guessed right about the problem with the fifo! |
18:50
🔗
|
closure |
will fix that |
18:53
🔗
|
beardicus |
guess i need to figure out all these nfs permissions baloney before i proceed. Going to try running it on the nas. |
18:54
🔗
|
closure |
actually, you can hack around it |
18:54
🔗
|
closure |
just do this: copy id_rsa from the nfs to your home directory someplace |
18:54
🔗
|
closure |
and then, edit IA.BAK/shard1/.git/config |
18:55
🔗
|
closure |
there's a line that has the path to the id_rsa file, just update that |
18:55
🔗
|
closure |
we could even automate this |
18:55
🔗
|
closure |
check if the file is owned by whoami and if not, put it in $HOME/ |
18:58
🔗
|
beardicus |
no dice. |
18:58
🔗
|
beardicus |
copied id_rsa to ~/.ssh/at_id_rsa |
18:58
🔗
|
beardicus |
permissions are rw only for user, owner is current user |
18:59
🔗
|
beardicus |
edited shard1/.git/config |
18:59
🔗
|
beardicus |
"annex-ssh-options = -i ~/.ssh/at_id_rsa" |
18:59
🔗
|
closure |
not sure if ~ works in there |
19:00
🔗
|
beardicus |
tried it with a full path. same password prompt. |
19:02
🔗
|
closure |
is group still nobody? |
19:02
🔗
|
beardicus |
nope. |
19:06
🔗
|
beardicus |
poking around with the on the nas... busybox `find` command chokes on the -printf option. |
19:06
🔗
|
beardicus |
this is used in sharddirs="$(find . -maxdepth 1 -name shard\* -type d -printf "%P\n")" |
19:07
🔗
|
closure |
I think you'll have better luck debugging ssh |
19:08
🔗
|
|
db48x (~user@[redacted]) has joined #internetarchive.bak |
19:08
🔗
|
|
svchfoo2 gives channel operator status to db48x |
19:10
🔗
|
SketchCow |
Shift-reload on the graph page shows the blocks getting smaller and the 4 block getting bigger. |
19:13
🔗
|
|
db48x wonders why his computer was off |
19:15
🔗
|
Kazzy |
beardicus: does .git/config have references to crippled mode? |
19:16
🔗
|
beardicus |
Kazzy no. |
19:21
🔗
|
|
londoncal has quit (Quit: Leaving...) |
19:28
🔗
|
|
matthusb- (~matthusby@[redacted]) has joined #internetarchive.bak |
19:33
🔗
|
beardicus |
working ok on a synology nas after working around that printf issue. |
19:33
🔗
|
|
Nemo_bis (~Nemo_bis@[redacted]) has joined #internetarchive.bak |
19:33
🔗
|
beardicus |
though i still had to fix permissions on the id_rsa file. |
19:36
🔗
|
|
kniffy (~kniffy@[redacted]) has joined #internetarchive.bak |
19:36
🔗
|
|
Nemo_bis (~Nemo_bis@[redacted]) has left #internetarchive.bak |
19:48
🔗
|
|
SN4T14__ (~SN4T14@[redacted]) has joined #internetarchive.bak |
19:56
🔗
|
|
SN4T14_ has quit (Ping timeout: 512 seconds) |
20:13
🔗
|
closure |
beardicus: awesome! |
20:13
🔗
|
closure |
beardicus: hey, you could open a pull request with the printf workaround |
20:13
🔗
|
beardicus |
not sure it went all the way. |
20:13
🔗
|
beardicus |
is shard1 only about 600mb? |
20:13
🔗
|
closure |
no, it's much bigger |
20:14
🔗
|
beardicus |
yeah. i got a message that said it's fully backed up, and then exit to shell. |
20:14
🔗
|
closure |
I think we're not done with SHARD1, so probably some problem there |
20:15
🔗
|
beardicus |
yeah. must've just downloaded the index. |
20:17
🔗
|
beardicus |
if i run iabak again, i get a bunch of 403s. |
20:18
🔗
|
closure |
well, the 403's are known |
20:18
🔗
|
beardicus |
ok. |
20:18
🔗
|
closure |
but, your system seems to not be noticing the files that remain to get |
20:18
🔗
|
beardicus |
i'll leave it going. |
20:19
🔗
|
beardicus |
i'll see if it starts downloading some stuff. |
20:19
🔗
|
closure |
do you have a /usr/bin/shuf? |
20:19
🔗
|
beardicus |
i don't, not on the nas. |
20:19
🔗
|
beardicus |
i wonder if busybox can do that and they just didn't link it. |
20:20
🔗
|
closure |
you don't need it, I was just wondering which branch it was running |
20:20
🔗
|
beardicus |
no shuf in busybox i guess. |
20:21
🔗
|
closure |
it sounds like it tried to download only the files that no other node has downloaded yet |
20:21
🔗
|
closure |
where instead, it's supposed to download files that are not on at least 3 other nodes |
20:21
🔗
|
closure |
try this: cd shard1 ; git annex find --not --copies 4 |
20:21
🔗
|
beardicus |
well it seemed to think all files were on four nodes. |
20:21
🔗
|
closure |
see if it finds any files |
20:22
🔗
|
closure |
seems so, but I don't think we're there yet are we? |
20:23
🔗
|
db48x |
git annex find --copies 2 --not --copies 3 | wc -l gives me 40383 |
20:23
🔗
|
beardicus |
this could be it "git: 'annex' is not a git command. See 'git --help'." |
20:24
🔗
|
closure |
beardicus: oh, I forgot to say, run IA.BAK/git-annex.linux/runshell first |
20:26
🔗
|
beardicus |
closure that command is finding some files. |
20:26
🔗
|
closure |
hmm, weird |
20:26
🔗
|
closure |
try git annex get --not --copies 4 |
20:26
🔗
|
closure |
that's what the script is supposed to run, in your case (with no shuf) |
20:27
🔗
|
beardicus |
yeah. that looks like the output i get when i rerun iabak now. |
20:27
🔗
|
beardicus |
so i should probably just let it run through all the 403s until it gets to the undone stuff. |
20:28
🔗
|
beardicus |
doesn't really explain why it gave up in the first place though. |
20:28
🔗
|
closure |
oh, I thought you said it got through them all and said it was done |
20:28
🔗
|
|
bzc6p has quit (Read error: Operation timed out) |
20:28
🔗
|
beardicus |
no, it downloaded what i assume is an index of some sort, and then claimed it was all done and exited back to shell. |
20:29
🔗
|
closure |
hmm |
20:30
🔗
|
closure |
what kind of claim it was done? |
20:30
🔗
|
|
bzc6p (~bzc6p@[redacted]) has joined #internetarchive.bak |
20:30
🔗
|
closure |
this is where real output pastes are useful |
20:30
🔗
|
closure |
"Wow! This shard of the IA is fully backed up now!" |
20:30
🔗
|
beardicus |
http://pastebin.com/dU9FQrrv |
20:32
🔗
|
closure |
yeah, looks like the git annex get step somehow didn't run at all |
20:32
🔗
|
closure |
I don't understand how |
20:32
🔗
|
closure |
if it's plowing through 403's now, that step is certianly running this time |
20:33
🔗
|
db48x |
heh, after a sync I see only 8k left |
20:34
🔗
|
beardicus |
not sure. gotta run to make dindin now. i'll try to do that pull request this evening closure. |
20:35
🔗
|
closure |
there's a lot of better homes and gardens and ladies home journal in there that is all 403 |
20:37
🔗
|
SketchCow |
usfederalcourts, man |
20:37
🔗
|
SketchCow |
Should have gone with that |
20:38
🔗
|
sep332 |
$git-annex fsck --distributed |
20:38
🔗
|
sep332 |
git-annex: unrecognized option `--distributed' |
20:38
🔗
|
closure |
sep332: it's in the daily builds, next release |
20:39
🔗
|
sep332 |
ok. would it be useful for me to get it and do that before moving this drive? |
20:39
🔗
|
closure |
no, you don't need to run it |
20:39
🔗
|
closure |
I mean, you can run it wherever the drive ends up, and we'll know the drive still has the files |
20:40
🔗
|
sep332 |
ok thanks |
20:40
🔗
|
closure |
or you can not run it, and we'll realize you are not keeping the files |
20:40
🔗
|
db48x |
which is one of the scenarios we need to test |
20:40
🔗
|
sep332 |
yeah i meant for testing |
20:54
🔗
|
|
sankin1 has quit (Leaving.) |
21:01
🔗
|
db48x |
uh oh, getting really long wait times on disk operations |
22:14
🔗
|
closure |
oh boy, I can make this distributed fsck a lot more efficient |
22:14
🔗
|
closure |
I think we could run fscks daily, or hourly if we wanted to, if I pull this off |
22:18
🔗
|
closure |
I think it will be around 128 bytes per fsck per client |
22:26
🔗
|
db48x |
closure: just read about your changes to git annex get |
22:26
🔗
|
db48x |
as I understand it, "git annex get dir1 dir2" needs to download dir1/file1, dir1/file2, dir1/file3, dir2/subdir1/file1, dir2/subdir1/file2, dir2/subdir2/file1" etc, and you want them to be in the same order as the directories were specified on the command line? |
22:27
🔗
|
closure |
yep |
22:27
🔗
|
closure |
SketchCow: I broke the graph. Bonus: No numcopies +0 left! |
22:27
🔗
|
db48x |
why even generate the list of files ahead of time? why not produce a lazy sequence of them instead? |
22:28
🔗
|
db48x |
you can then produce them in sorted order, and amortize the cost over the whole time you're downloading things |
22:28
🔗
|
closure |
git ls-files b c a will output a b c |
22:30
🔗
|
closure |
I'd love to stream the list lazily, but I cannot when it's not generated in the right order |
22:32
🔗
|
db48x |
hmm |
22:34
🔗
|
db48x |
still seems like you'd be able to call ls-files on each dir individually, whenever the list runs out |
22:36
🔗
|
pikhq |
closure: Did you remove the numcopies 0 stuff, or did it just happen to finally get all retrieved? |
22:58
🔗
|
|
kyan (~kyan@[redacted]) has joined #internetarchive.bak |
23:28
🔗
|
db48x |
pikhq: git log says he removed the unavailable ones |
23:32
🔗
|
|
kniffy (~kniffy@[redacted]) has left #internetarchive.bak (Leaving) |