#internetarchive.bak 2016-11-19,Sat

↑back Search

Time Nickname Message
00:03 🔗 bwn has joined #internetarchive.bak
02:07 🔗 Start has quit IRC (Read error: Connection reset by peer)
02:08 🔗 Start_ has joined #internetarchive.bak
02:34 🔗 db48x heh, I've been letting iabak download some stuff to test the new code I wrote
02:34 🔗 db48x but I just realized that it's working on an item with ~650 files
02:35 🔗 db48x 80+ GB
02:37 🔗 db48x https://archive.org/details/13jany2014warcs
02:38 🔗 patrickod has quit IRC (Quit: ZNC - http://znc.in)
02:38 🔗 patrickod has joined #internetarchive.bak
02:40 🔗 patrickod has quit IRC (Client Quit)
02:40 🔗 patrickod has joined #internetarchive.bak
02:43 🔗 db48x` has joined #internetarchive.bak
02:45 🔗 db48x has quit IRC (Ping timeout: 255 seconds)
02:54 🔗 Start_ is now known as Start
04:28 🔗 db48x` oops, no wonder
04:28 🔗 db48x` I made a list of 5 items, then counted from 1 to 6 to download them
05:51 🔗 db48x` is now known as db48x
06:03 🔗 yipdw Kaz: yes
06:03 🔗 yipdw what's up
06:51 🔗 db48x yipdw: not sure what he was going to ask, but the stats aren't getting to graphite
06:52 🔗 yipdw hmm
06:52 🔗 db48x want to check it out?
06:52 🔗 db48x he fixed the cronjobs, which apparently had vanished
06:52 🔗 yipdw I can poke at it slowly over the next day or two
06:52 🔗 db48x ah :)
06:52 🔗 db48x well, I have a few hours before I sleep
06:56 🔗 db48x a lot of exceptions
07:00 🔗 db48x well, carbon is getting lots of connections
07:00 🔗 db48x 19/11/2016 02:00:34 :: MetricLineReceiver connection with 127.0.0.1:37932 established
07:00 🔗 db48x 19/11/2016 02:00:34 :: MetricLineReceiver connection with 127.0.0.1:37932 closed cleanly
07:00 🔗 db48x 19/11/2016 02:00:34 :: MetricLineReceiver connection with 127.0.0.1:37933 established
07:00 🔗 db48x 19/11/2016 02:00:34 :: MetricLineReceiver connection with 127.0.0.1:37933 closed cleanly
07:06 🔗 bwn has quit IRC (Ping timeout: 961 seconds)
07:06 🔗 db48x /var/lib/graphite/whisper/iabak/shardstats/connections/all.wsp has a very recent modification time
07:16 🔗 bwn has joined #internetarchive.bak
07:28 🔗 kyan has quit IRC (Quit: Leaving)
10:45 🔗 bwn has quit IRC (Ping timeout: 244 seconds)
10:56 🔗 atomotic has joined #internetarchive.bak
11:07 🔗 bwn has joined #internetarchive.bak
11:15 🔗 asktoomuc has joined #internetarchive.bak
12:14 🔗 Whopper has joined #internetarchive.bak
12:18 🔗 Optical has joined #internetarchive.bak
12:20 🔗 Optical I was wondering guys as you are sort of a online group have you tried reaching out to Seagate or other HDD companies to sponsor you for HDDs for the IA.BAK and other projects?
12:20 🔗 atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
12:47 🔗 sevs has joined #internetarchive.bak
12:53 🔗 sevs db48x: you here?
13:08 🔗 atomotic has joined #internetarchive.bak
13:17 🔗 atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
13:41 🔗 iabak-reg 03registrar 05master fd84ce3 06other 10SHARD15/pubkeys registration of ronin_fight on SHARD15
13:53 🔗 iabak-reg 03registrar 05master 53598d0 06other 10SHARD12/pubkeys registration of roninfight on SHARD12
13:57 🔗 asktoomuc even with NFS enabled (to test), this process is anything but straightforward. I'm now getting "Connection refused" errors to .git/annex/ssh/SHARD12@iabak.archiveteam.org
13:57 🔗 asktoomuc not sure what I'm doing wrong
14:17 🔗 kurt db48x / yipdw I haven't added the cronjobs back in - I was running the scripts manually to see if it actually ended up giving us updated graphs
14:26 🔗 trs80 asktoomuch: can you pastebin the error? we'll need more details to work out what's going on
15:43 🔗 cmaldonad has joined #internetarchive.bak
15:44 🔗 cmaldonad has quit IRC (Client Quit)
16:24 🔗 Optical has quit IRC (Quit: Page closed)
16:29 🔗 asktoomuc http://pastebin.com/E9gdpN1Q
16:31 🔗 iabak-reg 03registrar 05master 2af9b61 06other 10SHARD12/pubkeys registration of removed on SHARD12
16:33 🔗 iabak-reg 03registrar 05master 9391ff2 06other 10SHARD12/pubkeys registration of removed@gmail.com on SHARD12
16:50 🔗 db48x asktoomuc: to ssh control master connections generally work for you, or is the error message just saying that you couldn't connnect to iabak.archiveteam.org?
16:54 🔗 db48x asktoomuc: http://stackoverflow.com/questions/36459785/shared-ssh-connection-with-control-master-not-working perhaps?
16:54 🔗 db48x oh, is this another NFS-or-SMB thing?
16:58 🔗 iabak-reg 03registrar 05master 4ed30a9 06other 10SHARD12/pubkeys registration of removed@gmail.com on SHARD12
17:20 🔗 asktoomuc db48x: NFS this time round
17:21 🔗 asktoomuc as SMB isn't supported with symlinks it seems. I created a special share with NFS enabled for it
17:22 🔗 asktoomuc I'm not sure what shared ssh or control master are tbh. I have just installed a Debian 8 and I'm running the process from it
17:23 🔗 db48x it's a way for multiple SSH processes to share the same TCP connection, when they're all talking to the same server
17:24 🔗 db48x it makes it quicker to start multiple concurrent downloads, since they only have to set up the connection once
17:25 🔗 db48x but it requires creating a unix socket file
17:25 🔗 asktoomuc ok, I see
17:25 🔗 asktoomuc do I need any special configuration to make it work?
17:26 🔗 db48x hmm
17:26 🔗 asktoomuc I'm not too sure what's wrong from the logs. I have linked the pastebin of the execution as you've probably seen
17:27 🔗 db48x yes, see lines 167 and 168
17:28 🔗 db48x that's git printing out a message that it couldn't use the socket file we specified
17:28 🔗 asktoomuc hmmm ok
17:29 🔗 db48x or rather, it's git-annex that specifies where the socket should be stored
17:32 🔗 db48x ok, this is configurable in git-annex
17:32 🔗 db48x go into the shard directory and run 'git config annex.sshcaching false'
17:33 🔗 db48x of course, as soon as it checks out a new shard that shard will be broken
17:33 🔗 asktoomuc that command isn't supposed to return anything, is it?
17:34 🔗 db48x no
17:34 🔗 db48x you can run git config --list to see what settings are set
17:35 🔗 db48x I will think about a way to make this easier
17:36 🔗 asktoomuc http://pastebin.com/uCHZZfXb
17:36 🔗 asktoomuc looks like the setting is set
17:37 🔗 asktoomuc can I run the main iabak script in the parent directory after that?
17:50 🔗 sevs db48x: you should be able to put git options into the ANNEXGETOPTS file, I think there is a switch for that
17:52 🔗 sevs -c name=value Overrides git configuration settings.
17:54 🔗 asktoomuc hmmm this didn't fix it
17:55 🔗 asktoomuc "./iabak-helper: line 162: 5400000000000 - : syntax error: operand expected (error token is "- ")"
17:56 🔗 asktoomuc http://pastebin.com/7YiUKJ7k
17:56 🔗 asktoomuc sorry guys, I'm just trying to contribute but I end up creating a lot of problems...
17:57 🔗 sevs just had that a couple hours ago
17:57 🔗 sevs one sec
17:59 🔗 sevs go to the shard dir, and set the git config variable annex.diskreserve=100M
18:01 🔗 asktoomuc now that I think about it, the script never asked me how much free space I wanted to keep
18:01 🔗 asktoomuc I seem to remember reading about that in the wiki
18:02 🔗 asktoomuc "It should prompt you for how much disk space to not use. To adjust this value later, use git config annex.diskreserve 200GB in all of the IA.BAK/shard* directories."
18:02 🔗 asktoomuc yep, it didn't
18:04 🔗 sevs yeah, on one machine it was set, on another it wasn't
18:05 🔗 sevs db48x: I believe this was introduced with the newest commit to https://github.com/ArchiveTeam/IA.BAK
18:08 🔗 db48x yep
18:08 🔗 db48x annoyingly I thought I handled the case where there was no reserve set
18:08 🔗 db48x except right there, obviously
18:09 🔗 asktoomuc :)
18:10 🔗 asktoomuc well at least I'm happy that I'm helping with finding issues in the current process
18:10 🔗 asktoomuc it seems to be downloading now
18:10 🔗 sevs if i got my bits of bash right you tried with "${GIT} config annex.diskreserve || true" except it should be "|| 0"
18:11 🔗 db48x sevs: that's a good idea
18:11 🔗 sevs asktoomuc: yay!
18:11 🔗 asktoomuc the downloading seems a bit slow (~500KB/s) but I can live with that
18:13 🔗 sevs asktoomuc: by default it only downloads files sequentially
18:14 🔗 sevs you can put a file ANNEXGETOPTS in your ia.bak dir
18:14 🔗 asktoomuc yeah, I saw a message about enabling concurrent download but I don't know how to do that and I'm worried it would not work with my setup that doesn't support control master connections it seems
18:14 🔗 asktoomuc oh ok, well I'll try that
18:14 🔗 sevs with the content "-J<num>"
18:14 🔗 kurt bit of an edge case here really, but on the subject of concurrent downloads
18:14 🔗 asktoomuc the message says: "(Not using enough bandwith? Enable concurrent downloads with: echo -J5 > ANNEXGETOPTS)"
18:14 🔗 sevs exactly
18:15 🔗 kurt if you've got more than one concurrent download going then kill the script, it'll finish them off one by one once you restart the script
18:15 🔗 kurt any way to avoid that?
18:15 🔗 asktoomuc so I create ANNEXGETOPTS. What's a sensible value for n on a VM running on Core i7 with enough RAM and a 1Gbps symetrical connection?
18:16 🔗 db48x it'll work, it'll just make an extra TCP connection for each concurrent download
18:16 🔗 db48x kurt: tweak the rundownloaddirs function and send us a pull request :)
18:16 🔗 sevs yes? i remember deleting everything in <shard>/.git/annex/tmp/ worked at some point
18:17 🔗 asktoomuc *value for J sorry
18:17 🔗 sevs at least once that worked, no idea if i was just lucky
18:19 🔗 sevs asktoomuc: start with 10, see where that brings you to
18:19 🔗 asktoomuc ok thanks!
18:20 🔗 sevs do you have iftop? wonderful tool, shows the traffic/sec
18:20 🔗 asktoomuc nope, I'll install the package now
18:21 🔗 sevs in the lower right corner you see the total incomming and outgoing rate
18:27 🔗 Deewiant has quit IRC (Quit: Viivan loppu.)
18:30 🔗 kurt sevs: looks like that's just restarted the downloads :(
18:31 🔗 sevs puhh, yeah, no idea
18:31 🔗 db48x yes, it'll restart any downloads that were interrupted
18:32 🔗 db48x but it doesn't do so concurrently
18:32 🔗 db48x once those are done it'll go back to normal operation, which can be concurrent
18:33 🔗 sevs the one time I did this I got "Hey, you have shuf installed ..."
18:33 🔗 db48x yes, you'll get that next
18:33 🔗 db48x see https://github.com/ArchiveTeam/IA.BAK/blob/master/iabak-helper#L144-L149
18:36 🔗 kurt I know what the normal operation is - just wondering if there were a way to have it 'forget' that it's got some partly-finished downloads so it goes straight to concurrent downloads
18:37 🔗 kurt my issue is that huge 20gb files at 10mbit/s each 40 times is time inefficient
18:37 🔗 kurt I'll have a poke around anywho
18:39 🔗 db48x kurt: sure, run git annex unused
18:40 🔗 db48x then git annex dropunused
18:42 🔗 asktoomuc I keep getting that message "Filled up available disk space, so stopping here!"
18:43 🔗 sevs db48x: shouldn't it be possible to run the list from "git annex unused" through the same "| dirname_pipe | sumofbytes | shuffle | rundownloads" pipeline?
18:43 🔗 asktoomuc "Wow! I'm done downloading this shard of the IA!" <= that seems too good to be true
18:44 🔗 sevs asktoomuc: which shard were you working on?
18:44 🔗 asktoomuc shard 12
18:44 🔗 sevs might be that there were enough copies
18:44 🔗 asktoomuc I chose for me, I didn't specify anything
18:44 🔗 asktoomuc *it
18:45 🔗 sevs yeah
18:45 🔗 sevs you *do* have space free?
18:45 🔗 asktoomuc it says the shard directory is 641M (du -sh)
18:45 🔗 asktoomuc I have ~6TB free
18:45 🔗 sevs hmm
18:46 🔗 db48x what is your annex.diskreserve setting in shard12?
19:06 🔗 bwn has quit IRC (Ping timeout: 964 seconds)
19:09 🔗 asktoomuc annex.diskreserve=100M
19:09 🔗 asktoomuc how do you choose which shard you download?
19:09 🔗 asktoomuc and why does the script exits after just finishing 1 shard?
19:12 🔗 sevs has quit IRC (Ping timeout: 268 seconds)
19:14 🔗 kyan has joined #internetarchive.bak
19:15 🔗 db48x it's not supposed to
19:17 🔗 asktoomuc it probably has to do with my "Filled up available disk space" error message then
19:20 🔗 Kenshin who's managing the iabak node currently?
19:20 🔗 Kenshin i need to move the VM to another machine with more space, and also IP change
19:21 🔗 HCross db48x and closure
19:21 🔗 HCross probably need SketchCow for DNS
19:23 🔗 db48x Kenshin: why do you say that?
19:24 🔗 Kenshin db48x: ?
19:24 🔗 Kenshin db48x: what do you mean
19:25 🔗 db48x oh, I misread
19:28 🔗 db48x asktoomuc: so, what does "df -Ph ." print out on your system?
19:31 🔗 asktoomuc http://pastebin.com/sKGhShBx
19:31 🔗 db48x you forgot the .
19:31 🔗 db48x but presumably it's just the first and last lines of that
19:35 🔗 asktoomuc sorry
19:35 🔗 asktoomuc Filesystem Size Used Avail Use% Mounted on /dev/sda1 1.9G 1012M 726M 59% /
19:36 🔗 db48x ok, so iabak did the right thing
19:37 🔗 db48x you have 700M available and wanted to reserve 100M, so it downloaded ~600M of stuff :)
19:38 🔗 asktoomuc wrong directory...
19:38 🔗 asktoomuc man I'm useless
19:39 🔗 asktoomuc root@IABAK-VM:/mnt/IABAK/IA.BAK# df -Ph . Filesystem Size Used Avail Use% Mounted on 192.168.11.98:/mnt/user/IABAK 7.3T 1.9T 5.4T 26% /mnt/IABAK
19:39 🔗 iabak-reg 03registrar 05master 16a57e4 06other 10SHARD12/pubkeys registration of hcross on SHARD12
19:40 🔗 asktoomuc but I'm thinking what you were saying is related somehow. It's suspicious that it stopped at ~640M when the space available on the main disk is 726M
19:44 🔗 bwn has joined #internetarchive.bak
19:48 🔗 SketchCow ?
19:56 🔗 yipdw asktoomuc: that's the reserve behavior; that seems expected to me
19:59 🔗 db48x yipdw: except that he actually has terabytes available in the mount
20:00 🔗 db48x asktoomuc: just for kicks, what do you get when you run it in the shard directory?
20:05 🔗 iabak-reg 03registrar 05master fa6c2de 06other 10SHARD10/pubkeys registration of Kaz on SHARD10
20:06 🔗 yipdw oh hmm
20:06 🔗 yipdw I missed that
20:08 🔗 asktoomuc when I run waht, db48x ?
20:22 🔗 asktoomuc this? root@IABAK-VM:/mnt/IABAK/IA.BAK/shard12# df -Ph . Filesystem Size Used Avail Use% Mounted on 192.168.11.98:/mnt/user/IABAK 7.3T 1.9T 5.4T 26% /mnt/IABAK
20:24 🔗 iabak-reg 03registrar 05master 2e6043e 06other 10SHARD16/pubkeys registration of Kaz on SHARD16
20:30 🔗 thelsdj I'm having the out of disk space problem as well, only started when I re-ran iabak which updated code from git
20:31 🔗 thelsdj I have 8.8T free (what df says) and its set to save 7TB
20:34 🔗 thelsdj Filesystem Size Used Available Capacity Mounted on
20:34 🔗 thelsdj /dev/sda1 16.0T 7.1T 8.8T 45% /mnt/DroboFS
20:34 🔗 thelsdj annex.diskreserve=7TB
20:34 🔗 thelsdj Checking for any items that still need to be downloaded...
20:34 🔗 thelsdj oops, out of disk space
20:40 🔗 thelsdj tried setting it in GB so 7168GB and same problem, trying by removing the limit to verify that is the problem
20:41 🔗 thelsdj now it says i'm done with the shard, is shard16 only 202G? or is that another bug?
20:46 🔗 thelsdj huh, so i removed NOMORE and it won't even start another shard, says its used all the disk space (even though I don't have limit set in my existing shard)
20:49 🔗 kurt diskreserve = how much it'll keep free, no?
20:50 🔗 asktoomuc yeah
20:50 🔗 kurt and /mnt/IABAK has 5.4TB free?
20:51 🔗 kurt or are you no longer using /mnt/IABAK
20:51 🔗 asktoomuc no no, I'm indeed using it. And yes, it has 5.4TB free
20:52 🔗 kurt so you want to keep 7TB free
20:52 🔗 asktoomuc so I should be using it until it's almost full with the current setting
20:52 🔗 kurt you have 5.4TB free
20:52 🔗 kurt and you're wondering why it won't download more?
20:52 🔗 asktoomuc no, that's thelsdj
20:52 🔗 kurt you are correct, I am an idiot
20:52 🔗 kurt names are hard I want nick colors back, sorry
20:53 🔗 asktoomuc no worries
20:54 🔗 db48x kurt: :)
20:55 🔗 db48x so apparently I broke how it measures the free disk space?
20:55 🔗 db48x asktoomuc: can you do some debugging for me?
20:55 🔗 thelsdj db48x: i can as well, yeah you seem to have broken it
20:55 🔗 asktoomuc sure, thanks for trying to help
20:56 🔗 iabak-reg 03registrar 05master 6193808 06other 10SHARD10/pubkeys registration of thelsdj on SHARD10
20:56 🔗 thelsdj ok so hmm
20:56 🔗 thelsdj i set th annex.diskreserve on the IA.BAK directory as well
20:57 🔗 thelsdj and now it seems to be working
20:57 🔗 db48x https://gist.github.com/anonymous/41195d51c16e880df5e67c62fb46cca6
20:57 🔗 db48x ah, hmm
20:59 🔗 thelsdj also, sort of unrelated, how can i double check that if it tells me its finished getting a shard that I can believe it? is shard16 only 202G? i thought it was more than that
21:00 🔗 kurt heh, now I don't even have concurrent grabs at all
21:00 🔗 db48x forgot one function: https://gist.github.com/db48x/b079eaf83d33361d28c8115e8e5352da
21:01 🔗 db48x thelsdj: you can use git annex list --all to see which remotes have copies of which files
21:01 🔗 db48x you can use git annex list --not --copies 4 to see a list of all files that don't have enough copies
21:02 🔗 db48x asktoomuc: if you can download the test.sh from that second gist and source it, then you'll be able to call the functions and make sure they work correctly
21:03 🔗 db48x for example, bytesFromSize $(annexreserved) should print out 100000000
21:03 🔗 db48x and bytesFromSize $(diskfree) should print out 540000000000000
21:03 🔗 thelsdj line 13 has syntax error i think
21:03 🔗 asktoomuc I need a tiny bit more hand-holding, sorry
21:04 🔗 asktoomuc I can download the file, where do you want me to put it and what to do with it?
21:04 🔗 db48x asktoomuc: put it in IA.BAK
21:04 🔗 thelsdj no, just my shell was weird
21:04 🔗 db48x then use the "source" command to add it to your current environment
21:04 🔗 db48x (basically "source text.sh")
21:05 🔗 db48x thelsdj: it almost has LTS, but if you get an error message let me know what it is :)
21:05 🔗 thelsdj needs 'pow' as well
21:05 🔗 db48x ah, right
21:06 🔗 asktoomuc ok, copied and sourced
21:06 🔗 asktoomuc root@IABAK-VM:/mnt/IABAK/IA.BAK# bytesFromSize $(annexreserved) 100000000
21:06 🔗 db48x updated: https://gist.github.com/db48x/b079eaf83d33361d28c8115e8e5352da
21:07 🔗 asktoomuc root@IABAK-VM:/mnt/IABAK/IA.BAK# bytesFromSize $(diskfree) 5400000000000
21:07 🔗 db48x asktoomuc: ok, excellent
21:07 🔗 db48x asktoomuc: that rules out a class of problems
21:07 🔗 db48x asktoomuc: although for completeness, let's make sure that subtraction works :)
21:07 🔗 thelsdj so for me annexreserved is 0
21:07 🔗 thelsdj but i don't think thats right
21:08 🔗 db48x echo $(($(bytesFromSize $(diskfree)) - $(bytesFromSize $(annexreserved))))
21:08 🔗 asktoomuc root@IABAK-VM:/mnt/IABAK/IA.BAK# echo $(($(bytesFromSize $(diskfree)) - $(bytesFromSize $(annexreserved)))) 5399900000000
21:08 🔗 db48x asktoomuc: perfect
21:08 🔗 db48x thelsdj: annexreserved tries to be smart
21:09 🔗 db48x if you run annexreserved . it looks in the current directory
21:09 🔗 db48x if you just run annexreserved it looks in a shard* subdirectory of the current directory
21:10 🔗 db48x asktoomuc: ok, another thing we can do is to run iabak-helper with debugging output
21:10 🔗 thelsdj oh i guess my new shard wasn't really setup yet and didn't have the reserve
21:11 🔗 db48x thelsdj: ok, that's a bug we need to track down separately
21:11 🔗 db48x asktoomuc: if you edit iabak-helper and add "set -x" as the second line, then when you run iabak it'll be super verbose
21:11 🔗 db48x then I can read that output and see what's going on
21:11 🔗 asktoomuc ok, let's give that a try
21:12 🔗 db48x I'll be back in 5 minutes
21:19 🔗 iabak-reg 03registrar 05master a3e059a 06other 10SHARD15/pubkeys registration of thelsdj on SHARD15
21:19 🔗 iabak-reg 03registrar 05master 62053cb 06other 10SHARD16/pubkeys registration of octobyt3 on SHARD16
21:20 🔗 thelsdj + available=-6992000000000000
21:20 🔗 thelsdj + [[ -6992000000000000 -gt 34359738368 ]]
21:20 🔗 thelsdj + [[ -6992000000000000 -gt 0 ]]
21:20 🔗 thelsdj + echo 'oops, out of disk space'
21:20 🔗 thelsdj hmmm
21:20 🔗 thelsdj too much space free maybe?
21:20 🔗 thelsdj lol
21:23 🔗 kurt are you running freenas or something? could try changing the dataset quota if so
21:27 🔗 thelsdj db48x: so your bytesFromSize returns different things if I do 7TB or 7T
21:27 🔗 thelsdj maybe thats the bug?
21:28 🔗 thelsdj or a bug, not sure, still messing and trying to figure this out
21:32 🔗 db48x thelsdj: ah, indeed
21:33 🔗 asktoomuc on my side, rerunning it with set -x seems to have changed something. It is still downloading for now and hasn't aborted with the no space message yet
21:34 🔗 db48x asktoomuc: heh, ok. can you scroll back up to just before it started downloading?
21:34 🔗 asktoomuc sure
21:35 🔗 asktoomuc there's a bunch of stuff there
21:35 🔗 db48x look for this:
21:35 🔗 db48x + echo 'Checking for any files that still need to be downloaded...'
21:35 🔗 db48x + periodicsync
21:35 🔗 db48x Checking for any files that still need to be downloaded...
21:35 🔗 db48x + find_insufficient_copies
21:36 🔗 db48x and capture the log down until it starts downloading something
21:36 🔗 db48x (personally, I use tmux, so I can search backwards through the back buffer for the string 'find_insufficient_copies' and I'm there; perhaps you can do something similar in your setup)
21:37 🔗 asktoomuc http://pastebin.com/fVGUgQbv
21:37 🔗 db48x ok, your flock is breaking too, but that's not the cause of this problem
21:37 🔗 asktoomuc I'm not and Putty is very annoying because it keeps scrolling down the window when updating the download graph
21:38 🔗 asktoomuc hopefully I captured everything you needed
21:38 🔗 db48x yes, it looks good
21:38 🔗 db48x + available=5399900000000
21:38 🔗 db48x so it knows that it has plenty of space available
21:39 🔗 db48x + [[ 5399900000000 -gt 34359738368 ]]
21:39 🔗 db48x + available=34359738368
21:39 🔗 db48x + [[ 34359738368 -gt 0 ]]
21:39 🔗 db48x + spacelimit=34359738368
21:39 🔗 db48x it limits it to a cap that I put in on a whim
21:40 🔗 db48x ah:
21:40 🔗 db48x + read -d '' bytes filename
21:40 🔗 db48x + [[ 387325952 -lt 34359738368 ]]
21:40 🔗 db48x + spaceneeded=387325952
21:40 🔗 db48x + files+=(${filename})
21:40 🔗 db48x + read -d '' bytes filename
21:40 🔗 db48x + numfiles=1
21:40 🔗 db48x we read a name from the list of things to download, and end up only finding one thing
21:40 🔗 db48x occupying 387MB
21:42 🔗 thelsdj huh, so if i manually run the find_insufficient_copies in my shard dir i get a ton of files, but the script doesn't seem to find anything
21:44 🔗 thelsdj + files=()
21:44 🔗 thelsdj + read -d '' bytes filename
21:44 🔗 thelsdj + numfiles=0
21:44 🔗 db48x hrm
21:50 🔗 thelsdj looks like my xargs and dirname don't behave as expected
21:51 🔗 db48x perfect
21:51 🔗 thelsdj maybe they are being short circuited or not, let me try hard coding it to 'cat' and see if that fixes it
21:56 🔗 thelsdj is the sumofbytes supposed to only print one line?
21:56 🔗 db48x potentially
21:56 🔗 db48x it groups them by whatever is in the second column
21:57 🔗 db48x and we use dirname to shorten the second column and get the item names
21:57 🔗 db48x if there's only one item in the shard then there will only be one line in the final output
21:57 🔗 thelsdj i just get 734964 archivebot/archiveteam_archivebot_go_080/00000_Header.png but thats with cutting out the xargs/dirname stuff and just using 'cat'
21:57 🔗 db48x then no
21:57 🔗 thelsdj but theres a ton of items i don't yet have
21:58 🔗 db48x with filenames like that, sumofbytes should return them unchanged
21:58 🔗 db48x since there shouldn't be any duplicates in the second column to group the lines up by
21:59 🔗 thelsdj so by making it just find_insufficient_copies | rundownloads it works
21:59 🔗 thelsdj annoying that these pipes are so hard to debug
22:01 🔗 db48x thelsdj: :)
22:01 🔗 thelsdj hmm maybe not
22:01 🔗 db48x find_insufficient_copies | head -z | tr '\000' '\n'
22:01 🔗 db48x find_insufficient_copies | head -z | dirname_pipe | tr '\000' '\n', etc
22:02 🔗 VADemon has joined #internetarchive.bak
22:03 🔗 db48x (it would be nice if bash had a better debugger; one that let you set breakpoints, and inspect everything that had flowed through a pipeline)
22:05 🔗 asktoomuc it's still downloading on my side. I'm confused
22:06 🔗 db48x asktoomuc: :)
22:06 🔗 db48x gremlins?
22:09 🔗 asktoomuc yeah I don't know. The only thing I changed was the set -x after running the tests you asked me
22:10 🔗 asktoomuc anyway, not going to complain
22:12 🔗 thelsdj hmmm still not working, it tries to download 2 torrents that i don't have the tools for, but tehres still a TON of files that find_insufficient_copies returns but aren't being attempted
22:32 🔗 thelsdj si don't quite get shard16, i have 202G and i seem to have everything that there aren't 4 copies of and yet like half the shard is still IA only? are the torrent files really 50+% of the shard?
22:33 🔗 jsp12345 has quit IRC (Read error: Operation timed out)
22:33 🔗 thelsdj i guess if deewiant has 1.34T then maybe the torrents really are 1.1T total
22:34 🔗 thelsdj huh, now git annex list is showing a ton of stuff i don't hve but is on web and yet iabak is not downloading it
22:39 🔗 thelsdj find_insufficient_copies |tr '\000' '\n'| awk '{print $2}' |xargs -n 100 ../git-annex.linux/git annex list|grep "^__X_"|wc -l
22:39 🔗 thelsdj 3874
22:40 🔗 thelsdj so by my basic understanding, iabak _should_ be trying to download these?
22:46 🔗 iabak-reg 03registrar 05master c9abe05 06other 10SHARD3/pubkeys registration of mitch on SHARD3
22:53 🔗 db48x thelsdj: we're also having some trouble with the stats on the website
22:53 🔗 thelsdj i think i also may be hitting a bash pipe limit
22:54 🔗 db48x yes. it sounds to me like it's not processing that pipeline correctly or something
22:54 🔗 thelsdj the output of find_insufficient_copies is 490k but i think my bash gives up at 256k
22:54 🔗 db48x well, it's not like a pipe can only transfer a limited amount of data
22:55 🔗 thelsdj right, it doesn't make sense but stuffs not coming out the end, even if i remove all the steps
22:55 🔗 db48x weird
22:56 🔗 db48x what is find_insufficient_copies |tr '\000' '\n'|wc -l
22:57 🔗 komarEX has joined #internetarchive.bak
22:57 🔗 komarEX hey everyone
22:58 🔗 komarEX is it me or ANNEXGETOPTS disappeared from iabak ?
22:58 🔗 db48x komarEX: no, it's still there
22:58 🔗 komarEX well the information is but I don't see usage
22:58 🔗 thelsdj the 'read' is only getting 383 different filenames
22:59 🔗 thelsdj out of ~4800 that start the pipe
22:59 🔗 db48x thelsdj: funky
22:59 🔗 komarEX I have -J15 in file and it still downloads just one file
22:59 🔗 kurt komarEX: glad I'm not the only one noticing that today
22:59 🔗 kurt but I can't see any changes that would affect it in the iabak repo
23:00 🔗 komarEX Then I guess I have to look into my files backup
23:00 🔗 HCross2 Mine is doing that too
23:01 🔗 db48x it's right there on line... oh, uh
23:02 🔗 db48x sorry about that, I did break it
23:03 🔗 komarEX so let's just restart iabak right?
23:04 🔗 db48x yep
23:04 🔗 komarEX could you confirm 2 things with me
23:04 🔗 db48x hopefully :)
23:04 🔗 komarEX 1. shard4 is 1,41TB ?
23:04 🔗 db48x no, shard1 has 2.71 TB of stuff
23:05 🔗 komarEX oh
23:05 🔗 komarEX ok
23:05 🔗 komarEX 2. what will happen if I let annex use for ex. 4TB but shard is 2,71
23:05 🔗 db48x it will move on to another shard
23:05 🔗 komarEX ok
23:06 🔗 komarEX oh
23:06 🔗 komarEX btw
23:06 🔗 db48x feel free to also peruse the content of the shards and manually run git annex get for any files you would like to use
23:06 🔗 komarEX shard4 =/= shard1 I believe :d
23:06 🔗 db48x music you'd like to listen to, for instance
23:07 🔗 db48x oh, shard 4 is 1.41 TB
23:07 🔗 komarEX ok and one more thing
23:07 🔗 db48x sure
23:07 🔗 komarEX can you deny annex/iabak to download other shards ?
23:08 🔗 db48x sure, just edit the repolist
23:08 🔗 db48x set them all to "maint"
23:08 🔗 komarEX can you tell me which file/command ?
23:09 🔗 db48x use your favorite text editor
23:09 🔗 db48x it's a very simple file
23:09 🔗 komarEX oh I'm blind
23:09 🔗 komarEX it's in main dir ok
23:11 🔗 thelsdj so it eems to be stopping right before a 10G file, which i obviously have space for, so guessing issue is with: [[ $((${spaceneeded} + ${bytes})) -lt ${spacelimit} ]]
23:13 🔗 db48x thelsdj: what are the other variables at the time?
23:14 🔗 db48x komarEX: just be aware that this may require you to manually pull future updates
23:14 🔗 komarEX db48x: I'm aware
23:15 🔗 bwn has quit IRC (Ping timeout: 244 seconds)
23:15 🔗 db48x :)
23:15 🔗 thelsdj it gives up between these two: XXX:spaceneeded:70256110876,bytes:262,spacelimit:34359738368,archivebot/archiveteam_archivebot_go_081/www.geekhard.fr-shallow-20140726-175015-1f1e6.json
23:15 🔗 thelsdj XXX:spaceneeded:80994514176,bytes:10738403300,spacelimit:34359738368,archivebot/archiveteam_archivebot_go_081/www.genealogy.com-inf-20140605-184144-ahip0-000
23:16 🔗 thelsdj 00.warc.gz
23:16 🔗 thelsdj (i took out the check so i can get output from before and after)
23:16 🔗 komarEX db48x: I guess I should just throw this file to .gitignore?
23:17 🔗 db48x thelsdj: each time you run it it's going to get the list of files in a different order, and it'll stop at a different place
23:17 🔗 thelsdj no, its same space every time
23:17 🔗 thelsdj i took out the randomize
23:17 🔗 thelsdj since i dont' have shuf anyways
23:17 🔗 db48x ah
23:17 🔗 db48x ok, so why is spaceneeded 70G even though spacelimit is 34G?
23:18 🔗 db48x it's supposed to stop when spaceneeded + bytes -gt spacelimit
23:18 🔗 asktoomuc I'm still pulling files. I'm at roughly 15G for now, it's quite slow (between 1 and 6MB/s) but I guess I'm not in a hurry. I just hope it will keep working
23:19 🔗 thelsdj well, why is spacelimit 34G when I should have like 1.7T available for it
23:19 🔗 db48x thelsdj: because I had the idea that iabak should sync the repository more frequently
23:19 🔗 db48x so I wrote it to stop after a slightly random threshold in the hopes that it would do so
23:20 🔗 db48x (the threshold is 8 hours at 1MB/s)
23:21 🔗 thelsdj ok, so right, theres like 300+ files that it goes through no problem, doesn't take any time
23:21 🔗 thelsdj and it always tries to download them
23:21 🔗 thelsdj so i guess its blowing through that in like 20 seconds
23:22 🔗 db48x metadata and torrents and stuff that fails?
23:22 🔗 thelsdj i've manually run the git annex get command for them and theres no output, return value is 0
23:22 🔗 thelsdj doesn't seem to be failures as far as i can tell
23:22 🔗 thelsdj just silently succeeds immediately
23:22 🔗 db48x then you already have those files
23:23 🔗 db48x that may be something we forgot along the way, now that I think about it
23:23 🔗 thelsdj right so i already have the files, but thy aren't at 4 copies yet, but they aren't filtered out
23:23 🔗 thelsdj ok that makes sense
23:24 🔗 db48x and now that I've added a threshold, they're clogging up the works
23:24 🔗 thelsdj right
23:24 🔗 db48x good bug report :)
23:24 🔗 db48x --not --copies 4 --not --here?
23:25 🔗 thelsdj yeah --not is there
23:26 🔗 db48x ah, --not --copies 4 --not --in=here?
23:27 🔗 thelsdj oh i get it, ok adding that and trying
23:29 🔗 komarEX has quit IRC (Quit: Page closed)
23:31 🔗 thelsdj combined with changing sumofbytes to cat, it seems to be downloading again
23:31 🔗 db48x ok, so sumofbytes is broken
23:31 🔗 db48x it's just an awk script; could your awk be broken?
23:31 🔗 thelsdj at least on my system i think it prints only 1 line
23:31 🔗 thelsdj yeah its possible my awk is weird
23:32 🔗 thelsdj i have busybox awk
23:32 🔗 db48x oh
23:32 🔗 db48x busybox
23:33 🔗 thelsdj i'm surprised that this works as well as it does on my Drobo NAS
23:33 🔗 db48x gawk instead?
23:34 🔗 thelsdj if git-annex arm build could include gawk that would be great
23:34 🔗 thelsdj seeing if i have a gawk elsewhere or available
23:35 🔗 db48x heh
23:36 🔗 thelsdj not in any obvious places or in the repo for what people have built for the Drobo
23:38 🔗 thelsdj yeah so i think if its needed having git-annex include it in its arm binary would be very useful
23:40 🔗 asktoomuc am I supposed to see multiple progress bars when concurrent downloads happen?
23:43 🔗 asktoomuc because right now it only seems to download files sequentially even though I created the ANNEXGETOPTS file
23:43 🔗 asktoomuc iabak@IABAK-VM:/mnt/IABAK/IA.BAK$ cat ANNEXGETOPTS J9
23:43 🔗 thelsdj should be -J9 right?
23:44 🔗 asktoomuc indeed, thanks for spotting my mistake
23:45 🔗 asktoomuc time to go to bed I guess, I can't read properly anymore
23:45 🔗 asktoomuc thank you all for your help today!
23:45 🔗 bwn has joined #internetarchive.bak
23:48 🔗 db48x asktoomuc: you're welcome!
23:48 🔗 db48x asktoomuc: thanks for helping us out
23:48 🔗 db48x thelsdj: git-annex doesn't need awk at all
23:48 🔗 db48x iabak does, but iabak isn't going to distribute awk
23:48 🔗 db48x I mean, we could distribute bash too, and perl
23:48 🔗 db48x so I'd rather not distribute any of them
23:49 🔗 yipdw i have been on this weird rust kick latly
23:49 🔗 thelsdj yeah, i mean its worth discussing as embedded NAS' are a good source of space so would be nice to be able to run on them without much problem
23:54 🔗 db48x thelsdj: I'd rather have a few lines at the top of the file like AWK=awk that people can edit
23:57 🔗 db48x I added some things to the readme
23:57 🔗 db48x anything else we've covered today that I've forgotten?

irclogger-viewer