#internetarchive.bak 2016-11-10,Thu

↑back Search

Time Nickname Message
00:02 🔗 sep332 has quit IRC (konversation out)
01:08 🔗 kyan has joined #internetarchive.bak
01:24 🔗 Lord_Nigh has quit IRC (Ping timeout: 633 seconds)
01:25 🔗 Lord_Nigh has joined #internetarchive.bak
02:01 🔗 Lord_Nigh has quit IRC (Read error: Operation timed out)
02:08 🔗 Lord_Nigh has joined #internetarchive.bak
02:26 🔗 Lord_Nigh has quit IRC (Ping timeout: 244 seconds)
02:28 🔗 Lord_Nigh has joined #internetarchive.bak
03:08 🔗 Lord_Nigh has quit IRC (Read error: Operation timed out)
03:31 🔗 Lord_Nigh has joined #internetarchive.bak
04:26 🔗 Start has quit IRC (Quit: Disconnected.)
04:28 🔗 Start has joined #internetarchive.bak
04:50 🔗 SketchPho No action?
04:50 🔗 SketchPho I need to tap closure. Is there other stuff people have questions on?
04:51 🔗 SketchPho I'm not kidding when I say this is day in day out first priority
04:51 🔗 db48x do we have a list of things to put in a new shard? I can run the scripts
04:53 🔗 db48x actually, I take that back. I can't log into the server
04:53 🔗 db48x oh, nvm
04:54 🔗 db48x I was doing it wrong
05:04 🔗 db48x gotta grab the latest census
05:07 🔗 db48x ok, confusion
05:07 🔗 db48x archiveteam_census_2016 doesn't have what I expected; it's just a list of identifiers with none of the rest of the information
05:08 🔗 Lord_Nigh has quit IRC (Ping timeout: 244 seconds)
05:14 🔗 Lord_Nigh has joined #internetarchive.bak
05:24 🔗 Lord_Nigh has quit IRC (Read error: Operation timed out)
05:27 🔗 Lord_Nigh has joined #internetarchive.bak
05:30 🔗 SketchPho I'd like us to turn to archivebot and archiveteam items for new shards
05:30 🔗 db48x sure
05:30 🔗 db48x do we have any that aren't made of 50GB warcs?
05:51 🔗 bwn db48x: sorry, yes, no metadata in there yet
05:52 🔗 db48x ah
05:52 🔗 db48x :)
05:52 🔗 bwn i've started ia-mine running on archivebot collection to get metadata
05:59 🔗 yipdw bwn: also use http://archive.fart.website/archivebot/viewer/items/
05:59 🔗 yipdw that may be a more complete index of archivebot materials, as it includes a large number of items that are not in the archivebot collection
06:08 🔗 HCross2 Best person to make my SSH key available too?
06:16 🔗 bwn yipdw: looks like the items you're talking about got added to the archivebot collection at some point
06:17 🔗 bwn ia gave me archiveteam_archivebot_go_20161110020001 but it hasn't been added to the viewer yet
06:17 🔗 yipdw some of them did
06:17 🔗 yipdw yeah, the viewer updates every 24 hours
06:17 🔗 yipdw items are created every, uh, 3 or so
06:17 🔗 db48x HCross2: you can just paste it in here and I'll add it to the server. ed25519 keys are preferred
06:17 🔗 yipdw depending on upload speed from fos
06:17 🔗 kyan has quit IRC (Quit: Leaving)
06:18 🔗 HCross2 https://www.irccloud.com/pastebin/BPtozFKj
06:19 🔗 HCross2 db48x:
06:19 🔗 db48x strongly preferred :)
06:20 🔗 HCross2 Ah ok. I'll have to re do it
06:26 🔗 Deewiant Is it normal to see "verification of content failed" and "Unable to access these remotes: web" often? (Happens for approximately 1 file out of 5)
06:26 🔗 db48x Deewiant: sometimes
06:26 🔗 db48x even when an item is hidden on IA it can still be mentioned in the backup, but nobody will be able to download it
06:36 🔗 bwn yipdw: unless i'm making a mistake, it looks like the viewer matches up with `ia search collection:archivebot`
06:36 🔗 yipdw ah ok cool
06:39 🔗 bwn :)
06:39 🔗 bwn db48x: that metadata is finished, i also created a sorted list of the archivebot item sizes
06:40 🔗 bwn where would a good place to put this for anyone who needs it? i could create an ia item for now
06:41 🔗 db48x that's a nice meta way to do it
06:41 🔗 db48x or just serve it up the old-fashioned way and I'll download it from you
06:52 🔗 Deewiant db48x: Ok, it just looks a bit worrying to have it show up so often. Seems to mostly (or only?) affect _archive.torrent and _meta.xml files though.
06:52 🔗 bwn db48x: http://erebos.undo.it/MIRROR/db48x/
06:55 🔗 db48x Deewiant: ah, that's actually a slightly different issue; we don't know the correct hash of those files so they can't be verified
06:56 🔗 db48x bwn: I'm taking a look now
06:57 🔗 bwn i'm writing up a readme so you know what's what
06:58 🔗 db48x hmm
06:59 🔗 db48x the script we have reads from a file with the md5 hash, the size, the category and the file url
07:04 🔗 HCross2 ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIP5OhU2Lita9RdjPkX9N0w9wZnmVlednUDEx24bVn4Mk IABAK key - Harry C
07:04 🔗 HCross2 db48x: ^
07:05 🔗 HCross2 TIL: You have to use the development version of PuTTYgen for that
07:05 🔗 bwn get_item_files.jq outputs a partial url collection/filename for each file
07:06 🔗 db48x HCross2: great, what username do you want?
07:06 🔗 HCross2 HCross
07:10 🔗 bwn that get_item_files.jq can be modified to get the format you need with the jq filters from https://archive.org/details/ia-bak-census_20150304
07:10 🔗 bwn ^ can likely be modified
07:11 🔗 db48x HCross2: ok, you should be able to log in to iabak.archiveteam.org, username hcross
07:11 🔗 HCross2 Would it be a good idea to run another IA census now?
07:13 🔗 Atom has quit IRC (Read error: Connection reset by peer)
07:14 🔗 HCross2 db48x: thanks, am in
07:14 🔗 db48x excellent
07:15 🔗 HCross2 I need to leave for work now, but ill have a little play around when I get into the office
07:15 🔗 db48x oh, you know what? it just occurred to me
07:15 🔗 db48x ok
07:15 🔗 db48x there's a minor wrinkle that I'll straighten out shortly
07:17 🔗 db48x closure's way of setting machines up is better than mine, but requires remembering to do things better
07:28 🔗 db48x hrm
07:28 🔗 db48x I can't remember how we handled items that are in multiple collections
07:29 🔗 HCross2 How are we also going to do large collections
07:30 🔗 db48x split them across multiple shards
07:31 🔗 db48x .files |
07:31 🔗 db48x map(
07:31 🔗 db48x select(.source != "derivative") |
07:31 🔗 db48x # if case for catching files with size=null (i.e. files.xml).
07:31 🔗 db48x if .size != null then
07:31 🔗 db48x {"name": .name, "size": (.size | tonumber), "collection": $c, "md5": .md5}
07:31 🔗 db48x else
07:31 🔗 db48x {"name": .name, "size": 0, "collection": $c, "md5": .md5}
07:31 🔗 db48x end
07:32 🔗 db48x ) |
07:32 🔗 db48x map([.md5, .size, .collection[0], .name]) | map(@tsv) | .[]
07:33 🔗 db48x bwn: is every single item in this dataset in the archivebot collection?
07:33 🔗 HCross2 Ah. So for example, I would like to back up https://archive.org/details/archiveteam_newssites which is going up by at least half a terabyte a day
07:36 🔗 db48x yea, I guess they are
07:38 🔗 bwn yes, from `ia search collection:archivebot`
07:39 🔗 bwn i need to eat and get some sleep soon, hope that stuff helps :)
07:40 🔗 db48x it does
07:44 🔗 db48x archivebot and archiveteam_newssites are both going to have to end up split across a bunch of shards
08:02 🔗 HCross2 Easiest way of splitting them?
08:04 🔗 Kaz db48x: can you drpo my key in too please?
08:04 🔗 Kaz only here for a few minutes before I run off to work, but have key
08:12 🔗 Kaz ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIHhFYMd9Htlf9wPZzIDyqbYYNwuo3m+kWQ9/pfAD/TE9 Kaz IABAK
08:12 🔗 Kaz ^ if you're around, gotta run now
08:12 🔗 db48x Kaz: I will add you in
08:13 🔗 Kaz awesome, thanks
08:13 🔗 db48x you will be kaz on the server
08:16 🔗 jsp12345 has joined #internetarchive.bak
08:43 🔗 * db48x sighs
08:57 🔗 atomotic has joined #internetarchive.bak
09:45 🔗 antomatic has quit IRC (Read error: Connection reset by peer)
09:46 🔗 antomatic has joined #internetarchive.bak
10:06 🔗 atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
11:11 🔗 atomotic has joined #internetarchive.bak
11:26 🔗 atomotic has quit IRC (Remote host closed the connection)
11:28 🔗 atomotic has joined #internetarchive.bak
12:29 🔗 SketchPho Please do not add news sites to this project quite yet
12:30 🔗 SketchPho I would want us to go after the archive Bots collection first, just because that is pure websites that were grabbed for various reasons
12:31 🔗 SketchPho This is also going to bring to bear our management of large amounts of sharks, which we might as well deal with anyway
12:32 🔗 SketchPho I also understand, that we also will be forced to deal with the situation of a growing collection, which means that we might want to just focus on a cutoff date with archive Bots
12:33 🔗 SketchPho Perhaps it is time for me to begin discussions on data hoarders for space
13:05 🔗 HCross2 Ok. How do we cut up a large collection like archivebot
13:14 🔗 luckcolor By timestamp or by data size i suppose
13:17 🔗 luckcolor Shards with big files could have a smaller number of files, unlike the normal ones
13:17 🔗 HCross2 But we add it to the shard per item
13:20 🔗 luckcolor I mean collections with Big files :)
13:20 🔗 luckcolor *meant
13:20 🔗 luckcolor HCross2: ?
13:22 🔗 HCross2 Each shard has "items" in it, really not sure how we cut it up. I am probably mistaken about that though. Someone better can probably advise
13:22 🔗 db48x with shell scripting
13:23 🔗 db48x I wrote one
13:23 🔗 db48x which I've been testing
13:24 🔗 db48x SketchPho: since archivebot is sooo big I looked at archiveteam_fire instead
13:24 🔗 db48x 400k files, so it could be four shards
13:24 🔗 db48x (easy, just split -n1/4 ...)
13:25 🔗 db48x but I checked the dates, and 2011-2015 is 100k files and 4.3tb, so I figure I'll do it that way
13:28 🔗 db48x HCross2: https://gist.github.com/db48x/a1a8847916ab149abbfce25517944bdc
13:28 🔗 db48x I'll check it in to IA.BAK as well
13:28 🔗 luckcolor Isn't it better to have smaller shards?
13:28 🔗 SketchPho I trust your judgement
13:29 🔗 db48x luckcolor: total file size is less important than number of files
13:30 🔗 db48x because git gradually slows down on repositories with more files in them
13:30 🔗 luckcolor Ok right
13:31 🔗 db48x whereas the files are split up across multiple contributors
13:32 🔗 db48x annoyingly, archiveteam_fire for 2016 is 300k files but only .6tb :)
13:36 🔗 HCross2 db48x: will you be around in 5 and a half hours or so? I want to get started on shard making and want someone to show me the ropes
13:39 🔗 HCross2 4 and a bit I mean
13:39 🔗 db48x I'll probably be asleep
13:40 🔗 db48x for about the next 8 hours
13:42 🔗 HCross2 Ok
13:44 🔗 db48x it's not very hard
13:45 🔗 db48x use ia-mine (https://github.com/jjjake/iamine) to get the list of items in a collection, and the metadata for each of those items
13:46 🔗 db48x you can make a shard out of small collections, or split up a large collection to make several shards; either way you end up with a list of items
13:46 🔗 db48x then use jq to parse the metadata (which is one json object per item), and output a tsv file
13:46 🔗 db48x then feed the tsv file to a slightly-modified version of the mkSHARD script
13:48 🔗 db48x https://gist.github.com/db48x/a1a8847916ab149abbfce25517944bdc
13:48 🔗 HCross2 Awesome
13:49 🔗 db48x I'll get these checked into the IA.BAK repository at some point; probably after I've slept
13:50 🔗 db48x we used to do it slightly differently; ia-mine didn't exist, but we had a single huge json dump of every single public item on IA called the census
13:50 🔗 HCross2 Cool. What's the jq command for it please
13:50 🔗 db48x jq -r -f file.jq input.json > files.txt
13:51 🔗 HCross2 Thanks :)
13:51 🔗 db48x if you look on the server, you can see the huge TSV that we created from the huge json dump in ~joey/IA.BAK
13:52 🔗 db48x ok, I'm about to hit the normal-sized red button
13:54 🔗 db48x ok, I think that worked
13:55 🔗 db48x anyone want to try out SHARD12 before I hit the big red button?
14:03 🔗 db48x well, I guess I can hit the button without testing it
14:03 🔗 db48x what's the worst that could happen?
14:06 🔗 atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
14:21 🔗 Whopper has joined #internetarchive.bak
14:38 🔗 db48x merge: refs/remotes/origin/synced/master - not something we can merge
14:38 🔗 db48x failed
14:38 🔗 db48x (merging origin/git-annex into git-annex...)
14:38 🔗 db48x (recording state in git...)
14:38 🔗 db48x that's not quite what I expected
14:40 🔗 db48x ok, when I rerun iabak it works fine
14:41 🔗 db48x so I guess it's just a bug that happens to the first person to try out a new shard
14:41 🔗 Atom has joined #internetarchive.bak
14:49 🔗 db48x time for me to sleep
15:05 🔗 Jon has joined #internetarchive.bak
15:40 🔗 DFJustin has joined #internetarchive.bak
15:49 🔗 * closure waves
15:50 🔗 kurt o/
15:52 🔗 atomotic has joined #internetarchive.bak
15:54 🔗 db48x closure: howdy
15:58 🔗 closure sounds like you guys are making headway. I got a slack invite, but would rather avoid slack, so I'll be over here
16:00 🔗 db48x yes, manged to add a shard
16:00 🔗 db48x I chickened out and didn't add a bunch of shards all at once though
16:03 🔗 closure you asked about items in multiple collections. IIRC the list I generated picked an arbitrary collection for such items, so they only go into one
16:03 🔗 db48x yea, I did the same
16:06 🔗 db48x biggest development is that there's no census any more
16:06 🔗 db48x but there is ia-mine, which is handy
16:07 🔗 closure db48x: propellor pull request> hasGroup takes a User and a Group, so hasUser should not deconstruct the parameters
16:07 🔗 closure simplest implementation: hasUser = flip hasGroup
16:08 🔗 closure no census anymore? What is ia-mine?
16:09 🔗 closure ah, I guess it finds the items in a collection
16:09 🔗 db48x ia-mine fetches the same metadata json that we had in the census, but for arbitrary searches, from the command-line
16:09 🔗 db48x ah, I didn't know about flip
16:09 🔗 db48x (though I figured there would be something)
16:15 🔗 db48x bah
16:15 🔗 db48x I couldn't guess the type signature of flip well enough for hoogle to find it
16:15 🔗 db48x though it's obvious now that I look at it
16:25 🔗 Jon pulling a shard onto my first donated 1T; I have a second 1T but not contiguous, might need to run a second IA.BAK instance? not sure
16:52 🔗 Lord_Nigh has quit IRC (Ping timeout: 250 seconds)
16:52 🔗 Lord_Nigh has joined #internetarchive.bak
16:58 🔗 Lord_Nigh has quit IRC (Ping timeout: 244 seconds)
17:05 🔗 Lord_Nigh has joined #internetarchive.bak
17:30 🔗 atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
17:51 🔗 Kaz right
17:51 🔗 Kaz I'm on the server, how do I master the shards
17:52 🔗 closure db48x: respin the patch?
17:52 🔗 closure Kaz: did you see my wiki page on how to do it?
17:53 🔗 Kaz I did not. that might be a place to start
17:54 🔗 closure http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/admin
17:56 🔗 Kaz thanks, taking a look now
18:17 🔗 HCross has joined #internetarchive.bak
18:41 🔗 sep332 has joined #internetarchive.bak
18:46 🔗 kyan has joined #internetarchive.bak
18:59 🔗 Frogging so are these the instructions I should follow if I want to donate space? anything else? http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/git-annex_implementation
19:03 🔗 sep332 Frogging: the README is very helpful too, especially "tuning resource usage" part
19:03 🔗 Frogging yah I'm looking at that
19:04 🔗 Frogging wondering if maybe I should use a filesystem quota or sub-image
19:05 🔗 sep332 we've been trying to keep it simple but that certainly woulnd't hurt
19:06 🔗 Frogging it's just the diskreserve option is for how much space to _not_ use. just slightly complicates allocation on my array. it's only a minor issue though really
19:08 🔗 HCross closure, getting a ton of "server refused our key" now
19:25 🔗 Kaz right
19:26 🔗 Kaz so one of the issues I see is that we can't actually update the iabak repo to point to the new shards
19:26 🔗 Kaz uh, setup a new shard repo even
19:27 🔗 HCross ignore my error, cant spell my own name this evening
19:27 🔗 Kaz reading the wrong bit
19:29 🔗 HCross looking at the examples, I see "10191 jstor_jamerinstcrimlaw
19:29 🔗 HCross ", how do I get that first number?
19:30 🔗 HCross also, I dont see a ./mkSHARD file
19:30 🔗 HCross done the clone
19:30 🔗 Kaz HCross: see /usr/local/IA.BAK
19:31 🔗 Kaz HCross: and/or checkout server, rather than master
19:31 🔗 HCross ah, I checked out /master
19:35 🔗 HCross Kaz,
19:35 🔗 HCross hcross@ia-bak:~/IA.BAK$ git checkout -b server master/server
19:35 🔗 HCross fatal: Cannot update paths and switch to branch 'server' at the same time.
19:35 🔗 HCross Did you intend to checkout 'master/server' which can not be resolved as commit?
19:35 🔗 HCross can you please tell me where I am being a clot?
19:35 🔗 Kaz git checkout server
19:36 🔗 HCross thanks
19:36 🔗 Kaz -b makes new branch, which you don't want because it already exists
19:36 🔗 Kaz or at least, afaict
19:37 🔗 HCross going to create shard 14, based on https://archive.org/details/newspapers
19:38 🔗 Kaz 13?
19:39 🔗 HCross yours
19:39 🔗 Kaz I guess one thing to think of, is if items are added to a collection, how are these reflected in the iabak version?
19:43 🔗 HCross http://paste.nerds.io/uvaqeteyah.md im now slighly confused about why that went wrong
19:44 🔗 Kaz have you got the md5_collection_url.txt.pick1.sorted.uniq link?
19:44 🔗 HCross yes
19:45 🔗 Kaz it 'sort of works' if you do it from /usr/local/IA.BAK, but have no perms to create the SHARD14.list
19:48 🔗 HCross yea, we need a way of doing it in our directories
19:52 🔗 Kaz wait hang on
19:55 🔗 Kaz HCross: mkSHARD in /usr/local/IA.BAK is different to the one in the repo
19:56 🔗 Kaz this explains some things
19:57 🔗 HCross that one seems to be working
19:57 🔗 HCross its doing thinking about it now
19:57 🔗 HCross ive cp'ed it into my folder and renamed it
20:00 🔗 HCross 228575 SHARD14.list
20:00 🔗 HCross may be a tad high
20:02 🔗 HCross or not hang on
20:03 🔗 db48x yea, don't got over 100k
20:03 🔗 HCross thats the file size I think though, its still generating. My calculation is 60k
20:03 🔗 Kaz db48x: could you push the updated mkshard to the repo please? just in case there are other changes that should be pulled through too
20:03 🔗 db48x although the wiki page says to use that file, it's an out of date index to the archive
20:04 🔗 db48x probably better to use ia-mine to generate a more up-to-date index of the collection you're interested in
20:04 🔗 HCross db48x, if you show me how, I can get us a new one
20:04 🔗 HCross should we do a whole archive one?
20:04 🔗 db48x nah, takes ages
20:05 🔗 db48x ia-mine --secure -c -s "collection:${1}" --itemlist
20:05 🔗 HCross how long is "ages"
20:05 🔗 db48x a life-age of the earth
20:05 🔗 db48x seriously, the IA has 20PB of stuff
20:06 🔗 db48x takes ages, but if you just do one collection it only takes a minute or so
20:06 🔗 db48x see the split-collection.sh script that I checked in
20:06 🔗 db48x it first gets a list of identifiers for the items in the collection, then gets the json metadata for each item
20:07 🔗 db48x then you use jq to convert the json into a tsv of just the four things we need in order to make the git-annex repository
20:08 🔗 db48x you can adjust that basic recipe to suit the needs of the moment
20:08 🔗 HCross about to kick off several shards with archivebot.... db48x is that OK?
20:08 🔗 db48x probably :)
20:09 🔗 db48x how are you splitting them between shards?
20:09 🔗 HCross using your splitting script
20:10 🔗 db48x ok. just be aware that the archivebot collection is a smaller number of huge items, so 100k files is a bad way to split it :)
20:10 🔗 HCross best idea to split it?
20:10 🔗 db48x I'm not sure
20:10 🔗 db48x probably into about 50 shards of 2-3TB
20:11 🔗 HCross archivebot is 1.8k items.
20:12 🔗 HCross assuming each pack is 50GB, its 5 items per pack
20:12 🔗 db48x sounds right, but don't assume
20:12 🔗 HCross yea, ill check now
20:12 🔗 db48x that jq script dumps out the file sizes too, so dice it up and measure the size of all the pieces :)
20:13 🔗 Kaz so just to be clear
20:13 🔗 Kaz 1) run split-collection on huge collection
20:13 🔗 Kaz 2) do some jq magic
20:13 🔗 Kaz 3) make shard based on each list that gets pumped out by 1&"?
20:14 🔗 db48x yes
20:14 🔗 db48x but split-collection is just one possible way to split up a collection
20:14 🔗 db48x you could do it by date
20:14 🔗 Kaz right
20:15 🔗 HCross db48x, can we have nano please?
20:15 🔗 db48x as I did for archiveteam_fire, or by size, for archivebot, or by some clever means I haven't thought up
20:15 🔗 db48x sure
20:15 🔗 db48x emacs is on there, as is vim, but go ahead and install it
20:15 🔗 Frogging how does one get things "out" of the backup, should the need arise?
20:16 🔗 Kaz how does one go about running the jq script? -bash: jq: command not found
20:16 🔗 db48x while it's installing, go to /usr/local/propellor and edit joeyconfig.hs so that the iabak machine defined in there includes the package as well
20:16 🔗 Kaz or do I need to install for myself
20:16 🔗 HCross db48x, not root/cant sudo
20:16 🔗 db48x I had zero luck installing jq on iabak, sorry
20:16 🔗 db48x I had to do that part on my own machine
20:16 🔗 Kaz ah, okay
20:17 🔗 HCross I may do all the mining closer to myself then, and push over after
20:17 🔗 db48x but you're welcome to show me up :)
20:17 🔗 HCross I'd rather not try and upload lots of small files from London > Singapore
20:17 🔗 db48x oh, I never actually added you guys to /etc/sudoers
20:18 🔗 HCross Kaz, if I get a VPS somewhere near SG (aka LA), would you want to try and sort jq on that?
20:18 🔗 HCross then we can use that
20:19 🔗 Kaz could give it a shot, let me see if I can get it running locally first
20:20 🔗 db48x or just get it installed on the server
20:20 🔗 db48x I gave up because it was 4am
20:21 🔗 Kaz uh
20:21 🔗 Kaz am I missing something or am I supposed to be expecting something more than apt-get install jq
20:22 🔗 db48x dunno
20:22 🔗 db48x does it work?
20:22 🔗 Kaz yeah
20:22 🔗 Kaz well, it installed
20:22 🔗 db48x nice :)
20:22 🔗 Kaz will test functionality in a sec
20:26 🔗 HCross struggling to get iamine to go as well
20:26 🔗 db48x closure: I updated my pull request
20:27 🔗 db48x speaking of which, you guys should peruse it as well: https://github.com/joeyh/propellor/pull/17
20:29 🔗 HCross it doesnt half take a time to create a shard
20:29 🔗 HCross my first one is still computing
20:29 🔗 db48x indeed
20:30 🔗 Kaz HCross: do you have iamine working?
20:30 🔗 HCross I dont, it needs someones IA credentials
20:30 🔗 db48x it needs your own
20:31 🔗 HCross ah, per user
20:31 🔗 db48x just uses a username and password to get an auth key
20:31 🔗 db48x yea
20:31 🔗 db48x s/key/token
20:31 🔗 db48x /
20:32 🔗 Kaz iamine needs py3, right
20:34 🔗 db48x indeed
20:34 🔗 db48x https://github.com/joeyh/propellor/pull/17/commits/1d689b1e4ce1f5eeedab140bd3c330484a928586
20:35 🔗 yipdw Frogging: each remote location is a git-annex remote; to restore shard contents, you pull from remots
20:35 🔗 yipdw es
20:36 🔗 Frogging ooh, okay
20:36 🔗 yipdw https://git-annex.branchable.com/tips/offline_archive_drives/ and https://git-annex.branchable.com/location_tracking/
20:36 🔗 db48x Frogging: I'm sorry, I completely forgot to answer your question!
20:39 🔗 HCross db48x, split-collection: line 8: syntax error near unexpected token `('
20:39 🔗 HCross on your script
20:41 🔗 Kaz iamine is not paying nicely for me
20:42 🔗 db48x HCross: odd
20:42 🔗 db48x line 8 is just lines=$(wc -l "${itemfile}" | cut -d ' ' -f 1)
20:43 🔗 HCross not on https://raw.githubusercontent.com/ArchiveTeam/IA.BAK/server/split-collection
20:44 🔗 db48x oooh
20:44 🔗 Kaz yeah, the 'new' mkshard also isn't on the repo
20:45 🔗 atomotic has joined #internetarchive.bak
20:47 🔗 db48x Kaz: I committed it: https://github.com/ArchiveTeam/IA.BAK/commit/5b457779b2ffd9fb1342671a3dbc1cd73edcd14e#diff-85b50cc2f5b54f1254ecdcd0fec1959d
20:47 🔗 db48x just pushed the fixed split-collection script
20:48 🔗 Kaz https://github.com/ArchiveTeam/IA.BAK/blob/server/mkSHARD doesn't match /usr/local/IA.BAK/mkSHARD, which one should we be using?
20:49 🔗 db48x well, let's look at the diff
20:50 🔗 HCross db48x, http://paste.nerds.io/ujacucekid.vhdl what am I doing wrong?
20:50 🔗 db48x Kaz: yea, that's the old version
20:51 🔗 db48x there, I did a git pull
20:52 🔗 Kaz okay, thanks
20:52 🔗 db48x ++ basename 'archivebot-files/-*.json' .json
20:52 🔗 db48x error: "archivebot-meta-*.json" should be readable
20:54 🔗 db48x it's in quotes, so it didn't expand the glob
20:54 🔗 HCross but its made a load of shards
20:55 🔗 db48x naturally
20:55 🔗 HCross ah
20:55 🔗 db48x that's why the script has a for loop, to work on each one independantly
20:56 🔗 db48x but if the filename is quoted, then the loop only runs once, with f equal to "archivebot-meta-*.json" rather than with f equal to "archivebot-meta-0.json", then equal to "archivebot-meta-1.json", etc
20:57 🔗 HCross looks like ive got 357 shards of just archivebot
20:58 🔗 HCross hmm, thats not right
20:58 🔗 db48x that is a lot of shards
20:59 🔗 HCross yea, I messed up
21:00 🔗 HCross 34, that looks better
21:00 🔗 db48x :)
21:01 🔗 HCross db48x, also. annexed files in working tree: 228575
21:01 🔗 HCross - but from the info on archive.org for each collection there are only 68k files
21:01 🔗 db48x HCross: what collection are you looking at?
21:02 🔗 HCross The_Sydney_Morning_Herald svoboda_newspaper antiochnews The_Notre_Dame_Scholastic NCAA-News dailyracingform
21:04 🔗 db48x 23856 + 23183 + 5305 + 2206 + 857 + 10956 = ~66k items
21:04 🔗 db48x but yea, each item has multiple files
21:05 🔗 HCross ahh, its a file thing
21:05 🔗 HCross sorry about my cockups this evening
21:05 🔗 db48x indeed
21:07 🔗 db48x no worries
21:08 🔗 db48x I'd never done it before until last night either, I just didn't have anyone to talk to about the mistakes I made :)
21:12 🔗 HCross db48x, so I now have http://harrycross.me/dae.png
21:12 🔗 HCross - what do I do on them next? Tried running get_item_files.jq and I get a compile error
21:12 🔗 db48x show me the command and the error?
21:14 🔗 HCross jq get_item_files.jq
21:14 🔗 HCross error: get_item_files is not defined
21:14 🔗 HCross get_item_files.jq1 compile error
21:14 🔗 HCross no matter what file I add on the end of the command
21:17 🔗 db48x oh
21:17 🔗 db48x if you want it to run commands from a file, you have to use the -f option
21:17 🔗 db48x jq -f somefile.jq
21:18 🔗 db48x and since this script produces an array of strings that we want to treat as lines of a file, you also need the -r option
21:18 🔗 db48x so it becomes jq -r -f get_item_files.jq archivebot-meta-00.txt
21:19 🔗 db48x oh, and then you want to redirect the output, so tack on > archivebot-meta-00.tsv to the end
21:21 🔗 HCross parse error: Invalid numeric literal at line 2, column 0
21:22 🔗 db48x hrm
21:22 🔗 db48x what's line 2 of your input file look like?
21:23 🔗 db48x some horrible json object, I hope
21:24 🔗 HCross http://harrycross.me/123.png
21:24 🔗 HCross is archivebot-meta-00.txt
21:25 🔗 db48x ah, those are the ids, not the metadata
21:25 🔗 db48x split-collection should run ia-mine again using those as input
21:25 🔗 HCross it did something, then spat out the IA mine readme
21:26 🔗 db48x heh
21:27 🔗 HCross db48x, http://paste.nerds.io/fomihecuma.vhdl full log of what it did
21:27 🔗 HCross TO MAKE THOSE
21:27 🔗 HCross to make those files
21:30 🔗 db48x yea, you didn't fix the error on line 13
21:30 🔗 db48x I guess you edited split-collection, putting quotes around the glob of the for loop?
21:31 🔗 HCross db48x, http://paste.nerds.io/dokefiboxu.pl is that file, only edit is line 9
21:32 🔗 db48x http://paste.nerds.io/dokefiboxu.sh
21:32 🔗 db48x there, at least the colors are better now :)
21:33 🔗 db48x your previous log clearly shows the quotes
21:33 🔗 db48x maybe you edited it back after it failed, but forgot to rerun it? gremlins maybe?
21:33 🔗 Frogging sorry if the question is stupidm, but why the sudden interest in archiving the Archive? the election was mentioned but I didn't really understand the connection
21:37 🔗 db48x Frogging: we've been doing this for a while now, but we've let it slide a bit
21:38 🔗 Frogging yeah, I know. it just seemed to get a sudden bump to top priority on Tuesday
21:39 🔗 db48x could be because previous attemmpts to get it going again didn't really accomplish much?
21:40 🔗 Frogging maybe, though the implication was that it's now more urgent because of the election
21:40 🔗 Frogging I think it's great that it's getting more attention, I didn't understand that though
21:40 🔗 db48x hmm. no idea about that
21:40 🔗 db48x I suppose it's possible, but it didn't seem that way to me
21:41 🔗 HCross db48x, can you tell me what line 13 should be, not quite getting this (am a real newbie with this kind of thing), sorry
21:41 🔗 Kksmkrn has joined #internetarchive.bak
21:42 🔗 db48x HCross: just "done"
21:42 🔗 db48x oh, in the log
21:42 🔗 db48x not the 13th line of the script
21:42 🔗 db48x it shouldn't be an error
21:43 🔗 db48x let me run it and show you a log of when it works
21:43 🔗 HCross ok, thanks
21:43 🔗 kyan has quit IRC (Read error: Operation timed out)
21:44 🔗 Jon yay 21G down of the 1T allocated so far
21:45 🔗 db48x nice
21:48 🔗 db48x bah: {"params": {"rows": 50, "q": "collection:archiveteam-fire", "page": 397, "output": "json"}, "url": "https://archive.org/advancedsearch.php", "retries_left": 0, "message": "Maximum retries exceeded for url, giving up."}
21:51 🔗 Start_ has joined #internetarchive.bak
21:52 🔗 Start has quit IRC (Read error: Connection reset by peer)
21:52 🔗 db48x HCross: do you see a similar stream of error messages when you run ia-mine?
21:52 🔗 HCross nope, just the IA mine readme
21:52 🔗 db48x ah, it only happens on that collection
21:53 🔗 db48x I'll look into that later, back to helping you
21:53 🔗 HCross thanks
21:54 🔗 db48x oooh, lol
21:55 🔗 atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
21:55 🔗 db48x it's because I am dumb
21:58 🔗 wp494 has quit IRC (Read error: Connection reset by peer)
22:02 🔗 db48x http://paste.nerds.io/ubuhujimur.sh
22:02 🔗 db48x HCross: I pushed a fix
22:03 🔗 db48x basically, I made some changes to split-collection, then had the bright idea to split archiveteam-fire by date, and didn't actually run it again with my changes
22:04 🔗 db48x the quotes in your log were a red herring
22:04 🔗 db48x I was forgetting that this is bash we're dealing with
22:04 🔗 HCross i just ordered a time3vps for storage for this, and as soon as I paid, their portal crashed
22:05 🔗 db48x heh
22:08 🔗 HCross that looks more like what it should be doing db48x - its also taking more time
22:08 🔗 db48x yay
22:09 🔗 wp494 has joined #internetarchive.bak
22:09 🔗 HCross db48x, is https://billing.time4vps.eu down/showing a cloudflare live thingy for you?
22:11 🔗 db48x it's showing nothing but a spinning spinner so far
22:11 🔗 db48x ah, indeed it is
22:12 🔗 HCross rip, just fed them 20 eur
22:12 🔗 Kksmkrn Loading for me
22:13 🔗 Kksmkrn Nevermind, I should drink more, talk less.. live thingy
22:14 🔗 HCross db48x, it worked this time, have a nice JSON car-crash here
22:14 🔗 HCross cat'ing that first file was a BAD idea
22:14 🔗 db48x :D
22:14 🔗 db48x use less instead
22:14 🔗 db48x or head
22:15 🔗 Senji hexdump -C
22:16 🔗 HCross especially on a server 200ms away. Also, jq is still cocking up
22:16 🔗 HCross same parse error: Invalid numeric literal at line 2, column 0
22:22 🔗 kyan has joined #internetarchive.bak
22:35 🔗 Kksmkrn has left zZzZ..
22:49 🔗 db48x HCross: sounds like you fed it another file containing ids instead of json :)
23:01 🔗 HCross db48x, I now have .tsv files full of JSON and .json files full of lists of IDs
23:03 🔗 HCross db48x, I am going to head to bed, ill sort this out tomorrow. Goodnight
23:20 🔗 SketchPho Looks like good work today.
23:20 🔗 SketchPho Do I need to help in any way?
23:24 🔗 db48x HCross: awesome! :D
23:25 🔗 db48x SketchPho: I think we're making progress
23:25 🔗 SketchPho Agreed. (Checked scrollback)
23:26 🔗 db48x it's not really repeatable, quality-controlled, iso9000 certified progress, but it's not too bad :)
23:26 🔗 HCross2 I'll sort out the rest of it later tomorrow. It should be easier from here. Its now just converting those files to shards
23:27 🔗 HCross2 I've also got 4tb downloading
23:28 🔗 db48x I had to stop after 15gb of shard12

irclogger-viewer