[00:02] *** sep332 has quit IRC (konversation out) [01:08] *** kyan has joined #internetarchive.bak [01:24] *** Lord_Nigh has quit IRC (Ping timeout: 633 seconds) [01:25] *** Lord_Nigh has joined #internetarchive.bak [02:01] *** Lord_Nigh has quit IRC (Read error: Operation timed out) [02:08] *** Lord_Nigh has joined #internetarchive.bak [02:26] *** Lord_Nigh has quit IRC (Ping timeout: 244 seconds) [02:28] *** Lord_Nigh has joined #internetarchive.bak [03:08] *** Lord_Nigh has quit IRC (Read error: Operation timed out) [03:31] *** Lord_Nigh has joined #internetarchive.bak [04:26] *** Start has quit IRC (Quit: Disconnected.) [04:28] *** Start has joined #internetarchive.bak [04:50] No action? [04:50] I need to tap closure. Is there other stuff people have questions on? [04:51] I'm not kidding when I say this is day in day out first priority [04:51] do we have a list of things to put in a new shard? I can run the scripts [04:53] actually, I take that back. I can't log into the server [04:53] oh, nvm [04:54] I was doing it wrong [05:04] gotta grab the latest census [05:07] ok, confusion [05:07] archiveteam_census_2016 doesn't have what I expected; it's just a list of identifiers with none of the rest of the information [05:08] *** Lord_Nigh has quit IRC (Ping timeout: 244 seconds) [05:14] *** Lord_Nigh has joined #internetarchive.bak [05:24] *** Lord_Nigh has quit IRC (Read error: Operation timed out) [05:27] *** Lord_Nigh has joined #internetarchive.bak [05:30] I'd like us to turn to archivebot and archiveteam items for new shards [05:30] sure [05:30] do we have any that aren't made of 50GB warcs? [05:51] db48x: sorry, yes, no metadata in there yet [05:52] ah [05:52] :) [05:52] i've started ia-mine running on archivebot collection to get metadata [05:59] bwn: also use http://archive.fart.website/archivebot/viewer/items/ [05:59] that may be a more complete index of archivebot materials, as it includes a large number of items that are not in the archivebot collection [06:08] Best person to make my SSH key available too? [06:16] yipdw: looks like the items you're talking about got added to the archivebot collection at some point [06:17] ia gave me archiveteam_archivebot_go_20161110020001 but it hasn't been added to the viewer yet [06:17] some of them did [06:17] yeah, the viewer updates every 24 hours [06:17] items are created every, uh, 3 or so [06:17] HCross2: you can just paste it in here and I'll add it to the server. ed25519 keys are preferred [06:17] depending on upload speed from fos [06:17] *** kyan has quit IRC (Quit: Leaving) [06:18] https://www.irccloud.com/pastebin/BPtozFKj [06:19] db48x: [06:19] strongly preferred :) [06:20] Ah ok. I'll have to re do it [06:26] Is it normal to see "verification of content failed" and "Unable to access these remotes: web" often? (Happens for approximately 1 file out of 5) [06:26] Deewiant: sometimes [06:26] even when an item is hidden on IA it can still be mentioned in the backup, but nobody will be able to download it [06:36] yipdw: unless i'm making a mistake, it looks like the viewer matches up with `ia search collection:archivebot` [06:36] ah ok cool [06:39] :) [06:39] db48x: that metadata is finished, i also created a sorted list of the archivebot item sizes [06:40] where would a good place to put this for anyone who needs it? i could create an ia item for now [06:41] that's a nice meta way to do it [06:41] or just serve it up the old-fashioned way and I'll download it from you [06:52] db48x: Ok, it just looks a bit worrying to have it show up so often. Seems to mostly (or only?) affect _archive.torrent and _meta.xml files though. [06:52] db48x: http://erebos.undo.it/MIRROR/db48x/ [06:55] Deewiant: ah, that's actually a slightly different issue; we don't know the correct hash of those files so they can't be verified [06:56] bwn: I'm taking a look now [06:57] i'm writing up a readme so you know what's what [06:58] hmm [06:59] the script we have reads from a file with the md5 hash, the size, the category and the file url [07:04] ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIP5OhU2Lita9RdjPkX9N0w9wZnmVlednUDEx24bVn4Mk IABAK key - Harry C [07:04] db48x: ^ [07:05] TIL: You have to use the development version of PuTTYgen for that [07:05] get_item_files.jq outputs a partial url collection/filename for each file [07:06] HCross2: great, what username do you want? [07:06] HCross [07:10] that get_item_files.jq can be modified to get the format you need with the jq filters from https://archive.org/details/ia-bak-census_20150304 [07:10] ^ can likely be modified [07:11] HCross2: ok, you should be able to log in to iabak.archiveteam.org, username hcross [07:11] Would it be a good idea to run another IA census now? [07:13] *** Atom has quit IRC (Read error: Connection reset by peer) [07:14] db48x: thanks, am in [07:14] excellent [07:15] I need to leave for work now, but ill have a little play around when I get into the office [07:15] oh, you know what? it just occurred to me [07:15] ok [07:15] there's a minor wrinkle that I'll straighten out shortly [07:17] closure's way of setting machines up is better than mine, but requires remembering to do things better [07:28] hrm [07:28] I can't remember how we handled items that are in multiple collections [07:29] How are we also going to do large collections [07:30] split them across multiple shards [07:31] .files | [07:31] map( [07:31] select(.source != "derivative") | [07:31] # if case for catching files with size=null (i.e. files.xml). [07:31] if .size != null then [07:31] {"name": .name, "size": (.size | tonumber), "collection": $c, "md5": .md5} [07:31] else [07:31] {"name": .name, "size": 0, "collection": $c, "md5": .md5} [07:31] end [07:32] ) | [07:32] map([.md5, .size, .collection[0], .name]) | map(@tsv) | .[] [07:33] bwn: is every single item in this dataset in the archivebot collection? [07:33] Ah. So for example, I would like to back up https://archive.org/details/archiveteam_newssites which is going up by at least half a terabyte a day [07:36] yea, I guess they are [07:38] yes, from `ia search collection:archivebot` [07:39] i need to eat and get some sleep soon, hope that stuff helps :) [07:40] it does [07:44] archivebot and archiveteam_newssites are both going to have to end up split across a bunch of shards [08:02] Easiest way of splitting them? [08:04] db48x: can you drpo my key in too please? [08:04] only here for a few minutes before I run off to work, but have key [08:12] ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIHhFYMd9Htlf9wPZzIDyqbYYNwuo3m+kWQ9/pfAD/TE9 Kaz IABAK [08:12] ^ if you're around, gotta run now [08:12] Kaz: I will add you in [08:13] awesome, thanks [08:13] you will be kaz on the server [08:16] *** jsp12345 has joined #internetarchive.bak [08:43] * db48x sighs [08:57] *** atomotic has joined #internetarchive.bak [09:45] *** antomatic has quit IRC (Read error: Connection reset by peer) [09:46] *** antomatic has joined #internetarchive.bak [10:06] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [11:11] *** atomotic has joined #internetarchive.bak [11:26] *** atomotic has quit IRC (Remote host closed the connection) [11:28] *** atomotic has joined #internetarchive.bak [12:29] Please do not add news sites to this project quite yet [12:30] I would want us to go after the archive Bots collection first, just because that is pure websites that were grabbed for various reasons [12:31] This is also going to bring to bear our management of large amounts of sharks, which we might as well deal with anyway [12:32] I also understand, that we also will be forced to deal with the situation of a growing collection, which means that we might want to just focus on a cutoff date with archive Bots [12:33] Perhaps it is time for me to begin discussions on data hoarders for space [13:05] Ok. How do we cut up a large collection like archivebot [13:14] By timestamp or by data size i suppose [13:17] Shards with big files could have a smaller number of files, unlike the normal ones [13:17] But we add it to the shard per item [13:20] I mean collections with Big files :) [13:20] *meant [13:20] HCross2: ? [13:22] Each shard has "items" in it, really not sure how we cut it up. I am probably mistaken about that though. Someone better can probably advise [13:22] with shell scripting [13:23] I wrote one [13:23] which I've been testing [13:24] SketchPho: since archivebot is sooo big I looked at archiveteam_fire instead [13:24] 400k files, so it could be four shards [13:24] (easy, just split -n1/4 ...) [13:25] but I checked the dates, and 2011-2015 is 100k files and 4.3tb, so I figure I'll do it that way [13:28] HCross2: https://gist.github.com/db48x/a1a8847916ab149abbfce25517944bdc [13:28] I'll check it in to IA.BAK as well [13:28] Isn't it better to have smaller shards? [13:28] I trust your judgement [13:29] luckcolor: total file size is less important than number of files [13:30] because git gradually slows down on repositories with more files in them [13:30] Ok right [13:31] whereas the files are split up across multiple contributors [13:32] annoyingly, archiveteam_fire for 2016 is 300k files but only .6tb :) [13:36] db48x: will you be around in 5 and a half hours or so? I want to get started on shard making and want someone to show me the ropes [13:39] 4 and a bit I mean [13:39] I'll probably be asleep [13:40] for about the next 8 hours [13:42] Ok [13:44] it's not very hard [13:45] use ia-mine (https://github.com/jjjake/iamine) to get the list of items in a collection, and the metadata for each of those items [13:46] you can make a shard out of small collections, or split up a large collection to make several shards; either way you end up with a list of items [13:46] then use jq to parse the metadata (which is one json object per item), and output a tsv file [13:46] then feed the tsv file to a slightly-modified version of the mkSHARD script [13:48] https://gist.github.com/db48x/a1a8847916ab149abbfce25517944bdc [13:48] Awesome [13:49] I'll get these checked into the IA.BAK repository at some point; probably after I've slept [13:50] we used to do it slightly differently; ia-mine didn't exist, but we had a single huge json dump of every single public item on IA called the census [13:50] Cool. What's the jq command for it please [13:50] jq -r -f file.jq input.json > files.txt [13:51] Thanks :) [13:51] if you look on the server, you can see the huge TSV that we created from the huge json dump in ~joey/IA.BAK [13:52] ok, I'm about to hit the normal-sized red button [13:54] ok, I think that worked [13:55] anyone want to try out SHARD12 before I hit the big red button? [14:03] well, I guess I can hit the button without testing it [14:03] what's the worst that could happen? [14:06] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [14:21] *** Whopper has joined #internetarchive.bak [14:38] merge: refs/remotes/origin/synced/master - not something we can merge [14:38] failed [14:38] (merging origin/git-annex into git-annex...) [14:38] (recording state in git...) [14:38] that's not quite what I expected [14:40] ok, when I rerun iabak it works fine [14:41] so I guess it's just a bug that happens to the first person to try out a new shard [14:41] *** Atom has joined #internetarchive.bak [14:49] time for me to sleep [15:05] *** Jon has joined #internetarchive.bak [15:40] *** DFJustin has joined #internetarchive.bak [15:49] * closure waves [15:50] o/ [15:52] *** atomotic has joined #internetarchive.bak [15:54] closure: howdy [15:58] sounds like you guys are making headway. I got a slack invite, but would rather avoid slack, so I'll be over here [16:00] yes, manged to add a shard [16:00] I chickened out and didn't add a bunch of shards all at once though [16:03] you asked about items in multiple collections. IIRC the list I generated picked an arbitrary collection for such items, so they only go into one [16:03] yea, I did the same [16:06] biggest development is that there's no census any more [16:06] but there is ia-mine, which is handy [16:07] db48x: propellor pull request> hasGroup takes a User and a Group, so hasUser should not deconstruct the parameters [16:07] simplest implementation: hasUser = flip hasGroup [16:08] no census anymore? What is ia-mine? [16:09] ah, I guess it finds the items in a collection [16:09] ia-mine fetches the same metadata json that we had in the census, but for arbitrary searches, from the command-line [16:09] ah, I didn't know about flip [16:09] (though I figured there would be something) [16:15] bah [16:15] I couldn't guess the type signature of flip well enough for hoogle to find it [16:15] though it's obvious now that I look at it [16:25] pulling a shard onto my first donated 1T; I have a second 1T but not contiguous, might need to run a second IA.BAK instance? not sure [16:52] *** Lord_Nigh has quit IRC (Ping timeout: 250 seconds) [16:52] *** Lord_Nigh has joined #internetarchive.bak [16:58] *** Lord_Nigh has quit IRC (Ping timeout: 244 seconds) [17:05] *** Lord_Nigh has joined #internetarchive.bak [17:30] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [17:51] right [17:51] I'm on the server, how do I master the shards [17:52] db48x: respin the patch? [17:52] Kaz: did you see my wiki page on how to do it? [17:53] I did not. that might be a place to start [17:54] http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/admin [17:56] thanks, taking a look now [18:17] *** HCross has joined #internetarchive.bak [18:41] *** sep332 has joined #internetarchive.bak [18:46] *** kyan has joined #internetarchive.bak [18:59] so are these the instructions I should follow if I want to donate space? anything else? http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/git-annex_implementation [19:03] Frogging: the README is very helpful too, especially "tuning resource usage" part [19:03] yah I'm looking at that [19:04] wondering if maybe I should use a filesystem quota or sub-image [19:05] we've been trying to keep it simple but that certainly woulnd't hurt [19:06] it's just the diskreserve option is for how much space to _not_ use. just slightly complicates allocation on my array. it's only a minor issue though really [19:08] closure, getting a ton of "server refused our key" now [19:25] right [19:26] so one of the issues I see is that we can't actually update the iabak repo to point to the new shards [19:26] uh, setup a new shard repo even [19:27] ignore my error, cant spell my own name this evening [19:27] reading the wrong bit [19:29] looking at the examples, I see "10191 jstor_jamerinstcrimlaw [19:29] ", how do I get that first number? [19:30] also, I dont see a ./mkSHARD file [19:30] done the clone [19:30] HCross: see /usr/local/IA.BAK [19:31] HCross: and/or checkout server, rather than master [19:31] ah, I checked out /master [19:35] Kaz, [19:35] hcross@ia-bak:~/IA.BAK$ git checkout -b server master/server [19:35] fatal: Cannot update paths and switch to branch 'server' at the same time. [19:35] Did you intend to checkout 'master/server' which can not be resolved as commit? [19:35] can you please tell me where I am being a clot? [19:35] git checkout server [19:36] thanks [19:36] -b makes new branch, which you don't want because it already exists [19:36] or at least, afaict [19:37] going to create shard 14, based on https://archive.org/details/newspapers [19:38] 13? [19:39] yours [19:39] I guess one thing to think of, is if items are added to a collection, how are these reflected in the iabak version? [19:43] http://paste.nerds.io/uvaqeteyah.md im now slighly confused about why that went wrong [19:44] have you got the md5_collection_url.txt.pick1.sorted.uniq link? [19:44] yes [19:45] it 'sort of works' if you do it from /usr/local/IA.BAK, but have no perms to create the SHARD14.list [19:48] yea, we need a way of doing it in our directories [19:52] wait hang on [19:55] HCross: mkSHARD in /usr/local/IA.BAK is different to the one in the repo [19:56] this explains some things [19:57] that one seems to be working [19:57] its doing thinking about it now [19:57] ive cp'ed it into my folder and renamed it [20:00] 228575 SHARD14.list [20:00] may be a tad high [20:02] or not hang on [20:03] yea, don't got over 100k [20:03] thats the file size I think though, its still generating. My calculation is 60k [20:03] db48x: could you push the updated mkshard to the repo please? just in case there are other changes that should be pulled through too [20:03] although the wiki page says to use that file, it's an out of date index to the archive [20:04] probably better to use ia-mine to generate a more up-to-date index of the collection you're interested in [20:04] db48x, if you show me how, I can get us a new one [20:04] should we do a whole archive one? [20:04] nah, takes ages [20:05] ia-mine --secure -c -s "collection:${1}" --itemlist [20:05] how long is "ages" [20:05] a life-age of the earth [20:05] seriously, the IA has 20PB of stuff [20:06] takes ages, but if you just do one collection it only takes a minute or so [20:06] see the split-collection.sh script that I checked in [20:06] it first gets a list of identifiers for the items in the collection, then gets the json metadata for each item [20:07] then you use jq to convert the json into a tsv of just the four things we need in order to make the git-annex repository [20:08] you can adjust that basic recipe to suit the needs of the moment [20:08] about to kick off several shards with archivebot.... db48x is that OK? [20:08] probably :) [20:09] how are you splitting them between shards? [20:09] using your splitting script [20:10] ok. just be aware that the archivebot collection is a smaller number of huge items, so 100k files is a bad way to split it :) [20:10] best idea to split it? [20:10] I'm not sure [20:10] probably into about 50 shards of 2-3TB [20:11] archivebot is 1.8k items. [20:12] assuming each pack is 50GB, its 5 items per pack [20:12] sounds right, but don't assume [20:12] yea, ill check now [20:12] that jq script dumps out the file sizes too, so dice it up and measure the size of all the pieces :) [20:13] so just to be clear [20:13] 1) run split-collection on huge collection [20:13] 2) do some jq magic [20:13] 3) make shard based on each list that gets pumped out by 1&"? [20:14] yes [20:14] but split-collection is just one possible way to split up a collection [20:14] you could do it by date [20:14] right [20:15] db48x, can we have nano please? [20:15] as I did for archiveteam_fire, or by size, for archivebot, or by some clever means I haven't thought up [20:15] sure [20:15] emacs is on there, as is vim, but go ahead and install it [20:15] how does one get things "out" of the backup, should the need arise? [20:16] how does one go about running the jq script? -bash: jq: command not found [20:16] while it's installing, go to /usr/local/propellor and edit joeyconfig.hs so that the iabak machine defined in there includes the package as well [20:16] or do I need to install for myself [20:16] db48x, not root/cant sudo [20:16] I had zero luck installing jq on iabak, sorry [20:16] I had to do that part on my own machine [20:16] ah, okay [20:17] I may do all the mining closer to myself then, and push over after [20:17] but you're welcome to show me up :) [20:17] I'd rather not try and upload lots of small files from London > Singapore [20:17] oh, I never actually added you guys to /etc/sudoers [20:18] Kaz, if I get a VPS somewhere near SG (aka LA), would you want to try and sort jq on that? [20:18] then we can use that [20:19] could give it a shot, let me see if I can get it running locally first [20:20] or just get it installed on the server [20:20] I gave up because it was 4am [20:21] uh [20:21] am I missing something or am I supposed to be expecting something more than apt-get install jq [20:22] dunno [20:22] does it work? [20:22] yeah [20:22] well, it installed [20:22] nice :) [20:22] will test functionality in a sec [20:26] struggling to get iamine to go as well [20:26] closure: I updated my pull request [20:27] speaking of which, you guys should peruse it as well: https://github.com/joeyh/propellor/pull/17 [20:29] it doesnt half take a time to create a shard [20:29] my first one is still computing [20:29] indeed [20:30] HCross: do you have iamine working? [20:30] I dont, it needs someones IA credentials [20:30] it needs your own [20:31] ah, per user [20:31] just uses a username and password to get an auth key [20:31] yea [20:31] s/key/token [20:31] / [20:32] iamine needs py3, right [20:34] indeed [20:34] https://github.com/joeyh/propellor/pull/17/commits/1d689b1e4ce1f5eeedab140bd3c330484a928586 [20:35] Frogging: each remote location is a git-annex remote; to restore shard contents, you pull from remots [20:35] es [20:36] ooh, okay [20:36] https://git-annex.branchable.com/tips/offline_archive_drives/ and https://git-annex.branchable.com/location_tracking/ [20:36] Frogging: I'm sorry, I completely forgot to answer your question! [20:39] db48x, split-collection: line 8: syntax error near unexpected token `(' [20:39] on your script [20:41] iamine is not paying nicely for me [20:42] HCross: odd [20:42] line 8 is just lines=$(wc -l "${itemfile}" | cut -d ' ' -f 1) [20:43] not on https://raw.githubusercontent.com/ArchiveTeam/IA.BAK/server/split-collection [20:44] oooh [20:44] yeah, the 'new' mkshard also isn't on the repo [20:45] *** atomotic has joined #internetarchive.bak [20:47] Kaz: I committed it: https://github.com/ArchiveTeam/IA.BAK/commit/5b457779b2ffd9fb1342671a3dbc1cd73edcd14e#diff-85b50cc2f5b54f1254ecdcd0fec1959d [20:47] just pushed the fixed split-collection script [20:48] https://github.com/ArchiveTeam/IA.BAK/blob/server/mkSHARD doesn't match /usr/local/IA.BAK/mkSHARD, which one should we be using? [20:49] well, let's look at the diff [20:50] db48x, http://paste.nerds.io/ujacucekid.vhdl what am I doing wrong? [20:50] Kaz: yea, that's the old version [20:51] there, I did a git pull [20:52] okay, thanks [20:52] ++ basename 'archivebot-files/-*.json' .json [20:52] error: "archivebot-meta-*.json" should be readable [20:54] it's in quotes, so it didn't expand the glob [20:54] but its made a load of shards [20:55] naturally [20:55] ah [20:55] that's why the script has a for loop, to work on each one independantly [20:56] but if the filename is quoted, then the loop only runs once, with f equal to "archivebot-meta-*.json" rather than with f equal to "archivebot-meta-0.json", then equal to "archivebot-meta-1.json", etc [20:57] looks like ive got 357 shards of just archivebot [20:58] hmm, thats not right [20:58] that is a lot of shards [20:59] yea, I messed up [21:00] 34, that looks better [21:00] :) [21:01] db48x, also. annexed files in working tree: 228575 [21:01] - but from the info on archive.org for each collection there are only 68k files [21:01] HCross: what collection are you looking at? [21:02] The_Sydney_Morning_Herald svoboda_newspaper antiochnews The_Notre_Dame_Scholastic NCAA-News dailyracingform [21:04] 23856 + 23183 + 5305 + 2206 + 857 + 10956 = ~66k items [21:04] but yea, each item has multiple files [21:05] ahh, its a file thing [21:05] sorry about my cockups this evening [21:05] indeed [21:07] no worries [21:08] I'd never done it before until last night either, I just didn't have anyone to talk to about the mistakes I made :) [21:12] db48x, so I now have http://harrycross.me/dae.png [21:12] - what do I do on them next? Tried running get_item_files.jq and I get a compile error [21:12] show me the command and the error? [21:14] jq get_item_files.jq [21:14] error: get_item_files is not defined [21:14] get_item_files.jq1 compile error [21:14] no matter what file I add on the end of the command [21:17] oh [21:17] if you want it to run commands from a file, you have to use the -f option [21:17] jq -f somefile.jq [21:18] and since this script produces an array of strings that we want to treat as lines of a file, you also need the -r option [21:18] so it becomes jq -r -f get_item_files.jq archivebot-meta-00.txt [21:19] oh, and then you want to redirect the output, so tack on > archivebot-meta-00.tsv to the end [21:21] parse error: Invalid numeric literal at line 2, column 0 [21:22] hrm [21:22] what's line 2 of your input file look like? [21:23] some horrible json object, I hope [21:24] http://harrycross.me/123.png [21:24] is archivebot-meta-00.txt [21:25] ah, those are the ids, not the metadata [21:25] split-collection should run ia-mine again using those as input [21:25] it did something, then spat out the IA mine readme [21:26] heh [21:27] db48x, http://paste.nerds.io/fomihecuma.vhdl full log of what it did [21:27] TO MAKE THOSE [21:27] to make those files [21:30] yea, you didn't fix the error on line 13 [21:30] I guess you edited split-collection, putting quotes around the glob of the for loop? [21:31] db48x, http://paste.nerds.io/dokefiboxu.pl is that file, only edit is line 9 [21:32] http://paste.nerds.io/dokefiboxu.sh [21:32] there, at least the colors are better now :) [21:33] your previous log clearly shows the quotes [21:33] maybe you edited it back after it failed, but forgot to rerun it? gremlins maybe? [21:33] sorry if the question is stupidm, but why the sudden interest in archiving the Archive? the election was mentioned but I didn't really understand the connection [21:37] Frogging: we've been doing this for a while now, but we've let it slide a bit [21:38] yeah, I know. it just seemed to get a sudden bump to top priority on Tuesday [21:39] could be because previous attemmpts to get it going again didn't really accomplish much? [21:40] maybe, though the implication was that it's now more urgent because of the election [21:40] I think it's great that it's getting more attention, I didn't understand that though [21:40] hmm. no idea about that [21:40] I suppose it's possible, but it didn't seem that way to me [21:41] db48x, can you tell me what line 13 should be, not quite getting this (am a real newbie with this kind of thing), sorry [21:41] *** Kksmkrn has joined #internetarchive.bak [21:42] HCross: just "done" [21:42] oh, in the log [21:42] not the 13th line of the script [21:42] it shouldn't be an error [21:43] let me run it and show you a log of when it works [21:43] ok, thanks [21:43] *** kyan has quit IRC (Read error: Operation timed out) [21:44] yay 21G down of the 1T allocated so far [21:45] nice [21:48] bah: {"params": {"rows": 50, "q": "collection:archiveteam-fire", "page": 397, "output": "json"}, "url": "https://archive.org/advancedsearch.php", "retries_left": 0, "message": "Maximum retries exceeded for url, giving up."} [21:51] *** Start_ has joined #internetarchive.bak [21:52] *** Start has quit IRC (Read error: Connection reset by peer) [21:52] HCross: do you see a similar stream of error messages when you run ia-mine? [21:52] nope, just the IA mine readme [21:52] ah, it only happens on that collection [21:53] I'll look into that later, back to helping you [21:53] thanks [21:54] oooh, lol [21:55] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [21:55] it's because I am dumb [21:58] *** wp494 has quit IRC (Read error: Connection reset by peer) [22:02] http://paste.nerds.io/ubuhujimur.sh [22:02] HCross: I pushed a fix [22:03] basically, I made some changes to split-collection, then had the bright idea to split archiveteam-fire by date, and didn't actually run it again with my changes [22:04] the quotes in your log were a red herring [22:04] I was forgetting that this is bash we're dealing with [22:04] i just ordered a time3vps for storage for this, and as soon as I paid, their portal crashed [22:05] heh [22:08] that looks more like what it should be doing db48x - its also taking more time [22:08] yay [22:09] *** wp494 has joined #internetarchive.bak [22:09] db48x, is https://billing.time4vps.eu down/showing a cloudflare live thingy for you? [22:11] it's showing nothing but a spinning spinner so far [22:11] ah, indeed it is [22:12] rip, just fed them 20 eur [22:12] Loading for me [22:13] Nevermind, I should drink more, talk less.. live thingy [22:14] db48x, it worked this time, have a nice JSON car-crash here [22:14] cat'ing that first file was a BAD idea [22:14] :D [22:14] use less instead [22:14] or head [22:15] hexdump -C [22:16] especially on a server 200ms away. Also, jq is still cocking up [22:16] same parse error: Invalid numeric literal at line 2, column 0 [22:22] *** kyan has joined #internetarchive.bak [22:35] *** Kksmkrn has left zZzZ.. [22:49] HCross: sounds like you fed it another file containing ids instead of json :) [23:01] db48x, I now have .tsv files full of JSON and .json files full of lists of IDs [23:03] db48x, I am going to head to bed, ill sort this out tomorrow. Goodnight [23:20] Looks like good work today. [23:20] Do I need to help in any way? [23:24] HCross: awesome! :D [23:25] SketchPho: I think we're making progress [23:25] Agreed. (Checked scrollback) [23:26] it's not really repeatable, quality-controlled, iso9000 certified progress, but it's not too bad :) [23:26] I'll sort out the rest of it later tomorrow. It should be easier from here. Its now just converting those files to shards [23:27] I've also got 4tb downloading [23:28] I had to stop after 15gb of shard12