Time |
Nickname |
Message |
00:02
🔗
|
|
sep332 has quit IRC (konversation out) |
01:08
🔗
|
|
kyan has joined #internetarchive.bak |
01:24
🔗
|
|
Lord_Nigh has quit IRC (Ping timeout: 633 seconds) |
01:25
🔗
|
|
Lord_Nigh has joined #internetarchive.bak |
02:01
🔗
|
|
Lord_Nigh has quit IRC (Read error: Operation timed out) |
02:08
🔗
|
|
Lord_Nigh has joined #internetarchive.bak |
02:26
🔗
|
|
Lord_Nigh has quit IRC (Ping timeout: 244 seconds) |
02:28
🔗
|
|
Lord_Nigh has joined #internetarchive.bak |
03:08
🔗
|
|
Lord_Nigh has quit IRC (Read error: Operation timed out) |
03:31
🔗
|
|
Lord_Nigh has joined #internetarchive.bak |
04:26
🔗
|
|
Start has quit IRC (Quit: Disconnected.) |
04:28
🔗
|
|
Start has joined #internetarchive.bak |
04:50
🔗
|
SketchPho |
No action? |
04:50
🔗
|
SketchPho |
I need to tap closure. Is there other stuff people have questions on? |
04:51
🔗
|
SketchPho |
I'm not kidding when I say this is day in day out first priority |
04:51
🔗
|
db48x |
do we have a list of things to put in a new shard? I can run the scripts |
04:53
🔗
|
db48x |
actually, I take that back. I can't log into the server |
04:53
🔗
|
db48x |
oh, nvm |
04:54
🔗
|
db48x |
I was doing it wrong |
05:04
🔗
|
db48x |
gotta grab the latest census |
05:07
🔗
|
db48x |
ok, confusion |
05:07
🔗
|
db48x |
archiveteam_census_2016 doesn't have what I expected; it's just a list of identifiers with none of the rest of the information |
05:08
🔗
|
|
Lord_Nigh has quit IRC (Ping timeout: 244 seconds) |
05:14
🔗
|
|
Lord_Nigh has joined #internetarchive.bak |
05:24
🔗
|
|
Lord_Nigh has quit IRC (Read error: Operation timed out) |
05:27
🔗
|
|
Lord_Nigh has joined #internetarchive.bak |
05:30
🔗
|
SketchPho |
I'd like us to turn to archivebot and archiveteam items for new shards |
05:30
🔗
|
db48x |
sure |
05:30
🔗
|
db48x |
do we have any that aren't made of 50GB warcs? |
05:51
🔗
|
bwn |
db48x: sorry, yes, no metadata in there yet |
05:52
🔗
|
db48x |
ah |
05:52
🔗
|
db48x |
:) |
05:52
🔗
|
bwn |
i've started ia-mine running on archivebot collection to get metadata |
05:59
🔗
|
yipdw |
bwn: also use http://archive.fart.website/archivebot/viewer/items/ |
05:59
🔗
|
yipdw |
that may be a more complete index of archivebot materials, as it includes a large number of items that are not in the archivebot collection |
06:08
🔗
|
HCross2 |
Best person to make my SSH key available too? |
06:16
🔗
|
bwn |
yipdw: looks like the items you're talking about got added to the archivebot collection at some point |
06:17
🔗
|
bwn |
ia gave me archiveteam_archivebot_go_20161110020001 but it hasn't been added to the viewer yet |
06:17
🔗
|
yipdw |
some of them did |
06:17
🔗
|
yipdw |
yeah, the viewer updates every 24 hours |
06:17
🔗
|
yipdw |
items are created every, uh, 3 or so |
06:17
🔗
|
db48x |
HCross2: you can just paste it in here and I'll add it to the server. ed25519 keys are preferred |
06:17
🔗
|
yipdw |
depending on upload speed from fos |
06:17
🔗
|
|
kyan has quit IRC (Quit: Leaving) |
06:18
🔗
|
HCross2 |
https://www.irccloud.com/pastebin/BPtozFKj |
06:19
🔗
|
HCross2 |
db48x: |
06:19
🔗
|
db48x |
strongly preferred :) |
06:20
🔗
|
HCross2 |
Ah ok. I'll have to re do it |
06:26
🔗
|
Deewiant |
Is it normal to see "verification of content failed" and "Unable to access these remotes: web" often? (Happens for approximately 1 file out of 5) |
06:26
🔗
|
db48x |
Deewiant: sometimes |
06:26
🔗
|
db48x |
even when an item is hidden on IA it can still be mentioned in the backup, but nobody will be able to download it |
06:36
🔗
|
bwn |
yipdw: unless i'm making a mistake, it looks like the viewer matches up with `ia search collection:archivebot` |
06:36
🔗
|
yipdw |
ah ok cool |
06:39
🔗
|
bwn |
:) |
06:39
🔗
|
bwn |
db48x: that metadata is finished, i also created a sorted list of the archivebot item sizes |
06:40
🔗
|
bwn |
where would a good place to put this for anyone who needs it? i could create an ia item for now |
06:41
🔗
|
db48x |
that's a nice meta way to do it |
06:41
🔗
|
db48x |
or just serve it up the old-fashioned way and I'll download it from you |
06:52
🔗
|
Deewiant |
db48x: Ok, it just looks a bit worrying to have it show up so often. Seems to mostly (or only?) affect _archive.torrent and _meta.xml files though. |
06:52
🔗
|
bwn |
db48x: http://erebos.undo.it/MIRROR/db48x/ |
06:55
🔗
|
db48x |
Deewiant: ah, that's actually a slightly different issue; we don't know the correct hash of those files so they can't be verified |
06:56
🔗
|
db48x |
bwn: I'm taking a look now |
06:57
🔗
|
bwn |
i'm writing up a readme so you know what's what |
06:58
🔗
|
db48x |
hmm |
06:59
🔗
|
db48x |
the script we have reads from a file with the md5 hash, the size, the category and the file url |
07:04
🔗
|
HCross2 |
ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIP5OhU2Lita9RdjPkX9N0w9wZnmVlednUDEx24bVn4Mk IABAK key - Harry C |
07:04
🔗
|
HCross2 |
db48x: ^ |
07:05
🔗
|
HCross2 |
TIL: You have to use the development version of PuTTYgen for that |
07:05
🔗
|
bwn |
get_item_files.jq outputs a partial url collection/filename for each file |
07:06
🔗
|
db48x |
HCross2: great, what username do you want? |
07:06
🔗
|
HCross2 |
HCross |
07:10
🔗
|
bwn |
that get_item_files.jq can be modified to get the format you need with the jq filters from https://archive.org/details/ia-bak-census_20150304 |
07:10
🔗
|
bwn |
^ can likely be modified |
07:11
🔗
|
db48x |
HCross2: ok, you should be able to log in to iabak.archiveteam.org, username hcross |
07:11
🔗
|
HCross2 |
Would it be a good idea to run another IA census now? |
07:13
🔗
|
|
Atom has quit IRC (Read error: Connection reset by peer) |
07:14
🔗
|
HCross2 |
db48x: thanks, am in |
07:14
🔗
|
db48x |
excellent |
07:15
🔗
|
HCross2 |
I need to leave for work now, but ill have a little play around when I get into the office |
07:15
🔗
|
db48x |
oh, you know what? it just occurred to me |
07:15
🔗
|
db48x |
ok |
07:15
🔗
|
db48x |
there's a minor wrinkle that I'll straighten out shortly |
07:17
🔗
|
db48x |
closure's way of setting machines up is better than mine, but requires remembering to do things better |
07:28
🔗
|
db48x |
hrm |
07:28
🔗
|
db48x |
I can't remember how we handled items that are in multiple collections |
07:29
🔗
|
HCross2 |
How are we also going to do large collections |
07:30
🔗
|
db48x |
split them across multiple shards |
07:31
🔗
|
db48x |
.files | |
07:31
🔗
|
db48x |
map( |
07:31
🔗
|
db48x |
select(.source != "derivative") | |
07:31
🔗
|
db48x |
# if case for catching files with size=null (i.e. files.xml). |
07:31
🔗
|
db48x |
if .size != null then |
07:31
🔗
|
db48x |
{"name": .name, "size": (.size | tonumber), "collection": $c, "md5": .md5} |
07:31
🔗
|
db48x |
else |
07:31
🔗
|
db48x |
{"name": .name, "size": 0, "collection": $c, "md5": .md5} |
07:31
🔗
|
db48x |
end |
07:32
🔗
|
db48x |
) | |
07:32
🔗
|
db48x |
map([.md5, .size, .collection[0], .name]) | map(@tsv) | .[] |
07:33
🔗
|
db48x |
bwn: is every single item in this dataset in the archivebot collection? |
07:33
🔗
|
HCross2 |
Ah. So for example, I would like to back up https://archive.org/details/archiveteam_newssites which is going up by at least half a terabyte a day |
07:36
🔗
|
db48x |
yea, I guess they are |
07:38
🔗
|
bwn |
yes, from `ia search collection:archivebot` |
07:39
🔗
|
bwn |
i need to eat and get some sleep soon, hope that stuff helps :) |
07:40
🔗
|
db48x |
it does |
07:44
🔗
|
db48x |
archivebot and archiveteam_newssites are both going to have to end up split across a bunch of shards |
08:02
🔗
|
HCross2 |
Easiest way of splitting them? |
08:04
🔗
|
Kaz |
db48x: can you drpo my key in too please? |
08:04
🔗
|
Kaz |
only here for a few minutes before I run off to work, but have key |
08:12
🔗
|
Kaz |
ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIHhFYMd9Htlf9wPZzIDyqbYYNwuo3m+kWQ9/pfAD/TE9 Kaz IABAK |
08:12
🔗
|
Kaz |
^ if you're around, gotta run now |
08:12
🔗
|
db48x |
Kaz: I will add you in |
08:13
🔗
|
Kaz |
awesome, thanks |
08:13
🔗
|
db48x |
you will be kaz on the server |
08:16
🔗
|
|
jsp12345 has joined #internetarchive.bak |
08:43
🔗
|
* |
db48x sighs |
08:57
🔗
|
|
atomotic has joined #internetarchive.bak |
09:45
🔗
|
|
antomatic has quit IRC (Read error: Connection reset by peer) |
09:46
🔗
|
|
antomatic has joined #internetarchive.bak |
10:06
🔗
|
|
atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) |
11:11
🔗
|
|
atomotic has joined #internetarchive.bak |
11:26
🔗
|
|
atomotic has quit IRC (Remote host closed the connection) |
11:28
🔗
|
|
atomotic has joined #internetarchive.bak |
12:29
🔗
|
SketchPho |
Please do not add news sites to this project quite yet |
12:30
🔗
|
SketchPho |
I would want us to go after the archive Bots collection first, just because that is pure websites that were grabbed for various reasons |
12:31
🔗
|
SketchPho |
This is also going to bring to bear our management of large amounts of sharks, which we might as well deal with anyway |
12:32
🔗
|
SketchPho |
I also understand, that we also will be forced to deal with the situation of a growing collection, which means that we might want to just focus on a cutoff date with archive Bots |
12:33
🔗
|
SketchPho |
Perhaps it is time for me to begin discussions on data hoarders for space |
13:05
🔗
|
HCross2 |
Ok. How do we cut up a large collection like archivebot |
13:14
🔗
|
luckcolor |
By timestamp or by data size i suppose |
13:17
🔗
|
luckcolor |
Shards with big files could have a smaller number of files, unlike the normal ones |
13:17
🔗
|
HCross2 |
But we add it to the shard per item |
13:20
🔗
|
luckcolor |
I mean collections with Big files :) |
13:20
🔗
|
luckcolor |
*meant |
13:20
🔗
|
luckcolor |
HCross2: ? |
13:22
🔗
|
HCross2 |
Each shard has "items" in it, really not sure how we cut it up. I am probably mistaken about that though. Someone better can probably advise |
13:22
🔗
|
db48x |
with shell scripting |
13:23
🔗
|
db48x |
I wrote one |
13:23
🔗
|
db48x |
which I've been testing |
13:24
🔗
|
db48x |
SketchPho: since archivebot is sooo big I looked at archiveteam_fire instead |
13:24
🔗
|
db48x |
400k files, so it could be four shards |
13:24
🔗
|
db48x |
(easy, just split -n1/4 ...) |
13:25
🔗
|
db48x |
but I checked the dates, and 2011-2015 is 100k files and 4.3tb, so I figure I'll do it that way |
13:28
🔗
|
db48x |
HCross2: https://gist.github.com/db48x/a1a8847916ab149abbfce25517944bdc |
13:28
🔗
|
db48x |
I'll check it in to IA.BAK as well |
13:28
🔗
|
luckcolor |
Isn't it better to have smaller shards? |
13:28
🔗
|
SketchPho |
I trust your judgement |
13:29
🔗
|
db48x |
luckcolor: total file size is less important than number of files |
13:30
🔗
|
db48x |
because git gradually slows down on repositories with more files in them |
13:30
🔗
|
luckcolor |
Ok right |
13:31
🔗
|
db48x |
whereas the files are split up across multiple contributors |
13:32
🔗
|
db48x |
annoyingly, archiveteam_fire for 2016 is 300k files but only .6tb :) |
13:36
🔗
|
HCross2 |
db48x: will you be around in 5 and a half hours or so? I want to get started on shard making and want someone to show me the ropes |
13:39
🔗
|
HCross2 |
4 and a bit I mean |
13:39
🔗
|
db48x |
I'll probably be asleep |
13:40
🔗
|
db48x |
for about the next 8 hours |
13:42
🔗
|
HCross2 |
Ok |
13:44
🔗
|
db48x |
it's not very hard |
13:45
🔗
|
db48x |
use ia-mine (https://github.com/jjjake/iamine) to get the list of items in a collection, and the metadata for each of those items |
13:46
🔗
|
db48x |
you can make a shard out of small collections, or split up a large collection to make several shards; either way you end up with a list of items |
13:46
🔗
|
db48x |
then use jq to parse the metadata (which is one json object per item), and output a tsv file |
13:46
🔗
|
db48x |
then feed the tsv file to a slightly-modified version of the mkSHARD script |
13:48
🔗
|
db48x |
https://gist.github.com/db48x/a1a8847916ab149abbfce25517944bdc |
13:48
🔗
|
HCross2 |
Awesome |
13:49
🔗
|
db48x |
I'll get these checked into the IA.BAK repository at some point; probably after I've slept |
13:50
🔗
|
db48x |
we used to do it slightly differently; ia-mine didn't exist, but we had a single huge json dump of every single public item on IA called the census |
13:50
🔗
|
HCross2 |
Cool. What's the jq command for it please |
13:50
🔗
|
db48x |
jq -r -f file.jq input.json > files.txt |
13:51
🔗
|
HCross2 |
Thanks :) |
13:51
🔗
|
db48x |
if you look on the server, you can see the huge TSV that we created from the huge json dump in ~joey/IA.BAK |
13:52
🔗
|
db48x |
ok, I'm about to hit the normal-sized red button |
13:54
🔗
|
db48x |
ok, I think that worked |
13:55
🔗
|
db48x |
anyone want to try out SHARD12 before I hit the big red button? |
14:03
🔗
|
db48x |
well, I guess I can hit the button without testing it |
14:03
🔗
|
db48x |
what's the worst that could happen? |
14:06
🔗
|
|
atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) |
14:21
🔗
|
|
Whopper has joined #internetarchive.bak |
14:38
🔗
|
db48x |
merge: refs/remotes/origin/synced/master - not something we can merge |
14:38
🔗
|
db48x |
failed |
14:38
🔗
|
db48x |
(merging origin/git-annex into git-annex...) |
14:38
🔗
|
db48x |
(recording state in git...) |
14:38
🔗
|
db48x |
that's not quite what I expected |
14:40
🔗
|
db48x |
ok, when I rerun iabak it works fine |
14:41
🔗
|
db48x |
so I guess it's just a bug that happens to the first person to try out a new shard |
14:41
🔗
|
|
Atom has joined #internetarchive.bak |
14:49
🔗
|
db48x |
time for me to sleep |
15:05
🔗
|
|
Jon has joined #internetarchive.bak |
15:40
🔗
|
|
DFJustin has joined #internetarchive.bak |
15:49
🔗
|
* |
closure waves |
15:50
🔗
|
kurt |
o/ |
15:52
🔗
|
|
atomotic has joined #internetarchive.bak |
15:54
🔗
|
db48x |
closure: howdy |
15:58
🔗
|
closure |
sounds like you guys are making headway. I got a slack invite, but would rather avoid slack, so I'll be over here |
16:00
🔗
|
db48x |
yes, manged to add a shard |
16:00
🔗
|
db48x |
I chickened out and didn't add a bunch of shards all at once though |
16:03
🔗
|
closure |
you asked about items in multiple collections. IIRC the list I generated picked an arbitrary collection for such items, so they only go into one |
16:03
🔗
|
db48x |
yea, I did the same |
16:06
🔗
|
db48x |
biggest development is that there's no census any more |
16:06
🔗
|
db48x |
but there is ia-mine, which is handy |
16:07
🔗
|
closure |
db48x: propellor pull request> hasGroup takes a User and a Group, so hasUser should not deconstruct the parameters |
16:07
🔗
|
closure |
simplest implementation: hasUser = flip hasGroup |
16:08
🔗
|
closure |
no census anymore? What is ia-mine? |
16:09
🔗
|
closure |
ah, I guess it finds the items in a collection |
16:09
🔗
|
db48x |
ia-mine fetches the same metadata json that we had in the census, but for arbitrary searches, from the command-line |
16:09
🔗
|
db48x |
ah, I didn't know about flip |
16:09
🔗
|
db48x |
(though I figured there would be something) |
16:15
🔗
|
db48x |
bah |
16:15
🔗
|
db48x |
I couldn't guess the type signature of flip well enough for hoogle to find it |
16:15
🔗
|
db48x |
though it's obvious now that I look at it |
16:25
🔗
|
Jon |
pulling a shard onto my first donated 1T; I have a second 1T but not contiguous, might need to run a second IA.BAK instance? not sure |
16:52
🔗
|
|
Lord_Nigh has quit IRC (Ping timeout: 250 seconds) |
16:52
🔗
|
|
Lord_Nigh has joined #internetarchive.bak |
16:58
🔗
|
|
Lord_Nigh has quit IRC (Ping timeout: 244 seconds) |
17:05
🔗
|
|
Lord_Nigh has joined #internetarchive.bak |
17:30
🔗
|
|
atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) |
17:51
🔗
|
Kaz |
right |
17:51
🔗
|
Kaz |
I'm on the server, how do I master the shards |
17:52
🔗
|
closure |
db48x: respin the patch? |
17:52
🔗
|
closure |
Kaz: did you see my wiki page on how to do it? |
17:53
🔗
|
Kaz |
I did not. that might be a place to start |
17:54
🔗
|
closure |
http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/admin |
17:56
🔗
|
Kaz |
thanks, taking a look now |
18:17
🔗
|
|
HCross has joined #internetarchive.bak |
18:41
🔗
|
|
sep332 has joined #internetarchive.bak |
18:46
🔗
|
|
kyan has joined #internetarchive.bak |
18:59
🔗
|
Frogging |
so are these the instructions I should follow if I want to donate space? anything else? http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/git-annex_implementation |
19:03
🔗
|
sep332 |
Frogging: the README is very helpful too, especially "tuning resource usage" part |
19:03
🔗
|
Frogging |
yah I'm looking at that |
19:04
🔗
|
Frogging |
wondering if maybe I should use a filesystem quota or sub-image |
19:05
🔗
|
sep332 |
we've been trying to keep it simple but that certainly woulnd't hurt |
19:06
🔗
|
Frogging |
it's just the diskreserve option is for how much space to _not_ use. just slightly complicates allocation on my array. it's only a minor issue though really |
19:08
🔗
|
HCross |
closure, getting a ton of "server refused our key" now |
19:25
🔗
|
Kaz |
right |
19:26
🔗
|
Kaz |
so one of the issues I see is that we can't actually update the iabak repo to point to the new shards |
19:26
🔗
|
Kaz |
uh, setup a new shard repo even |
19:27
🔗
|
HCross |
ignore my error, cant spell my own name this evening |
19:27
🔗
|
Kaz |
reading the wrong bit |
19:29
🔗
|
HCross |
looking at the examples, I see "10191 jstor_jamerinstcrimlaw |
19:29
🔗
|
HCross |
", how do I get that first number? |
19:30
🔗
|
HCross |
also, I dont see a ./mkSHARD file |
19:30
🔗
|
HCross |
done the clone |
19:30
🔗
|
Kaz |
HCross: see /usr/local/IA.BAK |
19:31
🔗
|
Kaz |
HCross: and/or checkout server, rather than master |
19:31
🔗
|
HCross |
ah, I checked out /master |
19:35
🔗
|
HCross |
Kaz, |
19:35
🔗
|
HCross |
hcross@ia-bak:~/IA.BAK$ git checkout -b server master/server |
19:35
🔗
|
HCross |
fatal: Cannot update paths and switch to branch 'server' at the same time. |
19:35
🔗
|
HCross |
Did you intend to checkout 'master/server' which can not be resolved as commit? |
19:35
🔗
|
HCross |
can you please tell me where I am being a clot? |
19:35
🔗
|
Kaz |
git checkout server |
19:36
🔗
|
HCross |
thanks |
19:36
🔗
|
Kaz |
-b makes new branch, which you don't want because it already exists |
19:36
🔗
|
Kaz |
or at least, afaict |
19:37
🔗
|
HCross |
going to create shard 14, based on https://archive.org/details/newspapers |
19:38
🔗
|
Kaz |
13? |
19:39
🔗
|
HCross |
yours |
19:39
🔗
|
Kaz |
I guess one thing to think of, is if items are added to a collection, how are these reflected in the iabak version? |
19:43
🔗
|
HCross |
http://paste.nerds.io/uvaqeteyah.md im now slighly confused about why that went wrong |
19:44
🔗
|
Kaz |
have you got the md5_collection_url.txt.pick1.sorted.uniq link? |
19:44
🔗
|
HCross |
yes |
19:45
🔗
|
Kaz |
it 'sort of works' if you do it from /usr/local/IA.BAK, but have no perms to create the SHARD14.list |
19:48
🔗
|
HCross |
yea, we need a way of doing it in our directories |
19:52
🔗
|
Kaz |
wait hang on |
19:55
🔗
|
Kaz |
HCross: mkSHARD in /usr/local/IA.BAK is different to the one in the repo |
19:56
🔗
|
Kaz |
this explains some things |
19:57
🔗
|
HCross |
that one seems to be working |
19:57
🔗
|
HCross |
its doing thinking about it now |
19:57
🔗
|
HCross |
ive cp'ed it into my folder and renamed it |
20:00
🔗
|
HCross |
228575 SHARD14.list |
20:00
🔗
|
HCross |
may be a tad high |
20:02
🔗
|
HCross |
or not hang on |
20:03
🔗
|
db48x |
yea, don't got over 100k |
20:03
🔗
|
HCross |
thats the file size I think though, its still generating. My calculation is 60k |
20:03
🔗
|
Kaz |
db48x: could you push the updated mkshard to the repo please? just in case there are other changes that should be pulled through too |
20:03
🔗
|
db48x |
although the wiki page says to use that file, it's an out of date index to the archive |
20:04
🔗
|
db48x |
probably better to use ia-mine to generate a more up-to-date index of the collection you're interested in |
20:04
🔗
|
HCross |
db48x, if you show me how, I can get us a new one |
20:04
🔗
|
HCross |
should we do a whole archive one? |
20:04
🔗
|
db48x |
nah, takes ages |
20:05
🔗
|
db48x |
ia-mine --secure -c -s "collection:${1}" --itemlist |
20:05
🔗
|
HCross |
how long is "ages" |
20:05
🔗
|
db48x |
a life-age of the earth |
20:05
🔗
|
db48x |
seriously, the IA has 20PB of stuff |
20:06
🔗
|
db48x |
takes ages, but if you just do one collection it only takes a minute or so |
20:06
🔗
|
db48x |
see the split-collection.sh script that I checked in |
20:06
🔗
|
db48x |
it first gets a list of identifiers for the items in the collection, then gets the json metadata for each item |
20:07
🔗
|
db48x |
then you use jq to convert the json into a tsv of just the four things we need in order to make the git-annex repository |
20:08
🔗
|
db48x |
you can adjust that basic recipe to suit the needs of the moment |
20:08
🔗
|
HCross |
about to kick off several shards with archivebot.... db48x is that OK? |
20:08
🔗
|
db48x |
probably :) |
20:09
🔗
|
db48x |
how are you splitting them between shards? |
20:09
🔗
|
HCross |
using your splitting script |
20:10
🔗
|
db48x |
ok. just be aware that the archivebot collection is a smaller number of huge items, so 100k files is a bad way to split it :) |
20:10
🔗
|
HCross |
best idea to split it? |
20:10
🔗
|
db48x |
I'm not sure |
20:10
🔗
|
db48x |
probably into about 50 shards of 2-3TB |
20:11
🔗
|
HCross |
archivebot is 1.8k items. |
20:12
🔗
|
HCross |
assuming each pack is 50GB, its 5 items per pack |
20:12
🔗
|
db48x |
sounds right, but don't assume |
20:12
🔗
|
HCross |
yea, ill check now |
20:12
🔗
|
db48x |
that jq script dumps out the file sizes too, so dice it up and measure the size of all the pieces :) |
20:13
🔗
|
Kaz |
so just to be clear |
20:13
🔗
|
Kaz |
1) run split-collection on huge collection |
20:13
🔗
|
Kaz |
2) do some jq magic |
20:13
🔗
|
Kaz |
3) make shard based on each list that gets pumped out by 1&"? |
20:14
🔗
|
db48x |
yes |
20:14
🔗
|
db48x |
but split-collection is just one possible way to split up a collection |
20:14
🔗
|
db48x |
you could do it by date |
20:14
🔗
|
Kaz |
right |
20:15
🔗
|
HCross |
db48x, can we have nano please? |
20:15
🔗
|
db48x |
as I did for archiveteam_fire, or by size, for archivebot, or by some clever means I haven't thought up |
20:15
🔗
|
db48x |
sure |
20:15
🔗
|
db48x |
emacs is on there, as is vim, but go ahead and install it |
20:15
🔗
|
Frogging |
how does one get things "out" of the backup, should the need arise? |
20:16
🔗
|
Kaz |
how does one go about running the jq script? -bash: jq: command not found |
20:16
🔗
|
db48x |
while it's installing, go to /usr/local/propellor and edit joeyconfig.hs so that the iabak machine defined in there includes the package as well |
20:16
🔗
|
Kaz |
or do I need to install for myself |
20:16
🔗
|
HCross |
db48x, not root/cant sudo |
20:16
🔗
|
db48x |
I had zero luck installing jq on iabak, sorry |
20:16
🔗
|
db48x |
I had to do that part on my own machine |
20:16
🔗
|
Kaz |
ah, okay |
20:17
🔗
|
HCross |
I may do all the mining closer to myself then, and push over after |
20:17
🔗
|
db48x |
but you're welcome to show me up :) |
20:17
🔗
|
HCross |
I'd rather not try and upload lots of small files from London > Singapore |
20:17
🔗
|
db48x |
oh, I never actually added you guys to /etc/sudoers |
20:18
🔗
|
HCross |
Kaz, if I get a VPS somewhere near SG (aka LA), would you want to try and sort jq on that? |
20:18
🔗
|
HCross |
then we can use that |
20:19
🔗
|
Kaz |
could give it a shot, let me see if I can get it running locally first |
20:20
🔗
|
db48x |
or just get it installed on the server |
20:20
🔗
|
db48x |
I gave up because it was 4am |
20:21
🔗
|
Kaz |
uh |
20:21
🔗
|
Kaz |
am I missing something or am I supposed to be expecting something more than apt-get install jq |
20:22
🔗
|
db48x |
dunno |
20:22
🔗
|
db48x |
does it work? |
20:22
🔗
|
Kaz |
yeah |
20:22
🔗
|
Kaz |
well, it installed |
20:22
🔗
|
db48x |
nice :) |
20:22
🔗
|
Kaz |
will test functionality in a sec |
20:26
🔗
|
HCross |
struggling to get iamine to go as well |
20:26
🔗
|
db48x |
closure: I updated my pull request |
20:27
🔗
|
db48x |
speaking of which, you guys should peruse it as well: https://github.com/joeyh/propellor/pull/17 |
20:29
🔗
|
HCross |
it doesnt half take a time to create a shard |
20:29
🔗
|
HCross |
my first one is still computing |
20:29
🔗
|
db48x |
indeed |
20:30
🔗
|
Kaz |
HCross: do you have iamine working? |
20:30
🔗
|
HCross |
I dont, it needs someones IA credentials |
20:30
🔗
|
db48x |
it needs your own |
20:31
🔗
|
HCross |
ah, per user |
20:31
🔗
|
db48x |
just uses a username and password to get an auth key |
20:31
🔗
|
db48x |
yea |
20:31
🔗
|
db48x |
s/key/token |
20:31
🔗
|
db48x |
/ |
20:32
🔗
|
Kaz |
iamine needs py3, right |
20:34
🔗
|
db48x |
indeed |
20:34
🔗
|
db48x |
https://github.com/joeyh/propellor/pull/17/commits/1d689b1e4ce1f5eeedab140bd3c330484a928586 |
20:35
🔗
|
yipdw |
Frogging: each remote location is a git-annex remote; to restore shard contents, you pull from remots |
20:35
🔗
|
yipdw |
es |
20:36
🔗
|
Frogging |
ooh, okay |
20:36
🔗
|
yipdw |
https://git-annex.branchable.com/tips/offline_archive_drives/ and https://git-annex.branchable.com/location_tracking/ |
20:36
🔗
|
db48x |
Frogging: I'm sorry, I completely forgot to answer your question! |
20:39
🔗
|
HCross |
db48x, split-collection: line 8: syntax error near unexpected token `(' |
20:39
🔗
|
HCross |
on your script |
20:41
🔗
|
Kaz |
iamine is not paying nicely for me |
20:42
🔗
|
db48x |
HCross: odd |
20:42
🔗
|
db48x |
line 8 is just lines=$(wc -l "${itemfile}" | cut -d ' ' -f 1) |
20:43
🔗
|
HCross |
not on https://raw.githubusercontent.com/ArchiveTeam/IA.BAK/server/split-collection |
20:44
🔗
|
db48x |
oooh |
20:44
🔗
|
Kaz |
yeah, the 'new' mkshard also isn't on the repo |
20:45
🔗
|
|
atomotic has joined #internetarchive.bak |
20:47
🔗
|
db48x |
Kaz: I committed it: https://github.com/ArchiveTeam/IA.BAK/commit/5b457779b2ffd9fb1342671a3dbc1cd73edcd14e#diff-85b50cc2f5b54f1254ecdcd0fec1959d |
20:47
🔗
|
db48x |
just pushed the fixed split-collection script |
20:48
🔗
|
Kaz |
https://github.com/ArchiveTeam/IA.BAK/blob/server/mkSHARD doesn't match /usr/local/IA.BAK/mkSHARD, which one should we be using? |
20:49
🔗
|
db48x |
well, let's look at the diff |
20:50
🔗
|
HCross |
db48x, http://paste.nerds.io/ujacucekid.vhdl what am I doing wrong? |
20:50
🔗
|
db48x |
Kaz: yea, that's the old version |
20:51
🔗
|
db48x |
there, I did a git pull |
20:52
🔗
|
Kaz |
okay, thanks |
20:52
🔗
|
db48x |
++ basename 'archivebot-files/-*.json' .json |
20:52
🔗
|
db48x |
error: "archivebot-meta-*.json" should be readable |
20:54
🔗
|
db48x |
it's in quotes, so it didn't expand the glob |
20:54
🔗
|
HCross |
but its made a load of shards |
20:55
🔗
|
db48x |
naturally |
20:55
🔗
|
HCross |
ah |
20:55
🔗
|
db48x |
that's why the script has a for loop, to work on each one independantly |
20:56
🔗
|
db48x |
but if the filename is quoted, then the loop only runs once, with f equal to "archivebot-meta-*.json" rather than with f equal to "archivebot-meta-0.json", then equal to "archivebot-meta-1.json", etc |
20:57
🔗
|
HCross |
looks like ive got 357 shards of just archivebot |
20:58
🔗
|
HCross |
hmm, thats not right |
20:58
🔗
|
db48x |
that is a lot of shards |
20:59
🔗
|
HCross |
yea, I messed up |
21:00
🔗
|
HCross |
34, that looks better |
21:00
🔗
|
db48x |
:) |
21:01
🔗
|
HCross |
db48x, also. annexed files in working tree: 228575 |
21:01
🔗
|
HCross |
- but from the info on archive.org for each collection there are only 68k files |
21:01
🔗
|
db48x |
HCross: what collection are you looking at? |
21:02
🔗
|
HCross |
The_Sydney_Morning_Herald svoboda_newspaper antiochnews The_Notre_Dame_Scholastic NCAA-News dailyracingform |
21:04
🔗
|
db48x |
23856 + 23183 + 5305 + 2206 + 857 + 10956 = ~66k items |
21:04
🔗
|
db48x |
but yea, each item has multiple files |
21:05
🔗
|
HCross |
ahh, its a file thing |
21:05
🔗
|
HCross |
sorry about my cockups this evening |
21:05
🔗
|
db48x |
indeed |
21:07
🔗
|
db48x |
no worries |
21:08
🔗
|
db48x |
I'd never done it before until last night either, I just didn't have anyone to talk to about the mistakes I made :) |
21:12
🔗
|
HCross |
db48x, so I now have http://harrycross.me/dae.png |
21:12
🔗
|
HCross |
- what do I do on them next? Tried running get_item_files.jq and I get a compile error |
21:12
🔗
|
db48x |
show me the command and the error? |
21:14
🔗
|
HCross |
jq get_item_files.jq |
21:14
🔗
|
HCross |
error: get_item_files is not defined |
21:14
🔗
|
HCross |
get_item_files.jq1 compile error |
21:14
🔗
|
HCross |
no matter what file I add on the end of the command |
21:17
🔗
|
db48x |
oh |
21:17
🔗
|
db48x |
if you want it to run commands from a file, you have to use the -f option |
21:17
🔗
|
db48x |
jq -f somefile.jq |
21:18
🔗
|
db48x |
and since this script produces an array of strings that we want to treat as lines of a file, you also need the -r option |
21:18
🔗
|
db48x |
so it becomes jq -r -f get_item_files.jq archivebot-meta-00.txt |
21:19
🔗
|
db48x |
oh, and then you want to redirect the output, so tack on > archivebot-meta-00.tsv to the end |
21:21
🔗
|
HCross |
parse error: Invalid numeric literal at line 2, column 0 |
21:22
🔗
|
db48x |
hrm |
21:22
🔗
|
db48x |
what's line 2 of your input file look like? |
21:23
🔗
|
db48x |
some horrible json object, I hope |
21:24
🔗
|
HCross |
http://harrycross.me/123.png |
21:24
🔗
|
HCross |
is archivebot-meta-00.txt |
21:25
🔗
|
db48x |
ah, those are the ids, not the metadata |
21:25
🔗
|
db48x |
split-collection should run ia-mine again using those as input |
21:25
🔗
|
HCross |
it did something, then spat out the IA mine readme |
21:26
🔗
|
db48x |
heh |
21:27
🔗
|
HCross |
db48x, http://paste.nerds.io/fomihecuma.vhdl full log of what it did |
21:27
🔗
|
HCross |
TO MAKE THOSE |
21:27
🔗
|
HCross |
to make those files |
21:30
🔗
|
db48x |
yea, you didn't fix the error on line 13 |
21:30
🔗
|
db48x |
I guess you edited split-collection, putting quotes around the glob of the for loop? |
21:31
🔗
|
HCross |
db48x, http://paste.nerds.io/dokefiboxu.pl is that file, only edit is line 9 |
21:32
🔗
|
db48x |
http://paste.nerds.io/dokefiboxu.sh |
21:32
🔗
|
db48x |
there, at least the colors are better now :) |
21:33
🔗
|
db48x |
your previous log clearly shows the quotes |
21:33
🔗
|
db48x |
maybe you edited it back after it failed, but forgot to rerun it? gremlins maybe? |
21:33
🔗
|
Frogging |
sorry if the question is stupidm, but why the sudden interest in archiving the Archive? the election was mentioned but I didn't really understand the connection |
21:37
🔗
|
db48x |
Frogging: we've been doing this for a while now, but we've let it slide a bit |
21:38
🔗
|
Frogging |
yeah, I know. it just seemed to get a sudden bump to top priority on Tuesday |
21:39
🔗
|
db48x |
could be because previous attemmpts to get it going again didn't really accomplish much? |
21:40
🔗
|
Frogging |
maybe, though the implication was that it's now more urgent because of the election |
21:40
🔗
|
Frogging |
I think it's great that it's getting more attention, I didn't understand that though |
21:40
🔗
|
db48x |
hmm. no idea about that |
21:40
🔗
|
db48x |
I suppose it's possible, but it didn't seem that way to me |
21:41
🔗
|
HCross |
db48x, can you tell me what line 13 should be, not quite getting this (am a real newbie with this kind of thing), sorry |
21:41
🔗
|
|
Kksmkrn has joined #internetarchive.bak |
21:42
🔗
|
db48x |
HCross: just "done" |
21:42
🔗
|
db48x |
oh, in the log |
21:42
🔗
|
db48x |
not the 13th line of the script |
21:42
🔗
|
db48x |
it shouldn't be an error |
21:43
🔗
|
db48x |
let me run it and show you a log of when it works |
21:43
🔗
|
HCross |
ok, thanks |
21:43
🔗
|
|
kyan has quit IRC (Read error: Operation timed out) |
21:44
🔗
|
Jon |
yay 21G down of the 1T allocated so far |
21:45
🔗
|
db48x |
nice |
21:48
🔗
|
db48x |
bah: {"params": {"rows": 50, "q": "collection:archiveteam-fire", "page": 397, "output": "json"}, "url": "https://archive.org/advancedsearch.php", "retries_left": 0, "message": "Maximum retries exceeded for url, giving up."} |
21:51
🔗
|
|
Start_ has joined #internetarchive.bak |
21:52
🔗
|
|
Start has quit IRC (Read error: Connection reset by peer) |
21:52
🔗
|
db48x |
HCross: do you see a similar stream of error messages when you run ia-mine? |
21:52
🔗
|
HCross |
nope, just the IA mine readme |
21:52
🔗
|
db48x |
ah, it only happens on that collection |
21:53
🔗
|
db48x |
I'll look into that later, back to helping you |
21:53
🔗
|
HCross |
thanks |
21:54
🔗
|
db48x |
oooh, lol |
21:55
🔗
|
|
atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) |
21:55
🔗
|
db48x |
it's because I am dumb |
21:58
🔗
|
|
wp494 has quit IRC (Read error: Connection reset by peer) |
22:02
🔗
|
db48x |
http://paste.nerds.io/ubuhujimur.sh |
22:02
🔗
|
db48x |
HCross: I pushed a fix |
22:03
🔗
|
db48x |
basically, I made some changes to split-collection, then had the bright idea to split archiveteam-fire by date, and didn't actually run it again with my changes |
22:04
🔗
|
db48x |
the quotes in your log were a red herring |
22:04
🔗
|
db48x |
I was forgetting that this is bash we're dealing with |
22:04
🔗
|
HCross |
i just ordered a time3vps for storage for this, and as soon as I paid, their portal crashed |
22:05
🔗
|
db48x |
heh |
22:08
🔗
|
HCross |
that looks more like what it should be doing db48x - its also taking more time |
22:08
🔗
|
db48x |
yay |
22:09
🔗
|
|
wp494 has joined #internetarchive.bak |
22:09
🔗
|
HCross |
db48x, is https://billing.time4vps.eu down/showing a cloudflare live thingy for you? |
22:11
🔗
|
db48x |
it's showing nothing but a spinning spinner so far |
22:11
🔗
|
db48x |
ah, indeed it is |
22:12
🔗
|
HCross |
rip, just fed them 20 eur |
22:12
🔗
|
Kksmkrn |
Loading for me |
22:13
🔗
|
Kksmkrn |
Nevermind, I should drink more, talk less.. live thingy |
22:14
🔗
|
HCross |
db48x, it worked this time, have a nice JSON car-crash here |
22:14
🔗
|
HCross |
cat'ing that first file was a BAD idea |
22:14
🔗
|
db48x |
:D |
22:14
🔗
|
db48x |
use less instead |
22:14
🔗
|
db48x |
or head |
22:15
🔗
|
Senji |
hexdump -C |
22:16
🔗
|
HCross |
especially on a server 200ms away. Also, jq is still cocking up |
22:16
🔗
|
HCross |
same parse error: Invalid numeric literal at line 2, column 0 |
22:22
🔗
|
|
kyan has joined #internetarchive.bak |
22:35
🔗
|
|
Kksmkrn has left zZzZ.. |
22:49
🔗
|
db48x |
HCross: sounds like you fed it another file containing ids instead of json :) |
23:01
🔗
|
HCross |
db48x, I now have .tsv files full of JSON and .json files full of lists of IDs |
23:03
🔗
|
HCross |
db48x, I am going to head to bed, ill sort this out tomorrow. Goodnight |
23:20
🔗
|
SketchPho |
Looks like good work today. |
23:20
🔗
|
SketchPho |
Do I need to help in any way? |
23:24
🔗
|
db48x |
HCross: awesome! :D |
23:25
🔗
|
db48x |
SketchPho: I think we're making progress |
23:25
🔗
|
SketchPho |
Agreed. (Checked scrollback) |
23:26
🔗
|
db48x |
it's not really repeatable, quality-controlled, iso9000 certified progress, but it's not too bad :) |
23:26
🔗
|
HCross2 |
I'll sort out the rest of it later tomorrow. It should be easier from here. Its now just converting those files to shards |
23:27
🔗
|
HCross2 |
I've also got 4tb downloading |
23:28
🔗
|
db48x |
I had to stop after 15gb of shard12 |