#internetarchive.bak 2016-11-10,Thu

↑back Search

Time	Nickname	Message
00:02 ^🔗		sep332 has quit IRC (konversation out)
01:08 ^🔗		kyan has joined #internetarchive.bak
01:24 ^🔗		Lord_Nigh has quit IRC (Ping timeout: 633 seconds)
01:25 ^🔗		Lord_Nigh has joined #internetarchive.bak
02:01 ^🔗		Lord_Nigh has quit IRC (Read error: Operation timed out)
02:08 ^🔗		Lord_Nigh has joined #internetarchive.bak
02:26 ^🔗		Lord_Nigh has quit IRC (Ping timeout: 244 seconds)
02:28 ^🔗		Lord_Nigh has joined #internetarchive.bak
03:08 ^🔗		Lord_Nigh has quit IRC (Read error: Operation timed out)
03:31 ^🔗		Lord_Nigh has joined #internetarchive.bak
04:26 ^🔗		Start has quit IRC (Quit: Disconnected.)
04:28 ^🔗		Start has joined #internetarchive.bak
04:50 ^🔗	SketchPho	No action?
04:50 ^🔗	SketchPho	I need to tap closure. Is there other stuff people have questions on?
04:51 ^🔗	SketchPho	I'm not kidding when I say this is day in day out first priority
04:51 ^🔗	db48x	do we have a list of things to put in a new shard? I can run the scripts
04:53 ^🔗	db48x	actually, I take that back. I can't log into the server
04:53 ^🔗	db48x	oh, nvm
04:54 ^🔗	db48x	I was doing it wrong
05:04 ^🔗	db48x	gotta grab the latest census
05:07 ^🔗	db48x	ok, confusion
05:07 ^🔗	db48x	archiveteam_census_2016 doesn't have what I expected; it's just a list of identifiers with none of the rest of the information
05:08 ^🔗		Lord_Nigh has quit IRC (Ping timeout: 244 seconds)
05:14 ^🔗		Lord_Nigh has joined #internetarchive.bak
05:24 ^🔗		Lord_Nigh has quit IRC (Read error: Operation timed out)
05:27 ^🔗		Lord_Nigh has joined #internetarchive.bak
05:30 ^🔗	SketchPho	I'd like us to turn to archivebot and archiveteam items for new shards
05:30 ^🔗	db48x	sure
05:30 ^🔗	db48x	do we have any that aren't made of 50GB warcs?
05:51 ^🔗	bwn	db48x: sorry, yes, no metadata in there yet
05:52 ^🔗	db48x	ah
05:52 ^🔗	db48x	:)
05:52 ^🔗	bwn	i've started ia-mine running on archivebot collection to get metadata
05:59 ^🔗	yipdw	bwn: also use http://archive.fart.website/archivebot/viewer/items/
05:59 ^🔗	yipdw	that may be a more complete index of archivebot materials, as it includes a large number of items that are not in the archivebot collection
06:08 ^🔗	HCross2	Best person to make my SSH key available too?
06:16 ^🔗	bwn	yipdw: looks like the items you're talking about got added to the archivebot collection at some point
06:17 ^🔗	bwn	ia gave me archiveteam_archivebot_go_20161110020001 but it hasn't been added to the viewer yet
06:17 ^🔗	yipdw	some of them did
06:17 ^🔗	yipdw	yeah, the viewer updates every 24 hours
06:17 ^🔗	yipdw	items are created every, uh, 3 or so
06:17 ^🔗	db48x	HCross2: you can just paste it in here and I'll add it to the server. ed25519 keys are preferred
06:17 ^🔗	yipdw	depending on upload speed from fos
06:17 ^🔗		kyan has quit IRC (Quit: Leaving)
06:18 ^🔗	HCross2	https://www.irccloud.com/pastebin/BPtozFKj
06:19 ^🔗	HCross2	db48x:
06:19 ^🔗	db48x	strongly preferred :)
06:20 ^🔗	HCross2	Ah ok. I'll have to re do it
06:26 ^🔗	Deewiant	Is it normal to see "verification of content failed" and "Unable to access these remotes: web" often? (Happens for approximately 1 file out of 5)
06:26 ^🔗	db48x	Deewiant: sometimes
06:26 ^🔗	db48x	even when an item is hidden on IA it can still be mentioned in the backup, but nobody will be able to download it
06:36 ^🔗	bwn	yipdw: unless i'm making a mistake, it looks like the viewer matches up with `ia search collection:archivebot`
06:36 ^🔗	yipdw	ah ok cool
06:39 ^🔗	bwn	:)
06:39 ^🔗	bwn	db48x: that metadata is finished, i also created a sorted list of the archivebot item sizes
06:40 ^🔗	bwn	where would a good place to put this for anyone who needs it? i could create an ia item for now
06:41 ^🔗	db48x	that's a nice meta way to do it
06:41 ^🔗	db48x	or just serve it up the old-fashioned way and I'll download it from you
06:52 ^🔗	Deewiant	db48x: Ok, it just looks a bit worrying to have it show up so often. Seems to mostly (or only?) affect _archive.torrent and _meta.xml files though.
06:52 ^🔗	bwn	db48x: http://erebos.undo.it/MIRROR/db48x/
06:55 ^🔗	db48x	Deewiant: ah, that's actually a slightly different issue; we don't know the correct hash of those files so they can't be verified
06:56 ^🔗	db48x	bwn: I'm taking a look now
06:57 ^🔗	bwn	i'm writing up a readme so you know what's what
06:58 ^🔗	db48x	hmm
06:59 ^🔗	db48x	the script we have reads from a file with the md5 hash, the size, the category and the file url
07:04 ^🔗	HCross2	ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIP5OhU2Lita9RdjPkX9N0w9wZnmVlednUDEx24bVn4Mk IABAK key - Harry C
07:04 ^🔗	HCross2	db48x: ^
07:05 ^🔗	HCross2	TIL: You have to use the development version of PuTTYgen for that
07:05 ^🔗	bwn	get_item_files.jq outputs a partial url collection/filename for each file
07:06 ^🔗	db48x	HCross2: great, what username do you want?
07:06 ^🔗	HCross2	HCross
07:10 ^🔗	bwn	that get_item_files.jq can be modified to get the format you need with the jq filters from https://archive.org/details/ia-bak-census_20150304
07:10 ^🔗	bwn	^ can likely be modified
07:11 ^🔗	db48x	HCross2: ok, you should be able to log in to iabak.archiveteam.org, username hcross
07:11 ^🔗	HCross2	Would it be a good idea to run another IA census now?
07:13 ^🔗		Atom has quit IRC (Read error: Connection reset by peer)
07:14 ^🔗	HCross2	db48x: thanks, am in
07:14 ^🔗	db48x	excellent
07:15 ^🔗	HCross2	I need to leave for work now, but ill have a little play around when I get into the office
07:15 ^🔗	db48x	oh, you know what? it just occurred to me
07:15 ^🔗	db48x	ok
07:15 ^🔗	db48x	there's a minor wrinkle that I'll straighten out shortly
07:17 ^🔗	db48x	closure's way of setting machines up is better than mine, but requires remembering to do things better
07:28 ^🔗	db48x	hrm
07:28 ^🔗	db48x	I can't remember how we handled items that are in multiple collections
07:29 ^🔗	HCross2	How are we also going to do large collections
07:30 ^🔗	db48x	split them across multiple shards
07:31 ^🔗	db48x	.files \|
07:31 ^🔗	db48x	map(
07:31 ^🔗	db48x	select(.source != "derivative") \|
07:31 ^🔗	db48x	# if case for catching files with size=null (i.e. files.xml).
07:31 ^🔗	db48x	if .size != null then
07:31 ^🔗	db48x	{"name": .name, "size": (.size \| tonumber), "collection": $c, "md5": .md5}
07:31 ^🔗	db48x	else
07:31 ^🔗	db48x	{"name": .name, "size": 0, "collection": $c, "md5": .md5}
07:31 ^🔗	db48x	end
07:32 ^🔗	db48x	) \|
07:32 ^🔗	db48x	map([.md5, .size, .collection[0], .name]) \| map(@tsv) \| .[]
07:33 ^🔗	db48x	bwn: is every single item in this dataset in the archivebot collection?
07:33 ^🔗	HCross2	Ah. So for example, I would like to back up https://archive.org/details/archiveteam_newssites which is going up by at least half a terabyte a day
07:36 ^🔗	db48x	yea, I guess they are
07:38 ^🔗	bwn	yes, from `ia search collection:archivebot`
07:39 ^🔗	bwn	i need to eat and get some sleep soon, hope that stuff helps :)
07:40 ^🔗	db48x	it does
07:44 ^🔗	db48x	archivebot and archiveteam_newssites are both going to have to end up split across a bunch of shards
08:02 ^🔗	HCross2	Easiest way of splitting them?
08:04 ^🔗	Kaz	db48x: can you drpo my key in too please?
08:04 ^🔗	Kaz	only here for a few minutes before I run off to work, but have key
08:12 ^🔗	Kaz	ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIHhFYMd9Htlf9wPZzIDyqbYYNwuo3m+kWQ9/pfAD/TE9 Kaz IABAK
08:12 ^🔗	Kaz	^ if you're around, gotta run now
08:12 ^🔗	db48x	Kaz: I will add you in
08:13 ^🔗	Kaz	awesome, thanks
08:13 ^🔗	db48x	you will be kaz on the server
08:16 ^🔗		jsp12345 has joined #internetarchive.bak
08:43 ^🔗	*	db48x sighs
08:57 ^🔗		atomotic has joined #internetarchive.bak
09:45 ^🔗		antomatic has quit IRC (Read error: Connection reset by peer)
09:46 ^🔗		antomatic has joined #internetarchive.bak
10:06 ^🔗		atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
11:11 ^🔗		atomotic has joined #internetarchive.bak
11:26 ^🔗		atomotic has quit IRC (Remote host closed the connection)
11:28 ^🔗		atomotic has joined #internetarchive.bak
12:29 ^🔗	SketchPho	Please do not add news sites to this project quite yet
12:30 ^🔗	SketchPho	I would want us to go after the archive Bots collection first, just because that is pure websites that were grabbed for various reasons
12:31 ^🔗	SketchPho	This is also going to bring to bear our management of large amounts of sharks, which we might as well deal with anyway
12:32 ^🔗	SketchPho	I also understand, that we also will be forced to deal with the situation of a growing collection, which means that we might want to just focus on a cutoff date with archive Bots
12:33 ^🔗	SketchPho	Perhaps it is time for me to begin discussions on data hoarders for space
13:05 ^🔗	HCross2	Ok. How do we cut up a large collection like archivebot
13:14 ^🔗	luckcolor	By timestamp or by data size i suppose
13:17 ^🔗	luckcolor	Shards with big files could have a smaller number of files, unlike the normal ones
13:17 ^🔗	HCross2	But we add it to the shard per item
13:20 ^🔗	luckcolor	I mean collections with Big files :)
13:20 ^🔗	luckcolor	*meant
13:20 ^🔗	luckcolor	HCross2: ?
13:22 ^🔗	HCross2	Each shard has "items" in it, really not sure how we cut it up. I am probably mistaken about that though. Someone better can probably advise
13:22 ^🔗	db48x	with shell scripting
13:23 ^🔗	db48x	I wrote one
13:23 ^🔗	db48x	which I've been testing
13:24 ^🔗	db48x	SketchPho: since archivebot is sooo big I looked at archiveteam_fire instead
13:24 ^🔗	db48x	400k files, so it could be four shards
13:24 ^🔗	db48x	(easy, just split -n1/4 ...)
13:25 ^🔗	db48x	but I checked the dates, and 2011-2015 is 100k files and 4.3tb, so I figure I'll do it that way
13:28 ^🔗	db48x	HCross2: https://gist.github.com/db48x/a1a8847916ab149abbfce25517944bdc
13:28 ^🔗	db48x	I'll check it in to IA.BAK as well
13:28 ^🔗	luckcolor	Isn't it better to have smaller shards?
13:28 ^🔗	SketchPho	I trust your judgement
13:29 ^🔗	db48x	luckcolor: total file size is less important than number of files
13:30 ^🔗	db48x	because git gradually slows down on repositories with more files in them
13:30 ^🔗	luckcolor	Ok right
13:31 ^🔗	db48x	whereas the files are split up across multiple contributors
13:32 ^🔗	db48x	annoyingly, archiveteam_fire for 2016 is 300k files but only .6tb :)
13:36 ^🔗	HCross2	db48x: will you be around in 5 and a half hours or so? I want to get started on shard making and want someone to show me the ropes
13:39 ^🔗	HCross2	4 and a bit I mean
13:39 ^🔗	db48x	I'll probably be asleep
13:40 ^🔗	db48x	for about the next 8 hours
13:42 ^🔗	HCross2	Ok
13:44 ^🔗	db48x	it's not very hard
13:45 ^🔗	db48x	use ia-mine (https://github.com/jjjake/iamine) to get the list of items in a collection, and the metadata for each of those items
13:46 ^🔗	db48x	you can make a shard out of small collections, or split up a large collection to make several shards; either way you end up with a list of items
13:46 ^🔗	db48x	then use jq to parse the metadata (which is one json object per item), and output a tsv file
13:46 ^🔗	db48x	then feed the tsv file to a slightly-modified version of the mkSHARD script
13:48 ^🔗	db48x	https://gist.github.com/db48x/a1a8847916ab149abbfce25517944bdc
13:48 ^🔗	HCross2	Awesome
13:49 ^🔗	db48x	I'll get these checked into the IA.BAK repository at some point; probably after I've slept
13:50 ^🔗	db48x	we used to do it slightly differently; ia-mine didn't exist, but we had a single huge json dump of every single public item on IA called the census
13:50 ^🔗	HCross2	Cool. What's the jq command for it please
13:50 ^🔗	db48x	jq -r -f file.jq input.json > files.txt
13:51 ^🔗	HCross2	Thanks :)
13:51 ^🔗	db48x	if you look on the server, you can see the huge TSV that we created from the huge json dump in ~joey/IA.BAK
13:52 ^🔗	db48x	ok, I'm about to hit the normal-sized red button
13:54 ^🔗	db48x	ok, I think that worked
13:55 ^🔗	db48x	anyone want to try out SHARD12 before I hit the big red button?
14:03 ^🔗	db48x	well, I guess I can hit the button without testing it
14:03 ^🔗	db48x	what's the worst that could happen?
14:06 ^🔗		atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
14:21 ^🔗		Whopper has joined #internetarchive.bak
14:38 ^🔗	db48x	merge: refs/remotes/origin/synced/master - not something we can merge
14:38 ^🔗	db48x	failed
14:38 ^🔗	db48x	(merging origin/git-annex into git-annex...)
14:38 ^🔗	db48x	(recording state in git...)
14:38 ^🔗	db48x	that's not quite what I expected
14:40 ^🔗	db48x	ok, when I rerun iabak it works fine
14:41 ^🔗	db48x	so I guess it's just a bug that happens to the first person to try out a new shard
14:41 ^🔗		Atom has joined #internetarchive.bak
14:49 ^🔗	db48x	time for me to sleep
15:05 ^🔗		Jon has joined #internetarchive.bak
15:40 ^🔗		DFJustin has joined #internetarchive.bak
15:49 ^🔗	*	closure waves
15:50 ^🔗	kurt	o/
15:52 ^🔗		atomotic has joined #internetarchive.bak
15:54 ^🔗	db48x	closure: howdy
15:58 ^🔗	closure	sounds like you guys are making headway. I got a slack invite, but would rather avoid slack, so I'll be over here
16:00 ^🔗	db48x	yes, manged to add a shard
16:00 ^🔗	db48x	I chickened out and didn't add a bunch of shards all at once though
16:03 ^🔗	closure	you asked about items in multiple collections. IIRC the list I generated picked an arbitrary collection for such items, so they only go into one
16:03 ^🔗	db48x	yea, I did the same
16:06 ^🔗	db48x	biggest development is that there's no census any more
16:06 ^🔗	db48x	but there is ia-mine, which is handy
16:07 ^🔗	closure	db48x: propellor pull request> hasGroup takes a User and a Group, so hasUser should not deconstruct the parameters
16:07 ^🔗	closure	simplest implementation: hasUser = flip hasGroup
16:08 ^🔗	closure	no census anymore? What is ia-mine?
16:09 ^🔗	closure	ah, I guess it finds the items in a collection
16:09 ^🔗	db48x	ia-mine fetches the same metadata json that we had in the census, but for arbitrary searches, from the command-line
16:09 ^🔗	db48x	ah, I didn't know about flip
16:09 ^🔗	db48x	(though I figured there would be something)
16:15 ^🔗	db48x	bah
16:15 ^🔗	db48x	I couldn't guess the type signature of flip well enough for hoogle to find it
16:15 ^🔗	db48x	though it's obvious now that I look at it
16:25 ^🔗	Jon	pulling a shard onto my first donated 1T; I have a second 1T but not contiguous, might need to run a second IA.BAK instance? not sure
16:52 ^🔗		Lord_Nigh has quit IRC (Ping timeout: 250 seconds)
16:52 ^🔗		Lord_Nigh has joined #internetarchive.bak
16:58 ^🔗		Lord_Nigh has quit IRC (Ping timeout: 244 seconds)
17:05 ^🔗		Lord_Nigh has joined #internetarchive.bak
17:30 ^🔗		atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
17:51 ^🔗	Kaz	right
17:51 ^🔗	Kaz	I'm on the server, how do I master the shards
17:52 ^🔗	closure	db48x: respin the patch?
17:52 ^🔗	closure	Kaz: did you see my wiki page on how to do it?
17:53 ^🔗	Kaz	I did not. that might be a place to start
17:54 ^🔗	closure	http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/admin
17:56 ^🔗	Kaz	thanks, taking a look now
18:17 ^🔗		HCross has joined #internetarchive.bak
18:41 ^🔗		sep332 has joined #internetarchive.bak
18:46 ^🔗		kyan has joined #internetarchive.bak
18:59 ^🔗	Frogging	so are these the instructions I should follow if I want to donate space? anything else? http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/git-annex_implementation
19:03 ^🔗	sep332	Frogging: the README is very helpful too, especially "tuning resource usage" part
19:03 ^🔗	Frogging	yah I'm looking at that
19:04 ^🔗	Frogging	wondering if maybe I should use a filesystem quota or sub-image
19:05 ^🔗	sep332	we've been trying to keep it simple but that certainly woulnd't hurt
19:06 ^🔗	Frogging	it's just the diskreserve option is for how much space to _not_ use. just slightly complicates allocation on my array. it's only a minor issue though really
19:08 ^🔗	HCross	closure, getting a ton of "server refused our key" now
19:25 ^🔗	Kaz	right
19:26 ^🔗	Kaz	so one of the issues I see is that we can't actually update the iabak repo to point to the new shards
19:26 ^🔗	Kaz	uh, setup a new shard repo even
19:27 ^🔗	HCross	ignore my error, cant spell my own name this evening
19:27 ^🔗	Kaz	reading the wrong bit
19:29 ^🔗	HCross	looking at the examples, I see "10191 jstor_jamerinstcrimlaw
19:29 ^🔗	HCross	", how do I get that first number?
19:30 ^🔗	HCross	also, I dont see a ./mkSHARD file
19:30 ^🔗	HCross	done the clone
19:30 ^🔗	Kaz	HCross: see /usr/local/IA.BAK
19:31 ^🔗	Kaz	HCross: and/or checkout server, rather than master
19:31 ^🔗	HCross	ah, I checked out /master
19:35 ^🔗	HCross	Kaz,
19:35 ^🔗	HCross	hcross@ia-bak:~/IA.BAK$ git checkout -b server master/server
19:35 ^🔗	HCross	fatal: Cannot update paths and switch to branch 'server' at the same time.
19:35 ^🔗	HCross	Did you intend to checkout 'master/server' which can not be resolved as commit?
19:35 ^🔗	HCross	can you please tell me where I am being a clot?
19:35 ^🔗	Kaz	git checkout server
19:36 ^🔗	HCross	thanks
19:36 ^🔗	Kaz	-b makes new branch, which you don't want because it already exists
19:36 ^🔗	Kaz	or at least, afaict
19:37 ^🔗	HCross	going to create shard 14, based on https://archive.org/details/newspapers
19:38 ^🔗	Kaz	13?
19:39 ^🔗	HCross	yours
19:39 ^🔗	Kaz	I guess one thing to think of, is if items are added to a collection, how are these reflected in the iabak version?
19:43 ^🔗	HCross	http://paste.nerds.io/uvaqeteyah.md im now slighly confused about why that went wrong
19:44 ^🔗	Kaz	have you got the md5_collection_url.txt.pick1.sorted.uniq link?
19:44 ^🔗	HCross	yes
19:45 ^🔗	Kaz	it 'sort of works' if you do it from /usr/local/IA.BAK, but have no perms to create the SHARD14.list
19:48 ^🔗	HCross	yea, we need a way of doing it in our directories
19:52 ^🔗	Kaz	wait hang on
19:55 ^🔗	Kaz	HCross: mkSHARD in /usr/local/IA.BAK is different to the one in the repo
19:56 ^🔗	Kaz	this explains some things
19:57 ^🔗	HCross	that one seems to be working
19:57 ^🔗	HCross	its doing thinking about it now
19:57 ^🔗	HCross	ive cp'ed it into my folder and renamed it
20:00 ^🔗	HCross	228575 SHARD14.list
20:00 ^🔗	HCross	may be a tad high
20:02 ^🔗	HCross	or not hang on
20:03 ^🔗	db48x	yea, don't got over 100k
20:03 ^🔗	HCross	thats the file size I think though, its still generating. My calculation is 60k
20:03 ^🔗	Kaz	db48x: could you push the updated mkshard to the repo please? just in case there are other changes that should be pulled through too
20:03 ^🔗	db48x	although the wiki page says to use that file, it's an out of date index to the archive
20:04 ^🔗	db48x	probably better to use ia-mine to generate a more up-to-date index of the collection you're interested in
20:04 ^🔗	HCross	db48x, if you show me how, I can get us a new one
20:04 ^🔗	HCross	should we do a whole archive one?
20:04 ^🔗	db48x	nah, takes ages
20:05 ^🔗	db48x	ia-mine --secure -c -s "collection:${1}" --itemlist
20:05 ^🔗	HCross	how long is "ages"
20:05 ^🔗	db48x	a life-age of the earth
20:05 ^🔗	db48x	seriously, the IA has 20PB of stuff
20:06 ^🔗	db48x	takes ages, but if you just do one collection it only takes a minute or so
20:06 ^🔗	db48x	see the split-collection.sh script that I checked in
20:06 ^🔗	db48x	it first gets a list of identifiers for the items in the collection, then gets the json metadata for each item
20:07 ^🔗	db48x	then you use jq to convert the json into a tsv of just the four things we need in order to make the git-annex repository
20:08 ^🔗	db48x	you can adjust that basic recipe to suit the needs of the moment
20:08 ^🔗	HCross	about to kick off several shards with archivebot.... db48x is that OK?
20:08 ^🔗	db48x	probably :)
20:09 ^🔗	db48x	how are you splitting them between shards?
20:09 ^🔗	HCross	using your splitting script
20:10 ^🔗	db48x	ok. just be aware that the archivebot collection is a smaller number of huge items, so 100k files is a bad way to split it :)
20:10 ^🔗	HCross	best idea to split it?
20:10 ^🔗	db48x	I'm not sure
20:10 ^🔗	db48x	probably into about 50 shards of 2-3TB
20:11 ^🔗	HCross	archivebot is 1.8k items.
20:12 ^🔗	HCross	assuming each pack is 50GB, its 5 items per pack
20:12 ^🔗	db48x	sounds right, but don't assume
20:12 ^🔗	HCross	yea, ill check now
20:12 ^🔗	db48x	that jq script dumps out the file sizes too, so dice it up and measure the size of all the pieces :)
20:13 ^🔗	Kaz	so just to be clear
20:13 ^🔗	Kaz	1) run split-collection on huge collection
20:13 ^🔗	Kaz	2) do some jq magic
20:13 ^🔗	Kaz	3) make shard based on each list that gets pumped out by 1&"?
20:14 ^🔗	db48x	yes
20:14 ^🔗	db48x	but split-collection is just one possible way to split up a collection
20:14 ^🔗	db48x	you could do it by date
20:14 ^🔗	Kaz	right
20:15 ^🔗	HCross	db48x, can we have nano please?
20:15 ^🔗	db48x	as I did for archiveteam_fire, or by size, for archivebot, or by some clever means I haven't thought up
20:15 ^🔗	db48x	sure
20:15 ^🔗	db48x	emacs is on there, as is vim, but go ahead and install it
20:15 ^🔗	Frogging	how does one get things "out" of the backup, should the need arise?
20:16 ^🔗	Kaz	how does one go about running the jq script? -bash: jq: command not found
20:16 ^🔗	db48x	while it's installing, go to /usr/local/propellor and edit joeyconfig.hs so that the iabak machine defined in there includes the package as well
20:16 ^🔗	Kaz	or do I need to install for myself
20:16 ^🔗	HCross	db48x, not root/cant sudo
20:16 ^🔗	db48x	I had zero luck installing jq on iabak, sorry
20:16 ^🔗	db48x	I had to do that part on my own machine
20:16 ^🔗	Kaz	ah, okay
20:17 ^🔗	HCross	I may do all the mining closer to myself then, and push over after
20:17 ^🔗	db48x	but you're welcome to show me up :)
20:17 ^🔗	HCross	I'd rather not try and upload lots of small files from London > Singapore
20:17 ^🔗	db48x	oh, I never actually added you guys to /etc/sudoers
20:18 ^🔗	HCross	Kaz, if I get a VPS somewhere near SG (aka LA), would you want to try and sort jq on that?
20:18 ^🔗	HCross	then we can use that
20:19 ^🔗	Kaz	could give it a shot, let me see if I can get it running locally first
20:20 ^🔗	db48x	or just get it installed on the server
20:20 ^🔗	db48x	I gave up because it was 4am
20:21 ^🔗	Kaz	uh
20:21 ^🔗	Kaz	am I missing something or am I supposed to be expecting something more than apt-get install jq
20:22 ^🔗	db48x	dunno
20:22 ^🔗	db48x	does it work?
20:22 ^🔗	Kaz	yeah
20:22 ^🔗	Kaz	well, it installed
20:22 ^🔗	db48x	nice :)
20:22 ^🔗	Kaz	will test functionality in a sec
20:26 ^🔗	HCross	struggling to get iamine to go as well
20:26 ^🔗	db48x	closure: I updated my pull request
20:27 ^🔗	db48x	speaking of which, you guys should peruse it as well: https://github.com/joeyh/propellor/pull/17
20:29 ^🔗	HCross	it doesnt half take a time to create a shard
20:29 ^🔗	HCross	my first one is still computing
20:29 ^🔗	db48x	indeed
20:30 ^🔗	Kaz	HCross: do you have iamine working?
20:30 ^🔗	HCross	I dont, it needs someones IA credentials
20:30 ^🔗	db48x	it needs your own
20:31 ^🔗	HCross	ah, per user
20:31 ^🔗	db48x	just uses a username and password to get an auth key
20:31 ^🔗	db48x	yea
20:31 ^🔗	db48x	s/key/token
20:31 ^🔗	db48x	/
20:32 ^🔗	Kaz	iamine needs py3, right
20:34 ^🔗	db48x	indeed
20:34 ^🔗	db48x	https://github.com/joeyh/propellor/pull/17/commits/1d689b1e4ce1f5eeedab140bd3c330484a928586
20:35 ^🔗	yipdw	Frogging: each remote location is a git-annex remote; to restore shard contents, you pull from remots
20:35 ^🔗	yipdw	es
20:36 ^🔗	Frogging	ooh, okay
20:36 ^🔗	yipdw	https://git-annex.branchable.com/tips/offline_archive_drives/ and https://git-annex.branchable.com/location_tracking/
20:36 ^🔗	db48x	Frogging: I'm sorry, I completely forgot to answer your question!
20:39 ^🔗	HCross	db48x, split-collection: line 8: syntax error near unexpected token `('
20:39 ^🔗	HCross	on your script
20:41 ^🔗	Kaz	iamine is not paying nicely for me
20:42 ^🔗	db48x	HCross: odd
20:42 ^🔗	db48x	line 8 is just lines=$(wc -l "${itemfile}" \| cut -d ' ' -f 1)
20:43 ^🔗	HCross	not on https://raw.githubusercontent.com/ArchiveTeam/IA.BAK/server/split-collection
20:44 ^🔗	db48x	oooh
20:44 ^🔗	Kaz	yeah, the 'new' mkshard also isn't on the repo
20:45 ^🔗		atomotic has joined #internetarchive.bak
20:47 ^🔗	db48x	Kaz: I committed it: https://github.com/ArchiveTeam/IA.BAK/commit/5b457779b2ffd9fb1342671a3dbc1cd73edcd14e#diff-85b50cc2f5b54f1254ecdcd0fec1959d
20:47 ^🔗	db48x	just pushed the fixed split-collection script
20:48 ^🔗	Kaz	https://github.com/ArchiveTeam/IA.BAK/blob/server/mkSHARD doesn't match /usr/local/IA.BAK/mkSHARD, which one should we be using?
20:49 ^🔗	db48x	well, let's look at the diff
20:50 ^🔗	HCross	db48x, http://paste.nerds.io/ujacucekid.vhdl what am I doing wrong?
20:50 ^🔗	db48x	Kaz: yea, that's the old version
20:51 ^🔗	db48x	there, I did a git pull
20:52 ^🔗	Kaz	okay, thanks
20:52 ^🔗	db48x	++ basename 'archivebot-files/-*.json' .json
20:52 ^🔗	db48x	error: "archivebot-meta-*.json" should be readable
20:54 ^🔗	db48x	it's in quotes, so it didn't expand the glob
20:54 ^🔗	HCross	but its made a load of shards
20:55 ^🔗	db48x	naturally
20:55 ^🔗	HCross	ah
20:55 ^🔗	db48x	that's why the script has a for loop, to work on each one independantly
20:56 ^🔗	db48x	but if the filename is quoted, then the loop only runs once, with f equal to "archivebot-meta-*.json" rather than with f equal to "archivebot-meta-0.json", then equal to "archivebot-meta-1.json", etc
20:57 ^🔗	HCross	looks like ive got 357 shards of just archivebot
20:58 ^🔗	HCross	hmm, thats not right
20:58 ^🔗	db48x	that is a lot of shards
20:59 ^🔗	HCross	yea, I messed up
21:00 ^🔗	HCross	34, that looks better
21:00 ^🔗	db48x	:)
21:01 ^🔗	HCross	db48x, also. annexed files in working tree: 228575
21:01 ^🔗	HCross	- but from the info on archive.org for each collection there are only 68k files
21:01 ^🔗	db48x	HCross: what collection are you looking at?
21:02 ^🔗	HCross	The_Sydney_Morning_Herald svoboda_newspaper antiochnews The_Notre_Dame_Scholastic NCAA-News dailyracingform
21:04 ^🔗	db48x	23856 + 23183 + 5305 + 2206 + 857 + 10956 = ~66k items
21:04 ^🔗	db48x	but yea, each item has multiple files
21:05 ^🔗	HCross	ahh, its a file thing
21:05 ^🔗	HCross	sorry about my cockups this evening
21:05 ^🔗	db48x	indeed
21:07 ^🔗	db48x	no worries
21:08 ^🔗	db48x	I'd never done it before until last night either, I just didn't have anyone to talk to about the mistakes I made :)
21:12 ^🔗	HCross	db48x, so I now have http://harrycross.me/dae.png
21:12 ^🔗	HCross	- what do I do on them next? Tried running get_item_files.jq and I get a compile error
21:12 ^🔗	db48x	show me the command and the error?
21:14 ^🔗	HCross	jq get_item_files.jq
21:14 ^🔗	HCross	error: get_item_files is not defined
21:14 ^🔗	HCross	get_item_files.jq1 compile error
21:14 ^🔗	HCross	no matter what file I add on the end of the command
21:17 ^🔗	db48x	oh
21:17 ^🔗	db48x	if you want it to run commands from a file, you have to use the -f option
21:17 ^🔗	db48x	jq -f somefile.jq
21:18 ^🔗	db48x	and since this script produces an array of strings that we want to treat as lines of a file, you also need the -r option
21:18 ^🔗	db48x	so it becomes jq -r -f get_item_files.jq archivebot-meta-00.txt
21:19 ^🔗	db48x	oh, and then you want to redirect the output, so tack on > archivebot-meta-00.tsv to the end
21:21 ^🔗	HCross	parse error: Invalid numeric literal at line 2, column 0
21:22 ^🔗	db48x	hrm
21:22 ^🔗	db48x	what's line 2 of your input file look like?
21:23 ^🔗	db48x	some horrible json object, I hope
21:24 ^🔗	HCross	http://harrycross.me/123.png
21:24 ^🔗	HCross	is archivebot-meta-00.txt
21:25 ^🔗	db48x	ah, those are the ids, not the metadata
21:25 ^🔗	db48x	split-collection should run ia-mine again using those as input
21:25 ^🔗	HCross	it did something, then spat out the IA mine readme
21:26 ^🔗	db48x	heh
21:27 ^🔗	HCross	db48x, http://paste.nerds.io/fomihecuma.vhdl full log of what it did
21:27 ^🔗	HCross	TO MAKE THOSE
21:27 ^🔗	HCross	to make those files
21:30 ^🔗	db48x	yea, you didn't fix the error on line 13
21:30 ^🔗	db48x	I guess you edited split-collection, putting quotes around the glob of the for loop?
21:31 ^🔗	HCross	db48x, http://paste.nerds.io/dokefiboxu.pl is that file, only edit is line 9
21:32 ^🔗	db48x	http://paste.nerds.io/dokefiboxu.sh
21:32 ^🔗	db48x	there, at least the colors are better now :)
21:33 ^🔗	db48x	your previous log clearly shows the quotes
21:33 ^🔗	db48x	maybe you edited it back after it failed, but forgot to rerun it? gremlins maybe?
21:33 ^🔗	Frogging	sorry if the question is stupidm, but why the sudden interest in archiving the Archive? the election was mentioned but I didn't really understand the connection
21:37 ^🔗	db48x	Frogging: we've been doing this for a while now, but we've let it slide a bit
21:38 ^🔗	Frogging	yeah, I know. it just seemed to get a sudden bump to top priority on Tuesday
21:39 ^🔗	db48x	could be because previous attemmpts to get it going again didn't really accomplish much?
21:40 ^🔗	Frogging	maybe, though the implication was that it's now more urgent because of the election
21:40 ^🔗	Frogging	I think it's great that it's getting more attention, I didn't understand that though
21:40 ^🔗	db48x	hmm. no idea about that
21:40 ^🔗	db48x	I suppose it's possible, but it didn't seem that way to me
21:41 ^🔗	HCross	db48x, can you tell me what line 13 should be, not quite getting this (am a real newbie with this kind of thing), sorry
21:41 ^🔗		Kksmkrn has joined #internetarchive.bak
21:42 ^🔗	db48x	HCross: just "done"
21:42 ^🔗	db48x	oh, in the log
21:42 ^🔗	db48x	not the 13th line of the script
21:42 ^🔗	db48x	it shouldn't be an error
21:43 ^🔗	db48x	let me run it and show you a log of when it works
21:43 ^🔗	HCross	ok, thanks
21:43 ^🔗		kyan has quit IRC (Read error: Operation timed out)
21:44 ^🔗	Jon	yay 21G down of the 1T allocated so far
21:45 ^🔗	db48x	nice
21:48 ^🔗	db48x	bah: {"params": {"rows": 50, "q": "collection:archiveteam-fire", "page": 397, "output": "json"}, "url": "https://archive.org/advancedsearch.php", "retries_left": 0, "message": "Maximum retries exceeded for url, giving up."}
21:51 ^🔗		Start_ has joined #internetarchive.bak
21:52 ^🔗		Start has quit IRC (Read error: Connection reset by peer)
21:52 ^🔗	db48x	HCross: do you see a similar stream of error messages when you run ia-mine?
21:52 ^🔗	HCross	nope, just the IA mine readme
21:52 ^🔗	db48x	ah, it only happens on that collection
21:53 ^🔗	db48x	I'll look into that later, back to helping you
21:53 ^🔗	HCross	thanks
21:54 ^🔗	db48x	oooh, lol
21:55 ^🔗		atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
21:55 ^🔗	db48x	it's because I am dumb
21:58 ^🔗		wp494 has quit IRC (Read error: Connection reset by peer)
22:02 ^🔗	db48x	http://paste.nerds.io/ubuhujimur.sh
22:02 ^🔗	db48x	HCross: I pushed a fix
22:03 ^🔗	db48x	basically, I made some changes to split-collection, then had the bright idea to split archiveteam-fire by date, and didn't actually run it again with my changes
22:04 ^🔗	db48x	the quotes in your log were a red herring
22:04 ^🔗	db48x	I was forgetting that this is bash we're dealing with
22:04 ^🔗	HCross	i just ordered a time3vps for storage for this, and as soon as I paid, their portal crashed
22:05 ^🔗	db48x	heh
22:08 ^🔗	HCross	that looks more like what it should be doing db48x - its also taking more time
22:08 ^🔗	db48x	yay
22:09 ^🔗		wp494 has joined #internetarchive.bak
22:09 ^🔗	HCross	db48x, is https://billing.time4vps.eu down/showing a cloudflare live thingy for you?
22:11 ^🔗	db48x	it's showing nothing but a spinning spinner so far
22:11 ^🔗	db48x	ah, indeed it is
22:12 ^🔗	HCross	rip, just fed them 20 eur
22:12 ^🔗	Kksmkrn	Loading for me
22:13 ^🔗	Kksmkrn	Nevermind, I should drink more, talk less.. live thingy
22:14 ^🔗	HCross	db48x, it worked this time, have a nice JSON car-crash here
22:14 ^🔗	HCross	cat'ing that first file was a BAD idea
22:14 ^🔗	db48x	:D
22:14 ^🔗	db48x	use less instead
22:14 ^🔗	db48x	or head
22:15 ^🔗	Senji	hexdump -C
22:16 ^🔗	HCross	especially on a server 200ms away. Also, jq is still cocking up
22:16 ^🔗	HCross	same parse error: Invalid numeric literal at line 2, column 0
22:22 ^🔗		kyan has joined #internetarchive.bak
22:35 ^🔗		Kksmkrn has left zZzZ..
22:49 ^🔗	db48x	HCross: sounds like you fed it another file containing ids instead of json :)
23:01 ^🔗	HCross	db48x, I now have .tsv files full of JSON and .json files full of lists of IDs
23:03 ^🔗	HCross	db48x, I am going to head to bed, ill sort this out tomorrow. Goodnight
23:20 ^🔗	SketchPho	Looks like good work today.
23:20 ^🔗	SketchPho	Do I need to help in any way?
23:24 ^🔗	db48x	HCross: awesome! :D
23:25 ^🔗	db48x	SketchPho: I think we're making progress
23:25 ^🔗	SketchPho	Agreed. (Checked scrollback)
23:26 ^🔗	db48x	it's not really repeatable, quality-controlled, iso9000 certified progress, but it's not too bad :)
23:26 ^🔗	HCross2	I'll sort out the rest of it later tomorrow. It should be easier from here. Its now just converting those files to shards
23:27 ^🔗	HCross2	I've also got 4tb downloading
23:28 ^🔗	db48x	I had to stop after 15gb of shard12

irclogger-viewer