#archiveteam-bs 2016-12-03,Sat

↑back Search

Time	Nickname	Message
00:12 ^🔗		RichardG_ has joined #archiveteam-bs
00:17 ^🔗		RichardG has quit IRC (Read error: Operation timed out)
00:25 ^🔗		Somebody has joined #archiveteam-bs
00:26 ^🔗		VADemon has quit IRC (Quit: left4dead)
00:46 ^🔗		Aranje has joined #archiveteam-bs
00:47 ^🔗		RichardG_ is now known as RichardG
01:17 ^🔗		Somebody has quit IRC (Ping timeout: 370 seconds)
01:22 ^🔗		hawc145 has joined #archiveteam-bs
01:24 ^🔗		wacky has quit IRC (Ping timeout: 250 seconds)
01:24 ^🔗		Kksmkrn has quit IRC (Ping timeout: 250 seconds)
01:24 ^🔗		wacky has joined #archiveteam-bs
01:25 ^🔗		HCross has quit IRC (Ping timeout: 250 seconds)
01:25 ^🔗		dashcloud has quit IRC (Ping timeout: 250 seconds)
01:25 ^🔗		dxdx has quit IRC (Ping timeout: 250 seconds)
01:25 ^🔗		pikhq has quit IRC (Ping timeout: 250 seconds)
01:25 ^🔗		Zebranky has quit IRC (Ping timeout: 250 seconds)
01:25 ^🔗		dashcloud has joined #archiveteam-bs
01:26 ^🔗		pikhq has joined #archiveteam-bs
01:32 ^🔗		dx has joined #archiveteam-bs
01:33 ^🔗		Zebranky has joined #archiveteam-bs
02:03 ^🔗		Somebody has joined #archiveteam-bs
02:58 ^🔗		Kksmkrn has joined #archiveteam-bs
03:42 ^🔗		Lord_Nigh has quit IRC (Read error: Operation timed out)
03:44 ^🔗		ravetcofx has quit IRC (Read error: Operation timed out)
03:51 ^🔗		ravetcofx has joined #archiveteam-bs
03:55 ^🔗		Lord_Nigh has joined #archiveteam-bs
04:00 ^🔗		Lord_Nigh has quit IRC (Ping timeout: 244 seconds)
04:01 ^🔗		Lord_Nigh has joined #archiveteam-bs
04:01 ^🔗		jrwr has quit IRC (Remote host closed the connection)
04:12 ^🔗		ndiddy has quit IRC (Read error: Connection reset by peer)
05:09 ^🔗		Lord_Nigh has quit IRC (Read error: Operation timed out)
05:20 ^🔗		Lord_Nigh has joined #archiveteam-bs
05:42 ^🔗		Sk1d has quit IRC (Ping timeout: 194 seconds)
05:48 ^🔗		Lord_Nigh has quit IRC (Ping timeout: 244 seconds)
05:48 ^🔗		Sk1d has joined #archiveteam-bs
05:49 ^🔗		Lord_Nigh has joined #archiveteam-bs
06:02 ^🔗		Aranje has quit IRC (Quit: Three sheets to the wind)
06:36 ^🔗		Lord_Nigh has quit IRC (Read error: Operation timed out)
06:38 ^🔗		Lord_Nigh has joined #archiveteam-bs
06:46 ^🔗		Lord_Nigh has quit IRC (Read error: Operation timed out)
07:09 ^🔗		Lord_Nigh has joined #archiveteam-bs
07:51 ^🔗		Lord_Nigh has quit IRC (Ping timeout: 250 seconds)
07:52 ^🔗		Lord_Nigh has joined #archiveteam-bs
08:06 ^🔗		Somebody has quit IRC (Ping timeout: 370 seconds)
09:16 ^🔗		GE has joined #archiveteam-bs
09:18 ^🔗		phuzion has quit IRC (Read error: Operation timed out)
09:19 ^🔗		phuzion has joined #archiveteam-bs
09:22 ^🔗		ravetcofx has quit IRC (Read error: Operation timed out)
09:22 ^🔗		Lord_Nigh has quit IRC (Ping timeout: 244 seconds)
09:26 ^🔗		Lord_Nigh has joined #archiveteam-bs
09:49 ^🔗		dashcloud has quit IRC (Ping timeout: 244 seconds)
09:51 ^🔗		dashcloud has joined #archiveteam-bs
10:05 ^🔗	godane	i'm uploading more kpra audio
11:30 ^🔗		BlueMaxim has quit IRC (Quit: Leaving)
11:38 ^🔗		GE has quit IRC (Remote host closed the connection)
11:40 ^🔗	whydomain	Anyone know of a way to download an icecast stream in chunks (e.g: 500mb parts).
11:41 ^🔗	whydomain	I want to grab an ongoing radio stream but if I just download as one file I'll eventually run out of disk space
11:44 ^🔗	whydomain	The problem is most ways of spliting a file create a copy of the file, rather than splitting the original
12:22 ^🔗	ranma	worth backing up? https://www.youtube.com/watch?v=miw39UKfKPU
12:22 ^🔗	ranma	<Chii> At Dinner With Donald Trump, Mitt Romney Ate Crow - [8m54s] 2016-12-01 - The Late Show with Stephen Colbert - 1,078,394 views
12:22 ^🔗	ranma	references Trump's "loss of citizenship or year in jail" quote
12:52 ^🔗		hawc145 is now known as HCross
12:55 ^🔗	ae_g_i_s	whydomain: i suspect that `split` should be able to do that if you output the icecast stream to stdout and pipe it to `split`
12:57 ^🔗	ae_g_i_s	the drawback is that it won't conserve any headers, so the resulting files (after the first one) might be slightly broken - but if you can just `cat` then together on the target system, that's no issue
12:58 ^🔗	ae_g_i_s	okay, wrong phrasing. "won't conserve headers" as in "won't write extra headers to each output file"
13:05 ^🔗		GE has joined #archiveteam-bs
13:07 ^🔗		BartoCH has quit IRC (Remote host closed the connection)
13:07 ^🔗	arkiver	whydomain: which radio stream?
13:09 ^🔗	whydomain	A local community one, that I don't think will be archived.
13:09 ^🔗	arkiver	do you have a link?
13:09 ^🔗	whydomain	But I think that ae_g_i_s's method is working
13:09 ^🔗	whydomain	http://icecast.easystream.co.uk:8000/blackdiamondfm.m3u
13:11 ^🔗	whydomain	Yes! Thanks ae_g_i_s, it works.
13:11 ^🔗	whydomain	curl http://icecast.easystream.co.uk:8000/blackdiamondfm \| split -d -b 100M - radio
13:11 ^🔗	HCross	arkiver, something to write up/add to videobot?
13:11 ^🔗	whydomain	(if anyone else is interested)
13:15 ^🔗	arkiver	well, we're doing a radio recording project over at IA
13:15 ^🔗	arkiver	it's mostly not public though
13:17 ^🔗	whydomain	arkiver: out of curiosity, will IA be targeting smaller community/local stations, or just the big ones?
13:17 ^🔗	arkiver	everything
13:18 ^🔗	arkiver	however, we prefer informative radio stations
13:18 ^🔗	arkiver	and this is project is not very public, FYI
13:19 ^🔗	whydomain	everything? (even non-US stuff? - like the one I'm grabbing right now (black diamond) )
13:19 ^🔗	arkiver	definitely non-US stuff!
13:21 ^🔗		BartoCH has joined #archiveteam-bs
13:21 ^🔗	whydomain	what if there is no web stream?
13:21 ^🔗	arkiver	well, currently only web streaming stations
13:22 ^🔗	arkiver	but they almost all have a web stream
13:25 ^🔗		GE has quit IRC (Remote host closed the connection)
13:50 ^🔗		GE has joined #archiveteam-bs
14:01 ^🔗	godane	SketchCow: looks like metadata for the date has to be fix here: https://archive.org/details/1988-JUn-compute-magazine
14:05 ^🔗		BartoCH has quit IRC (Ping timeout: 260 seconds)
14:11 ^🔗		BartoCH has joined #archiveteam-bs
15:00 ^🔗		BartoCH has quit IRC (Ping timeout: 260 seconds)
15:08 ^🔗	tapedrive	arkiver: Just in case you haven't seen this: http://www.radiofeeds.co.uk/ is a listing of nearly all radio feeds in the UK.
15:37 ^🔗	arkiver	tapedrive: thank you!
15:42 ^🔗	arkiver	That's a very nice list
15:42 ^🔗	arkiver	if you have anything, please let me know :D
16:17 ^🔗		BartoCH has joined #archiveteam-bs
16:24 ^🔗		kristian_ has joined #archiveteam-bs
16:39 ^🔗	tapedrive	arkiver: All of the ones I've tested from that list work in non-uk countries, but there may be some that don't.
16:43 ^🔗	whydomain	arkiver: there's Roland Radio (Amstrad CPC computer music) at http://streaming.rolandradio.net:8000/rolandradio
16:43 ^🔗		kvieta has quit IRC (Ping timeout: 246 seconds)
16:49 ^🔗		kvieta has joined #archiveteam-bs
17:34 ^🔗		BartoCH has quit IRC (Ping timeout: 260 seconds)
17:39 ^🔗		BartoCH has joined #archiveteam-bs
17:45 ^🔗		BartoCH has quit IRC (Ping timeout: 260 seconds)
17:49 ^🔗		BartoCH has joined #archiveteam-bs
18:08 ^🔗		zerkalo has quit IRC (Read error: Connection reset by peer)
18:08 ^🔗		zerkalo has joined #archiveteam-bs
18:09 ^🔗		zerkalo has quit IRC (Read error: Connection reset by peer)
18:09 ^🔗		zerkalo has joined #archiveteam-bs
18:10 ^🔗		ndiddy has joined #archiveteam-bs
18:18 ^🔗		zerkalo has quit IRC (Ping timeout: 244 seconds)
18:30 ^🔗		zerkalo has joined #archiveteam-bs
19:01 ^🔗		VADemon has joined #archiveteam-bs
19:06 ^🔗		Somebody has joined #archiveteam-bs
19:17 ^🔗		Somebody has quit IRC (Ping timeout: 370 seconds)
19:24 ^🔗	godane	we are up to 2016-11-30 with kpfa
19:26 ^🔗	HCross	Sanqui, only 1k more PewDiePie videos to go
19:26 ^🔗	Kaz	anyone in here with lots of local storage? Looking for some advice
19:26 ^🔗	Frogging	how much is lots?
19:26 ^🔗	HCross	how much is defined by "lots"
19:27 ^🔗	Kaz	lets say 20TB+
19:27 ^🔗	Frogging	ah I don't have quite that much
19:27 ^🔗	Kaz	trying to work out the best route for 8-12 drives in a non-huge physical space
19:27 ^🔗	Frogging	http://www.ncix.com/detail/fractal-design-node-804-matx-23-97165.htm
19:28 ^🔗	Kaz	HP microserver (4 bays) is doing fine at the moment, but I'm not too sure on expansion
19:28 ^🔗	HCross	Kaz, best bet may be #DataHoarder on Freenode
19:28 ^🔗	Kaz	already there :)
19:28 ^🔗	Frogging	that thing holds 10 unmodded
19:28 ^🔗	Kaz	Frogging: ..did not realise that had space for 10 3.5"'s inside wtf
19:28 ^🔗	Frogging	yeah it's pretty amazing. It's what I'm using for my NAS
19:29 ^🔗	Frogging	8 in the main bays and there are mounting points next to the motherboard for two more
19:29 ^🔗	Frogging	and then you can stick an SSD or two in the front
19:29 ^🔗	Kaz	what mobo/cpu are you running in there?
19:29 ^🔗	Kaz	and freenas/unraid or anything?
19:30 ^🔗	Frogging	ASRock 970M and a AMD Phenom II 965 quad core
19:30 ^🔗	Frogging	it's running Debian with md RAID
19:31 ^🔗	Frogging	I'm not a fan of freenas
19:31 ^🔗	Kaz	ah, right
19:32 ^🔗	Kaz	god this won't be cheap
19:32 ^🔗	Frogging	by far the most expensive thing for me was the drives
19:33 ^🔗	Frogging	I have four 4TB WD Reds in there right now, and some cheap SSD for the OS
19:33 ^🔗	Kaz	yeah, I'm looking at 4-6TB drives for now, 8-10 in future
19:33 ^🔗	Kaz	or maybe I could delete some stuff
19:33 ^🔗	Frogging	actually I have three drives, not four
19:33 ^🔗	Frogging	oops
19:46 ^🔗		ravetcofx has joined #archiveteam-bs
19:52 ^🔗		BlueMaxim has joined #archiveteam-bs
20:06 ^🔗		Somebody has joined #archiveteam-bs
20:23 ^🔗	godane	HCross: I guess if your doing PewDiePie youtube channel i don't have to download it
20:23 ^🔗	godane	its a good thing cause i still have 2800+ to go
20:23 ^🔗	HCross	godane, Ive nearly got it down, just need some advice on the best way to get it to the archive now
20:24 ^🔗	Frogging	I was thinking of making a wiki page where people who archive youtube channels can add them to a table and provide contact information in the event that someone wants to get something out
20:24 ^🔗	HCross	godane, do you find that youtube seems to have really variable download speedsd?
20:24 ^🔗	godane	yes
20:25 ^🔗	Frogging	I imagine their storage is very geographically distributed, maybe that has something to do with it
20:25 ^🔗	godane	but your downloading a much fast rate then i could
20:26 ^🔗	HCross	Im getting anywhere from several hundred Mbps, to less than 1
20:26 ^🔗	godane	sometimes start and restart fixes that
20:26 ^🔗	Frogging	where is the downloader located HCross?
20:26 ^🔗	HCross	OVH, Roubaix
20:26 ^🔗	Frogging	ah
20:27 ^🔗	HCross	probably IP range throttles as well
20:27 ^🔗	godane	i use a move script to sort my videos by date before i upload
20:27 ^🔗	godane	based on the json script
20:27 ^🔗	godane	*json files
20:27 ^🔗	Frogging	you're downloading in full quality I hope, HCross ?
20:28 ^🔗	Frogging	though in this instance, with the number of videos, I'd understand if that isn't feasible..
20:28 ^🔗	HCross	yep. max video and max audio quality
20:28 ^🔗	HCross	Frogging, I dont have 12TB storage for nothing
20:28 ^🔗	HCross	godane, which script do you use?
20:29 ^🔗	Frogging	nice
20:29 ^🔗	Frogging	this is the command I use for grabbing channels
20:29 ^🔗	Frogging	youtube-dl --download-archive archive.txt --write-description --write-annotations --write-info-json -f bestvideo[ext=webm]+bestaudio[ext=webm]/bestvideo[ext=mp4]+bestaudio[ext=m4a]/best $*
20:30 ^🔗	godane	http://pastebin.com/KyYJk6pE
20:30 ^🔗	godane	that is just my move script
20:30 ^🔗	HCross	thanks godane :)
20:30 ^🔗	godane	the move script is for sorting the files locally
20:31 ^🔗	HCross	nearing 400GB so far
20:31 ^🔗	godane	i think use another script to uploaded the sorted files to make ids like this: https://archive.org/details/achannelthatsawesome-youtube-channel-2016-02-06
20:32 ^🔗	godane	anyways do what you know best
20:33 ^🔗	HCross	trying to get a collection sorted, as itll alll probably need to be darked
20:35 ^🔗	Frogging	hope it can be un-darked if he actually ends up deleting all his videos
20:43 ^🔗	godane	i do the youtube-channel dates cause there is no metadata in the titles
20:44 ^🔗	godane	this was also my way of sorting thur stuff so i know what has to be uploaded next
21:09 ^🔗		jrwr has joined #archiveteam-bs
21:21 ^🔗		Stiletto has quit IRC (Ping timeout: 244 seconds)
21:39 ^🔗		jsp234 has joined #archiveteam-bs
21:41 ^🔗		jsp12345 has quit IRC (Read error: Operation timed out)
21:49 ^🔗		kanzure has joined #archiveteam-bs
21:50 ^🔗	Sanqui	moving to -bs
21:50 ^🔗		jsp234 has quit IRC (Remote host closed the connection)
21:50 ^🔗		nicolas17 has joined #archiveteam-bs
21:50 ^🔗	kanzure	actually i don't care which one i get,
21:50 ^🔗	DFJustin	they do generate sha1s for every file in every item
21:51 ^🔗	kanzure	torrent hashes, the hashes inside each torrent file, or the actual file hashes
21:51 ^🔗	DFJustin	but I guess they only included md5 in the collected census for whatever reason
21:52 ^🔗	kanzure	i mean, i wouldn't feel great about recomputing hashes for everything every few years either :)
21:52 ^🔗	DFJustin	you could run ia mine yourself I guess but that would take a while
21:52 ^🔗	Sanqui	what's the goal here?
21:52 ^🔗	Frogging	yeah I was going to ask
21:52 ^🔗	Frogging	I may have missed it but what are you trying to achieve
21:53 ^🔗	kanzure	https://petertodd.org/2016/opentimestamps-announcement
21:53 ^🔗	kanzure	timestamping using merkle trees
21:53 ^🔗	kanzure	bitcoin timestamp proofs, in particular... although it would be applicable to non-bitcoin systems as well i suppose.
21:54 ^🔗	DFJustin	hmm interesting
21:54 ^🔗	Frogging	and where does the files on archive.org enter into this?
21:55 ^🔗	kanzure	archive.org has hashes, i just need the hashes
21:55 ^🔗	kanzure	and using a weak hash (like md5) is not appropriate
21:55 ^🔗	Frogging	the hashes of what though?
21:55 ^🔗	kanzure	al of it
21:55 ^🔗	kanzure	everything :)
21:55 ^🔗	Kaz	why must you have the hashes
21:55 ^🔗	Kaz	what is your quest
21:55 ^🔗	kanzure	timestamping is a way of showing the existence of an item based on something other than a trusted clock
21:56 ^🔗	Sanqui	take a look at IA.BAK. it currently only covers a (small) subset of IA, but may have good data for a trial run
21:56 ^🔗	Kaz	right
21:56 ^🔗	Sanqui	it runs on git-annex
21:56 ^🔗	Kaz	so you want to timestamp things to prove the IA has them?
21:56 ^🔗	Sanqui	http://iabak.archiveteam.org/
21:56 ^🔗	kanzure	https://petertodd.org/2016/opentimestamps-announcement#what-can-and-cant-timestamps-prove
21:56 ^🔗	kanzure	kaz, ^
21:56 ^🔗	Frogging	i think what he's asking is what does the IA have to do with this
21:57 ^🔗	Kaz	I don't want your link
21:57 ^🔗	kanzure	Sanqui: are you the same Sanqui that i know
21:57 ^🔗	Kaz	I want to understand what you actually want here
21:57 ^🔗	Sanqui	kanzure: yes!
21:57 ^🔗	kanzure	ohai
21:57 ^🔗	Sanqui	hi!
21:58 ^🔗		jsp12345 has joined #archiveteam-bs
22:01 ^🔗		jsp12345 has quit IRC (Remote host closed the connection)
22:02 ^🔗	kanzure	kaz: timestamping, in this style, can help protect against future allegations of backdating
22:02 ^🔗	kanzure	or rather, anyone can always make an allegation of backdating, but at least here you can show a timestamp proof that a certain version existed at a certain time
22:02 ^🔗	xmc	well, ia.bak uses a git repo of all the hashes, and you can drop the commit ids into some kind of timestamping service
22:03 ^🔗	kanzure	ah. opentimestamps is compatible with git repositories, actually. it uses the git commit's tree hash.
22:03 ^🔗	xmc	kool
22:03 ^🔗	xmc	so it sounds like maybe what you want is something you can get trivially
22:05 ^🔗		jsp12345 has joined #archiveteam-bs
22:06 ^🔗	kanzure	also it looks like public-file-size-md_20150304205357.json.gz is about right too, 'cept for all the md5 hashes
22:07 ^🔗	Frogging	MD5 is quite a bit faster than SHA1, probably why they did it that way
22:08 ^🔗	Frogging	(that's just a guess, I wasn't around for it)
22:08 ^🔗	kanzure	md5 was an okay option at one point, i think. i dunno, i'm not a cryptographer.
22:10 ^🔗	Frogging	neither am I but as far as I know, MD5 is fine if your intended application isn't at risk of being tampered with. like verifying file integrity in a relatively safe environment
22:11 ^🔗	Somebody	kanzure: The original census only included md5s because we were only concerned about identifying accidental identical files
22:11 ^🔗	Frogging	it can be coerced into generating a collision (I believe it's called a preimage attack), but if that's not something you're trying to protect against, then it's fine
22:11 ^🔗	kanzure	in the context of timestamping, it would mean that anyone can forge an alternative and show hey this document "existed" back then too, and you can try to pass that different versio off as legitimate
22:11 ^🔗	Somebody	The most recent census does include both md5 and sha1
22:11 ^🔗	kanzure	Somebody: is the most recent census available for download somewhere?
22:11 ^🔗	Frogging	kanzure: you're right for sure. but the census wasn't designed for that
22:11 ^🔗	Somebody	kanzure: yeah, it should be listed on the wiki page, but I think I haven't updated it yet. Just a sec.
22:13 ^🔗	Somebody	kanzure: https://archive.org/details/ia_census_201604
22:13 ^🔗	kanzure	two different types of hashes is definitely helpful
22:13 ^🔗	godane	so i found out that WFMU has audio going back to 2002
22:15 ^🔗	kanzure	Somebody: thank you much
22:15 ^🔗	*	Frogging forgot who Somebody is until just now
22:15 ^🔗	Frogging	:p
22:15 ^🔗	Somebody	Frogging: that's the idea. :-)
22:15 ^🔗	Frogging	oh :p
22:15 ^🔗		jsp12345 has quit IRC (Remote host closed the connection)
22:15 ^🔗	godane	the only bad news is with WFMU is that the streams are only in big MP4 files
22:16 ^🔗		jsp12345 has joined #archiveteam-bs
22:16 ^🔗	Somebody	kanzure: glad to help -- if you have any further questions about the data or format, please ask!
22:18 ^🔗	kanzure	got ratelimited womp womp
22:18 ^🔗	Somebody	ratelimited by what?
22:18 ^🔗	kanzure	dunno, i was doing 10 MB/sec for a few minutes. i'll blame my ISP, it's fine.
22:19 ^🔗	Somebody	no, I mean, what are you trying to download?
22:19 ^🔗		jsp12345 has quit IRC (Remote host closed the connection)
22:19 ^🔗	kanzure	public-file-size-md_20150304205357.json.gz
22:19 ^🔗	kanzure	(this was from before you gave me the more recent link)
22:19 ^🔗		jsp12345 has joined #archiveteam-bs
22:21 ^🔗		ndiddy has quit IRC (Read error: Connection reset by peer)
22:24 ^🔗		ndiddy has joined #archiveteam-bs
22:24 ^🔗	Somebody	kanzure: Try downloading it through a torrent -- I think there are peers for the censuses.
22:30 ^🔗		jsp12345 has quit IRC (Remote host closed the connection)
22:30 ^🔗		jsp12345 has joined #archiveteam-bs
22:42 ^🔗	kanzure	a39e3a8d37793792f62b85cbd7b74cafe482b5b2014203ca28b8555822ce74f3 public-file-size-md_20150304205357.json.gz
22:55 ^🔗	Somebody	kanzure: what hash is that sha256?
22:56 ^🔗	kanzure	for that file
22:56 ^🔗	kanzure	yes it's sha256
23:03 ^🔗	kanzure	should i also do the private collection? what's in there
23:04 ^🔗	Somebody	kanzure: that hash matches what other people downloaded, see: https://hash-archive.org/history/https://archive.org/download/ia-bak-census_20150304/public-file-size-md_20150304205357.json.gz
23:04 ^🔗	kanzure	ah hash-archive.org had it, okay
23:05 ^🔗	Somebody	well, it had it now (I just added it)
23:05 ^🔗	kanzure	i had already hashed one of the previous databases of hash-archive.org
23:05 ^🔗	kanzure	oh i see.
23:05 ^🔗	Somebody	anyone can put anything into hash-archive, just submit a URL
23:05 ^🔗	Somebody	and it will download it and hash it
23:05 ^🔗	kanzure	yes i know
23:05 ^🔗	kanzure	i submitted this one the other day: https://hash-archive.org/history/https://archive.org/download/archiveteam_archivebot_go_068/bitcointalk.org-inf-20140403-045710-7i531.warc.gz
23:06 ^🔗	Somebody	oh, good!
23:07 ^🔗	Somebody	If you think of any features to add to hash-archive, do contact the author.
23:07 ^🔗	kanzure	he actually didn't reply to me, but whatever
23:07 ^🔗	Somebody	hm, that's odd. what did you suggest?
23:07 ^🔗	kanzure	merkle roots
23:07 ^🔗	kanzure	also the database download page is broken
23:07 ^🔗	Somebody	ah, probably just hasn't gotten to it then
23:07 ^🔗	kanzure	s/merkle roots/merkle trees
23:08 ^🔗	Somebody	regarding the _private data file, it has (nearly) all the Wayback Machine data, and various other stuff
23:08 ^🔗	kanzure	oh, wayback machine data sounds potentially useful to timestamp. alright.
23:09 ^🔗	Somebody	but it won't have any torrents for you to grab (I think)
23:09 ^🔗	kanzure	right. i can just hash the .json file that lists all the hashes.
23:09 ^🔗	kanzure	it's a nasty hack but whatever
23:10 ^🔗	Somebody	yeah, if all you are doing is timestamping the existence of census data files, certainly, do all of them
23:10 ^🔗	kanzure	this is helpful metadata. it can be useful to in the future point out that no, the archive was not backdated, and here's why :).
23:11 ^🔗	Somebody	kanzure: you'll probably also be interested in https://archive.org/download/archiveteam_census_2016 -- it will contain monthly lists of all the identifiers included in the search engine
23:11 ^🔗	kanzure	or rather: not backdated after it was timestamped today
23:11 ^🔗	kanzure	8daa7a635d77eddb9fecb000abbe10b19611b623a1242b4a7b4b7881b92ddae6 file_hashes_sha1_20160411221100_public.tsv.gz.ots
23:11 ^🔗	Somebody	uploaded automatically once a month (with a new item generated each year)
23:12 ^🔗	Somebody	ah, nice
23:12 ^🔗	Somebody	kanzure: agreed
23:12 ^🔗	Somebody	Yes, my basic interest in the census work was to provide 3rd-party validation of "this was in the archive at this time"
23:13 ^🔗	Somebody	so I'm delighted to see you working on timestamping it
23:13 ^🔗	Somebody	If you'd like to drop copies of the census data into a gmail mailbox, and/or AWS, that'd be nice too
23:14 ^🔗	Somebody	and please do dump copies of the toplevel merkel hashes into multiple pastebins, and then archive the pastebins
23:14 ^🔗	kanzure	any other files i should look at before i do that?
23:16 ^🔗	Somebody	kanzure: might as well grab http://archiveteam.org/index.php?title=Internet_Archive_Census#See_Also
23:17 ^🔗	nicolas17	Google Cloud Platform has some nice "coldline" storage now too
23:17 ^🔗	kanzure	oops pardon me, 8daa7a635d77eddb9fecb000abbe10b19611b623a1242b4a7b4b7881b92ddae6 was for file_hashes_sha1_20160411221100_public.tsv.gz not file_hashes_sha1_20160411221100_public.tsv.gz.ots
23:18 ^🔗	Somebody	also note that item _meta.xml and _files.xml will change every time the metadata for an item changes, which can happen pretty much anytime, and will happen regularly. So differences in those isn't generally very interesting
23:18 ^🔗	kanzure	are these census files going to ever change? if they change, a new census is released, right?
23:19 ^🔗	Somebody	kanzure: The census files shouldn't change, as I conceptualize them, no.
23:19 ^🔗	kanzure	tsv == tab csv?
23:19 ^🔗	Somebody	yep
23:20 ^🔗	Somebody	tab-separated-values
23:20 ^🔗	kanzure	all sorts of fancy up in this joint, wow
23:20 ^🔗	Somebody	looks nicer, and commas are more likely to be found in item identifiers than tabs
23:21 ^🔗	Somebody	and pretty much nothing that supports csv can't be tweaked to support tsv instead
23:22 ^🔗	ae_g_i_s	tsv is much better, agreed
23:22 ^🔗	Somebody	in the next census, I might generate a separate file-hashes list that excludes the _meta.xml and _files.xml files.
23:22 ^🔗	Somebody	ae_g_i_s: glad you agree; did I leave out any of the other advantages?
23:22 ^🔗	ae_g_i_s	you use a character that's almost never used in content as the separator instead of a character that's often used in texts
23:22 ^🔗	kanzure	Somebody: use lots of different hash functions, very helpful to resist bit rot and hash function failure over time
23:22 ^🔗	Somebody	nicolas17: thanks -- please do dump copies of the census stuff in there, if you get a chance!
23:23 ^🔗	ae_g_i_s	which is kinda the main parser perspective on it - you want the separator to not be in the set of valid "content" characters
23:24 ^🔗	Somebody	I think csv was originally intended for primarily-numerical spreadsheets -- where, while it still hurts, it is likely to be more sensible.
23:24 ^🔗	kanzure	ideally, all content on archive.org would be timestamped (using opentimestamps or w/e) at submission time. and then archive.org would store the timestamp proof itself.
23:24 ^🔗	Somebody	kanzure: yes, that would be quite neat.
23:24 ^🔗	nicolas17	there are a bunch of csv variants wrt how they handle escaping of the comma, or quoting
23:24 ^🔗	kanzure	timestamp proof is only a few hundred bytes per item
23:24 ^🔗	Kaz	but this is AT
23:24 ^🔗	Kaz	not IA
23:25 ^🔗	kanzure	i refuse to believe that IA would be so cold as to not hang out in here
23:25 ^🔗	Somebody	kanzure: yes, that would be awesome. please send that idea into mek if you haven't already.
23:25 ^🔗	kanzure	i have not
23:25 ^🔗	kanzure	who is mek
23:25 ^🔗	Somebody	kanzure: eh, there's overlap, but we try to maintain separation
23:25 ^🔗	Kaz	kanzure: #archiveteam
23:25 ^🔗	Kaz	uh
23:25 ^🔗	Kaz	#internetarchive
23:25 ^🔗	kanzure	this is an endless loop...
23:25 ^🔗	kanzure	ah
23:26 ^🔗	kanzure	mek does not seem to be there
23:26 ^🔗	Kaz	oh, yeah I have no idea who/what mek is, but #internetarchive is the channel for your needs
23:26 ^🔗	nicolas17	hm, I need to update the Mapillary page on the AT wiki
23:27 ^🔗	Somebody	kanzure: mek is a staffer at Archive.org, and interested in various new ideas. https://michaelkarpeles.com/
23:27 ^🔗	nicolas17	they have 104M photos by now :P
23:30 ^🔗	kanzure	e2bc2f240490e91d52a1eaeb5636664f75795da9bfcbbe7692c07b90ae18244b file_hashes_sha1_20160411221100_private.tsv.gz
23:31 ^🔗		GE has quit IRC (Remote host closed the connection)
23:32 ^🔗	nicolas17	how would you go about archiving 200TB of photos from an AWS S3 bucket in Europe? preferably without giving the owner high AWS bandwidth costs
23:33 ^🔗	kanzure	amazon has snowball data container things
23:37 ^🔗	ae_g_i_s	yeah, afaik that one's called glacier
23:37 ^🔗	ae_g_i_s	though i dunno if glacier works in the 'restore direction', i do know that it works in the 'backup direction', i.e. they send you drives, you fill them up and send it to them
23:37 ^🔗	kanzure	https://aws.amazon.com/blogs/aws/aws-importexport-snowball-transfer-1-petabyte-per-week-using-amazon-owned-storage-appliances/
23:39 ^🔗	kanzure	oops that was last year
23:39 ^🔗	kanzure	ah here we go,
23:39 ^🔗	kanzure	https://aws.amazon.com/blogs/aws/aws-snowmobile-move-exabytes-of-data-to-the-cloud-in-weeks/
23:40 ^🔗	ae_g_i_s	yeah, snowmobile is the new one, where they send you a powered truck
23:40 ^🔗	kanzure	"In order to meet the needs of these customers, we are launching Snowmobile today. This secure data truck stores up to 100 PB of data and can help you to move exabytes to AWS in a matter of weeks (you can get more than one if necessary). "
23:40 ^🔗	kanzure	raise the pirate flag, let's go raiding
23:42 ^🔗	kanzure	from https://archive.org/download/archiveteam_census_2016 ,
23:42 ^🔗	kanzure	2840f4e64f4c2bf562e97294714371cfe7beb4122e73c3437d535175d93e53df 2016.10.23-ia_identifiers.txt.gz
23:42 ^🔗	ae_g_i_s	fun fact: you can bake ~175 chicken at the same time with the power necessary for an amazon snowmobile
23:49 ^🔗	Somebody	https://hash-archive.org/history/https://archive.org/download/ia_census_201604/file_hashes_sha1_20160411221100_private.tsv.gz
23:49 ^🔗	Somebody	https://hash-archive.org/history/https://archive.org/download/archiveteam_census_2016/2016.10.23-ia_identifiers.txt.gz
23:50 ^🔗	kanzure	ah.
23:51 ^🔗	kanzure	yes i should have checked that first.
23:51 ^🔗	Somebody	eh, the order doesn't matter -- but it is a nice way to get a 3rd-party check

irclogger-viewer