#archiveteam-bs 2016-12-03,Sat

↑back Search

Time Nickname Message
00:12 🔗 RichardG_ has joined #archiveteam-bs
00:17 🔗 RichardG has quit IRC (Read error: Operation timed out)
00:25 🔗 Somebody has joined #archiveteam-bs
00:26 🔗 VADemon has quit IRC (Quit: left4dead)
00:46 🔗 Aranje has joined #archiveteam-bs
00:47 🔗 RichardG_ is now known as RichardG
01:17 🔗 Somebody has quit IRC (Ping timeout: 370 seconds)
01:22 🔗 hawc145 has joined #archiveteam-bs
01:24 🔗 wacky has quit IRC (Ping timeout: 250 seconds)
01:24 🔗 Kksmkrn has quit IRC (Ping timeout: 250 seconds)
01:24 🔗 wacky has joined #archiveteam-bs
01:25 🔗 HCross has quit IRC (Ping timeout: 250 seconds)
01:25 🔗 dashcloud has quit IRC (Ping timeout: 250 seconds)
01:25 🔗 dxdx has quit IRC (Ping timeout: 250 seconds)
01:25 🔗 pikhq has quit IRC (Ping timeout: 250 seconds)
01:25 🔗 Zebranky has quit IRC (Ping timeout: 250 seconds)
01:25 🔗 dashcloud has joined #archiveteam-bs
01:26 🔗 pikhq has joined #archiveteam-bs
01:32 🔗 dx has joined #archiveteam-bs
01:33 🔗 Zebranky has joined #archiveteam-bs
02:03 🔗 Somebody has joined #archiveteam-bs
02:58 🔗 Kksmkrn has joined #archiveteam-bs
03:42 🔗 Lord_Nigh has quit IRC (Read error: Operation timed out)
03:44 🔗 ravetcofx has quit IRC (Read error: Operation timed out)
03:51 🔗 ravetcofx has joined #archiveteam-bs
03:55 🔗 Lord_Nigh has joined #archiveteam-bs
04:00 🔗 Lord_Nigh has quit IRC (Ping timeout: 244 seconds)
04:01 🔗 Lord_Nigh has joined #archiveteam-bs
04:01 🔗 jrwr has quit IRC (Remote host closed the connection)
04:12 🔗 ndiddy has quit IRC (Read error: Connection reset by peer)
05:09 🔗 Lord_Nigh has quit IRC (Read error: Operation timed out)
05:20 🔗 Lord_Nigh has joined #archiveteam-bs
05:42 🔗 Sk1d has quit IRC (Ping timeout: 194 seconds)
05:48 🔗 Lord_Nigh has quit IRC (Ping timeout: 244 seconds)
05:48 🔗 Sk1d has joined #archiveteam-bs
05:49 🔗 Lord_Nigh has joined #archiveteam-bs
06:02 🔗 Aranje has quit IRC (Quit: Three sheets to the wind)
06:36 🔗 Lord_Nigh has quit IRC (Read error: Operation timed out)
06:38 🔗 Lord_Nigh has joined #archiveteam-bs
06:46 🔗 Lord_Nigh has quit IRC (Read error: Operation timed out)
07:09 🔗 Lord_Nigh has joined #archiveteam-bs
07:51 🔗 Lord_Nigh has quit IRC (Ping timeout: 250 seconds)
07:52 🔗 Lord_Nigh has joined #archiveteam-bs
08:06 🔗 Somebody has quit IRC (Ping timeout: 370 seconds)
09:16 🔗 GE has joined #archiveteam-bs
09:18 🔗 phuzion has quit IRC (Read error: Operation timed out)
09:19 🔗 phuzion has joined #archiveteam-bs
09:22 🔗 ravetcofx has quit IRC (Read error: Operation timed out)
09:22 🔗 Lord_Nigh has quit IRC (Ping timeout: 244 seconds)
09:26 🔗 Lord_Nigh has joined #archiveteam-bs
09:49 🔗 dashcloud has quit IRC (Ping timeout: 244 seconds)
09:51 🔗 dashcloud has joined #archiveteam-bs
10:05 🔗 godane i'm uploading more kpra audio
11:30 🔗 BlueMaxim has quit IRC (Quit: Leaving)
11:38 🔗 GE has quit IRC (Remote host closed the connection)
11:40 🔗 whydomain Anyone know of a way to download an icecast stream in chunks (e.g: 500mb parts).
11:41 🔗 whydomain I want to grab an ongoing radio stream but if I just download as one file I'll eventually run out of disk space
11:44 🔗 whydomain The problem is most ways of spliting a file create a *copy* of the file, rather than splitting the original
12:22 🔗 ranma worth backing up? https://www.youtube.com/watch?v=miw39UKfKPU
12:22 🔗 ranma <Chii> At Dinner With Donald Trump, Mitt Romney Ate Crow - [8m54s] 2016-12-01 - The Late Show with Stephen Colbert - 1,078,394 views
12:22 🔗 ranma references Trump's "loss of citizenship or year in jail" quote
12:52 🔗 hawc145 is now known as HCross
12:55 🔗 ae_g_i_s whydomain: i suspect that `split` should be able to do that if you output the icecast stream to stdout and pipe it to `split`
12:57 🔗 ae_g_i_s the drawback is that it won't conserve any headers, so the resulting files (after the first one) might be slightly broken - but if you can just `cat` then together on the target system, that's no issue
12:58 🔗 ae_g_i_s okay, wrong phrasing. "won't conserve headers" as in "won't write extra headers to each output file"
13:05 🔗 GE has joined #archiveteam-bs
13:07 🔗 BartoCH has quit IRC (Remote host closed the connection)
13:07 🔗 arkiver whydomain: which radio stream?
13:09 🔗 whydomain A local community one, that I don't think will be archived.
13:09 🔗 arkiver do you have a link?
13:09 🔗 whydomain But I think that ae_g_i_s's method is working
13:09 🔗 whydomain http://icecast.easystream.co.uk:8000/blackdiamondfm.m3u
13:11 🔗 whydomain Yes! Thanks ae_g_i_s, it works.
13:11 🔗 whydomain curl http://icecast.easystream.co.uk:8000/blackdiamondfm | split -d -b 100M - radio
13:11 🔗 HCross arkiver, something to write up/add to videobot?
13:11 🔗 whydomain (if anyone else is interested)
13:15 🔗 arkiver well, we're doing a radio recording project over at IA
13:15 🔗 arkiver it's mostly not public though
13:17 🔗 whydomain arkiver: out of curiosity, will IA be targeting smaller community/local stations, or just the big ones?
13:17 🔗 arkiver everything
13:18 🔗 arkiver however, we prefer informative radio stations
13:18 🔗 arkiver and this is project is not very public, FYI
13:19 🔗 whydomain everything? (even non-US stuff? - like the one I'm grabbing right now (black diamond) )
13:19 🔗 arkiver definitely non-US stuff!
13:21 🔗 BartoCH has joined #archiveteam-bs
13:21 🔗 whydomain what if there is no web stream?
13:21 🔗 arkiver well, currently only web streaming stations
13:22 🔗 arkiver but they almost all have a web stream
13:25 🔗 GE has quit IRC (Remote host closed the connection)
13:50 🔗 GE has joined #archiveteam-bs
14:01 🔗 godane SketchCow: looks like metadata for the date has to be fix here: https://archive.org/details/1988-JUn-compute-magazine
14:05 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
14:11 🔗 BartoCH has joined #archiveteam-bs
15:00 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
15:08 🔗 tapedrive arkiver: Just in case you haven't seen this: http://www.radiofeeds.co.uk/ is a listing of nearly all radio feeds in the UK.
15:37 🔗 arkiver tapedrive: thank you!
15:42 🔗 arkiver That's a very nice list
15:42 🔗 arkiver if you have anything, please let me know :D
16:17 🔗 BartoCH has joined #archiveteam-bs
16:24 🔗 kristian_ has joined #archiveteam-bs
16:39 🔗 tapedrive arkiver: All of the ones I've tested from that list work in non-uk countries, but there may be some that don't.
16:43 🔗 whydomain arkiver: there's Roland Radio (Amstrad CPC computer music) at http://streaming.rolandradio.net:8000/rolandradio
16:43 🔗 kvieta has quit IRC (Ping timeout: 246 seconds)
16:49 🔗 kvieta has joined #archiveteam-bs
17:34 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
17:39 🔗 BartoCH has joined #archiveteam-bs
17:45 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
17:49 🔗 BartoCH has joined #archiveteam-bs
18:08 🔗 zerkalo has quit IRC (Read error: Connection reset by peer)
18:08 🔗 zerkalo has joined #archiveteam-bs
18:09 🔗 zerkalo has quit IRC (Read error: Connection reset by peer)
18:09 🔗 zerkalo has joined #archiveteam-bs
18:10 🔗 ndiddy has joined #archiveteam-bs
18:18 🔗 zerkalo has quit IRC (Ping timeout: 244 seconds)
18:30 🔗 zerkalo has joined #archiveteam-bs
19:01 🔗 VADemon has joined #archiveteam-bs
19:06 🔗 Somebody has joined #archiveteam-bs
19:17 🔗 Somebody has quit IRC (Ping timeout: 370 seconds)
19:24 🔗 godane we are up to 2016-11-30 with kpfa
19:26 🔗 HCross Sanqui, only 1k more PewDiePie videos to go
19:26 🔗 Kaz anyone in here with *lots* of local storage? Looking for some advice
19:26 🔗 Frogging how much is lots?
19:26 🔗 HCross how much is defined by "lots"
19:27 🔗 Kaz lets say 20TB+
19:27 🔗 Frogging ah I don't have quite that much
19:27 🔗 Kaz trying to work out the best route for 8-12 drives in a non-huge physical space
19:27 🔗 Frogging http://www.ncix.com/detail/fractal-design-node-804-matx-23-97165.htm
19:28 🔗 Kaz HP microserver (4 bays) is doing fine at the moment, but I'm not too sure on expansion
19:28 🔗 HCross Kaz, best bet may be #DataHoarder on Freenode
19:28 🔗 Kaz already there :)
19:28 🔗 Frogging that thing holds 10 unmodded
19:28 🔗 Kaz Frogging: ..did not realise that had space for 10 3.5"'s inside wtf
19:28 🔗 Frogging yeah it's pretty amazing. It's what I'm using for my NAS
19:29 🔗 Frogging 8 in the main bays and there are mounting points next to the motherboard for two more
19:29 🔗 Frogging and then you can stick an SSD or two in the front
19:29 🔗 Kaz what mobo/cpu are you running in there?
19:29 🔗 Kaz and freenas/unraid or anything?
19:30 🔗 Frogging ASRock 970M and a AMD Phenom II 965 quad core
19:30 🔗 Frogging it's running Debian with md RAID
19:31 🔗 Frogging I'm not a fan of freenas
19:31 🔗 Kaz ah, right
19:32 🔗 Kaz god this won't be cheap
19:32 🔗 Frogging by far the most expensive thing for me was the drives
19:33 🔗 Frogging I have four 4TB WD Reds in there right now, and some cheap SSD for the OS
19:33 🔗 Kaz yeah, I'm looking at 4-6TB drives for now, 8-10 in future
19:33 🔗 Kaz or maybe I could delete some stuff
19:33 🔗 Frogging actually I have three drives, not four
19:33 🔗 Frogging oops
19:46 🔗 ravetcofx has joined #archiveteam-bs
19:52 🔗 BlueMaxim has joined #archiveteam-bs
20:06 🔗 Somebody has joined #archiveteam-bs
20:23 🔗 godane HCross: I guess if your doing PewDiePie youtube channel i don't have to download it
20:23 🔗 godane its a good thing cause i still have 2800+ to go
20:23 🔗 HCross godane, Ive nearly got it down, just need some advice on the best way to get it to the archive now
20:24 🔗 Frogging I was thinking of making a wiki page where people who archive youtube channels can add them to a table and provide contact information in the event that someone wants to get something out
20:24 🔗 HCross godane, do you find that youtube seems to have really variable download speedsd?
20:24 🔗 godane yes
20:25 🔗 Frogging I imagine their storage is very geographically distributed, maybe that has something to do with it
20:25 🔗 godane but your downloading a much fast rate then i could
20:26 🔗 HCross Im getting anywhere from several hundred Mbps, to less than 1
20:26 🔗 godane sometimes start and restart fixes that
20:26 🔗 Frogging where is the downloader located HCross?
20:26 🔗 HCross OVH, Roubaix
20:26 🔗 Frogging ah
20:27 🔗 HCross probably IP range throttles as well
20:27 🔗 godane i use a move script to sort my videos by date before i upload
20:27 🔗 godane based on the json script
20:27 🔗 godane *json files
20:27 🔗 Frogging you're downloading in full quality I hope, HCross ?
20:28 🔗 Frogging though in this instance, with the number of videos, I'd understand if that isn't feasible..
20:28 🔗 HCross yep. max video and max audio quality
20:28 🔗 HCross Frogging, I dont have 12TB storage for nothing
20:28 🔗 HCross godane, which script do you use?
20:29 🔗 Frogging nice
20:29 🔗 Frogging this is the command I use for grabbing channels
20:29 🔗 Frogging youtube-dl --download-archive archive.txt --write-description --write-annotations --write-info-json -f bestvideo[ext=webm]+bestaudio[ext=webm]/bestvideo[ext=mp4]+bestaudio[ext=m4a]/best $*
20:30 🔗 godane http://pastebin.com/KyYJk6pE
20:30 🔗 godane that is just my move script
20:30 🔗 HCross thanks godane :)
20:30 🔗 godane the move script is for sorting the files locally
20:31 🔗 HCross nearing 400GB so far
20:31 🔗 godane i think use another script to uploaded the sorted files to make ids like this: https://archive.org/details/achannelthatsawesome-youtube-channel-2016-02-06
20:32 🔗 godane anyways do what you know best
20:33 🔗 HCross trying to get a collection sorted, as itll alll probably need to be darked
20:35 🔗 Frogging hope it can be un-darked if he actually ends up deleting all his videos
20:43 🔗 godane i do the youtube-channel dates cause there is no metadata in the titles
20:44 🔗 godane this was also my way of sorting thur stuff so i know what has to be uploaded next
21:09 🔗 jrwr has joined #archiveteam-bs
21:21 🔗 Stiletto has quit IRC (Ping timeout: 244 seconds)
21:39 🔗 jsp234 has joined #archiveteam-bs
21:41 🔗 jsp12345 has quit IRC (Read error: Operation timed out)
21:49 🔗 kanzure has joined #archiveteam-bs
21:50 🔗 Sanqui moving to -bs
21:50 🔗 jsp234 has quit IRC (Remote host closed the connection)
21:50 🔗 nicolas17 has joined #archiveteam-bs
21:50 🔗 kanzure actually i don't care which one i get,
21:50 🔗 DFJustin they do generate sha1s for every file in every item
21:51 🔗 kanzure torrent hashes, the hashes inside each torrent file, or the actual file hashes
21:51 🔗 DFJustin but I guess they only included md5 in the collected census for whatever reason
21:52 🔗 kanzure i mean, i wouldn't feel great about recomputing hashes for everything every few years either :)
21:52 🔗 DFJustin you could run ia mine yourself I guess but that would take a while
21:52 🔗 Sanqui what's the goal here?
21:52 🔗 Frogging yeah I was going to ask
21:52 🔗 Frogging I may have missed it but what are you trying to achieve
21:53 🔗 kanzure https://petertodd.org/2016/opentimestamps-announcement
21:53 🔗 kanzure timestamping using merkle trees
21:53 🔗 kanzure bitcoin timestamp proofs, in particular... although it would be applicable to non-bitcoin systems as well i suppose.
21:54 🔗 DFJustin hmm interesting
21:54 🔗 Frogging and where does the files on archive.org enter into this?
21:55 🔗 kanzure archive.org has hashes, i just need the hashes
21:55 🔗 kanzure and using a weak hash (like md5) is not appropriate
21:55 🔗 Frogging the hashes of what though?
21:55 🔗 kanzure al of it
21:55 🔗 kanzure everything :)
21:55 🔗 Kaz why must you have the hashes
21:55 🔗 Kaz what is your quest
21:55 🔗 kanzure timestamping is a way of showing the existence of an item based on something other than a trusted clock
21:56 🔗 Sanqui take a look at IA.BAK. it currently only covers a (small) subset of IA, but may have good data for a trial run
21:56 🔗 Kaz right
21:56 🔗 Sanqui it runs on git-annex
21:56 🔗 Kaz so you want to timestamp things to prove the IA has them?
21:56 🔗 Sanqui http://iabak.archiveteam.org/
21:56 🔗 kanzure https://petertodd.org/2016/opentimestamps-announcement#what-can-and-cant-timestamps-prove
21:56 🔗 kanzure kaz, ^
21:56 🔗 Frogging i think what he's asking is what does the IA have to do with this
21:57 🔗 Kaz I don't want your link
21:57 🔗 kanzure Sanqui: are you the same Sanqui that i know
21:57 🔗 Kaz I want to understand what you actually want here
21:57 🔗 Sanqui kanzure: yes!
21:57 🔗 kanzure ohai
21:57 🔗 Sanqui hi!
21:58 🔗 jsp12345 has joined #archiveteam-bs
22:01 🔗 jsp12345 has quit IRC (Remote host closed the connection)
22:02 🔗 kanzure kaz: timestamping, in this style, can help protect against future allegations of backdating
22:02 🔗 kanzure or rather, anyone can always make an allegation of backdating, but at least here you can show a timestamp proof that a certain version existed at a certain time
22:02 🔗 xmc well, ia.bak uses a git repo of all the hashes, and you can drop the commit ids into some kind of timestamping service
22:03 🔗 kanzure ah. opentimestamps is compatible with git repositories, actually. it uses the git commit's tree hash.
22:03 🔗 xmc kool
22:03 🔗 xmc so it sounds like maybe what you want is something you can get trivially
22:05 🔗 jsp12345 has joined #archiveteam-bs
22:06 🔗 kanzure also it looks like public-file-size-md_20150304205357.json.gz is about right too, 'cept for all the md5 hashes
22:07 🔗 Frogging MD5 is quite a bit faster than SHA1, probably why they did it that way
22:08 🔗 Frogging (that's just a guess, I wasn't around for it)
22:08 🔗 kanzure md5 was an okay option at one point, i think. i dunno, i'm not a cryptographer.
22:10 🔗 Frogging neither am I but as far as I know, MD5 is fine if your intended application isn't at risk of being tampered with. like verifying file integrity in a relatively safe environment
22:11 🔗 Somebody kanzure: The original census only included md5s because we were only concerned about identifying accidental identical files
22:11 🔗 Frogging it can be coerced into generating a collision (I believe it's called a preimage attack), but if that's not something you're trying to protect against, then it's fine
22:11 🔗 kanzure in the context of timestamping, it would mean that anyone can forge an alternative and show hey this document "existed" back then too, and you can try to pass that different versio off as legitimate
22:11 🔗 Somebody The most recent census does include both md5 and sha1
22:11 🔗 kanzure Somebody: is the most recent census available for download somewhere?
22:11 🔗 Frogging kanzure: you're right for sure. but the census wasn't designed for that
22:11 🔗 Somebody kanzure: yeah, it *should* be listed on the wiki page, but I think I haven't updated it yet. Just a sec.
22:13 🔗 Somebody kanzure: https://archive.org/details/ia_census_201604
22:13 🔗 kanzure two different types of hashes is definitely helpful
22:13 🔗 godane so i found out that WFMU has audio going back to 2002
22:15 🔗 kanzure Somebody: thank you much
22:15 🔗 * Frogging forgot who Somebody is until just now
22:15 🔗 Frogging :p
22:15 🔗 Somebody Frogging: that's the idea. :-)
22:15 🔗 Frogging oh :p
22:15 🔗 jsp12345 has quit IRC (Remote host closed the connection)
22:15 🔗 godane the only bad news is with WFMU is that the streams are only in big MP4 files
22:16 🔗 jsp12345 has joined #archiveteam-bs
22:16 🔗 Somebody kanzure: glad to help -- if you have any further questions about the data or format, please ask!
22:18 🔗 kanzure got ratelimited womp womp
22:18 🔗 Somebody ratelimited by what?
22:18 🔗 kanzure dunno, i was doing 10 MB/sec for a few minutes. i'll blame my ISP, it's fine.
22:19 🔗 Somebody no, I mean, what are you trying to download?
22:19 🔗 jsp12345 has quit IRC (Remote host closed the connection)
22:19 🔗 kanzure public-file-size-md_20150304205357.json.gz
22:19 🔗 kanzure (this was from before you gave me the more recent link)
22:19 🔗 jsp12345 has joined #archiveteam-bs
22:21 🔗 ndiddy has quit IRC (Read error: Connection reset by peer)
22:24 🔗 ndiddy has joined #archiveteam-bs
22:24 🔗 Somebody kanzure: Try downloading it through a torrent -- I think there are peers for the censuses.
22:30 🔗 jsp12345 has quit IRC (Remote host closed the connection)
22:30 🔗 jsp12345 has joined #archiveteam-bs
22:42 🔗 kanzure a39e3a8d37793792f62b85cbd7b74cafe482b5b2014203ca28b8555822ce74f3 public-file-size-md_20150304205357.json.gz
22:55 🔗 Somebody kanzure: what hash is that sha256?
22:56 🔗 kanzure for that file
22:56 🔗 kanzure yes it's sha256
23:03 🔗 kanzure should i also do the private collection? what's in there
23:04 🔗 Somebody kanzure: that hash matches what other people downloaded, see: https://hash-archive.org/history/https://archive.org/download/ia-bak-census_20150304/public-file-size-md_20150304205357.json.gz
23:04 🔗 kanzure ah hash-archive.org had it, okay
23:05 🔗 Somebody well, it had it *now* (I just added it)
23:05 🔗 kanzure i had already hashed one of the previous databases of hash-archive.org
23:05 🔗 kanzure oh i see.
23:05 🔗 Somebody anyone can put anything into hash-archive, just submit a URL
23:05 🔗 Somebody and it will download it and hash it
23:05 🔗 kanzure yes i know
23:05 🔗 kanzure i submitted this one the other day: https://hash-archive.org/history/https://archive.org/download/archiveteam_archivebot_go_068/bitcointalk.org-inf-20140403-045710-7i531.warc.gz
23:06 🔗 Somebody oh, good!
23:07 🔗 Somebody If you think of any features to add to hash-archive, do contact the author.
23:07 🔗 kanzure he actually didn't reply to me, but whatever
23:07 🔗 Somebody hm, that's odd. what did you suggest?
23:07 🔗 kanzure merkle roots
23:07 🔗 kanzure also the database download page is broken
23:07 🔗 Somebody ah, probably just hasn't gotten to it then
23:07 🔗 kanzure s/merkle roots/merkle trees
23:08 🔗 Somebody regarding the _private data file, it has (nearly) all the Wayback Machine data, and various other stuff
23:08 🔗 kanzure oh, wayback machine data sounds potentially useful to timestamp. alright.
23:09 🔗 Somebody but it won't have any torrents for you to grab (I think)
23:09 🔗 kanzure right. i can just hash the .json file that lists all the hashes.
23:09 🔗 kanzure it's a nasty hack but whatever
23:10 🔗 Somebody yeah, if all you are doing is timestamping the existence of census data files, certainly, do all of them
23:10 🔗 kanzure this is helpful metadata. it can be useful to in the future point out that no, the archive was not backdated, and here's why :).
23:11 🔗 Somebody kanzure: you'll probably also be interested in https://archive.org/download/archiveteam_census_2016 -- it will contain monthly lists of all the identifiers included in the search engine
23:11 🔗 kanzure or rather: not backdated after it was timestamped today
23:11 🔗 kanzure 8daa7a635d77eddb9fecb000abbe10b19611b623a1242b4a7b4b7881b92ddae6 file_hashes_sha1_20160411221100_public.tsv.gz.ots
23:11 🔗 Somebody uploaded automatically once a month (with a new item generated each year)
23:12 🔗 Somebody ah, nice
23:12 🔗 Somebody kanzure: agreed
23:12 🔗 Somebody Yes, my basic interest in the census work was to provide 3rd-party validation of "this was in the archive at this time"
23:13 🔗 Somebody so I'm delighted to see you working on timestamping it
23:13 🔗 Somebody If you'd like to drop copies of the census data into a gmail mailbox, and/or AWS, that'd be nice too
23:14 🔗 Somebody and please *do* dump copies of the toplevel merkel hashes into multiple pastebins, and then archive the pastebins
23:14 🔗 kanzure any other files i should look at before i do that?
23:16 🔗 Somebody kanzure: might as well grab http://archiveteam.org/index.php?title=Internet_Archive_Census#See_Also
23:17 🔗 nicolas17 Google Cloud Platform has some nice "coldline" storage now too
23:17 🔗 kanzure oops pardon me, 8daa7a635d77eddb9fecb000abbe10b19611b623a1242b4a7b4b7881b92ddae6 was for file_hashes_sha1_20160411221100_public.tsv.gz not file_hashes_sha1_20160411221100_public.tsv.gz.ots
23:18 🔗 Somebody also note that item _meta.xml and _files.xml will change every time the metadata for an item changes, which can happen pretty much anytime, and will happen regularly. So differences in those isn't generally very interesting
23:18 🔗 kanzure are these census files going to ever change? if they change, a new census is released, right?
23:19 🔗 Somebody kanzure: The census files *shouldn't* change, as I conceptualize them, no.
23:19 🔗 kanzure tsv == tab csv?
23:19 🔗 Somebody yep
23:20 🔗 Somebody tab-separated-values
23:20 🔗 kanzure all sorts of fancy up in this joint, wow
23:20 🔗 Somebody looks nicer, and commas are more likely to be found in item identifiers than tabs
23:21 🔗 Somebody and pretty much nothing that supports csv can't be tweaked to support tsv instead
23:22 🔗 ae_g_i_s tsv is much better, agreed
23:22 🔗 Somebody in the next census, I might generate a separate file-hashes list that excludes the _meta.xml and _files.xml files.
23:22 🔗 Somebody ae_g_i_s: glad you agree; did I leave out any of the other advantages?
23:22 🔗 ae_g_i_s you use a character that's almost never used in content as the separator instead of a character that's often used in texts
23:22 🔗 kanzure Somebody: use lots of different hash functions, very helpful to resist bit rot and hash function failure over time
23:22 🔗 Somebody nicolas17: thanks -- please do dump copies of the census stuff in there, if you get a chance!
23:23 🔗 ae_g_i_s which is kinda the main parser perspective on it - you want the separator to not be in the set of valid "content" characters
23:24 🔗 Somebody I think csv was originally intended for primarily-numerical spreadsheets -- where, while it still hurts, it is likely to be *more* sensible.
23:24 🔗 kanzure ideally, all content on archive.org would be timestamped (using opentimestamps or w/e) at submission time. and then archive.org would store the timestamp proof itself.
23:24 🔗 Somebody kanzure: yes, that would be quite neat.
23:24 🔗 nicolas17 there are a bunch of csv variants wrt how they handle escaping of the comma, or quoting
23:24 🔗 kanzure timestamp proof is only a few hundred bytes per item
23:24 🔗 Kaz but this is AT
23:24 🔗 Kaz not IA
23:25 🔗 kanzure i refuse to believe that IA would be so cold as to not hang out in here
23:25 🔗 Somebody kanzure: yes, that would be awesome. please send that idea into mek if you haven't already.
23:25 🔗 kanzure i have not
23:25 🔗 kanzure who is mek
23:25 🔗 Somebody kanzure: eh, there's overlap, but we try to maintain separation
23:25 🔗 Kaz kanzure: #archiveteam
23:25 🔗 Kaz uh
23:25 🔗 Kaz #internetarchive
23:25 🔗 kanzure this is an endless loop...
23:25 🔗 kanzure ah
23:26 🔗 kanzure mek does not seem to be there
23:26 🔗 Kaz oh, yeah I have no idea who/what mek is, but #internetarchive is the channel for your needs
23:26 🔗 nicolas17 hm, I need to update the Mapillary page on the AT wiki
23:27 🔗 Somebody kanzure: mek is a staffer at Archive.org, and interested in various new ideas. https://michaelkarpeles.com/
23:27 🔗 nicolas17 they have 104M photos by now :P
23:30 🔗 kanzure e2bc2f240490e91d52a1eaeb5636664f75795da9bfcbbe7692c07b90ae18244b file_hashes_sha1_20160411221100_private.tsv.gz
23:31 🔗 GE has quit IRC (Remote host closed the connection)
23:32 🔗 nicolas17 how would you go about archiving 200TB of photos from an AWS S3 bucket in Europe? preferably without giving the owner high AWS bandwidth costs
23:33 🔗 kanzure amazon has snowball data container things
23:37 🔗 ae_g_i_s yeah, afaik that one's called glacier
23:37 🔗 ae_g_i_s though i dunno if glacier works in the 'restore direction', i do know that it works in the 'backup direction', i.e. they send you drives, you fill them up and send it to them
23:37 🔗 kanzure https://aws.amazon.com/blogs/aws/aws-importexport-snowball-transfer-1-petabyte-per-week-using-amazon-owned-storage-appliances/
23:39 🔗 kanzure oops that was last year
23:39 🔗 kanzure ah here we go,
23:39 🔗 kanzure https://aws.amazon.com/blogs/aws/aws-snowmobile-move-exabytes-of-data-to-the-cloud-in-weeks/
23:40 🔗 ae_g_i_s yeah, snowmobile is the new one, where they send you a powered truck
23:40 🔗 kanzure "In order to meet the needs of these customers, we are launching Snowmobile today. This secure data truck stores up to 100 PB of data and can help you to move exabytes to AWS in a matter of weeks (you can get more than one if necessary). "
23:40 🔗 kanzure raise the pirate flag, let's go raiding
23:42 🔗 kanzure from https://archive.org/download/archiveteam_census_2016 ,
23:42 🔗 kanzure 2840f4e64f4c2bf562e97294714371cfe7beb4122e73c3437d535175d93e53df 2016.10.23-ia_identifiers.txt.gz
23:42 🔗 ae_g_i_s fun fact: you can bake ~175 chicken at the same time with the power necessary for an amazon snowmobile
23:49 🔗 Somebody https://hash-archive.org/history/https://archive.org/download/ia_census_201604/file_hashes_sha1_20160411221100_private.tsv.gz
23:49 🔗 Somebody https://hash-archive.org/history/https://archive.org/download/archiveteam_census_2016/2016.10.23-ia_identifiers.txt.gz
23:50 🔗 kanzure ah.
23:51 🔗 kanzure yes i should have checked that first.
23:51 🔗 Somebody eh, the order doesn't matter -- but it is a nice way to get a 3rd-party check

irclogger-viewer