Time |
Nickname |
Message |
00:12
🔗
|
|
RichardG_ has joined #archiveteam-bs |
00:17
🔗
|
|
RichardG has quit IRC (Read error: Operation timed out) |
00:25
🔗
|
|
Somebody has joined #archiveteam-bs |
00:26
🔗
|
|
VADemon has quit IRC (Quit: left4dead) |
00:46
🔗
|
|
Aranje has joined #archiveteam-bs |
00:47
🔗
|
|
RichardG_ is now known as RichardG |
01:17
🔗
|
|
Somebody has quit IRC (Ping timeout: 370 seconds) |
01:22
🔗
|
|
hawc145 has joined #archiveteam-bs |
01:24
🔗
|
|
wacky has quit IRC (Ping timeout: 250 seconds) |
01:24
🔗
|
|
Kksmkrn has quit IRC (Ping timeout: 250 seconds) |
01:24
🔗
|
|
wacky has joined #archiveteam-bs |
01:25
🔗
|
|
HCross has quit IRC (Ping timeout: 250 seconds) |
01:25
🔗
|
|
dashcloud has quit IRC (Ping timeout: 250 seconds) |
01:25
🔗
|
|
dxdx has quit IRC (Ping timeout: 250 seconds) |
01:25
🔗
|
|
pikhq has quit IRC (Ping timeout: 250 seconds) |
01:25
🔗
|
|
Zebranky has quit IRC (Ping timeout: 250 seconds) |
01:25
🔗
|
|
dashcloud has joined #archiveteam-bs |
01:26
🔗
|
|
pikhq has joined #archiveteam-bs |
01:32
🔗
|
|
dx has joined #archiveteam-bs |
01:33
🔗
|
|
Zebranky has joined #archiveteam-bs |
02:03
🔗
|
|
Somebody has joined #archiveteam-bs |
02:58
🔗
|
|
Kksmkrn has joined #archiveteam-bs |
03:42
🔗
|
|
Lord_Nigh has quit IRC (Read error: Operation timed out) |
03:44
🔗
|
|
ravetcofx has quit IRC (Read error: Operation timed out) |
03:51
🔗
|
|
ravetcofx has joined #archiveteam-bs |
03:55
🔗
|
|
Lord_Nigh has joined #archiveteam-bs |
04:00
🔗
|
|
Lord_Nigh has quit IRC (Ping timeout: 244 seconds) |
04:01
🔗
|
|
Lord_Nigh has joined #archiveteam-bs |
04:01
🔗
|
|
jrwr has quit IRC (Remote host closed the connection) |
04:12
🔗
|
|
ndiddy has quit IRC (Read error: Connection reset by peer) |
05:09
🔗
|
|
Lord_Nigh has quit IRC (Read error: Operation timed out) |
05:20
🔗
|
|
Lord_Nigh has joined #archiveteam-bs |
05:42
🔗
|
|
Sk1d has quit IRC (Ping timeout: 194 seconds) |
05:48
🔗
|
|
Lord_Nigh has quit IRC (Ping timeout: 244 seconds) |
05:48
🔗
|
|
Sk1d has joined #archiveteam-bs |
05:49
🔗
|
|
Lord_Nigh has joined #archiveteam-bs |
06:02
🔗
|
|
Aranje has quit IRC (Quit: Three sheets to the wind) |
06:36
🔗
|
|
Lord_Nigh has quit IRC (Read error: Operation timed out) |
06:38
🔗
|
|
Lord_Nigh has joined #archiveteam-bs |
06:46
🔗
|
|
Lord_Nigh has quit IRC (Read error: Operation timed out) |
07:09
🔗
|
|
Lord_Nigh has joined #archiveteam-bs |
07:51
🔗
|
|
Lord_Nigh has quit IRC (Ping timeout: 250 seconds) |
07:52
🔗
|
|
Lord_Nigh has joined #archiveteam-bs |
08:06
🔗
|
|
Somebody has quit IRC (Ping timeout: 370 seconds) |
09:16
🔗
|
|
GE has joined #archiveteam-bs |
09:18
🔗
|
|
phuzion has quit IRC (Read error: Operation timed out) |
09:19
🔗
|
|
phuzion has joined #archiveteam-bs |
09:22
🔗
|
|
ravetcofx has quit IRC (Read error: Operation timed out) |
09:22
🔗
|
|
Lord_Nigh has quit IRC (Ping timeout: 244 seconds) |
09:26
🔗
|
|
Lord_Nigh has joined #archiveteam-bs |
09:49
🔗
|
|
dashcloud has quit IRC (Ping timeout: 244 seconds) |
09:51
🔗
|
|
dashcloud has joined #archiveteam-bs |
10:05
🔗
|
godane |
i'm uploading more kpra audio |
11:30
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
11:38
🔗
|
|
GE has quit IRC (Remote host closed the connection) |
11:40
🔗
|
whydomain |
Anyone know of a way to download an icecast stream in chunks (e.g: 500mb parts). |
11:41
🔗
|
whydomain |
I want to grab an ongoing radio stream but if I just download as one file I'll eventually run out of disk space |
11:44
🔗
|
whydomain |
The problem is most ways of spliting a file create a *copy* of the file, rather than splitting the original |
12:22
🔗
|
ranma |
worth backing up? https://www.youtube.com/watch?v=miw39UKfKPU |
12:22
🔗
|
ranma |
<Chii> At Dinner With Donald Trump, Mitt Romney Ate Crow - [8m54s] 2016-12-01 - The Late Show with Stephen Colbert - 1,078,394 views |
12:22
🔗
|
ranma |
references Trump's "loss of citizenship or year in jail" quote |
12:52
🔗
|
|
hawc145 is now known as HCross |
12:55
🔗
|
ae_g_i_s |
whydomain: i suspect that `split` should be able to do that if you output the icecast stream to stdout and pipe it to `split` |
12:57
🔗
|
ae_g_i_s |
the drawback is that it won't conserve any headers, so the resulting files (after the first one) might be slightly broken - but if you can just `cat` then together on the target system, that's no issue |
12:58
🔗
|
ae_g_i_s |
okay, wrong phrasing. "won't conserve headers" as in "won't write extra headers to each output file" |
13:05
🔗
|
|
GE has joined #archiveteam-bs |
13:07
🔗
|
|
BartoCH has quit IRC (Remote host closed the connection) |
13:07
🔗
|
arkiver |
whydomain: which radio stream? |
13:09
🔗
|
whydomain |
A local community one, that I don't think will be archived. |
13:09
🔗
|
arkiver |
do you have a link? |
13:09
🔗
|
whydomain |
But I think that ae_g_i_s's method is working |
13:09
🔗
|
whydomain |
http://icecast.easystream.co.uk:8000/blackdiamondfm.m3u |
13:11
🔗
|
whydomain |
Yes! Thanks ae_g_i_s, it works. |
13:11
🔗
|
whydomain |
curl http://icecast.easystream.co.uk:8000/blackdiamondfm | split -d -b 100M - radio |
13:11
🔗
|
HCross |
arkiver, something to write up/add to videobot? |
13:11
🔗
|
whydomain |
(if anyone else is interested) |
13:15
🔗
|
arkiver |
well, we're doing a radio recording project over at IA |
13:15
🔗
|
arkiver |
it's mostly not public though |
13:17
🔗
|
whydomain |
arkiver: out of curiosity, will IA be targeting smaller community/local stations, or just the big ones? |
13:17
🔗
|
arkiver |
everything |
13:18
🔗
|
arkiver |
however, we prefer informative radio stations |
13:18
🔗
|
arkiver |
and this is project is not very public, FYI |
13:19
🔗
|
whydomain |
everything? (even non-US stuff? - like the one I'm grabbing right now (black diamond) ) |
13:19
🔗
|
arkiver |
definitely non-US stuff! |
13:21
🔗
|
|
BartoCH has joined #archiveteam-bs |
13:21
🔗
|
whydomain |
what if there is no web stream? |
13:21
🔗
|
arkiver |
well, currently only web streaming stations |
13:22
🔗
|
arkiver |
but they almost all have a web stream |
13:25
🔗
|
|
GE has quit IRC (Remote host closed the connection) |
13:50
🔗
|
|
GE has joined #archiveteam-bs |
14:01
🔗
|
godane |
SketchCow: looks like metadata for the date has to be fix here: https://archive.org/details/1988-JUn-compute-magazine |
14:05
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
14:11
🔗
|
|
BartoCH has joined #archiveteam-bs |
15:00
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
15:08
🔗
|
tapedrive |
arkiver: Just in case you haven't seen this: http://www.radiofeeds.co.uk/ is a listing of nearly all radio feeds in the UK. |
15:37
🔗
|
arkiver |
tapedrive: thank you! |
15:42
🔗
|
arkiver |
That's a very nice list |
15:42
🔗
|
arkiver |
if you have anything, please let me know :D |
16:17
🔗
|
|
BartoCH has joined #archiveteam-bs |
16:24
🔗
|
|
kristian_ has joined #archiveteam-bs |
16:39
🔗
|
tapedrive |
arkiver: All of the ones I've tested from that list work in non-uk countries, but there may be some that don't. |
16:43
🔗
|
whydomain |
arkiver: there's Roland Radio (Amstrad CPC computer music) at http://streaming.rolandradio.net:8000/rolandradio |
16:43
🔗
|
|
kvieta has quit IRC (Ping timeout: 246 seconds) |
16:49
🔗
|
|
kvieta has joined #archiveteam-bs |
17:34
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
17:39
🔗
|
|
BartoCH has joined #archiveteam-bs |
17:45
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
17:49
🔗
|
|
BartoCH has joined #archiveteam-bs |
18:08
🔗
|
|
zerkalo has quit IRC (Read error: Connection reset by peer) |
18:08
🔗
|
|
zerkalo has joined #archiveteam-bs |
18:09
🔗
|
|
zerkalo has quit IRC (Read error: Connection reset by peer) |
18:09
🔗
|
|
zerkalo has joined #archiveteam-bs |
18:10
🔗
|
|
ndiddy has joined #archiveteam-bs |
18:18
🔗
|
|
zerkalo has quit IRC (Ping timeout: 244 seconds) |
18:30
🔗
|
|
zerkalo has joined #archiveteam-bs |
19:01
🔗
|
|
VADemon has joined #archiveteam-bs |
19:06
🔗
|
|
Somebody has joined #archiveteam-bs |
19:17
🔗
|
|
Somebody has quit IRC (Ping timeout: 370 seconds) |
19:24
🔗
|
godane |
we are up to 2016-11-30 with kpfa |
19:26
🔗
|
HCross |
Sanqui, only 1k more PewDiePie videos to go |
19:26
🔗
|
Kaz |
anyone in here with *lots* of local storage? Looking for some advice |
19:26
🔗
|
Frogging |
how much is lots? |
19:26
🔗
|
HCross |
how much is defined by "lots" |
19:27
🔗
|
Kaz |
lets say 20TB+ |
19:27
🔗
|
Frogging |
ah I don't have quite that much |
19:27
🔗
|
Kaz |
trying to work out the best route for 8-12 drives in a non-huge physical space |
19:27
🔗
|
Frogging |
http://www.ncix.com/detail/fractal-design-node-804-matx-23-97165.htm |
19:28
🔗
|
Kaz |
HP microserver (4 bays) is doing fine at the moment, but I'm not too sure on expansion |
19:28
🔗
|
HCross |
Kaz, best bet may be #DataHoarder on Freenode |
19:28
🔗
|
Kaz |
already there :) |
19:28
🔗
|
Frogging |
that thing holds 10 unmodded |
19:28
🔗
|
Kaz |
Frogging: ..did not realise that had space for 10 3.5"'s inside wtf |
19:28
🔗
|
Frogging |
yeah it's pretty amazing. It's what I'm using for my NAS |
19:29
🔗
|
Frogging |
8 in the main bays and there are mounting points next to the motherboard for two more |
19:29
🔗
|
Frogging |
and then you can stick an SSD or two in the front |
19:29
🔗
|
Kaz |
what mobo/cpu are you running in there? |
19:29
🔗
|
Kaz |
and freenas/unraid or anything? |
19:30
🔗
|
Frogging |
ASRock 970M and a AMD Phenom II 965 quad core |
19:30
🔗
|
Frogging |
it's running Debian with md RAID |
19:31
🔗
|
Frogging |
I'm not a fan of freenas |
19:31
🔗
|
Kaz |
ah, right |
19:32
🔗
|
Kaz |
god this won't be cheap |
19:32
🔗
|
Frogging |
by far the most expensive thing for me was the drives |
19:33
🔗
|
Frogging |
I have four 4TB WD Reds in there right now, and some cheap SSD for the OS |
19:33
🔗
|
Kaz |
yeah, I'm looking at 4-6TB drives for now, 8-10 in future |
19:33
🔗
|
Kaz |
or maybe I could delete some stuff |
19:33
🔗
|
Frogging |
actually I have three drives, not four |
19:33
🔗
|
Frogging |
oops |
19:46
🔗
|
|
ravetcofx has joined #archiveteam-bs |
19:52
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
20:06
🔗
|
|
Somebody has joined #archiveteam-bs |
20:23
🔗
|
godane |
HCross: I guess if your doing PewDiePie youtube channel i don't have to download it |
20:23
🔗
|
godane |
its a good thing cause i still have 2800+ to go |
20:23
🔗
|
HCross |
godane, Ive nearly got it down, just need some advice on the best way to get it to the archive now |
20:24
🔗
|
Frogging |
I was thinking of making a wiki page where people who archive youtube channels can add them to a table and provide contact information in the event that someone wants to get something out |
20:24
🔗
|
HCross |
godane, do you find that youtube seems to have really variable download speedsd? |
20:24
🔗
|
godane |
yes |
20:25
🔗
|
Frogging |
I imagine their storage is very geographically distributed, maybe that has something to do with it |
20:25
🔗
|
godane |
but your downloading a much fast rate then i could |
20:26
🔗
|
HCross |
Im getting anywhere from several hundred Mbps, to less than 1 |
20:26
🔗
|
godane |
sometimes start and restart fixes that |
20:26
🔗
|
Frogging |
where is the downloader located HCross? |
20:26
🔗
|
HCross |
OVH, Roubaix |
20:26
🔗
|
Frogging |
ah |
20:27
🔗
|
HCross |
probably IP range throttles as well |
20:27
🔗
|
godane |
i use a move script to sort my videos by date before i upload |
20:27
🔗
|
godane |
based on the json script |
20:27
🔗
|
godane |
*json files |
20:27
🔗
|
Frogging |
you're downloading in full quality I hope, HCross ? |
20:28
🔗
|
Frogging |
though in this instance, with the number of videos, I'd understand if that isn't feasible.. |
20:28
🔗
|
HCross |
yep. max video and max audio quality |
20:28
🔗
|
HCross |
Frogging, I dont have 12TB storage for nothing |
20:28
🔗
|
HCross |
godane, which script do you use? |
20:29
🔗
|
Frogging |
nice |
20:29
🔗
|
Frogging |
this is the command I use for grabbing channels |
20:29
🔗
|
Frogging |
youtube-dl --download-archive archive.txt --write-description --write-annotations --write-info-json -f bestvideo[ext=webm]+bestaudio[ext=webm]/bestvideo[ext=mp4]+bestaudio[ext=m4a]/best $* |
20:30
🔗
|
godane |
http://pastebin.com/KyYJk6pE |
20:30
🔗
|
godane |
that is just my move script |
20:30
🔗
|
HCross |
thanks godane :) |
20:30
🔗
|
godane |
the move script is for sorting the files locally |
20:31
🔗
|
HCross |
nearing 400GB so far |
20:31
🔗
|
godane |
i think use another script to uploaded the sorted files to make ids like this: https://archive.org/details/achannelthatsawesome-youtube-channel-2016-02-06 |
20:32
🔗
|
godane |
anyways do what you know best |
20:33
🔗
|
HCross |
trying to get a collection sorted, as itll alll probably need to be darked |
20:35
🔗
|
Frogging |
hope it can be un-darked if he actually ends up deleting all his videos |
20:43
🔗
|
godane |
i do the youtube-channel dates cause there is no metadata in the titles |
20:44
🔗
|
godane |
this was also my way of sorting thur stuff so i know what has to be uploaded next |
21:09
🔗
|
|
jrwr has joined #archiveteam-bs |
21:21
🔗
|
|
Stiletto has quit IRC (Ping timeout: 244 seconds) |
21:39
🔗
|
|
jsp234 has joined #archiveteam-bs |
21:41
🔗
|
|
jsp12345 has quit IRC (Read error: Operation timed out) |
21:49
🔗
|
|
kanzure has joined #archiveteam-bs |
21:50
🔗
|
Sanqui |
moving to -bs |
21:50
🔗
|
|
jsp234 has quit IRC (Remote host closed the connection) |
21:50
🔗
|
|
nicolas17 has joined #archiveteam-bs |
21:50
🔗
|
kanzure |
actually i don't care which one i get, |
21:50
🔗
|
DFJustin |
they do generate sha1s for every file in every item |
21:51
🔗
|
kanzure |
torrent hashes, the hashes inside each torrent file, or the actual file hashes |
21:51
🔗
|
DFJustin |
but I guess they only included md5 in the collected census for whatever reason |
21:52
🔗
|
kanzure |
i mean, i wouldn't feel great about recomputing hashes for everything every few years either :) |
21:52
🔗
|
DFJustin |
you could run ia mine yourself I guess but that would take a while |
21:52
🔗
|
Sanqui |
what's the goal here? |
21:52
🔗
|
Frogging |
yeah I was going to ask |
21:52
🔗
|
Frogging |
I may have missed it but what are you trying to achieve |
21:53
🔗
|
kanzure |
https://petertodd.org/2016/opentimestamps-announcement |
21:53
🔗
|
kanzure |
timestamping using merkle trees |
21:53
🔗
|
kanzure |
bitcoin timestamp proofs, in particular... although it would be applicable to non-bitcoin systems as well i suppose. |
21:54
🔗
|
DFJustin |
hmm interesting |
21:54
🔗
|
Frogging |
and where does the files on archive.org enter into this? |
21:55
🔗
|
kanzure |
archive.org has hashes, i just need the hashes |
21:55
🔗
|
kanzure |
and using a weak hash (like md5) is not appropriate |
21:55
🔗
|
Frogging |
the hashes of what though? |
21:55
🔗
|
kanzure |
al of it |
21:55
🔗
|
kanzure |
everything :) |
21:55
🔗
|
Kaz |
why must you have the hashes |
21:55
🔗
|
Kaz |
what is your quest |
21:55
🔗
|
kanzure |
timestamping is a way of showing the existence of an item based on something other than a trusted clock |
21:56
🔗
|
Sanqui |
take a look at IA.BAK. it currently only covers a (small) subset of IA, but may have good data for a trial run |
21:56
🔗
|
Kaz |
right |
21:56
🔗
|
Sanqui |
it runs on git-annex |
21:56
🔗
|
Kaz |
so you want to timestamp things to prove the IA has them? |
21:56
🔗
|
Sanqui |
http://iabak.archiveteam.org/ |
21:56
🔗
|
kanzure |
https://petertodd.org/2016/opentimestamps-announcement#what-can-and-cant-timestamps-prove |
21:56
🔗
|
kanzure |
kaz, ^ |
21:56
🔗
|
Frogging |
i think what he's asking is what does the IA have to do with this |
21:57
🔗
|
Kaz |
I don't want your link |
21:57
🔗
|
kanzure |
Sanqui: are you the same Sanqui that i know |
21:57
🔗
|
Kaz |
I want to understand what you actually want here |
21:57
🔗
|
Sanqui |
kanzure: yes! |
21:57
🔗
|
kanzure |
ohai |
21:57
🔗
|
Sanqui |
hi! |
21:58
🔗
|
|
jsp12345 has joined #archiveteam-bs |
22:01
🔗
|
|
jsp12345 has quit IRC (Remote host closed the connection) |
22:02
🔗
|
kanzure |
kaz: timestamping, in this style, can help protect against future allegations of backdating |
22:02
🔗
|
kanzure |
or rather, anyone can always make an allegation of backdating, but at least here you can show a timestamp proof that a certain version existed at a certain time |
22:02
🔗
|
xmc |
well, ia.bak uses a git repo of all the hashes, and you can drop the commit ids into some kind of timestamping service |
22:03
🔗
|
kanzure |
ah. opentimestamps is compatible with git repositories, actually. it uses the git commit's tree hash. |
22:03
🔗
|
xmc |
kool |
22:03
🔗
|
xmc |
so it sounds like maybe what you want is something you can get trivially |
22:05
🔗
|
|
jsp12345 has joined #archiveteam-bs |
22:06
🔗
|
kanzure |
also it looks like public-file-size-md_20150304205357.json.gz is about right too, 'cept for all the md5 hashes |
22:07
🔗
|
Frogging |
MD5 is quite a bit faster than SHA1, probably why they did it that way |
22:08
🔗
|
Frogging |
(that's just a guess, I wasn't around for it) |
22:08
🔗
|
kanzure |
md5 was an okay option at one point, i think. i dunno, i'm not a cryptographer. |
22:10
🔗
|
Frogging |
neither am I but as far as I know, MD5 is fine if your intended application isn't at risk of being tampered with. like verifying file integrity in a relatively safe environment |
22:11
🔗
|
Somebody |
kanzure: The original census only included md5s because we were only concerned about identifying accidental identical files |
22:11
🔗
|
Frogging |
it can be coerced into generating a collision (I believe it's called a preimage attack), but if that's not something you're trying to protect against, then it's fine |
22:11
🔗
|
kanzure |
in the context of timestamping, it would mean that anyone can forge an alternative and show hey this document "existed" back then too, and you can try to pass that different versio off as legitimate |
22:11
🔗
|
Somebody |
The most recent census does include both md5 and sha1 |
22:11
🔗
|
kanzure |
Somebody: is the most recent census available for download somewhere? |
22:11
🔗
|
Frogging |
kanzure: you're right for sure. but the census wasn't designed for that |
22:11
🔗
|
Somebody |
kanzure: yeah, it *should* be listed on the wiki page, but I think I haven't updated it yet. Just a sec. |
22:13
🔗
|
Somebody |
kanzure: https://archive.org/details/ia_census_201604 |
22:13
🔗
|
kanzure |
two different types of hashes is definitely helpful |
22:13
🔗
|
godane |
so i found out that WFMU has audio going back to 2002 |
22:15
🔗
|
kanzure |
Somebody: thank you much |
22:15
🔗
|
* |
Frogging forgot who Somebody is until just now |
22:15
🔗
|
Frogging |
:p |
22:15
🔗
|
Somebody |
Frogging: that's the idea. :-) |
22:15
🔗
|
Frogging |
oh :p |
22:15
🔗
|
|
jsp12345 has quit IRC (Remote host closed the connection) |
22:15
🔗
|
godane |
the only bad news is with WFMU is that the streams are only in big MP4 files |
22:16
🔗
|
|
jsp12345 has joined #archiveteam-bs |
22:16
🔗
|
Somebody |
kanzure: glad to help -- if you have any further questions about the data or format, please ask! |
22:18
🔗
|
kanzure |
got ratelimited womp womp |
22:18
🔗
|
Somebody |
ratelimited by what? |
22:18
🔗
|
kanzure |
dunno, i was doing 10 MB/sec for a few minutes. i'll blame my ISP, it's fine. |
22:19
🔗
|
Somebody |
no, I mean, what are you trying to download? |
22:19
🔗
|
|
jsp12345 has quit IRC (Remote host closed the connection) |
22:19
🔗
|
kanzure |
public-file-size-md_20150304205357.json.gz |
22:19
🔗
|
kanzure |
(this was from before you gave me the more recent link) |
22:19
🔗
|
|
jsp12345 has joined #archiveteam-bs |
22:21
🔗
|
|
ndiddy has quit IRC (Read error: Connection reset by peer) |
22:24
🔗
|
|
ndiddy has joined #archiveteam-bs |
22:24
🔗
|
Somebody |
kanzure: Try downloading it through a torrent -- I think there are peers for the censuses. |
22:30
🔗
|
|
jsp12345 has quit IRC (Remote host closed the connection) |
22:30
🔗
|
|
jsp12345 has joined #archiveteam-bs |
22:42
🔗
|
kanzure |
a39e3a8d37793792f62b85cbd7b74cafe482b5b2014203ca28b8555822ce74f3 public-file-size-md_20150304205357.json.gz |
22:55
🔗
|
Somebody |
kanzure: what hash is that sha256? |
22:56
🔗
|
kanzure |
for that file |
22:56
🔗
|
kanzure |
yes it's sha256 |
23:03
🔗
|
kanzure |
should i also do the private collection? what's in there |
23:04
🔗
|
Somebody |
kanzure: that hash matches what other people downloaded, see: https://hash-archive.org/history/https://archive.org/download/ia-bak-census_20150304/public-file-size-md_20150304205357.json.gz |
23:04
🔗
|
kanzure |
ah hash-archive.org had it, okay |
23:05
🔗
|
Somebody |
well, it had it *now* (I just added it) |
23:05
🔗
|
kanzure |
i had already hashed one of the previous databases of hash-archive.org |
23:05
🔗
|
kanzure |
oh i see. |
23:05
🔗
|
Somebody |
anyone can put anything into hash-archive, just submit a URL |
23:05
🔗
|
Somebody |
and it will download it and hash it |
23:05
🔗
|
kanzure |
yes i know |
23:05
🔗
|
kanzure |
i submitted this one the other day: https://hash-archive.org/history/https://archive.org/download/archiveteam_archivebot_go_068/bitcointalk.org-inf-20140403-045710-7i531.warc.gz |
23:06
🔗
|
Somebody |
oh, good! |
23:07
🔗
|
Somebody |
If you think of any features to add to hash-archive, do contact the author. |
23:07
🔗
|
kanzure |
he actually didn't reply to me, but whatever |
23:07
🔗
|
Somebody |
hm, that's odd. what did you suggest? |
23:07
🔗
|
kanzure |
merkle roots |
23:07
🔗
|
kanzure |
also the database download page is broken |
23:07
🔗
|
Somebody |
ah, probably just hasn't gotten to it then |
23:07
🔗
|
kanzure |
s/merkle roots/merkle trees |
23:08
🔗
|
Somebody |
regarding the _private data file, it has (nearly) all the Wayback Machine data, and various other stuff |
23:08
🔗
|
kanzure |
oh, wayback machine data sounds potentially useful to timestamp. alright. |
23:09
🔗
|
Somebody |
but it won't have any torrents for you to grab (I think) |
23:09
🔗
|
kanzure |
right. i can just hash the .json file that lists all the hashes. |
23:09
🔗
|
kanzure |
it's a nasty hack but whatever |
23:10
🔗
|
Somebody |
yeah, if all you are doing is timestamping the existence of census data files, certainly, do all of them |
23:10
🔗
|
kanzure |
this is helpful metadata. it can be useful to in the future point out that no, the archive was not backdated, and here's why :). |
23:11
🔗
|
Somebody |
kanzure: you'll probably also be interested in https://archive.org/download/archiveteam_census_2016 -- it will contain monthly lists of all the identifiers included in the search engine |
23:11
🔗
|
kanzure |
or rather: not backdated after it was timestamped today |
23:11
🔗
|
kanzure |
8daa7a635d77eddb9fecb000abbe10b19611b623a1242b4a7b4b7881b92ddae6 file_hashes_sha1_20160411221100_public.tsv.gz.ots |
23:11
🔗
|
Somebody |
uploaded automatically once a month (with a new item generated each year) |
23:12
🔗
|
Somebody |
ah, nice |
23:12
🔗
|
Somebody |
kanzure: agreed |
23:12
🔗
|
Somebody |
Yes, my basic interest in the census work was to provide 3rd-party validation of "this was in the archive at this time" |
23:13
🔗
|
Somebody |
so I'm delighted to see you working on timestamping it |
23:13
🔗
|
Somebody |
If you'd like to drop copies of the census data into a gmail mailbox, and/or AWS, that'd be nice too |
23:14
🔗
|
Somebody |
and please *do* dump copies of the toplevel merkel hashes into multiple pastebins, and then archive the pastebins |
23:14
🔗
|
kanzure |
any other files i should look at before i do that? |
23:16
🔗
|
Somebody |
kanzure: might as well grab http://archiveteam.org/index.php?title=Internet_Archive_Census#See_Also |
23:17
🔗
|
nicolas17 |
Google Cloud Platform has some nice "coldline" storage now too |
23:17
🔗
|
kanzure |
oops pardon me, 8daa7a635d77eddb9fecb000abbe10b19611b623a1242b4a7b4b7881b92ddae6 was for file_hashes_sha1_20160411221100_public.tsv.gz not file_hashes_sha1_20160411221100_public.tsv.gz.ots |
23:18
🔗
|
Somebody |
also note that item _meta.xml and _files.xml will change every time the metadata for an item changes, which can happen pretty much anytime, and will happen regularly. So differences in those isn't generally very interesting |
23:18
🔗
|
kanzure |
are these census files going to ever change? if they change, a new census is released, right? |
23:19
🔗
|
Somebody |
kanzure: The census files *shouldn't* change, as I conceptualize them, no. |
23:19
🔗
|
kanzure |
tsv == tab csv? |
23:19
🔗
|
Somebody |
yep |
23:20
🔗
|
Somebody |
tab-separated-values |
23:20
🔗
|
kanzure |
all sorts of fancy up in this joint, wow |
23:20
🔗
|
Somebody |
looks nicer, and commas are more likely to be found in item identifiers than tabs |
23:21
🔗
|
Somebody |
and pretty much nothing that supports csv can't be tweaked to support tsv instead |
23:22
🔗
|
ae_g_i_s |
tsv is much better, agreed |
23:22
🔗
|
Somebody |
in the next census, I might generate a separate file-hashes list that excludes the _meta.xml and _files.xml files. |
23:22
🔗
|
Somebody |
ae_g_i_s: glad you agree; did I leave out any of the other advantages? |
23:22
🔗
|
ae_g_i_s |
you use a character that's almost never used in content as the separator instead of a character that's often used in texts |
23:22
🔗
|
kanzure |
Somebody: use lots of different hash functions, very helpful to resist bit rot and hash function failure over time |
23:22
🔗
|
Somebody |
nicolas17: thanks -- please do dump copies of the census stuff in there, if you get a chance! |
23:23
🔗
|
ae_g_i_s |
which is kinda the main parser perspective on it - you want the separator to not be in the set of valid "content" characters |
23:24
🔗
|
Somebody |
I think csv was originally intended for primarily-numerical spreadsheets -- where, while it still hurts, it is likely to be *more* sensible. |
23:24
🔗
|
kanzure |
ideally, all content on archive.org would be timestamped (using opentimestamps or w/e) at submission time. and then archive.org would store the timestamp proof itself. |
23:24
🔗
|
Somebody |
kanzure: yes, that would be quite neat. |
23:24
🔗
|
nicolas17 |
there are a bunch of csv variants wrt how they handle escaping of the comma, or quoting |
23:24
🔗
|
kanzure |
timestamp proof is only a few hundred bytes per item |
23:24
🔗
|
Kaz |
but this is AT |
23:24
🔗
|
Kaz |
not IA |
23:25
🔗
|
kanzure |
i refuse to believe that IA would be so cold as to not hang out in here |
23:25
🔗
|
Somebody |
kanzure: yes, that would be awesome. please send that idea into mek if you haven't already. |
23:25
🔗
|
kanzure |
i have not |
23:25
🔗
|
kanzure |
who is mek |
23:25
🔗
|
Somebody |
kanzure: eh, there's overlap, but we try to maintain separation |
23:25
🔗
|
Kaz |
kanzure: #archiveteam |
23:25
🔗
|
Kaz |
uh |
23:25
🔗
|
Kaz |
#internetarchive |
23:25
🔗
|
kanzure |
this is an endless loop... |
23:25
🔗
|
kanzure |
ah |
23:26
🔗
|
kanzure |
mek does not seem to be there |
23:26
🔗
|
Kaz |
oh, yeah I have no idea who/what mek is, but #internetarchive is the channel for your needs |
23:26
🔗
|
nicolas17 |
hm, I need to update the Mapillary page on the AT wiki |
23:27
🔗
|
Somebody |
kanzure: mek is a staffer at Archive.org, and interested in various new ideas. https://michaelkarpeles.com/ |
23:27
🔗
|
nicolas17 |
they have 104M photos by now :P |
23:30
🔗
|
kanzure |
e2bc2f240490e91d52a1eaeb5636664f75795da9bfcbbe7692c07b90ae18244b file_hashes_sha1_20160411221100_private.tsv.gz |
23:31
🔗
|
|
GE has quit IRC (Remote host closed the connection) |
23:32
🔗
|
nicolas17 |
how would you go about archiving 200TB of photos from an AWS S3 bucket in Europe? preferably without giving the owner high AWS bandwidth costs |
23:33
🔗
|
kanzure |
amazon has snowball data container things |
23:37
🔗
|
ae_g_i_s |
yeah, afaik that one's called glacier |
23:37
🔗
|
ae_g_i_s |
though i dunno if glacier works in the 'restore direction', i do know that it works in the 'backup direction', i.e. they send you drives, you fill them up and send it to them |
23:37
🔗
|
kanzure |
https://aws.amazon.com/blogs/aws/aws-importexport-snowball-transfer-1-petabyte-per-week-using-amazon-owned-storage-appliances/ |
23:39
🔗
|
kanzure |
oops that was last year |
23:39
🔗
|
kanzure |
ah here we go, |
23:39
🔗
|
kanzure |
https://aws.amazon.com/blogs/aws/aws-snowmobile-move-exabytes-of-data-to-the-cloud-in-weeks/ |
23:40
🔗
|
ae_g_i_s |
yeah, snowmobile is the new one, where they send you a powered truck |
23:40
🔗
|
kanzure |
"In order to meet the needs of these customers, we are launching Snowmobile today. This secure data truck stores up to 100 PB of data and can help you to move exabytes to AWS in a matter of weeks (you can get more than one if necessary). " |
23:40
🔗
|
kanzure |
raise the pirate flag, let's go raiding |
23:42
🔗
|
kanzure |
from https://archive.org/download/archiveteam_census_2016 , |
23:42
🔗
|
kanzure |
2840f4e64f4c2bf562e97294714371cfe7beb4122e73c3437d535175d93e53df 2016.10.23-ia_identifiers.txt.gz |
23:42
🔗
|
ae_g_i_s |
fun fact: you can bake ~175 chicken at the same time with the power necessary for an amazon snowmobile |
23:49
🔗
|
Somebody |
https://hash-archive.org/history/https://archive.org/download/ia_census_201604/file_hashes_sha1_20160411221100_private.tsv.gz |
23:49
🔗
|
Somebody |
https://hash-archive.org/history/https://archive.org/download/archiveteam_census_2016/2016.10.23-ia_identifiers.txt.gz |
23:50
🔗
|
kanzure |
ah. |
23:51
🔗
|
kanzure |
yes i should have checked that first. |
23:51
🔗
|
Somebody |
eh, the order doesn't matter -- but it is a nice way to get a 3rd-party check |