Time |
Nickname |
Message |
00:04
🔗
|
|
h3ndr1k has quit IRC (Read error: Operation timed out) |
00:05
🔗
|
|
benjinsmi has joined #archiveteam-bs |
00:06
🔗
|
|
h3ndr1k has joined #archiveteam-bs |
01:05
🔗
|
|
Selavi has quit IRC (verb. to stop or discontinue) |
01:06
🔗
|
|
ivan- has joined #archiveteam-bs |
01:06
🔗
|
|
Fusl sets mode: +o ivan- |
01:08
🔗
|
|
ivan_ has quit IRC (Ping timeout: 265 seconds) |
01:08
🔗
|
|
Selavi has joined #archiveteam-bs |
01:45
🔗
|
|
m007a83 has joined #archiveteam-bs |
02:11
🔗
|
|
xarph has joined #archiveteam-bs |
02:21
🔗
|
|
BlueMax has joined #archiveteam-bs |
03:05
🔗
|
SketchCow |
Great |
03:32
🔗
|
|
Pixi_ has quit IRC (Quit: Pixi_) |
03:33
🔗
|
|
Pixi has joined #archiveteam-bs |
03:33
🔗
|
|
odemgi has joined #archiveteam-bs |
03:35
🔗
|
|
odemgi_ has quit IRC (Ping timeout: 252 seconds) |
03:35
🔗
|
|
odemg has quit IRC (Ping timeout: 265 seconds) |
03:38
🔗
|
|
Fusl4 has joined #archiveteam-bs |
03:43
🔗
|
|
Fusl3 has quit IRC (Read error: Operation timed out) |
03:48
🔗
|
|
odemg has joined #archiveteam-bs |
04:58
🔗
|
|
odemgi_ has joined #archiveteam-bs |
05:00
🔗
|
|
xarph has quit IRC (Quit: Connection closed for inactivity) |
05:02
🔗
|
|
benjins has joined #archiveteam-bs |
05:02
🔗
|
|
m007a83_ has joined #archiveteam-bs |
05:03
🔗
|
|
SmileyG has joined #archiveteam-bs |
05:03
🔗
|
|
deevious has quit IRC (Read error: Connection reset by peer) |
05:04
🔗
|
|
af10b3e5e has joined #archiveteam-bs |
05:06
🔗
|
|
deevious has joined #archiveteam-bs |
05:06
🔗
|
|
stapler11 has quit IRC (Read error: Connection reset by peer) |
05:06
🔗
|
|
Fionera_ has joined #archiveteam-bs |
05:09
🔗
|
|
stapler11 has joined #archiveteam-bs |
05:10
🔗
|
|
dxrt- has joined #archiveteam-bs |
05:10
🔗
|
|
Fusl sets mode: +o dxrt- |
05:11
🔗
|
|
m007a83_ has quit IRC (Quit: Fuck you Comcast) |
05:14
🔗
|
|
deevious has quit IRC (se.hub irc.underworld.no) |
05:14
🔗
|
|
odemgi has quit IRC (se.hub irc.underworld.no) |
05:14
🔗
|
|
m007a83 has quit IRC (se.hub irc.underworld.no) |
05:14
🔗
|
|
benjinsmi has quit IRC (se.hub irc.underworld.no) |
05:14
🔗
|
|
VerifiedJ has quit IRC (se.hub irc.underworld.no) |
05:14
🔗
|
|
JH88 has quit IRC (se.hub irc.underworld.no) |
05:14
🔗
|
|
d5f4a3622 has quit IRC (se.hub irc.underworld.no) |
05:14
🔗
|
|
dxrt has quit IRC (se.hub irc.underworld.no) |
05:14
🔗
|
|
Smiley has quit IRC (se.hub irc.underworld.no) |
05:14
🔗
|
|
kiska has quit IRC (se.hub irc.underworld.no) |
05:14
🔗
|
|
Flashfire has quit IRC (se.hub irc.underworld.no) |
05:14
🔗
|
|
Fionera has quit IRC (se.hub irc.underworld.no) |
05:14
🔗
|
|
coderobe has quit IRC (se.hub irc.underworld.no) |
05:14
🔗
|
|
PurpleSym has quit IRC (se.hub irc.underworld.no) |
05:14
🔗
|
|
Lord_Nigh has quit IRC (se.hub irc.underworld.no) |
05:14
🔗
|
|
jut has quit IRC (se.hub irc.underworld.no) |
05:14
🔗
|
|
i0npulse has quit IRC (se.hub irc.underworld.no) |
05:14
🔗
|
|
ranma has quit IRC (se.hub irc.underworld.no) |
05:17
🔗
|
|
LordNigh2 has joined #archiveteam-bs |
05:29
🔗
|
|
enick_187 has joined #archiveteam-bs |
05:29
🔗
|
|
LordNigh2 is now known as Lord_Nigh |
05:38
🔗
|
|
jut has joined #archiveteam-bs |
05:41
🔗
|
|
i0npulse has joined #archiveteam-bs |
05:43
🔗
|
|
Flashfire has joined #archiveteam-bs |
05:46
🔗
|
|
ivan- is now known as ivan_ |
05:51
🔗
|
|
killsushi has quit IRC (Quit: Leaving) |
05:52
🔗
|
|
Pixi has quit IRC (Ping timeout: 255 seconds) |
05:52
🔗
|
|
m007a83 has joined #archiveteam-bs |
05:58
🔗
|
|
Pixi has joined #archiveteam-bs |
06:13
🔗
|
|
DigiDigi has quit IRC (Quit: Leaving) |
06:13
🔗
|
|
tuluu has quit IRC (hub.efnet.us irc.efnet.nl) |
06:13
🔗
|
|
yano has quit IRC (hub.efnet.us irc.efnet.nl) |
06:13
🔗
|
|
bsmith093 has quit IRC (hub.efnet.us irc.efnet.nl) |
06:13
🔗
|
|
TC01 has quit IRC (hub.efnet.us irc.efnet.nl) |
06:13
🔗
|
|
BnAboyZ has quit IRC (hub.efnet.us irc.efnet.nl) |
06:13
🔗
|
|
sHATNER has quit IRC (hub.efnet.us irc.efnet.nl) |
06:13
🔗
|
|
MrRadar2 has quit IRC (hub.efnet.us irc.efnet.nl) |
06:13
🔗
|
|
brayden has quit IRC (hub.efnet.us irc.efnet.nl) |
06:13
🔗
|
|
joshua_ has quit IRC (hub.efnet.us irc.efnet.nl) |
06:13
🔗
|
|
Tenebrae has quit IRC (hub.efnet.us irc.efnet.nl) |
06:13
🔗
|
|
Xibalba has quit IRC (hub.efnet.us irc.efnet.nl) |
06:13
🔗
|
|
Lord_Nigh has quit IRC (hub.efnet.us irc.efnet.nl) |
06:13
🔗
|
|
stapler11 has quit IRC (hub.efnet.us irc.efnet.nl) |
06:13
🔗
|
|
benjins has quit IRC (hub.efnet.us irc.efnet.nl) |
06:13
🔗
|
|
odemgi_ has quit IRC (hub.efnet.us irc.efnet.nl) |
06:13
🔗
|
|
h3ndr1k has quit IRC (hub.efnet.us irc.efnet.nl) |
06:13
🔗
|
|
Atom-- has quit IRC (hub.efnet.us irc.efnet.nl) |
06:13
🔗
|
|
MillerBOS has quit IRC (hub.efnet.us irc.efnet.nl) |
06:13
🔗
|
|
Yurume has quit IRC (hub.efnet.us irc.efnet.nl) |
06:13
🔗
|
|
thejsa has quit IRC (hub.efnet.us irc.efnet.nl) |
06:13
🔗
|
|
omglolbah has quit IRC (hub.efnet.us irc.efnet.nl) |
06:13
🔗
|
|
Jon has quit IRC (hub.efnet.us irc.efnet.nl) |
06:13
🔗
|
|
pikami has quit IRC (hub.efnet.us irc.efnet.nl) |
06:13
🔗
|
|
sknebel has quit IRC (hub.efnet.us irc.efnet.nl) |
06:13
🔗
|
|
drcd has quit IRC (hub.efnet.us irc.efnet.nl) |
06:13
🔗
|
|
legoktm has quit IRC (hub.efnet.us irc.efnet.nl) |
06:13
🔗
|
|
Kenshin has quit IRC (hub.efnet.us irc.efnet.nl) |
06:13
🔗
|
|
chfoo has quit IRC (hub.efnet.us irc.efnet.nl) |
06:13
🔗
|
|
_niklas has quit IRC (hub.efnet.us irc.efnet.nl) |
06:15
🔗
|
|
Lord_Nigh has joined #archiveteam-bs |
06:15
🔗
|
|
stapler11 has joined #archiveteam-bs |
06:15
🔗
|
|
benjins has joined #archiveteam-bs |
06:15
🔗
|
|
odemgi_ has joined #archiveteam-bs |
06:15
🔗
|
|
h3ndr1k has joined #archiveteam-bs |
06:15
🔗
|
|
tuluu has joined #archiveteam-bs |
06:15
🔗
|
|
Atom-- has joined #archiveteam-bs |
06:15
🔗
|
|
yano has joined #archiveteam-bs |
06:15
🔗
|
|
bsmith093 has joined #archiveteam-bs |
06:15
🔗
|
|
MillerBOS has joined #archiveteam-bs |
06:15
🔗
|
|
TC01 has joined #archiveteam-bs |
06:15
🔗
|
|
BnAboyZ has joined #archiveteam-bs |
06:15
🔗
|
|
sHATNER has joined #archiveteam-bs |
06:15
🔗
|
|
brayden has joined #archiveteam-bs |
06:15
🔗
|
|
MrRadar2 has joined #archiveteam-bs |
06:15
🔗
|
|
joshua_ has joined #archiveteam-bs |
06:15
🔗
|
|
Tenebrae has joined #archiveteam-bs |
06:15
🔗
|
|
Yurume has joined #archiveteam-bs |
06:15
🔗
|
|
thejsa has joined #archiveteam-bs |
06:15
🔗
|
|
_niklas has joined #archiveteam-bs |
06:15
🔗
|
|
chfoo has joined #archiveteam-bs |
06:15
🔗
|
|
Kenshin has joined #archiveteam-bs |
06:15
🔗
|
|
legoktm has joined #archiveteam-bs |
06:15
🔗
|
|
drcd has joined #archiveteam-bs |
06:15
🔗
|
|
sknebel has joined #archiveteam-bs |
06:15
🔗
|
|
pikami has joined #archiveteam-bs |
06:15
🔗
|
|
Jon has joined #archiveteam-bs |
06:15
🔗
|
|
omglolbah has joined #archiveteam-bs |
06:15
🔗
|
|
Xibalba has joined #archiveteam-bs |
06:15
🔗
|
|
irc.efnet.nl sets mode: +o MrRadar2 |
06:17
🔗
|
|
DigiDigi has joined #archiveteam-bs |
06:20
🔗
|
|
yano_ has joined #archiveteam-bs |
06:22
🔗
|
|
yano has quit IRC (Ping timeout: 268 seconds) |
06:22
🔗
|
|
brayden has quit IRC (Read error: Operation timed out) |
06:23
🔗
|
|
BnAboyZ has quit IRC (Read error: Operation timed out) |
06:23
🔗
|
|
tuluu has quit IRC (Ping timeout: 268 seconds) |
06:23
🔗
|
|
TC01 has quit IRC (Ping timeout: 268 seconds) |
06:23
🔗
|
|
sHATNER has quit IRC (Read error: Operation timed out) |
06:23
🔗
|
|
bsmith093 has quit IRC (Ping timeout: 268 seconds) |
06:23
🔗
|
|
TC01 has joined #archiveteam-bs |
06:24
🔗
|
|
bsmith093 has joined #archiveteam-bs |
06:26
🔗
|
|
tuluu has joined #archiveteam-bs |
06:31
🔗
|
|
joshua_ has quit IRC (Ping timeout: 268 seconds) |
06:31
🔗
|
|
Xibalba has quit IRC (Ping timeout: 268 seconds) |
06:31
🔗
|
|
MrRadar2 has quit IRC (Ping timeout: 268 seconds) |
06:31
🔗
|
|
joshua_ has joined #archiveteam-bs |
06:31
🔗
|
|
sHATNER has joined #archiveteam-bs |
06:32
🔗
|
|
BnAboyZ has joined #archiveteam-bs |
06:33
🔗
|
|
brayden has joined #archiveteam-bs |
06:33
🔗
|
|
Xibalba has joined #archiveteam-bs |
06:36
🔗
|
|
kiska has joined #archiveteam-bs |
06:36
🔗
|
|
Fusl sets mode: +o kiska |
06:36
🔗
|
|
MrRadar2 has joined #archiveteam-bs |
06:37
🔗
|
|
svchfoo1 sets mode: +o MrRadar2 |
06:37
🔗
|
|
svchfoo3 sets mode: +o MrRadar2 |
06:43
🔗
|
|
DFJustin has quit IRC (Remote host closed the connection) |
06:47
🔗
|
|
DFJustin has joined #archiveteam-bs |
07:18
🔗
|
|
schbirid has quit IRC (Remote host closed the connection) |
07:29
🔗
|
|
dxrt- is now known as dxrt |
09:01
🔗
|
|
Stilettoo has joined #archiveteam-bs |
09:09
🔗
|
|
Stiletto has quit IRC (Ping timeout: 615 seconds) |
09:13
🔗
|
|
coderobe has joined #archiveteam-bs |
09:23
🔗
|
|
atomicthu has joined #archiveteam-bs |
09:34
🔗
|
|
h3ndr1k has quit IRC (Read error: Connection reset by peer) |
09:43
🔗
|
|
h3ndr1k has joined #archiveteam-bs |
10:17
🔗
|
|
stapler11 has quit IRC (Read error: Connection reset by peer) |
10:32
🔗
|
|
BlueMax has quit IRC (Quit: Leaving) |
11:37
🔗
|
|
VerifiedJ has joined #archiveteam-bs |
11:40
🔗
|
JAA |
The NRATV playlist download finished about 10 hours ago. Now the fun part begins, parsing all that shit and extracting the video segments we want. |
11:43
🔗
|
JAA |
The wpull DB for that crawl is 16.5 GB, by the way. 46 million segments or so. |
11:46
🔗
|
JAA |
Six playlists failed to download with a 403: |
11:46
🔗
|
JAA |
http://d3ldmcwnurseh3.cloudfront.net/assets/nr_140815_news_cc_sp_AmandaCollins/nr_140815_news_cc_sp_AmandaCollins.m3u8 |
11:46
🔗
|
JAA |
http://d3ldmcwnurseh3.cloudfront.net/assets/nr_140815_news_cc_sp_BrittanyBoddington/nr_140815_news_cc_sp_BrittanyBoddington.m3u8 |
11:46
🔗
|
JAA |
http://d3ldmcwnurseh3.cloudfront.net/assets/nr_140815_news_cc_sp_JeremyGreene/nr_140815_news_cc_sp_JeremyGreene.m3u8 |
11:46
🔗
|
JAA |
http://d3ldmcwnurseh3.cloudfront.net/assets/nr-181107-americanheroes-s01e01-daveeubank-social2-18-nr-382-rv1_1080ph-00005/nr-181107-americanheroes-s01e01-daveeubank-social2-18-nr-382-rv1_1080ph-00005.m3u8 |
11:46
🔗
|
JAA |
http://d3ldmcwnurseh3.cloudfront.net/assets/nra_fs_1%E2%80%8B41030_no%E2%80%8Bir_ep17_%E2%80%8Bandrew_k%E2%80%8Bline_bon%E2%80%8Bus_rv04/nra_fs_1%E2%80%8B41030_no%E2%80%8Bir_ep17_%E2%80%8Bandrew_k%E2%80%8Bline_bon%E2%80%8Bus_rv04.m3u8 |
11:46
🔗
|
JAA |
http://d3ldmcwnurseh3.cloudfront.net/assets/nra_fs_1%E2%80%8B41121_no%E2%80%8Bir_ep19_%E2%80%8Brob_pinc%E2%80%8Bus_bonus%E2%80%8B_rv04/nra_fs_1%E2%80%8B41121_no%E2%80%8Bir_ep19_%E2%80%8Brob_pinc%E2%80%8Bus_bonus%E2%80%8B_rv04.m3u8 |
11:47
🔗
|
JAA |
Just in case someone wants to look into whether these can be grabbed somewhere else. |
11:53
🔗
|
|
Sokar has quit IRC (Ping timeout: 615 seconds) |
12:26
🔗
|
|
Sokar has joined #archiveteam-bs |
12:30
🔗
|
|
HashbangI has quit IRC (Read error: Connection reset by peer) |
12:32
🔗
|
|
joshua_ has quit IRC (Read error: Operation timed out) |
12:34
🔗
|
|
joshua_ has joined #archiveteam-bs |
12:54
🔗
|
JAA |
Soo, after a bunch of grep and awk, it looks like there are about 6 million video segments to download. |
12:55
🔗
|
JAA |
Will try to get a size estimate now. |
13:04
🔗
|
|
yano_ is now known as yano |
13:08
🔗
|
JAA |
Ok, this is going to be big. |
13:09
🔗
|
JAA |
I took a 1 ‰ random sample (6044 URLs) and ran HEAD requests against those: 13365751712 bytes. So it'll be around 13.4 TB. |
13:13
🔗
|
JAA |
Who wants to grab that? I don't have anywhere to put that much data at the moment. |
13:14
🔗
|
JAA |
(Remember that this will also need to be muxed together afterwards if we want it to be accessible in IA items.) |
13:17
🔗
|
Fusl |
abox hel1, megawarc it and then dump into IA? |
13:17
🔗
|
Fusl |
oh |
13:17
🔗
|
Fusl |
muxed together |
13:17
🔗
|
Fusl |
thats what you mean |
13:17
🔗
|
Fusl |
ffmpeg or something |
13:19
🔗
|
Fusl |
yeah, abox hel1 can do that, has 12 cores and plenty of memory for caching |
13:20
🔗
|
Fusl |
NAME USED AVAIL REFER MOUNTPOINT |
13:20
🔗
|
Fusl |
zpool-docker 35.4T 31.4T 33.1T /var/lib/docker |
13:20
🔗
|
JAA |
My plan was to grab everything as WARCs and then deal with the muxing later. |
13:20
🔗
|
JAA |
But at this size, that'll be a pain. |
13:20
🔗
|
JAA |
So not sure what the best solution is. |
13:20
🔗
|
Fusl |
i wouldnt know how to reliably dump warcs into files so ehh |
13:21
🔗
|
JAA |
Yeah |
13:21
🔗
|
JAA |
We could use wpull without --delete-after so that the plain files stay on disk. Means double the size obviously. |
13:22
🔗
|
JAA |
But we'd have to extract everything anyway, so... |
13:22
🔗
|
JAA |
I also have no idea what the best method for the actual muxing would be. |
13:23
🔗
|
Fusl |
can you give me an example m3u8 url? |
13:23
🔗
|
JAA |
http://d3ldmcwnurseh3.cloudfront.net/assets/-nr-180507-news-cc-lucretiahughes-social-1-oc/-nr-180507-news-cc-lucretiahughes-social-1-oc_1080ph.m3u8 |
13:23
🔗
|
Fusl |
they probably be mpeg transport stream concatenatable files |
13:24
🔗
|
Fusl |
yeah, they are plain mpeg ts cat files |
13:24
🔗
|
JAA |
Ok, that sounds good. |
13:24
🔗
|
Fusl |
so doing `cat $file1 $file2 ... $fileN` will do the trick |
13:25
🔗
|
Fusl |
can probably just warc them |
13:25
🔗
|
Fusl |
and then have pywb with nginx playback the files for you |
13:26
🔗
|
JAA |
Oh sure, might even work in the WBM. My idea was just to have them as IA items with straight video files. |
13:26
🔗
|
JAA |
With metadata and everything for searchability. |
13:27
🔗
|
JAA |
Anyway, the important part right now is to grab everything. |
13:27
🔗
|
Fusl |
i rather meant, grab them into warcs for now |
13:27
🔗
|
JAA |
Yeah |
13:27
🔗
|
Fusl |
and then we can go over the WARCs, dump the files and concat them |
13:27
🔗
|
JAA |
Yup |
13:28
🔗
|
JAA |
warcat should be able to dump the files, but I've had issues with that before. |
13:28
🔗
|
Fusl |
can you do a few of them and then rsync the warcs over to rsync://archivebox-hel1.meo.ws/thomas-the-tank-engine/ ? |
13:28
🔗
|
JAA |
As a test you mean? |
13:28
🔗
|
Fusl |
yeah |
13:28
🔗
|
JAA |
Sure |
13:29
🔗
|
Fusl |
wanna see how reliably i can dump the WARCs into files and cat the .ts files together |
13:32
🔗
|
JAA |
Grabbing 3 videos with a single wpull right now. |
13:32
🔗
|
JAA |
So the segments should be sequential in this file, but don't rely on that since we'll have to parallelise this download obviously. |
13:33
🔗
|
Fusl |
yeah |
13:33
🔗
|
Fusl |
im probably gonna do some pywb -> nginx -> ffmpeg -c copy -> single .ts magic |
13:34
🔗
|
JAA |
That sounds really complicated. |
13:34
🔗
|
JAA |
And pywb is going to make a full copy of the entire dataset. |
13:34
🔗
|
JAA |
I don't think anarcat's PR got merged that would support symlinks etc. |
13:35
🔗
|
JAA |
https://github.com/webrecorder/pywb/pull/409 |
13:35
🔗
|
Fusl |
maybe just warcat into files and then to some grep magic, i dunno, ill figure out something :P |
13:35
🔗
|
JAA |
https://github.com/chfoo/warcat |
13:36
🔗
|
JAA |
I remember having issues with it before, but I don't remember what they were. |
13:36
🔗
|
JAA |
I did write a small little tool myself, but that's not suitable for this. |
13:36
🔗
|
Fusl |
i mean if we wanted to, we could also just do a wpull on the .m3u8 files and ffmpeg on the .m3u8 files as well, that way we wont have to extract the files from the warcs but it will cause double bandwidth usage for them |
13:36
🔗
|
JAA |
In any case, would be nice to fix warcat if it isn't working since it's quite useful also for other things. |
13:37
🔗
|
JAA |
Yeah |
13:37
🔗
|
JAA |
I already have all the M3U8 files. |
13:37
🔗
|
JAA |
Except not as files but in a WARC, but yeah. |
13:41
🔗
|
JAA |
(This is my little tool, by the way: https://github.com/JustAnotherArchivist/little-things/blob/master/warc-tiny If warcat has issues and we don't want to dive into it, it could probably be converted into something that works for this.) |
13:42
🔗
|
JAA |
(There's at least one bug in it that makes it unusable, and it's also quite slow.) |
13:47
🔗
|
* |
anarcat sad cat |
13:49
🔗
|
JAA |
Fusl: nratv-3vid-sample.warc.gz is now on hel1. |
13:53
🔗
|
|
VerifiedJ has quit IRC (Read error: Connection reset by peer) |
13:54
🔗
|
|
VerifiedJ has joined #archiveteam-bs |
13:58
🔗
|
Fusl |
JAA: will each .warc contain exactly one .m3u8 and multiple .ts for that .m3u8? |
13:58
🔗
|
Fusl |
actually, looks like the actual .m3u8 is missing in that warc |
13:59
🔗
|
Fusl |
thats what i need for correctly concatenating them together tho. any way you can get that into the warc as well? |
14:00
🔗
|
JAA |
Depends on how we split those 6M .ts files up. I guess we could do one WARC per video. That'd be ~24k WARCs. |
14:00
🔗
|
JAA |
And yeah, can throw in the .m3u8 if necessary. I have them in a separate WARC containing all of them at the moment. |
14:01
🔗
|
JAA |
Curious though, why do you need it? |
14:02
🔗
|
JAA |
It's just *00000.ts, *00001.ts etc. Haven't seen anything else than that or non-sequential. |
14:02
🔗
|
Fusl |
do we wanna rely on the name being sequential? |
14:03
🔗
|
JAA |
Yeah, probably better safe than sorry. |
14:03
🔗
|
Fusl |
warcat extracts the data just fine btw and ill probably write a docker container so that we can spread the load across many small hcloud instances |
14:05
🔗
|
JAA |
Sweet. Looking at https://github.com/chfoo/warcat/issues/19 I think I mainly had issues with broken HTTP servers. CloudFront seems to be doing this right. |
14:06
🔗
|
|
PhrackD has quit IRC (Read error: Operation timed out) |
14:06
🔗
|
JAA |
Ok, so one wpull process per video which downloads the .m3u8 and all .ts? |
14:06
🔗
|
Fusl |
lol \nnn |
14:07
🔗
|
JAA |
nnCoection Yup, someone fucked up there. |
14:07
🔗
|
Fusl |
(!?\r\n|\n|\r) is usually how i write my code, that'll eat windows, linux and macos line endings |
14:08
🔗
|
Fusl |
by now i think macos also uses \r\n though? |
14:09
🔗
|
JAA |
I know it isn't \r anymore, but I don't know what it changed to. |
14:09
🔗
|
JAA |
So if we do this "one wpull per video", we might as well just drop --delete-after and have the plain files directly. |
14:10
🔗
|
Fusl |
whats --delete-after? |
14:10
🔗
|
|
PhrackD has joined #archiveteam-bs |
14:10
🔗
|
Fusl |
also |
14:10
🔗
|
Fusl |
dont need one wpull per video |
14:10
🔗
|
Fusl |
was just my expectation |
14:10
🔗
|
JAA |
That removes the downloaded file. If you don't use it, you end up with two copies, one in the WARC and one as a plain file. |
14:10
🔗
|
Fusl |
i can work with multiple videos in a single warc |
14:11
🔗
|
Fusl |
just make sure that all the videos for a single m3u8 are always in a single warc and not split across multiple warcs |
14:12
🔗
|
JAA |
Yeah, grouping multiple videos together is easy. |
14:13
🔗
|
JAA |
On the other hand, there's 250 segments per video on average, and the downloads aren't incredibly fast. |
14:13
🔗
|
JAA |
(Individual downloads, I mean.) |
14:14
🔗
|
JAA |
If a download fails, that would also limit potential damage to a single video. |
14:46
🔗
|
|
katocala has quit IRC () |
14:57
🔗
|
|
rellem has joined #archiveteam-bs |
14:57
🔗
|
|
katocala has joined #archiveteam-bs |
15:23
🔗
|
JAA |
So I have one file per video now, listing the .m3u8 and all .ts. |
15:30
🔗
|
JAA |
Nevermind, fucked something up there. |
15:57
🔗
|
|
Mayonaise has quit IRC (Read error: Operation timed out) |
16:01
🔗
|
JAA |
Ok, fixed that. The ts files are no longer in order, but I can't be bothered to fix that. |
16:05
🔗
|
JAA |
Fusl: So, want to run this? |
16:05
🔗
|
JAA |
22188 videos |
16:05
🔗
|
Fusl |
the download or the concat? or both? |
16:05
🔗
|
|
Mayonaise has joined #archiveteam-bs |
16:05
🔗
|
JAA |
Both |
16:06
🔗
|
Fusl |
sure |
16:07
🔗
|
JAA |
1 WARC per video, then we can still pack them into megawarcs if necessary for upload, but it seems that the size of one video should be roughly 600 MB, so that probably makes sense. |
16:26
🔗
|
JAA |
Fusl: https://transfer.notkiska.pw/skqvS/nratv-videos.tar.gz |
16:26
🔗
|
JAA |
wpull --input-file X --warc-file X --user-agent 'ArchiveTeam' --no-verbose |
16:28
🔗
|
JAA |
And then you can run the muxing directly. |
16:28
🔗
|
JAA |
I think wpull puts the files in a directory structure like ./domain/path/.../file. There's probably an option to change that though. |
16:30
🔗
|
Fusl |
JAA: is that one video per file now? |
16:30
🔗
|
JAA |
Yep |
16:53
🔗
|
JAA |
Playlist grab: https://archive.org/details/nratv_video_playlists_201906 |
16:59
🔗
|
|
icedice has joined #archiveteam-bs |
17:02
🔗
|
Fusl |
wpull --input-file "${item_file}" --warc-file "${item_id}.warc.gz" --user-agent 'ArchiveTeam' --no-verbose || exit 1 |
17:02
🔗
|
Fusl |
JAA: is it .warc.gz or .warc? |
17:08
🔗
|
JAA |
Fusl: wpull appends it, so --warc-file is without an extension. |
17:10
🔗
|
Fusl |
k |
17:11
🔗
|
JAA |
Would make more sense to have --warc-file-prefix for when you use --warc-max-size and --warc-file as a full filename when it's a single file, but "it's how wget does it", so wpull can't change that. :-/ |
17:45
🔗
|
Fusl |
JAA: are 500-599 errors retried? |
17:47
🔗
|
Fusl |
Kaz: http://xor.meo.ws/a12adf5f/30c9/4517/b759/97164e3b4c80.png |
17:48
🔗
|
Kaz |
I refuse to accept you can find a file in there easily |
17:48
🔗
|
Fusl |
i cant |
17:49
🔗
|
Fusl |
i have scripts for nano and ssh |
17:49
🔗
|
Fusl |
if i type in `nano $filename`, it looks up an open file handle, if there is one open already, it searches the tmux window, switches to it and voila |
17:50
🔗
|
Kaz |
Hah, fair enough |
17:50
🔗
|
Fusl |
if i type in `ssh <hostname>` it looks up an open ssh master file handle, traces that to the master ssh process and an sqlite3 database to switch to that correct window as well |
18:02
🔗
|
|
icedice2 has joined #archiveteam-bs |
18:07
🔗
|
|
icedice has quit IRC (Read error: Operation timed out) |
18:10
🔗
|
Fusl |
JAA: http://archivebox-hel1.meo.ws/nratv/files/ |
18:13
🔗
|
JAA |
Fusl: Yes. Specifically, wpull considers 200, 204, and 304 successful, and 401, 403, 404, 405, and 410 permanent errors. Any other status code is retried. |
18:14
🔗
|
JAA |
Nice :-) |
18:18
🔗
|
|
MaximeLeG has joined #archiveteam-bs |
18:21
🔗
|
|
RichardG has quit IRC (Ping timeout: 255 seconds) |
18:21
🔗
|
Fusl |
i really need to get 10gbit on this hetzner server ready |
18:22
🔗
|
Fusl |
its already at 1gbit already :$ http://xor.meo.ws/c471c0f6/99b0/4039/a2d1/09421f959c45.png |
18:22
🔗
|
JAA |
Hetzner has 10 Gb/s servers? |
18:22
🔗
|
anarcat |
Fusl: grafana? |
18:22
🔗
|
Fusl |
grafana. |
18:22
🔗
|
Fusl |
JAA: they offer 10gbit upgrades on some of their servers but that means the traffic is limited to 20tb |
18:23
🔗
|
Fusl |
i may have a solution around that by using hetzner cloud for outbound connections :P |
18:23
🔗
|
JAA |
Ah |
18:23
🔗
|
Fusl |
because internal hetzner is not counted towards the traffic limit |
18:23
🔗
|
JAA |
Right |
18:25
🔗
|
|
RichardG has joined #archiveteam-bs |
18:28
🔗
|
Fusl |
300+300 MB per item, 22188 items total = 6.6tb*2 total ? |
18:30
🔗
|
JAA |
Seems a bit small, but maybe my estimate wasn't a representative sample. |
18:32
🔗
|
Fusl |
have the sizes for the samples you pulled? |
18:33
🔗
|
JAA |
Nope, only the total. |
18:35
🔗
|
JAA |
This is what I did: shuf -n 6044 <best-videos-ts >best-videos-ts-sample + parallel -X -j 16 curl -sI {} <best-videos-ts-sample | grep -i '^content-length: ' | awk '{sum+=$2} END {printf "%.0f\n",sum}' |
18:35
🔗
|
SketchCow |
Just wanted to mention that a bug in the upload script made it so archivejobs that start with - would cause that bulk of items to not upload. |
18:35
🔗
|
SketchCow |
So there were two sets from may that were just sitting around being ignored |
18:35
🔗
|
SketchCow |
Safe, but not put into the archive. |
18:35
🔗
|
JAA |
Oh, heh. |
18:35
🔗
|
JAA |
Nice find |
18:36
🔗
|
SketchCow |
Well, I was wondering why they weren't getting uploaded, also why that script was barfing errors |
18:36
🔗
|
JAA |
ArchiveBot I guess? |
18:36
🔗
|
SketchCow |
No, my uploader |
18:36
🔗
|
JAA |
Yeah, but it's AB archives? |
18:36
🔗
|
SketchCow |
I mean the name of the warc was -inf plus stuff |
18:36
🔗
|
SketchCow |
Yes, archivebot |
18:36
🔗
|
JAA |
Huh, that seems odd. |
18:36
🔗
|
JAA |
Do you have the full filename? |
18:37
🔗
|
SketchCow |
-inf-20190507-055626-3foh0-00000.warc.gz was one |
18:38
🔗
|
JAA |
Oh, heh, I actually noticed that one at the time. |
18:38
🔗
|
JAA |
Yeah, that was a broken job. |
18:39
🔗
|
JAA |
"!archive https:/twitter.com/..." missing second slash before the domain. |
18:39
🔗
|
SketchCow |
# Special Use Case: You can feed this script a pile of arguments and it's like running it |
18:39
🔗
|
SketchCow |
# over and over. It'll give you a LOT of output (for now, unless we do major surgery) |
18:39
🔗
|
SketchCow |
# but it will do the right thing. Otherwise, use as the docs say. |
18:39
🔗
|
SketchCow |
Well, it's righting itself |
18:39
🔗
|
SketchCow |
for ARGUMENT in $@ |
18:39
🔗
|
SketchCow |
Ha, past |
18:39
🔗
|
SketchCow |
a |
18:39
🔗
|
SketchCow |
do |
18:40
🔗
|
SketchCow |
That reminds me to make the stupid announcement in the main channel |
18:40
🔗
|
JAA |
-- and "$@" ftw. |
18:41
🔗
|
SketchCow |
Anyway, this delayed adding kiwifarms to the wayback for 90 days, boo hoo |
18:56
🔗
|
|
MaximeLeG has quit IRC (Ping timeout: 260 seconds) |
19:03
🔗
|
|
icedice has joined #archiveteam-bs |
19:05
🔗
|
|
icedice2 has quit IRC (Ping timeout: 252 seconds) |
19:06
🔗
|
|
MaximeLeG has joined #archiveteam-bs |
19:07
🔗
|
|
MaximeLeG has quit IRC (Client Quit) |
19:08
🔗
|
|
MaximeLeG has joined #archiveteam-bs |
19:09
🔗
|
|
MaximeLeG has quit IRC (Client Quit) |
20:02
🔗
|
|
icedice2 has joined #archiveteam-bs |
20:08
🔗
|
|
icedice has quit IRC (Read error: Operation timed out) |
20:08
🔗
|
|
AnthonyI has joined #archiveteam-bs |
20:20
🔗
|
|
ranma has joined #archiveteam-bs |
21:07
🔗
|
|
t3 has joined #archiveteam-bs |
21:09
🔗
|
|
godane has joined #archiveteam-bs |
22:38
🔗
|
|
znak has joined #archiveteam-bs |
22:40
🔗
|
|
znak has left |
22:46
🔗
|
|
killsushi has joined #archiveteam-bs |
22:48
🔗
|
|
Dallas has joined #archiveteam-bs |
22:52
🔗
|
|
BlueMax has joined #archiveteam-bs |