#archiveteam-bs 2019-06-28,Fri

↑back Search

Time Nickname Message
00:04 🔗 h3ndr1k has quit IRC (Read error: Operation timed out)
00:05 🔗 benjinsmi has joined #archiveteam-bs
00:06 🔗 h3ndr1k has joined #archiveteam-bs
01:05 🔗 Selavi has quit IRC (verb. to stop or discontinue)
01:06 🔗 ivan- has joined #archiveteam-bs
01:06 🔗 Fusl sets mode: +o ivan-
01:08 🔗 ivan_ has quit IRC (Ping timeout: 265 seconds)
01:08 🔗 Selavi has joined #archiveteam-bs
01:45 🔗 m007a83 has joined #archiveteam-bs
02:11 🔗 xarph has joined #archiveteam-bs
02:21 🔗 BlueMax has joined #archiveteam-bs
03:05 🔗 SketchCow Great
03:32 🔗 Pixi_ has quit IRC (Quit: Pixi_)
03:33 🔗 Pixi has joined #archiveteam-bs
03:33 🔗 odemgi has joined #archiveteam-bs
03:35 🔗 odemgi_ has quit IRC (Ping timeout: 252 seconds)
03:35 🔗 odemg has quit IRC (Ping timeout: 265 seconds)
03:38 🔗 Fusl4 has joined #archiveteam-bs
03:43 🔗 Fusl3 has quit IRC (Read error: Operation timed out)
03:48 🔗 odemg has joined #archiveteam-bs
04:58 🔗 odemgi_ has joined #archiveteam-bs
05:00 🔗 xarph has quit IRC (Quit: Connection closed for inactivity)
05:02 🔗 benjins has joined #archiveteam-bs
05:02 🔗 m007a83_ has joined #archiveteam-bs
05:03 🔗 SmileyG has joined #archiveteam-bs
05:03 🔗 deevious has quit IRC (Read error: Connection reset by peer)
05:04 🔗 af10b3e5e has joined #archiveteam-bs
05:06 🔗 deevious has joined #archiveteam-bs
05:06 🔗 stapler11 has quit IRC (Read error: Connection reset by peer)
05:06 🔗 Fionera_ has joined #archiveteam-bs
05:09 🔗 stapler11 has joined #archiveteam-bs
05:10 🔗 dxrt- has joined #archiveteam-bs
05:10 🔗 Fusl sets mode: +o dxrt-
05:11 🔗 m007a83_ has quit IRC (Quit: Fuck you Comcast)
05:14 🔗 deevious has quit IRC (se.hub irc.underworld.no)
05:14 🔗 odemgi has quit IRC (se.hub irc.underworld.no)
05:14 🔗 m007a83 has quit IRC (se.hub irc.underworld.no)
05:14 🔗 benjinsmi has quit IRC (se.hub irc.underworld.no)
05:14 🔗 VerifiedJ has quit IRC (se.hub irc.underworld.no)
05:14 🔗 JH88 has quit IRC (se.hub irc.underworld.no)
05:14 🔗 d5f4a3622 has quit IRC (se.hub irc.underworld.no)
05:14 🔗 dxrt has quit IRC (se.hub irc.underworld.no)
05:14 🔗 Smiley has quit IRC (se.hub irc.underworld.no)
05:14 🔗 kiska has quit IRC (se.hub irc.underworld.no)
05:14 🔗 Flashfire has quit IRC (se.hub irc.underworld.no)
05:14 🔗 Fionera has quit IRC (se.hub irc.underworld.no)
05:14 🔗 coderobe has quit IRC (se.hub irc.underworld.no)
05:14 🔗 PurpleSym has quit IRC (se.hub irc.underworld.no)
05:14 🔗 Lord_Nigh has quit IRC (se.hub irc.underworld.no)
05:14 🔗 jut has quit IRC (se.hub irc.underworld.no)
05:14 🔗 i0npulse has quit IRC (se.hub irc.underworld.no)
05:14 🔗 ranma has quit IRC (se.hub irc.underworld.no)
05:17 🔗 LordNigh2 has joined #archiveteam-bs
05:29 🔗 enick_187 has joined #archiveteam-bs
05:29 🔗 LordNigh2 is now known as Lord_Nigh
05:38 🔗 jut has joined #archiveteam-bs
05:41 🔗 i0npulse has joined #archiveteam-bs
05:43 🔗 Flashfire has joined #archiveteam-bs
05:46 🔗 ivan- is now known as ivan_
05:51 🔗 killsushi has quit IRC (Quit: Leaving)
05:52 🔗 Pixi has quit IRC (Ping timeout: 255 seconds)
05:52 🔗 m007a83 has joined #archiveteam-bs
05:58 🔗 Pixi has joined #archiveteam-bs
06:13 🔗 DigiDigi has quit IRC (Quit: Leaving)
06:13 🔗 tuluu has quit IRC (hub.efnet.us irc.efnet.nl)
06:13 🔗 yano has quit IRC (hub.efnet.us irc.efnet.nl)
06:13 🔗 bsmith093 has quit IRC (hub.efnet.us irc.efnet.nl)
06:13 🔗 TC01 has quit IRC (hub.efnet.us irc.efnet.nl)
06:13 🔗 BnAboyZ has quit IRC (hub.efnet.us irc.efnet.nl)
06:13 🔗 sHATNER has quit IRC (hub.efnet.us irc.efnet.nl)
06:13 🔗 MrRadar2 has quit IRC (hub.efnet.us irc.efnet.nl)
06:13 🔗 brayden has quit IRC (hub.efnet.us irc.efnet.nl)
06:13 🔗 joshua_ has quit IRC (hub.efnet.us irc.efnet.nl)
06:13 🔗 Tenebrae has quit IRC (hub.efnet.us irc.efnet.nl)
06:13 🔗 Xibalba has quit IRC (hub.efnet.us irc.efnet.nl)
06:13 🔗 Lord_Nigh has quit IRC (hub.efnet.us irc.efnet.nl)
06:13 🔗 stapler11 has quit IRC (hub.efnet.us irc.efnet.nl)
06:13 🔗 benjins has quit IRC (hub.efnet.us irc.efnet.nl)
06:13 🔗 odemgi_ has quit IRC (hub.efnet.us irc.efnet.nl)
06:13 🔗 h3ndr1k has quit IRC (hub.efnet.us irc.efnet.nl)
06:13 🔗 Atom-- has quit IRC (hub.efnet.us irc.efnet.nl)
06:13 🔗 MillerBOS has quit IRC (hub.efnet.us irc.efnet.nl)
06:13 🔗 Yurume has quit IRC (hub.efnet.us irc.efnet.nl)
06:13 🔗 thejsa has quit IRC (hub.efnet.us irc.efnet.nl)
06:13 🔗 omglolbah has quit IRC (hub.efnet.us irc.efnet.nl)
06:13 🔗 Jon has quit IRC (hub.efnet.us irc.efnet.nl)
06:13 🔗 pikami has quit IRC (hub.efnet.us irc.efnet.nl)
06:13 🔗 sknebel has quit IRC (hub.efnet.us irc.efnet.nl)
06:13 🔗 drcd has quit IRC (hub.efnet.us irc.efnet.nl)
06:13 🔗 legoktm has quit IRC (hub.efnet.us irc.efnet.nl)
06:13 🔗 Kenshin has quit IRC (hub.efnet.us irc.efnet.nl)
06:13 🔗 chfoo has quit IRC (hub.efnet.us irc.efnet.nl)
06:13 🔗 _niklas has quit IRC (hub.efnet.us irc.efnet.nl)
06:15 🔗 Lord_Nigh has joined #archiveteam-bs
06:15 🔗 stapler11 has joined #archiveteam-bs
06:15 🔗 benjins has joined #archiveteam-bs
06:15 🔗 odemgi_ has joined #archiveteam-bs
06:15 🔗 h3ndr1k has joined #archiveteam-bs
06:15 🔗 tuluu has joined #archiveteam-bs
06:15 🔗 Atom-- has joined #archiveteam-bs
06:15 🔗 yano has joined #archiveteam-bs
06:15 🔗 bsmith093 has joined #archiveteam-bs
06:15 🔗 MillerBOS has joined #archiveteam-bs
06:15 🔗 TC01 has joined #archiveteam-bs
06:15 🔗 BnAboyZ has joined #archiveteam-bs
06:15 🔗 sHATNER has joined #archiveteam-bs
06:15 🔗 brayden has joined #archiveteam-bs
06:15 🔗 MrRadar2 has joined #archiveteam-bs
06:15 🔗 joshua_ has joined #archiveteam-bs
06:15 🔗 Tenebrae has joined #archiveteam-bs
06:15 🔗 Yurume has joined #archiveteam-bs
06:15 🔗 thejsa has joined #archiveteam-bs
06:15 🔗 _niklas has joined #archiveteam-bs
06:15 🔗 chfoo has joined #archiveteam-bs
06:15 🔗 Kenshin has joined #archiveteam-bs
06:15 🔗 legoktm has joined #archiveteam-bs
06:15 🔗 drcd has joined #archiveteam-bs
06:15 🔗 sknebel has joined #archiveteam-bs
06:15 🔗 pikami has joined #archiveteam-bs
06:15 🔗 Jon has joined #archiveteam-bs
06:15 🔗 omglolbah has joined #archiveteam-bs
06:15 🔗 Xibalba has joined #archiveteam-bs
06:15 🔗 irc.efnet.nl sets mode: +o MrRadar2
06:17 🔗 DigiDigi has joined #archiveteam-bs
06:20 🔗 yano_ has joined #archiveteam-bs
06:22 🔗 yano has quit IRC (Ping timeout: 268 seconds)
06:22 🔗 brayden has quit IRC (Read error: Operation timed out)
06:23 🔗 BnAboyZ has quit IRC (Read error: Operation timed out)
06:23 🔗 tuluu has quit IRC (Ping timeout: 268 seconds)
06:23 🔗 TC01 has quit IRC (Ping timeout: 268 seconds)
06:23 🔗 sHATNER has quit IRC (Read error: Operation timed out)
06:23 🔗 bsmith093 has quit IRC (Ping timeout: 268 seconds)
06:23 🔗 TC01 has joined #archiveteam-bs
06:24 🔗 bsmith093 has joined #archiveteam-bs
06:26 🔗 tuluu has joined #archiveteam-bs
06:31 🔗 joshua_ has quit IRC (Ping timeout: 268 seconds)
06:31 🔗 Xibalba has quit IRC (Ping timeout: 268 seconds)
06:31 🔗 MrRadar2 has quit IRC (Ping timeout: 268 seconds)
06:31 🔗 joshua_ has joined #archiveteam-bs
06:31 🔗 sHATNER has joined #archiveteam-bs
06:32 🔗 BnAboyZ has joined #archiveteam-bs
06:33 🔗 brayden has joined #archiveteam-bs
06:33 🔗 Xibalba has joined #archiveteam-bs
06:36 🔗 kiska has joined #archiveteam-bs
06:36 🔗 Fusl sets mode: +o kiska
06:36 🔗 MrRadar2 has joined #archiveteam-bs
06:37 🔗 svchfoo1 sets mode: +o MrRadar2
06:37 🔗 svchfoo3 sets mode: +o MrRadar2
06:43 🔗 DFJustin has quit IRC (Remote host closed the connection)
06:47 🔗 DFJustin has joined #archiveteam-bs
07:18 🔗 schbirid has quit IRC (Remote host closed the connection)
07:29 🔗 dxrt- is now known as dxrt
09:01 🔗 Stilettoo has joined #archiveteam-bs
09:09 🔗 Stiletto has quit IRC (Ping timeout: 615 seconds)
09:13 🔗 coderobe has joined #archiveteam-bs
09:23 🔗 atomicthu has joined #archiveteam-bs
09:34 🔗 h3ndr1k has quit IRC (Read error: Connection reset by peer)
09:43 🔗 h3ndr1k has joined #archiveteam-bs
10:17 🔗 stapler11 has quit IRC (Read error: Connection reset by peer)
10:32 🔗 BlueMax has quit IRC (Quit: Leaving)
11:37 🔗 VerifiedJ has joined #archiveteam-bs
11:40 🔗 JAA The NRATV playlist download finished about 10 hours ago. Now the fun part begins, parsing all that shit and extracting the video segments we want.
11:43 🔗 JAA The wpull DB for that crawl is 16.5 GB, by the way. 46 million segments or so.
11:46 🔗 JAA Six playlists failed to download with a 403:
11:46 🔗 JAA http://d3ldmcwnurseh3.cloudfront.net/assets/nr_140815_news_cc_sp_AmandaCollins/nr_140815_news_cc_sp_AmandaCollins.m3u8
11:46 🔗 JAA http://d3ldmcwnurseh3.cloudfront.net/assets/nr_140815_news_cc_sp_BrittanyBoddington/nr_140815_news_cc_sp_BrittanyBoddington.m3u8
11:46 🔗 JAA http://d3ldmcwnurseh3.cloudfront.net/assets/nr_140815_news_cc_sp_JeremyGreene/nr_140815_news_cc_sp_JeremyGreene.m3u8
11:46 🔗 JAA http://d3ldmcwnurseh3.cloudfront.net/assets/nr-181107-americanheroes-s01e01-daveeubank-social2-18-nr-382-rv1_1080ph-00005/nr-181107-americanheroes-s01e01-daveeubank-social2-18-nr-382-rv1_1080ph-00005.m3u8
11:46 🔗 JAA http://d3ldmcwnurseh3.cloudfront.net/assets/nra_fs_1%E2%80%8B41030_no%E2%80%8Bir_ep17_%E2%80%8Bandrew_k%E2%80%8Bline_bon%E2%80%8Bus_rv04/nra_fs_1%E2%80%8B41030_no%E2%80%8Bir_ep17_%E2%80%8Bandrew_k%E2%80%8Bline_bon%E2%80%8Bus_rv04.m3u8
11:46 🔗 JAA http://d3ldmcwnurseh3.cloudfront.net/assets/nra_fs_1%E2%80%8B41121_no%E2%80%8Bir_ep19_%E2%80%8Brob_pinc%E2%80%8Bus_bonus%E2%80%8B_rv04/nra_fs_1%E2%80%8B41121_no%E2%80%8Bir_ep19_%E2%80%8Brob_pinc%E2%80%8Bus_bonus%E2%80%8B_rv04.m3u8
11:47 🔗 JAA Just in case someone wants to look into whether these can be grabbed somewhere else.
11:53 🔗 Sokar has quit IRC (Ping timeout: 615 seconds)
12:26 🔗 Sokar has joined #archiveteam-bs
12:30 🔗 HashbangI has quit IRC (Read error: Connection reset by peer)
12:32 🔗 joshua_ has quit IRC (Read error: Operation timed out)
12:34 🔗 joshua_ has joined #archiveteam-bs
12:54 🔗 JAA Soo, after a bunch of grep and awk, it looks like there are about 6 million video segments to download.
12:55 🔗 JAA Will try to get a size estimate now.
13:04 🔗 yano_ is now known as yano
13:08 🔗 JAA Ok, this is going to be big.
13:09 🔗 JAA I took a 1 ‰ random sample (6044 URLs) and ran HEAD requests against those: 13365751712 bytes. So it'll be around 13.4 TB.
13:13 🔗 JAA Who wants to grab that? I don't have anywhere to put that much data at the moment.
13:14 🔗 JAA (Remember that this will also need to be muxed together afterwards if we want it to be accessible in IA items.)
13:17 🔗 Fusl abox hel1, megawarc it and then dump into IA?
13:17 🔗 Fusl oh
13:17 🔗 Fusl muxed together
13:17 🔗 Fusl thats what you mean
13:17 🔗 Fusl ffmpeg or something
13:19 🔗 Fusl yeah, abox hel1 can do that, has 12 cores and plenty of memory for caching
13:20 🔗 Fusl NAME USED AVAIL REFER MOUNTPOINT
13:20 🔗 Fusl zpool-docker 35.4T 31.4T 33.1T /var/lib/docker
13:20 🔗 JAA My plan was to grab everything as WARCs and then deal with the muxing later.
13:20 🔗 JAA But at this size, that'll be a pain.
13:20 🔗 JAA So not sure what the best solution is.
13:20 🔗 Fusl i wouldnt know how to reliably dump warcs into files so ehh
13:21 🔗 JAA Yeah
13:21 🔗 JAA We could use wpull without --delete-after so that the plain files stay on disk. Means double the size obviously.
13:22 🔗 JAA But we'd have to extract everything anyway, so...
13:22 🔗 JAA I also have no idea what the best method for the actual muxing would be.
13:23 🔗 Fusl can you give me an example m3u8 url?
13:23 🔗 JAA http://d3ldmcwnurseh3.cloudfront.net/assets/-nr-180507-news-cc-lucretiahughes-social-1-oc/-nr-180507-news-cc-lucretiahughes-social-1-oc_1080ph.m3u8
13:23 🔗 Fusl they probably be mpeg transport stream concatenatable files
13:24 🔗 Fusl yeah, they are plain mpeg ts cat files
13:24 🔗 JAA Ok, that sounds good.
13:24 🔗 Fusl so doing `cat $file1 $file2 ... $fileN` will do the trick
13:25 🔗 Fusl can probably just warc them
13:25 🔗 Fusl and then have pywb with nginx playback the files for you
13:26 🔗 JAA Oh sure, might even work in the WBM. My idea was just to have them as IA items with straight video files.
13:26 🔗 JAA With metadata and everything for searchability.
13:27 🔗 JAA Anyway, the important part right now is to grab everything.
13:27 🔗 Fusl i rather meant, grab them into warcs for now
13:27 🔗 JAA Yeah
13:27 🔗 Fusl and then we can go over the WARCs, dump the files and concat them
13:27 🔗 JAA Yup
13:28 🔗 JAA warcat should be able to dump the files, but I've had issues with that before.
13:28 🔗 Fusl can you do a few of them and then rsync the warcs over to rsync://archivebox-hel1.meo.ws/thomas-the-tank-engine/ ?
13:28 🔗 JAA As a test you mean?
13:28 🔗 Fusl yeah
13:28 🔗 JAA Sure
13:29 🔗 Fusl wanna see how reliably i can dump the WARCs into files and cat the .ts files together
13:32 🔗 JAA Grabbing 3 videos with a single wpull right now.
13:32 🔗 JAA So the segments should be sequential in this file, but don't rely on that since we'll have to parallelise this download obviously.
13:33 🔗 Fusl yeah
13:33 🔗 Fusl im probably gonna do some pywb -> nginx -> ffmpeg -c copy -> single .ts magic
13:34 🔗 JAA That sounds really complicated.
13:34 🔗 JAA And pywb is going to make a full copy of the entire dataset.
13:34 🔗 JAA I don't think anarcat's PR got merged that would support symlinks etc.
13:35 🔗 JAA https://github.com/webrecorder/pywb/pull/409
13:35 🔗 Fusl maybe just warcat into files and then to some grep magic, i dunno, ill figure out something :P
13:35 🔗 JAA https://github.com/chfoo/warcat
13:36 🔗 JAA I remember having issues with it before, but I don't remember what they were.
13:36 🔗 JAA I did write a small little tool myself, but that's not suitable for this.
13:36 🔗 Fusl i mean if we wanted to, we could also just do a wpull on the .m3u8 files and ffmpeg on the .m3u8 files as well, that way we wont have to extract the files from the warcs but it will cause double bandwidth usage for them
13:36 🔗 JAA In any case, would be nice to fix warcat if it isn't working since it's quite useful also for other things.
13:37 🔗 JAA Yeah
13:37 🔗 JAA I already have all the M3U8 files.
13:37 🔗 JAA Except not as files but in a WARC, but yeah.
13:41 🔗 JAA (This is my little tool, by the way: https://github.com/JustAnotherArchivist/little-things/blob/master/warc-tiny If warcat has issues and we don't want to dive into it, it could probably be converted into something that works for this.)
13:42 🔗 JAA (There's at least one bug in it that makes it unusable, and it's also quite slow.)
13:47 🔗 * anarcat sad cat
13:49 🔗 JAA Fusl: nratv-3vid-sample.warc.gz is now on hel1.
13:53 🔗 VerifiedJ has quit IRC (Read error: Connection reset by peer)
13:54 🔗 VerifiedJ has joined #archiveteam-bs
13:58 🔗 Fusl JAA: will each .warc contain exactly one .m3u8 and multiple .ts for that .m3u8?
13:58 🔗 Fusl actually, looks like the actual .m3u8 is missing in that warc
13:59 🔗 Fusl thats what i need for correctly concatenating them together tho. any way you can get that into the warc as well?
14:00 🔗 JAA Depends on how we split those 6M .ts files up. I guess we could do one WARC per video. That'd be ~24k WARCs.
14:00 🔗 JAA And yeah, can throw in the .m3u8 if necessary. I have them in a separate WARC containing all of them at the moment.
14:01 🔗 JAA Curious though, why do you need it?
14:02 🔗 JAA It's just *00000.ts, *00001.ts etc. Haven't seen anything else than that or non-sequential.
14:02 🔗 Fusl do we wanna rely on the name being sequential?
14:03 🔗 JAA Yeah, probably better safe than sorry.
14:03 🔗 Fusl warcat extracts the data just fine btw and ill probably write a docker container so that we can spread the load across many small hcloud instances
14:05 🔗 JAA Sweet. Looking at https://github.com/chfoo/warcat/issues/19 I think I mainly had issues with broken HTTP servers. CloudFront seems to be doing this right.
14:06 🔗 PhrackD has quit IRC (Read error: Operation timed out)
14:06 🔗 JAA Ok, so one wpull process per video which downloads the .m3u8 and all .ts?
14:06 🔗 Fusl lol \nnn
14:07 🔗 JAA nnCoection Yup, someone fucked up there.
14:07 🔗 Fusl (!?\r\n|\n|\r) is usually how i write my code, that'll eat windows, linux and macos line endings
14:08 🔗 Fusl by now i think macos also uses \r\n though?
14:09 🔗 JAA I know it isn't \r anymore, but I don't know what it changed to.
14:09 🔗 JAA So if we do this "one wpull per video", we might as well just drop --delete-after and have the plain files directly.
14:10 🔗 Fusl whats --delete-after?
14:10 🔗 PhrackD has joined #archiveteam-bs
14:10 🔗 Fusl also
14:10 🔗 Fusl dont need one wpull per video
14:10 🔗 Fusl was just my expectation
14:10 🔗 JAA That removes the downloaded file. If you don't use it, you end up with two copies, one in the WARC and one as a plain file.
14:10 🔗 Fusl i can work with multiple videos in a single warc
14:11 🔗 Fusl just make sure that all the videos for a single m3u8 are always in a single warc and not split across multiple warcs
14:12 🔗 JAA Yeah, grouping multiple videos together is easy.
14:13 🔗 JAA On the other hand, there's 250 segments per video on average, and the downloads aren't incredibly fast.
14:13 🔗 JAA (Individual downloads, I mean.)
14:14 🔗 JAA If a download fails, that would also limit potential damage to a single video.
14:46 🔗 katocala has quit IRC ()
14:57 🔗 rellem has joined #archiveteam-bs
14:57 🔗 katocala has joined #archiveteam-bs
15:23 🔗 JAA So I have one file per video now, listing the .m3u8 and all .ts.
15:30 🔗 JAA Nevermind, fucked something up there.
15:57 🔗 Mayonaise has quit IRC (Read error: Operation timed out)
16:01 🔗 JAA Ok, fixed that. The ts files are no longer in order, but I can't be bothered to fix that.
16:05 🔗 JAA Fusl: So, want to run this?
16:05 🔗 JAA 22188 videos
16:05 🔗 Fusl the download or the concat? or both?
16:05 🔗 Mayonaise has joined #archiveteam-bs
16:05 🔗 JAA Both
16:06 🔗 Fusl sure
16:07 🔗 JAA 1 WARC per video, then we can still pack them into megawarcs if necessary for upload, but it seems that the size of one video should be roughly 600 MB, so that probably makes sense.
16:26 🔗 JAA Fusl: https://transfer.notkiska.pw/skqvS/nratv-videos.tar.gz
16:26 🔗 JAA wpull --input-file X --warc-file X --user-agent 'ArchiveTeam' --no-verbose
16:28 🔗 JAA And then you can run the muxing directly.
16:28 🔗 JAA I think wpull puts the files in a directory structure like ./domain/path/.../file. There's probably an option to change that though.
16:30 🔗 Fusl JAA: is that one video per file now?
16:30 🔗 JAA Yep
16:53 🔗 JAA Playlist grab: https://archive.org/details/nratv_video_playlists_201906
16:59 🔗 icedice has joined #archiveteam-bs
17:02 🔗 Fusl wpull --input-file "${item_file}" --warc-file "${item_id}.warc.gz" --user-agent 'ArchiveTeam' --no-verbose || exit 1
17:02 🔗 Fusl JAA: is it .warc.gz or .warc?
17:08 🔗 JAA Fusl: wpull appends it, so --warc-file is without an extension.
17:10 🔗 Fusl k
17:11 🔗 JAA Would make more sense to have --warc-file-prefix for when you use --warc-max-size and --warc-file as a full filename when it's a single file, but "it's how wget does it", so wpull can't change that. :-/
17:45 🔗 Fusl JAA: are 500-599 errors retried?
17:47 🔗 Fusl Kaz: http://xor.meo.ws/a12adf5f/30c9/4517/b759/97164e3b4c80.png
17:48 🔗 Kaz I refuse to accept you can find a file in there easily
17:48 🔗 Fusl i cant
17:49 🔗 Fusl i have scripts for nano and ssh
17:49 🔗 Fusl if i type in `nano $filename`, it looks up an open file handle, if there is one open already, it searches the tmux window, switches to it and voila
17:50 🔗 Kaz Hah, fair enough
17:50 🔗 Fusl if i type in `ssh <hostname>` it looks up an open ssh master file handle, traces that to the master ssh process and an sqlite3 database to switch to that correct window as well
18:02 🔗 icedice2 has joined #archiveteam-bs
18:07 🔗 icedice has quit IRC (Read error: Operation timed out)
18:10 🔗 Fusl JAA: http://archivebox-hel1.meo.ws/nratv/files/
18:13 🔗 JAA Fusl: Yes. Specifically, wpull considers 200, 204, and 304 successful, and 401, 403, 404, 405, and 410 permanent errors. Any other status code is retried.
18:14 🔗 JAA Nice :-)
18:18 🔗 MaximeLeG has joined #archiveteam-bs
18:21 🔗 RichardG has quit IRC (Ping timeout: 255 seconds)
18:21 🔗 Fusl i really need to get 10gbit on this hetzner server ready
18:22 🔗 Fusl its already at 1gbit already :$ http://xor.meo.ws/c471c0f6/99b0/4039/a2d1/09421f959c45.png
18:22 🔗 JAA Hetzner has 10 Gb/s servers?
18:22 🔗 anarcat Fusl: grafana?
18:22 🔗 Fusl grafana.
18:22 🔗 Fusl JAA: they offer 10gbit upgrades on some of their servers but that means the traffic is limited to 20tb
18:23 🔗 Fusl i may have a solution around that by using hetzner cloud for outbound connections :P
18:23 🔗 JAA Ah
18:23 🔗 Fusl because internal hetzner is not counted towards the traffic limit
18:23 🔗 JAA Right
18:25 🔗 RichardG has joined #archiveteam-bs
18:28 🔗 Fusl 300+300 MB per item, 22188 items total = 6.6tb*2 total ?
18:30 🔗 JAA Seems a bit small, but maybe my estimate wasn't a representative sample.
18:32 🔗 Fusl have the sizes for the samples you pulled?
18:33 🔗 JAA Nope, only the total.
18:35 🔗 JAA This is what I did: shuf -n 6044 <best-videos-ts >best-videos-ts-sample + parallel -X -j 16 curl -sI {} <best-videos-ts-sample | grep -i '^content-length: ' | awk '{sum+=$2} END {printf "%.0f\n",sum}'
18:35 🔗 SketchCow Just wanted to mention that a bug in the upload script made it so archivejobs that start with - would cause that bulk of items to not upload.
18:35 🔗 SketchCow So there were two sets from may that were just sitting around being ignored
18:35 🔗 SketchCow Safe, but not put into the archive.
18:35 🔗 JAA Oh, heh.
18:35 🔗 JAA Nice find
18:36 🔗 SketchCow Well, I was wondering why they weren't getting uploaded, also why that script was barfing errors
18:36 🔗 JAA ArchiveBot I guess?
18:36 🔗 SketchCow No, my uploader
18:36 🔗 JAA Yeah, but it's AB archives?
18:36 🔗 SketchCow I mean the name of the warc was -inf plus stuff
18:36 🔗 SketchCow Yes, archivebot
18:36 🔗 JAA Huh, that seems odd.
18:36 🔗 JAA Do you have the full filename?
18:37 🔗 SketchCow -inf-20190507-055626-3foh0-00000.warc.gz was one
18:38 🔗 JAA Oh, heh, I actually noticed that one at the time.
18:38 🔗 JAA Yeah, that was a broken job.
18:39 🔗 JAA "!archive https:/twitter.com/..." missing second slash before the domain.
18:39 🔗 SketchCow # Special Use Case: You can feed this script a pile of arguments and it's like running it
18:39 🔗 SketchCow # over and over. It'll give you a LOT of output (for now, unless we do major surgery)
18:39 🔗 SketchCow # but it will do the right thing. Otherwise, use as the docs say.
18:39 🔗 SketchCow Well, it's righting itself
18:39 🔗 SketchCow for ARGUMENT in $@
18:39 🔗 SketchCow Ha, past
18:39 🔗 SketchCow a
18:39 🔗 SketchCow do
18:40 🔗 SketchCow That reminds me to make the stupid announcement in the main channel
18:40 🔗 JAA -- and "$@" ftw.
18:41 🔗 SketchCow Anyway, this delayed adding kiwifarms to the wayback for 90 days, boo hoo
18:56 🔗 MaximeLeG has quit IRC (Ping timeout: 260 seconds)
19:03 🔗 icedice has joined #archiveteam-bs
19:05 🔗 icedice2 has quit IRC (Ping timeout: 252 seconds)
19:06 🔗 MaximeLeG has joined #archiveteam-bs
19:07 🔗 MaximeLeG has quit IRC (Client Quit)
19:08 🔗 MaximeLeG has joined #archiveteam-bs
19:09 🔗 MaximeLeG has quit IRC (Client Quit)
20:02 🔗 icedice2 has joined #archiveteam-bs
20:08 🔗 icedice has quit IRC (Read error: Operation timed out)
20:08 🔗 AnthonyI has joined #archiveteam-bs
20:20 🔗 ranma has joined #archiveteam-bs
21:07 🔗 t3 has joined #archiveteam-bs
21:09 🔗 godane has joined #archiveteam-bs
22:38 🔗 znak has joined #archiveteam-bs
22:40 🔗 znak has left
22:46 🔗 killsushi has joined #archiveteam-bs
22:48 🔗 Dallas has joined #archiveteam-bs
22:52 🔗 BlueMax has joined #archiveteam-bs

irclogger-viewer