[00:02] Not sure if this has been posted before, but Stringify is closing https://www.stringify.com [00:07] It's a service similar to IFTTT - might be worth archiving but I'm not sure [00:13] JAA: I guess if you got any thing >1M you can manually upload it to rsync://95.217.3.46/sola/JAA/ at least it'll be something [00:36] anything new running/resources needed a ywhere? been out of town on work for a week [00:37] *** netsound_ has joined #archiveteam-bs [00:42] *** netsound has quit IRC (Ping timeout: 360 seconds) [00:53] *** tech234a has quit IRC (Quit: Connection closed for inactivity) [00:56] *** BlueMax has quit IRC (Quit: Leaving) [01:12] *** ayanami_ has quit IRC (Quit: Leaving) [01:15] *** Zerote has quit IRC (Ping timeout: 260 seconds) [01:16] *** Exairnous has joined #archiveteam-bs [01:20] *** BlueMax has joined #archiveteam-bs [01:45] *** xit_ has joined #archiveteam-bs [01:46] *** m007a83 has joined #archiveteam-bs [01:51] *** icedice has quit IRC (Quit: Leaving) [02:12] *** enowaldo has joined #archiveteam-bs [02:14] *** Odd0002_ has joined #archiveteam-bs [02:18] *** Odd0002 has quit IRC (Read error: Operation timed out) [02:18] *** Odd0002_ is now known as Odd0002 [02:19] *** enowaldo has quit IRC (Read error: Operation timed out) [03:13] *** ndiddy has quit IRC () [03:14] *** odemgi_ has joined #archiveteam-bs [03:17] *** odemgi has quit IRC (Ping timeout: 252 seconds) [03:23] *** odemg has quit IRC (Ping timeout: 615 seconds) [03:29] *** qw3rty111 has joined #archiveteam-bs [03:36] *** qw3rty119 has quit IRC (Ping timeout: 600 seconds) [03:52] *** Mateon1 has quit IRC (Remote host closed the connection) [03:52] *** Mateon1 has joined #archiveteam-bs [04:03] *** tech234a has joined #archiveteam-bs [04:42] *** Despatche has quit IRC (Quit: Read error: Connection reset by deer) [05:08] *** Exairnous has quit IRC (Ping timeout: 268 seconds) [05:12] *** enowaldo has joined #archiveteam-bs [05:16] *** Stilett0 has quit IRC (Read error: Operation timed out) [05:19] *** sep332 has quit IRC (Read error: Operation timed out) [05:21] *** enowaldo has quit IRC (Ping timeout: 492 seconds) [05:26] *** sep332 has joined #archiveteam-bs [05:44] *** Stiletto has joined #archiveteam-bs [06:01] *** deevious has joined #archiveteam-bs [06:03] *** deevious has quit IRC (Client Quit) [06:13] *** benjins has quit IRC (Read error: Connection reset by peer) [06:18] *** deevious has joined #archiveteam-bs [06:32] *** Stiletto has quit IRC () [06:40] *** Stiletto has joined #archiveteam-bs [06:50] *** Exairnous has joined #archiveteam-bs [06:50] *** deevious has quit IRC (Read error: Connection reset by peer) [06:51] *** Hani has quit IRC (west.us.hub irc.Prison.NET) [06:51] *** achip has quit IRC (west.us.hub irc.Prison.NET) [06:51] *** marked has quit IRC (west.us.hub irc.Prison.NET) [06:51] *** K4k has quit IRC (west.us.hub irc.Prison.NET) [06:51] *** deevious has joined #archiveteam-bs [06:52] *** K4k_ has joined #archiveteam-bs [07:10] https://twitter.com/MarkMan23/status/1116072831228436480 [07:32] *** Dallas has quit IRC (Read error: Connection reset by peer) [07:32] *** BnAboyZ has quit IRC (Read error: Connection reset by peer) [07:32] *** Dallas6 has joined #archiveteam-bs [07:32] *** VoynichCr has quit IRC (Ping timeout: 268 seconds) [07:33] *** MrRadar2 has quit IRC (Ping timeout: 268 seconds) [07:33] *** Zerote has joined #archiveteam-bs [07:35] *** Tenebrae has quit IRC (Ping timeout: 268 seconds) [07:39] *** asie has quit IRC (Ping timeout: 268 seconds) [07:40] *** Dallas6 has quit IRC (Ping timeout: 268 seconds) [07:40] *** sHATNER has quit IRC (Ping timeout: 365 seconds) [07:40] *** Xibalba has quit IRC (Ping timeout: 268 seconds) [07:41] kiska: Come again? [07:42] *** marked has joined #archiveteam-bs [07:43] *** achip has joined #archiveteam-bs [07:44] *** Tenebrae has joined #archiveteam-bs [07:44] *** sHATNER has joined #archiveteam-bs [07:44] *** asie has joined #archiveteam-bs [07:44] *** MrRadar2 has joined #archiveteam-bs [07:45] *** Dallas6 has joined #archiveteam-bs [07:45] *** svchfoo3 sets mode: +o MrRadar2 [07:45] *** BnAboyZ has joined #archiveteam-bs [07:47] *** Xibalba has joined #archiveteam-bs [07:47] *** VoynichCr has joined #archiveteam-bs [08:03] *** tech234a has quit IRC (Quit: Connection closed for inactivity) [08:18] *** fuzzy8021 has quit IRC (Read error: Connection reset by peer) [08:18] *** fuzzy8021 has joined #archiveteam-bs [08:18] *** dxrt_ has quit IRC (Read error: Operation timed out) [08:19] kiska: Doesn't look like there's any data left on your machines. I'll take a closer look later. [08:20] *** paul2520 has quit IRC (Read error: Operation timed out) [08:20] *** K4k_ has quit IRC (Read error: Operation timed out) [08:20] *** K4k_ has joined #archiveteam-bs [08:21] *** Damme has quit IRC (Read error: Operation timed out) [08:21] *** TigerbotH has quit IRC (Read error: Operation timed out) [08:22] *** chuckx has quit IRC (Read error: Operation timed out) [08:22] *** Ravenloft has quit IRC (Read error: Operation timed out) [08:23] *** Odd0002 has quit IRC (Read error: Operation timed out) [08:23] *** ivan- has joined #archiveteam-bs [08:24] *** kiska1 has quit IRC (Read error: Operation timed out) [08:25] *** Odd0002 has joined #archiveteam-bs [08:25] *** LeG0ax has joined #archiveteam-bs [08:25] *** qw3rty111 has quit IRC (Read error: Operation timed out) [08:26] *** PotcFdk has quit IRC (Ping timeout: 600 seconds) [08:26] *** PhrackD has quit IRC (Read error: Operation timed out) [08:26] *** step has quit IRC (Ping timeout: 600 seconds) [08:27] *** deevious has quit IRC (Quit: deevious) [08:28] *** Damme has joined #archiveteam-bs [08:28] *** ivan has quit IRC (Ping timeout: 600 seconds) [08:29] *** Lord_Nigh has quit IRC (Read error: Operation timed out) [08:30] *** paul2520 has joined #archiveteam-bs [08:31] *** Ing3b0rg has quit IRC (Ping timeout: 600 seconds) [08:31] *** LeG0ax is now known as Ing3b0rg [08:31] *** Lord_Nigh has joined #archiveteam-bs [08:32] *** Mayonaise has quit IRC (Ping timeout: 600 seconds) [08:34] *** step has joined #archiveteam-bs [08:34] *** qw3rty111 has joined #archiveteam-bs [08:34] *** kiska1 has joined #archiveteam-bs [08:35] *** svchfoo3 sets mode: +o kiska1 [08:36] *** Mayonaise has joined #archiveteam-bs [08:36] *** chuckx has joined #archiveteam-bs [08:38] *** dxrt_ has joined #archiveteam-bs [08:38] *** dxrt sets mode: +o dxrt_ [08:38] *** PhrackD has joined #archiveteam-bs [08:38] *** TigerbotH has joined #archiveteam-bs [08:47] *** enowaldo has joined #archiveteam-bs [08:47] *** deevious has joined #archiveteam-bs [08:52] *** enowaldo has quit IRC (Ping timeout: 265 seconds) [09:19] *** PotcFdk has joined #archiveteam-bs [09:36] *** Hani has joined #archiveteam-bs [10:24] *** Verified_ has quit IRC (Remote host closed the connection) [10:25] *** Verified_ has joined #archiveteam-bs [10:50] *** ivan- is now known as ivan [10:57] *** BlueMax has quit IRC (Read error: Connection reset by peer) [11:08] *** benjins has joined #archiveteam-bs [11:12] *** SynMonger has quit IRC (Ping timeout: 615 seconds) [11:29] *** Terbium has quit IRC (Ping timeout: 604 seconds) [11:32] *** icedice has joined #archiveteam-bs [11:42] *** jesso has quit IRC (Quit: jesso) [12:05] *** icedice has quit IRC (Quit: Leaving) [12:08] *** Despatche has joined #archiveteam-bs [12:11] *** odemg has joined #archiveteam-bs [12:11] *** odemg has quit IRC (Remote host closed the connection) [12:12] *** odemg has joined #archiveteam-bs [12:13] *** Jopik has quit IRC (Ping timeout: 360 seconds) [12:52] *** enowaldo has joined #archiveteam-bs [13:06] *** jesso has joined #archiveteam-bs [13:10] *** jesso has quit IRC (Client Quit) [13:21] *** tapos has joined #archiveteam-bs [13:25] *** drcd has joined #archiveteam-bs [13:52] *** tapos has quit IRC (Leaving) [14:02] Sola's CDN is still up, so I'll try to extract those URLs from the API AB job. [14:09] *** icedice has joined #archiveteam-bs [14:20] *** jesso has joined #archiveteam-bs [14:20] kiska: Nothing running on your machines anymore from my side and no data or anything else remaining. Just your AB pipeline on one of them. [14:21] Hrm... I had ~3G of data on my instances which I am now going to pack and ship to IA [14:27] *** wyatt8740 has joined #archiveteam-bs [14:29] *** Verified_ has quit IRC (Ping timeout: 252 seconds) [14:30] *** jesso has quit IRC (Remote host closed the connection) [14:39] *** Odd0002 has quit IRC (Ping timeout: 252 seconds) [14:39] *** Odd0002 has joined #archiveteam-bs [14:45] *** Verified_ has joined #archiveteam-bs [14:57] *** Zerote has quit IRC (Ping timeout: 260 seconds) [14:58] *** enowaldo has quit IRC (Read error: Operation timed out) [15:01] *** Smiley has quit IRC (Ping timeout: 265 seconds) [15:04] *** wp494 has quit IRC (Ping timeout: 506 seconds) [15:05] *** wp494 has joined #archiveteam-bs [15:10] Oof, the text file of URLs extracted from the API WARC is almost as large as the WARC itself... :-| [15:10] (Compression etc.) [15:10] 13.9M URLs, but that's before deduplication. [15:23] *** wyatt8740 has quit IRC (Ping timeout: 246 seconds) [15:27] HCross, tracker borked on newgrabber? [15:28] *** ayanami_ has joined #archiveteam-bs [15:29] Ooh, TIL deduping with Perl instead of AWK is faster by two orders of magnitude. [15:30] Specifically: perl -ne 'print if ! $a{$_}++' in >out instead of awk '!seen[$0]++' in >out (which already beats sort -u by a few orders of magnitude for large files) [15:32] JAA: Are you sure `sort` is not writing temporary files to disk (i.e. you increased --buffer-size)? [15:33] PurpleSym: No idea what sort's doing besides, you know, sorting, which is ridiculously expensive for large files. I haven't used sort for pure deduplication in years. I just wasn't aware that Perl was so much faster than AWK. [15:34] Sorting a million lines takes several minutes with AWK and about 1.2 seconds with Perl... [15:34] s/Sorting/Deduplicating/ [15:34] It uses a relatively small memory buffer by default and writes temporary files instead. That’s why it’s slow for big files. [15:35] Hmm, but surely there's still a massive overhead due to the sorting itself, right? [15:35] Also, the temporary file is written to TMPDIR I assume, so when that's a tmpfs, that shouldn't matter too much? [15:36] Well, if you don’t need sorting you can go with `uniq`. [15:37] Uniq only works for consecutive duplicates. [15:38] Just tested sort's --buffer-size/-S option, and there is a difference, but it's pretty small (0.8 instead of 0.9 seconds for sorting a file generated with 'seq 100000 | sort -R'). [15:39] And with 1M lines, it's 3.1 vs 2.9 seconds. [15:39] https://github.com/ludios/quickmunge/blob/master/bin/uniqify [15:40] I'm gonna guess my Python 3 port is slower than the original :-) [15:41] The Perl/awk versions use a hash table, is that right? [15:41] Correct [15:42] ivan: Much faster than AWK, a bit slower than Perl. [15:42] Ok, if that’s O(1) it’ll always beat sort’s O(n*log(n)) until you run out of memory. [15:43] Specificaly, with a 250k line test file: Perl 0.26 s, uniqify 0.37 s, AWK 21.3 s [15:43] However, the Perl and AWK solutions will preserve order while your script doesn't (I think). [15:43] Not that important usually, just something worth noting. [15:44] Switching it to an 'if line not in s' check inside the loop to preserve the order gives 0.40 s. [15:46] Obviously, that'll depend a lot on how many duplicates are in the input. [15:47] Deduplicating the 13.9M URLs with Perl took 14 s. I killed the AWK before after 10 minutes and having reached only about 10 % of the file. [15:47] *** Zerote has joined #archiveteam-bs [15:52] Anyway, back to topic: 8.4 million unique URLs from the Sola API data, and the vast majority of that is either directly the Sola CDN (cdn.solacore.net) or S3. [15:52] Ah, also s.plague.io, which I'm not really sure what it is. [15:53] *** enowaldo has joined #archiveteam-bs [15:54] NB, this does not contain URLs within posts unless the posts start with a URL (in which case the "URL" contains the entire post, not just the URL itself) due to how I extracted it from the JSON data (grep). [15:56] Mainly CDN? I guess just pictures are better than nothing [15:56] Ah, the s.plague.io URLs come from "share_image" fields on very old posts. Newer posts have URLs like "https://cdn.solacore.net/upload/2017-11-23/8935a00d-5507-4597-b2f3-d169d7e85bdb.jpg". They also switched the post IDs from numerical to alphanumerical when they changed those URLs. [16:14] *** Terbium has joined #archiveteam-bs [16:16] *** enowaldo has quit IRC (Read error: Operation timed out) [16:27] *** omarroth has joined #archiveteam-bs [16:30] *** ReimuHaku has quit IRC (Remote host closed the connection) [16:35] *** ReimuHaku has joined #archiveteam-bs [17:02] *** enowaldo has joined #archiveteam-bs [17:14] JAA: is your file not already sorted? because if it is, 'uniq' is much faster than 'sort -u' :) [17:18] *** enowaldo has quit IRC (Ping timeout: 268 seconds) [17:26] *** fuzzy8021 has quit IRC (Read error: Operation timed out) [17:26] *** fuzzy8021 has joined #archiveteam-bs [17:37] the tumblr dedup jobs were running for 10's of minutes so would make a better benchmark test case [18:15] *** Smiley has joined #archiveteam-bs [18:19] *** drcd_ has joined #archiveteam-bs [18:21] *** drcd has quit IRC (Ping timeout: 252 seconds) [18:31] *** drcd has joined #archiveteam-bs [18:35] *** drcd_ has quit IRC (Read error: Operation timed out) [18:37] *** wyatt8740 has joined #archiveteam-bs [18:37] *** ayanami_ has quit IRC (Quit: Leaving) [18:47] *** enowaldo has joined #archiveteam-bs [19:00] *** schbirid has quit IRC (Read error: Connection reset by peer) [19:15] *** killsushi has joined #archiveteam-bs [19:25] *** enowaldo has quit IRC (Read error: Operation timed out) [19:52] *** bytefray has joined #archiveteam-bs [19:53] *** bytefray has quit IRC (Client Quit) [20:00] *** enowaldo has joined #archiveteam-bs [20:03] *** drcd has quit IRC (Ping timeout: 252 seconds) [20:07] *** godane1 has quit IRC (Ping timeout: 246 seconds) [20:08] *** godane has joined #archiveteam-bs [20:16] *** godane has quit IRC (Ping timeout: 246 seconds) [20:18] *** godane has joined #archiveteam-bs [20:26] *** godane has quit IRC (Ping timeout: 246 seconds) [20:28] *** godane has joined #archiveteam-bs [20:37] *** godane has quit IRC (Ping timeout: 246 seconds) [20:38] *** godane has joined #archiveteam-bs [20:46] *** godane has quit IRC (Ping timeout: 246 seconds) [20:46] *** godane has joined #archiveteam-bs [20:55] *** godane has quit IRC (Ping timeout: 246 seconds) [20:56] *** godane has joined #archiveteam-bs [21:04] *** godane has quit IRC (Ping timeout: 246 seconds) [21:05] *** godane has joined #archiveteam-bs [21:14] *** godane has quit IRC (Ping timeout: 246 seconds) [21:15] *** godane has joined #archiveteam-bs [21:23] *** godane has quit IRC (Ping timeout: 246 seconds) [22:03] *** tomaspark has joined #archiveteam-bs [22:03] *** deevious has quit IRC (Read error: Connection reset by peer) [22:04] *** deevious has joined #archiveteam-bs [22:08] *** BlueMax has joined #archiveteam-bs [22:32] *** BlueMax has quit IRC (Quit: Leaving) [22:55] *** godane has joined #archiveteam-bs [23:12] *** enowaldo has quit IRC (Read error: Operation timed out)