#archiveteam-bs 2019-04-11,Thu

↑back Search

Time Nickname Message
00:02 πŸ”— ayanami_ Not sure if this has been posted before, but Stringify is closing https://www.stringify.com
00:07 πŸ”— ayanami_ It's a service similar to IFTTT - might be worth archiving but I'm not sure
00:13 πŸ”— kiska JAA: I guess if you got any thing >1M you can manually upload it to rsync://95.217.3.46/sola/JAA/ at least it'll be something
00:36 πŸ”— fuzzy8021 anything new running/resources needed a ywhere? been out of town on work for a week
00:37 πŸ”— netsound_ has joined #archiveteam-bs
00:42 πŸ”— netsound has quit IRC (Ping timeout: 360 seconds)
00:53 πŸ”— tech234a has quit IRC (Quit: Connection closed for inactivity)
00:56 πŸ”— BlueMax has quit IRC (Quit: Leaving)
01:12 πŸ”— ayanami_ has quit IRC (Quit: Leaving)
01:15 πŸ”— Zerote has quit IRC (Ping timeout: 260 seconds)
01:16 πŸ”— Exairnous has joined #archiveteam-bs
01:20 πŸ”— BlueMax has joined #archiveteam-bs
01:45 πŸ”— xit_ has joined #archiveteam-bs
01:46 πŸ”— m007a83 has joined #archiveteam-bs
01:51 πŸ”— icedice has quit IRC (Quit: Leaving)
02:12 πŸ”— enowaldo has joined #archiveteam-bs
02:14 πŸ”— Odd0002_ has joined #archiveteam-bs
02:18 πŸ”— Odd0002 has quit IRC (Read error: Operation timed out)
02:18 πŸ”— Odd0002_ is now known as Odd0002
02:19 πŸ”— enowaldo has quit IRC (Read error: Operation timed out)
03:13 πŸ”— ndiddy has quit IRC ()
03:14 πŸ”— odemgi_ has joined #archiveteam-bs
03:17 πŸ”— odemgi has quit IRC (Ping timeout: 252 seconds)
03:23 πŸ”— odemg has quit IRC (Ping timeout: 615 seconds)
03:29 πŸ”— qw3rty111 has joined #archiveteam-bs
03:36 πŸ”— qw3rty119 has quit IRC (Ping timeout: 600 seconds)
03:52 πŸ”— Mateon1 has quit IRC (Remote host closed the connection)
03:52 πŸ”— Mateon1 has joined #archiveteam-bs
04:03 πŸ”— tech234a has joined #archiveteam-bs
04:42 πŸ”— Despatche has quit IRC (Quit: Read error: Connection reset by deer)
05:08 πŸ”— Exairnous has quit IRC (Ping timeout: 268 seconds)
05:12 πŸ”— enowaldo has joined #archiveteam-bs
05:16 πŸ”— Stilett0 has quit IRC (Read error: Operation timed out)
05:19 πŸ”— sep332 has quit IRC (Read error: Operation timed out)
05:21 πŸ”— enowaldo has quit IRC (Ping timeout: 492 seconds)
05:26 πŸ”— sep332 has joined #archiveteam-bs
05:44 πŸ”— Stiletto has joined #archiveteam-bs
06:01 πŸ”— deevious has joined #archiveteam-bs
06:03 πŸ”— deevious has quit IRC (Client Quit)
06:13 πŸ”— benjins has quit IRC (Read error: Connection reset by peer)
06:18 πŸ”— deevious has joined #archiveteam-bs
06:32 πŸ”— Stiletto has quit IRC ()
06:40 πŸ”— Stiletto has joined #archiveteam-bs
06:50 πŸ”— Exairnous has joined #archiveteam-bs
06:50 πŸ”— deevious has quit IRC (Read error: Connection reset by peer)
06:51 πŸ”— Hani has quit IRC (west.us.hub irc.Prison.NET)
06:51 πŸ”— achip has quit IRC (west.us.hub irc.Prison.NET)
06:51 πŸ”— marked has quit IRC (west.us.hub irc.Prison.NET)
06:51 πŸ”— K4k has quit IRC (west.us.hub irc.Prison.NET)
06:51 πŸ”— deevious has joined #archiveteam-bs
06:52 πŸ”— K4k_ has joined #archiveteam-bs
07:10 πŸ”— schbirid https://twitter.com/MarkMan23/status/1116072831228436480
07:32 πŸ”— Dallas has quit IRC (Read error: Connection reset by peer)
07:32 πŸ”— BnAboyZ has quit IRC (Read error: Connection reset by peer)
07:32 πŸ”— Dallas6 has joined #archiveteam-bs
07:32 πŸ”— VoynichCr has quit IRC (Ping timeout: 268 seconds)
07:33 πŸ”— MrRadar2 has quit IRC (Ping timeout: 268 seconds)
07:33 πŸ”— Zerote has joined #archiveteam-bs
07:35 πŸ”— Tenebrae has quit IRC (Ping timeout: 268 seconds)
07:39 πŸ”— asie has quit IRC (Ping timeout: 268 seconds)
07:40 πŸ”— Dallas6 has quit IRC (Ping timeout: 268 seconds)
07:40 πŸ”— sHATNER has quit IRC (Ping timeout: 365 seconds)
07:40 πŸ”— Xibalba has quit IRC (Ping timeout: 268 seconds)
07:41 πŸ”— JAA kiska: Come again?
07:42 πŸ”— marked has joined #archiveteam-bs
07:43 πŸ”— achip has joined #archiveteam-bs
07:44 πŸ”— Tenebrae has joined #archiveteam-bs
07:44 πŸ”— sHATNER has joined #archiveteam-bs
07:44 πŸ”— asie has joined #archiveteam-bs
07:44 πŸ”— MrRadar2 has joined #archiveteam-bs
07:45 πŸ”— Dallas6 has joined #archiveteam-bs
07:45 πŸ”— svchfoo3 sets mode: +o MrRadar2
07:45 πŸ”— BnAboyZ has joined #archiveteam-bs
07:47 πŸ”— Xibalba has joined #archiveteam-bs
07:47 πŸ”— VoynichCr has joined #archiveteam-bs
08:03 πŸ”— tech234a has quit IRC (Quit: Connection closed for inactivity)
08:18 πŸ”— fuzzy8021 has quit IRC (Read error: Connection reset by peer)
08:18 πŸ”— fuzzy8021 has joined #archiveteam-bs
08:18 πŸ”— dxrt_ has quit IRC (Read error: Operation timed out)
08:19 πŸ”— JAA kiska: Doesn't look like there's any data left on your machines. I'll take a closer look later.
08:20 πŸ”— paul2520 has quit IRC (Read error: Operation timed out)
08:20 πŸ”— K4k_ has quit IRC (Read error: Operation timed out)
08:20 πŸ”— K4k_ has joined #archiveteam-bs
08:21 πŸ”— Damme has quit IRC (Read error: Operation timed out)
08:21 πŸ”— TigerbotH has quit IRC (Read error: Operation timed out)
08:22 πŸ”— chuckx has quit IRC (Read error: Operation timed out)
08:22 πŸ”— Ravenloft has quit IRC (Read error: Operation timed out)
08:23 πŸ”— Odd0002 has quit IRC (Read error: Operation timed out)
08:23 πŸ”— ivan- has joined #archiveteam-bs
08:24 πŸ”— kiska1 has quit IRC (Read error: Operation timed out)
08:25 πŸ”— Odd0002 has joined #archiveteam-bs
08:25 πŸ”— LeG0ax has joined #archiveteam-bs
08:25 πŸ”— qw3rty111 has quit IRC (Read error: Operation timed out)
08:26 πŸ”— PotcFdk has quit IRC (Ping timeout: 600 seconds)
08:26 πŸ”— PhrackD has quit IRC (Read error: Operation timed out)
08:26 πŸ”— step has quit IRC (Ping timeout: 600 seconds)
08:27 πŸ”— deevious has quit IRC (Quit: deevious)
08:28 πŸ”— Damme has joined #archiveteam-bs
08:28 πŸ”— ivan has quit IRC (Ping timeout: 600 seconds)
08:29 πŸ”— Lord_Nigh has quit IRC (Read error: Operation timed out)
08:30 πŸ”— paul2520 has joined #archiveteam-bs
08:31 πŸ”— Ing3b0rg has quit IRC (Ping timeout: 600 seconds)
08:31 πŸ”— LeG0ax is now known as Ing3b0rg
08:31 πŸ”— Lord_Nigh has joined #archiveteam-bs
08:32 πŸ”— Mayonaise has quit IRC (Ping timeout: 600 seconds)
08:34 πŸ”— step has joined #archiveteam-bs
08:34 πŸ”— qw3rty111 has joined #archiveteam-bs
08:34 πŸ”— kiska1 has joined #archiveteam-bs
08:35 πŸ”— svchfoo3 sets mode: +o kiska1
08:36 πŸ”— Mayonaise has joined #archiveteam-bs
08:36 πŸ”— chuckx has joined #archiveteam-bs
08:38 πŸ”— dxrt_ has joined #archiveteam-bs
08:38 πŸ”— dxrt sets mode: +o dxrt_
08:38 πŸ”— PhrackD has joined #archiveteam-bs
08:38 πŸ”— TigerbotH has joined #archiveteam-bs
08:47 πŸ”— enowaldo has joined #archiveteam-bs
08:47 πŸ”— deevious has joined #archiveteam-bs
08:52 πŸ”— enowaldo has quit IRC (Ping timeout: 265 seconds)
09:19 πŸ”— PotcFdk has joined #archiveteam-bs
09:36 πŸ”— Hani has joined #archiveteam-bs
10:24 πŸ”— Verified_ has quit IRC (Remote host closed the connection)
10:25 πŸ”— Verified_ has joined #archiveteam-bs
10:50 πŸ”— ivan- is now known as ivan
10:57 πŸ”— BlueMax has quit IRC (Read error: Connection reset by peer)
11:08 πŸ”— benjins has joined #archiveteam-bs
11:12 πŸ”— SynMonger has quit IRC (Ping timeout: 615 seconds)
11:29 πŸ”— Terbium has quit IRC (Ping timeout: 604 seconds)
11:32 πŸ”— icedice has joined #archiveteam-bs
11:42 πŸ”— jesso has quit IRC (Quit: jesso)
12:05 πŸ”— icedice has quit IRC (Quit: Leaving)
12:08 πŸ”— Despatche has joined #archiveteam-bs
12:11 πŸ”— odemg has joined #archiveteam-bs
12:11 πŸ”— odemg has quit IRC (Remote host closed the connection)
12:12 πŸ”— odemg has joined #archiveteam-bs
12:13 πŸ”— Jopik has quit IRC (Ping timeout: 360 seconds)
12:52 πŸ”— enowaldo has joined #archiveteam-bs
13:06 πŸ”— jesso has joined #archiveteam-bs
13:10 πŸ”— jesso has quit IRC (Client Quit)
13:21 πŸ”— tapos has joined #archiveteam-bs
13:25 πŸ”— drcd has joined #archiveteam-bs
13:52 πŸ”— tapos has quit IRC (Leaving)
14:02 πŸ”— JAA Sola's CDN is still up, so I'll try to extract those URLs from the API AB job.
14:09 πŸ”— icedice has joined #archiveteam-bs
14:20 πŸ”— jesso has joined #archiveteam-bs
14:20 πŸ”— JAA kiska: Nothing running on your machines anymore from my side and no data or anything else remaining. Just your AB pipeline on one of them.
14:21 πŸ”— kiska Hrm... I had ~3G of data on my instances which I am now going to pack and ship to IA
14:27 πŸ”— wyatt8740 has joined #archiveteam-bs
14:29 πŸ”— Verified_ has quit IRC (Ping timeout: 252 seconds)
14:30 πŸ”— jesso has quit IRC (Remote host closed the connection)
14:39 πŸ”— Odd0002 has quit IRC (Ping timeout: 252 seconds)
14:39 πŸ”— Odd0002 has joined #archiveteam-bs
14:45 πŸ”— Verified_ has joined #archiveteam-bs
14:57 πŸ”— Zerote has quit IRC (Ping timeout: 260 seconds)
14:58 πŸ”— enowaldo has quit IRC (Read error: Operation timed out)
15:01 πŸ”— Smiley has quit IRC (Ping timeout: 265 seconds)
15:04 πŸ”— wp494 has quit IRC (Ping timeout: 506 seconds)
15:05 πŸ”— wp494 has joined #archiveteam-bs
15:10 πŸ”— JAA Oof, the text file of URLs extracted from the API WARC is almost as large as the WARC itself... :-|
15:10 πŸ”— JAA (Compression etc.)
15:10 πŸ”— JAA 13.9M URLs, but that's before deduplication.
15:23 πŸ”— wyatt8740 has quit IRC (Ping timeout: 246 seconds)
15:27 πŸ”— odemgi_ HCross, tracker borked on newgrabber?
15:28 πŸ”— ayanami_ has joined #archiveteam-bs
15:29 πŸ”— JAA Ooh, TIL deduping with Perl instead of AWK is faster by two orders of magnitude.
15:30 πŸ”— JAA Specifically: perl -ne 'print if ! $a{$_}++' in >out instead of awk '!seen[$0]++' in >out (which already beats sort -u by a few orders of magnitude for large files)
15:32 πŸ”— PurpleSym JAA: Are you sure `sort` is not writing temporary files to disk (i.e. you increased --buffer-size)?
15:33 πŸ”— JAA PurpleSym: No idea what sort's doing besides, you know, sorting, which is ridiculously expensive for large files. I haven't used sort for pure deduplication in years. I just wasn't aware that Perl was so much faster than AWK.
15:34 πŸ”— JAA Sorting a million lines takes several minutes with AWK and about 1.2 seconds with Perl...
15:34 πŸ”— JAA s/Sorting/Deduplicating/
15:34 πŸ”— PurpleSym It uses a relatively small memory buffer by default and writes temporary files instead. That’s why it’s slow for big files.
15:35 πŸ”— JAA Hmm, but surely there's still a massive overhead due to the sorting itself, right?
15:35 πŸ”— JAA Also, the temporary file is written to TMPDIR I assume, so when that's a tmpfs, that shouldn't matter too much?
15:36 πŸ”— PurpleSym Well, if you don’t need sorting you can go with `uniq`.
15:37 πŸ”— JAA Uniq only works for consecutive duplicates.
15:38 πŸ”— JAA Just tested sort's --buffer-size/-S option, and there is a difference, but it's pretty small (0.8 instead of 0.9 seconds for sorting a file generated with 'seq 100000 | sort -R').
15:39 πŸ”— JAA And with 1M lines, it's 3.1 vs 2.9 seconds.
15:39 πŸ”— ivan https://github.com/ludios/quickmunge/blob/master/bin/uniqify
15:40 πŸ”— ivan I'm gonna guess my Python 3 port is slower than the original :-)
15:41 πŸ”— PurpleSym The Perl/awk versions use a hash table, is that right?
15:41 πŸ”— JAA Correct
15:42 πŸ”— JAA ivan: Much faster than AWK, a bit slower than Perl.
15:42 πŸ”— PurpleSym Ok, if that’s O(1) it’ll always beat sort’s O(n*log(n)) until you run out of memory.
15:43 πŸ”— JAA Specificaly, with a 250k line test file: Perl 0.26 s, uniqify 0.37 s, AWK 21.3 s
15:43 πŸ”— JAA However, the Perl and AWK solutions will preserve order while your script doesn't (I think).
15:43 πŸ”— JAA Not that important usually, just something worth noting.
15:44 πŸ”— JAA Switching it to an 'if line not in s' check inside the loop to preserve the order gives 0.40 s.
15:46 πŸ”— JAA Obviously, that'll depend a lot on how many duplicates are in the input.
15:47 πŸ”— JAA Deduplicating the 13.9M URLs with Perl took 14 s. I killed the AWK before after 10 minutes and having reached only about 10 % of the file.
15:47 πŸ”— Zerote has joined #archiveteam-bs
15:52 πŸ”— JAA Anyway, back to topic: 8.4 million unique URLs from the Sola API data, and the vast majority of that is either directly the Sola CDN (cdn.solacore.net) or S3.
15:52 πŸ”— JAA Ah, also s.plague.io, which I'm not really sure what it is.
15:53 πŸ”— enowaldo has joined #archiveteam-bs
15:54 πŸ”— JAA NB, this does not contain URLs within posts unless the posts start with a URL (in which case the "URL" contains the entire post, not just the URL itself) due to how I extracted it from the JSON data (grep).
15:56 πŸ”— ayanami_ Mainly CDN? I guess just pictures are better than nothing
15:56 πŸ”— JAA Ah, the s.plague.io URLs come from "share_image" fields on very old posts. Newer posts have URLs like "https://cdn.solacore.net/upload/2017-11-23/8935a00d-5507-4597-b2f3-d169d7e85bdb.jpg". They also switched the post IDs from numerical to alphanumerical when they changed those URLs.
16:14 πŸ”— Terbium has joined #archiveteam-bs
16:16 πŸ”— enowaldo has quit IRC (Read error: Operation timed out)
16:27 πŸ”— omarroth has joined #archiveteam-bs
16:30 πŸ”— ReimuHaku has quit IRC (Remote host closed the connection)
16:35 πŸ”— ReimuHaku has joined #archiveteam-bs
17:02 πŸ”— enowaldo has joined #archiveteam-bs
17:14 πŸ”— astrid JAA: is your file not already sorted? because if it is, 'uniq' is much faster than 'sort -u' :)
17:18 πŸ”— enowaldo has quit IRC (Ping timeout: 268 seconds)
17:26 πŸ”— fuzzy8021 has quit IRC (Read error: Operation timed out)
17:26 πŸ”— fuzzy8021 has joined #archiveteam-bs
17:37 πŸ”— marked the tumblr dedup jobs were running for 10's of minutes so would make a better benchmark test case
18:15 πŸ”— Smiley has joined #archiveteam-bs
18:19 πŸ”— drcd_ has joined #archiveteam-bs
18:21 πŸ”— drcd has quit IRC (Ping timeout: 252 seconds)
18:31 πŸ”— drcd has joined #archiveteam-bs
18:35 πŸ”— drcd_ has quit IRC (Read error: Operation timed out)
18:37 πŸ”— wyatt8740 has joined #archiveteam-bs
18:37 πŸ”— ayanami_ has quit IRC (Quit: Leaving)
18:47 πŸ”— enowaldo has joined #archiveteam-bs
19:00 πŸ”— schbirid has quit IRC (Read error: Connection reset by peer)
19:15 πŸ”— killsushi has joined #archiveteam-bs
19:25 πŸ”— enowaldo has quit IRC (Read error: Operation timed out)
19:52 πŸ”— bytefray has joined #archiveteam-bs
19:53 πŸ”— bytefray has quit IRC (Client Quit)
20:00 πŸ”— enowaldo has joined #archiveteam-bs
20:03 πŸ”— drcd has quit IRC (Ping timeout: 252 seconds)
20:07 πŸ”— godane1 has quit IRC (Ping timeout: 246 seconds)
20:08 πŸ”— godane has joined #archiveteam-bs
20:16 πŸ”— godane has quit IRC (Ping timeout: 246 seconds)
20:18 πŸ”— godane has joined #archiveteam-bs
20:26 πŸ”— godane has quit IRC (Ping timeout: 246 seconds)
20:28 πŸ”— godane has joined #archiveteam-bs
20:37 πŸ”— godane has quit IRC (Ping timeout: 246 seconds)
20:38 πŸ”— godane has joined #archiveteam-bs
20:46 πŸ”— godane has quit IRC (Ping timeout: 246 seconds)
20:46 πŸ”— godane has joined #archiveteam-bs
20:55 πŸ”— godane has quit IRC (Ping timeout: 246 seconds)
20:56 πŸ”— godane has joined #archiveteam-bs
21:04 πŸ”— godane has quit IRC (Ping timeout: 246 seconds)
21:05 πŸ”— godane has joined #archiveteam-bs
21:14 πŸ”— godane has quit IRC (Ping timeout: 246 seconds)
21:15 πŸ”— godane has joined #archiveteam-bs
21:23 πŸ”— godane has quit IRC (Ping timeout: 246 seconds)
22:03 πŸ”— tomaspark has joined #archiveteam-bs
22:03 πŸ”— deevious has quit IRC (Read error: Connection reset by peer)
22:04 πŸ”— deevious has joined #archiveteam-bs
22:08 πŸ”— BlueMax has joined #archiveteam-bs
22:32 πŸ”— BlueMax has quit IRC (Quit: Leaving)
22:55 πŸ”— godane has joined #archiveteam-bs
23:12 πŸ”— enowaldo has quit IRC (Read error: Operation timed out)

irclogger-viewer