[00:42] *** Maylay has quit IRC (Quit: No Ping reply in 300 seconds.) [00:44] *** manwith1n is now known as ranma [00:53] *** Maylay has joined #archiveteam-bs [01:12] *** Maylay has quit IRC (Remote host closed the connection) [01:15] *** Maylay has joined #archiveteam-bs [01:18] *** X-Scale has quit IRC (Quit: HydraIRC -> http://www.hydrairc.com <- Organize your IRC) [01:19] *** zino_ has quit IRC (Read error: Operation timed out) [01:21] *** zino has joined #archiveteam-bs [01:21] *** Maylay has quit IRC (Quit: No Ping reply in 300 seconds.) [01:24] *** Maylay has joined #archiveteam-bs [01:32] *** X-Scale has joined #archiveteam-bs [01:33] *** VerifiedJ has quit IRC (Quit: Leaving) [01:40] *** Maylay has quit IRC (Quit: No Ping reply in 300 seconds.) [01:45] *** sirvy_ has joined #archiveteam-bs [01:50] *** sirvy has quit IRC (Ping timeout: 615 seconds) [01:55] *** Maylay has joined #archiveteam-bs [02:04] *** Maylay has quit IRC (Quit: No Ping reply in 300 seconds.) [02:06] *** Maylay has joined #archiveteam-bs [02:14] *** Maylay has quit IRC (Remote host closed the connection) [02:24] *** Maylay has joined #archiveteam-bs [02:27] someone smarter than me, how was this thing escaped: data-uix-load-more-href=\"\/browse_ajax?action_continuation=1\u0026amp;continuation=4qmFsgIqEhpWTExMTXU1Z1BtS3A1YXYwUUNBYWpLVE1odxoMZWdkUVZEcERUV2RD\"\u003e\u003c [02:30] markedL: an HTML-safe JSON encoder that also escapes / to \/, followed by html entity escaping [02:31] the reason some JSON encoders do that is to prevent from closing a script (and starting a new one) [02:36] *** kode54 has quit IRC (Quit: The Lounge - https://thelounge.chat) [02:48] *** Maylay has quit IRC (Remote host closed the connection) [02:51] *** Maylay has joined #archiveteam-bs [03:06] *** Maylay has quit IRC (No Ping reply in 300 seconds.) [03:08] *** Maylay has joined #archiveteam-bs [03:12] *** kode54 has joined #archiveteam-bs [03:15] *** Maylay has quit IRC (Quit: No Ping reply in 300 seconds.) [03:17] *** Maylay has joined #archiveteam-bs [03:29] *** Maylay has quit IRC (Read error: Operation timed out) [03:37] *** kode54 has quit IRC (Remote host closed the connection) [03:38] *** kode54 has joined #archiveteam-bs [04:00] *** Maylay has joined #archiveteam-bs [04:00] *** Maylay has quit IRC (Remote host closed the connection!) [04:02] *** DLoader has quit IRC (Quit: DLoader) [04:07] *** cppchrisc has joined #archiveteam-bs [04:07] *** cppchrisc has quit IRC (Connection closed) [04:08] *** cppchrisc has joined #archiveteam-bs [04:10] *** qw3rty has joined #archiveteam-bs [04:19] *** qw3rty2 has quit IRC (Ping timeout: 745 seconds) [04:27] *** odemgi_ has joined #archiveteam-bs [04:28] *** Maylay has joined #archiveteam-bs [04:32] *** odemgi has quit IRC (Read error: Operation timed out) [04:34] *** sirvy has joined #archiveteam-bs [04:38] *** sirvy_ has quit IRC (Ping timeout: 615 seconds) [04:46] *** K4k has quit IRC (Read error: Connection reset by peer) [04:49] *** killsushi has joined #archiveteam-bs [05:18] *** tech234a has quit IRC (Quit: Connection closed for inactivity) [05:59] I know [06:00] They still need to qa the thing [06:04] it's cool, looks like it's deployed [06:10] update on Youtube liked-lists, going to need a repo and target soon. 4.4TB of HTML in warc.gz [06:12] markedL: can you do a search of your scrapings for videos by channel name [06:13] can you join the project channel? [06:13] a channel that was recently removed has been backed up, but no metadata. [06:13] sort of a side question anyway [06:14] just wondering if you have any said metadata [06:18] I just process what I'm handed, you're welcome to ask the upstream data providers [07:13] *** DLoader has joined #archiveteam-bs [07:36] *** godane has quit IRC (Ping timeout: 252 seconds) [07:49] *** klg has quit IRC (Ping timeout: 258 seconds) [07:54] *** klg has joined #archiveteam-bs [09:16] what's your favorite tool for crawling and producing WARC archives? [09:16] (and is WARC the recommended format?) [09:21] apache2: Yes, WARC is the generally recommended format. There are a bunch of tools people use, but this one is pretty good for personal use: https://github.com/archiveteam/grab-site [09:21] There's also a tools section on the wiki: https://www.archiveteam.org/index.php?title=The_WARC_Ecosystem#Tools [09:30] thank you jodizzle [09:31] follow-up question: what do you personally use for searching/extracting from WARC dumps, and is there something like a browser-based application that can browse a WARC archive WITHOUT internet connections being made? [09:32] zgrep for searching WARCs because I'm insane. For playback, pywb is pretty good. [09:33] There are tools for extracting WARC contents into plain files, but I don't have much experience with them. [09:33] At most, I might dump the response contents to stdout (using my own tool, warc-tiny) and process them using grep, sed, awk, etc. to extract the information I'm interested in. [09:34] I'd primarily be looking for something that could extract article content (so some sort of CSS3-like selector syntax) and media linked form those articles [09:34] fair enough, I'm happy to hack something together myself too. I'll look into pywb and grab-site [09:35] how does grab-site deal with media content like streaming video? [09:36] It's complicated. [09:36] If it's a simple