[00:02] *** zyphlar_ has joined #archiveteam-bs [00:02] *** Sk1d has joined #archiveteam-bs [00:09] *** Sk1d has quit IRC (Read error: Operation timed out) [00:11] *** Sk1d has joined #archiveteam-bs [00:20] *** wp494 has quit IRC (Ping timeout: 506 seconds) [00:20] *** wp494 has joined #archiveteam-bs [00:25] *** godane has quit IRC (Ping timeout: 252 seconds) [00:38] *** BlueMax has quit IRC (Quit: Leaving) [00:44] *** godane has joined #archiveteam-bs [01:42] godane Sorry I should have done more research [01:42] The VPN is useful thoug [01:50] *** Sk1d has quit IRC (Read error: Operation timed out) [01:54] *** Sk1d has joined #archiveteam-bs [01:54] *** bithippo has quit IRC (My MacBook Air has gone to sleep. ZZZzzz…) [02:05] *** fuzy802 has joined #archiveteam-bs [02:12] *** zyphlar_ has quit IRC (Quit: Connection closed for inactivity) [02:12] *** fuzzy8021 has quit IRC (Ping timeout: 615 seconds) [02:15] *** fuzy802 is now known as fuzzy8021 [04:03] *** BlueMax has joined #archiveteam-bs [04:32] *** odemgi has joined #archiveteam-bs [04:33] *** qw3rty115 has joined #archiveteam-bs [04:35] *** odemgi_ has quit IRC (Ping timeout: 252 seconds) [04:39] *** qw3rty114 has quit IRC (Read error: Operation timed out) [04:41] *** odemg has quit IRC (Ping timeout: 615 seconds) [04:47] *** odemg has joined #archiveteam-bs [05:07] *** newbie98 has joined #archiveteam-bs [05:13] Can anyone give me a hand processing some twitter WARCs? I've got a bot scraping some twitter content on a regular basis and spitting out .warc.gz files, which I then concatenate. I'd liek to turn the concatenated WARC into a searchable text file of tweets + URLs, and I've put together this basic script to do it: https://pastebin.com/8e05En8C [05:13] ...but it seems to take *forever* to run [05:13] *** newbie98 is now known as jianaran [05:13] (whoops) [05:15] I'm not very good at python, but this seems to take ~2 seconds per record which seems insane. [05:19] BeautifulSoup is probably the slow part [05:20] you can use a tracing profiler to confirm [05:20] https://github.com/uber/pyflame or https://github.com/benfred/py-spy or https://github.com/kwlzn/pytracing I have only used the first [05:21] if you're looking for just the title you might as well use a regexp [05:21] jianaran: ^ [05:22] you can check how fast it runs without the BeautifulSoup before doing anything else [05:27] Replacing BS with a simple regexp would be pretty easy; I'll try that. tthank you [05:27] Is the gz (de)compression normally an issue? [05:35] no [05:48] OK, thank you. ivan_: can you tell me how to extract raw text from a warcio object? [05:52] jianaran: something other than record.raw_stream? [05:53] record.raw_stream seems to give a warcio.limitreader.LimitReader object, though, which I'm not quite sure how to access or parse from [05:53] try .read() [05:54] *** tomaspark has quit IRC (Remote host closed the connection) [05:55] *** chimyatta has joined #archiveteam-bs [05:59] *** Sk1d has quit IRC (Read error: Operation timed out) [06:02] *** Sk1d has joined #archiveteam-bs [06:06] *** Atom has joined #archiveteam-bs [06:19] *** Sk1d has quit IRC (Read error: Operation timed out) [06:22] *** Sk1d has joined #archiveteam-bs [06:47] *** deevious has joined #archiveteam-bs [06:48] *** Despatche has quit IRC (Quit: Connection reset by deer) [06:55] *** VADemon has quit IRC (Read error: Connection reset by peer) [06:56] *** VADemon has joined #archiveteam-bs [07:16] *** achip has quit IRC (Read error: Operation timed out) [07:18] *** achip has joined #archiveteam-bs [07:26] *** VADemon has quit IRC (Read error: Connection reset by peer) [07:27] *** VADemon has joined #archiveteam-bs [07:43] *** jianaran has quit IRC (Ping timeout: 252 seconds) [08:16] *** Mateon1 has quit IRC (Quit: Mateon1) [08:17] *** Mateon1 has joined #archiveteam-bs [08:21] *** Sk1d has quit IRC (Read error: Operation timed out) [08:24] *** Sk1d has joined #archiveteam-bs [08:24] *** Oddly has joined #archiveteam-bs [08:25] *** Oddly2 has joined #archiveteam-bs [08:29] *** ubahn has joined #archiveteam-bs [08:30] *** Sk1d has quit IRC (Read error: Operation timed out) [08:33] *** Oddly has quit IRC (Quit: Leaving) [08:33] *** Sk1d has joined #archiveteam-bs [08:34] *** ubahn has quit IRC (Quit: ubahn) [09:04] *** Mateon1 has quit IRC (Read error: Operation timed out) [09:05] *** Mateon1 has joined #archiveteam-bs [09:22] *** wp494 has quit IRC (Ping timeout: 615 seconds) [09:22] *** Exairnous has quit IRC (Ping timeout: 615 seconds) [09:23] *** wp494 has joined #archiveteam-bs [09:44] *** BlueMax has quit IRC (Quit: Leaving) [09:45] *** Sk1d has quit IRC (Read error: Operation timed out) [09:48] *** Sk1d has joined #archiveteam-bs [10:08] *** PhrackD has quit IRC (Read error: Connection reset by peer) [10:12] *** PhrackD has joined #archiveteam-bs [10:42] *** godane has quit IRC (Ping timeout: 360 seconds) [10:59] *** godane has joined #archiveteam-bs [11:03] *** Sk1d has quit IRC (Read error: Operation timed out) [11:06] *** Sk1d has joined #archiveteam-bs [11:29] *** LFlare has quit IRC (Remote host closed the connection) [11:56] *** Oddly2 has quit IRC (Read error: Operation timed out) [11:59] *** Sk1d has quit IRC (Read error: Operation timed out) [12:02] *** Sk1d has joined #archiveteam-bs [12:22] so looks like my new vcr at least doesn't have a audio/video sync issue [12:22] one tape called david (which is a musical about King David) was having some bad audio/video sync issue [12:23] the guy in it has a head mic so i didn't get why it would have that [12:24] anyways digitizing batman two-face tape i got [13:58] *** bitBaron has joined #archiveteam-bs [13:59] *** sep332 has quit IRC (Read error: Operation timed out) [14:00] *** sep332 has joined #archiveteam-bs [14:30] *** fredgido has quit IRC (Read error: Operation timed out) [14:58] *** bitBaron has quit IRC (My computer has gone to sleep. 😴😪ZZZzzz…) [15:03] *** bitBaron has joined #archiveteam-bs [15:27] *** Hani has quit IRC (Read error: Connection reset by peer) [15:27] *** Hani has joined #archiveteam-bs [17:08] what's the process for downloading archiveteam dumps and searching the contents? [17:08] I'm fine with a grep-like approach [17:09] but having trouble figuring out which tools to use for parsing the files [18:02] *** bitBaron has quit IRC (Read error: Connection reset by peer) [18:09] *** C4K3 has joined #archiveteam-bs [18:12] *** C4K3_ has quit IRC (Ping timeout: 252 seconds) [18:19] *** ubahn has joined #archiveteam-bs [18:20] *** wp494 has quit IRC (Read error: Operation timed out) [18:21] *** wp494 has joined #archiveteam-bs [18:47] *** ubahn has quit IRC (Quit: ubahn) [18:48] *** schbirid has joined #archiveteam-bs [18:59] *** Gfy has quit IRC (Quit: I'll be back!) [19:06] *** phuz has quit IRC (Quit: http://quassel-irc.org - Chat comfortably. Anywhere.) [19:07] *** phuzion has joined #archiveteam-bs [19:08] *** Exairnous has joined #archiveteam-bs [19:11] *** ubahn has joined #archiveteam-bs [19:12] *** Sk1d has quit IRC (Read error: Operation timed out) [19:13] *** Gfy has joined #archiveteam-bs [19:15] *** Sk1d has joined #archiveteam-bs [19:19] *** Sk1d has quit IRC (Read error: Operation timed out) [19:22] *** Sk1d has joined #archiveteam-bs [19:33] *** sep332_ has quit IRC (konversation out) [19:34] *** Sk1d has quit IRC (Read error: Operation timed out) [19:37] *** Sk1d has joined #archiveteam-bs [19:59] *** bitBaron has joined #archiveteam-bs [20:54] *** ubahn has quit IRC (Quit: ubahn) [21:00] *** ubahn has joined #archiveteam-bs [21:01] apache2: it depends on the project in question [21:02] but there is a good chance you're dealing with WARC files [21:03] if WARC, then warcat is similar in style to grep [21:03] https://github.com/chfoo/warcat [21:04] disclaimer: I've not used this [21:08] *** ubahn has quit IRC (Quit: ubahn) [21:08] *** ubahn has joined #archiveteam-bs [21:08] *** ubahn has quit IRC (Client Quit) [21:20] *** wyatt8740 has quit IRC (Ping timeout: 246 seconds) [21:21] *** wyatt8740 has joined #archiveteam-bs [21:23] *** cast has joined #archiveteam-bs [21:42] *** schbirid has quit IRC (Remote host closed the connection) [21:54] *** Despatche has joined #archiveteam-bs [22:06] *** BlueMax has joined #archiveteam-bs [22:14] *** wyatt8740 has quit IRC (Ping timeout: 255 seconds) [22:28] *** Oddly2 has joined #archiveteam-bs [22:31] betamax: thanks! [22:36] *** fredgido has joined #archiveteam-bs [22:40] *** Sk1d has quit IRC (Read error: Operation timed out) [22:40] *** godane has quit IRC (Read error: Connection reset by peer) [22:43] *** Sk1d has joined #archiveteam-bs [22:49] *** Despatche has quit IRC (Read error: Operation timed out) [22:56] *** godane has joined #archiveteam-bs [22:58] *** Oddly2 has quit IRC (Read error: Operation timed out) [23:08] *** second has quit IRC (Quit: ZNC 1.6.5 - http://znc.in) [23:11] *** second has joined #archiveteam-bs [23:17] *** chimyatta has quit IRC (Quit: quitting) [23:34] *** chauffer has quit IRC (Ping timeout: 615 seconds) [23:38] *** cast has quit IRC (Remote host closed the connection) [23:39] *** cast has joined #archiveteam-bs [23:44] *** cast has quit IRC (Remote host closed the connection) [23:45] *** Sk1d has quit IRC (Read error: Operation timed out) [23:47] *** Sk1d has joined #archiveteam-bs [23:50] *** DFJustin has quit IRC (Remote host closed the connection) [23:51] *** DFJustin has joined #archiveteam-bs [23:55] *** martle has quit IRC (ZNC 1.7.2 - https://znc.in) [23:58] *** martle has joined #archiveteam-bs