#archiveteam-bs 2019-02-11,Mon

↑back Search

Time Nickname Message
00:02 🔗 zyphlar_ has joined #archiveteam-bs
00:02 🔗 Sk1d has joined #archiveteam-bs
00:09 🔗 Sk1d has quit IRC (Read error: Operation timed out)
00:11 🔗 Sk1d has joined #archiveteam-bs
00:20 🔗 wp494 has quit IRC (Ping timeout: 506 seconds)
00:20 🔗 wp494 has joined #archiveteam-bs
00:25 🔗 godane has quit IRC (Ping timeout: 252 seconds)
00:38 🔗 BlueMax has quit IRC (Quit: Leaving)
00:44 🔗 godane has joined #archiveteam-bs
01:42 🔗 Flashfire godane Sorry I should have done more research
01:42 🔗 Flashfire The VPN is useful thoug
01:50 🔗 Sk1d has quit IRC (Read error: Operation timed out)
01:54 🔗 Sk1d has joined #archiveteam-bs
01:54 🔗 bithippo has quit IRC (My MacBook Air has gone to sleep. ZZZzzz…)
02:05 🔗 fuzy802 has joined #archiveteam-bs
02:12 🔗 zyphlar_ has quit IRC (Quit: Connection closed for inactivity)
02:12 🔗 fuzzy8021 has quit IRC (Ping timeout: 615 seconds)
02:15 🔗 fuzy802 is now known as fuzzy8021
04:03 🔗 BlueMax has joined #archiveteam-bs
04:32 🔗 odemgi has joined #archiveteam-bs
04:33 🔗 qw3rty115 has joined #archiveteam-bs
04:35 🔗 odemgi_ has quit IRC (Ping timeout: 252 seconds)
04:39 🔗 qw3rty114 has quit IRC (Read error: Operation timed out)
04:41 🔗 odemg has quit IRC (Ping timeout: 615 seconds)
04:47 🔗 odemg has joined #archiveteam-bs
05:07 🔗 newbie98 has joined #archiveteam-bs
05:13 🔗 newbie98 Can anyone give me a hand processing some twitter WARCs? I've got a bot scraping some twitter content on a regular basis and spitting out .warc.gz files, which I then concatenate. I'd liek to turn the concatenated WARC into a searchable text file of tweets + URLs, and I've put together this basic script to do it: https://pastebin.com/8e05En8C
05:13 🔗 newbie98 ...but it seems to take *forever* to run
05:13 🔗 newbie98 is now known as jianaran
05:13 🔗 jianaran (whoops)
05:15 🔗 jianaran I'm not very good at python, but this seems to take ~2 seconds per record which seems insane.
05:19 🔗 ivan_ BeautifulSoup is probably the slow part
05:20 🔗 ivan_ you can use a tracing profiler to confirm
05:20 🔗 ivan_ https://github.com/uber/pyflame or https://github.com/benfred/py-spy or https://github.com/kwlzn/pytracing I have only used the first
05:21 🔗 ivan_ if you're looking for just the title you might as well use a regexp
05:21 🔗 ivan_ jianaran: ^
05:22 🔗 ivan_ you can check how fast it runs without the BeautifulSoup before doing anything else
05:27 🔗 jianaran Replacing BS with a simple regexp would be pretty easy; I'll try that. tthank you
05:27 🔗 jianaran Is the gz (de)compression normally an issue?
05:35 🔗 ivan_ no
05:48 🔗 jianaran OK, thank you. ivan_: can you tell me how to extract raw text from a warcio object?
05:52 🔗 ivan_ jianaran: something other than record.raw_stream?
05:53 🔗 jianaran record.raw_stream seems to give a warcio.limitreader.LimitReader object, though, which I'm not quite sure how to access or parse from
05:53 🔗 ivan_ try .read()
05:54 🔗 tomaspark has quit IRC (Remote host closed the connection)
05:55 🔗 chimyatta has joined #archiveteam-bs
05:59 🔗 Sk1d has quit IRC (Read error: Operation timed out)
06:02 🔗 Sk1d has joined #archiveteam-bs
06:06 🔗 Atom has joined #archiveteam-bs
06:19 🔗 Sk1d has quit IRC (Read error: Operation timed out)
06:22 🔗 Sk1d has joined #archiveteam-bs
06:47 🔗 deevious has joined #archiveteam-bs
06:48 🔗 Despatche has quit IRC (Quit: Connection reset by deer)
06:55 🔗 VADemon has quit IRC (Read error: Connection reset by peer)
06:56 🔗 VADemon has joined #archiveteam-bs
07:16 🔗 achip has quit IRC (Read error: Operation timed out)
07:18 🔗 achip has joined #archiveteam-bs
07:26 🔗 VADemon has quit IRC (Read error: Connection reset by peer)
07:27 🔗 VADemon has joined #archiveteam-bs
07:43 🔗 jianaran has quit IRC (Ping timeout: 252 seconds)
08:16 🔗 Mateon1 has quit IRC (Quit: Mateon1)
08:17 🔗 Mateon1 has joined #archiveteam-bs
08:21 🔗 Sk1d has quit IRC (Read error: Operation timed out)
08:24 🔗 Sk1d has joined #archiveteam-bs
08:24 🔗 Oddly has joined #archiveteam-bs
08:25 🔗 Oddly2 has joined #archiveteam-bs
08:29 🔗 ubahn has joined #archiveteam-bs
08:30 🔗 Sk1d has quit IRC (Read error: Operation timed out)
08:33 🔗 Oddly has quit IRC (Quit: Leaving)
08:33 🔗 Sk1d has joined #archiveteam-bs
08:34 🔗 ubahn has quit IRC (Quit: ubahn)
09:04 🔗 Mateon1 has quit IRC (Read error: Operation timed out)
09:05 🔗 Mateon1 has joined #archiveteam-bs
09:22 🔗 wp494 has quit IRC (Ping timeout: 615 seconds)
09:22 🔗 Exairnous has quit IRC (Ping timeout: 615 seconds)
09:23 🔗 wp494 has joined #archiveteam-bs
09:44 🔗 BlueMax has quit IRC (Quit: Leaving)
09:45 🔗 Sk1d has quit IRC (Read error: Operation timed out)
09:48 🔗 Sk1d has joined #archiveteam-bs
10:08 🔗 PhrackD has quit IRC (Read error: Connection reset by peer)
10:12 🔗 PhrackD has joined #archiveteam-bs
10:42 🔗 godane has quit IRC (Ping timeout: 360 seconds)
10:59 🔗 godane has joined #archiveteam-bs
11:03 🔗 Sk1d has quit IRC (Read error: Operation timed out)
11:06 🔗 Sk1d has joined #archiveteam-bs
11:29 🔗 LFlare has quit IRC (Remote host closed the connection)
11:56 🔗 Oddly2 has quit IRC (Read error: Operation timed out)
11:59 🔗 Sk1d has quit IRC (Read error: Operation timed out)
12:02 🔗 Sk1d has joined #archiveteam-bs
12:22 🔗 godane so looks like my new vcr at least doesn't have a audio/video sync issue
12:22 🔗 godane one tape called david (which is a musical about King David) was having some bad audio/video sync issue
12:23 🔗 godane the guy in it has a head mic so i didn't get why it would have that
12:24 🔗 godane anyways digitizing batman two-face tape i got
13:58 🔗 bitBaron has joined #archiveteam-bs
13:59 🔗 sep332 has quit IRC (Read error: Operation timed out)
14:00 🔗 sep332 has joined #archiveteam-bs
14:30 🔗 fredgido has quit IRC (Read error: Operation timed out)
14:58 🔗 bitBaron has quit IRC (My computer has gone to sleep. 😴😪ZZZzzz…)
15:03 🔗 bitBaron has joined #archiveteam-bs
15:27 🔗 Hani has quit IRC (Read error: Connection reset by peer)
15:27 🔗 Hani has joined #archiveteam-bs
17:08 🔗 apache2 what's the process for downloading archiveteam dumps and searching the contents?
17:08 🔗 apache2 I'm fine with a grep-like approach
17:09 🔗 apache2 but having trouble figuring out which tools to use for parsing the files
18:02 🔗 bitBaron has quit IRC (Read error: Connection reset by peer)
18:09 🔗 C4K3 has joined #archiveteam-bs
18:12 🔗 C4K3_ has quit IRC (Ping timeout: 252 seconds)
18:19 🔗 ubahn has joined #archiveteam-bs
18:20 🔗 wp494 has quit IRC (Read error: Operation timed out)
18:21 🔗 wp494 has joined #archiveteam-bs
18:47 🔗 ubahn has quit IRC (Quit: ubahn)
18:48 🔗 schbirid has joined #archiveteam-bs
18:59 🔗 Gfy has quit IRC (Quit: I'll be back!)
19:06 🔗 phuz has quit IRC (Quit: http://quassel-irc.org - Chat comfortably. Anywhere.)
19:07 🔗 phuzion has joined #archiveteam-bs
19:08 🔗 Exairnous has joined #archiveteam-bs
19:11 🔗 ubahn has joined #archiveteam-bs
19:12 🔗 Sk1d has quit IRC (Read error: Operation timed out)
19:13 🔗 Gfy has joined #archiveteam-bs
19:15 🔗 Sk1d has joined #archiveteam-bs
19:19 🔗 Sk1d has quit IRC (Read error: Operation timed out)
19:22 🔗 Sk1d has joined #archiveteam-bs
19:33 🔗 sep332_ has quit IRC (konversation out)
19:34 🔗 Sk1d has quit IRC (Read error: Operation timed out)
19:37 🔗 Sk1d has joined #archiveteam-bs
19:59 🔗 bitBaron has joined #archiveteam-bs
20:54 🔗 ubahn has quit IRC (Quit: ubahn)
21:00 🔗 ubahn has joined #archiveteam-bs
21:01 🔗 betamax apache2: it depends on the project in question
21:02 🔗 betamax but there is a good chance you're dealing with WARC files
21:03 🔗 betamax if WARC, then warcat is similar in style to grep
21:03 🔗 betamax https://github.com/chfoo/warcat
21:04 🔗 betamax disclaimer: I've not used this
21:08 🔗 ubahn has quit IRC (Quit: ubahn)
21:08 🔗 ubahn has joined #archiveteam-bs
21:08 🔗 ubahn has quit IRC (Client Quit)
21:20 🔗 wyatt8740 has quit IRC (Ping timeout: 246 seconds)
21:21 🔗 wyatt8740 has joined #archiveteam-bs
21:23 🔗 cast has joined #archiveteam-bs
21:42 🔗 schbirid has quit IRC (Remote host closed the connection)
21:54 🔗 Despatche has joined #archiveteam-bs
22:06 🔗 BlueMax has joined #archiveteam-bs
22:14 🔗 wyatt8740 has quit IRC (Ping timeout: 255 seconds)
22:28 🔗 Oddly2 has joined #archiveteam-bs
22:31 🔗 apache2 betamax: thanks!
22:36 🔗 fredgido has joined #archiveteam-bs
22:40 🔗 Sk1d has quit IRC (Read error: Operation timed out)
22:40 🔗 godane has quit IRC (Read error: Connection reset by peer)
22:43 🔗 Sk1d has joined #archiveteam-bs
22:49 🔗 Despatche has quit IRC (Read error: Operation timed out)
22:56 🔗 godane has joined #archiveteam-bs
22:58 🔗 Oddly2 has quit IRC (Read error: Operation timed out)
23:08 🔗 second has quit IRC (Quit: ZNC 1.6.5 - http://znc.in)
23:11 🔗 second has joined #archiveteam-bs
23:17 🔗 chimyatta has quit IRC (Quit: quitting)
23:34 🔗 chauffer has quit IRC (Ping timeout: 615 seconds)
23:38 🔗 cast has quit IRC (Remote host closed the connection)
23:39 🔗 cast has joined #archiveteam-bs
23:44 🔗 cast has quit IRC (Remote host closed the connection)
23:45 🔗 Sk1d has quit IRC (Read error: Operation timed out)
23:47 🔗 Sk1d has joined #archiveteam-bs
23:50 🔗 DFJustin has quit IRC (Remote host closed the connection)
23:51 🔗 DFJustin has joined #archiveteam-bs
23:55 🔗 martle has quit IRC (ZNC 1.7.2 - https://znc.in)
23:58 🔗 martle has joined #archiveteam-bs

irclogger-viewer