Time |
Nickname |
Message |
00:02
🔗
|
|
zyphlar_ has joined #archiveteam-bs |
00:02
🔗
|
|
Sk1d has joined #archiveteam-bs |
00:09
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
00:11
🔗
|
|
Sk1d has joined #archiveteam-bs |
00:20
🔗
|
|
wp494 has quit IRC (Ping timeout: 506 seconds) |
00:20
🔗
|
|
wp494 has joined #archiveteam-bs |
00:25
🔗
|
|
godane has quit IRC (Ping timeout: 252 seconds) |
00:38
🔗
|
|
BlueMax has quit IRC (Quit: Leaving) |
00:44
🔗
|
|
godane has joined #archiveteam-bs |
01:42
🔗
|
Flashfire |
godane Sorry I should have done more research |
01:42
🔗
|
Flashfire |
The VPN is useful thoug |
01:50
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
01:54
🔗
|
|
Sk1d has joined #archiveteam-bs |
01:54
🔗
|
|
bithippo has quit IRC (My MacBook Air has gone to sleep. ZZZzzz…) |
02:05
🔗
|
|
fuzy802 has joined #archiveteam-bs |
02:12
🔗
|
|
zyphlar_ has quit IRC (Quit: Connection closed for inactivity) |
02:12
🔗
|
|
fuzzy8021 has quit IRC (Ping timeout: 615 seconds) |
02:15
🔗
|
|
fuzy802 is now known as fuzzy8021 |
04:03
🔗
|
|
BlueMax has joined #archiveteam-bs |
04:32
🔗
|
|
odemgi has joined #archiveteam-bs |
04:33
🔗
|
|
qw3rty115 has joined #archiveteam-bs |
04:35
🔗
|
|
odemgi_ has quit IRC (Ping timeout: 252 seconds) |
04:39
🔗
|
|
qw3rty114 has quit IRC (Read error: Operation timed out) |
04:41
🔗
|
|
odemg has quit IRC (Ping timeout: 615 seconds) |
04:47
🔗
|
|
odemg has joined #archiveteam-bs |
05:07
🔗
|
|
newbie98 has joined #archiveteam-bs |
05:13
🔗
|
newbie98 |
Can anyone give me a hand processing some twitter WARCs? I've got a bot scraping some twitter content on a regular basis and spitting out .warc.gz files, which I then concatenate. I'd liek to turn the concatenated WARC into a searchable text file of tweets + URLs, and I've put together this basic script to do it: https://pastebin.com/8e05En8C |
05:13
🔗
|
newbie98 |
...but it seems to take *forever* to run |
05:13
🔗
|
|
newbie98 is now known as jianaran |
05:13
🔗
|
jianaran |
(whoops) |
05:15
🔗
|
jianaran |
I'm not very good at python, but this seems to take ~2 seconds per record which seems insane. |
05:19
🔗
|
ivan_ |
BeautifulSoup is probably the slow part |
05:20
🔗
|
ivan_ |
you can use a tracing profiler to confirm |
05:20
🔗
|
ivan_ |
https://github.com/uber/pyflame or https://github.com/benfred/py-spy or https://github.com/kwlzn/pytracing I have only used the first |
05:21
🔗
|
ivan_ |
if you're looking for just the title you might as well use a regexp |
05:21
🔗
|
ivan_ |
jianaran: ^ |
05:22
🔗
|
ivan_ |
you can check how fast it runs without the BeautifulSoup before doing anything else |
05:27
🔗
|
jianaran |
Replacing BS with a simple regexp would be pretty easy; I'll try that. tthank you |
05:27
🔗
|
jianaran |
Is the gz (de)compression normally an issue? |
05:35
🔗
|
ivan_ |
no |
05:48
🔗
|
jianaran |
OK, thank you. ivan_: can you tell me how to extract raw text from a warcio object? |
05:52
🔗
|
ivan_ |
jianaran: something other than record.raw_stream? |
05:53
🔗
|
jianaran |
record.raw_stream seems to give a warcio.limitreader.LimitReader object, though, which I'm not quite sure how to access or parse from |
05:53
🔗
|
ivan_ |
try .read() |
05:54
🔗
|
|
tomaspark has quit IRC (Remote host closed the connection) |
05:55
🔗
|
|
chimyatta has joined #archiveteam-bs |
05:59
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
06:02
🔗
|
|
Sk1d has joined #archiveteam-bs |
06:06
🔗
|
|
Atom has joined #archiveteam-bs |
06:19
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
06:22
🔗
|
|
Sk1d has joined #archiveteam-bs |
06:47
🔗
|
|
deevious has joined #archiveteam-bs |
06:48
🔗
|
|
Despatche has quit IRC (Quit: Connection reset by deer) |
06:55
🔗
|
|
VADemon has quit IRC (Read error: Connection reset by peer) |
06:56
🔗
|
|
VADemon has joined #archiveteam-bs |
07:16
🔗
|
|
achip has quit IRC (Read error: Operation timed out) |
07:18
🔗
|
|
achip has joined #archiveteam-bs |
07:26
🔗
|
|
VADemon has quit IRC (Read error: Connection reset by peer) |
07:27
🔗
|
|
VADemon has joined #archiveteam-bs |
07:43
🔗
|
|
jianaran has quit IRC (Ping timeout: 252 seconds) |
08:16
🔗
|
|
Mateon1 has quit IRC (Quit: Mateon1) |
08:17
🔗
|
|
Mateon1 has joined #archiveteam-bs |
08:21
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
08:24
🔗
|
|
Sk1d has joined #archiveteam-bs |
08:24
🔗
|
|
Oddly has joined #archiveteam-bs |
08:25
🔗
|
|
Oddly2 has joined #archiveteam-bs |
08:29
🔗
|
|
ubahn has joined #archiveteam-bs |
08:30
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
08:33
🔗
|
|
Oddly has quit IRC (Quit: Leaving) |
08:33
🔗
|
|
Sk1d has joined #archiveteam-bs |
08:34
🔗
|
|
ubahn has quit IRC (Quit: ubahn) |
09:04
🔗
|
|
Mateon1 has quit IRC (Read error: Operation timed out) |
09:05
🔗
|
|
Mateon1 has joined #archiveteam-bs |
09:22
🔗
|
|
wp494 has quit IRC (Ping timeout: 615 seconds) |
09:22
🔗
|
|
Exairnous has quit IRC (Ping timeout: 615 seconds) |
09:23
🔗
|
|
wp494 has joined #archiveteam-bs |
09:44
🔗
|
|
BlueMax has quit IRC (Quit: Leaving) |
09:45
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
09:48
🔗
|
|
Sk1d has joined #archiveteam-bs |
10:08
🔗
|
|
PhrackD has quit IRC (Read error: Connection reset by peer) |
10:12
🔗
|
|
PhrackD has joined #archiveteam-bs |
10:42
🔗
|
|
godane has quit IRC (Ping timeout: 360 seconds) |
10:59
🔗
|
|
godane has joined #archiveteam-bs |
11:03
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
11:06
🔗
|
|
Sk1d has joined #archiveteam-bs |
11:29
🔗
|
|
LFlare has quit IRC (Remote host closed the connection) |
11:56
🔗
|
|
Oddly2 has quit IRC (Read error: Operation timed out) |
11:59
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
12:02
🔗
|
|
Sk1d has joined #archiveteam-bs |
12:22
🔗
|
godane |
so looks like my new vcr at least doesn't have a audio/video sync issue |
12:22
🔗
|
godane |
one tape called david (which is a musical about King David) was having some bad audio/video sync issue |
12:23
🔗
|
godane |
the guy in it has a head mic so i didn't get why it would have that |
12:24
🔗
|
godane |
anyways digitizing batman two-face tape i got |
13:58
🔗
|
|
bitBaron has joined #archiveteam-bs |
13:59
🔗
|
|
sep332 has quit IRC (Read error: Operation timed out) |
14:00
🔗
|
|
sep332 has joined #archiveteam-bs |
14:30
🔗
|
|
fredgido has quit IRC (Read error: Operation timed out) |
14:58
🔗
|
|
bitBaron has quit IRC (My computer has gone to sleep. 😴😪ZZZzzz…) |
15:03
🔗
|
|
bitBaron has joined #archiveteam-bs |
15:27
🔗
|
|
Hani has quit IRC (Read error: Connection reset by peer) |
15:27
🔗
|
|
Hani has joined #archiveteam-bs |
17:08
🔗
|
apache2 |
what's the process for downloading archiveteam dumps and searching the contents? |
17:08
🔗
|
apache2 |
I'm fine with a grep-like approach |
17:09
🔗
|
apache2 |
but having trouble figuring out which tools to use for parsing the files |
18:02
🔗
|
|
bitBaron has quit IRC (Read error: Connection reset by peer) |
18:09
🔗
|
|
C4K3 has joined #archiveteam-bs |
18:12
🔗
|
|
C4K3_ has quit IRC (Ping timeout: 252 seconds) |
18:19
🔗
|
|
ubahn has joined #archiveteam-bs |
18:20
🔗
|
|
wp494 has quit IRC (Read error: Operation timed out) |
18:21
🔗
|
|
wp494 has joined #archiveteam-bs |
18:47
🔗
|
|
ubahn has quit IRC (Quit: ubahn) |
18:48
🔗
|
|
schbirid has joined #archiveteam-bs |
18:59
🔗
|
|
Gfy has quit IRC (Quit: I'll be back!) |
19:06
🔗
|
|
phuz has quit IRC (Quit: http://quassel-irc.org - Chat comfortably. Anywhere.) |
19:07
🔗
|
|
phuzion has joined #archiveteam-bs |
19:08
🔗
|
|
Exairnous has joined #archiveteam-bs |
19:11
🔗
|
|
ubahn has joined #archiveteam-bs |
19:12
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
19:13
🔗
|
|
Gfy has joined #archiveteam-bs |
19:15
🔗
|
|
Sk1d has joined #archiveteam-bs |
19:19
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
19:22
🔗
|
|
Sk1d has joined #archiveteam-bs |
19:33
🔗
|
|
sep332_ has quit IRC (konversation out) |
19:34
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
19:37
🔗
|
|
Sk1d has joined #archiveteam-bs |
19:59
🔗
|
|
bitBaron has joined #archiveteam-bs |
20:54
🔗
|
|
ubahn has quit IRC (Quit: ubahn) |
21:00
🔗
|
|
ubahn has joined #archiveteam-bs |
21:01
🔗
|
betamax |
apache2: it depends on the project in question |
21:02
🔗
|
betamax |
but there is a good chance you're dealing with WARC files |
21:03
🔗
|
betamax |
if WARC, then warcat is similar in style to grep |
21:03
🔗
|
betamax |
https://github.com/chfoo/warcat |
21:04
🔗
|
betamax |
disclaimer: I've not used this |
21:08
🔗
|
|
ubahn has quit IRC (Quit: ubahn) |
21:08
🔗
|
|
ubahn has joined #archiveteam-bs |
21:08
🔗
|
|
ubahn has quit IRC (Client Quit) |
21:20
🔗
|
|
wyatt8740 has quit IRC (Ping timeout: 246 seconds) |
21:21
🔗
|
|
wyatt8740 has joined #archiveteam-bs |
21:23
🔗
|
|
cast has joined #archiveteam-bs |
21:42
🔗
|
|
schbirid has quit IRC (Remote host closed the connection) |
21:54
🔗
|
|
Despatche has joined #archiveteam-bs |
22:06
🔗
|
|
BlueMax has joined #archiveteam-bs |
22:14
🔗
|
|
wyatt8740 has quit IRC (Ping timeout: 255 seconds) |
22:28
🔗
|
|
Oddly2 has joined #archiveteam-bs |
22:31
🔗
|
apache2 |
betamax: thanks! |
22:36
🔗
|
|
fredgido has joined #archiveteam-bs |
22:40
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
22:40
🔗
|
|
godane has quit IRC (Read error: Connection reset by peer) |
22:43
🔗
|
|
Sk1d has joined #archiveteam-bs |
22:49
🔗
|
|
Despatche has quit IRC (Read error: Operation timed out) |
22:56
🔗
|
|
godane has joined #archiveteam-bs |
22:58
🔗
|
|
Oddly2 has quit IRC (Read error: Operation timed out) |
23:08
🔗
|
|
second has quit IRC (Quit: ZNC 1.6.5 - http://znc.in) |
23:11
🔗
|
|
second has joined #archiveteam-bs |
23:17
🔗
|
|
chimyatta has quit IRC (Quit: quitting) |
23:34
🔗
|
|
chauffer has quit IRC (Ping timeout: 615 seconds) |
23:38
🔗
|
|
cast has quit IRC (Remote host closed the connection) |
23:39
🔗
|
|
cast has joined #archiveteam-bs |
23:44
🔗
|
|
cast has quit IRC (Remote host closed the connection) |
23:45
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
23:47
🔗
|
|
Sk1d has joined #archiveteam-bs |
23:50
🔗
|
|
DFJustin has quit IRC (Remote host closed the connection) |
23:51
🔗
|
|
DFJustin has joined #archiveteam-bs |
23:55
🔗
|
|
martle has quit IRC (ZNC 1.7.2 - https://znc.in) |
23:58
🔗
|
|
martle has joined #archiveteam-bs |