Time |
Nickname |
Message |
00:49
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
00:53
🔗
|
|
dashcloud has joined #archiveteam-ot |
00:58
🔗
|
|
hook54321 sets mode: +o wp494 |
00:59
🔗
|
|
Stiletto has joined #archiveteam-ot |
01:01
🔗
|
|
Stilett0 has quit IRC (Ping timeout: 268 seconds) |
01:19
🔗
|
|
dashcloud has quit IRC (Remote host closed the connection) |
01:20
🔗
|
|
dashcloud has joined #archiveteam-ot |
01:23
🔗
|
ivan |
ZizzyDizz: feed `gs-dump-urls wpull.db done | grep youtube.com/watch` into youtube-dl |
02:55
🔗
|
ivan |
!status |
02:55
🔗
|
ivan |
oops |
02:57
🔗
|
ivan |
does anyone have 4+ idle cores somewhere that I can use to archive more twitter with grab-site? |
02:58
🔗
|
ivan |
I'm working through all the politicians and next up are the verified accounts |
02:58
🔗
|
kiska |
I have 4 free cores, but its doing AB work |
02:59
🔗
|
kiska |
ivan: https://server5.kiska.pw/laptop/ |
02:59
🔗
|
ivan |
AB work is good work |
03:04
🔗
|
Flashfire |
If you do them as !ao on archivebot that works |
03:04
🔗
|
Flashfire |
!ao < ExampleURLList |
03:04
🔗
|
Flashfire |
ivan |
03:05
🔗
|
ivan |
I am familiar |
03:06
🔗
|
Flashfire |
Alright then I can monitor some that way |
03:06
🔗
|
ivan |
I just have too much and I don't want to mix grab-site/wpull 3 and AB |
03:07
🔗
|
ivan |
AB is backlogged anyway |
03:08
🔗
|
kiska |
AB is backlogged with !a requests, !ao isn't |
03:09
🔗
|
ivan |
Flashfire: huh that https://www.versace.com/ you found is interesting |
03:09
🔗
|
ivan |
interesting but my !ao jobs are 100K each and end up at ~700K requests, heh |
03:10
🔗
|
Flashfire |
Ok I have a core spare how much bandwith would it use? I have a TB per month |
03:10
🔗
|
Flashfire |
and its only the 9th |
03:10
🔗
|
Flashfire |
What fork of grabsite do I set up? |
03:11
🔗
|
kiska |
ivan: There is at least 3 slots free for ao requests |
03:12
🔗
|
Flashfire |
Yeah !ao jobs will be fine I can monitor them |
03:12
🔗
|
ivan |
Flashfire: I don't want to eat into your TB, it might go over |
03:13
🔗
|
Flashfire |
Ok then |
03:13
🔗
|
Flashfire |
and Versacce is neat but has been bought out by another company |
03:15
🔗
|
kiska |
ivan: I need better utilisation of my network, so I'll be fine if you want to use grab-site on my server |
03:16
🔗
|
kiska |
I just need to grab a pubkey from you, since I am going to chuck you on the AB user |
03:16
🔗
|
kiska |
Or the other option is to use !ao < <list> and it'll still land on my pipeline |
03:20
🔗
|
ivan |
PMed |
03:30
🔗
|
|
odemg has quit IRC (Ping timeout: 260 seconds) |
03:43
🔗
|
|
odemg has joined #archiveteam-ot |
06:32
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
06:55
🔗
|
|
icedice has joined #archiveteam-ot |
07:06
🔗
|
|
djsundog has quit IRC (Read error: Operation timed out) |
07:11
🔗
|
|
djsundog has joined #archiveteam-ot |
08:17
🔗
|
Flashfire |
TIME TO WATCH DOCTOR WHO SEASON 11 EPISODE 1 |
08:57
🔗
|
|
VerifiedJ has joined #archiveteam-ot |
09:09
🔗
|
|
VerifiedJ has quit IRC (Read error: Connection reset by peer) |
09:11
🔗
|
|
ZizzyDizz has quit IRC (Ping timeout: 260 seconds) |
09:21
🔗
|
|
faolingf_ has quit IRC (Quit: Leaving) |
09:39
🔗
|
kiska |
Time for me to get el jannahs |
10:32
🔗
|
|
VerifiedJ has joined #archiveteam-ot |
12:10
🔗
|
|
BlueMax has quit IRC (Quit: Leaving) |
12:59
🔗
|
ivan |
https://gist.github.com/ivan/e36e347875936c7933d9adc30ebc7a6e |
13:01
🔗
|
JAA |
Ah yeah, those are fun. |
13:02
🔗
|
ivan |
I have my telepathy hat on tonight |
13:03
🔗
|
ivan |
because it's a quickly reproducible regression |
13:05
🔗
|
ivan |
yep. don't read `.pattern` on a pyre2 object, don't even copy it to your own string |
13:12
🔗
|
kiska |
Guessing that is on my server? |
13:20
🔗
|
ivan |
it is not |
13:32
🔗
|
ivan |
this one's more exciting https://gist.github.com/ivan/fa91d5dcd6d4cee3d285f423f4b42846#file-the-new-html-parser-L1335 |
13:34
🔗
|
ivan |
oh sweet it repros |
13:39
🔗
|
ivan |
the unparseable page http://sm-hs.eu/index.php/smhs/article/view/sm-hs.2016.102 |
13:41
🔗
|
JAA |
<html /> |
13:42
🔗
|
JAA |
That's the one thing that jumps at me and could easily cause problems in a parser. |
13:45
🔗
|
JAA |
https://bugs.php.net/bug.php?id=76980 Now in PHP: partially defined classes! :-) |
13:52
🔗
|
ivan |
thanks, you were right |
13:52
🔗
|
ivan |
I minimized it to |
13:52
🔗
|
ivan |
import html5_parser |
13:52
🔗
|
ivan |
html5_parser.parse("<html><html />", maybe_xhtml=True) |
15:01
🔗
|
ivan |
does anyone want to test grab-site v2? https://github.com/ludios/grab-site/tree/v2#install-on-ubuntu-1604-1804-debian-9-stretch-debian-10-buster but add @v2 to the url in the last line |
15:02
🔗
|
ivan |
https://gist.github.com/ivan/3d6d3d4f3574fa460c44204567e4184d upgrade guide |
15:05
🔗
|
ivan |
oh yeah I need to test macOS |
15:20
🔗
|
JAA |
snscrape is now on PyPI, so it can be installed with a simple 'pip install snscrape' and upgraded with 'pip install --upgrade snscrape' now. |
15:43
🔗
|
JAA |
#archiveteam-ot is becoming #archiveteam-dev. :-) |
15:46
🔗
|
jrwr |
#archiveteam-dev: You have joined too many channels |
15:47
🔗
|
astrid |
lol |
15:48
🔗
|
JAA |
Just EFNet being EFNet. |
15:48
🔗
|
jrwr |
I say we all pile onto sdf.org irc |
15:48
🔗
|
jrwr |
:) |
15:51
🔗
|
|
schbirid has joined #archiveteam-ot |
17:04
🔗
|
|
wp494 has quit IRC (Quit: LOUD UNNECESSARY QUIT MESSAGES) |
17:04
🔗
|
|
Stilett0 has joined #archiveteam-ot |
17:06
🔗
|
|
wp494 has joined #archiveteam-ot |
17:06
🔗
|
|
svchfoo1 sets mode: +o wp494 |
17:11
🔗
|
|
Stiletto has quit IRC (Read error: Operation timed out) |
17:19
🔗
|
ivan |
grab-site 2.x is out, enjoy potentially exotic crashes many days into a crawl |
17:20
🔗
|
ivan |
dupespotter performance improvements may come later |
17:21
🔗
|
schbirid |
<3 |
17:26
🔗
|
|
Stiletto has joined #archiveteam-ot |
17:29
🔗
|
|
Stilett0 has quit IRC (Read error: Operation timed out) |
17:56
🔗
|
|
icedice has quit IRC (Quit: Leaving) |
17:58
🔗
|
|
Stilett0 has joined #archiveteam-ot |
17:59
🔗
|
|
Stiletto has quit IRC (Read error: Operation timed out) |
18:33
🔗
|
|
Stiletto has joined #archiveteam-ot |
18:36
🔗
|
|
Stilett0 has quit IRC (Read error: Operation timed out) |
18:44
🔗
|
ivan |
https://twitter.com/Malaysia_Gov disappeared between the time I snscraped it a week ago and now |
18:44
🔗
|
ivan |
was suspended, I mean |
18:49
🔗
|
JAA |
Between this and Kanye West's Twitter disappearance, I feel like we should start a project to continuously archive any social media accounts by government institutions and popular figures. I think I heard that politicians on the federal level in the US are covered by IA, but I'm not sure if that's true. |
18:50
🔗
|
JAA |
Even better would be automatically archiving any tweet with a retweet/like count beyond some threshold and any profile with more than x followers. But not sure if that's even possible. |
18:50
🔗
|
JAA |
(Without paying Twitter for their enterprise API, that is.) |
18:51
🔗
|
|
schbirid has quit IRC (Remote host closed the connection) |
18:55
🔗
|
|
Mateon1 has quit IRC (Ping timeout: 252 seconds) |
18:55
🔗
|
|
Mateon1 has joined #archiveteam-ot |
19:08
🔗
|
ivan |
https://twitter.com/search?q=trump%20min_faves%3A10000&src=typd |
19:08
🔗
|
ivan |
https://twitter.com/search?q=trump%20min_retweets%3A10000&src=typd |
19:20
🔗
|
ivan |
just crawling twitter is a good way to discover popular tweets |
19:20
🔗
|
ivan |
I often set it on some search |
19:35
🔗
|
JAA |
Oh nice, secret search options. :-) |
20:07
🔗
|
JAA |
Huh, echo 'a&foo'$'\n''bar' | perl -pe 's,&[^&]+$,,m;' prints "abar" instead of "a" and "bar" on two lines. That's weird. |
20:13
🔗
|
JAA |
Changing [^&]+ to a non-greedy [^&]+? "fixes" that, but I don't understand why the former would match a newline at the end. $ is supposed to match *before* the newline... |
20:20
🔗
|
JAA |
(By the way, the m flag makes no difference.) |
20:24
🔗
|
|
BlueMax has joined #archiveteam-ot |
21:03
🔗
|
ivan |
someone is touching my software https://github.com/DuckHP/grab-site/commits/master |
21:07
🔗
|
JAA |
ivan: That's ola_norsk, but he hasn't been here in a few months. |
21:07
🔗
|
ivan |
ah |
22:01
🔗
|
JAA |
ivan: I just pushed all my wpull code to https://github.com/JustAnotherArchivist/wpull |
22:02
🔗
|
JAA |
To anyone thinking about using my fork: don't. |
22:04
🔗
|
|
m007a83_ has joined #archiveteam-ot |
22:05
🔗
|
moufu |
why |
22:05
🔗
|
|
m007a83 has quit IRC (Ping timeout: 252 seconds) |
22:05
🔗
|
|
m007a83_ is now known as m007a83 |
22:07
🔗
|
JAA |
moufu: It's untested, and I'm sure there was a reason why I didn't push this code back in Jan/Feb when I wrote it, though I don't remember the details. Testing the code would be much appreciated, but actually using it for archival is probably a bad idea at the moment. |
22:17
🔗
|
|
VerifiedJ has quit IRC (Quit: Leaving) |
22:17
🔗
|
|
m007a83_ has joined #archiveteam-ot |
22:19
🔗
|
|
m007a83 has quit IRC (Read error: Operation timed out) |
22:34
🔗
|
|
BlueMax has quit IRC (Quit: Leaving) |
22:50
🔗
|
|
Stiletto has quit IRC (Read error: Operation timed out) |
22:50
🔗
|
|
Stilett0 has joined #archiveteam-ot |
22:58
🔗
|
|
m007a83 has joined #archiveteam-ot |
23:02
🔗
|
|
m007a83_ has quit IRC (Read error: Operation timed out) |
23:04
🔗
|
|
Polylith_ has quit IRC (Read error: Operation timed out) |
23:05
🔗
|
|
svchfoo3 has quit IRC (Read error: Operation timed out) |
23:05
🔗
|
|
Polylith has joined #archiveteam-ot |
23:10
🔗
|
|
svchfoo3 has joined #archiveteam-ot |
23:11
🔗
|
|
svchfoo1 sets mode: +o svchfoo3 |
23:14
🔗
|
|
Stiletto has joined #archiveteam-ot |
23:16
🔗
|
|
Stilett0 has quit IRC (Read error: Operation timed out) |
23:23
🔗
|
|
Stilett0 has joined #archiveteam-ot |
23:24
🔗
|
|
Stiletto has quit IRC (Ping timeout: 260 seconds) |