[00:49] *** dashcloud has quit IRC (Read error: Operation timed out) [00:53] *** dashcloud has joined #archiveteam-ot [00:58] *** hook54321 sets mode: +o wp494 [00:59] *** Stiletto has joined #archiveteam-ot [01:01] *** Stilett0 has quit IRC (Ping timeout: 268 seconds) [01:19] *** dashcloud has quit IRC (Remote host closed the connection) [01:20] *** dashcloud has joined #archiveteam-ot [01:23] ZizzyDizz: feed `gs-dump-urls wpull.db done | grep youtube.com/watch` into youtube-dl [02:55] !status [02:55] oops [02:57] does anyone have 4+ idle cores somewhere that I can use to archive more twitter with grab-site? [02:58] I'm working through all the politicians and next up are the verified accounts [02:58] I have 4 free cores, but its doing AB work [02:59] ivan: https://server5.kiska.pw/laptop/ [02:59] AB work is good work [03:04] If you do them as !ao on archivebot that works [03:04] !ao < ExampleURLList [03:04] ivan [03:05] I am familiar [03:06] Alright then I can monitor some that way [03:06] I just have too much and I don't want to mix grab-site/wpull 3 and AB [03:07] AB is backlogged anyway [03:08] AB is backlogged with !a requests, !ao isn't [03:09] Flashfire: huh that https://www.versace.com/ you found is interesting [03:09] interesting but my !ao jobs are 100K each and end up at ~700K requests, heh [03:10] Ok I have a core spare how much bandwith would it use? I have a TB per month [03:10] and its only the 9th [03:10] What fork of grabsite do I set up? [03:11] ivan: There is at least 3 slots free for ao requests [03:12] Yeah !ao jobs will be fine I can monitor them [03:12] Flashfire: I don't want to eat into your TB, it might go over [03:13] Ok then [03:13] and Versacce is neat but has been bought out by another company [03:15] ivan: I need better utilisation of my network, so I'll be fine if you want to use grab-site on my server [03:16] I just need to grab a pubkey from you, since I am going to chuck you on the AB user [03:16] Or the other option is to use !ao < and it'll still land on my pipeline [03:20] PMed [03:30] *** odemg has quit IRC (Ping timeout: 260 seconds) [03:43] *** odemg has joined #archiveteam-ot [06:32] *** dashcloud has quit IRC (Read error: Operation timed out) [06:55] *** icedice has joined #archiveteam-ot [07:06] *** djsundog has quit IRC (Read error: Operation timed out) [07:11] *** djsundog has joined #archiveteam-ot [08:17] TIME TO WATCH DOCTOR WHO SEASON 11 EPISODE 1 [08:57] *** VerifiedJ has joined #archiveteam-ot [09:09] *** VerifiedJ has quit IRC (Read error: Connection reset by peer) [09:11] *** ZizzyDizz has quit IRC (Ping timeout: 260 seconds) [09:21] *** faolingf_ has quit IRC (Quit: Leaving) [09:39] Time for me to get el jannahs [10:32] *** VerifiedJ has joined #archiveteam-ot [12:10] *** BlueMax has quit IRC (Quit: Leaving) [12:59] https://gist.github.com/ivan/e36e347875936c7933d9adc30ebc7a6e [13:01] Ah yeah, those are fun. [13:02] I have my telepathy hat on tonight [13:03] because it's a quickly reproducible regression [13:05] yep. don't read `.pattern` on a pyre2 object, don't even copy it to your own string [13:12] Guessing that is on my server? [13:20] it is not [13:32] this one's more exciting https://gist.github.com/ivan/fa91d5dcd6d4cee3d285f423f4b42846#file-the-new-html-parser-L1335 [13:34] oh sweet it repros [13:39] the unparseable page http://sm-hs.eu/index.php/smhs/article/view/sm-hs.2016.102 [13:41] [13:42] That's the one thing that jumps at me and could easily cause problems in a parser. [13:45] https://bugs.php.net/bug.php?id=76980 Now in PHP: partially defined classes! :-) [13:52] thanks, you were right [13:52] I minimized it to [13:52] import html5_parser [13:52] html5_parser.parse("", maybe_xhtml=True) [15:01] does anyone want to test grab-site v2? https://github.com/ludios/grab-site/tree/v2#install-on-ubuntu-1604-1804-debian-9-stretch-debian-10-buster but add @v2 to the url in the last line [15:02] https://gist.github.com/ivan/3d6d3d4f3574fa460c44204567e4184d upgrade guide [15:05] oh yeah I need to test macOS [15:20] snscrape is now on PyPI, so it can be installed with a simple 'pip install snscrape' and upgraded with 'pip install --upgrade snscrape' now. [15:43] #archiveteam-ot is becoming #archiveteam-dev. :-) [15:46] #archiveteam-dev: You have joined too many channels [15:47] lol [15:48] Just EFNet being EFNet. [15:48] I say we all pile onto sdf.org irc [15:48] :) [15:51] *** schbirid has joined #archiveteam-ot [17:04] *** wp494 has quit IRC (Quit: LOUD UNNECESSARY QUIT MESSAGES) [17:04] *** Stilett0 has joined #archiveteam-ot [17:06] *** wp494 has joined #archiveteam-ot [17:06] *** svchfoo1 sets mode: +o wp494 [17:11] *** Stiletto has quit IRC (Read error: Operation timed out) [17:19] grab-site 2.x is out, enjoy potentially exotic crashes many days into a crawl [17:20] dupespotter performance improvements may come later [17:21] <3 [17:26] *** Stiletto has joined #archiveteam-ot [17:29] *** Stilett0 has quit IRC (Read error: Operation timed out) [17:56] *** icedice has quit IRC (Quit: Leaving) [17:58] *** Stilett0 has joined #archiveteam-ot [17:59] *** Stiletto has quit IRC (Read error: Operation timed out) [18:33] *** Stiletto has joined #archiveteam-ot [18:36] *** Stilett0 has quit IRC (Read error: Operation timed out) [18:44] https://twitter.com/Malaysia_Gov disappeared between the time I snscraped it a week ago and now [18:44] was suspended, I mean [18:49] Between this and Kanye West's Twitter disappearance, I feel like we should start a project to continuously archive any social media accounts by government institutions and popular figures. I think I heard that politicians on the federal level in the US are covered by IA, but I'm not sure if that's true. [18:50] Even better would be automatically archiving any tweet with a retweet/like count beyond some threshold and any profile with more than x followers. But not sure if that's even possible. [18:50] (Without paying Twitter for their enterprise API, that is.) [18:51] *** schbirid has quit IRC (Remote host closed the connection) [18:55] *** Mateon1 has quit IRC (Ping timeout: 252 seconds) [18:55] *** Mateon1 has joined #archiveteam-ot [19:08] https://twitter.com/search?q=trump%20min_faves%3A10000&src=typd [19:08] https://twitter.com/search?q=trump%20min_retweets%3A10000&src=typd [19:20] just crawling twitter is a good way to discover popular tweets [19:20] I often set it on some search [19:35] Oh nice, secret search options. :-) [20:07] Huh, echo 'a&foo'$'\n''bar' | perl -pe 's,&[^&]+$,,m;' prints "abar" instead of "a" and "bar" on two lines. That's weird. [20:13] Changing [^&]+ to a non-greedy [^&]+? "fixes" that, but I don't understand why the former would match a newline at the end. $ is supposed to match *before* the newline... [20:20] (By the way, the m flag makes no difference.) [20:24] *** BlueMax has joined #archiveteam-ot [21:03] someone is touching my software https://github.com/DuckHP/grab-site/commits/master [21:07] ivan: That's ola_norsk, but he hasn't been here in a few months. [21:07] ah [22:01] ivan: I just pushed all my wpull code to https://github.com/JustAnotherArchivist/wpull [22:02] To anyone thinking about using my fork: don't. [22:04] *** m007a83_ has joined #archiveteam-ot [22:05] why [22:05] *** m007a83 has quit IRC (Ping timeout: 252 seconds) [22:05] *** m007a83_ is now known as m007a83 [22:07] moufu: It's untested, and I'm sure there was a reason why I didn't push this code back in Jan/Feb when I wrote it, though I don't remember the details. Testing the code would be much appreciated, but actually using it for archival is probably a bad idea at the moment. [22:17] *** VerifiedJ has quit IRC (Quit: Leaving) [22:17] *** m007a83_ has joined #archiveteam-ot [22:19] *** m007a83 has quit IRC (Read error: Operation timed out) [22:34] *** BlueMax has quit IRC (Quit: Leaving) [22:50] *** Stiletto has quit IRC (Read error: Operation timed out) [22:50] *** Stilett0 has joined #archiveteam-ot [22:58] *** m007a83 has joined #archiveteam-ot [23:02] *** m007a83_ has quit IRC (Read error: Operation timed out) [23:04] *** Polylith_ has quit IRC (Read error: Operation timed out) [23:05] *** svchfoo3 has quit IRC (Read error: Operation timed out) [23:05] *** Polylith has joined #archiveteam-ot [23:10] *** svchfoo3 has joined #archiveteam-ot [23:11] *** svchfoo1 sets mode: +o svchfoo3 [23:14] *** Stiletto has joined #archiveteam-ot [23:16] *** Stilett0 has quit IRC (Read error: Operation timed out) [23:23] *** Stilett0 has joined #archiveteam-ot [23:24] *** Stiletto has quit IRC (Ping timeout: 260 seconds)