#archiveteam-ot 2018-10-09,Tue

↑back Search

Time Nickname Message
00:49 🔗 dashcloud has quit IRC (Read error: Operation timed out)
00:53 🔗 dashcloud has joined #archiveteam-ot
00:58 🔗 hook54321 sets mode: +o wp494
00:59 🔗 Stiletto has joined #archiveteam-ot
01:01 🔗 Stilett0 has quit IRC (Ping timeout: 268 seconds)
01:19 🔗 dashcloud has quit IRC (Remote host closed the connection)
01:20 🔗 dashcloud has joined #archiveteam-ot
01:23 🔗 ivan ZizzyDizz: feed `gs-dump-urls wpull.db done | grep youtube.com/watch` into youtube-dl
02:55 🔗 ivan !status
02:55 🔗 ivan oops
02:57 🔗 ivan does anyone have 4+ idle cores somewhere that I can use to archive more twitter with grab-site?
02:58 🔗 ivan I'm working through all the politicians and next up are the verified accounts
02:58 🔗 kiska I have 4 free cores, but its doing AB work
02:59 🔗 kiska ivan: https://server5.kiska.pw/laptop/
02:59 🔗 ivan AB work is good work
03:04 🔗 Flashfire If you do them as !ao on archivebot that works
03:04 🔗 Flashfire !ao < ExampleURLList
03:04 🔗 Flashfire ivan
03:05 🔗 ivan I am familiar
03:06 🔗 Flashfire Alright then I can monitor some that way
03:06 🔗 ivan I just have too much and I don't want to mix grab-site/wpull 3 and AB
03:07 🔗 ivan AB is backlogged anyway
03:08 🔗 kiska AB is backlogged with !a requests, !ao isn't
03:09 🔗 ivan Flashfire: huh that https://www.versace.com/ you found is interesting
03:09 🔗 ivan interesting but my !ao jobs are 100K each and end up at ~700K requests, heh
03:10 🔗 Flashfire Ok I have a core spare how much bandwith would it use? I have a TB per month
03:10 🔗 Flashfire and its only the 9th
03:10 🔗 Flashfire What fork of grabsite do I set up?
03:11 🔗 kiska ivan: There is at least 3 slots free for ao requests
03:12 🔗 Flashfire Yeah !ao jobs will be fine I can monitor them
03:12 🔗 ivan Flashfire: I don't want to eat into your TB, it might go over
03:13 🔗 Flashfire Ok then
03:13 🔗 Flashfire and Versacce is neat but has been bought out by another company
03:15 🔗 kiska ivan: I need better utilisation of my network, so I'll be fine if you want to use grab-site on my server
03:16 🔗 kiska I just need to grab a pubkey from you, since I am going to chuck you on the AB user
03:16 🔗 kiska Or the other option is to use !ao < <list> and it'll still land on my pipeline
03:20 🔗 ivan PMed
03:30 🔗 odemg has quit IRC (Ping timeout: 260 seconds)
03:43 🔗 odemg has joined #archiveteam-ot
06:32 🔗 dashcloud has quit IRC (Read error: Operation timed out)
06:55 🔗 icedice has joined #archiveteam-ot
07:06 🔗 djsundog has quit IRC (Read error: Operation timed out)
07:11 🔗 djsundog has joined #archiveteam-ot
08:17 🔗 Flashfire TIME TO WATCH DOCTOR WHO SEASON 11 EPISODE 1
08:57 🔗 VerifiedJ has joined #archiveteam-ot
09:09 🔗 VerifiedJ has quit IRC (Read error: Connection reset by peer)
09:11 🔗 ZizzyDizz has quit IRC (Ping timeout: 260 seconds)
09:21 🔗 faolingf_ has quit IRC (Quit: Leaving)
09:39 🔗 kiska Time for me to get el jannahs
10:32 🔗 VerifiedJ has joined #archiveteam-ot
12:10 🔗 BlueMax has quit IRC (Quit: Leaving)
12:59 🔗 ivan https://gist.github.com/ivan/e36e347875936c7933d9adc30ebc7a6e
13:01 🔗 JAA Ah yeah, those are fun.
13:02 🔗 ivan I have my telepathy hat on tonight
13:03 🔗 ivan because it's a quickly reproducible regression
13:05 🔗 ivan yep. don't read `.pattern` on a pyre2 object, don't even copy it to your own string
13:12 🔗 kiska Guessing that is on my server?
13:20 🔗 ivan it is not
13:32 🔗 ivan this one's more exciting https://gist.github.com/ivan/fa91d5dcd6d4cee3d285f423f4b42846#file-the-new-html-parser-L1335
13:34 🔗 ivan oh sweet it repros
13:39 🔗 ivan the unparseable page http://sm-hs.eu/index.php/smhs/article/view/sm-hs.2016.102
13:41 🔗 JAA <html />
13:42 🔗 JAA That's the one thing that jumps at me and could easily cause problems in a parser.
13:45 🔗 JAA https://bugs.php.net/bug.php?id=76980 Now in PHP: partially defined classes! :-)
13:52 🔗 ivan thanks, you were right
13:52 🔗 ivan I minimized it to
13:52 🔗 ivan import html5_parser
13:52 🔗 ivan html5_parser.parse("<html><html />", maybe_xhtml=True)
15:01 🔗 ivan does anyone want to test grab-site v2? https://github.com/ludios/grab-site/tree/v2#install-on-ubuntu-1604-1804-debian-9-stretch-debian-10-buster but add @v2 to the url in the last line
15:02 🔗 ivan https://gist.github.com/ivan/3d6d3d4f3574fa460c44204567e4184d upgrade guide
15:05 🔗 ivan oh yeah I need to test macOS
15:20 🔗 JAA snscrape is now on PyPI, so it can be installed with a simple 'pip install snscrape' and upgraded with 'pip install --upgrade snscrape' now.
15:43 🔗 JAA #archiveteam-ot is becoming #archiveteam-dev. :-)
15:46 🔗 jrwr #archiveteam-dev: You have joined too many channels
15:47 🔗 astrid lol
15:48 🔗 JAA Just EFNet being EFNet.
15:48 🔗 jrwr I say we all pile onto sdf.org irc
15:48 🔗 jrwr :)
15:51 🔗 schbirid has joined #archiveteam-ot
17:04 🔗 wp494 has quit IRC (Quit: LOUD UNNECESSARY QUIT MESSAGES)
17:04 🔗 Stilett0 has joined #archiveteam-ot
17:06 🔗 wp494 has joined #archiveteam-ot
17:06 🔗 svchfoo1 sets mode: +o wp494
17:11 🔗 Stiletto has quit IRC (Read error: Operation timed out)
17:19 🔗 ivan grab-site 2.x is out, enjoy potentially exotic crashes many days into a crawl
17:20 🔗 ivan dupespotter performance improvements may come later
17:21 🔗 schbirid <3
17:26 🔗 Stiletto has joined #archiveteam-ot
17:29 🔗 Stilett0 has quit IRC (Read error: Operation timed out)
17:56 🔗 icedice has quit IRC (Quit: Leaving)
17:58 🔗 Stilett0 has joined #archiveteam-ot
17:59 🔗 Stiletto has quit IRC (Read error: Operation timed out)
18:33 🔗 Stiletto has joined #archiveteam-ot
18:36 🔗 Stilett0 has quit IRC (Read error: Operation timed out)
18:44 🔗 ivan https://twitter.com/Malaysia_Gov disappeared between the time I snscraped it a week ago and now
18:44 🔗 ivan was suspended, I mean
18:49 🔗 JAA Between this and Kanye West's Twitter disappearance, I feel like we should start a project to continuously archive any social media accounts by government institutions and popular figures. I think I heard that politicians on the federal level in the US are covered by IA, but I'm not sure if that's true.
18:50 🔗 JAA Even better would be automatically archiving any tweet with a retweet/like count beyond some threshold and any profile with more than x followers. But not sure if that's even possible.
18:50 🔗 JAA (Without paying Twitter for their enterprise API, that is.)
18:51 🔗 schbirid has quit IRC (Remote host closed the connection)
18:55 🔗 Mateon1 has quit IRC (Ping timeout: 252 seconds)
18:55 🔗 Mateon1 has joined #archiveteam-ot
19:08 🔗 ivan https://twitter.com/search?q=trump%20min_faves%3A10000&src=typd
19:08 🔗 ivan https://twitter.com/search?q=trump%20min_retweets%3A10000&src=typd
19:20 🔗 ivan just crawling twitter is a good way to discover popular tweets
19:20 🔗 ivan I often set it on some search
19:35 🔗 JAA Oh nice, secret search options. :-)
20:07 🔗 JAA Huh, echo 'a&foo'$'\n''bar' | perl -pe 's,&[^&]+$,,m;' prints "abar" instead of "a" and "bar" on two lines. That's weird.
20:13 🔗 JAA Changing [^&]+ to a non-greedy [^&]+? "fixes" that, but I don't understand why the former would match a newline at the end. $ is supposed to match *before* the newline...
20:20 🔗 JAA (By the way, the m flag makes no difference.)
20:24 🔗 BlueMax has joined #archiveteam-ot
21:03 🔗 ivan someone is touching my software https://github.com/DuckHP/grab-site/commits/master
21:07 🔗 JAA ivan: That's ola_norsk, but he hasn't been here in a few months.
21:07 🔗 ivan ah
22:01 🔗 JAA ivan: I just pushed all my wpull code to https://github.com/JustAnotherArchivist/wpull
22:02 🔗 JAA To anyone thinking about using my fork: don't.
22:04 🔗 m007a83_ has joined #archiveteam-ot
22:05 🔗 moufu why
22:05 🔗 m007a83 has quit IRC (Ping timeout: 252 seconds)
22:05 🔗 m007a83_ is now known as m007a83
22:07 🔗 JAA moufu: It's untested, and I'm sure there was a reason why I didn't push this code back in Jan/Feb when I wrote it, though I don't remember the details. Testing the code would be much appreciated, but actually using it for archival is probably a bad idea at the moment.
22:17 🔗 VerifiedJ has quit IRC (Quit: Leaving)
22:17 🔗 m007a83_ has joined #archiveteam-ot
22:19 🔗 m007a83 has quit IRC (Read error: Operation timed out)
22:34 🔗 BlueMax has quit IRC (Quit: Leaving)
22:50 🔗 Stiletto has quit IRC (Read error: Operation timed out)
22:50 🔗 Stilett0 has joined #archiveteam-ot
22:58 🔗 m007a83 has joined #archiveteam-ot
23:02 🔗 m007a83_ has quit IRC (Read error: Operation timed out)
23:04 🔗 Polylith_ has quit IRC (Read error: Operation timed out)
23:05 🔗 svchfoo3 has quit IRC (Read error: Operation timed out)
23:05 🔗 Polylith has joined #archiveteam-ot
23:10 🔗 svchfoo3 has joined #archiveteam-ot
23:11 🔗 svchfoo1 sets mode: +o svchfoo3
23:14 🔗 Stiletto has joined #archiveteam-ot
23:16 🔗 Stilett0 has quit IRC (Read error: Operation timed out)
23:23 🔗 Stilett0 has joined #archiveteam-ot
23:24 🔗 Stiletto has quit IRC (Ping timeout: 260 seconds)

irclogger-viewer