[00:19] Good news: I've finally fixed https://github.com/JustAnotherArchivist/snscrape/issues/17 [00:24] JAA: Nice, that's a super valuable fix. [00:25] Might have to revist some Facebook pages I couldn't scrape before. [00:57] *** BlueMax has quit IRC (Quit: Leaving) [01:04] https://pbs.twimg.com/media/D1yyNpxXQAIAx_g.jpg [01:20] *** BlueMax has joined #archiveteam-ot [01:21] jodizzle: You might want to wait with that. I noticed that my scraper probably missed a *lot* of posts on some Facebook pages, namely all that have /permalink.php URLs rather than something with /posts/, /photos/, or /videos/. [01:25] https://github.com/JustAnotherArchivist/snscrape/issues/32 [01:26] *** Despatche has quit IRC (Read error: Connection reset by peer) [01:34] Ah, interesting [01:34] I have noticed particularly low counts URL counts from snscrape on some Facebook pages in the past [02:20] lol Facebook: https://transfer.notkiska.pw/a5UPe/fbdivs.png [02:20] I couldn't even screenshot the whole thing due to my screen size. The whole tower is about twice as large. [02:36] *** kbtoo has joined #archiveteam-ot [02:37] *** kbtoo_ has quit IRC (Ping timeout: 255 seconds) [02:40] T minus 5 minutes [02:44] 1 minute [02:45] > TZ=UTC date '+%Y-%m-%d %H:%M:%S %Z' [02:45] 2019-04-18 02:45:51 UTC [02:45] > date +%s [02:45] 1555555551 [02:46] :-) [02:47] what was that countdown for? [02:47] See above. Unix time just reached a pretty neat number. [03:23] *** qw3rty113 has joined #archiveteam-ot [03:29] *** qw3rty112 has quit IRC (Read error: Operation timed out) [03:37] *** odemg has quit IRC (Ping timeout: 615 seconds) [03:44] *** odemg has joined #archiveteam-ot [04:00] *** Tsuser has joined #archiveteam-ot [04:04] JAA: yeet, i slept over it, again. [04:04] the same i slept over 1500000000 [04:17] *** LowLevelM has joined #archiveteam-ot [04:18] Are we going to have any new projects soon? [04:19] yes [04:19] I mean if you want to suggest shorteners to add to the URL tracker we are always accepting suggestions. In the grand scheme of things kaz might be able to tell you more [04:20] Ok, is the urlteam tracker overloaded? it seems to have enough urls [04:21] no, it's just rate-limited in a sane way [04:21] ok [04:21] Ahaha we are currently exporting. We can always use more shorteners to add its just a pain picking which ones hence I usually ask for suggestions [04:21] We are always up for more workers but as kaz said there is some rate limiting to avoid getting banned by any shorteners [04:22] Will there be a rescan to get new urls later? [04:22] Sometimes we do rescans of older shortcode ranges but its not always viable [04:22] Like we dont want to scrape bitly from 0 again for example but a smaller URL service may get rescanned every few years if its viable [04:23] But everyone treats it a little different I am the newest at being an admin there so I might be over ruled later on ahahaha [04:24] I just looked at the tracker and it says export in progress. Now I know why there are no jobs [04:24] Yes we export reguarly to the archive [04:25] should be finished soon then back to brute forcing [04:25] Why is there not a 2nd server? one could be exporting while the other one is up [04:26] I mean I dont know if i am honest but this way works at the moment there is just a little bit of downtime sometimes [04:44] There is no need for a second host.. [04:44] That's absolutely not the solution to that problem in any case [04:57] Are there any url shorteners that need code written for them? [05:06] I mean you can look at the code and see if you can fix any of the issue LowLevelM [05:06] Ok [05:07] https://github.com/ArchiveTeam/terroroftinytown LowLevelM [05:07] BTW When will google+ be removed as an available project? [05:08] No idea I dont handle that stuff [05:09] Who does? [05:22] I could [05:22] I did [05:22] Thanks, I wasn't going to [05:40] *** dhyan_nat has joined #archiveteam-ot [05:44] *** LowLevelM has quit IRC (Ping timeout: 260 seconds) [06:30] *** dashcloud has quit IRC (Ping timeout: 265 seconds) [06:31] *** dashcloud has joined #archiveteam-ot [06:37] JAA: merged [07:19] *** schbirid has joined #archiveteam-ot [08:15] *** BlueMax has quit IRC (Quit: Leaving) [08:48] *** kbtoo_ has joined #archiveteam-ot [08:55] *** kbtoo has quit IRC (Read error: Operation timed out) [10:03] *** dhyan_nat has quit IRC (Read error: Operation timed out) [10:09] *** dhyan_nat has joined #archiveteam-ot [11:33] *** dhyan_nat has quit IRC (Read error: Operation timed out) [11:36] *** drcd has joined #archiveteam-ot [11:50] *** dhyan_nat has joined #archiveteam-ot [11:54] *** nataraj has joined #archiveteam-ot [11:55] *** dhyan_nat has quit IRC (Read error: Connection reset by peer) [12:01] *** nataraj has quit IRC (Ping timeout: 268 seconds) [12:06] *** nataraj has joined #archiveteam-ot [12:11] *** Despatche has joined #archiveteam-ot [12:12] *** nataraj has quit IRC (Ping timeout: 268 seconds) [12:12] *** nataraj has joined #archiveteam-ot [12:16] *** nataraj has quit IRC (Read error: Operation timed out) [12:16] *** cfarquhar has quit IRC (Read error: Operation timed out) [12:17] *** cfarquhar has joined #archiveteam-ot [12:19] *** Mateon1 has quit IRC (Remote host closed the connection) [12:19] *** Mateon1 has joined #archiveteam-ot [12:45] snscrape now yields proper items for each service instead of just the URL. You can customise the CLI output format with the --format option; the fields are so-far undocumented though. [12:46] Example: snscrape --format '{date:%Y-%m-%d} {url} {content}' -n 10 twitter-user realDonaldTrump [12:47] For Facebook and Instagram, snscrape will now by default print clean URLs instead of the hot garbage Facebook in particular returns. Use --format '{dirtyUrl}' if you want the previous output. [12:49] awesome [12:49] awesome indeed [12:49] now we want a irc bot feature [12:50] (-: [12:51] Hehe [12:51] Should be trivial to create a bot for this, but I won't integrate it into snscrape directly. [13:08] *** Verified_ has joined #archiveteam-ot [13:13] And now there's a --since option as well to stop when a certain date is reached. It accepts these date formats: ('%Y-%m-%d %H:%M:%S %z', '%Y-%m-%d %H:%M:%S', '%Y-%m-%d %z', '%Y-%m-%d') (if no UTC offset is specified, UTC is used) [13:14] (Note that it's %z (e.g. "+0200"), not %Z (e.g. "CEST). Python doesn't like the latter, and I didn't want to add an external dependency just for this. https://bugs.python.org/issue22377) [13:22] I am fine with you piping it to my transfer,sh instance [13:23] Automatically, if you want to add that feature [13:32] *** Shen has joined #archiveteam-ot [13:45] JAA: what about %s format (unix ts) for --since? [13:53] *** jspiros__ has quit IRC (Quit: ZNC - https://znc.in) [13:54] *** jspiros__ has joined #archiveteam-ot [14:01] Fusl: Good idea, done. [14:03] thanks, that makes scripting for the irc bot a lot easier now [14:05] *** Hani has quit IRC (Quit: Going offline, see ya! (www.adiirc.com)) [14:07] Turns out that Instagram doesn't like it when you hammer their GraphQL endpoint. [14:07] Implementing support for hashtags right now, and it's working too well. :-P [14:15] Now there's an instagram-hashtag scraper. Expect issues if you combine it with --since: Instagram sometimes returns weird results, so the recursion might end too soon. [14:16] *** jspiros__ has quit IRC (Quit: ZNC - https://znc.in) [14:16] Specifically, towards the end of the returned results, it sometimes returns some much older entries. Once you load the next page, it continues from where the newer ones stopped. But because of those intermediate older results, snscrape would already stop with --since. [14:54] *** nataraj has joined #archiveteam-ot [15:02] *** dhyan_nat has joined #archiveteam-ot [15:02] *** nataraj has quit IRC (Read error: Connection reset by peer) [15:11] *** dhyan_nat has quit IRC (Ping timeout: 268 seconds) [15:15] *** dhyan_nat has joined #archiveteam-ot [15:20] *** dhyan_nat has quit IRC (Ping timeout: 268 seconds) [15:59] bittorrent of mueller report: 4ffa11aa616e0481f23743e8e2d8c3105621f0d2 also available at https://www.justice.gov/storage/report.pdf [16:30] OCR Attempt: d614a8634a7bb48c75b1bae6e2c8b5ebb08df8fe [16:31] OCR Attempt: 101a36145ca0db64ccdc370f158ac4c9e7c4436f [16:32] *** dhyan_nat has joined #archiveteam-ot [16:37] Of *course* they printed and scanned it again instead of uploading a clean document directly... [16:40] that's intentional [16:40] that's how the gov properly redacts things [16:41] snscrape now has VKontakte support. (Tested with a bunch of random pages including profiles and communities: navalny, maria_butina, dm, durov, teachers.union, land_of_fire, id497489393.) [16:42] Unfortunately, VK only returns a timestamp for the most recent posts, so older posts currently lack a date. I didn't feel like trying to write a parser for "today at 12:44 pm"-type strings involving the local timezone etc. [16:45] https://slate.com/technology/2016/06/house-democrats-improperly-redacted-documents-wrong-but-they-re-not-alone.html [16:56] I'm aware of why they're doing it, but it's still silly. There are technical solutions for properly removing information from a digital document or rather, preventing it from being in the distributed document at all (e.g. \phantom in LaTeX just generates a corresponding empty space as far as I know). [16:57] Whereas physically covering it with a marker or tape or whatever they use and scanning it again may still leave the original content detectable. [18:35] *** Stilettoo is now known as Stiletto [19:30] VoynichCr: So think I found a bug on https://www.archiveteam.org/index.php?title=ArchiveBot/Governments/Micronesia wih the 'raw list' links. They links go to pages that appear to be blank. (https://www.archiveteam.org/index.php?title=ArchiveBot/Governments/Micronesia/list) [19:32] VoynichCr: Actually, the 'raw list' links don't correspond to the tables. [19:44] t3: i see... that's because i am including content from other page in the wiki, but the template uses the title of the current page [19:45] i created that page before JAA added sections feature, so iam going to convert it [19:52] Ah, nice trick actually. [20:04] *** killsushi has joined #archiveteam-ot [20:26] *** chirlu has quit IRC (Ping timeout: 255 seconds) [20:53] *** dhyan_nat has quit IRC (Read error: Operation timed out) [21:20] *** drcd has quit IRC (Read error: Connection reset by peer) [21:46] *** Soni has joined #archiveteam-ot [21:48] *** katocala has joined #archiveteam-ot [21:57] katocala: Can you find me some test cases for Gab? An empty profile and a small one in particular. [21:58] (Empty meaning the user exists but hasn't made any posts.) [21:59] I added with chromebot on 4/15: https://gab.ai/PastorLindstedt. It hasn't hit WBM yet to see how it went [22:00] I'll check for others - I tried a handful [22:01] 2/14 I used chromebot on https://gab.com/PatriotFront. WBM shows a blank: https://web.archive.org/web/20190214200215/https://gab.com/PatriotFront [22:02] Might have the same issue as Instagram? [22:03] Not the same issue, but Gab is unusable without JS and loads everything asynchronously. [22:05] alright, this one isn't WBM material then, I guess [22:09] https://twitter.com/doidepsec and https://twitter.com/depsecdef were deleted [22:19] katocala: It probably just needs some extra tricks and possibly better JS support in the WBM. Better to have the data in a currently inaccessible form than to not have it at all. [22:19] I'm adding Gab support to snscrape anyway because it's really easy and could still be useful for other things. [22:22] Found some small test cases in the meantime: https://gab.com/TRUTHREIGNS and https://gab.com/MsWen [22:40] katocala: Gab support has landed. [22:41] This has by far been the most pleasant service to work with so far. [22:45] Gab has a clean public API which is also used on the website everywhere, and there is no cookie, special header, or parameter bullshit needed either. Just a simple API endpoint URL and a 'before' parameter which is either the publishing date of the last post on the page (for the "posts" tab) or a number (for comments and media). [23:05] well damn. That is great! [23:16] *** BlueMax has joined #archiveteam-ot [23:52] https://gab.com/blah is an empty profile [23:54] https://gab.com/test has 138 posts