#archiveteam-ot 2019-04-18,Thu

↑back Search

Time Nickname Message
00:19 🔗 JAA Good news: I've finally fixed https://github.com/JustAnotherArchivist/snscrape/issues/17
00:24 🔗 jodizzle JAA: Nice, that's a super valuable fix.
00:25 🔗 jodizzle Might have to revist some Facebook pages I couldn't scrape before.
00:57 🔗 BlueMax has quit IRC (Quit: Leaving)
01:04 🔗 Ravenloft https://pbs.twimg.com/media/D1yyNpxXQAIAx_g.jpg
01:20 🔗 BlueMax has joined #archiveteam-ot
01:21 🔗 JAA jodizzle: You might want to wait with that. I noticed that my scraper probably missed a *lot* of posts on some Facebook pages, namely all that have /permalink.php URLs rather than something with /posts/, /photos/, or /videos/.
01:25 🔗 JAA https://github.com/JustAnotherArchivist/snscrape/issues/32
01:26 🔗 Despatche has quit IRC (Read error: Connection reset by peer)
01:34 🔗 jodizzle Ah, interesting
01:34 🔗 jodizzle I have noticed particularly low counts URL counts from snscrape on some Facebook pages in the past
02:20 🔗 JAA lol Facebook: https://transfer.notkiska.pw/a5UPe/fbdivs.png
02:20 🔗 JAA I couldn't even screenshot the whole thing due to my screen size. The whole tower is about twice as large.
02:36 🔗 kbtoo has joined #archiveteam-ot
02:37 🔗 kbtoo_ has quit IRC (Ping timeout: 255 seconds)
02:40 🔗 JAA T minus 5 minutes
02:44 🔗 JAA 1 minute
02:45 🔗 JAA > TZ=UTC date '+%Y-%m-%d %H:%M:%S %Z'
02:45 🔗 JAA 2019-04-18 02:45:51 UTC
02:45 🔗 JAA > date +%s
02:45 🔗 JAA 1555555551
02:46 🔗 JAA :-)
02:47 🔗 Flashfire what was that countdown for?
02:47 🔗 JAA See above. Unix time just reached a pretty neat number.
03:23 🔗 qw3rty113 has joined #archiveteam-ot
03:29 🔗 qw3rty112 has quit IRC (Read error: Operation timed out)
03:37 🔗 odemg has quit IRC (Ping timeout: 615 seconds)
03:44 🔗 odemg has joined #archiveteam-ot
04:00 🔗 Tsuser has joined #archiveteam-ot
04:04 🔗 Fusl JAA: yeet, i slept over it, again.
04:04 🔗 Fusl the same i slept over 1500000000
04:17 🔗 LowLevelM has joined #archiveteam-ot
04:18 🔗 LowLevelM Are we going to have any new projects soon?
04:19 🔗 Kaz yes
04:19 🔗 Flashfire I mean if you want to suggest shorteners to add to the URL tracker we are always accepting suggestions. In the grand scheme of things kaz might be able to tell you more
04:20 🔗 LowLevelM Ok, is the urlteam tracker overloaded? it seems to have enough urls
04:21 🔗 Kaz no, it's just rate-limited in a sane way
04:21 🔗 LowLevelM ok
04:21 🔗 Flashfire Ahaha we are currently exporting. We can always use more shorteners to add its just a pain picking which ones hence I usually ask for suggestions
04:21 🔗 Flashfire We are always up for more workers but as kaz said there is some rate limiting to avoid getting banned by any shorteners
04:22 🔗 LowLevelM Will there be a rescan to get new urls later?
04:22 🔗 Flashfire Sometimes we do rescans of older shortcode ranges but its not always viable
04:22 🔗 Flashfire Like we dont want to scrape bitly from 0 again for example but a smaller URL service may get rescanned every few years if its viable
04:23 🔗 Flashfire But everyone treats it a little different I am the newest at being an admin there so I might be over ruled later on ahahaha
04:24 🔗 LowLevelM I just looked at the tracker and it says export in progress. Now I know why there are no jobs
04:24 🔗 Flashfire Yes we export reguarly to the archive
04:25 🔗 Flashfire should be finished soon then back to brute forcing
04:25 🔗 LowLevelM Why is there not a 2nd server? one could be exporting while the other one is up
04:26 🔗 Flashfire I mean I dont know if i am honest but this way works at the moment there is just a little bit of downtime sometimes
04:44 🔗 Kaz There is no need for a second host..
04:44 🔗 Kaz That's absolutely not the solution to that problem in any case
04:57 🔗 LowLevelM Are there any url shorteners that need code written for them?
05:06 🔗 Flashfire I mean you can look at the code and see if you can fix any of the issue LowLevelM
05:06 🔗 LowLevelM Ok
05:07 🔗 Flashfire https://github.com/ArchiveTeam/terroroftinytown LowLevelM
05:07 🔗 LowLevelM BTW When will google+ be removed as an available project?
05:08 🔗 Flashfire No idea I dont handle that stuff
05:09 🔗 LowLevelM Who does?
05:22 🔗 Kaz I could
05:22 🔗 Fusl I did
05:22 🔗 Kaz Thanks, I wasn't going to
05:40 🔗 dhyan_nat has joined #archiveteam-ot
05:44 🔗 LowLevelM has quit IRC (Ping timeout: 260 seconds)
06:30 🔗 dashcloud has quit IRC (Ping timeout: 265 seconds)
06:31 🔗 dashcloud has joined #archiveteam-ot
06:37 🔗 VoynichCr JAA: merged
07:19 🔗 schbirid has joined #archiveteam-ot
08:15 🔗 BlueMax has quit IRC (Quit: Leaving)
08:48 🔗 kbtoo_ has joined #archiveteam-ot
08:55 🔗 kbtoo has quit IRC (Read error: Operation timed out)
10:03 🔗 dhyan_nat has quit IRC (Read error: Operation timed out)
10:09 🔗 dhyan_nat has joined #archiveteam-ot
11:33 🔗 dhyan_nat has quit IRC (Read error: Operation timed out)
11:36 🔗 drcd has joined #archiveteam-ot
11:50 🔗 dhyan_nat has joined #archiveteam-ot
11:54 🔗 nataraj has joined #archiveteam-ot
11:55 🔗 dhyan_nat has quit IRC (Read error: Connection reset by peer)
12:01 🔗 nataraj has quit IRC (Ping timeout: 268 seconds)
12:06 🔗 nataraj has joined #archiveteam-ot
12:11 🔗 Despatche has joined #archiveteam-ot
12:12 🔗 nataraj has quit IRC (Ping timeout: 268 seconds)
12:12 🔗 nataraj has joined #archiveteam-ot
12:16 🔗 nataraj has quit IRC (Read error: Operation timed out)
12:16 🔗 cfarquhar has quit IRC (Read error: Operation timed out)
12:17 🔗 cfarquhar has joined #archiveteam-ot
12:19 🔗 Mateon1 has quit IRC (Remote host closed the connection)
12:19 🔗 Mateon1 has joined #archiveteam-ot
12:45 🔗 JAA snscrape now yields proper items for each service instead of just the URL. You can customise the CLI output format with the --format option; the fields are so-far undocumented though.
12:46 🔗 JAA Example: snscrape --format '{date:%Y-%m-%d} {url} {content}' -n 10 twitter-user realDonaldTrump
12:47 🔗 JAA For Facebook and Instagram, snscrape will now by default print clean URLs instead of the hot garbage Facebook in particular returns. Use --format '{dirtyUrl}' if you want the previous output.
12:49 🔗 VoynichCr awesome
12:49 🔗 Fusl awesome indeed
12:49 🔗 VoynichCr now we want a irc bot feature
12:50 🔗 VoynichCr (-:
12:51 🔗 JAA Hehe
12:51 🔗 JAA Should be trivial to create a bot for this, but I won't integrate it into snscrape directly.
13:08 🔗 Verified_ has joined #archiveteam-ot
13:13 🔗 JAA And now there's a --since option as well to stop when a certain date is reached. It accepts these date formats: ('%Y-%m-%d %H:%M:%S %z', '%Y-%m-%d %H:%M:%S', '%Y-%m-%d %z', '%Y-%m-%d') (if no UTC offset is specified, UTC is used)
13:14 🔗 JAA (Note that it's %z (e.g. "+0200"), not %Z (e.g. "CEST). Python doesn't like the latter, and I didn't want to add an external dependency just for this. https://bugs.python.org/issue22377)
13:22 🔗 kiska I am fine with you piping it to my transfer,sh instance
13:23 🔗 kiska Automatically, if you want to add that feature
13:32 🔗 Shen has joined #archiveteam-ot
13:45 🔗 Fusl JAA: what about %s format (unix ts) for --since?
13:53 🔗 jspiros__ has quit IRC (Quit: ZNC - https://znc.in)
13:54 🔗 jspiros__ has joined #archiveteam-ot
14:01 🔗 JAA Fusl: Good idea, done.
14:03 🔗 Fusl thanks, that makes scripting for the irc bot a lot easier now
14:05 🔗 Hani has quit IRC (Quit: Going offline, see ya! (www.adiirc.com))
14:07 🔗 JAA Turns out that Instagram doesn't like it when you hammer their GraphQL endpoint.
14:07 🔗 JAA Implementing support for hashtags right now, and it's working too well. :-P
14:15 🔗 JAA Now there's an instagram-hashtag scraper. Expect issues if you combine it with --since: Instagram sometimes returns weird results, so the recursion might end too soon.
14:16 🔗 jspiros__ has quit IRC (Quit: ZNC - https://znc.in)
14:16 🔗 JAA Specifically, towards the end of the returned results, it sometimes returns some much older entries. Once you load the next page, it continues from where the newer ones stopped. But because of those intermediate older results, snscrape would already stop with --since.
14:54 🔗 nataraj has joined #archiveteam-ot
15:02 🔗 dhyan_nat has joined #archiveteam-ot
15:02 🔗 nataraj has quit IRC (Read error: Connection reset by peer)
15:11 🔗 dhyan_nat has quit IRC (Ping timeout: 268 seconds)
15:15 🔗 dhyan_nat has joined #archiveteam-ot
15:20 🔗 dhyan_nat has quit IRC (Ping timeout: 268 seconds)
15:59 🔗 yano bittorrent of mueller report: 4ffa11aa616e0481f23743e8e2d8c3105621f0d2 also available at https://www.justice.gov/storage/report.pdf
16:30 🔗 yano OCR Attempt: d614a8634a7bb48c75b1bae6e2c8b5ebb08df8fe
16:31 🔗 yano OCR Attempt: 101a36145ca0db64ccdc370f158ac4c9e7c4436f
16:32 🔗 dhyan_nat has joined #archiveteam-ot
16:37 🔗 JAA Of *course* they printed and scanned it again instead of uploading a clean document directly...
16:40 🔗 yano that's intentional
16:40 🔗 yano that's how the gov properly redacts things
16:41 🔗 JAA snscrape now has VKontakte support. (Tested with a bunch of random pages including profiles and communities: navalny, maria_butina, dm, durov, teachers.union, land_of_fire, id497489393.)
16:42 🔗 JAA Unfortunately, VK only returns a timestamp for the most recent posts, so older posts currently lack a date. I didn't feel like trying to write a parser for "today at 12:44 pm"-type strings involving the local timezone etc.
16:45 🔗 yano https://slate.com/technology/2016/06/house-democrats-improperly-redacted-documents-wrong-but-they-re-not-alone.html
16:56 🔗 JAA I'm aware of why they're doing it, but it's still silly. There are technical solutions for properly removing information from a digital document or rather, preventing it from being in the distributed document at all (e.g. \phantom in LaTeX just generates a corresponding empty space as far as I know).
16:57 🔗 JAA Whereas physically covering it with a marker or tape or whatever they use and scanning it again may still leave the original content detectable.
18:35 🔗 Stilettoo is now known as Stiletto
19:30 🔗 t3 VoynichCr: So think I found a bug on https://www.archiveteam.org/index.php?title=ArchiveBot/Governments/Micronesia wih the 'raw list' links. They links go to pages that appear to be blank. (https://www.archiveteam.org/index.php?title=ArchiveBot/Governments/Micronesia/list)
19:32 🔗 t3 VoynichCr: Actually, the 'raw list' links don't correspond to the tables.
19:44 🔗 VoynichCr t3: i see... that's because i am including content from other page in the wiki, but the template uses the title of the current page
19:45 🔗 VoynichCr i created that page before JAA added sections feature, so iam going to convert it
19:52 🔗 JAA Ah, nice trick actually.
20:04 🔗 killsushi has joined #archiveteam-ot
20:26 🔗 chirlu has quit IRC (Ping timeout: 255 seconds)
20:53 🔗 dhyan_nat has quit IRC (Read error: Operation timed out)
21:20 🔗 drcd has quit IRC (Read error: Connection reset by peer)
21:46 🔗 Soni has joined #archiveteam-ot
21:48 🔗 katocala has joined #archiveteam-ot
21:57 🔗 JAA katocala: Can you find me some test cases for Gab? An empty profile and a small one in particular.
21:58 🔗 JAA (Empty meaning the user exists but hasn't made any posts.)
21:59 🔗 katocala I added with chromebot on 4/15: https://gab.ai/PastorLindstedt. It hasn't hit WBM yet to see how it went
22:00 🔗 katocala I'll check for others - I tried a handful
22:01 🔗 katocala 2/14 I used chromebot on https://gab.com/PatriotFront. WBM shows a blank: https://web.archive.org/web/20190214200215/https://gab.com/PatriotFront
22:02 🔗 katocala Might have the same issue as Instagram?
22:03 🔗 JAA Not the same issue, but Gab is unusable without JS and loads everything asynchronously.
22:05 🔗 katocala alright, this one isn't WBM material then, I guess
22:09 🔗 yano https://twitter.com/doidepsec and https://twitter.com/depsecdef were deleted
22:19 🔗 JAA katocala: It probably just needs some extra tricks and possibly better JS support in the WBM. Better to have the data in a currently inaccessible form than to not have it at all.
22:19 🔗 JAA I'm adding Gab support to snscrape anyway because it's really easy and could still be useful for other things.
22:22 🔗 JAA Found some small test cases in the meantime: https://gab.com/TRUTHREIGNS and https://gab.com/MsWen
22:40 🔗 JAA katocala: Gab support has landed.
22:41 🔗 JAA This has by far been the most pleasant service to work with so far.
22:45 🔗 JAA Gab has a clean public API which is also used on the website everywhere, and there is no cookie, special header, or parameter bullshit needed either. Just a simple API endpoint URL and a 'before' parameter which is either the publishing date of the last post on the page (for the "posts" tab) or a number (for comments and media).
23:05 🔗 katocala well damn. That is great!
23:16 🔗 BlueMax has joined #archiveteam-ot
23:52 🔗 marked https://gab.com/blah is an empty profile
23:54 🔗 marked https://gab.com/test has 138 posts

irclogger-viewer