#archiveteam-ot 2019-04-18,Thu

↑back Search

Time	Nickname	Message
00:19 ^🔗	JAA	Good news: I've finally fixed https://github.com/JustAnotherArchivist/snscrape/issues/17
00:24 ^🔗	jodizzle	JAA: Nice, that's a super valuable fix.
00:25 ^🔗	jodizzle	Might have to revist some Facebook pages I couldn't scrape before.
00:57 ^🔗		BlueMax has quit IRC (Quit: Leaving)
01:04 ^🔗	Ravenloft	https://pbs.twimg.com/media/D1yyNpxXQAIAx_g.jpg
01:20 ^🔗		BlueMax has joined #archiveteam-ot
01:21 ^🔗	JAA	jodizzle: You might want to wait with that. I noticed that my scraper probably missed a lot of posts on some Facebook pages, namely all that have /permalink.php URLs rather than something with /posts/, /photos/, or /videos/.
01:25 ^🔗	JAA	https://github.com/JustAnotherArchivist/snscrape/issues/32
01:26 ^🔗		Despatche has quit IRC (Read error: Connection reset by peer)
01:34 ^🔗	jodizzle	Ah, interesting
01:34 ^🔗	jodizzle	I have noticed particularly low counts URL counts from snscrape on some Facebook pages in the past
02:20 ^🔗	JAA	lol Facebook: https://transfer.notkiska.pw/a5UPe/fbdivs.png
02:20 ^🔗	JAA	I couldn't even screenshot the whole thing due to my screen size. The whole tower is about twice as large.
02:36 ^🔗		kbtoo has joined #archiveteam-ot
02:37 ^🔗		kbtoo_ has quit IRC (Ping timeout: 255 seconds)
02:40 ^🔗	JAA	T minus 5 minutes
02:44 ^🔗	JAA	1 minute
02:45 ^🔗	JAA	> TZ=UTC date '+%Y-%m-%d %H:%M:%S %Z'
02:45 ^🔗	JAA	2019-04-18 02:45:51 UTC
02:45 ^🔗	JAA	> date +%s
02:45 ^🔗	JAA	1555555551
02:46 ^🔗	JAA	:-)
02:47 ^🔗	Flashfire	what was that countdown for?
02:47 ^🔗	JAA	See above. Unix time just reached a pretty neat number.
03:23 ^🔗		qw3rty113 has joined #archiveteam-ot
03:29 ^🔗		qw3rty112 has quit IRC (Read error: Operation timed out)
03:37 ^🔗		odemg has quit IRC (Ping timeout: 615 seconds)
03:44 ^🔗		odemg has joined #archiveteam-ot
04:00 ^🔗		Tsuser has joined #archiveteam-ot
04:04 ^🔗	Fusl	JAA: yeet, i slept over it, again.
04:04 ^🔗	Fusl	the same i slept over 1500000000
04:17 ^🔗		LowLevelM has joined #archiveteam-ot
04:18 ^🔗	LowLevelM	Are we going to have any new projects soon?
04:19 ^🔗	Kaz	yes
04:19 ^🔗	Flashfire	I mean if you want to suggest shorteners to add to the URL tracker we are always accepting suggestions. In the grand scheme of things kaz might be able to tell you more
04:20 ^🔗	LowLevelM	Ok, is the urlteam tracker overloaded? it seems to have enough urls
04:21 ^🔗	Kaz	no, it's just rate-limited in a sane way
04:21 ^🔗	LowLevelM	ok
04:21 ^🔗	Flashfire	Ahaha we are currently exporting. We can always use more shorteners to add its just a pain picking which ones hence I usually ask for suggestions
04:21 ^🔗	Flashfire	We are always up for more workers but as kaz said there is some rate limiting to avoid getting banned by any shorteners
04:22 ^🔗	LowLevelM	Will there be a rescan to get new urls later?
04:22 ^🔗	Flashfire	Sometimes we do rescans of older shortcode ranges but its not always viable
04:22 ^🔗	Flashfire	Like we dont want to scrape bitly from 0 again for example but a smaller URL service may get rescanned every few years if its viable
04:23 ^🔗	Flashfire	But everyone treats it a little different I am the newest at being an admin there so I might be over ruled later on ahahaha
04:24 ^🔗	LowLevelM	I just looked at the tracker and it says export in progress. Now I know why there are no jobs
04:24 ^🔗	Flashfire	Yes we export reguarly to the archive
04:25 ^🔗	Flashfire	should be finished soon then back to brute forcing
04:25 ^🔗	LowLevelM	Why is there not a 2nd server? one could be exporting while the other one is up
04:26 ^🔗	Flashfire	I mean I dont know if i am honest but this way works at the moment there is just a little bit of downtime sometimes
04:44 ^🔗	Kaz	There is no need for a second host..
04:44 ^🔗	Kaz	That's absolutely not the solution to that problem in any case
04:57 ^🔗	LowLevelM	Are there any url shorteners that need code written for them?
05:06 ^🔗	Flashfire	I mean you can look at the code and see if you can fix any of the issue LowLevelM
05:06 ^🔗	LowLevelM	Ok
05:07 ^🔗	Flashfire	https://github.com/ArchiveTeam/terroroftinytown LowLevelM
05:07 ^🔗	LowLevelM	BTW When will google+ be removed as an available project?
05:08 ^🔗	Flashfire	No idea I dont handle that stuff
05:09 ^🔗	LowLevelM	Who does?
05:22 ^🔗	Kaz	I could
05:22 ^🔗	Fusl	I did
05:22 ^🔗	Kaz	Thanks, I wasn't going to
05:40 ^🔗		dhyan_nat has joined #archiveteam-ot
05:44 ^🔗		LowLevelM has quit IRC (Ping timeout: 260 seconds)
06:30 ^🔗		dashcloud has quit IRC (Ping timeout: 265 seconds)
06:31 ^🔗		dashcloud has joined #archiveteam-ot
06:37 ^🔗	VoynichCr	JAA: merged
07:19 ^🔗		schbirid has joined #archiveteam-ot
08:15 ^🔗		BlueMax has quit IRC (Quit: Leaving)
08:48 ^🔗		kbtoo_ has joined #archiveteam-ot
08:55 ^🔗		kbtoo has quit IRC (Read error: Operation timed out)
10:03 ^🔗		dhyan_nat has quit IRC (Read error: Operation timed out)
10:09 ^🔗		dhyan_nat has joined #archiveteam-ot
11:33 ^🔗		dhyan_nat has quit IRC (Read error: Operation timed out)
11:36 ^🔗		drcd has joined #archiveteam-ot
11:50 ^🔗		dhyan_nat has joined #archiveteam-ot
11:54 ^🔗		nataraj has joined #archiveteam-ot
11:55 ^🔗		dhyan_nat has quit IRC (Read error: Connection reset by peer)
12:01 ^🔗		nataraj has quit IRC (Ping timeout: 268 seconds)
12:06 ^🔗		nataraj has joined #archiveteam-ot
12:11 ^🔗		Despatche has joined #archiveteam-ot
12:12 ^🔗		nataraj has quit IRC (Ping timeout: 268 seconds)
12:12 ^🔗		nataraj has joined #archiveteam-ot
12:16 ^🔗		nataraj has quit IRC (Read error: Operation timed out)
12:16 ^🔗		cfarquhar has quit IRC (Read error: Operation timed out)
12:17 ^🔗		cfarquhar has joined #archiveteam-ot
12:19 ^🔗		Mateon1 has quit IRC (Remote host closed the connection)
12:19 ^🔗		Mateon1 has joined #archiveteam-ot
12:45 ^🔗	JAA	snscrape now yields proper items for each service instead of just the URL. You can customise the CLI output format with the --format option; the fields are so-far undocumented though.
12:46 ^🔗	JAA	Example: snscrape --format '{date:%Y-%m-%d} {url} {content}' -n 10 twitter-user realDonaldTrump
12:47 ^🔗	JAA	For Facebook and Instagram, snscrape will now by default print clean URLs instead of the hot garbage Facebook in particular returns. Use --format '{dirtyUrl}' if you want the previous output.
12:49 ^🔗	VoynichCr	awesome
12:49 ^🔗	Fusl	awesome indeed
12:49 ^🔗	VoynichCr	now we want a irc bot feature
12:50 ^🔗	VoynichCr	(-:
12:51 ^🔗	JAA	Hehe
12:51 ^🔗	JAA	Should be trivial to create a bot for this, but I won't integrate it into snscrape directly.
13:08 ^🔗		Verified_ has joined #archiveteam-ot
13:13 ^🔗	JAA	And now there's a --since option as well to stop when a certain date is reached. It accepts these date formats: ('%Y-%m-%d %H:%M:%S %z', '%Y-%m-%d %H:%M:%S', '%Y-%m-%d %z', '%Y-%m-%d') (if no UTC offset is specified, UTC is used)
13:14 ^🔗	JAA	(Note that it's %z (e.g. "+0200"), not %Z (e.g. "CEST). Python doesn't like the latter, and I didn't want to add an external dependency just for this. https://bugs.python.org/issue22377)
13:22 ^🔗	kiska	I am fine with you piping it to my transfer,sh instance
13:23 ^🔗	kiska	Automatically, if you want to add that feature
13:32 ^🔗		Shen has joined #archiveteam-ot
13:45 ^🔗	Fusl	JAA: what about %s format (unix ts) for --since?
13:53 ^🔗		jspiros__ has quit IRC (Quit: ZNC - https://znc.in)
13:54 ^🔗		jspiros__ has joined #archiveteam-ot
14:01 ^🔗	JAA	Fusl: Good idea, done.
14:03 ^🔗	Fusl	thanks, that makes scripting for the irc bot a lot easier now
14:05 ^🔗		Hani has quit IRC (Quit: Going offline, see ya! (www.adiirc.com))
14:07 ^🔗	JAA	Turns out that Instagram doesn't like it when you hammer their GraphQL endpoint.
14:07 ^🔗	JAA	Implementing support for hashtags right now, and it's working too well. :-P
14:15 ^🔗	JAA	Now there's an instagram-hashtag scraper. Expect issues if you combine it with --since: Instagram sometimes returns weird results, so the recursion might end too soon.
14:16 ^🔗		jspiros__ has quit IRC (Quit: ZNC - https://znc.in)
14:16 ^🔗	JAA	Specifically, towards the end of the returned results, it sometimes returns some much older entries. Once you load the next page, it continues from where the newer ones stopped. But because of those intermediate older results, snscrape would already stop with --since.
14:54 ^🔗		nataraj has joined #archiveteam-ot
15:02 ^🔗		dhyan_nat has joined #archiveteam-ot
15:02 ^🔗		nataraj has quit IRC (Read error: Connection reset by peer)
15:11 ^🔗		dhyan_nat has quit IRC (Ping timeout: 268 seconds)
15:15 ^🔗		dhyan_nat has joined #archiveteam-ot
15:20 ^🔗		dhyan_nat has quit IRC (Ping timeout: 268 seconds)
15:59 ^🔗	yano	bittorrent of mueller report: 4ffa11aa616e0481f23743e8e2d8c3105621f0d2 also available at https://www.justice.gov/storage/report.pdf
16:30 ^🔗	yano	OCR Attempt: d614a8634a7bb48c75b1bae6e2c8b5ebb08df8fe
16:31 ^🔗	yano	OCR Attempt: 101a36145ca0db64ccdc370f158ac4c9e7c4436f
16:32 ^🔗		dhyan_nat has joined #archiveteam-ot
16:37 ^🔗	JAA	Of course they printed and scanned it again instead of uploading a clean document directly...
16:40 ^🔗	yano	that's intentional
16:40 ^🔗	yano	that's how the gov properly redacts things
16:41 ^🔗	JAA	snscrape now has VKontakte support. (Tested with a bunch of random pages including profiles and communities: navalny, maria_butina, dm, durov, teachers.union, land_of_fire, id497489393.)
16:42 ^🔗	JAA	Unfortunately, VK only returns a timestamp for the most recent posts, so older posts currently lack a date. I didn't feel like trying to write a parser for "today at 12:44 pm"-type strings involving the local timezone etc.
16:45 ^🔗	yano	https://slate.com/technology/2016/06/house-democrats-improperly-redacted-documents-wrong-but-they-re-not-alone.html
16:56 ^🔗	JAA	I'm aware of why they're doing it, but it's still silly. There are technical solutions for properly removing information from a digital document or rather, preventing it from being in the distributed document at all (e.g. \phantom in LaTeX just generates a corresponding empty space as far as I know).
16:57 ^🔗	JAA	Whereas physically covering it with a marker or tape or whatever they use and scanning it again may still leave the original content detectable.
18:35 ^🔗		Stilettoo is now known as Stiletto
19:30 ^🔗	t3	VoynichCr: So think I found a bug on https://www.archiveteam.org/index.php?title=ArchiveBot/Governments/Micronesia wih the 'raw list' links. They links go to pages that appear to be blank. (https://www.archiveteam.org/index.php?title=ArchiveBot/Governments/Micronesia/list)
19:32 ^🔗	t3	VoynichCr: Actually, the 'raw list' links don't correspond to the tables.
19:44 ^🔗	VoynichCr	t3: i see... that's because i am including content from other page in the wiki, but the template uses the title of the current page
19:45 ^🔗	VoynichCr	i created that page before JAA added sections feature, so iam going to convert it
19:52 ^🔗	JAA	Ah, nice trick actually.
20:04 ^🔗		killsushi has joined #archiveteam-ot
20:26 ^🔗		chirlu has quit IRC (Ping timeout: 255 seconds)
20:53 ^🔗		dhyan_nat has quit IRC (Read error: Operation timed out)
21:20 ^🔗		drcd has quit IRC (Read error: Connection reset by peer)
21:46 ^🔗		Soni has joined #archiveteam-ot
21:48 ^🔗		katocala has joined #archiveteam-ot
21:57 ^🔗	JAA	katocala: Can you find me some test cases for Gab? An empty profile and a small one in particular.
21:58 ^🔗	JAA	(Empty meaning the user exists but hasn't made any posts.)
21:59 ^🔗	katocala	I added with chromebot on 4/15: https://gab.ai/PastorLindstedt. It hasn't hit WBM yet to see how it went
22:00 ^🔗	katocala	I'll check for others - I tried a handful
22:01 ^🔗	katocala	2/14 I used chromebot on https://gab.com/PatriotFront. WBM shows a blank: https://web.archive.org/web/20190214200215/https://gab.com/PatriotFront
22:02 ^🔗	katocala	Might have the same issue as Instagram?
22:03 ^🔗	JAA	Not the same issue, but Gab is unusable without JS and loads everything asynchronously.
22:05 ^🔗	katocala	alright, this one isn't WBM material then, I guess
22:09 ^🔗	yano	https://twitter.com/doidepsec and https://twitter.com/depsecdef were deleted
22:19 ^🔗	JAA	katocala: It probably just needs some extra tricks and possibly better JS support in the WBM. Better to have the data in a currently inaccessible form than to not have it at all.
22:19 ^🔗	JAA	I'm adding Gab support to snscrape anyway because it's really easy and could still be useful for other things.
22:22 ^🔗	JAA	Found some small test cases in the meantime: https://gab.com/TRUTHREIGNS and https://gab.com/MsWen
22:40 ^🔗	JAA	katocala: Gab support has landed.
22:41 ^🔗	JAA	This has by far been the most pleasant service to work with so far.
22:45 ^🔗	JAA	Gab has a clean public API which is also used on the website everywhere, and there is no cookie, special header, or parameter bullshit needed either. Just a simple API endpoint URL and a 'before' parameter which is either the publishing date of the last post on the page (for the "posts" tab) or a number (for comments and media).
23:05 ^🔗	katocala	well damn. That is great!
23:16 ^🔗		BlueMax has joined #archiveteam-ot
23:52 ^🔗	marked	https://gab.com/blah is an empty profile
23:54 ^🔗	marked	https://gab.com/test has 138 posts

irclogger-viewer