#archiveteam-ot 2019-09-01,Sun

↑back Search

Time	Nickname	Message
00:13 ^🔗		ShellyRol has quit IRC (Ping timeout: 496 seconds)
00:14 ^🔗		ShellyRol has joined #archiveteam-ot
00:14 ^🔗		ephemer0l has quit IRC (Ping timeout: 745 seconds)
00:18 ^🔗		ephemer0l has joined #archiveteam-ot
00:42 ^🔗		Quirk8 has quit IRC (Read error: Operation timed out)
00:55 ^🔗		Quirk8 has joined #archiveteam-ot
01:00 ^🔗		Quirk8 has quit IRC (Read error: Operation timed out)
01:02 ^🔗		Quirk8 has joined #archiveteam-ot
01:16 ^🔗		ZizzyDizz has quit IRC (Ping timeout: 260 seconds)
01:19 ^🔗		BlueMax has quit IRC (Quit: Leaving)
02:13 ^🔗		ola_norsk has joined #archiveteam-ot
02:13 ^🔗		dxrt_2 is now known as dxrt_
02:21 ^🔗	ola_norsk	I archived a couple of letters just now; But, when IA converts pdf's to txt, does it do so by OCR? Being too lazy to type, i decided to check the resulting full-text derivates for link to copy-paste.. but found that the urls in the (google docs) pdf's were quite mangled https://imgur.com/m1fDP1e
02:22 ^🔗	ola_norsk	where e.g 'tinyurl.com' had become 'tinvurl.com'
02:23 ^🔗	ola_norsk	albeit, just before one of the links, there was an 'y' present, in same font..
02:24 ^🔗	ola_norsk	item: https://archive.org/details/Qriist_letters_20190901 (the 'Summary' doc is where i noticed it)
02:26 ^🔗	ola_norsk	i've just never noticed that before, that's all
02:27 ^🔗	ola_norsk	that, or it's Google Doc's pdf exporting that fubared
02:28 ^🔗	markedL	It says in the metadata at the bottom: Ocr ABBYY FineReader 11.0 (Extended OCR)
02:29 ^🔗	markedL	does the original PDF not have a text layer? Abbyy is often done during the original scan and pdf creation
02:29 ^🔗	ola_norsk	i have no idea, i just exported the letters from a google drive link to pdf
02:30 ^🔗	ola_norsk	but if it's OCR then it's to be expected i guess
02:37 ^🔗	markedL	did you scan the letters using a scanner from a paper copy?
02:39 ^🔗	ola_norsk	markedL: No. They are not mine. They are apparently written by the person who seemingly showed up to Tim Pool's door at 4am the other day. This is the 'Summary' original ( https://docs.google.com/document/d/1hnnybwRQoqkX0teHC2HCuErLiS9kMRV0SgyhYesI08Q/edit?usp=drivesdk ) . Might there be a better format i should export and upload?
02:40 ^🔗	markedL	yeah this link is text. The OCR should not have been needed, ideally.
02:40 ^🔗	ola_norsk	i don't know what is the most 'native' format of Google Docs
02:40 ^🔗	ola_norsk	so i simply picked pdf
02:43 ^🔗	markedL	so sounds like archive.org's fault. googledoc, to pdf, preserved the text. then archive OCR'd it to get a text format.
02:44 ^🔗	ola_norsk	could exporting an adding e.g the *.odt files help?
02:44 ^🔗	ola_norsk	or the epubs perhaps?
02:48 ^🔗	ola_norsk	could be the underlining of the urls' what did it, causing 'y' and 'g' to become 'v' and 'q'
02:48 ^🔗	markedL	the pdf's on archive.org have the text still too, so it's their choice to use use abbyy to create the text version. I guess I can see a reason to do that would be to better preserve formatting.
02:48 ^🔗	markedL	I agree it's the underlying confusing it
02:52 ^🔗	ola_norsk	i'll try re-exporting them as OpenDocument and add those as well and see what happens. I've rarely used google docs so i don't know what's it most native format.
02:53 ^🔗	markedL	is there a way to upload a text file that will overwrite the autoconverted text version?
02:55 ^🔗	ola_norsk	it's possible to replace the derived file i think
02:56 ^🔗	ola_norsk	though, i can't be arsed to export every format presented by google docs for the documents
02:57 ^🔗	ola_norsk	so i'll try to add the *.odt exports and re-derive the item and see if that helps
03:00 ^🔗	ola_norsk	btw (and 99% OT) , does the US have some sort of national army/military veteran outreach organization? More specifically in the Washington surroundings?
03:01 ^🔗	markedL	this might not be desired, but might be good for a test: https://docs.google.com/document/d/1236Dgb5QASspdGLu_XK5B6It6M0qov1AgOyxX_vIHUg/edit
03:04 ^🔗	markedL	that's a clone of the gdoc then removing the link detection.
03:05 ^🔗	markedL	the VA (Veterans Affairs) he mentioned is the govt agency. there's likely some non-profits as well though I can think of one
03:06 ^🔗	ola_norsk	what i figured, and what surprised me is; I always figured that pdf's that were not mere scans, but saved by a word-processor/editor, included the full text in the pdf. Not e.g simply converting it to vector graphics.
03:07 ^🔗	ola_norsk	and, that IA simply extracted that text directly from the PDF
03:07 ^🔗	markedL	there is a way for IA to do that, so I'm not sure their choice to use OCR except for maybe layout preservation.
03:07 ^🔗	markedL	because as you say, the text is there in the PDF
03:08 ^🔗	markedL	regarding formats, you're basically looking for one that IA treats differently because gdocs is doing the sane thing
03:09 ^🔗	ola_norsk	could it be google docs exporting shitty pdf's ?
03:09 ^🔗	markedL	odt, rtf, or epub would be my best guess
03:09 ^🔗	markedL	it could have been, but looking the pdf, it has the text, so it's not gdocs this time.
03:09 ^🔗	markedL	it's IA. IA's choice would make sense if they also factor in they get scans and want a single work flow
03:11 ^🔗	markedL	linux has a few utilites that will strip the text out of a pdf the way you said
03:26 ^🔗		qw3rty118 has joined #archiveteam-ot
03:29 ^🔗	ola_norsk	markedL: When in the IA document reader, would you happen to know what format that uses of the derives? Or is that simply the original in the case of pdf's ?
03:29 ^🔗	ola_norsk	what format the IA reader is showing, i guess is what i'm asking
03:32 ^🔗	ola_norsk	btw, IA appears unwilling to re-derive all files simply by adding a second original format
03:33 ^🔗		qw3rty117 has quit IRC (Ping timeout: 612 seconds)
03:34 ^🔗	ola_norsk	anywho, thanks for insights. I simply never thought OCR might be an issue with non-scanned pdf's. Skål!
03:34 ^🔗		ola_norsk has quit IRC (leaving)
03:42 ^🔗		ZizzyDizz has joined #archiveteam-ot
03:43 ^🔗	ZizzyDizz	markedL: sorry not getting to you earlier, never got a notification for this tab. I got everything I wanted though in the end.
03:44 ^🔗	ZizzyDizz	Specifically archived everything within this /channel/ https://disqus.com/home/channel/friendshipdaily/ ended up doing it manually with webrecorder. I'd love to hear about chromebot though
04:13 ^🔗		dhyan_nat has joined #archiveteam-ot
04:31 ^🔗	markedL	ola_norsk : looks to me like IA is displaying an image
04:32 ^🔗	markedL	ZizzyDizz : I thought JA_ made something for your disqus for you
04:34 ^🔗	markedL	ola_norsk : yeah definitely an image> https://ia801507.us.archive.org/BookReader/BookReaderImages.php?zip=/32/items/Qriist_letters_20190901/Case%20Summary%20_jp2.zip&file=Case%20Summary%20_jp2/Case%20Summary%20_0000.jp2&scale=2.910958904109589&rotate=0
04:51 ^🔗		BlueMax has joined #archiveteam-ot
05:43 ^🔗		ZizzyDizz has quit IRC (Ping timeout: 260 seconds)
07:35 ^🔗		killsushi has quit IRC (Quit: Leaving)
09:32 ^🔗		dhyan_nat has quit IRC (Read error: Operation timed out)
09:56 ^🔗		chirlu` has quit IRC (Read error: Operation timed out)
10:15 ^🔗		BlueMax has quit IRC (Ping timeout: 745 seconds)
10:23 ^🔗		BlueMax has joined #archiveteam-ot
10:29 ^🔗		Somebody2 has quit IRC (west.us.hub irc.Prison.NET)
10:32 ^🔗		dhyan_nat has joined #archiveteam-ot
10:46 ^🔗		Mateon1 has quit IRC (Read error: Operation timed out)
10:48 ^🔗		Mateon1 has joined #archiveteam-ot
10:53 ^🔗		Somebody2 has joined #archiveteam-ot
11:07 ^🔗		dhyan_nat has quit IRC (Read error: Operation timed out)
11:30 ^🔗		kiskabak has quit IRC (Remote host closed the connection)
11:30 ^🔗		kiskabak has joined #archiveteam-ot
11:30 ^🔗		Fusl_ sets mode: +o kiskabak
11:30 ^🔗		Fusl sets mode: +o kiskabak
11:30 ^🔗		Fusl__ sets mode: +o kiskabak
11:33 ^🔗		kiska1 has quit IRC (Remote host closed the connection)
11:33 ^🔗		kiska1 has joined #archiveteam-ot
11:33 ^🔗		Fusl__ sets mode: +o kiska1
11:33 ^🔗		Fusl sets mode: +o kiska1
11:33 ^🔗		Fusl_ sets mode: +o kiska1
11:41 ^🔗		Quirk8 has quit IRC (Ping timeout: 246 seconds)
11:46 ^🔗		dhyan_nat has joined #archiveteam-ot
11:49 ^🔗		zino_ has quit IRC (Read error: Operation timed out)
12:00 ^🔗		zino_ has joined #archiveteam-ot
12:12 ^🔗		Quirk8 has joined #archiveteam-ot
12:32 ^🔗		BlueMax has quit IRC (Quit: Leaving)
12:53 ^🔗		DogsRNice has joined #archiveteam-ot
13:03 ^🔗	Dallas	Does anyone know (roughly) how quickly you can download from instagram before your ip get banned/rate limited ?
13:21 ^🔗	ivan_	https://gitee.com/ TIL Chinese GitHub
13:55 ^🔗	JAA	Dallas: What I can tell you is that you get banned pretty quickly on the pagination (i.e. scrolling) through the GraphQL API. I haven't seen any issues with retrieving the actual post pages and images/videos yet on ArchiveBot jobs.
14:15 ^🔗		Mateon1 has quit IRC (Read error: Operation timed out)
14:15 ^🔗		Mateon1 has joined #archiveteam-ot
15:09 ^🔗		Mateon1 has quit IRC (Quit: Mateon1)
15:09 ^🔗		Mateon1 has joined #archiveteam-ot
15:09 ^🔗		Mateon1 has quit IRC (Client Quit)
15:09 ^🔗		Mateon1 has joined #archiveteam-ot
15:31 ^🔗		David_ has joined #archiveteam-ot
15:32 ^🔗	David_	WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD
15:37 ^🔗		David_ has quit IRC (Ping timeout: 260 seconds)
15:59 ^🔗		kiskabak has quit IRC (Remote host closed the connection)
15:59 ^🔗		kiskabak has joined #archiveteam-ot
15:59 ^🔗		Fusl sets mode: +o kiskabak
15:59 ^🔗		Fusl__ sets mode: +o kiskabak
15:59 ^🔗		Fusl_ sets mode: +o kiskabak
16:02 ^🔗		kiska1 has quit IRC (Remote host closed the connection)
16:02 ^🔗		kiska1 has joined #archiveteam-ot
16:02 ^🔗		Fusl__ sets mode: +o kiska1
16:02 ^🔗		Fusl sets mode: +o kiska1
16:02 ^🔗		Fusl_ sets mode: +o kiska1
16:04 ^🔗		kiskabak has quit IRC (Remote host closed the connection)
16:04 ^🔗		kiskabak has joined #archiveteam-ot
16:04 ^🔗		Fusl__ sets mode: +o kiskabak
16:04 ^🔗		Fusl sets mode: +o kiskabak
16:04 ^🔗		Fusl_ sets mode: +o kiskabak
17:10 ^🔗		icedice has joined #archiveteam-ot
17:15 ^🔗		ShellyRol has quit IRC (Read error: Operation timed out)
17:15 ^🔗		ShellyRol has joined #archiveteam-ot
17:17 ^🔗		justas1 has quit IRC (Read error: Connection reset by peer)
17:17 ^🔗		justas1 has joined #archiveteam-ot
17:27 ^🔗		icedice2 has joined #archiveteam-ot
17:29 ^🔗		icedice2 has quit IRC (Client Quit)
17:31 ^🔗		icedice2 has joined #archiveteam-ot
17:33 ^🔗		icedice has quit IRC (Read error: Operation timed out)
18:00 ^🔗		Laverne has joined #archiveteam-ot
18:04 ^🔗		chirlu has joined #archiveteam-ot
18:09 ^🔗		icedice2 has quit IRC (Read error: Connection reset by peer)
18:09 ^🔗		icedice2 has joined #archiveteam-ot
18:12 ^🔗		icedice2 has quit IRC (Read error: Connection reset by peer)
18:13 ^🔗		icedice2 has joined #archiveteam-ot
18:15 ^🔗		icedice has joined #archiveteam-ot
18:24 ^🔗		icedice2 has quit IRC (Ping timeout: 612 seconds)
18:37 ^🔗		icedice2 has joined #archiveteam-ot
18:40 ^🔗		icedice has quit IRC (Ping timeout: 246 seconds)
18:43 ^🔗		icedice2 has quit IRC (Quit: Leaving)
18:43 ^🔗		icedice has joined #archiveteam-ot
18:56 ^🔗		icedice has quit IRC (Read error: Connection reset by peer)
18:56 ^🔗		icedice has joined #archiveteam-ot
19:11 ^🔗		icedice has quit IRC (Read error: Connection reset by peer)
19:11 ^🔗		icedice has joined #archiveteam-ot
19:37 ^🔗		icedice2 has joined #archiveteam-ot
19:40 ^🔗		icedice has quit IRC (Ping timeout: 252 seconds)
19:49 ^🔗		icedice2 has quit IRC (Leaving)
20:05 ^🔗		ShellyRol has quit IRC (Ping timeout: 252 seconds)
20:08 ^🔗		ShellyRol has joined #archiveteam-ot
20:21 ^🔗	markedL	David_ someone asked you what you were planning to work on in the main channel. answer there and wait around for an answer there
20:48 ^🔗	t3	My browser seems to have some trouble loading the HTTPS version of the ArchiveBot Dashboard. The HTTP version is working fine, however.
20:49 ^🔗	t3	Does anyone else have the same issue? Visit https://dashboard.at.ninjawedding.org/
20:50 ^🔗	t3	Would it be a hassle to make the site work with HTTPS?
20:59 ^🔗	JAA	The AB dashboard has never worked through HTTPS.
21:00 ^🔗	JAA	And yes, it would be fairly complicated to make that work.
21:10 ^🔗		DogsRNice has quit IRC (Read error: Connection reset by peer)
21:42 ^🔗		dhyan_nat has quit IRC (Read error: Operation timed out)
22:00 ^🔗		benjins has quit IRC (Read error: Connection reset by peer)
22:03 ^🔗		benjins has joined #archiveteam-ot
22:16 ^🔗		Sanqui has quit IRC (Remote host closed the connection)
22:16 ^🔗		Sanqui has joined #archiveteam-ot
22:36 ^🔗		Leslie has joined #archiveteam-ot
23:59 ^🔗		Somebody2 has quit IRC (west.us.hub irc.Prison.NET)

irclogger-viewer