[00:13] *** ShellyRol has quit IRC (Ping timeout: 496 seconds)
[00:14] *** ShellyRol has joined #archiveteam-ot
[00:14] *** ephemer0l has quit IRC (Ping timeout: 745 seconds)
[00:18] *** ephemer0l has joined #archiveteam-ot
[00:42] *** Quirk8 has quit IRC (Read error: Operation timed out)
[00:55] *** Quirk8 has joined #archiveteam-ot
[01:00] *** Quirk8 has quit IRC (Read error: Operation timed out)
[01:02] *** Quirk8 has joined #archiveteam-ot
[01:16] *** ZizzyDizz has quit IRC (Ping timeout: 260 seconds)
[01:19] *** BlueMax has quit IRC (Quit: Leaving)
[02:13] *** ola_norsk has joined #archiveteam-ot
[02:13] *** dxrt_2 is now known as dxrt_
[02:21] <ola_norsk> I archived a couple of letters just now; But, when IA converts pdf's to txt, does it do so by OCR? Being too lazy to type, i decided to check the resulting full-text derivates for link to copy-paste.. but found that the urls in the (google docs) pdf's were quite mangled https://imgur.com/m1fDP1e
[02:22] <ola_norsk> where e.g 'tinyurl.com' had become 'tinvurl.com'
[02:23] <ola_norsk> albeit, just before one of the links, there was an 'y' present, in same font..
[02:24] <ola_norsk> item: https://archive.org/details/Qriist_letters_20190901 (the 'Summary' doc is where i noticed it)
[02:26] <ola_norsk> i've just never noticed that before, that's all
[02:27] <ola_norsk> that, or it's Google Doc's pdf exporting that fubared
[02:28] <markedL> It says in the metadata at the bottom: Ocr ABBYY FineReader 11.0 (Extended OCR)
[02:29] <markedL> does the original PDF not have a text layer?  Abbyy is often done during the original scan and pdf creation 
[02:29] <ola_norsk> i have no idea, i just exported the letters from a google drive link to pdf
[02:30] <ola_norsk> but if it's OCR then it's to be expected i guess
[02:37] <markedL> did you scan the letters using a scanner from a paper copy? 
[02:39] <ola_norsk> markedL: No. They are not mine. They are apparently written by the person who seemingly showed up to Tim Pool's door at 4am the other day. This is the 'Summary' original ( https://docs.google.com/document/d/1hnnybwRQoqkX0teHC2HCuErLiS9kMRV0SgyhYesI08Q/edit?usp=drivesdk ) . Might there be a better format i should export and upload?
[02:40] <markedL> yeah this link is text.  The OCR should not have been needed, ideally. 
[02:40] <ola_norsk> i don't know what is the most 'native' format of Google Docs
[02:40] <ola_norsk> so i simply picked pdf
[02:43] <markedL> so sounds like archive.org's fault.  googledoc, to pdf, preserved the text.  then archive OCR'd it to get a text format.
[02:44] <ola_norsk> could exporting an adding e.g the *.odt files help?
[02:44] <ola_norsk> or the epubs perhaps?
[02:48] <ola_norsk> could be the underlining of the urls' what did it, causing 'y' and 'g' to become 'v' and 'q'
[02:48] <markedL> the pdf's on archive.org have the text still too, so it's their choice to use use abbyy to create the text version.  I guess I can see a reason to do that would be to better preserve formatting.
[02:48] <markedL> I agree it's the underlying confusing it
[02:52] <ola_norsk> i'll try re-exporting them as OpenDocument and add those as well and see what happens. I've rarely used google docs so i don't know what's it most native format.
[02:53] <markedL> is there a way to upload a text file that will overwrite the autoconverted text version? 
[02:55] <ola_norsk> it's possible to replace the derived file i think
[02:56] <ola_norsk> though, i can't be arsed to export every format presented by google docs for the documents
[02:57] <ola_norsk> so i'll try to add the *.odt exports and re-derive the item and see if that helps
[03:00] <ola_norsk> btw (and 99% OT) , does the US have some sort of national army/military veteran outreach organization? More specifically in the Washington surroundings?
[03:01] <markedL> this might not be desired, but might be good for a test: https://docs.google.com/document/d/1236Dgb5QASspdGLu_XK5B6It6M0qov1AgOyxX_vIHUg/edit
[03:04] <markedL> that's a clone of the gdoc then removing the link detection.  
[03:05] <markedL> the VA (Veterans Affairs) he mentioned is the govt agency. there's likely some non-profits as well though I can think of one 
[03:06] <ola_norsk> what i figured, and what surprised me is; I always figured that pdf's that were not mere scans, but saved by a word-processor/editor, included the full text in the pdf. Not e.g simply converting it to vector graphics.
[03:07] <ola_norsk> and, that IA simply extracted that text directly from the PDF
[03:07] <markedL> there is a way for IA to do that, so I'm not sure their choice to use OCR except for maybe layout preservation. 
[03:07] <markedL> because as you say, the text is there in the PDF
[03:08] <markedL> regarding formats, you're basically looking for one that IA treats differently because gdocs is doing the sane thing
[03:09] <ola_norsk> could it be google docs exporting shitty pdf's ?
[03:09] <markedL> odt, rtf, or epub would be my best guess 
[03:09] <markedL> it could have been, but looking the pdf, it has the text, so it's not gdocs this time. 
[03:09] <markedL> it's IA.  IA's choice would make sense if they also factor in they get scans and want a single work flow
[03:11] <markedL> linux has a few utilites that will strip the text out of a pdf the way you said 
[03:26] *** qw3rty118 has joined #archiveteam-ot
[03:29] <ola_norsk> markedL: When in the IA document reader, would you happen to know what format that uses of the derives? Or is that simply the original in the case of pdf's ?
[03:29] <ola_norsk> what format the IA reader is showing, i guess is what i'm asking
[03:32] <ola_norsk> btw, IA appears unwilling to re-derive all files simply by adding a second original format
[03:33] *** qw3rty117 has quit IRC (Ping timeout: 612 seconds)
[03:34] <ola_norsk> anywho, thanks for insights. I simply never thought OCR might be an issue with non-scanned pdf's. Skål!
[03:34] *** ola_norsk has quit IRC (leaving)
[03:42] *** ZizzyDizz has joined #archiveteam-ot
[03:43] <ZizzyDizz> markedL: sorry not getting to you earlier, never got a notification for this tab.  I got everything I wanted though in the end.
[03:44] <ZizzyDizz> Specifically archived everything within this /channel/ https://disqus.com/home/channel/friendshipdaily/ ended up doing it manually with webrecorder.  I'd love to hear about chromebot though
[04:13] *** dhyan_nat has joined #archiveteam-ot
[04:31] <markedL> ola_norsk : looks to me like IA is displaying an image
[04:32] <markedL> ZizzyDizz : I thought JA_ made something for your disqus for you 
[04:34] <markedL> ola_norsk : yeah definitely an image> https://ia801507.us.archive.org/BookReader/BookReaderImages.php?zip=/32/items/Qriist_letters_20190901/Case%20Summary%20_jp2.zip&file=Case%20Summary%20_jp2/Case%20Summary%20_0000.jp2&scale=2.910958904109589&rotate=0 
[04:51] *** BlueMax has joined #archiveteam-ot
[05:43] *** ZizzyDizz has quit IRC (Ping timeout: 260 seconds)
[07:35] *** killsushi has quit IRC (Quit: Leaving)
[09:32] *** dhyan_nat has quit IRC (Read error: Operation timed out)
[09:56] *** chirlu` has quit IRC (Read error: Operation timed out)
[10:15] *** BlueMax has quit IRC (Ping timeout: 745 seconds)
[10:23] *** BlueMax has joined #archiveteam-ot
[10:29] *** Somebody2 has quit IRC (west.us.hub irc.Prison.NET)
[10:32] *** dhyan_nat has joined #archiveteam-ot
[10:46] *** Mateon1 has quit IRC (Read error: Operation timed out)
[10:48] *** Mateon1 has joined #archiveteam-ot
[10:53] *** Somebody2 has joined #archiveteam-ot
[11:07] *** dhyan_nat has quit IRC (Read error: Operation timed out)
[11:30] *** kiskabak has quit IRC (Remote host closed the connection)
[11:30] *** kiskabak has joined #archiveteam-ot
[11:30] *** Fusl_ sets mode: +o kiskabak
[11:30] *** Fusl sets mode: +o kiskabak
[11:30] *** Fusl__ sets mode: +o kiskabak
[11:33] *** kiska1 has quit IRC (Remote host closed the connection)
[11:33] *** kiska1 has joined #archiveteam-ot
[11:33] *** Fusl__ sets mode: +o kiska1
[11:33] *** Fusl sets mode: +o kiska1
[11:33] *** Fusl_ sets mode: +o kiska1
[11:41] *** Quirk8 has quit IRC (Ping timeout: 246 seconds)
[11:46] *** dhyan_nat has joined #archiveteam-ot
[11:49] *** zino_ has quit IRC (Read error: Operation timed out)
[12:00] *** zino_ has joined #archiveteam-ot
[12:12] *** Quirk8 has joined #archiveteam-ot
[12:32] *** BlueMax has quit IRC (Quit: Leaving)
[12:53] *** DogsRNice has joined #archiveteam-ot
[13:03] <Dallas> Does anyone know (roughly) how quickly you can download from instagram before your ip get banned/rate limited ?
[13:21] <ivan_> https://gitee.com/ TIL Chinese GitHub
[13:55] <JAA> Dallas: What I can tell you is that you get banned pretty quickly on the pagination (i.e. scrolling) through the GraphQL API. I haven't seen any issues with retrieving the actual post pages and images/videos yet on ArchiveBot jobs.
[14:15] *** Mateon1 has quit IRC (Read error: Operation timed out)
[14:15] *** Mateon1 has joined #archiveteam-ot
[15:09] *** Mateon1 has quit IRC (Quit: Mateon1)
[15:09] *** Mateon1 has joined #archiveteam-ot
[15:09] *** Mateon1 has quit IRC (Client Quit)
[15:09] *** Mateon1 has joined #archiveteam-ot
[15:31] *** David_ has joined #archiveteam-ot
[15:32] <David_> WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD
[15:37] *** David_ has quit IRC (Ping timeout: 260 seconds)
[15:59] *** kiskabak has quit IRC (Remote host closed the connection)
[15:59] *** kiskabak has joined #archiveteam-ot
[15:59] *** Fusl sets mode: +o kiskabak
[15:59] *** Fusl__ sets mode: +o kiskabak
[15:59] *** Fusl_ sets mode: +o kiskabak
[16:02] *** kiska1 has quit IRC (Remote host closed the connection)
[16:02] *** kiska1 has joined #archiveteam-ot
[16:02] *** Fusl__ sets mode: +o kiska1
[16:02] *** Fusl sets mode: +o kiska1
[16:02] *** Fusl_ sets mode: +o kiska1
[16:04] *** kiskabak has quit IRC (Remote host closed the connection)
[16:04] *** kiskabak has joined #archiveteam-ot
[16:04] *** Fusl__ sets mode: +o kiskabak
[16:04] *** Fusl sets mode: +o kiskabak
[16:04] *** Fusl_ sets mode: +o kiskabak
[17:10] *** icedice has joined #archiveteam-ot
[17:15] *** ShellyRol has quit IRC (Read error: Operation timed out)
[17:15] *** ShellyRol has joined #archiveteam-ot
[17:17] *** justas1 has quit IRC (Read error: Connection reset by peer)
[17:17] *** justas1 has joined #archiveteam-ot
[17:27] *** icedice2 has joined #archiveteam-ot
[17:29] *** icedice2 has quit IRC (Client Quit)
[17:31] *** icedice2 has joined #archiveteam-ot
[17:33] *** icedice has quit IRC (Read error: Operation timed out)
[18:00] *** Laverne has joined #archiveteam-ot
[18:04] *** chirlu has joined #archiveteam-ot
[18:09] *** icedice2 has quit IRC (Read error: Connection reset by peer)
[18:09] *** icedice2 has joined #archiveteam-ot
[18:12] *** icedice2 has quit IRC (Read error: Connection reset by peer)
[18:13] *** icedice2 has joined #archiveteam-ot
[18:15] *** icedice has joined #archiveteam-ot
[18:24] *** icedice2 has quit IRC (Ping timeout: 612 seconds)
[18:37] *** icedice2 has joined #archiveteam-ot
[18:40] *** icedice has quit IRC (Ping timeout: 246 seconds)
[18:43] *** icedice2 has quit IRC (Quit: Leaving)
[18:43] *** icedice has joined #archiveteam-ot
[18:56] *** icedice has quit IRC (Read error: Connection reset by peer)
[18:56] *** icedice has joined #archiveteam-ot
[19:11] *** icedice has quit IRC (Read error: Connection reset by peer)
[19:11] *** icedice has joined #archiveteam-ot
[19:37] *** icedice2 has joined #archiveteam-ot
[19:40] *** icedice has quit IRC (Ping timeout: 252 seconds)
[19:49] *** icedice2 has quit IRC (Leaving)
[20:05] *** ShellyRol has quit IRC (Ping timeout: 252 seconds)
[20:08] *** ShellyRol has joined #archiveteam-ot
[20:21] <markedL> David_ someone asked you what you were planning to work on in the main channel.  answer there and wait around for an answer there  
[20:48] <t3> My browser seems to have some trouble loading the HTTPS version of the ArchiveBot Dashboard. The HTTP version is working fine, however.
[20:49] <t3> Does anyone else have the same issue? Visit https://dashboard.at.ninjawedding.org/
[20:50] <t3> Would it be a hassle to make the site work with HTTPS?
[20:59] <JAA> The AB dashboard has never worked through HTTPS.
[21:00] <JAA> And yes, it would be fairly complicated to make that work.
[21:10] *** DogsRNice has quit IRC (Read error: Connection reset by peer)
[21:42] *** dhyan_nat has quit IRC (Read error: Operation timed out)
[22:00] *** benjins has quit IRC (Read error: Connection reset by peer)
[22:03] *** benjins has joined #archiveteam-ot
[22:16] *** Sanqui has quit IRC (Remote host closed the connection)
[22:16] *** Sanqui has joined #archiveteam-ot
[22:36] *** Leslie has joined #archiveteam-ot
[23:59] *** Somebody2 has quit IRC (west.us.hub irc.Prison.NET)