[00:13] *** ShellyRol has quit IRC (Ping timeout: 496 seconds) [00:14] *** ShellyRol has joined #archiveteam-ot [00:14] *** ephemer0l has quit IRC (Ping timeout: 745 seconds) [00:18] *** ephemer0l has joined #archiveteam-ot [00:42] *** Quirk8 has quit IRC (Read error: Operation timed out) [00:55] *** Quirk8 has joined #archiveteam-ot [01:00] *** Quirk8 has quit IRC (Read error: Operation timed out) [01:02] *** Quirk8 has joined #archiveteam-ot [01:16] *** ZizzyDizz has quit IRC (Ping timeout: 260 seconds) [01:19] *** BlueMax has quit IRC (Quit: Leaving) [02:13] *** ola_norsk has joined #archiveteam-ot [02:13] *** dxrt_2 is now known as dxrt_ [02:21] I archived a couple of letters just now; But, when IA converts pdf's to txt, does it do so by OCR? Being too lazy to type, i decided to check the resulting full-text derivates for link to copy-paste.. but found that the urls in the (google docs) pdf's were quite mangled https://imgur.com/m1fDP1e [02:22] where e.g 'tinyurl.com' had become 'tinvurl.com' [02:23] albeit, just before one of the links, there was an 'y' present, in same font.. [02:24] item: https://archive.org/details/Qriist_letters_20190901 (the 'Summary' doc is where i noticed it) [02:26] i've just never noticed that before, that's all [02:27] that, or it's Google Doc's pdf exporting that fubared [02:28] It says in the metadata at the bottom: Ocr ABBYY FineReader 11.0 (Extended OCR) [02:29] does the original PDF not have a text layer? Abbyy is often done during the original scan and pdf creation [02:29] i have no idea, i just exported the letters from a google drive link to pdf [02:30] but if it's OCR then it's to be expected i guess [02:37] did you scan the letters using a scanner from a paper copy? [02:39] markedL: No. They are not mine. They are apparently written by the person who seemingly showed up to Tim Pool's door at 4am the other day. This is the 'Summary' original ( https://docs.google.com/document/d/1hnnybwRQoqkX0teHC2HCuErLiS9kMRV0SgyhYesI08Q/edit?usp=drivesdk ) . Might there be a better format i should export and upload? [02:40] yeah this link is text. The OCR should not have been needed, ideally. [02:40] i don't know what is the most 'native' format of Google Docs [02:40] so i simply picked pdf [02:43] so sounds like archive.org's fault. googledoc, to pdf, preserved the text. then archive OCR'd it to get a text format. [02:44] could exporting an adding e.g the *.odt files help? [02:44] or the epubs perhaps? [02:48] could be the underlining of the urls' what did it, causing 'y' and 'g' to become 'v' and 'q' [02:48] the pdf's on archive.org have the text still too, so it's their choice to use use abbyy to create the text version. I guess I can see a reason to do that would be to better preserve formatting. [02:48] I agree it's the underlying confusing it [02:52] i'll try re-exporting them as OpenDocument and add those as well and see what happens. I've rarely used google docs so i don't know what's it most native format. [02:53] is there a way to upload a text file that will overwrite the autoconverted text version? [02:55] it's possible to replace the derived file i think [02:56] though, i can't be arsed to export every format presented by google docs for the documents [02:57] so i'll try to add the *.odt exports and re-derive the item and see if that helps [03:00] btw (and 99% OT) , does the US have some sort of national army/military veteran outreach organization? More specifically in the Washington surroundings? [03:01] this might not be desired, but might be good for a test: https://docs.google.com/document/d/1236Dgb5QASspdGLu_XK5B6It6M0qov1AgOyxX_vIHUg/edit [03:04] that's a clone of the gdoc then removing the link detection. [03:05] the VA (Veterans Affairs) he mentioned is the govt agency. there's likely some non-profits as well though I can think of one [03:06] what i figured, and what surprised me is; I always figured that pdf's that were not mere scans, but saved by a word-processor/editor, included the full text in the pdf. Not e.g simply converting it to vector graphics. [03:07] and, that IA simply extracted that text directly from the PDF [03:07] there is a way for IA to do that, so I'm not sure their choice to use OCR except for maybe layout preservation. [03:07] because as you say, the text is there in the PDF [03:08] regarding formats, you're basically looking for one that IA treats differently because gdocs is doing the sane thing [03:09] could it be google docs exporting shitty pdf's ? [03:09] odt, rtf, or epub would be my best guess [03:09] it could have been, but looking the pdf, it has the text, so it's not gdocs this time. [03:09] it's IA. IA's choice would make sense if they also factor in they get scans and want a single work flow [03:11] linux has a few utilites that will strip the text out of a pdf the way you said [03:26] *** qw3rty118 has joined #archiveteam-ot [03:29] markedL: When in the IA document reader, would you happen to know what format that uses of the derives? Or is that simply the original in the case of pdf's ? [03:29] what format the IA reader is showing, i guess is what i'm asking [03:32] btw, IA appears unwilling to re-derive all files simply by adding a second original format [03:33] *** qw3rty117 has quit IRC (Ping timeout: 612 seconds) [03:34] anywho, thanks for insights. I simply never thought OCR might be an issue with non-scanned pdf's. Skål! [03:34] *** ola_norsk has quit IRC (leaving) [03:42] *** ZizzyDizz has joined #archiveteam-ot [03:43] markedL: sorry not getting to you earlier, never got a notification for this tab. I got everything I wanted though in the end. [03:44] Specifically archived everything within this /channel/ https://disqus.com/home/channel/friendshipdaily/ ended up doing it manually with webrecorder. I'd love to hear about chromebot though [04:13] *** dhyan_nat has joined #archiveteam-ot [04:31] ola_norsk : looks to me like IA is displaying an image [04:32] ZizzyDizz : I thought JA_ made something for your disqus for you [04:34] ola_norsk : yeah definitely an image> https://ia801507.us.archive.org/BookReader/BookReaderImages.php?zip=/32/items/Qriist_letters_20190901/Case%20Summary%20_jp2.zip&file=Case%20Summary%20_jp2/Case%20Summary%20_0000.jp2&scale=2.910958904109589&rotate=0 [04:51] *** BlueMax has joined #archiveteam-ot [05:43] *** ZizzyDizz has quit IRC (Ping timeout: 260 seconds) [07:35] *** killsushi has quit IRC (Quit: Leaving) [09:32] *** dhyan_nat has quit IRC (Read error: Operation timed out) [09:56] *** chirlu` has quit IRC (Read error: Operation timed out) [10:15] *** BlueMax has quit IRC (Ping timeout: 745 seconds) [10:23] *** BlueMax has joined #archiveteam-ot [10:29] *** Somebody2 has quit IRC (west.us.hub irc.Prison.NET) [10:32] *** dhyan_nat has joined #archiveteam-ot [10:46] *** Mateon1 has quit IRC (Read error: Operation timed out) [10:48] *** Mateon1 has joined #archiveteam-ot [10:53] *** Somebody2 has joined #archiveteam-ot [11:07] *** dhyan_nat has quit IRC (Read error: Operation timed out) [11:30] *** kiskabak has quit IRC (Remote host closed the connection) [11:30] *** kiskabak has joined #archiveteam-ot [11:30] *** Fusl_ sets mode: +o kiskabak [11:30] *** Fusl sets mode: +o kiskabak [11:30] *** Fusl__ sets mode: +o kiskabak [11:33] *** kiska1 has quit IRC (Remote host closed the connection) [11:33] *** kiska1 has joined #archiveteam-ot [11:33] *** Fusl__ sets mode: +o kiska1 [11:33] *** Fusl sets mode: +o kiska1 [11:33] *** Fusl_ sets mode: +o kiska1 [11:41] *** Quirk8 has quit IRC (Ping timeout: 246 seconds) [11:46] *** dhyan_nat has joined #archiveteam-ot [11:49] *** zino_ has quit IRC (Read error: Operation timed out) [12:00] *** zino_ has joined #archiveteam-ot [12:12] *** Quirk8 has joined #archiveteam-ot [12:32] *** BlueMax has quit IRC (Quit: Leaving) [12:53] *** DogsRNice has joined #archiveteam-ot [13:03] Does anyone know (roughly) how quickly you can download from instagram before your ip get banned/rate limited ? [13:21] https://gitee.com/ TIL Chinese GitHub [13:55] Dallas: What I can tell you is that you get banned pretty quickly on the pagination (i.e. scrolling) through the GraphQL API. I haven't seen any issues with retrieving the actual post pages and images/videos yet on ArchiveBot jobs. [14:15] *** Mateon1 has quit IRC (Read error: Operation timed out) [14:15] *** Mateon1 has joined #archiveteam-ot [15:09] *** Mateon1 has quit IRC (Quit: Mateon1) [15:09] *** Mateon1 has joined #archiveteam-ot [15:09] *** Mateon1 has quit IRC (Client Quit) [15:09] *** Mateon1 has joined #archiveteam-ot [15:31] *** David_ has joined #archiveteam-ot [15:32] WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD [15:37] *** David_ has quit IRC (Ping timeout: 260 seconds) [15:59] *** kiskabak has quit IRC (Remote host closed the connection) [15:59] *** kiskabak has joined #archiveteam-ot [15:59] *** Fusl sets mode: +o kiskabak [15:59] *** Fusl__ sets mode: +o kiskabak [15:59] *** Fusl_ sets mode: +o kiskabak [16:02] *** kiska1 has quit IRC (Remote host closed the connection) [16:02] *** kiska1 has joined #archiveteam-ot [16:02] *** Fusl__ sets mode: +o kiska1 [16:02] *** Fusl sets mode: +o kiska1 [16:02] *** Fusl_ sets mode: +o kiska1 [16:04] *** kiskabak has quit IRC (Remote host closed the connection) [16:04] *** kiskabak has joined #archiveteam-ot [16:04] *** Fusl__ sets mode: +o kiskabak [16:04] *** Fusl sets mode: +o kiskabak [16:04] *** Fusl_ sets mode: +o kiskabak [17:10] *** icedice has joined #archiveteam-ot [17:15] *** ShellyRol has quit IRC (Read error: Operation timed out) [17:15] *** ShellyRol has joined #archiveteam-ot [17:17] *** justas1 has quit IRC (Read error: Connection reset by peer) [17:17] *** justas1 has joined #archiveteam-ot [17:27] *** icedice2 has joined #archiveteam-ot [17:29] *** icedice2 has quit IRC (Client Quit) [17:31] *** icedice2 has joined #archiveteam-ot [17:33] *** icedice has quit IRC (Read error: Operation timed out) [18:00] *** Laverne has joined #archiveteam-ot [18:04] *** chirlu has joined #archiveteam-ot [18:09] *** icedice2 has quit IRC (Read error: Connection reset by peer) [18:09] *** icedice2 has joined #archiveteam-ot [18:12] *** icedice2 has quit IRC (Read error: Connection reset by peer) [18:13] *** icedice2 has joined #archiveteam-ot [18:15] *** icedice has joined #archiveteam-ot [18:24] *** icedice2 has quit IRC (Ping timeout: 612 seconds) [18:37] *** icedice2 has joined #archiveteam-ot [18:40] *** icedice has quit IRC (Ping timeout: 246 seconds) [18:43] *** icedice2 has quit IRC (Quit: Leaving) [18:43] *** icedice has joined #archiveteam-ot [18:56] *** icedice has quit IRC (Read error: Connection reset by peer) [18:56] *** icedice has joined #archiveteam-ot [19:11] *** icedice has quit IRC (Read error: Connection reset by peer) [19:11] *** icedice has joined #archiveteam-ot [19:37] *** icedice2 has joined #archiveteam-ot [19:40] *** icedice has quit IRC (Ping timeout: 252 seconds) [19:49] *** icedice2 has quit IRC (Leaving) [20:05] *** ShellyRol has quit IRC (Ping timeout: 252 seconds) [20:08] *** ShellyRol has joined #archiveteam-ot [20:21] David_ someone asked you what you were planning to work on in the main channel. answer there and wait around for an answer there [20:48] My browser seems to have some trouble loading the HTTPS version of the ArchiveBot Dashboard. The HTTP version is working fine, however. [20:49] Does anyone else have the same issue? Visit https://dashboard.at.ninjawedding.org/ [20:50] Would it be a hassle to make the site work with HTTPS? [20:59] The AB dashboard has never worked through HTTPS. [21:00] And yes, it would be fairly complicated to make that work. [21:10] *** DogsRNice has quit IRC (Read error: Connection reset by peer) [21:42] *** dhyan_nat has quit IRC (Read error: Operation timed out) [22:00] *** benjins has quit IRC (Read error: Connection reset by peer) [22:03] *** benjins has joined #archiveteam-ot [22:16] *** Sanqui has quit IRC (Remote host closed the connection) [22:16] *** Sanqui has joined #archiveteam-ot [22:36] *** Leslie has joined #archiveteam-ot [23:59] *** Somebody2 has quit IRC (west.us.hub irc.Prison.NET)