#archiveteam-ot 2019-09-01,Sun

↑back Search

Time Nickname Message
00:13 🔗 ShellyRol has quit IRC (Ping timeout: 496 seconds)
00:14 🔗 ShellyRol has joined #archiveteam-ot
00:14 🔗 ephemer0l has quit IRC (Ping timeout: 745 seconds)
00:18 🔗 ephemer0l has joined #archiveteam-ot
00:42 🔗 Quirk8 has quit IRC (Read error: Operation timed out)
00:55 🔗 Quirk8 has joined #archiveteam-ot
01:00 🔗 Quirk8 has quit IRC (Read error: Operation timed out)
01:02 🔗 Quirk8 has joined #archiveteam-ot
01:16 🔗 ZizzyDizz has quit IRC (Ping timeout: 260 seconds)
01:19 🔗 BlueMax has quit IRC (Quit: Leaving)
02:13 🔗 ola_norsk has joined #archiveteam-ot
02:13 🔗 dxrt_2 is now known as dxrt_
02:21 🔗 ola_norsk I archived a couple of letters just now; But, when IA converts pdf's to txt, does it do so by OCR? Being too lazy to type, i decided to check the resulting full-text derivates for link to copy-paste.. but found that the urls in the (google docs) pdf's were quite mangled https://imgur.com/m1fDP1e
02:22 🔗 ola_norsk where e.g 'tinyurl.com' had become 'tinvurl.com'
02:23 🔗 ola_norsk albeit, just before one of the links, there was an 'y' present, in same font..
02:24 🔗 ola_norsk item: https://archive.org/details/Qriist_letters_20190901 (the 'Summary' doc is where i noticed it)
02:26 🔗 ola_norsk i've just never noticed that before, that's all
02:27 🔗 ola_norsk that, or it's Google Doc's pdf exporting that fubared
02:28 🔗 markedL It says in the metadata at the bottom: Ocr ABBYY FineReader 11.0 (Extended OCR)
02:29 🔗 markedL does the original PDF not have a text layer? Abbyy is often done during the original scan and pdf creation
02:29 🔗 ola_norsk i have no idea, i just exported the letters from a google drive link to pdf
02:30 🔗 ola_norsk but if it's OCR then it's to be expected i guess
02:37 🔗 markedL did you scan the letters using a scanner from a paper copy?
02:39 🔗 ola_norsk markedL: No. They are not mine. They are apparently written by the person who seemingly showed up to Tim Pool's door at 4am the other day. This is the 'Summary' original ( https://docs.google.com/document/d/1hnnybwRQoqkX0teHC2HCuErLiS9kMRV0SgyhYesI08Q/edit?usp=drivesdk ) . Might there be a better format i should export and upload?
02:40 🔗 markedL yeah this link is text. The OCR should not have been needed, ideally.
02:40 🔗 ola_norsk i don't know what is the most 'native' format of Google Docs
02:40 🔗 ola_norsk so i simply picked pdf
02:43 🔗 markedL so sounds like archive.org's fault. googledoc, to pdf, preserved the text. then archive OCR'd it to get a text format.
02:44 🔗 ola_norsk could exporting an adding e.g the *.odt files help?
02:44 🔗 ola_norsk or the epubs perhaps?
02:48 🔗 ola_norsk could be the underlining of the urls' what did it, causing 'y' and 'g' to become 'v' and 'q'
02:48 🔗 markedL the pdf's on archive.org have the text still too, so it's their choice to use use abbyy to create the text version. I guess I can see a reason to do that would be to better preserve formatting.
02:48 🔗 markedL I agree it's the underlying confusing it
02:52 🔗 ola_norsk i'll try re-exporting them as OpenDocument and add those as well and see what happens. I've rarely used google docs so i don't know what's it most native format.
02:53 🔗 markedL is there a way to upload a text file that will overwrite the autoconverted text version?
02:55 🔗 ola_norsk it's possible to replace the derived file i think
02:56 🔗 ola_norsk though, i can't be arsed to export every format presented by google docs for the documents
02:57 🔗 ola_norsk so i'll try to add the *.odt exports and re-derive the item and see if that helps
03:00 🔗 ola_norsk btw (and 99% OT) , does the US have some sort of national army/military veteran outreach organization? More specifically in the Washington surroundings?
03:01 🔗 markedL this might not be desired, but might be good for a test: https://docs.google.com/document/d/1236Dgb5QASspdGLu_XK5B6It6M0qov1AgOyxX_vIHUg/edit
03:04 🔗 markedL that's a clone of the gdoc then removing the link detection.
03:05 🔗 markedL the VA (Veterans Affairs) he mentioned is the govt agency. there's likely some non-profits as well though I can think of one
03:06 🔗 ola_norsk what i figured, and what surprised me is; I always figured that pdf's that were not mere scans, but saved by a word-processor/editor, included the full text in the pdf. Not e.g simply converting it to vector graphics.
03:07 🔗 ola_norsk and, that IA simply extracted that text directly from the PDF
03:07 🔗 markedL there is a way for IA to do that, so I'm not sure their choice to use OCR except for maybe layout preservation.
03:07 🔗 markedL because as you say, the text is there in the PDF
03:08 🔗 markedL regarding formats, you're basically looking for one that IA treats differently because gdocs is doing the sane thing
03:09 🔗 ola_norsk could it be google docs exporting shitty pdf's ?
03:09 🔗 markedL odt, rtf, or epub would be my best guess
03:09 🔗 markedL it could have been, but looking the pdf, it has the text, so it's not gdocs this time.
03:09 🔗 markedL it's IA. IA's choice would make sense if they also factor in they get scans and want a single work flow
03:11 🔗 markedL linux has a few utilites that will strip the text out of a pdf the way you said
03:26 🔗 qw3rty118 has joined #archiveteam-ot
03:29 🔗 ola_norsk markedL: When in the IA document reader, would you happen to know what format that uses of the derives? Or is that simply the original in the case of pdf's ?
03:29 🔗 ola_norsk what format the IA reader is showing, i guess is what i'm asking
03:32 🔗 ola_norsk btw, IA appears unwilling to re-derive all files simply by adding a second original format
03:33 🔗 qw3rty117 has quit IRC (Ping timeout: 612 seconds)
03:34 🔗 ola_norsk anywho, thanks for insights. I simply never thought OCR might be an issue with non-scanned pdf's. Skål!
03:34 🔗 ola_norsk has quit IRC (leaving)
03:42 🔗 ZizzyDizz has joined #archiveteam-ot
03:43 🔗 ZizzyDizz markedL: sorry not getting to you earlier, never got a notification for this tab. I got everything I wanted though in the end.
03:44 🔗 ZizzyDizz Specifically archived everything within this /channel/ https://disqus.com/home/channel/friendshipdaily/ ended up doing it manually with webrecorder. I'd love to hear about chromebot though
04:13 🔗 dhyan_nat has joined #archiveteam-ot
04:31 🔗 markedL ola_norsk : looks to me like IA is displaying an image
04:32 🔗 markedL ZizzyDizz : I thought JA_ made something for your disqus for you
04:34 🔗 markedL ola_norsk : yeah definitely an image> https://ia801507.us.archive.org/BookReader/BookReaderImages.php?zip=/32/items/Qriist_letters_20190901/Case%20Summary%20_jp2.zip&file=Case%20Summary%20_jp2/Case%20Summary%20_0000.jp2&scale=2.910958904109589&rotate=0
04:51 🔗 BlueMax has joined #archiveteam-ot
05:43 🔗 ZizzyDizz has quit IRC (Ping timeout: 260 seconds)
07:35 🔗 killsushi has quit IRC (Quit: Leaving)
09:32 🔗 dhyan_nat has quit IRC (Read error: Operation timed out)
09:56 🔗 chirlu` has quit IRC (Read error: Operation timed out)
10:15 🔗 BlueMax has quit IRC (Ping timeout: 745 seconds)
10:23 🔗 BlueMax has joined #archiveteam-ot
10:29 🔗 Somebody2 has quit IRC (west.us.hub irc.Prison.NET)
10:32 🔗 dhyan_nat has joined #archiveteam-ot
10:46 🔗 Mateon1 has quit IRC (Read error: Operation timed out)
10:48 🔗 Mateon1 has joined #archiveteam-ot
10:53 🔗 Somebody2 has joined #archiveteam-ot
11:07 🔗 dhyan_nat has quit IRC (Read error: Operation timed out)
11:30 🔗 kiskabak has quit IRC (Remote host closed the connection)
11:30 🔗 kiskabak has joined #archiveteam-ot
11:30 🔗 Fusl_ sets mode: +o kiskabak
11:30 🔗 Fusl sets mode: +o kiskabak
11:30 🔗 Fusl__ sets mode: +o kiskabak
11:33 🔗 kiska1 has quit IRC (Remote host closed the connection)
11:33 🔗 kiska1 has joined #archiveteam-ot
11:33 🔗 Fusl__ sets mode: +o kiska1
11:33 🔗 Fusl sets mode: +o kiska1
11:33 🔗 Fusl_ sets mode: +o kiska1
11:41 🔗 Quirk8 has quit IRC (Ping timeout: 246 seconds)
11:46 🔗 dhyan_nat has joined #archiveteam-ot
11:49 🔗 zino_ has quit IRC (Read error: Operation timed out)
12:00 🔗 zino_ has joined #archiveteam-ot
12:12 🔗 Quirk8 has joined #archiveteam-ot
12:32 🔗 BlueMax has quit IRC (Quit: Leaving)
12:53 🔗 DogsRNice has joined #archiveteam-ot
13:03 🔗 Dallas Does anyone know (roughly) how quickly you can download from instagram before your ip get banned/rate limited ?
13:21 🔗 ivan_ https://gitee.com/ TIL Chinese GitHub
13:55 🔗 JAA Dallas: What I can tell you is that you get banned pretty quickly on the pagination (i.e. scrolling) through the GraphQL API. I haven't seen any issues with retrieving the actual post pages and images/videos yet on ArchiveBot jobs.
14:15 🔗 Mateon1 has quit IRC (Read error: Operation timed out)
14:15 🔗 Mateon1 has joined #archiveteam-ot
15:09 🔗 Mateon1 has quit IRC (Quit: Mateon1)
15:09 🔗 Mateon1 has joined #archiveteam-ot
15:09 🔗 Mateon1 has quit IRC (Client Quit)
15:09 🔗 Mateon1 has joined #archiveteam-ot
15:31 🔗 David_ has joined #archiveteam-ot
15:32 🔗 David_ WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD
15:37 🔗 David_ has quit IRC (Ping timeout: 260 seconds)
15:59 🔗 kiskabak has quit IRC (Remote host closed the connection)
15:59 🔗 kiskabak has joined #archiveteam-ot
15:59 🔗 Fusl sets mode: +o kiskabak
15:59 🔗 Fusl__ sets mode: +o kiskabak
15:59 🔗 Fusl_ sets mode: +o kiskabak
16:02 🔗 kiska1 has quit IRC (Remote host closed the connection)
16:02 🔗 kiska1 has joined #archiveteam-ot
16:02 🔗 Fusl__ sets mode: +o kiska1
16:02 🔗 Fusl sets mode: +o kiska1
16:02 🔗 Fusl_ sets mode: +o kiska1
16:04 🔗 kiskabak has quit IRC (Remote host closed the connection)
16:04 🔗 kiskabak has joined #archiveteam-ot
16:04 🔗 Fusl__ sets mode: +o kiskabak
16:04 🔗 Fusl sets mode: +o kiskabak
16:04 🔗 Fusl_ sets mode: +o kiskabak
17:10 🔗 icedice has joined #archiveteam-ot
17:15 🔗 ShellyRol has quit IRC (Read error: Operation timed out)
17:15 🔗 ShellyRol has joined #archiveteam-ot
17:17 🔗 justas1 has quit IRC (Read error: Connection reset by peer)
17:17 🔗 justas1 has joined #archiveteam-ot
17:27 🔗 icedice2 has joined #archiveteam-ot
17:29 🔗 icedice2 has quit IRC (Client Quit)
17:31 🔗 icedice2 has joined #archiveteam-ot
17:33 🔗 icedice has quit IRC (Read error: Operation timed out)
18:00 🔗 Laverne has joined #archiveteam-ot
18:04 🔗 chirlu has joined #archiveteam-ot
18:09 🔗 icedice2 has quit IRC (Read error: Connection reset by peer)
18:09 🔗 icedice2 has joined #archiveteam-ot
18:12 🔗 icedice2 has quit IRC (Read error: Connection reset by peer)
18:13 🔗 icedice2 has joined #archiveteam-ot
18:15 🔗 icedice has joined #archiveteam-ot
18:24 🔗 icedice2 has quit IRC (Ping timeout: 612 seconds)
18:37 🔗 icedice2 has joined #archiveteam-ot
18:40 🔗 icedice has quit IRC (Ping timeout: 246 seconds)
18:43 🔗 icedice2 has quit IRC (Quit: Leaving)
18:43 🔗 icedice has joined #archiveteam-ot
18:56 🔗 icedice has quit IRC (Read error: Connection reset by peer)
18:56 🔗 icedice has joined #archiveteam-ot
19:11 🔗 icedice has quit IRC (Read error: Connection reset by peer)
19:11 🔗 icedice has joined #archiveteam-ot
19:37 🔗 icedice2 has joined #archiveteam-ot
19:40 🔗 icedice has quit IRC (Ping timeout: 252 seconds)
19:49 🔗 icedice2 has quit IRC (Leaving)
20:05 🔗 ShellyRol has quit IRC (Ping timeout: 252 seconds)
20:08 🔗 ShellyRol has joined #archiveteam-ot
20:21 🔗 markedL David_ someone asked you what you were planning to work on in the main channel. answer there and wait around for an answer there
20:48 🔗 t3 My browser seems to have some trouble loading the HTTPS version of the ArchiveBot Dashboard. The HTTP version is working fine, however.
20:49 🔗 t3 Does anyone else have the same issue? Visit https://dashboard.at.ninjawedding.org/
20:50 🔗 t3 Would it be a hassle to make the site work with HTTPS?
20:59 🔗 JAA The AB dashboard has never worked through HTTPS.
21:00 🔗 JAA And yes, it would be fairly complicated to make that work.
21:10 🔗 DogsRNice has quit IRC (Read error: Connection reset by peer)
21:42 🔗 dhyan_nat has quit IRC (Read error: Operation timed out)
22:00 🔗 benjins has quit IRC (Read error: Connection reset by peer)
22:03 🔗 benjins has joined #archiveteam-ot
22:16 🔗 Sanqui has quit IRC (Remote host closed the connection)
22:16 🔗 Sanqui has joined #archiveteam-ot
22:36 🔗 Leslie has joined #archiveteam-ot
23:59 🔗 Somebody2 has quit IRC (west.us.hub irc.Prison.NET)

irclogger-viewer