Time |
Nickname |
Message |
00:13
🔗
|
|
ShellyRol has quit IRC (Ping timeout: 496 seconds) |
00:14
🔗
|
|
ShellyRol has joined #archiveteam-ot |
00:14
🔗
|
|
ephemer0l has quit IRC (Ping timeout: 745 seconds) |
00:18
🔗
|
|
ephemer0l has joined #archiveteam-ot |
00:42
🔗
|
|
Quirk8 has quit IRC (Read error: Operation timed out) |
00:55
🔗
|
|
Quirk8 has joined #archiveteam-ot |
01:00
🔗
|
|
Quirk8 has quit IRC (Read error: Operation timed out) |
01:02
🔗
|
|
Quirk8 has joined #archiveteam-ot |
01:16
🔗
|
|
ZizzyDizz has quit IRC (Ping timeout: 260 seconds) |
01:19
🔗
|
|
BlueMax has quit IRC (Quit: Leaving) |
02:13
🔗
|
|
ola_norsk has joined #archiveteam-ot |
02:13
🔗
|
|
dxrt_2 is now known as dxrt_ |
02:21
🔗
|
ola_norsk |
I archived a couple of letters just now; But, when IA converts pdf's to txt, does it do so by OCR? Being too lazy to type, i decided to check the resulting full-text derivates for link to copy-paste.. but found that the urls in the (google docs) pdf's were quite mangled https://imgur.com/m1fDP1e |
02:22
🔗
|
ola_norsk |
where e.g 'tinyurl.com' had become 'tinvurl.com' |
02:23
🔗
|
ola_norsk |
albeit, just before one of the links, there was an 'y' present, in same font.. |
02:24
🔗
|
ola_norsk |
item: https://archive.org/details/Qriist_letters_20190901 (the 'Summary' doc is where i noticed it) |
02:26
🔗
|
ola_norsk |
i've just never noticed that before, that's all |
02:27
🔗
|
ola_norsk |
that, or it's Google Doc's pdf exporting that fubared |
02:28
🔗
|
markedL |
It says in the metadata at the bottom: Ocr ABBYY FineReader 11.0 (Extended OCR) |
02:29
🔗
|
markedL |
does the original PDF not have a text layer? Abbyy is often done during the original scan and pdf creation |
02:29
🔗
|
ola_norsk |
i have no idea, i just exported the letters from a google drive link to pdf |
02:30
🔗
|
ola_norsk |
but if it's OCR then it's to be expected i guess |
02:37
🔗
|
markedL |
did you scan the letters using a scanner from a paper copy? |
02:39
🔗
|
ola_norsk |
markedL: No. They are not mine. They are apparently written by the person who seemingly showed up to Tim Pool's door at 4am the other day. This is the 'Summary' original ( https://docs.google.com/document/d/1hnnybwRQoqkX0teHC2HCuErLiS9kMRV0SgyhYesI08Q/edit?usp=drivesdk ) . Might there be a better format i should export and upload? |
02:40
🔗
|
markedL |
yeah this link is text. The OCR should not have been needed, ideally. |
02:40
🔗
|
ola_norsk |
i don't know what is the most 'native' format of Google Docs |
02:40
🔗
|
ola_norsk |
so i simply picked pdf |
02:43
🔗
|
markedL |
so sounds like archive.org's fault. googledoc, to pdf, preserved the text. then archive OCR'd it to get a text format. |
02:44
🔗
|
ola_norsk |
could exporting an adding e.g the *.odt files help? |
02:44
🔗
|
ola_norsk |
or the epubs perhaps? |
02:48
🔗
|
ola_norsk |
could be the underlining of the urls' what did it, causing 'y' and 'g' to become 'v' and 'q' |
02:48
🔗
|
markedL |
the pdf's on archive.org have the text still too, so it's their choice to use use abbyy to create the text version. I guess I can see a reason to do that would be to better preserve formatting. |
02:48
🔗
|
markedL |
I agree it's the underlying confusing it |
02:52
🔗
|
ola_norsk |
i'll try re-exporting them as OpenDocument and add those as well and see what happens. I've rarely used google docs so i don't know what's it most native format. |
02:53
🔗
|
markedL |
is there a way to upload a text file that will overwrite the autoconverted text version? |
02:55
🔗
|
ola_norsk |
it's possible to replace the derived file i think |
02:56
🔗
|
ola_norsk |
though, i can't be arsed to export every format presented by google docs for the documents |
02:57
🔗
|
ola_norsk |
so i'll try to add the *.odt exports and re-derive the item and see if that helps |
03:00
🔗
|
ola_norsk |
btw (and 99% OT) , does the US have some sort of national army/military veteran outreach organization? More specifically in the Washington surroundings? |
03:01
🔗
|
markedL |
this might not be desired, but might be good for a test: https://docs.google.com/document/d/1236Dgb5QASspdGLu_XK5B6It6M0qov1AgOyxX_vIHUg/edit |
03:04
🔗
|
markedL |
that's a clone of the gdoc then removing the link detection. |
03:05
🔗
|
markedL |
the VA (Veterans Affairs) he mentioned is the govt agency. there's likely some non-profits as well though I can think of one |
03:06
🔗
|
ola_norsk |
what i figured, and what surprised me is; I always figured that pdf's that were not mere scans, but saved by a word-processor/editor, included the full text in the pdf. Not e.g simply converting it to vector graphics. |
03:07
🔗
|
ola_norsk |
and, that IA simply extracted that text directly from the PDF |
03:07
🔗
|
markedL |
there is a way for IA to do that, so I'm not sure their choice to use OCR except for maybe layout preservation. |
03:07
🔗
|
markedL |
because as you say, the text is there in the PDF |
03:08
🔗
|
markedL |
regarding formats, you're basically looking for one that IA treats differently because gdocs is doing the sane thing |
03:09
🔗
|
ola_norsk |
could it be google docs exporting shitty pdf's ? |
03:09
🔗
|
markedL |
odt, rtf, or epub would be my best guess |
03:09
🔗
|
markedL |
it could have been, but looking the pdf, it has the text, so it's not gdocs this time. |
03:09
🔗
|
markedL |
it's IA. IA's choice would make sense if they also factor in they get scans and want a single work flow |
03:11
🔗
|
markedL |
linux has a few utilites that will strip the text out of a pdf the way you said |
03:26
🔗
|
|
qw3rty118 has joined #archiveteam-ot |
03:29
🔗
|
ola_norsk |
markedL: When in the IA document reader, would you happen to know what format that uses of the derives? Or is that simply the original in the case of pdf's ? |
03:29
🔗
|
ola_norsk |
what format the IA reader is showing, i guess is what i'm asking |
03:32
🔗
|
ola_norsk |
btw, IA appears unwilling to re-derive all files simply by adding a second original format |
03:33
🔗
|
|
qw3rty117 has quit IRC (Ping timeout: 612 seconds) |
03:34
🔗
|
ola_norsk |
anywho, thanks for insights. I simply never thought OCR might be an issue with non-scanned pdf's. Skål! |
03:34
🔗
|
|
ola_norsk has quit IRC (leaving) |
03:42
🔗
|
|
ZizzyDizz has joined #archiveteam-ot |
03:43
🔗
|
ZizzyDizz |
markedL: sorry not getting to you earlier, never got a notification for this tab. I got everything I wanted though in the end. |
03:44
🔗
|
ZizzyDizz |
Specifically archived everything within this /channel/ https://disqus.com/home/channel/friendshipdaily/ ended up doing it manually with webrecorder. I'd love to hear about chromebot though |
04:13
🔗
|
|
dhyan_nat has joined #archiveteam-ot |
04:31
🔗
|
markedL |
ola_norsk : looks to me like IA is displaying an image |
04:32
🔗
|
markedL |
ZizzyDizz : I thought JA_ made something for your disqus for you |
04:34
🔗
|
markedL |
ola_norsk : yeah definitely an image> https://ia801507.us.archive.org/BookReader/BookReaderImages.php?zip=/32/items/Qriist_letters_20190901/Case%20Summary%20_jp2.zip&file=Case%20Summary%20_jp2/Case%20Summary%20_0000.jp2&scale=2.910958904109589&rotate=0 |
04:51
🔗
|
|
BlueMax has joined #archiveteam-ot |
05:43
🔗
|
|
ZizzyDizz has quit IRC (Ping timeout: 260 seconds) |
07:35
🔗
|
|
killsushi has quit IRC (Quit: Leaving) |
09:32
🔗
|
|
dhyan_nat has quit IRC (Read error: Operation timed out) |
09:56
🔗
|
|
chirlu` has quit IRC (Read error: Operation timed out) |
10:15
🔗
|
|
BlueMax has quit IRC (Ping timeout: 745 seconds) |
10:23
🔗
|
|
BlueMax has joined #archiveteam-ot |
10:29
🔗
|
|
Somebody2 has quit IRC (west.us.hub irc.Prison.NET) |
10:32
🔗
|
|
dhyan_nat has joined #archiveteam-ot |
10:46
🔗
|
|
Mateon1 has quit IRC (Read error: Operation timed out) |
10:48
🔗
|
|
Mateon1 has joined #archiveteam-ot |
10:53
🔗
|
|
Somebody2 has joined #archiveteam-ot |
11:07
🔗
|
|
dhyan_nat has quit IRC (Read error: Operation timed out) |
11:30
🔗
|
|
kiskabak has quit IRC (Remote host closed the connection) |
11:30
🔗
|
|
kiskabak has joined #archiveteam-ot |
11:30
🔗
|
|
Fusl_ sets mode: +o kiskabak |
11:30
🔗
|
|
Fusl sets mode: +o kiskabak |
11:30
🔗
|
|
Fusl__ sets mode: +o kiskabak |
11:33
🔗
|
|
kiska1 has quit IRC (Remote host closed the connection) |
11:33
🔗
|
|
kiska1 has joined #archiveteam-ot |
11:33
🔗
|
|
Fusl__ sets mode: +o kiska1 |
11:33
🔗
|
|
Fusl sets mode: +o kiska1 |
11:33
🔗
|
|
Fusl_ sets mode: +o kiska1 |
11:41
🔗
|
|
Quirk8 has quit IRC (Ping timeout: 246 seconds) |
11:46
🔗
|
|
dhyan_nat has joined #archiveteam-ot |
11:49
🔗
|
|
zino_ has quit IRC (Read error: Operation timed out) |
12:00
🔗
|
|
zino_ has joined #archiveteam-ot |
12:12
🔗
|
|
Quirk8 has joined #archiveteam-ot |
12:32
🔗
|
|
BlueMax has quit IRC (Quit: Leaving) |
12:53
🔗
|
|
DogsRNice has joined #archiveteam-ot |
13:03
🔗
|
Dallas |
Does anyone know (roughly) how quickly you can download from instagram before your ip get banned/rate limited ? |
13:21
🔗
|
ivan_ |
https://gitee.com/ TIL Chinese GitHub |
13:55
🔗
|
JAA |
Dallas: What I can tell you is that you get banned pretty quickly on the pagination (i.e. scrolling) through the GraphQL API. I haven't seen any issues with retrieving the actual post pages and images/videos yet on ArchiveBot jobs. |
14:15
🔗
|
|
Mateon1 has quit IRC (Read error: Operation timed out) |
14:15
🔗
|
|
Mateon1 has joined #archiveteam-ot |
15:09
🔗
|
|
Mateon1 has quit IRC (Quit: Mateon1) |
15:09
🔗
|
|
Mateon1 has joined #archiveteam-ot |
15:09
🔗
|
|
Mateon1 has quit IRC (Client Quit) |
15:09
🔗
|
|
Mateon1 has joined #archiveteam-ot |
15:31
🔗
|
|
David_ has joined #archiveteam-ot |
15:32
🔗
|
David_ |
WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD |
15:37
🔗
|
|
David_ has quit IRC (Ping timeout: 260 seconds) |
15:59
🔗
|
|
kiskabak has quit IRC (Remote host closed the connection) |
15:59
🔗
|
|
kiskabak has joined #archiveteam-ot |
15:59
🔗
|
|
Fusl sets mode: +o kiskabak |
15:59
🔗
|
|
Fusl__ sets mode: +o kiskabak |
15:59
🔗
|
|
Fusl_ sets mode: +o kiskabak |
16:02
🔗
|
|
kiska1 has quit IRC (Remote host closed the connection) |
16:02
🔗
|
|
kiska1 has joined #archiveteam-ot |
16:02
🔗
|
|
Fusl__ sets mode: +o kiska1 |
16:02
🔗
|
|
Fusl sets mode: +o kiska1 |
16:02
🔗
|
|
Fusl_ sets mode: +o kiska1 |
16:04
🔗
|
|
kiskabak has quit IRC (Remote host closed the connection) |
16:04
🔗
|
|
kiskabak has joined #archiveteam-ot |
16:04
🔗
|
|
Fusl__ sets mode: +o kiskabak |
16:04
🔗
|
|
Fusl sets mode: +o kiskabak |
16:04
🔗
|
|
Fusl_ sets mode: +o kiskabak |
17:10
🔗
|
|
icedice has joined #archiveteam-ot |
17:15
🔗
|
|
ShellyRol has quit IRC (Read error: Operation timed out) |
17:15
🔗
|
|
ShellyRol has joined #archiveteam-ot |
17:17
🔗
|
|
justas1 has quit IRC (Read error: Connection reset by peer) |
17:17
🔗
|
|
justas1 has joined #archiveteam-ot |
17:27
🔗
|
|
icedice2 has joined #archiveteam-ot |
17:29
🔗
|
|
icedice2 has quit IRC (Client Quit) |
17:31
🔗
|
|
icedice2 has joined #archiveteam-ot |
17:33
🔗
|
|
icedice has quit IRC (Read error: Operation timed out) |
18:00
🔗
|
|
Laverne has joined #archiveteam-ot |
18:04
🔗
|
|
chirlu has joined #archiveteam-ot |
18:09
🔗
|
|
icedice2 has quit IRC (Read error: Connection reset by peer) |
18:09
🔗
|
|
icedice2 has joined #archiveteam-ot |
18:12
🔗
|
|
icedice2 has quit IRC (Read error: Connection reset by peer) |
18:13
🔗
|
|
icedice2 has joined #archiveteam-ot |
18:15
🔗
|
|
icedice has joined #archiveteam-ot |
18:24
🔗
|
|
icedice2 has quit IRC (Ping timeout: 612 seconds) |
18:37
🔗
|
|
icedice2 has joined #archiveteam-ot |
18:40
🔗
|
|
icedice has quit IRC (Ping timeout: 246 seconds) |
18:43
🔗
|
|
icedice2 has quit IRC (Quit: Leaving) |
18:43
🔗
|
|
icedice has joined #archiveteam-ot |
18:56
🔗
|
|
icedice has quit IRC (Read error: Connection reset by peer) |
18:56
🔗
|
|
icedice has joined #archiveteam-ot |
19:11
🔗
|
|
icedice has quit IRC (Read error: Connection reset by peer) |
19:11
🔗
|
|
icedice has joined #archiveteam-ot |
19:37
🔗
|
|
icedice2 has joined #archiveteam-ot |
19:40
🔗
|
|
icedice has quit IRC (Ping timeout: 252 seconds) |
19:49
🔗
|
|
icedice2 has quit IRC (Leaving) |
20:05
🔗
|
|
ShellyRol has quit IRC (Ping timeout: 252 seconds) |
20:08
🔗
|
|
ShellyRol has joined #archiveteam-ot |
20:21
🔗
|
markedL |
David_ someone asked you what you were planning to work on in the main channel. answer there and wait around for an answer there |
20:48
🔗
|
t3 |
My browser seems to have some trouble loading the HTTPS version of the ArchiveBot Dashboard. The HTTP version is working fine, however. |
20:49
🔗
|
t3 |
Does anyone else have the same issue? Visit https://dashboard.at.ninjawedding.org/ |
20:50
🔗
|
t3 |
Would it be a hassle to make the site work with HTTPS? |
20:59
🔗
|
JAA |
The AB dashboard has never worked through HTTPS. |
21:00
🔗
|
JAA |
And yes, it would be fairly complicated to make that work. |
21:10
🔗
|
|
DogsRNice has quit IRC (Read error: Connection reset by peer) |
21:42
🔗
|
|
dhyan_nat has quit IRC (Read error: Operation timed out) |
22:00
🔗
|
|
benjins has quit IRC (Read error: Connection reset by peer) |
22:03
🔗
|
|
benjins has joined #archiveteam-ot |
22:16
🔗
|
|
Sanqui has quit IRC (Remote host closed the connection) |
22:16
🔗
|
|
Sanqui has joined #archiveteam-ot |
22:36
🔗
|
|
Leslie has joined #archiveteam-ot |
23:59
🔗
|
|
Somebody2 has quit IRC (west.us.hub irc.Prison.NET) |