Time |
Nickname |
Message |
00:33
🔗
|
|
X-Scale` has joined #archiveteam-bs |
00:37
🔗
|
|
X-Scale has quit IRC (Read error: Operation timed out) |
00:37
🔗
|
|
X-Scale` is now known as X-Scale |
00:39
🔗
|
britmob |
JAA: Did you fix the malformed WARC issue with qwarc? |
00:40
🔗
|
anarcat |
what's qwarc |
00:40
🔗
|
britmob |
https://github.com/JustAnotherArchivist/qwarc |
00:40
🔗
|
JAA |
britmob: The partial records? Yes, that's fixed in 0.2.2. |
00:41
🔗
|
britmob |
Perfect, thanks. |
00:41
🔗
|
JAA |
Are you using qwarc? |
00:41
🔗
|
britmob |
Occasionally |
00:41
🔗
|
JAA |
Nice, you might be the only one. :-P |
00:41
🔗
|
britmob |
hehe |
00:42
🔗
|
markedL |
I ran it once, but didn't write the grab behavior |
00:42
🔗
|
britmob |
qwarc/brozzler/grab-site is what I use most often |
00:42
🔗
|
britmob |
Sometimes wpull. |
00:43
🔗
|
anarcat |
so it's kind of this curl url-list.txt | qwarc kind of thing? |
00:44
🔗
|
JAA |
Nope, not at all. |
00:44
🔗
|
JAA |
Think of qwarc like a local version of the tracker. |
00:45
🔗
|
JAA |
The work unit is an item, and each item fetches any number of things via HTTP requests. |
00:45
🔗
|
JAA |
It's very low level. You have to write all of the retrieval stuff, recursion as desired, etc. yourself. |
00:46
🔗
|
britmob |
Which is why I like it :) |
00:46
🔗
|
anarcat |
so it's a dispatcher |
00:47
🔗
|
JAA |
While it's possible to do what you suggest (one item per URL, no further processing like extraction of inline resources etc.), that would be quite inefficient and entirely blocked by SQLite lock contention. |
00:49
🔗
|
JAA |
Here's an example of the code you'd need to write: https://transfer.notkiska.pw/p5U8I/storywars.py |
00:50
🔗
|
anarcat |
okay, so it does have fetch primitives |
00:51
🔗
|
JAA |
It's intentionally minimal to achieve very high request rates. Even with a shitty old i3-2130, I can easily do hundreds of requests per second, assuming the remote server lets me. |
00:51
🔗
|
anarcat |
interesting |
00:51
🔗
|
anarcat |
what's the http backend? |
00:51
🔗
|
anarcat |
aiohttp? |
00:51
🔗
|
JAA |
Yeah |
00:51
🔗
|
anarcat |
figures |
00:52
🔗
|
JAA |
A highly hacked version of it though. :-P |
00:52
🔗
|
anarcat |
ouch |
00:52
🔗
|
anarcat |
also figures :p |
00:52
🔗
|
JAA |
aiohttp doesn't expose the raw data stream. |
00:52
🔗
|
anarcat |
i wonder what's the entry point in storywars.py |
00:52
🔗
|
JAA |
You run it like `qwarc storywars.py`. |
00:52
🔗
|
JAA |
(Plus a bunch of options usually for concurrency etc.) |
00:53
🔗
|
anarcat |
but how does qwarc know which classess to load |
00:54
🔗
|
JAA |
qwarc.Item.__subclasses__() + recursion |
00:54
🔗
|
anarcat |
clever |
00:54
🔗
|
JAA |
Which is actually a bit annoying because the subclass order is random. |
00:55
🔗
|
JAA |
Though Python 3.7 should fix that. (I'm still running 3.6 on my main qwarc machine.) |
00:55
🔗
|
anarcat |
sorted(subclasses)? :) |
00:55
🔗
|
anarcat |
ah |
00:55
🔗
|
JAA |
No, I'd like it in the order specified actually. |
00:55
🔗
|
anarcat |
where is 3.6 from... debian has 3.5 or 3.7? |
00:55
🔗
|
JAA |
But that's extremely tricky. |
00:55
🔗
|
anarcat |
oic |
00:56
🔗
|
JAA |
You need a metaclass to record the insertion order, because it's all stored in a dict internally. |
00:56
🔗
|
anarcat |
but newer python dict objects preserve order now |
00:56
🔗
|
anarcat |
iirc |
00:56
🔗
|
anarcat |
brb |
00:57
🔗
|
JAA |
Yeah, actually I was confusing that, that's the case since Python 3.6, not 3.7. Not sure why the order is still random on 3.6 for me. |
00:57
🔗
|
JAA |
And yeah, 3.6 isn't in Debian package repos; I installed it with pyenv. |
01:00
🔗
|
JAA |
britmob: So what's been your experience with qwarc so far? |
01:01
🔗
|
britmob |
Well, I used it a few times like.. 2 months ago? Then I switched to my own scripts with wpull for websites that needed it. Otherwise, it's grab-site all the way. |
01:01
🔗
|
|
DigiDigi has quit IRC (Remote host closed the connection) |
01:01
🔗
|
britmob |
I appreciate the customizability but it's unneeded for me most of the time |
01:02
🔗
|
britmob |
Doesn't help my python isn't great either haha |
01:03
🔗
|
JAA |
Yeah, I rarely need all of it either. |
01:06
🔗
|
JAA |
I've been wanting to write a shitty recursive crawler with it. One that extracts hrefs, srcs, etc. using string processing (str.find et al.) and then somehow groups the found resources together to avoid the DB overhead. Because why not? :-P |
01:06
🔗
|
JAA |
For reasonably HTML standard compliant sites, it should probably work okay-ish. |
01:08
🔗
|
britmob |
"Because why not" very much fits the theme here lol.. |
01:08
🔗
|
JAA |
:-) |
01:08
🔗
|
JAA |
Another thing I'd like to do is couple it to snscrape. |
01:09
🔗
|
britmob |
Oh, that's interesting. Hadn't seen that before. |
01:11
🔗
|
britmob |
Gonna have to play with that later :P |
01:12
🔗
|
JAA |
:-) |
01:12
🔗
|
JAA |
Have fun! |
01:15
🔗
|
britmob |
I plan to. |
01:18
🔗
|
anarcat |
rewriting snscrape with qwarc would make sense in itself no? |
01:19
🔗
|
anarcat |
otherwise plugging qwarc into chromium or some other headless parser would make sense as well |
01:19
🔗
|
JAA |
snscrape is inherently unparallelisable since you only know the required pagination parameters after retrieving the previous page. |
01:20
🔗
|
anarcat |
well you can still parallelize the fetches within that page |
01:20
🔗
|
JAA |
So that would only make sense for library usage of multiple simultaneous scrapes. |
01:20
🔗
|
JAA |
It doesn't request anything else though. |
01:21
🔗
|
JAA |
In a coupled setup, it would make sense. For snscrape itself, not so much. |
01:21
🔗
|
anarcat |
couldn't you parallelize fetchin, say, all the tweets from the first page of a profile? |
01:21
🔗
|
anarcat |
right |
01:21
🔗
|
anarcat |
i meant rewrite, not couple :) |
01:21
🔗
|
JAA |
The only benefit from rewriting snscrape on top of qwarc would be generating WARCs. |
01:21
🔗
|
anarcat |
right |
01:22
🔗
|
JAA |
Which is a nice benefit, but it can be achieved in easier ways (e.g. warcprox). |
01:22
🔗
|
JAA |
But yes, a properly coupled setup could then fetch the individual post pages, images, videos, etc. in parallel to the scraping. |
01:23
🔗
|
JAA |
It's just that this doesn't really belong to the intended use cases of snscrape, which is just extracting the relevant info from a feed. |
01:23
🔗
|
anarcat |
ah right, i see what you mean |
01:23
🔗
|
anarcat |
forgot that part of snscrape :) |
01:24
🔗
|
JAA |
Regarding browsers and HTML parsers: that would completely destroy the main advantage of qwarc and reason why I wrote it in the first place, efficiency/speed. HTML parsing is entirely dominating wpull execution time, for example. |
01:25
🔗
|
anarcat |
well it would be a sample plugin, not core qwarc |
01:25
🔗
|
* |
anarcat thinking of chromebot |
01:28
🔗
|
JAA |
Hmm, how would it be different from a MITM WARC-writing proxy? |
01:29
🔗
|
anarcat |
i... don't know |
01:29
🔗
|
anarcat |
would a mitm wrac-writing proxy feed URLs back into qwarc? |
01:30
🔗
|
JAA |
I meant such a proxy with a headless browser (plus recursion logic). |
01:30
🔗
|
anarcat |
no difference then i guess |
01:40
🔗
|
|
OrIdow6 has joined #archiveteam-bs |
01:56
🔗
|
|
DigiDigi has joined #archiveteam-bs |
02:49
🔗
|
JAA |
The shitty recursive crawler with qwarc is a thing now. :-P |
02:49
🔗
|
JAA |
I fully expect this to blow up in numerous ways if it's ever actually used though. |
02:53
🔗
|
anarcat |
haha no way |
02:57
🔗
|
JAA |
https://transfer.notkiska.pw/mCQbe/qwarc-recur-simple.py |
03:04
🔗
|
JAA |
(Just in case it wasn't clear enough, no, you shouldn't ever use this. lol) |
03:07
🔗
|
|
VADemon has quit IRC (Read error: Connection reset by peer) |
03:07
🔗
|
|
VADemon has joined #archiveteam-bs |
03:09
🔗
|
britmob |
What's that? Petition the IA to switch to qwarc? |
03:12
🔗
|
JAA |
My announcement that I'll move ArchiveBot to this tomorrow. :-P |
03:14
🔗
|
|
revi has quit IRC () |
03:14
🔗
|
|
revi has joined #archiveteam-bs |
03:15
🔗
|
anarcat |
for hrefPos in qwarc.utils.find_all(content, b'href'): |
03:15
🔗
|
anarcat |
whee |
03:16
🔗
|
JAA |
:-) |
03:16
🔗
|
JAA |
But it doesn't handle case variations. I wish HTML were stricter about how you have to write it. |
03:16
🔗
|
anarcat |
if case variation is your only concern, you're in for a ride |
03:18
🔗
|
JAA |
I know, but on the other hand, I'm not really writing a parser, just a shitty thing to extract stuff. |
03:19
🔗
|
JAA |
So most of the weird edge and corner cases aren't that relevant here. |
03:22
🔗
|
JAA |
The whitespace handling is obviously also annoying, but this is the hardest one for this particular purpose. |
03:23
🔗
|
JAA |
Regex is sloooow, maybe .lower() is faster. |
03:28
🔗
|
JAA |
Or rather, .translate() since I'm working with bytes. |
03:34
🔗
|
anarcat |
which is much more reasonable anyways |
03:38
🔗
|
|
cerca has quit IRC (Remote host closed the connection) |
03:41
🔗
|
JAA |
What, you don't like to "parse" HTML with regex? |
03:45
🔗
|
JAA |
Ok, here's a slightly saner version: https://transfer.notkiska.pw/QeN1G/qwarc-recur.py |
03:45
🔗
|
JAA |
Performance is actually pretty decent at ~45 requests per second with a concurrency of 1. |
03:47
🔗
|
JAA |
Anyway, this is way beyond -bs territory by now. I'm curious how far this concept can be taken, but let's do that in -dev. |
03:56
🔗
|
anarcat |
i don't like to parse html period |
03:56
🔗
|
anarcat |
good job |
03:56
🔗
|
kiiwii |
When and where should I upload my archive of gopherholes? Like once I hit a certain amount and should I update the archive every month or so? |
03:58
🔗
|
JAA |
anarcat: By the way, regarding __subclasses__ order: https://bugs.python.org/issue17936#msg190005 :-| |
03:59
🔗
|
JAA |
kiiwii: Can you do incremental archives, or do you have to regrab everything every time? But in general, I'd upload one complete archive to one item on the Internet Archive (with a sensible name and all the metadata you can add). |
04:00
🔗
|
JAA |
If your method allows to, you can of course start uploading before it's done if the entire thing is too large to store at once, but that's probably not too relevant here since it's only ~4 million resources. |
04:00
🔗
|
kiiwii |
It doesn't regrab everything each time, it grabs new files and any modified files. |
04:01
🔗
|
JAA |
Cool. How does that work? |
04:02
🔗
|
JAA |
I didn't see anything regarding modification timestamps or similar in the Gopher descriptions I skimmed over. |
04:02
🔗
|
kiiwii |
the python script written has certain commands that allow you to do it lol |
04:03
🔗
|
JAA |
Which script is that? |
04:03
🔗
|
kiiwii |
https://github.com/jnlon/gopherdl |
04:06
🔗
|
JAA |
Hmm, I don't see an option for not redownloading already downloaded things? |
04:07
🔗
|
kiiwii |
it doesn't by default, it says "not overwriting" and skips it |
04:07
🔗
|
JAA |
Ah, ok, but then it still redownloads it, just doesn't write to disk. |
04:08
🔗
|
kiiwii |
I believe so, yes |
04:08
🔗
|
JAA |
And that's just clobbering, not checking whether the file contents have changed. |
04:09
🔗
|
kiiwii |
The problem though is that some gopherholes like sdf.org or quux.org have so many directories that it errors out |
04:09
🔗
|
kiiwii |
Maybe I'll learn python so I can fix that issue |
04:09
🔗
|
JAA |
Mhm |
04:09
🔗
|
JAA |
I also saw that it buffers the entire response in memory, so if there are any large files, that could also be a problem. |
04:10
🔗
|
kiiwii |
I thought that may have been the problem, since it changed how many directories it got to before shitting itself |
04:15
🔗
|
|
odemgi_ has joined #archiveteam-bs |
04:21
🔗
|
|
odemgi has quit IRC (Read error: Operation timed out) |
04:30
🔗
|
|
tech234a has quit IRC (Quit: Connection closed for inactivity) |
04:44
🔗
|
|
bluefoo_ has quit IRC (Quit: bluefoo_) |
04:53
🔗
|
|
qw3rty2 has joined #archiveteam-bs |
04:53
🔗
|
|
tech234a has joined #archiveteam-bs |
05:02
🔗
|
|
qw3rty has quit IRC (Ping timeout: 745 seconds) |
05:04
🔗
|
|
superkuh_ has quit IRC (Quit: the neuronal action potential is an electrical manipulation of reversible abrupt phase changes in the lipid bilaye) |
05:39
🔗
|
JAA |
Apparently the NGINX forums at https://forum.nginx.org/ broke sometime in the last two months, throwing only a DB error now. If it comes back, might be a good idea to archive that. |
05:51
🔗
|
|
bluefoo has joined #archiveteam-bs |
05:57
🔗
|
Terbium |
Btw nginx offices have been raided by the police recently |
05:58
🔗
|
astrid |
i hear they have a big lawsuit |
05:59
🔗
|
JAA |
It broke in the last two days based on Google's cache, so yeah, possible it's somehow connected to the raids. |
06:00
🔗
|
astrid |
nginx got bought by f5 about 8 months ago, and f5 doesn't really care about keeping historical stuff around |
06:02
🔗
|
Terbium |
Yep, the buy out by f5 did not make people happy |
06:02
🔗
|
Frogging |
Though the forums going down in the last 2 days, right after the raid... |
06:02
🔗
|
JAA |
Yeah, ^ |
06:03
🔗
|
JAA |
Also, what's the relation between nginx.com and nginx.org again? |
06:04
🔗
|
JAA |
I wonder if the forums' database is somehow involved in those copyright claims that triggered the raids. |
06:05
🔗
|
Frogging |
com is the corporate/enterprise site, org is the open source project |
06:05
🔗
|
Frogging |
I think. |
06:05
🔗
|
Frogging |
The nginx project/corporate structure never did sit right with me and that's why I stopped using it |
06:07
🔗
|
Frogging |
that and being based in Russia where the kind of shit we just saw tends to happen a lot, and due process is ignored when it's convenient |
06:07
🔗
|
|
dewdrop has joined #archiveteam-bs |
06:10
🔗
|
|
bluefoo has quit IRC (Read error: Operation timed out) |
06:23
🔗
|
|
LowLevelM has quit IRC (Read error: Operation timed out) |
06:30
🔗
|
|
LowLevelM has joined #archiveteam-bs |
06:31
🔗
|
|
bluefoo has joined #archiveteam-bs |
06:32
🔗
|
|
d5f4a3622 has quit IRC (Read error: Connection reset by peer) |
06:34
🔗
|
|
d5f4a3622 has joined #archiveteam-bs |
06:51
🔗
|
|
HP_Archiv has quit IRC (Quit: Leaving) |
06:58
🔗
|
Ryz |
Oh hey, http://assemblergames.com/ now disappeared long after their expected death date~ |
07:16
🔗
|
|
VADemon has quit IRC (Quit: left4dead) |
07:18
🔗
|
Flashfire |
May I please have access restored to archivebot? I have some old webhosts I want to save some userpages of |
07:20
🔗
|
Flashfire |
Specifically that of Zoominternet as the company has since folded |
07:20
🔗
|
Flashfire |
And my SaveNow captures are only doing so much when the outlinks function keeps freezing |
07:23
🔗
|
|
VADemon has joined #archiveteam-bs |
07:25
🔗
|
|
m007a83 has joined #archiveteam-bs |
07:55
🔗
|
Ryz |
On Zoom Internet stuff, Flashfire - on what you gave me on a Google search term, being 'site:zoominternet.net' - something curious happened, |
07:56
🔗
|
Ryz |
There have been search results that have something like http://www.zoominternet.net/~tfm2006/ - but also stuff like http://users.zoominternet.net/~rdetoro/ |
07:57
🔗
|
Ryz |
It would appear that links like http://users.zoominternet.net/~tfm2006/ are also acceptable, being the same as http://www.zoominternet.net/~tfm2006/ - which may introduce some kind of friction on what to save first |
07:58
🔗
|
Flashfire |
I would go with users and then to be safe www |
08:00
🔗
|
Ryz |
Ah, userapge links under http://users.zoominternet.net/ appear more than just http://www.zoominternet.net/ |
08:01
🔗
|
Ryz |
Even more curious, is http://static-acs-24-144-176-47.zoominternet.net/ - which was found in the search results, but can't access it at all |
08:14
🔗
|
Flashfire |
Oh that ones easy Ryz those are actual websites hosted by the company |
08:41
🔗
|
|
killsushi has quit IRC (Quit: Leaving) |
08:41
🔗
|
Ryz |
Some more investigating, I came across something I never seen before, a 300 web code; I stumbled upon http://users.zoominternet.net/~rbtson/hitty.htm that came from me checking out http://users.zoominternet.net/~rbtson/chap59.htm - unsure if manually created or auto-generated at the time |
08:44
🔗
|
Ryz |
...Pondering whether it's better to run them individually still or run all of 'em as "!a <" |
08:59
🔗
|
Ryz |
Did some curious investigating Flashfire; so while http://www.zoominternet.net/~blown85z/ and http://users.zoominternet.net/~blown85z/ are acceptable, it appears that further into the userpages, it would have to use either of those two, |
08:59
🔗
|
Flashfire |
And now you see why I wanted these web spaces saved |
09:00
🔗
|
Ryz |
Oh no, I did a further check, it seems the two can be used interchangeably~ I somehow typed 'user' instead of 'users' as the sub-domain, |
09:00
🔗
|
Ryz |
The unfortunate thing is that there could be two types of links being used in one page |
09:01
🔗
|
|
Craigle has quit IRC (Quit: The Lounge - https://thelounge.chat) |
09:02
🔗
|
Ryz |
I already check if the links are http://www.zoominternet.net/ links or http://users.zoominternet.net/ when checking the main userpages anyway~ |
09:06
🔗
|
Ryz |
The way you said that makes me uncertain of you s: |
09:10
🔗
|
Ryz |
Flashfire: ^ |
09:11
🔗
|
Flashfire |
No Sorry dude I meant that as I wanted them saved because some do have these variations. Web spaces like that are unstable at the best of times |
09:29
🔗
|
|
kiska has quit IRC (Remote host closed the connection) |
09:29
🔗
|
|
Flashfire has quit IRC (Remote host closed the connection) |
09:30
🔗
|
|
kiska has joined #archiveteam-bs |
09:30
🔗
|
|
Flashfire has joined #archiveteam-bs |
09:31
🔗
|
|
svchfoo3 sets mode: +o kiska |
09:31
🔗
|
|
svchfoo1 sets mode: +o kiska |
10:08
🔗
|
|
deevious has quit IRC (Remote host closed the connection) |
10:20
🔗
|
|
tech234a has quit IRC (Quit: Connection closed for inactivity) |
10:22
🔗
|
|
BlueMax has quit IRC (Read error: Connection reset by peer) |
10:31
🔗
|
|
Craigle has joined #archiveteam-bs |
10:31
🔗
|
|
deevious has joined #archiveteam-bs |
11:15
🔗
|
|
zerkalo has quit IRC (Remote host closed the connection) |
11:15
🔗
|
|
erin has quit IRC (Quit: WeeChat 2.5) |
11:23
🔗
|
|
cerca has joined #archiveteam-bs |
11:31
🔗
|
|
bluefoo has quit IRC (Ping timeout: 360 seconds) |
11:45
🔗
|
|
tech234a has joined #archiveteam-bs |
12:25
🔗
|
|
Jopik has quit IRC (Read error: Operation timed out) |
13:12
🔗
|
|
LowLevelM has quit IRC (Read error: Connection reset by peer) |
13:22
🔗
|
|
bluefoo has joined #archiveteam-bs |
13:46
🔗
|
|
kiiwii has quit IRC (Quit: Konversation terminated!) |
14:16
🔗
|
godane |
SketchCow: so i got some good news a bad news in my magazine finding |
14:17
🔗
|
godane |
good news is i found a website called 1001mags.com that has tons of french magazines and some are very old |
14:17
🔗
|
godane |
also most of the magazines are free |
14:18
🔗
|
|
LowLevelM has joined #archiveteam-bs |
14:18
🔗
|
godane |
the bad news is the pdfs are auto generated to have there personal info put on the cover page cause the only way to download these free magazines is to "buy" them |
14:19
🔗
|
godane |
the magazines are still free its just done thru there cart buying system |
14:20
🔗
|
arkiver |
joepie91_: can you please ping me back? |
14:23
🔗
|
|
superkuh_ has joined #archiveteam-bs |
14:25
🔗
|
markedL |
godane that's often easy to remove with a pdf rewriter |
14:29
🔗
|
godane |
i figure we just source the cover to remove and readd a clean cover to pdf |
14:29
🔗
|
godane |
markedL: you can download all pages at 750x |
14:30
🔗
|
godane |
it just will be very small |
14:30
🔗
|
godane |
vs pdf |
14:32
🔗
|
markedL |
does the free one require a credit card on file? |
14:32
🔗
|
godane |
no |
14:32
🔗
|
godane |
no credit card require for this |
14:36
🔗
|
Raccoon |
godane: is the pii on the cover page a watermark modified on the image, or just a text layer added to the PDF? |
14:36
🔗
|
Raccoon |
*modifying the image |
14:36
🔗
|
godane |
there is a white background then text |
14:37
🔗
|
Raccoon |
try opening the PDF in a text editor to see if you can remove the line |
14:37
🔗
|
Raccoon |
"oh, just delete line 14 from every PDF file" |
14:39
🔗
|
SootBectr |
Example: https://i.imgur.com/fjYgkZo.png the first two pages could just be removed if they're all done like that |
14:40
🔗
|
SootBectr |
as they're adverts. |
14:40
🔗
|
markedL |
i can remove that as long as it's text and not an image |
14:40
🔗
|
markedL |
is it selectable? for copy/paste |
14:41
🔗
|
Raccoon |
oh that's definitely a text object slapped on there |
14:42
🔗
|
Raccoon |
easy peasy 99% deletable |
14:42
🔗
|
SootBectr |
Can you recommend a pdf editor/viewer for linux that lets you select text? I reinstalled this laptop recently and can't remember which one I used to use |
14:43
🔗
|
Raccoon |
try Okular |
14:43
🔗
|
Raccoon |
as for doing the work, it's probably easier via script |
14:44
🔗
|
godane |
its differently text in it |
14:44
🔗
|
SootBectr |
Thanks. Oh this one (Atril) does actually, it's just awkward to get that bit of it. Yes the PII is text. |
14:44
🔗
|
Raccoon |
if it's consistent, it should be predictable to find and remove either by line number or substring match |
14:45
🔗
|
markedL |
it's probably encoded strings, grep will tell you |
14:46
🔗
|
Raccoon |
just don't break the PDF :) |
14:47
🔗
|
godane |
pdfinfo of one my pdfs : https://pastebin.com/2tTCaDBk |
14:47
🔗
|
godane |
it is encrypted |
14:48
🔗
|
Raccoon |
gross. Okular has an option to ignore protection, you have to turn it on. |
14:50
🔗
|
SootBectr |
This one begins with the title page and the PII box is located differently https://i.imgur.com/WcuJG0T.png |
14:51
🔗
|
godane |
the place of that will be different |
14:52
🔗
|
Raccoon |
unless it's just different x/y coords for the exact same element located similarly in the file |
14:52
🔗
|
Raccoon |
i don't know about a tool for removing encryption from a pdf |
14:52
🔗
|
Raccoon |
probably exists |
14:55
🔗
|
godane |
i figure it out maybe |
14:55
🔗
|
godane |
using qpdf |
14:56
🔗
|
godane |
qpdf -decrypt Air-le-Mag-101.pdf output.pdf |
15:00
🔗
|
SootBectr |
Decrypted one and can see the PII is there as metadata too |
15:01
🔗
|
Raccoon |
try reading that in a text editor that won't barf on binary content, to locate the element their script is inserting |
15:08
🔗
|
|
SoraUta has quit IRC (Read error: Operation timed out) |
15:14
🔗
|
SootBectr |
Here's what I have at the very beginning of the file, can't find any other occurances of "204." or "archive" https://paste.ubuntu.com/p/nRrmcXRM7z/ |
15:16
🔗
|
markedL |
that's just metadata. it would be closer to the drawing sections |
15:17
🔗
|
markedL |
if you want an easy job, order the same document with two different accounts, then diff the decrypted versions |
15:17
🔗
|
SootBectr |
It's metadata, yes |
15:17
🔗
|
markedL |
it's not going to be an ascii string but it will tell you where the changes are without having to understand the pdf language |
15:18
🔗
|
Raccoon |
why won't it be an ascii string? seems their script is so dumb it doesn't even indent, it just injects print. |
15:20
🔗
|
markedL |
https://blog.didierstevens.com/2008/05/19/pdf-stream-objects/ |
15:24
🔗
|
Raccoon |
i see. https://blog.didierstevens.com/2008/04/29/pdf-let-me-count-the-ways/ |
15:28
🔗
|
SootBectr |
godane: If you'd like to compare here's my encrypted file for http://fr.1001mags.com/magazine/douane-magazine (I figure best to give you encrypted in case there's any differences between our versions of qpdf) https://transfer.sh/tTNZu/Douane-Magazine-014.pdf |
15:32
🔗
|
SootBectr |
Note that there's one small difference every time you qpdf -decrypt the same file, looks like a md5sum of something. |
15:33
🔗
|
|
OrIdow6 has quit IRC (Quit: Leaving.) |
15:34
🔗
|
godane |
so md5sum is different everytime you use qpdf -decrypt |
15:36
🔗
|
|
OrIdow6 has joined #archiveteam-bs |
15:38
🔗
|
godane |
SootBectr: looks like the md5sum is different with each decrypt |
15:43
🔗
|
SootBectr |
Yes, there's a line in the file that changes every time you qpdf -decrypt |
15:43
🔗
|
SootBectr |
I imagine it isn't important, just pointing it out to avoid confusion. |
15:58
🔗
|
godane |
good news |
15:58
🔗
|
godane |
the full cover is there |
15:58
🔗
|
godane |
i was able to remove the write background not the text using ghostscript |
15:59
🔗
|
godane |
turns out that white background is a vector image |
15:59
🔗
|
|
deevious has quit IRC (Read error: Connection reset by peer) |
15:59
🔗
|
|
deevious has joined #archiveteam-bs |
16:01
🔗
|
godane |
bad news is remove other vector stuff also |
16:09
🔗
|
SootBectr |
The metadata is easy to strip at least: exiftool -e -all:all="" file.pdf -o temp.pdf ; qpdf --linearize temp.pdf output.pdf |
16:11
🔗
|
SootBectr |
The linearize step is necessary because exiftool's deletions are reversible |
16:26
🔗
|
SootBectr |
godane: if you'd like to send me a file I can try comparing too. |
16:26
🔗
|
SootBectr |
I suggest an encrypted source file. |
16:34
🔗
|
|
jamiew has joined #archiveteam-bs |
16:38
🔗
|
godane |
SootBectr: https://archive.org/details/CNEWS-Matin-2504 |
16:49
🔗
|
markedL |
can you get two different sourced copies of the same issue? |
16:49
🔗
|
SootBectr |
Looks like this is the relevant section https://paste.ubuntu.com/p/7kFRsQ7dGT/ |
16:52
🔗
|
SootBectr |
There's loads of other FlateDecode occurances though, I can't see a way to identify that one in particular. |
16:53
🔗
|
SootBectr |
...besides decoding it, of course. |
16:53
🔗
|
godane |
maybe mess with it uncompress |
16:53
🔗
|
godane |
pdftk file.pdf output uncompress.pdf uncompress |
16:55
🔗
|
SootBectr |
I did qpdf -qdf --object-streams=disable in.pdf out.pdf and that lets you read it all. |
16:55
🔗
|
|
asdf0101 has quit IRC (Read error: Operation timed out) |
16:56
🔗
|
|
markedL has quit IRC (Read error: Operation timed out) |
17:16
🔗
|
|
superkuh_ has quit IRC (Quit: the neuronal action potential is an electrical manipulation of reversible abrupt phase changes in the lipid bilaye) |
17:30
🔗
|
|
trc has joined #archiveteam-bs |
17:31
🔗
|
|
markedL has joined #archiveteam-bs |
17:31
🔗
|
|
asdf0101 has joined #archiveteam-bs |
17:41
🔗
|
godane |
SootBectr: i'm making progress |
17:42
🔗
|
godane |
i was able to remove my name and my email text |
18:06
🔗
|
|
zerkalo has joined #archiveteam-bs |
18:06
🔗
|
godane |
bad news is i may have to break up the pdf to do this right |
18:06
🔗
|
godane |
this is cause when removing and the text page 2 become blank for some reason |
18:07
🔗
|
godane |
so theory is to make a cover pdf and 2-end pdf |
18:07
🔗
|
godane |
end the cover pdf then use pdfunite to combine the cover pdf and 2-end pdf |
18:15
🔗
|
SootBectr |
I suspect it can be done with a regex search and replace, but as I understand it you need to keep the string length the same - don't know if there's a go-to tool that doesn't trip up on binary files to do that? |
18:18
🔗
|
SootBectr |
You can certainly use a hex editor and just blank out the strings |
18:18
🔗
|
|
DogsRNice has joined #archiveteam-bs |
18:18
🔗
|
godane |
fuck yes i got it : cat CNEWS-Matin-2504-cover.pdf | sed "/Length 372$/,/Length 768/d" > diff.pdf |
18:23
🔗
|
SootBectr |
Won't that also delete any other sections that happen to be the same length? |
18:27
🔗
|
godane |
that deletes every between Length 372 and Length 768 |
18:27
🔗
|
godane |
so you have both Lengths for to be problem |
18:28
🔗
|
godane |
also by make the cover its own pdf and just doing it on that limits that problem |
18:39
🔗
|
|
katocala has joined #archiveteam-bs |
18:40
🔗
|
|
katocala has left |
18:48
🔗
|
|
Craigle has quit IRC (Quit: Ping timeout (120 seconds)) |
18:48
🔗
|
|
Craigle has joined #archiveteam-bs |
18:52
🔗
|
|
schbirid has joined #archiveteam-bs |
18:56
🔗
|
godane |
sadly the lengths are different in each pdfs maybe |
19:02
🔗
|
SootBectr |
Perhaps the way to approach it is to find the block of stream ... endstream that contains the email address |
19:10
🔗
|
|
dashcloud has joined #archiveteam-bs |
19:12
🔗
|
|
Myself has quit IRC (Read error: Connection reset by peer) |
19:16
🔗
|
|
Myself has joined #archiveteam-bs |
19:32
🔗
|
godane |
so this works in remove the watermark: cat "cover.pdf" | sed "/Length 3[0-9][0-9]$/,/Length 768/d" |
19:40
🔗
|
|
prq has quit IRC (Remote host closed the connection) |
19:55
🔗
|
markedL |
was the goal here to remove the data, or prevent its render? |
20:04
🔗
|
godane |
can it be prevent it in render? |
20:07
🔗
|
markedL |
mods which don't remove the data but prevent its render would be taking out the instructions to draw |
20:09
🔗
|
|
jamiew has quit IRC (zzz) |
20:10
🔗
|
|
tech234a has quit IRC (Quit: Connection closed for inactivity) |
20:39
🔗
|
|
mtntmnky has quit IRC (Remote host closed the connection) |
20:40
🔗
|
|
schbirid has quit IRC (Quit: Leaving) |
20:40
🔗
|
|
mtntmnky has joined #archiveteam-bs |
20:51
🔗
|
SootBectr |
I'm trying my hand at some python to remove it, so far I have it reading line by line, detecting the stream .. endstream blocks and writing out a file which is identical to source. |
20:51
🔗
|
SootBectr |
Now to remind myself how to regex in pythonland |
20:52
🔗
|
|
tech234a has joined #archiveteam-bs |
20:53
🔗
|
godane |
the 'crapier' opition could be this : cat "$cover" | sed "/Exemplaire strictement personnel/,/gmail.com/d" |
20:54
🔗
|
godane |
that removes the text but there is still whitebox where the text would be so the cover is not full unedit |
20:55
🔗
|
godane |
at least this would mostly get done 99% of the time i would think |
20:57
🔗
|
|
trc has quit IRC (Quit: Goodbye) |
21:01
🔗
|
SootBectr |
I get an invalid file if I do that |
21:02
🔗
|
markedL |
https://blog.didierstevens.com/programs/pdf-tools/ |
21:02
🔗
|
godane |
i have a very big script for this |
21:02
🔗
|
godane |
that just part of it |
21:04
🔗
|
markedL |
https://github.com/pdfminer/pdfminer.six |
21:04
🔗
|
godane |
2nd if you tried that on one your pdfs it would just delete everything after Exemplaire strictement personnel cause you don't have a gmail.com the tell it to stop |
21:08
🔗
|
SootBectr |
Oh I changed the email bit |
21:12
🔗
|
|
BlueMax has joined #archiveteam-bs |
21:13
🔗
|
godane |
SootBectr: did you fix it or your just saying that did that when it gave you the invaid file |
21:36
🔗
|
|
Stiletto has quit IRC () |
21:37
🔗
|
|
Stiletto has joined #archiveteam-bs |
21:37
🔗
|
|
Stiletto has quit IRC (Client Quit) |
21:43
🔗
|
|
Stiletto has joined #archiveteam-bs |
21:59
🔗
|
|
Stiletto has quit IRC () |
22:11
🔗
|
|
Stiletto has joined #archiveteam-bs |
22:17
🔗
|
dashcloud |
godane: what are you trying to do exactly? |
22:18
🔗
|
Raccoon |
dashcloud: removing PII tags. https://i.imgur.com/fjYgkZo.png |
22:19
🔗
|
|
SoraUta has joined #archiveteam-bs |
22:29
🔗
|
SootBectr |
godane: that gave me invalid. I have some python code that's successfully removing some regex matches now, will improve it a bit and share |
22:31
🔗
|
|
Stiletto has quit IRC () |
22:31
🔗
|
SootBectr |
markedL: Thanks, I had a quick skim but couldn't see an option to save changes to a pdf, I'm sure the object parsing code would be useful though |
22:32
🔗
|
|
Stiletto has joined #archiveteam-bs |
22:36
🔗
|
|
Stiletto has quit IRC (Client Quit) |
22:40
🔗
|
|
Stiletto has joined #archiveteam-bs |
22:47
🔗
|
|
superkuh_ has joined #archiveteam-bs |
22:53
🔗
|
|
jamiew has joined #archiveteam-bs |
22:59
🔗
|
|
Zerote_ has joined #archiveteam-bs |
23:04
🔗
|
|
Zerote has quit IRC (Read error: Operation timed out) |
23:10
🔗
|
SootBectr |
godane: give this a spin https://paste.ubuntu.com/p/yg33z24DG9/ |
23:15
🔗
|
|
jamiew has quit IRC (zzz) |
23:17
🔗
|
godane |
doesn't workat all |
23:20
🔗
|
SootBectr |
I tested it on the file you gave me, oh were you deflating the streams with pdftk? maybe that affects it |
23:25
🔗
|
godane |
SootBectr: my script : https://pastebin.com/KHzFvBq0 |
23:26
🔗
|
SootBectr |
It does, let me see why. Or you can try qpdf -qdf --object-streams=disable in.pdf out.pdf and then run the python |
23:26
🔗
|
|
LowLevelM has quit IRC (Read error: Operation timed out) |
23:29
🔗
|
godane |
it works after i did that |
23:31
🔗
|
godane |
my script get rid of the white box though (mostly) |
23:32
🔗
|
markedL |
oh that flag makes this easy |
23:32
🔗
|
godane |
my script also gets rid the metadata |
23:41
🔗
|
SootBectr |
Yeah I'd probably have that step in a shell script that runs the python afterwards |
23:44
🔗
|
markedL |
I'm not sure how you're editing the files without updating the xref table |
23:47
🔗
|
SootBectr |
I don't even know what an xref table is :) |
23:48
🔗
|
markedL |
**** Error: An error occurred while reading an XREF table. **** Error: An error occurred while reading an XREF table. |
23:48
🔗
|
markedL |
for my edits, haven't tried your edits yet |
23:49
🔗
|
SootBectr |
What program is giving you that error, and how are you making edits? |
23:49
🔗
|
|
oofdere has joined #archiveteam-bs |
23:50
🔗
|
markedL |
ghostscript gives that error, and xpdf doesn't like it either. I deleted the strings contents so that they're 0 bytes long. this moved the byte offsets those 2 programs were trying to follow |
23:50
🔗
|
markedL |
i'll have to redo my edits so the byte offsets don't change |
23:51
🔗
|
SootBectr |
Aha, I'm just counting the length of a regex match and replacing it with spaces |
23:51
🔗
|
SootBectr |
with that number of spaces |
23:51
🔗
|
markedL |
ah yes, that would preserve it |
23:53
🔗
|
markedL |
ok, yeah spaces method works, which you knew |
23:54
🔗
|
markedL |
rectangle should be right around here |