#archiveteam-bs 2019-12-16,Mon

↑back Search

Time Nickname Message
00:33 🔗 X-Scale` has joined #archiveteam-bs
00:37 🔗 X-Scale has quit IRC (Read error: Operation timed out)
00:37 🔗 X-Scale` is now known as X-Scale
00:39 🔗 britmob JAA: Did you fix the malformed WARC issue with qwarc?
00:40 🔗 anarcat what's qwarc
00:40 🔗 britmob https://github.com/JustAnotherArchivist/qwarc
00:40 🔗 JAA britmob: The partial records? Yes, that's fixed in 0.2.2.
00:41 🔗 britmob Perfect, thanks.
00:41 🔗 JAA Are you using qwarc?
00:41 🔗 britmob Occasionally
00:41 🔗 JAA Nice, you might be the only one. :-P
00:41 🔗 britmob hehe
00:42 🔗 markedL I ran it once, but didn't write the grab behavior
00:42 🔗 britmob qwarc/brozzler/grab-site is what I use most often
00:42 🔗 britmob Sometimes wpull.
00:43 🔗 anarcat so it's kind of this curl url-list.txt | qwarc kind of thing?
00:44 🔗 JAA Nope, not at all.
00:44 🔗 JAA Think of qwarc like a local version of the tracker.
00:45 🔗 JAA The work unit is an item, and each item fetches any number of things via HTTP requests.
00:45 🔗 JAA It's very low level. You have to write all of the retrieval stuff, recursion as desired, etc. yourself.
00:46 🔗 britmob Which is why I like it :)
00:46 🔗 anarcat so it's a dispatcher
00:47 🔗 JAA While it's possible to do what you suggest (one item per URL, no further processing like extraction of inline resources etc.), that would be quite inefficient and entirely blocked by SQLite lock contention.
00:49 🔗 JAA Here's an example of the code you'd need to write: https://transfer.notkiska.pw/p5U8I/storywars.py
00:50 🔗 anarcat okay, so it does have fetch primitives
00:51 🔗 JAA It's intentionally minimal to achieve very high request rates. Even with a shitty old i3-2130, I can easily do hundreds of requests per second, assuming the remote server lets me.
00:51 🔗 anarcat interesting
00:51 🔗 anarcat what's the http backend?
00:51 🔗 anarcat aiohttp?
00:51 🔗 JAA Yeah
00:51 🔗 anarcat figures
00:52 🔗 JAA A highly hacked version of it though. :-P
00:52 🔗 anarcat ouch
00:52 🔗 anarcat also figures :p
00:52 🔗 JAA aiohttp doesn't expose the raw data stream.
00:52 🔗 anarcat i wonder what's the entry point in storywars.py
00:52 🔗 JAA You run it like `qwarc storywars.py`.
00:52 🔗 JAA (Plus a bunch of options usually for concurrency etc.)
00:53 🔗 anarcat but how does qwarc know which classess to load
00:54 🔗 JAA qwarc.Item.__subclasses__() + recursion
00:54 🔗 anarcat clever
00:54 🔗 JAA Which is actually a bit annoying because the subclass order is random.
00:55 🔗 JAA Though Python 3.7 should fix that. (I'm still running 3.6 on my main qwarc machine.)
00:55 🔗 anarcat sorted(subclasses)? :)
00:55 🔗 anarcat ah
00:55 🔗 JAA No, I'd like it in the order specified actually.
00:55 🔗 anarcat where is 3.6 from... debian has 3.5 or 3.7?
00:55 🔗 JAA But that's extremely tricky.
00:55 🔗 anarcat oic
00:56 🔗 JAA You need a metaclass to record the insertion order, because it's all stored in a dict internally.
00:56 🔗 anarcat but newer python dict objects preserve order now
00:56 🔗 anarcat iirc
00:56 🔗 anarcat brb
00:57 🔗 JAA Yeah, actually I was confusing that, that's the case since Python 3.6, not 3.7. Not sure why the order is still random on 3.6 for me.
00:57 🔗 JAA And yeah, 3.6 isn't in Debian package repos; I installed it with pyenv.
01:00 🔗 JAA britmob: So what's been your experience with qwarc so far?
01:01 🔗 britmob Well, I used it a few times like.. 2 months ago? Then I switched to my own scripts with wpull for websites that needed it. Otherwise, it's grab-site all the way.
01:01 🔗 DigiDigi has quit IRC (Remote host closed the connection)
01:01 🔗 britmob I appreciate the customizability but it's unneeded for me most of the time
01:02 🔗 britmob Doesn't help my python isn't great either haha
01:03 🔗 JAA Yeah, I rarely need all of it either.
01:06 🔗 JAA I've been wanting to write a shitty recursive crawler with it. One that extracts hrefs, srcs, etc. using string processing (str.find et al.) and then somehow groups the found resources together to avoid the DB overhead. Because why not? :-P
01:06 🔗 JAA For reasonably HTML standard compliant sites, it should probably work okay-ish.
01:08 🔗 britmob "Because why not" very much fits the theme here lol..
01:08 🔗 JAA :-)
01:08 🔗 JAA Another thing I'd like to do is couple it to snscrape.
01:09 🔗 britmob Oh, that's interesting. Hadn't seen that before.
01:11 🔗 britmob Gonna have to play with that later :P
01:12 🔗 JAA :-)
01:12 🔗 JAA Have fun!
01:15 🔗 britmob I plan to.
01:18 🔗 anarcat rewriting snscrape with qwarc would make sense in itself no?
01:19 🔗 anarcat otherwise plugging qwarc into chromium or some other headless parser would make sense as well
01:19 🔗 JAA snscrape is inherently unparallelisable since you only know the required pagination parameters after retrieving the previous page.
01:20 🔗 anarcat well you can still parallelize the fetches within that page
01:20 🔗 JAA So that would only make sense for library usage of multiple simultaneous scrapes.
01:20 🔗 JAA It doesn't request anything else though.
01:21 🔗 JAA In a coupled setup, it would make sense. For snscrape itself, not so much.
01:21 🔗 anarcat couldn't you parallelize fetchin, say, all the tweets from the first page of a profile?
01:21 🔗 anarcat right
01:21 🔗 anarcat i meant rewrite, not couple :)
01:21 🔗 JAA The only benefit from rewriting snscrape on top of qwarc would be generating WARCs.
01:21 🔗 anarcat right
01:22 🔗 JAA Which is a nice benefit, but it can be achieved in easier ways (e.g. warcprox).
01:22 🔗 JAA But yes, a properly coupled setup could then fetch the individual post pages, images, videos, etc. in parallel to the scraping.
01:23 🔗 JAA It's just that this doesn't really belong to the intended use cases of snscrape, which is just extracting the relevant info from a feed.
01:23 🔗 anarcat ah right, i see what you mean
01:23 🔗 anarcat forgot that part of snscrape :)
01:24 🔗 JAA Regarding browsers and HTML parsers: that would completely destroy the main advantage of qwarc and reason why I wrote it in the first place, efficiency/speed. HTML parsing is entirely dominating wpull execution time, for example.
01:25 🔗 anarcat well it would be a sample plugin, not core qwarc
01:25 🔗 * anarcat thinking of chromebot
01:28 🔗 JAA Hmm, how would it be different from a MITM WARC-writing proxy?
01:29 🔗 anarcat i... don't know
01:29 🔗 anarcat would a mitm wrac-writing proxy feed URLs back into qwarc?
01:30 🔗 JAA I meant such a proxy with a headless browser (plus recursion logic).
01:30 🔗 anarcat no difference then i guess
01:40 🔗 OrIdow6 has joined #archiveteam-bs
01:56 🔗 DigiDigi has joined #archiveteam-bs
02:49 🔗 JAA The shitty recursive crawler with qwarc is a thing now. :-P
02:49 🔗 JAA I fully expect this to blow up in numerous ways if it's ever actually used though.
02:53 🔗 anarcat haha no way
02:57 🔗 JAA https://transfer.notkiska.pw/mCQbe/qwarc-recur-simple.py
03:04 🔗 JAA (Just in case it wasn't clear enough, no, you shouldn't ever use this. lol)
03:07 🔗 VADemon has quit IRC (Read error: Connection reset by peer)
03:07 🔗 VADemon has joined #archiveteam-bs
03:09 🔗 britmob What's that? Petition the IA to switch to qwarc?
03:12 🔗 JAA My announcement that I'll move ArchiveBot to this tomorrow. :-P
03:14 🔗 revi has quit IRC ()
03:14 🔗 revi has joined #archiveteam-bs
03:15 🔗 anarcat for hrefPos in qwarc.utils.find_all(content, b'href'):
03:15 🔗 anarcat whee
03:16 🔗 JAA :-)
03:16 🔗 JAA But it doesn't handle case variations. I wish HTML were stricter about how you have to write it.
03:16 🔗 anarcat if case variation is your only concern, you're in for a ride
03:18 🔗 JAA I know, but on the other hand, I'm not really writing a parser, just a shitty thing to extract stuff.
03:19 🔗 JAA So most of the weird edge and corner cases aren't that relevant here.
03:22 🔗 JAA The whitespace handling is obviously also annoying, but this is the hardest one for this particular purpose.
03:23 🔗 JAA Regex is sloooow, maybe .lower() is faster.
03:28 🔗 JAA Or rather, .translate() since I'm working with bytes.
03:34 🔗 anarcat which is much more reasonable anyways
03:38 🔗 cerca has quit IRC (Remote host closed the connection)
03:41 🔗 JAA What, you don't like to "parse" HTML with regex?
03:45 🔗 JAA Ok, here's a slightly saner version: https://transfer.notkiska.pw/QeN1G/qwarc-recur.py
03:45 🔗 JAA Performance is actually pretty decent at ~45 requests per second with a concurrency of 1.
03:47 🔗 JAA Anyway, this is way beyond -bs territory by now. I'm curious how far this concept can be taken, but let's do that in -dev.
03:56 🔗 anarcat i don't like to parse html period
03:56 🔗 anarcat good job
03:56 🔗 kiiwii When and where should I upload my archive of gopherholes? Like once I hit a certain amount and should I update the archive every month or so?
03:58 🔗 JAA anarcat: By the way, regarding __subclasses__ order: https://bugs.python.org/issue17936#msg190005 :-|
03:59 🔗 JAA kiiwii: Can you do incremental archives, or do you have to regrab everything every time? But in general, I'd upload one complete archive to one item on the Internet Archive (with a sensible name and all the metadata you can add).
04:00 🔗 JAA If your method allows to, you can of course start uploading before it's done if the entire thing is too large to store at once, but that's probably not too relevant here since it's only ~4 million resources.
04:00 🔗 kiiwii It doesn't regrab everything each time, it grabs new files and any modified files.
04:01 🔗 JAA Cool. How does that work?
04:02 🔗 JAA I didn't see anything regarding modification timestamps or similar in the Gopher descriptions I skimmed over.
04:02 🔗 kiiwii the python script written has certain commands that allow you to do it lol
04:03 🔗 JAA Which script is that?
04:03 🔗 kiiwii https://github.com/jnlon/gopherdl
04:06 🔗 JAA Hmm, I don't see an option for not redownloading already downloaded things?
04:07 🔗 kiiwii it doesn't by default, it says "not overwriting" and skips it
04:07 🔗 JAA Ah, ok, but then it still redownloads it, just doesn't write to disk.
04:08 🔗 kiiwii I believe so, yes
04:08 🔗 JAA And that's just clobbering, not checking whether the file contents have changed.
04:09 🔗 kiiwii The problem though is that some gopherholes like sdf.org or quux.org have so many directories that it errors out
04:09 🔗 kiiwii Maybe I'll learn python so I can fix that issue
04:09 🔗 JAA Mhm
04:09 🔗 JAA I also saw that it buffers the entire response in memory, so if there are any large files, that could also be a problem.
04:10 🔗 kiiwii I thought that may have been the problem, since it changed how many directories it got to before shitting itself
04:15 🔗 odemgi_ has joined #archiveteam-bs
04:21 🔗 odemgi has quit IRC (Read error: Operation timed out)
04:30 🔗 tech234a has quit IRC (Quit: Connection closed for inactivity)
04:44 🔗 bluefoo_ has quit IRC (Quit: bluefoo_)
04:53 🔗 qw3rty2 has joined #archiveteam-bs
04:53 🔗 tech234a has joined #archiveteam-bs
05:02 🔗 qw3rty has quit IRC (Ping timeout: 745 seconds)
05:04 🔗 superkuh_ has quit IRC (Quit: the neuronal action potential is an electrical manipulation of reversible abrupt phase changes in the lipid bilaye)
05:39 🔗 JAA Apparently the NGINX forums at https://forum.nginx.org/ broke sometime in the last two months, throwing only a DB error now. If it comes back, might be a good idea to archive that.
05:51 🔗 bluefoo has joined #archiveteam-bs
05:57 🔗 Terbium Btw nginx offices have been raided by the police recently
05:58 🔗 astrid i hear they have a big lawsuit
05:59 🔗 JAA It broke in the last two days based on Google's cache, so yeah, possible it's somehow connected to the raids.
06:00 🔗 astrid nginx got bought by f5 about 8 months ago, and f5 doesn't really care about keeping historical stuff around
06:02 🔗 Terbium Yep, the buy out by f5 did not make people happy
06:02 🔗 Frogging Though the forums going down in the last 2 days, right after the raid...
06:02 🔗 JAA Yeah, ^
06:03 🔗 JAA Also, what's the relation between nginx.com and nginx.org again?
06:04 🔗 JAA I wonder if the forums' database is somehow involved in those copyright claims that triggered the raids.
06:05 🔗 Frogging com is the corporate/enterprise site, org is the open source project
06:05 🔗 Frogging I think.
06:05 🔗 Frogging The nginx project/corporate structure never did sit right with me and that's why I stopped using it
06:07 🔗 Frogging that and being based in Russia where the kind of shit we just saw tends to happen a lot, and due process is ignored when it's convenient
06:07 🔗 dewdrop has joined #archiveteam-bs
06:10 🔗 bluefoo has quit IRC (Read error: Operation timed out)
06:23 🔗 LowLevelM has quit IRC (Read error: Operation timed out)
06:30 🔗 LowLevelM has joined #archiveteam-bs
06:31 🔗 bluefoo has joined #archiveteam-bs
06:32 🔗 d5f4a3622 has quit IRC (Read error: Connection reset by peer)
06:34 🔗 d5f4a3622 has joined #archiveteam-bs
06:51 🔗 HP_Archiv has quit IRC (Quit: Leaving)
06:58 🔗 Ryz Oh hey, http://assemblergames.com/ now disappeared long after their expected death date~
07:16 🔗 VADemon has quit IRC (Quit: left4dead)
07:18 🔗 Flashfire May I please have access restored to archivebot? I have some old webhosts I want to save some userpages of
07:20 🔗 Flashfire Specifically that of Zoominternet as the company has since folded
07:20 🔗 Flashfire And my SaveNow captures are only doing so much when the outlinks function keeps freezing
07:23 🔗 VADemon has joined #archiveteam-bs
07:25 🔗 m007a83 has joined #archiveteam-bs
07:55 🔗 Ryz On Zoom Internet stuff, Flashfire - on what you gave me on a Google search term, being 'site:zoominternet.net' - something curious happened,
07:56 🔗 Ryz There have been search results that have something like http://www.zoominternet.net/~tfm2006/ - but also stuff like http://users.zoominternet.net/~rdetoro/
07:57 🔗 Ryz It would appear that links like http://users.zoominternet.net/~tfm2006/ are also acceptable, being the same as http://www.zoominternet.net/~tfm2006/ - which may introduce some kind of friction on what to save first
07:58 🔗 Flashfire I would go with users and then to be safe www
08:00 🔗 Ryz Ah, userapge links under http://users.zoominternet.net/ appear more than just http://www.zoominternet.net/
08:01 🔗 Ryz Even more curious, is http://static-acs-24-144-176-47.zoominternet.net/ - which was found in the search results, but can't access it at all
08:14 🔗 Flashfire Oh that ones easy Ryz those are actual websites hosted by the company
08:41 🔗 killsushi has quit IRC (Quit: Leaving)
08:41 🔗 Ryz Some more investigating, I came across something I never seen before, a 300 web code; I stumbled upon http://users.zoominternet.net/~rbtson/hitty.htm that came from me checking out http://users.zoominternet.net/~rbtson/chap59.htm - unsure if manually created or auto-generated at the time
08:44 🔗 Ryz ...Pondering whether it's better to run them individually still or run all of 'em as "!a <"
08:59 🔗 Ryz Did some curious investigating Flashfire; so while http://www.zoominternet.net/~blown85z/ and http://users.zoominternet.net/~blown85z/ are acceptable, it appears that further into the userpages, it would have to use either of those two,
08:59 🔗 Flashfire And now you see why I wanted these web spaces saved
09:00 🔗 Ryz Oh no, I did a further check, it seems the two can be used interchangeably~ I somehow typed 'user' instead of 'users' as the sub-domain,
09:00 🔗 Ryz The unfortunate thing is that there could be two types of links being used in one page
09:01 🔗 Craigle has quit IRC (Quit: The Lounge - https://thelounge.chat)
09:02 🔗 Ryz I already check if the links are http://www.zoominternet.net/ links or http://users.zoominternet.net/ when checking the main userpages anyway~
09:06 🔗 Ryz The way you said that makes me uncertain of you s:
09:10 🔗 Ryz Flashfire: ^
09:11 🔗 Flashfire No Sorry dude I meant that as I wanted them saved because some do have these variations. Web spaces like that are unstable at the best of times
09:29 🔗 kiska has quit IRC (Remote host closed the connection)
09:29 🔗 Flashfire has quit IRC (Remote host closed the connection)
09:30 🔗 kiska has joined #archiveteam-bs
09:30 🔗 Flashfire has joined #archiveteam-bs
09:31 🔗 svchfoo3 sets mode: +o kiska
09:31 🔗 svchfoo1 sets mode: +o kiska
10:08 🔗 deevious has quit IRC (Remote host closed the connection)
10:20 🔗 tech234a has quit IRC (Quit: Connection closed for inactivity)
10:22 🔗 BlueMax has quit IRC (Read error: Connection reset by peer)
10:31 🔗 Craigle has joined #archiveteam-bs
10:31 🔗 deevious has joined #archiveteam-bs
11:15 🔗 zerkalo has quit IRC (Remote host closed the connection)
11:15 🔗 erin has quit IRC (Quit: WeeChat 2.5)
11:23 🔗 cerca has joined #archiveteam-bs
11:31 🔗 bluefoo has quit IRC (Ping timeout: 360 seconds)
11:45 🔗 tech234a has joined #archiveteam-bs
12:25 🔗 Jopik has quit IRC (Read error: Operation timed out)
13:12 🔗 LowLevelM has quit IRC (Read error: Connection reset by peer)
13:22 🔗 bluefoo has joined #archiveteam-bs
13:46 🔗 kiiwii has quit IRC (Quit: Konversation terminated!)
14:16 🔗 godane SketchCow: so i got some good news a bad news in my magazine finding
14:17 🔗 godane good news is i found a website called 1001mags.com that has tons of french magazines and some are very old
14:17 🔗 godane also most of the magazines are free
14:18 🔗 LowLevelM has joined #archiveteam-bs
14:18 🔗 godane the bad news is the pdfs are auto generated to have there personal info put on the cover page cause the only way to download these free magazines is to "buy" them
14:19 🔗 godane the magazines are still free its just done thru there cart buying system
14:20 🔗 arkiver joepie91_: can you please ping me back?
14:23 🔗 superkuh_ has joined #archiveteam-bs
14:25 🔗 markedL godane that's often easy to remove with a pdf rewriter
14:29 🔗 godane i figure we just source the cover to remove and readd a clean cover to pdf
14:29 🔗 godane markedL: you can download all pages at 750x
14:30 🔗 godane it just will be very small
14:30 🔗 godane vs pdf
14:32 🔗 markedL does the free one require a credit card on file?
14:32 🔗 godane no
14:32 🔗 godane no credit card require for this
14:36 🔗 Raccoon godane: is the pii on the cover page a watermark modified on the image, or just a text layer added to the PDF?
14:36 🔗 Raccoon *modifying the image
14:36 🔗 godane there is a white background then text
14:37 🔗 Raccoon try opening the PDF in a text editor to see if you can remove the line
14:37 🔗 Raccoon "oh, just delete line 14 from every PDF file"
14:39 🔗 SootBectr Example: https://i.imgur.com/fjYgkZo.png the first two pages could just be removed if they're all done like that
14:40 🔗 SootBectr as they're adverts.
14:40 🔗 markedL i can remove that as long as it's text and not an image
14:40 🔗 markedL is it selectable? for copy/paste
14:41 🔗 Raccoon oh that's definitely a text object slapped on there
14:42 🔗 Raccoon easy peasy 99% deletable
14:42 🔗 SootBectr Can you recommend a pdf editor/viewer for linux that lets you select text? I reinstalled this laptop recently and can't remember which one I used to use
14:43 🔗 Raccoon try Okular
14:43 🔗 Raccoon as for doing the work, it's probably easier via script
14:44 🔗 godane its differently text in it
14:44 🔗 SootBectr Thanks. Oh this one (Atril) does actually, it's just awkward to get that bit of it. Yes the PII is text.
14:44 🔗 Raccoon if it's consistent, it should be predictable to find and remove either by line number or substring match
14:45 🔗 markedL it's probably encoded strings, grep will tell you
14:46 🔗 Raccoon just don't break the PDF :)
14:47 🔗 godane pdfinfo of one my pdfs : https://pastebin.com/2tTCaDBk
14:47 🔗 godane it is encrypted
14:48 🔗 Raccoon gross. Okular has an option to ignore protection, you have to turn it on.
14:50 🔗 SootBectr This one begins with the title page and the PII box is located differently https://i.imgur.com/WcuJG0T.png
14:51 🔗 godane the place of that will be different
14:52 🔗 Raccoon unless it's just different x/y coords for the exact same element located similarly in the file
14:52 🔗 Raccoon i don't know about a tool for removing encryption from a pdf
14:52 🔗 Raccoon probably exists
14:55 🔗 godane i figure it out maybe
14:55 🔗 godane using qpdf
14:56 🔗 godane qpdf -decrypt Air-le-Mag-101.pdf output.pdf
15:00 🔗 SootBectr Decrypted one and can see the PII is there as metadata too
15:01 🔗 Raccoon try reading that in a text editor that won't barf on binary content, to locate the element their script is inserting
15:08 🔗 SoraUta has quit IRC (Read error: Operation timed out)
15:14 🔗 SootBectr Here's what I have at the very beginning of the file, can't find any other occurances of "204." or "archive" https://paste.ubuntu.com/p/nRrmcXRM7z/
15:16 🔗 markedL that's just metadata. it would be closer to the drawing sections
15:17 🔗 markedL if you want an easy job, order the same document with two different accounts, then diff the decrypted versions
15:17 🔗 SootBectr It's metadata, yes
15:17 🔗 markedL it's not going to be an ascii string but it will tell you where the changes are without having to understand the pdf language
15:18 🔗 Raccoon why won't it be an ascii string? seems their script is so dumb it doesn't even indent, it just injects print.
15:20 🔗 markedL https://blog.didierstevens.com/2008/05/19/pdf-stream-objects/
15:24 🔗 Raccoon i see. https://blog.didierstevens.com/2008/04/29/pdf-let-me-count-the-ways/
15:28 🔗 SootBectr godane: If you'd like to compare here's my encrypted file for http://fr.1001mags.com/magazine/douane-magazine (I figure best to give you encrypted in case there's any differences between our versions of qpdf) https://transfer.sh/tTNZu/Douane-Magazine-014.pdf
15:32 🔗 SootBectr Note that there's one small difference every time you qpdf -decrypt the same file, looks like a md5sum of something.
15:33 🔗 OrIdow6 has quit IRC (Quit: Leaving.)
15:34 🔗 godane so md5sum is different everytime you use qpdf -decrypt
15:36 🔗 OrIdow6 has joined #archiveteam-bs
15:38 🔗 godane SootBectr: looks like the md5sum is different with each decrypt
15:43 🔗 SootBectr Yes, there's a line in the file that changes every time you qpdf -decrypt
15:43 🔗 SootBectr I imagine it isn't important, just pointing it out to avoid confusion.
15:58 🔗 godane good news
15:58 🔗 godane the full cover is there
15:58 🔗 godane i was able to remove the write background not the text using ghostscript
15:59 🔗 godane turns out that white background is a vector image
15:59 🔗 deevious has quit IRC (Read error: Connection reset by peer)
15:59 🔗 deevious has joined #archiveteam-bs
16:01 🔗 godane bad news is remove other vector stuff also
16:09 🔗 SootBectr The metadata is easy to strip at least: exiftool -e -all:all="" file.pdf -o temp.pdf ; qpdf --linearize temp.pdf output.pdf
16:11 🔗 SootBectr The linearize step is necessary because exiftool's deletions are reversible
16:26 🔗 SootBectr godane: if you'd like to send me a file I can try comparing too.
16:26 🔗 SootBectr I suggest an encrypted source file.
16:34 🔗 jamiew has joined #archiveteam-bs
16:38 🔗 godane SootBectr: https://archive.org/details/CNEWS-Matin-2504
16:49 🔗 markedL can you get two different sourced copies of the same issue?
16:49 🔗 SootBectr Looks like this is the relevant section https://paste.ubuntu.com/p/7kFRsQ7dGT/
16:52 🔗 SootBectr There's loads of other FlateDecode occurances though, I can't see a way to identify that one in particular.
16:53 🔗 SootBectr ...besides decoding it, of course.
16:53 🔗 godane maybe mess with it uncompress
16:53 🔗 godane pdftk file.pdf output uncompress.pdf uncompress
16:55 🔗 SootBectr I did qpdf -qdf --object-streams=disable in.pdf out.pdf and that lets you read it all.
16:55 🔗 asdf0101 has quit IRC (Read error: Operation timed out)
16:56 🔗 markedL has quit IRC (Read error: Operation timed out)
17:16 🔗 superkuh_ has quit IRC (Quit: the neuronal action potential is an electrical manipulation of reversible abrupt phase changes in the lipid bilaye)
17:30 🔗 trc has joined #archiveteam-bs
17:31 🔗 markedL has joined #archiveteam-bs
17:31 🔗 asdf0101 has joined #archiveteam-bs
17:41 🔗 godane SootBectr: i'm making progress
17:42 🔗 godane i was able to remove my name and my email text
18:06 🔗 zerkalo has joined #archiveteam-bs
18:06 🔗 godane bad news is i may have to break up the pdf to do this right
18:06 🔗 godane this is cause when removing and the text page 2 become blank for some reason
18:07 🔗 godane so theory is to make a cover pdf and 2-end pdf
18:07 🔗 godane end the cover pdf then use pdfunite to combine the cover pdf and 2-end pdf
18:15 🔗 SootBectr I suspect it can be done with a regex search and replace, but as I understand it you need to keep the string length the same - don't know if there's a go-to tool that doesn't trip up on binary files to do that?
18:18 🔗 SootBectr You can certainly use a hex editor and just blank out the strings
18:18 🔗 DogsRNice has joined #archiveteam-bs
18:18 🔗 godane fuck yes i got it : cat CNEWS-Matin-2504-cover.pdf | sed "/Length 372$/,/Length 768/d" > diff.pdf
18:23 🔗 SootBectr Won't that also delete any other sections that happen to be the same length?
18:27 🔗 godane that deletes every between Length 372 and Length 768
18:27 🔗 godane so you have both Lengths for to be problem
18:28 🔗 godane also by make the cover its own pdf and just doing it on that limits that problem
18:39 🔗 katocala has joined #archiveteam-bs
18:40 🔗 katocala has left
18:48 🔗 Craigle has quit IRC (Quit: Ping timeout (120 seconds))
18:48 🔗 Craigle has joined #archiveteam-bs
18:52 🔗 schbirid has joined #archiveteam-bs
18:56 🔗 godane sadly the lengths are different in each pdfs maybe
19:02 🔗 SootBectr Perhaps the way to approach it is to find the block of stream ... endstream that contains the email address
19:10 🔗 dashcloud has joined #archiveteam-bs
19:12 🔗 Myself has quit IRC (Read error: Connection reset by peer)
19:16 🔗 Myself has joined #archiveteam-bs
19:32 🔗 godane so this works in remove the watermark: cat "cover.pdf" | sed "/Length 3[0-9][0-9]$/,/Length 768/d"
19:40 🔗 prq has quit IRC (Remote host closed the connection)
19:55 🔗 markedL was the goal here to remove the data, or prevent its render?
20:04 🔗 godane can it be prevent it in render?
20:07 🔗 markedL mods which don't remove the data but prevent its render would be taking out the instructions to draw
20:09 🔗 jamiew has quit IRC (zzz)
20:10 🔗 tech234a has quit IRC (Quit: Connection closed for inactivity)
20:39 🔗 mtntmnky has quit IRC (Remote host closed the connection)
20:40 🔗 schbirid has quit IRC (Quit: Leaving)
20:40 🔗 mtntmnky has joined #archiveteam-bs
20:51 🔗 SootBectr I'm trying my hand at some python to remove it, so far I have it reading line by line, detecting the stream .. endstream blocks and writing out a file which is identical to source.
20:51 🔗 SootBectr Now to remind myself how to regex in pythonland
20:52 🔗 tech234a has joined #archiveteam-bs
20:53 🔗 godane the 'crapier' opition could be this : cat "$cover" | sed "/Exemplaire strictement personnel/,/gmail.com/d"
20:54 🔗 godane that removes the text but there is still whitebox where the text would be so the cover is not full unedit
20:55 🔗 godane at least this would mostly get done 99% of the time i would think
20:57 🔗 trc has quit IRC (Quit: Goodbye)
21:01 🔗 SootBectr I get an invalid file if I do that
21:02 🔗 markedL https://blog.didierstevens.com/programs/pdf-tools/
21:02 🔗 godane i have a very big script for this
21:02 🔗 godane that just part of it
21:04 🔗 markedL https://github.com/pdfminer/pdfminer.six
21:04 🔗 godane 2nd if you tried that on one your pdfs it would just delete everything after Exemplaire strictement personnel cause you don't have a gmail.com the tell it to stop
21:08 🔗 SootBectr Oh I changed the email bit
21:12 🔗 BlueMax has joined #archiveteam-bs
21:13 🔗 godane SootBectr: did you fix it or your just saying that did that when it gave you the invaid file
21:36 🔗 Stiletto has quit IRC ()
21:37 🔗 Stiletto has joined #archiveteam-bs
21:37 🔗 Stiletto has quit IRC (Client Quit)
21:43 🔗 Stiletto has joined #archiveteam-bs
21:59 🔗 Stiletto has quit IRC ()
22:11 🔗 Stiletto has joined #archiveteam-bs
22:17 🔗 dashcloud godane: what are you trying to do exactly?
22:18 🔗 Raccoon dashcloud: removing PII tags. https://i.imgur.com/fjYgkZo.png
22:19 🔗 SoraUta has joined #archiveteam-bs
22:29 🔗 SootBectr godane: that gave me invalid. I have some python code that's successfully removing some regex matches now, will improve it a bit and share
22:31 🔗 Stiletto has quit IRC ()
22:31 🔗 SootBectr markedL: Thanks, I had a quick skim but couldn't see an option to save changes to a pdf, I'm sure the object parsing code would be useful though
22:32 🔗 Stiletto has joined #archiveteam-bs
22:36 🔗 Stiletto has quit IRC (Client Quit)
22:40 🔗 Stiletto has joined #archiveteam-bs
22:47 🔗 superkuh_ has joined #archiveteam-bs
22:53 🔗 jamiew has joined #archiveteam-bs
22:59 🔗 Zerote_ has joined #archiveteam-bs
23:04 🔗 Zerote has quit IRC (Read error: Operation timed out)
23:10 🔗 SootBectr godane: give this a spin https://paste.ubuntu.com/p/yg33z24DG9/
23:15 🔗 jamiew has quit IRC (zzz)
23:17 🔗 godane doesn't workat all
23:20 🔗 SootBectr I tested it on the file you gave me, oh were you deflating the streams with pdftk? maybe that affects it
23:25 🔗 godane SootBectr: my script : https://pastebin.com/KHzFvBq0
23:26 🔗 SootBectr It does, let me see why. Or you can try qpdf -qdf --object-streams=disable in.pdf out.pdf and then run the python
23:26 🔗 LowLevelM has quit IRC (Read error: Operation timed out)
23:29 🔗 godane it works after i did that
23:31 🔗 godane my script get rid of the white box though (mostly)
23:32 🔗 markedL oh that flag makes this easy
23:32 🔗 godane my script also gets rid the metadata
23:41 🔗 SootBectr Yeah I'd probably have that step in a shell script that runs the python afterwards
23:44 🔗 markedL I'm not sure how you're editing the files without updating the xref table
23:47 🔗 SootBectr I don't even know what an xref table is :)
23:48 🔗 markedL **** Error: An error occurred while reading an XREF table. **** Error: An error occurred while reading an XREF table.
23:48 🔗 markedL for my edits, haven't tried your edits yet
23:49 🔗 SootBectr What program is giving you that error, and how are you making edits?
23:49 🔗 oofdere has joined #archiveteam-bs
23:50 🔗 markedL ghostscript gives that error, and xpdf doesn't like it either. I deleted the strings contents so that they're 0 bytes long. this moved the byte offsets those 2 programs were trying to follow
23:50 🔗 markedL i'll have to redo my edits so the byte offsets don't change
23:51 🔗 SootBectr Aha, I'm just counting the length of a regex match and replacing it with spaces
23:51 🔗 SootBectr with that number of spaces
23:51 🔗 markedL ah yes, that would preserve it
23:53 🔗 markedL ok, yeah spaces method works, which you knew
23:54 🔗 markedL rectangle should be right around here

irclogger-viewer