#archiveteam-bs 2019-12-16,Mon

↑back Search

Time	Nickname	Message
00:33 ^🔗		X-Scale` has joined #archiveteam-bs
00:37 ^🔗		X-Scale has quit IRC (Read error: Operation timed out)
00:37 ^🔗		X-Scale` is now known as X-Scale
00:39 ^🔗	britmob	JAA: Did you fix the malformed WARC issue with qwarc?
00:40 ^🔗	anarcat	what's qwarc
00:40 ^🔗	britmob	https://github.com/JustAnotherArchivist/qwarc
00:40 ^🔗	JAA	britmob: The partial records? Yes, that's fixed in 0.2.2.
00:41 ^🔗	britmob	Perfect, thanks.
00:41 ^🔗	JAA	Are you using qwarc?
00:41 ^🔗	britmob	Occasionally
00:41 ^🔗	JAA	Nice, you might be the only one. :-P
00:41 ^🔗	britmob	hehe
00:42 ^🔗	markedL	I ran it once, but didn't write the grab behavior
00:42 ^🔗	britmob	qwarc/brozzler/grab-site is what I use most often
00:42 ^🔗	britmob	Sometimes wpull.
00:43 ^🔗	anarcat	so it's kind of this curl url-list.txt \| qwarc kind of thing?
00:44 ^🔗	JAA	Nope, not at all.
00:44 ^🔗	JAA	Think of qwarc like a local version of the tracker.
00:45 ^🔗	JAA	The work unit is an item, and each item fetches any number of things via HTTP requests.
00:45 ^🔗	JAA	It's very low level. You have to write all of the retrieval stuff, recursion as desired, etc. yourself.
00:46 ^🔗	britmob	Which is why I like it :)
00:46 ^🔗	anarcat	so it's a dispatcher
00:47 ^🔗	JAA	While it's possible to do what you suggest (one item per URL, no further processing like extraction of inline resources etc.), that would be quite inefficient and entirely blocked by SQLite lock contention.
00:49 ^🔗	JAA	Here's an example of the code you'd need to write: https://transfer.notkiska.pw/p5U8I/storywars.py
00:50 ^🔗	anarcat	okay, so it does have fetch primitives
00:51 ^🔗	JAA	It's intentionally minimal to achieve very high request rates. Even with a shitty old i3-2130, I can easily do hundreds of requests per second, assuming the remote server lets me.
00:51 ^🔗	anarcat	interesting
00:51 ^🔗	anarcat	what's the http backend?
00:51 ^🔗	anarcat	aiohttp?
00:51 ^🔗	JAA	Yeah
00:51 ^🔗	anarcat	figures
00:52 ^🔗	JAA	A highly hacked version of it though. :-P
00:52 ^🔗	anarcat	ouch
00:52 ^🔗	anarcat	also figures :p
00:52 ^🔗	JAA	aiohttp doesn't expose the raw data stream.
00:52 ^🔗	anarcat	i wonder what's the entry point in storywars.py
00:52 ^🔗	JAA	You run it like `qwarc storywars.py`.
00:52 ^🔗	JAA	(Plus a bunch of options usually for concurrency etc.)
00:53 ^🔗	anarcat	but how does qwarc know which classess to load
00:54 ^🔗	JAA	qwarc.Item.__subclasses__() + recursion
00:54 ^🔗	anarcat	clever
00:54 ^🔗	JAA	Which is actually a bit annoying because the subclass order is random.
00:55 ^🔗	JAA	Though Python 3.7 should fix that. (I'm still running 3.6 on my main qwarc machine.)
00:55 ^🔗	anarcat	sorted(subclasses)? :)
00:55 ^🔗	anarcat	ah
00:55 ^🔗	JAA	No, I'd like it in the order specified actually.
00:55 ^🔗	anarcat	where is 3.6 from... debian has 3.5 or 3.7?
00:55 ^🔗	JAA	But that's extremely tricky.
00:55 ^🔗	anarcat	oic
00:56 ^🔗	JAA	You need a metaclass to record the insertion order, because it's all stored in a dict internally.
00:56 ^🔗	anarcat	but newer python dict objects preserve order now
00:56 ^🔗	anarcat	iirc
00:56 ^🔗	anarcat	brb
00:57 ^🔗	JAA	Yeah, actually I was confusing that, that's the case since Python 3.6, not 3.7. Not sure why the order is still random on 3.6 for me.
00:57 ^🔗	JAA	And yeah, 3.6 isn't in Debian package repos; I installed it with pyenv.
01:00 ^🔗	JAA	britmob: So what's been your experience with qwarc so far?
01:01 ^🔗	britmob	Well, I used it a few times like.. 2 months ago? Then I switched to my own scripts with wpull for websites that needed it. Otherwise, it's grab-site all the way.
01:01 ^🔗		DigiDigi has quit IRC (Remote host closed the connection)
01:01 ^🔗	britmob	I appreciate the customizability but it's unneeded for me most of the time
01:02 ^🔗	britmob	Doesn't help my python isn't great either haha
01:03 ^🔗	JAA	Yeah, I rarely need all of it either.
01:06 ^🔗	JAA	I've been wanting to write a shitty recursive crawler with it. One that extracts hrefs, srcs, etc. using string processing (str.find et al.) and then somehow groups the found resources together to avoid the DB overhead. Because why not? :-P
01:06 ^🔗	JAA	For reasonably HTML standard compliant sites, it should probably work okay-ish.
01:08 ^🔗	britmob	"Because why not" very much fits the theme here lol..
01:08 ^🔗	JAA	:-)
01:08 ^🔗	JAA	Another thing I'd like to do is couple it to snscrape.
01:09 ^🔗	britmob	Oh, that's interesting. Hadn't seen that before.
01:11 ^🔗	britmob	Gonna have to play with that later :P
01:12 ^🔗	JAA	:-)
01:12 ^🔗	JAA	Have fun!
01:15 ^🔗	britmob	I plan to.
01:18 ^🔗	anarcat	rewriting snscrape with qwarc would make sense in itself no?
01:19 ^🔗	anarcat	otherwise plugging qwarc into chromium or some other headless parser would make sense as well
01:19 ^🔗	JAA	snscrape is inherently unparallelisable since you only know the required pagination parameters after retrieving the previous page.
01:20 ^🔗	anarcat	well you can still parallelize the fetches within that page
01:20 ^🔗	JAA	So that would only make sense for library usage of multiple simultaneous scrapes.
01:20 ^🔗	JAA	It doesn't request anything else though.
01:21 ^🔗	JAA	In a coupled setup, it would make sense. For snscrape itself, not so much.
01:21 ^🔗	anarcat	couldn't you parallelize fetchin, say, all the tweets from the first page of a profile?
01:21 ^🔗	anarcat	right
01:21 ^🔗	anarcat	i meant rewrite, not couple :)
01:21 ^🔗	JAA	The only benefit from rewriting snscrape on top of qwarc would be generating WARCs.
01:21 ^🔗	anarcat	right
01:22 ^🔗	JAA	Which is a nice benefit, but it can be achieved in easier ways (e.g. warcprox).
01:22 ^🔗	JAA	But yes, a properly coupled setup could then fetch the individual post pages, images, videos, etc. in parallel to the scraping.
01:23 ^🔗	JAA	It's just that this doesn't really belong to the intended use cases of snscrape, which is just extracting the relevant info from a feed.
01:23 ^🔗	anarcat	ah right, i see what you mean
01:23 ^🔗	anarcat	forgot that part of snscrape :)
01:24 ^🔗	JAA	Regarding browsers and HTML parsers: that would completely destroy the main advantage of qwarc and reason why I wrote it in the first place, efficiency/speed. HTML parsing is entirely dominating wpull execution time, for example.
01:25 ^🔗	anarcat	well it would be a sample plugin, not core qwarc
01:25 ^🔗	*	anarcat thinking of chromebot
01:28 ^🔗	JAA	Hmm, how would it be different from a MITM WARC-writing proxy?
01:29 ^🔗	anarcat	i... don't know
01:29 ^🔗	anarcat	would a mitm wrac-writing proxy feed URLs back into qwarc?
01:30 ^🔗	JAA	I meant such a proxy with a headless browser (plus recursion logic).
01:30 ^🔗	anarcat	no difference then i guess
01:40 ^🔗		OrIdow6 has joined #archiveteam-bs
01:56 ^🔗		DigiDigi has joined #archiveteam-bs
02:49 ^🔗	JAA	The shitty recursive crawler with qwarc is a thing now. :-P
02:49 ^🔗	JAA	I fully expect this to blow up in numerous ways if it's ever actually used though.
02:53 ^🔗	anarcat	haha no way
02:57 ^🔗	JAA	https://transfer.notkiska.pw/mCQbe/qwarc-recur-simple.py
03:04 ^🔗	JAA	(Just in case it wasn't clear enough, no, you shouldn't ever use this. lol)
03:07 ^🔗		VADemon has quit IRC (Read error: Connection reset by peer)
03:07 ^🔗		VADemon has joined #archiveteam-bs
03:09 ^🔗	britmob	What's that? Petition the IA to switch to qwarc?
03:12 ^🔗	JAA	My announcement that I'll move ArchiveBot to this tomorrow. :-P
03:14 ^🔗		revi has quit IRC ()
03:14 ^🔗		revi has joined #archiveteam-bs
03:15 ^🔗	anarcat	for hrefPos in qwarc.utils.find_all(content, b'href'):
03:15 ^🔗	anarcat	whee
03:16 ^🔗	JAA	:-)
03:16 ^🔗	JAA	But it doesn't handle case variations. I wish HTML were stricter about how you have to write it.
03:16 ^🔗	anarcat	if case variation is your only concern, you're in for a ride
03:18 ^🔗	JAA	I know, but on the other hand, I'm not really writing a parser, just a shitty thing to extract stuff.
03:19 ^🔗	JAA	So most of the weird edge and corner cases aren't that relevant here.
03:22 ^🔗	JAA	The whitespace handling is obviously also annoying, but this is the hardest one for this particular purpose.
03:23 ^🔗	JAA	Regex is sloooow, maybe .lower() is faster.
03:28 ^🔗	JAA	Or rather, .translate() since I'm working with bytes.
03:34 ^🔗	anarcat	which is much more reasonable anyways
03:38 ^🔗		cerca has quit IRC (Remote host closed the connection)
03:41 ^🔗	JAA	What, you don't like to "parse" HTML with regex?
03:45 ^🔗	JAA	Ok, here's a slightly saner version: https://transfer.notkiska.pw/QeN1G/qwarc-recur.py
03:45 ^🔗	JAA	Performance is actually pretty decent at ~45 requests per second with a concurrency of 1.
03:47 ^🔗	JAA	Anyway, this is way beyond -bs territory by now. I'm curious how far this concept can be taken, but let's do that in -dev.
03:56 ^🔗	anarcat	i don't like to parse html period
03:56 ^🔗	anarcat	good job
03:56 ^🔗	kiiwii	When and where should I upload my archive of gopherholes? Like once I hit a certain amount and should I update the archive every month or so?
03:58 ^🔗	JAA	anarcat: By the way, regarding __subclasses__ order: https://bugs.python.org/issue17936#msg190005 :-\|
03:59 ^🔗	JAA	kiiwii: Can you do incremental archives, or do you have to regrab everything every time? But in general, I'd upload one complete archive to one item on the Internet Archive (with a sensible name and all the metadata you can add).
04:00 ^🔗	JAA	If your method allows to, you can of course start uploading before it's done if the entire thing is too large to store at once, but that's probably not too relevant here since it's only ~4 million resources.
04:00 ^🔗	kiiwii	It doesn't regrab everything each time, it grabs new files and any modified files.
04:01 ^🔗	JAA	Cool. How does that work?
04:02 ^🔗	JAA	I didn't see anything regarding modification timestamps or similar in the Gopher descriptions I skimmed over.
04:02 ^🔗	kiiwii	the python script written has certain commands that allow you to do it lol
04:03 ^🔗	JAA	Which script is that?
04:03 ^🔗	kiiwii	https://github.com/jnlon/gopherdl
04:06 ^🔗	JAA	Hmm, I don't see an option for not redownloading already downloaded things?
04:07 ^🔗	kiiwii	it doesn't by default, it says "not overwriting" and skips it
04:07 ^🔗	JAA	Ah, ok, but then it still redownloads it, just doesn't write to disk.
04:08 ^🔗	kiiwii	I believe so, yes
04:08 ^🔗	JAA	And that's just clobbering, not checking whether the file contents have changed.
04:09 ^🔗	kiiwii	The problem though is that some gopherholes like sdf.org or quux.org have so many directories that it errors out
04:09 ^🔗	kiiwii	Maybe I'll learn python so I can fix that issue
04:09 ^🔗	JAA	Mhm
04:09 ^🔗	JAA	I also saw that it buffers the entire response in memory, so if there are any large files, that could also be a problem.
04:10 ^🔗	kiiwii	I thought that may have been the problem, since it changed how many directories it got to before shitting itself
04:15 ^🔗		odemgi_ has joined #archiveteam-bs
04:21 ^🔗		odemgi has quit IRC (Read error: Operation timed out)
04:30 ^🔗		tech234a has quit IRC (Quit: Connection closed for inactivity)
04:44 ^🔗		bluefoo_ has quit IRC (Quit: bluefoo_)
04:53 ^🔗		qw3rty2 has joined #archiveteam-bs
04:53 ^🔗		tech234a has joined #archiveteam-bs
05:02 ^🔗		qw3rty has quit IRC (Ping timeout: 745 seconds)
05:04 ^🔗		superkuh_ has quit IRC (Quit: the neuronal action potential is an electrical manipulation of reversible abrupt phase changes in the lipid bilaye)
05:39 ^🔗	JAA	Apparently the NGINX forums at https://forum.nginx.org/ broke sometime in the last two months, throwing only a DB error now. If it comes back, might be a good idea to archive that.
05:51 ^🔗		bluefoo has joined #archiveteam-bs
05:57 ^🔗	Terbium	Btw nginx offices have been raided by the police recently
05:58 ^🔗	astrid	i hear they have a big lawsuit
05:59 ^🔗	JAA	It broke in the last two days based on Google's cache, so yeah, possible it's somehow connected to the raids.
06:00 ^🔗	astrid	nginx got bought by f5 about 8 months ago, and f5 doesn't really care about keeping historical stuff around
06:02 ^🔗	Terbium	Yep, the buy out by f5 did not make people happy
06:02 ^🔗	Frogging	Though the forums going down in the last 2 days, right after the raid...
06:02 ^🔗	JAA	Yeah, ^
06:03 ^🔗	JAA	Also, what's the relation between nginx.com and nginx.org again?
06:04 ^🔗	JAA	I wonder if the forums' database is somehow involved in those copyright claims that triggered the raids.
06:05 ^🔗	Frogging	com is the corporate/enterprise site, org is the open source project
06:05 ^🔗	Frogging	I think.
06:05 ^🔗	Frogging	The nginx project/corporate structure never did sit right with me and that's why I stopped using it
06:07 ^🔗	Frogging	that and being based in Russia where the kind of shit we just saw tends to happen a lot, and due process is ignored when it's convenient
06:07 ^🔗		dewdrop has joined #archiveteam-bs
06:10 ^🔗		bluefoo has quit IRC (Read error: Operation timed out)
06:23 ^🔗		LowLevelM has quit IRC (Read error: Operation timed out)
06:30 ^🔗		LowLevelM has joined #archiveteam-bs
06:31 ^🔗		bluefoo has joined #archiveteam-bs
06:32 ^🔗		d5f4a3622 has quit IRC (Read error: Connection reset by peer)
06:34 ^🔗		d5f4a3622 has joined #archiveteam-bs
06:51 ^🔗		HP_Archiv has quit IRC (Quit: Leaving)
06:58 ^🔗	Ryz	Oh hey, http://assemblergames.com/ now disappeared long after their expected death date~
07:16 ^🔗		VADemon has quit IRC (Quit: left4dead)
07:18 ^🔗	Flashfire	May I please have access restored to archivebot? I have some old webhosts I want to save some userpages of
07:20 ^🔗	Flashfire	Specifically that of Zoominternet as the company has since folded
07:20 ^🔗	Flashfire	And my SaveNow captures are only doing so much when the outlinks function keeps freezing
07:23 ^🔗		VADemon has joined #archiveteam-bs
07:25 ^🔗		m007a83 has joined #archiveteam-bs
07:55 ^🔗	Ryz	On Zoom Internet stuff, Flashfire - on what you gave me on a Google search term, being 'site:zoominternet.net' - something curious happened,
07:56 ^🔗	Ryz	There have been search results that have something like http://www.zoominternet.net/~tfm2006/ - but also stuff like http://users.zoominternet.net/~rdetoro/
07:57 ^🔗	Ryz	It would appear that links like http://users.zoominternet.net/~tfm2006/ are also acceptable, being the same as http://www.zoominternet.net/~tfm2006/ - which may introduce some kind of friction on what to save first
07:58 ^🔗	Flashfire	I would go with users and then to be safe www
08:00 ^🔗	Ryz	Ah, userapge links under http://users.zoominternet.net/ appear more than just http://www.zoominternet.net/
08:01 ^🔗	Ryz	Even more curious, is http://static-acs-24-144-176-47.zoominternet.net/ - which was found in the search results, but can't access it at all
08:14 ^🔗	Flashfire	Oh that ones easy Ryz those are actual websites hosted by the company
08:41 ^🔗		killsushi has quit IRC (Quit: Leaving)
08:41 ^🔗	Ryz	Some more investigating, I came across something I never seen before, a 300 web code; I stumbled upon http://users.zoominternet.net/~rbtson/hitty.htm that came from me checking out http://users.zoominternet.net/~rbtson/chap59.htm - unsure if manually created or auto-generated at the time
08:44 ^🔗	Ryz	...Pondering whether it's better to run them individually still or run all of 'em as "!a <"
08:59 ^🔗	Ryz	Did some curious investigating Flashfire; so while http://www.zoominternet.net/~blown85z/ and http://users.zoominternet.net/~blown85z/ are acceptable, it appears that further into the userpages, it would have to use either of those two,
08:59 ^🔗	Flashfire	And now you see why I wanted these web spaces saved
09:00 ^🔗	Ryz	Oh no, I did a further check, it seems the two can be used interchangeably~ I somehow typed 'user' instead of 'users' as the sub-domain,
09:00 ^🔗	Ryz	The unfortunate thing is that there could be two types of links being used in one page
09:01 ^🔗		Craigle has quit IRC (Quit: The Lounge - https://thelounge.chat)
09:02 ^🔗	Ryz	I already check if the links are http://www.zoominternet.net/ links or http://users.zoominternet.net/ when checking the main userpages anyway~
09:06 ^🔗	Ryz	The way you said that makes me uncertain of you s:
09:10 ^🔗	Ryz	Flashfire: ^
09:11 ^🔗	Flashfire	No Sorry dude I meant that as I wanted them saved because some do have these variations. Web spaces like that are unstable at the best of times
09:29 ^🔗		kiska has quit IRC (Remote host closed the connection)
09:29 ^🔗		Flashfire has quit IRC (Remote host closed the connection)
09:30 ^🔗		kiska has joined #archiveteam-bs
09:30 ^🔗		Flashfire has joined #archiveteam-bs
09:31 ^🔗		svchfoo3 sets mode: +o kiska
09:31 ^🔗		svchfoo1 sets mode: +o kiska
10:08 ^🔗		deevious has quit IRC (Remote host closed the connection)
10:20 ^🔗		tech234a has quit IRC (Quit: Connection closed for inactivity)
10:22 ^🔗		BlueMax has quit IRC (Read error: Connection reset by peer)
10:31 ^🔗		Craigle has joined #archiveteam-bs
10:31 ^🔗		deevious has joined #archiveteam-bs
11:15 ^🔗		zerkalo has quit IRC (Remote host closed the connection)
11:15 ^🔗		erin has quit IRC (Quit: WeeChat 2.5)
11:23 ^🔗		cerca has joined #archiveteam-bs
11:31 ^🔗		bluefoo has quit IRC (Ping timeout: 360 seconds)
11:45 ^🔗		tech234a has joined #archiveteam-bs
12:25 ^🔗		Jopik has quit IRC (Read error: Operation timed out)
13:12 ^🔗		LowLevelM has quit IRC (Read error: Connection reset by peer)
13:22 ^🔗		bluefoo has joined #archiveteam-bs
13:46 ^🔗		kiiwii has quit IRC (Quit: Konversation terminated!)
14:16 ^🔗	godane	SketchCow: so i got some good news a bad news in my magazine finding
14:17 ^🔗	godane	good news is i found a website called 1001mags.com that has tons of french magazines and some are very old
14:17 ^🔗	godane	also most of the magazines are free
14:18 ^🔗		LowLevelM has joined #archiveteam-bs
14:18 ^🔗	godane	the bad news is the pdfs are auto generated to have there personal info put on the cover page cause the only way to download these free magazines is to "buy" them
14:19 ^🔗	godane	the magazines are still free its just done thru there cart buying system
14:20 ^🔗	arkiver	joepie91_: can you please ping me back?
14:23 ^🔗		superkuh_ has joined #archiveteam-bs
14:25 ^🔗	markedL	godane that's often easy to remove with a pdf rewriter
14:29 ^🔗	godane	i figure we just source the cover to remove and readd a clean cover to pdf
14:29 ^🔗	godane	markedL: you can download all pages at 750x
14:30 ^🔗	godane	it just will be very small
14:30 ^🔗	godane	vs pdf
14:32 ^🔗	markedL	does the free one require a credit card on file?
14:32 ^🔗	godane	no
14:32 ^🔗	godane	no credit card require for this
14:36 ^🔗	Raccoon	godane: is the pii on the cover page a watermark modified on the image, or just a text layer added to the PDF?
14:36 ^🔗	Raccoon	*modifying the image
14:36 ^🔗	godane	there is a white background then text
14:37 ^🔗	Raccoon	try opening the PDF in a text editor to see if you can remove the line
14:37 ^🔗	Raccoon	"oh, just delete line 14 from every PDF file"
14:39 ^🔗	SootBectr	Example: https://i.imgur.com/fjYgkZo.png the first two pages could just be removed if they're all done like that
14:40 ^🔗	SootBectr	as they're adverts.
14:40 ^🔗	markedL	i can remove that as long as it's text and not an image
14:40 ^🔗	markedL	is it selectable? for copy/paste
14:41 ^🔗	Raccoon	oh that's definitely a text object slapped on there
14:42 ^🔗	Raccoon	easy peasy 99% deletable
14:42 ^🔗	SootBectr	Can you recommend a pdf editor/viewer for linux that lets you select text? I reinstalled this laptop recently and can't remember which one I used to use
14:43 ^🔗	Raccoon	try Okular
14:43 ^🔗	Raccoon	as for doing the work, it's probably easier via script
14:44 ^🔗	godane	its differently text in it
14:44 ^🔗	SootBectr	Thanks. Oh this one (Atril) does actually, it's just awkward to get that bit of it. Yes the PII is text.
14:44 ^🔗	Raccoon	if it's consistent, it should be predictable to find and remove either by line number or substring match
14:45 ^🔗	markedL	it's probably encoded strings, grep will tell you
14:46 ^🔗	Raccoon	just don't break the PDF :)
14:47 ^🔗	godane	pdfinfo of one my pdfs : https://pastebin.com/2tTCaDBk
14:47 ^🔗	godane	it is encrypted
14:48 ^🔗	Raccoon	gross. Okular has an option to ignore protection, you have to turn it on.
14:50 ^🔗	SootBectr	This one begins with the title page and the PII box is located differently https://i.imgur.com/WcuJG0T.png
14:51 ^🔗	godane	the place of that will be different
14:52 ^🔗	Raccoon	unless it's just different x/y coords for the exact same element located similarly in the file
14:52 ^🔗	Raccoon	i don't know about a tool for removing encryption from a pdf
14:52 ^🔗	Raccoon	probably exists
14:55 ^🔗	godane	i figure it out maybe
14:55 ^🔗	godane	using qpdf
14:56 ^🔗	godane	qpdf -decrypt Air-le-Mag-101.pdf output.pdf
15:00 ^🔗	SootBectr	Decrypted one and can see the PII is there as metadata too
15:01 ^🔗	Raccoon	try reading that in a text editor that won't barf on binary content, to locate the element their script is inserting
15:08 ^🔗		SoraUta has quit IRC (Read error: Operation timed out)
15:14 ^🔗	SootBectr	Here's what I have at the very beginning of the file, can't find any other occurances of "204." or "archive" https://paste.ubuntu.com/p/nRrmcXRM7z/
15:16 ^🔗	markedL	that's just metadata. it would be closer to the drawing sections
15:17 ^🔗	markedL	if you want an easy job, order the same document with two different accounts, then diff the decrypted versions
15:17 ^🔗	SootBectr	It's metadata, yes
15:17 ^🔗	markedL	it's not going to be an ascii string but it will tell you where the changes are without having to understand the pdf language
15:18 ^🔗	Raccoon	why won't it be an ascii string? seems their script is so dumb it doesn't even indent, it just injects print.
15:20 ^🔗	markedL	https://blog.didierstevens.com/2008/05/19/pdf-stream-objects/
15:24 ^🔗	Raccoon	i see. https://blog.didierstevens.com/2008/04/29/pdf-let-me-count-the-ways/
15:28 ^🔗	SootBectr	godane: If you'd like to compare here's my encrypted file for http://fr.1001mags.com/magazine/douane-magazine (I figure best to give you encrypted in case there's any differences between our versions of qpdf) https://transfer.sh/tTNZu/Douane-Magazine-014.pdf
15:32 ^🔗	SootBectr	Note that there's one small difference every time you qpdf -decrypt the same file, looks like a md5sum of something.
15:33 ^🔗		OrIdow6 has quit IRC (Quit: Leaving.)
15:34 ^🔗	godane	so md5sum is different everytime you use qpdf -decrypt
15:36 ^🔗		OrIdow6 has joined #archiveteam-bs
15:38 ^🔗	godane	SootBectr: looks like the md5sum is different with each decrypt
15:43 ^🔗	SootBectr	Yes, there's a line in the file that changes every time you qpdf -decrypt
15:43 ^🔗	SootBectr	I imagine it isn't important, just pointing it out to avoid confusion.
15:58 ^🔗	godane	good news
15:58 ^🔗	godane	the full cover is there
15:58 ^🔗	godane	i was able to remove the write background not the text using ghostscript
15:59 ^🔗	godane	turns out that white background is a vector image
15:59 ^🔗		deevious has quit IRC (Read error: Connection reset by peer)
15:59 ^🔗		deevious has joined #archiveteam-bs
16:01 ^🔗	godane	bad news is remove other vector stuff also
16:09 ^🔗	SootBectr	The metadata is easy to strip at least: exiftool -e -all:all="" file.pdf -o temp.pdf ; qpdf --linearize temp.pdf output.pdf
16:11 ^🔗	SootBectr	The linearize step is necessary because exiftool's deletions are reversible
16:26 ^🔗	SootBectr	godane: if you'd like to send me a file I can try comparing too.
16:26 ^🔗	SootBectr	I suggest an encrypted source file.
16:34 ^🔗		jamiew has joined #archiveteam-bs
16:38 ^🔗	godane	SootBectr: https://archive.org/details/CNEWS-Matin-2504
16:49 ^🔗	markedL	can you get two different sourced copies of the same issue?
16:49 ^🔗	SootBectr	Looks like this is the relevant section https://paste.ubuntu.com/p/7kFRsQ7dGT/
16:52 ^🔗	SootBectr	There's loads of other FlateDecode occurances though, I can't see a way to identify that one in particular.
16:53 ^🔗	SootBectr	...besides decoding it, of course.
16:53 ^🔗	godane	maybe mess with it uncompress
16:53 ^🔗	godane	pdftk file.pdf output uncompress.pdf uncompress
16:55 ^🔗	SootBectr	I did qpdf -qdf --object-streams=disable in.pdf out.pdf and that lets you read it all.
16:55 ^🔗		asdf0101 has quit IRC (Read error: Operation timed out)
16:56 ^🔗		markedL has quit IRC (Read error: Operation timed out)
17:16 ^🔗		superkuh_ has quit IRC (Quit: the neuronal action potential is an electrical manipulation of reversible abrupt phase changes in the lipid bilaye)
17:30 ^🔗		trc has joined #archiveteam-bs
17:31 ^🔗		markedL has joined #archiveteam-bs
17:31 ^🔗		asdf0101 has joined #archiveteam-bs
17:41 ^🔗	godane	SootBectr: i'm making progress
17:42 ^🔗	godane	i was able to remove my name and my email text
18:06 ^🔗		zerkalo has joined #archiveteam-bs
18:06 ^🔗	godane	bad news is i may have to break up the pdf to do this right
18:06 ^🔗	godane	this is cause when removing and the text page 2 become blank for some reason
18:07 ^🔗	godane	so theory is to make a cover pdf and 2-end pdf
18:07 ^🔗	godane	end the cover pdf then use pdfunite to combine the cover pdf and 2-end pdf
18:15 ^🔗	SootBectr	I suspect it can be done with a regex search and replace, but as I understand it you need to keep the string length the same - don't know if there's a go-to tool that doesn't trip up on binary files to do that?
18:18 ^🔗	SootBectr	You can certainly use a hex editor and just blank out the strings
18:18 ^🔗		DogsRNice has joined #archiveteam-bs
18:18 ^🔗	godane	fuck yes i got it : cat CNEWS-Matin-2504-cover.pdf \| sed "/Length 372$/,/Length 768/d" > diff.pdf
18:23 ^🔗	SootBectr	Won't that also delete any other sections that happen to be the same length?
18:27 ^🔗	godane	that deletes every between Length 372 and Length 768
18:27 ^🔗	godane	so you have both Lengths for to be problem
18:28 ^🔗	godane	also by make the cover its own pdf and just doing it on that limits that problem
18:39 ^🔗		katocala has joined #archiveteam-bs
18:40 ^🔗		katocala has left
18:48 ^🔗		Craigle has quit IRC (Quit: Ping timeout (120 seconds))
18:48 ^🔗		Craigle has joined #archiveteam-bs
18:52 ^🔗		schbirid has joined #archiveteam-bs
18:56 ^🔗	godane	sadly the lengths are different in each pdfs maybe
19:02 ^🔗	SootBectr	Perhaps the way to approach it is to find the block of stream ... endstream that contains the email address
19:10 ^🔗		dashcloud has joined #archiveteam-bs
19:12 ^🔗		Myself has quit IRC (Read error: Connection reset by peer)
19:16 ^🔗		Myself has joined #archiveteam-bs
19:32 ^🔗	godane	so this works in remove the watermark: cat "cover.pdf" \| sed "/Length 3[0-9][0-9]$/,/Length 768/d"
19:40 ^🔗		prq has quit IRC (Remote host closed the connection)
19:55 ^🔗	markedL	was the goal here to remove the data, or prevent its render?
20:04 ^🔗	godane	can it be prevent it in render?
20:07 ^🔗	markedL	mods which don't remove the data but prevent its render would be taking out the instructions to draw
20:09 ^🔗		jamiew has quit IRC (zzz)
20:10 ^🔗		tech234a has quit IRC (Quit: Connection closed for inactivity)
20:39 ^🔗		mtntmnky has quit IRC (Remote host closed the connection)
20:40 ^🔗		schbirid has quit IRC (Quit: Leaving)
20:40 ^🔗		mtntmnky has joined #archiveteam-bs
20:51 ^🔗	SootBectr	I'm trying my hand at some python to remove it, so far I have it reading line by line, detecting the stream .. endstream blocks and writing out a file which is identical to source.
20:51 ^🔗	SootBectr	Now to remind myself how to regex in pythonland
20:52 ^🔗		tech234a has joined #archiveteam-bs
20:53 ^🔗	godane	the 'crapier' opition could be this : cat "$cover" \| sed "/Exemplaire strictement personnel/,/gmail.com/d"
20:54 ^🔗	godane	that removes the text but there is still whitebox where the text would be so the cover is not full unedit
20:55 ^🔗	godane	at least this would mostly get done 99% of the time i would think
20:57 ^🔗		trc has quit IRC (Quit: Goodbye)
21:01 ^🔗	SootBectr	I get an invalid file if I do that
21:02 ^🔗	markedL	https://blog.didierstevens.com/programs/pdf-tools/
21:02 ^🔗	godane	i have a very big script for this
21:02 ^🔗	godane	that just part of it
21:04 ^🔗	markedL	https://github.com/pdfminer/pdfminer.six
21:04 ^🔗	godane	2nd if you tried that on one your pdfs it would just delete everything after Exemplaire strictement personnel cause you don't have a gmail.com the tell it to stop
21:08 ^🔗	SootBectr	Oh I changed the email bit
21:12 ^🔗		BlueMax has joined #archiveteam-bs
21:13 ^🔗	godane	SootBectr: did you fix it or your just saying that did that when it gave you the invaid file
21:36 ^🔗		Stiletto has quit IRC ()
21:37 ^🔗		Stiletto has joined #archiveteam-bs
21:37 ^🔗		Stiletto has quit IRC (Client Quit)
21:43 ^🔗		Stiletto has joined #archiveteam-bs
21:59 ^🔗		Stiletto has quit IRC ()
22:11 ^🔗		Stiletto has joined #archiveteam-bs
22:17 ^🔗	dashcloud	godane: what are you trying to do exactly?
22:18 ^🔗	Raccoon	dashcloud: removing PII tags. https://i.imgur.com/fjYgkZo.png
22:19 ^🔗		SoraUta has joined #archiveteam-bs
22:29 ^🔗	SootBectr	godane: that gave me invalid. I have some python code that's successfully removing some regex matches now, will improve it a bit and share
22:31 ^🔗		Stiletto has quit IRC ()
22:31 ^🔗	SootBectr	markedL: Thanks, I had a quick skim but couldn't see an option to save changes to a pdf, I'm sure the object parsing code would be useful though
22:32 ^🔗		Stiletto has joined #archiveteam-bs
22:36 ^🔗		Stiletto has quit IRC (Client Quit)
22:40 ^🔗		Stiletto has joined #archiveteam-bs
22:47 ^🔗		superkuh_ has joined #archiveteam-bs
22:53 ^🔗		jamiew has joined #archiveteam-bs
22:59 ^🔗		Zerote_ has joined #archiveteam-bs
23:04 ^🔗		Zerote has quit IRC (Read error: Operation timed out)
23:10 ^🔗	SootBectr	godane: give this a spin https://paste.ubuntu.com/p/yg33z24DG9/
23:15 ^🔗		jamiew has quit IRC (zzz)
23:17 ^🔗	godane	doesn't workat all
23:20 ^🔗	SootBectr	I tested it on the file you gave me, oh were you deflating the streams with pdftk? maybe that affects it
23:25 ^🔗	godane	SootBectr: my script : https://pastebin.com/KHzFvBq0
23:26 ^🔗	SootBectr	It does, let me see why. Or you can try qpdf -qdf --object-streams=disable in.pdf out.pdf and then run the python
23:26 ^🔗		LowLevelM has quit IRC (Read error: Operation timed out)
23:29 ^🔗	godane	it works after i did that
23:31 ^🔗	godane	my script get rid of the white box though (mostly)
23:32 ^🔗	markedL	oh that flag makes this easy
23:32 ^🔗	godane	my script also gets rid the metadata
23:41 ^🔗	SootBectr	Yeah I'd probably have that step in a shell script that runs the python afterwards
23:44 ^🔗	markedL	I'm not sure how you're editing the files without updating the xref table
23:47 ^🔗	SootBectr	I don't even know what an xref table is :)
23:48 ^🔗	markedL	** Error: An error occurred while reading an XREF table. ** Error: An error occurred while reading an XREF table.
23:48 ^🔗	markedL	for my edits, haven't tried your edits yet
23:49 ^🔗	SootBectr	What program is giving you that error, and how are you making edits?
23:49 ^🔗		oofdere has joined #archiveteam-bs
23:50 ^🔗	markedL	ghostscript gives that error, and xpdf doesn't like it either. I deleted the strings contents so that they're 0 bytes long. this moved the byte offsets those 2 programs were trying to follow
23:50 ^🔗	markedL	i'll have to redo my edits so the byte offsets don't change
23:51 ^🔗	SootBectr	Aha, I'm just counting the length of a regex match and replacing it with spaces
23:51 ^🔗	SootBectr	with that number of spaces
23:51 ^🔗	markedL	ah yes, that would preserve it
23:53 ^🔗	markedL	ok, yeah spaces method works, which you knew
23:54 ^🔗	markedL	rectangle should be right around here

irclogger-viewer