hckrnws
Converting untrusted PDFs into trusted ones: The Qubes Way (2013)
by transpute
Related:
https://github.com/freedomofpress/dangerzone
> Take potentially dangerous PDFs, office documents, or images and convert them to safe PDFs.
From the learn more about page:
> Dangerzone was inspired by TrustedPDF but it works in non-Qubes operating systems, which is important, because most of the journalists I know use Macs and probably won’t be jumping to Qubes for some time.
> It uses gVisor sandboxes running in Linux containers to open dangerous documents, instead of virtual machines. And it also adds some features that TrustedPDF doesn’t have: it works with any office documents, not just PDFs; it uses optical character recognition (OCR) to make the safe PDF have a searchable text layer; and it compresses the final safe PDF.
Previously (announcement and details of gVisor sandboxing etc):
Safe Ride into the Dangerzone: Reducing Attack Surface with GVisor
I appreciate this! It would add another attack vector, but I could see the utility in SaaS'ifying this for servers to convert user-uploaded content on-the-fly.
This looks better than q(ubes)-pdf, but still not ideal.
Seems that PyMuPDF is used with a fixed (single pathname) "/tmp/input_file" ? https://github.com/freedomofpress/dangerzone/blob/main/dange...
Everything else is tossed through LibreOffice.
Meanwhile what I'd prefer for PDFs is some allow-listed set of 'safe' PDF operations (layout image, layout text) to be used with sanitized inputs (no underflows, overflows, corruption, etc), and the results of any at snapshot-runtime code evaluated and then flattened out to a safe element. Image OCR could be run atop that.
Similarly it'd be nice if a filter like that existed for the other documents, but as an individual contributor I don't have the human power to keep up with that goal and would take the same low hanging fruit worse but secure output route.
> allow-listed set of 'safe' PDF operations (layout image
That's the problem right there. PDF supports many image formats, including ones that are useful but you may have never heard of like JBIG2 for scanned documents. And the parser for those image formats needed to be secure as well. One very famous exploit is just exploiting JBIG2 (among other things): https://googleprojectzero.blogspot.com/2021/12/a-deep-dive-i...
You could simply not support those obscure image formats?
And when a journalist receives a tip from a reliable source they can't open with their secure PDF tool, they are more likely to do what? Disregard the info, or open the PDF in an unsafe tool?
That's not an option. JBIG2 is objectively the best format for compressing black-and-white scanned images. The Adobe Acrobat software has an "optimize scanned PDF" feature that produces this format. It's not obscure at all.
That objectively best format that was banned in some areas due to replacing important numbers in accounting documents? But if it's the default in Acrobat, yeah, likely not obscure enough
In terms of threat model, what is the problem with MuPDF in gVisor (a very tight sandbox)? Obviously, a memory-safe language would be ideal, but there's nothing fundamentally wrong with the approach.
Someone could potentially use the underlying tools directly in their own needs workflow, or an expert in those tools could mention if there's a vulnerability / common mistake pattern.
A single fixed path for the temporary input file means only one copy of the converter can run at a time (unless it's in a sandbox, which wasn't clear from the context). A PID specific tempfile or, better, use a standard 'make a new temp file name so there's not a collision' utility, would allow parallel conversions.
Hey, Dangerzone dev here. Just wanted to say that the temporary input file is actually within a sandbox - and there's one sandbox per file - so there's no collision with other files in the host.
> Someone could potentially use the underlying tools directly in their own needs workflow, or an expert in those tools could mention if there's a vulnerability / common mistake pattern.
Yep, that's something we cover in our about page. We process the untrusted document in a gVisor sandbox, that runs within a hardened Linux container, that runs within a Docker Desktop VM (in Windows/macOS). And yet, that doesn't mean that Dangerzone can't get hacked by a determined (possibly state-backed) attacker. So yeah, you have a point.
At the same time, using tools like this raises the "you must be THIS determined to enter" bar. This means attackers must spend much more money, time, expertise, 0-days, etc. These resources are finite, so more people are protected as a net result, even if we can't protect everyone. That's the way I see it at least.
very good answer :)
It looks like the qpdf-converter source, along with everything else, is now on Github according to the Developer / Source Code links on the site.
https://github.com/QubesOS/qubes-app-linux-pdf-converter
Their source code seems to take the most obvious path... flatten it to an image printout then possibly do more? https://github.com/QubesOS/qubes-app-linux-pdf-converter/blo... https://github.com/QubesOS/qubes-app-linux-pdf-converter/blo...
Though at a quick skim I can't see any OCR steps.
I was wondering that myself, but one of the downsides mentioned is that you lose text search, which seems to suggest that OCR is not being used.
This is a good approach. It’s an old, design pattern in high-assurance systems where a gateway converts things into usable, safer form. Another concept, often called LANGSEC, is generating parsers from simple grammars that are hopefully bulletproof. These ideas can be combined.
Two more things can happen.
The increasing volume of memory-safe utilities means they can be used on one or both sides of this. That might prevent the exploit entirely. If a memory-safe CPU, it can still help to isolate in case of hardware failures (esp bitflips).
It can also be used to boost performance in non-Qubes systems where a secure (or OSS) processor is in use. They’re often slower than commodity CPU’s. So, one can use the disposable VM’s on commodity CPU’s to filter data (block most attacks), transform it, and send it over simple, wire protocol. Commodity VM’s might also present it back to the user in dressed up form.
Outside of security, a long time ago, they were doing similar things to decrease latency and boost bandwidth on Beowulf clusters. A team made Fast (or Active?) Messages to eliminate TCP/IP as a bottleneck. So, sometimes a security technique can also be a performance booster.
I haven't used this in a few years since switching off of Qubes but something no one mentioned is that output PDFs are *huge*. They're practically PNGs with a .PDF extension in terms of size.
I love the idea of making PDFs dumber and safer but maybe ePub would fit the bill? I'm just thinking out loud, I would like to do this again, but the Qubes way of spinning up a disposable VM to produce a monster PDF file is unsatisfying. More general Qubes being slow was a big reason I switched off of it
Related:
Converting untrusted PDFs into trusted ones: The Qubes Way (2013) - https://news.ycombinator.com/item?id=10538888 - Nov 2015 (5 comments)
I'm curious, we use many PDF parsing and formatting tools as part of an ML ingest pipeline. Our goal is to keep the document as close as possible to the original in meaning, but remove unwanted junk and simplify the document by removing everything non-content related or converting it to text the ML can work with.
Surely you can do that instead? Parse the PDFs and format them in basic ways without support for "extensions" or anything. Let the user read that before using the "real" document with extensions potentially enabled.
then you at least need a pdf reader that implements that, and have to be sure that the parsing you do have cannot be exploited, while still giving a useful representation. This might be easier for ML where you don't care about visual display, but a human generally doesn't want to read raw, unformatted text. And a surprising amount of stuff is probably needed for a half-way decent visual display.
I do view the documents we format after they have gone through the processing stage. They seem to be the same in most ways I would care about. Diagrams are still present, etc. I don't know about PDFs that contain forms as these are not those kind of documents but closer to research documents.
Cute idea. Reminds me of the format conversions typically used to lessen the risk of steganography. But boy, the article took forever to get to the PDF -> RGB idea, talking about Simple Representation and everything in between. RGB has less complex parsing, ergo less attack surface.
PDFs are essentially a representation of PostScript. PostScript itself is expected to be run in a VM and relativepy straightforward to be isolated/sandboxed.
So there is a certain sense of absurdity of needing to spin up an entire VM just to render a PDF. Running a standard PostScript renderer in a user executable (perhaps in a chroot jail to be a little bit paranoid) should be enough for safety. Or just stick it inside a Docker.
Restrict the permissions on the user process to “read my static data files like fonts” and “write output to this 1 file, or a parsing error to this other 1 file”.
PDFs are not simply a representation of postscript. Over the years Adobe has added all sorts of scripting and code execution environments, which are regularly exploited. It supports a lot of image formats, opening yet more attack vectors. It allows plugin integration with every major business app, another wide set of attack surfaces.
Here, for example, is but one of the scripting engines https://helpx.adobe.com/acrobat/using/applying-actions-scrip...
Here’s its JavaScript execution engine for embedding JavaScript in a pdf: https://opensource.adobe.com/dc-acrobat-sdk-docs/library/jsa...
Your belief is what makes them an excellent exploit deployment format. “What can go wrong? I know tech and they cannot be harmful.” Click.
> PostScript itself is expected to be run in a VM and relativepy straightforward to be isolated/sandboxed.
Expected by who? It's been associated with security bugs for decades.
I wonder how Google handles this. Thousands of their software people will need to read PDFs from all over the web from work machines.
Are PDF parsers really so bad nowadays (this article is over 10 years old), that opening a PDF opens you up to vulnerabilities?
The author made this seem like such a fundamental issue. Is that because PDFs natively have support for say executing code (i doubt) or accessing the filesystem (i doubt), etc...
> Are PDF parsers really so bad nowadays (this article is over 10 years old), that opening a PDF opens you up to vulnerabilities?
Yep, here’s an Acrobat Reader release from two days ago that fixes two arbitrary code execution vulnerabilities since the previous one two months ago: https://helpx.adobe.com/security/products/acrobat/apsb24-92....
I haven’t looked into browser-embedded PDF viewers enough to know how they compare to other software – they’re definitely much safer than Acrobat and still not completely safe (e.g. CVE-2023-1530 in Chrome wasn’t that long ago) – but I would expect them to be at least as safe as other browser functionality.
> Is that because PDFs natively have support for say executing code (i doubt)
They do (https://helpx.adobe.com/ca/acrobat/using/applying-actions-sc..., including “Run a JavaScript”, although that has to be enabled), but indeed that’s not the one fundamental issue; it’s usually just standard vulnerabilities of memory unsafety or terrible design (XML).
Yeah, avoiding Adobe software is probably a mandatory first step in this context.
If I remember correctly Google bought a source code license from some Aussie company (?) for rendering PDFs in Chrome. That was like a decade ago though. I wonder what happened since. Probably lots.
Google bought a license from Foxit, a Chinese company. They've open sourced it since. Mozilla wrote their own and also open sourced it, PDF.js
I do believe Foxit didn't present itself as a Chinese company at the time. Not sure where I got the Aussie thing from. They do and/or did have a Melbourne office. Perhaps its address was used for marketing purposes at the time.
https://web.archive.org/web/20140529210328/http://www.foxits...
> Founded in 2001, Foxit is a leading software provider of solutions for reading, editing, creating, organizing, and securing PDF documents. Headquartered in Fremont, CA, USA, Foxit has operations worldwide in China, Belgium, Japan, and Taiwan
There's more details in the wikipedia talk page: https://en.wikipedia.org/wiki/Talk:Foxit_Software
seems like most of their presence was in china, and was domiciled in china, but they had sales "offices" in other countries and so they emphasized that part for better PR.
PDFs support JavaScript. Here's Adobe's guide on how to add JS to your PDFs: https://helpx.adobe.com/uk/acrobat/using/applying-actions-sc...
It might be in the PDF spec, but most viewers won't run JS.
The most widely ones run JS. And the majority of pdf exploits are not js, but a wide mix over all the things a pdf reader supports.
On non-Qubes systems, it doesn't take 5 seconds to launch a sandbox.
But the security is also incomparable.
Chrome
This seems like a lot of work. Surely you can just rewrite the PDF parser in Rust?
Many thanks for this HN moment comment.
I honestly can’t tell if you’re serious or not.
I can't either
because that's less work? there are PDF parsers and renderers in Rust, but unclear what's the goal? There are alos PDF parsers and renderers in JS/TS, etc.
isn't it possible to scan PDFs for unsafe code? Having a tool that makes docs safe from "now on" is semi useful, if you don't know whether you were already compromised.
this is adressed in the article...
What about PDF/A, archive version with restricted features?
Crafted by Rajat
Source Code