Bugzilla – Full Text Bug Listing |
Summary: | AUDIT-TRACKER: keylime: review TPM remote attestation framework | ||
---|---|---|---|
Product: | [Novell Products] SUSE Security Incidents | Reporter: | Marcus Meissner <meissner> |
Component: | Audits | Assignee: | Matthias Gerstner <matthias.gerstner> |
Status: | RESOLVED FIXED | QA Contact: | Security Team bot <security-team> |
Severity: | Normal | ||
Priority: | P5 - None | CC: | aplanas, security-team |
Version: | unspecified | ||
Target Milestone: | --- | ||
Hardware: | Other | ||
OS: | Other | ||
Whiteboard: | |||
Found By: | --- | Services Priority: | |
Business Priority: | Blocker: | --- | |
Marketing QA Status: | --- | IT Deployment: | --- |
Bug Depends on: | 1193997, 1193998, 1193999, 1194000, 1194002, 1194004, 1194005 | ||
Bug Blocks: |
Description
Marcus Meissner
2021-10-18 09:26:18 UTC
package is in openSUSE:Factory keylime I'll look into this now. 1) Scope of Review ================== I've been looking into the accessible code paths for all four main components of Keylime: the agent, registrar, verifier and tenant applications. I have been looking into version 6.2.0. Currently version 6.2.1 is already available. From comparing the changes I expect that most findings from this report still apply to version 6.2.1 as well. Any source code locations mentioned in this report relate to version 6.2.0. 2) Findings in the Agent Component ================================== Security Related ---------------- ### a) `check_mounted()` Function Logic can be Fooled by Unprivileged Mounts The `check_mounted()` function in `secure_mount.py` attempts to make sure that a "secure" tmpfs is mounted at `/var/lib/keylime/secure` to store sensitive data on that never gets written to disk. To do so the function parses the output of the `mount` utility to determine whether this file system is already mounted at the desired location. There can exist the possibility of unprivileged users performing certain mount operations, one of the most prominent examples being the `fusermount` setuid-root binary for mounting FUSE file systems. In view of this parsing mount table output needs prudence. I described the basic issue previously already in another report [1]. The following is a reproducer using `fusermount` that shows the basic local attack vector: user$ export _FUSE_COMMFD=0 user$ fusermount some/path/ -ononempty,fsname="tmpfs on /var/lib/keylime/secure" This will fool the parsing logic in `check_mounted()` and thus the function assumes that the "secure" tmpfs is already mounted, while it actually isn't. Thus this will allow a local attacker on the system to prevent this security feature to be effective, *if* the local attacker manages to create such a mount entry before the `keylime_agent` runs. The attack vector can also be used to perform a local DoS against `keylime_agent` by claiming a different `fsname` than tmpfs. `check_mounted()` will throw an Exception in this case and the agent won't start. On a side note there are calls to `secure_mount.mount()` spread throughout the Keylime codebase (for example three times in `keylime_agent.py`, two times in `tpm_main.py` and two times in `ca_impl_cfssl.py`. There is no code to *clean up* this mount again, however. So it potentially leaves behind a stale mount after services are shutdown. Furthermore if multiple Keylime processes should operate in parallel this could result in a race condition where the "secure" tmpfs is mounted twice, in the worst case mounting a fresh tmpfs over previously stored content there. My recommendation is to parse the `/proc/self/mountinfo` pseudo file for mount table information instead. Whitespace is specially encoded in this file. Furthermore the responsibility of mounting and unmounting this file system should be more clearly defined during startup/shutdown of processes and maybe a reference counting / locking scheme to prevent race conditions should be used. [1]: https://www.openwall.com/lists/oss-security/2020/06/04/5 ### b) Possible Information Leaks via Unauthenticated Agent Quote Interface A TPM quote can be requested without authentication from the agent service via the network: $ curl "keyagent-host:9002/?api_version=500"es=myquote&nonce=mynonce" { "code": 200, "status": "Success", "results": { "quote": <base64-data>, "hash_alg": "sha256", "enc_alg": "rsa", "sign_alg": "rsassa", "pubkey": <PEM key>", "boottime": 1639999864 } } This exposes for example the *boottime* of the host where the agent is running, information that is not otherwise easily publicly available. Furthermore: Could the contents of the TPM quote data also be interesting data? Could it for example allow deductions about which kind of operating system kernel is running on the host? I recommend to somehow authenticate and cryptographically secure this agent interface to prevent information leaks of this kind. ### c) Arbitrary Remote Code Execution via Unauthenticated Bootstrap Interface It looks like it is possible to simply post arbitrary new values for the U and V key parts and provide a new configuration payload to the agent, only knowing the agent's UUID. The agent's UUID can be public or semi-public information like when `agent_uuid=hostname` is configured. From the Keylime paper [2, section 3.2.2] it sounds like the UUID HMAC check is not considered a security feature but only a sanity check: > This provides the node with a quick check to determine if Kb is correct. When `extract_payload_script=true` (default) and `payload_script=autorun.sh` (default) are configured in `keylime.conf` then the provided payload will be unzipped and a potentially contained `autorun.sh` script is executed with full root privileges. Attached you can find a reproducer script `post_key.py` that demonstrates the issue by creating a file `/tmp/evil` on the agent host by only providing the agent hostname and UUID as input parameters. Even if `payload_script` is disabled then the extraction of a ZIP file as _root_ might result in a remote root exploit by extracting files outside of the intended target directory. I did not test this variant of the attack vector, though. Furthermore by providing a ZIP bomb as payload the agent process can be subjected to a remote DoS through memory exhaustion. Retrieving the full symmetric key previously stored in `/var/lib/keylime/secure/derived_tci_key` should not be possible this way, because when performing the bootstrap protocol, the previous data is removed in `keylime_agent.py:242`. A skillful attacker might attempt to first compromise the agent node and then wait for the tenant to re-deploy the agent using authentic keys and payload. Should this happen then the attacker can obtain the secret symmetric key from the compromised agent node after all. Similar to issue b) I recommend to somehow authenticate and cryptographically secure this agent interface to prevent these attacks. As a hotfix disabling the relevant configuration features should at least prevent the remote code execution and memory exhaustion attack vectors. Setting non-predictable UUID values can also help (but one should also consider item 3.a in this context). Even then this interface still allows to disrupt the operational state of the agent host by simply overwriting its current configuration. [3]: https://www.ll.mit.edu/sites/default/files/publication/doc/2018-04/2016_12_07_SchearN_ACSAC_FP.pdf ### d) Key Exchange and Bootstrap Protocol Susceptible to Replay Attacks Authentic payloads being passed from the tenant to the agent should be reasonably safe from attackers (when not considering issue c)), since the two halves of the symmetric key are encrypted using the per-agent node RSA public key. The bootstrap protocol seems to be susceptible to certain replay attacks, however. Since the interface does not employ transport security, the bootstrap protocol can simply be recorded and replayed to activate an authentic configuration payload. This could e.g. be used by an attacker to activate an outdated or even insecure older configuration of the agent node. Code Quality and other Concerns ------------------------------- ### e) Lacking Robustness of the REST API Interface There exist various combinations of REST API requests that lead to exception backtraces in the agent's output when running it on the console. For example: - `keylime_agent.py:87`: challenge extracted without check whether it's supplied. - `keylime_agent.py:103`: nonce extracted without checks, failure ends in empty reply from server. - `mask` parameter is not checked for being numeric, can trigger a `ValueError` in `check_mask()`. - `partial=something` also triggers a `ValueError`. - also in the POST case a lot of parameters aren't properly validated. While these don't seem to pose any security problems as such it seems a more defined way of executing the requests would be beneficial for the overall stability and security of the service. In the agent REST interface this was particularly apparent, but some of the REST APIs in other components suffer from similar issues. ### f) Quote Nonce is not Size Limited In the REST API call to create a TPM quote in `keylime_agent.py:127` there is no size restriction on the nonce parameter. This data is turned into a hexdigest and passed directly on the command line of the `tpm2_quote` utility. This could hit command line length restrictions and cause the subprocess execution to fail. To prevent unexpected results overly large parameters or non-alphanumeric contents should be rejected. 3) Findings in the Registrar Component ====================================== Security Related ---------------- ### a) UUID of Agents is Received on Unprotected HTTP Interface The Registrar provides two separate HTTP interfaces, a TLS protected one and an unprotected one. Part of the unprotected interface is the agent registration. As part of the agent registration the agent UUID is passed unencrypted (processed in `registrar_common.py:229`). This is not a security issue in its own but relates to issue 2.c where the knowledge of the UUID facilitates remote code execution on the agent nodes. This means if an attacker can listen in on the registrar's agent registration communication then even unpredictable agent UUIDs no longer hinder the attack described in issue 2.c. As outlined in 2.c the UUID does not seem to have been thought of as a security property in the first place so I see no urge to change anything here. Although when the bootstrapping protocol should get TLS protection then for completeness it could also make sense to protect this registrar interface as well the same way. ### b) Unsanitized UUID passed on Unprotected HTTP Interface Facilitates Log Spoofing Since the registrar's unprotected HTTP interface requires no authentication, anybody can post arbitrary agent registrations with arbitrary parameters. The agent ID (UUID) parameter is not sanitized in any way and is used unfiltered in log messages (e.g. `registrar_common.py:107`). As a result the agent ID parameter can be used to inject seemingly valid additional log lines that appear e.g. in `journalctl -u keylime_registrar.service`. The attached reproducer script `post_agent.py` can be used to demonstrate this: $ ./post_agent.py --host registrar-host --log-line "Please run rm -rf /* to protect your system" In the journal we will then see: Dec 21 11:44:22 registrar-host keylime_registrar[1426]: 2021-12-21 11:44:22.281 - keylime.registrar - WARNING - POST for trusted-agent Dec 21 11:44:22 registrar-host keylime_registrar[1426]: 2021-12-21 11:44:22.931 - keylime.registrar - WARNING - Please run rm -rf /* to protect your system Dec 21 11:44:22 registrar-host keylime_registrar[1426]: 2021-12-21 11:44:22.940 - keylime.registrar - DEBUG - returning 400 response. [...] Such log spoofing could be used to entice Administrators to perform actions that can be harmful or otherwise in the interest of an attacker. My recommendation is on the one hand to diligently sanitize untrusted input parameters. On the other hand it might make sense to authenticate this currently untrusted interface. ### c) Existing Agent Registrations can be Overwritten without Authentication Since the agent registration interface is not authenticated anybody can add new or overwrite existing agent registrations. These will not be "active", they can only be activated by performing a "PUT" request whilst also providing an `auth_tag`. The unauthenticated overwrite seems to allow to disrupt the actual agent's trust state in certain situations. It might also be possible to actually overwrite and *activate* an existing agent while also providing arbitrary "ip" and "port" parameters that e.g. the tenant application could evaluate and perform an outgoing connection to. Code Quality and other Concerns ------------------------------- - `registrar_common.py:549`: this busy wait sleep loop to join threads is bad style, why not do blocking joins? - is the `tpm2_makecredential` external program safe to process untrusted blob inputs? It is called in the Registrar context in `registrar_common.py:289`. 4) Findings in the Verifier Component ===================================== Security Related ---------------- ### a) Revocation Notifier Uses fixed /tmp Path for UNIX Domain Socket In *revocation_notifier.py* a fixed path in the world writable location */tmp/keylime.verifier.ipc* is used. The code (in this case the third party `zeromq` Python module) forcefully removes any file object found there earlier. Should the program be running as non-root, or if another local user simply places a *directory* at this location, then this serves as a local DoS attack against the revocation notifier process, because the socket cannot be created. This situation doesn't even seem to be noticed by the verifier main process, because the child process `broker_proc` is never waited on. This means that the local attacker could even replace the "blocking" directory by his own UNIX domain socket later on and will then receive revocation events from invocations of the `notify()` function in the main verifier process. The full impact of this would have to be researched further. It looks like failed quote notifications would longer be sent out. I recommend to place UNIX domains sockets in a dedicated safe directory in /run that cannot be staged with attacks by other local users in the system. ### b) Get Quote Response Contains Possibly Untrusted ZIP Data The verifier process periodically performs quote operations on registered agents. As part of this `process_quote_response()` is called and furthermore `check_quote()` and finally `_tpm2_checkquote()`. In `tpm_main.py:1018` a couple of ZIP data streams are uncompressed via `zlib.decompress()`. Since this is processing possibly untrusted data - the verifier is attempting to verify the current trust status of the node after all - it needs to be assumed that malicous data can also be supplied here. Therefore the question arises whether `zlib.decompress()` is robust against processing invalid ZIP data streams. One thing I already found out is that it is not robust against delivering ZIP bombs that will cause a memory exhaustion in the verifier process. Code Quality and other Concerns ------------------------------- ### c) Lifetime of Global Variable `global_password` for SSL Private Keys In the *ca_util.py* module the `global_password` global variable holds the cleartext passphrase for SSL private keys. This is global program state that should rather be coupled to the lifetime of some program object like the web server instance. This means that the password is possibly kept in memory longer than necessary and also inherited to forked child processes that exist in the verifier context for zeromq or the tornado web service. Even though it is difficult in Python to make sure the password is not residing in memory any more there should still be a clear and as short as possible logical lifetime during which the password is available in the program. ### d) Risky `pkill -f cfssl` to Shutdown a Local cfssl Instance In `ca_impl_cfssl.py:67` the external command `pkill -f cfssl` is used to try and terminate any already / still running cfssl instances in the system. Since Keylime runs as _root_ this is a risky operation that might conflict with other applications or services utilizing cfssl in the system possibly leading to hard to find bugs. Instead child process objects should be used for cfssl and they should be used to terminate and wait for cfssl. A PID file in the file system could be used to avoid zombie processes from previous runs to stick around. Also signal handlers or program exit handlers can be used to kill child processes in case the program terminates unexpectedly. Related to this cfssl handling is also `ca_impl_cfssl.py:71` where a racy check is made to test whether the child process "immediately" returned. A little later a `time.sleep(0.2)` is added to further add "stability". This is no deterministic way to check for child process errors. Instead the child process's stdout should be inspected for proper startup messages, maybe cfssl provides further functionality that allows programmatic evaluation of its status. ### e) Letting cfssl Locally Listen via HTTP The interface towards the cfssl instance as used in `post_cfssl()` raises questions about security. Currently the default listen address and port are 127.0.0.1:8888. This means that a local malicious user could bind a service itself on this port, thereby performing a DoS against the verifier. If the local attacker manages to make the verifier code assume that cfssl started successfully anyway (see shaky child process handling in item d)), then that local attacker could even hand out bogus certificates to the verifier, thereby undermining trust in the TLS setup. Another issue could be that if other users in the system have special capabilities like *CAP_NET_RAW* then they could start listening in on the unencrypted localhost communication towards the cfssl instance and obtain the certificates and private keys that the verifier (or other keylime components) creates. For the generation of sensitive data like private key files it would be desirable to use an interface towards cfssl that is not accessible to other users in the system. ### f) Connection Reliability Towards Agent Quote API In _cloud_verifier_tornado.py:719_ the quote requests against agents are carried out using the untrusted HTTP interface of the agent. Here the question comes to my mind what happens if an attacker in the attached network manages to interrupt or hijack this communication. Could it be used as a DoS vector to make the verifier assume that verification failed and it revokes the verification state of agents without valid reason? Could it furthermore be used even to *prevent* the revocation action to be carried out *on* the agent node itself? I am aware that this is a difficult technical problem to solve even with transport security enabled. It is an interesting viewing angle when dealing with a remote attestation solution, though, that should be thought through. 5) General Findings =================== This section contains findings that apply to all keylime components alike. Security Related ---------------- ### a) World-Readable keylime.conf Contains Potentially Sensitive Data The configuration `/etc/keylime.conf` is installed world-readable: $ ls -l /etc/keylime.conf -rw-r--r-- 1 root root 26770 Dec 16 14:54 /etc/keylime.conf This is the case for installations performed manually via the provided `installer.sh` script as well as for the RPM packaging found in both openSUSE Tumbleweed and Fedora 35 Linux distributions. Further distributions might be affected. `keylime.conf` contains a lot of information, some of it sensitive like the TPM ownership password (`tpm_ownerpassword`), TLS certificate private key passwords (`private_key_pw`, `registrar_private_key_pw`) or the database password for the registrar (`database_password`). Thus this is a local information leak, because arbitrary local users can obtain these passwords from the configuration file. NOTE: Alberto (in CC), our SUSE packager, already fixed the permission of the file for openSUSE Tumbleweed in the meantime. My recommendation is to make this file only accessible to _root_ and adjust all installation routines and possibly documentation. Affected distributors should be notified about this. The Keylime code could perform a sanity check of the permissions of the configuration file before reading it in. Design Related -------------- ### b) Lack of Privilege Separation All keylime services are currently designed to run as root all the time (except for testing purposes, see `REQUIRE_ROOT` in `config.py`). Only few bits of the keylime components actually should need root privileges. Most notably the bootstrapping scripts in the Agent component or the ability to bind privileged ports. Implementing a privilege separation approach would increase the defense in depth for keylime considerably, avoiding smaller security issues to become severe fast. Code Quality Concerns --------------------- ### c) Addition of `/usr/local/...` Directories to `PATH` and `LD_LIBRARY_PATH` Code in `tpm_main.py` and `ca_impl_cfssl.py` adds directories from `/usr/local` to the binary and library search path environment variables for executing child programs. This is done unconditionally and is unnecessary for RPM based installations. While it should be safe to assume that programs and libraries found in `/usr/local` are safe to run it still adds certain attack surface, should something be misconfigured within `/usr/local` that badly influences execution of Keylime. My recommendation is to make the installation path better configurable for integrators and to _only_ add it during runtime if it is required and isn't already part of the respective environment variables. ### d) `umask()` Invocations with Unclear Responsibility There are currently three `os.umask(0o077)` invocations in `keylime_db.py`, `tpm_abstract.py` and `config.py`. These invocations happen somewhat randomly during process execution and the original umask value is never restored. This leaves the process's umask in a more or less undefined state for other parts of the program depending on the exact program execution flow. My recommendation is to set a conservative umask once during program startup and cleanly change and restore it temporarily if necessary. ### e) Uses of `os.makedirs()` and `os.chown()` `config.ch_dir()` in conjunction with `config.chownroot()` cause an unknown number of new directories to be created (with mode `700`, which is good) and then calls `chown(0, 0)` on the final directory component. This should actually be an unnecessary operation, but *if* one of the upper directory components should be owned by a non-root user for some reason, then this could allow for a race condition where the attacker controlled path is replaced by symlinks to have the `chown()` happen on other targets than actually intended. My recommendation is not to perform `chown()` operations at all without a good reason and then it should be made sure that the directory tree is only _root_ controlled and `lchown()` should be used to avoid symlink attacks at least for the final path component. ### f) Relying on the CWD to Lookup and Write out Data Some code portions rely on the current working directory to be correctly set like in `tpm_abstract.py` the functions `__read_tpm_data()` and `__write_tpm_data()`. This means that if another program component decides to change the CWD or an attack vector allows to manipulate the CWD then sensitive data will be written to unsafe places or will be obtained from unsafe places. Since other aspects show that the current code base does not always have clear responsibilities for global process settings (see items d, e) the danger of this happening is tangible. I recommend not to use CWD relative path lookups but always absolute paths. ### g) Needless Re-opens of Temporary Files There exist a couple of code sequences involving `tempfile.mkstemp()` in the files `tpm_main.py` and `vtpm_manager.py` like this: # write out the public EK efd, etemp = tempfile.mkstemp() ekFile = open(etemp, "wb") This means that two file descriptors are created for each temporary file and the cleanup (closing, deletion) is also not straightforward. I recommend to use `with` statements and `NamedTemporaryFile` instead. ### h) Remaining uses of SHA-1 SHA-1 is becoming more and more questionable for a lot of cryptographic use cases. In the Keylime configuration we still have SHA-1 enabled by default in the `accept_tpm_hash_algs` setting. If no tpm 1.2 support is required then this default could probably be hardened. Another instance where SHA1 is used is in `keylime/crypto.py` in the context of RSA key usage. It seems only to be used for padding, which should be okay, but just to get this question out of the way it could be considered to upgrade this to SHA2 just as well. 6) CVE Assignments ================== I suggest to assign CVEs for the following issues: - item 2.a `check_mounted()` logic can be confused by local user mounts - item 2.c arbitrary Remote Code Execution via Unauthenticated Agent Bootstrap Interface - item 2.d bootstrap Protocol Susceptible to Replay Attacks - item 3.b log Spoofing in Registrar due to unsanitized agent UUID - item 4.a revocation notifier uses fixed /tmp/keylime.verifier.ipc - item 4.b untrusted ZIP data can at least lead to memory exhaustion in the Verifier - item 5.a world-readable keylime.conf Possible candidates: - item 3.b information leaks via unauthenticated agent quote interface. If more concrete exploitation of this information leak can be thought of then this might also be worth a CVE. I only consider it minor for the time being, though. - item 3.c existing agent registration can be overwritten. I'm not sure of the full impact. Upstream development is quite active over the last couple of weeks also with support from Alberto. I reviewed the patches available so far and it looks like we're on a good way to finish this. Since a new TLS path towards the Agent will be introduced there will be a somewhat complex upgrade path to the to-be-released version. There is still no definite publication date as far as I know but there was talk about releasing this by the end of the month. The security advisories are public now in the upstream repository: https://github.com/keylime/keylime/security/advisories Maintenance process is complete, thus closing the tracker. |