Bug 1191739

Summary: AUDIT-TRACKER: keylime: review TPM remote attestation framework
Product: [Novell Products] SUSE Security Incidents Reporter: Marcus Meissner <meissner>
Component: AuditsAssignee: Matthias Gerstner <matthias.gerstner>
Status: RESOLVED FIXED QA Contact: Security Team bot <security-team>
Severity: Normal    
Priority: P5 - None CC: aplanas, security-team
Version: unspecified   
Target Milestone: ---   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Bug Depends on: 1193997, 1193998, 1193999, 1194000, 1194002, 1194004, 1194005    
Bug Blocks:    

Description Marcus Meissner 2021-10-18 09:26:18 UTC
https://jira.suse.com/browse/SLE-18265

adds keylime to sles15-sp4
Comment 1 Marcus Meissner 2021-10-18 09:30:00 UTC
package is in openSUSE:Factory keylime
Comment 2 Matthias Gerstner 2021-12-09 10:25:21 UTC
I'll look into this now.
Comment 5 Matthias Gerstner 2021-12-22 10:50:49 UTC
 1) Scope of Review
 ==================
 
 I've been looking into the accessible code paths for all four main components
 of Keylime: the agent, registrar, verifier and tenant applications. I have
 been looking into version 6.2.0. Currently version 6.2.1 is already available.
 From comparing the changes I expect that most findings from this report still
 apply to version 6.2.1 as well. Any source code locations mentioned in this
 report relate to version 6.2.0.
 
 2) Findings in the Agent Component
 ==================================
 
 Security Related
 ----------------
 
 ### a) `check_mounted()` Function Logic can be Fooled by Unprivileged Mounts
 
 The `check_mounted()` function in `secure_mount.py` attempts to make sure
 that a "secure" tmpfs is mounted at `/var/lib/keylime/secure` to store
 sensitive
 data on that never gets written to disk. To do so the function parses the
 output of the `mount` utility to determine whether this file system is already
 mounted at the desired location.
 
 There can exist the possibility of unprivileged users performing certain mount
 operations, one of the most prominent examples being the `fusermount`
 setuid-root binary for mounting FUSE file systems. In view of this parsing
 mount table output needs prudence. I described the basic issue previously
 already in another report [1].
 
 The following is a reproducer using `fusermount` that shows the basic local
 attack vector:
 
     user$ export _FUSE_COMMFD=0
     user$ fusermount some/path/ -ononempty,fsname="tmpfs on
 /var/lib/keylime/secure"
 
 This will fool the parsing logic in `check_mounted()` and thus the function
 assumes that the "secure" tmpfs is already mounted, while it actually isn't.
 Thus this will allow a local attacker on the system to prevent this security
 feature to be effective, *if* the local attacker manages to create such a
 mount
 entry before the `keylime_agent` runs.
 
 The attack vector can also be used to perform a local DoS against
 `keylime_agent` by claiming a different `fsname` than tmpfs. `check_mounted()`
 will throw an Exception in this case and the agent won't start.
 
 On a side note there are calls to `secure_mount.mount()` spread throughout the
 Keylime codebase (for example three times in `keylime_agent.py`, two times in
 `tpm_main.py` and two times in `ca_impl_cfssl.py`. There is no code to *clean
 up* this mount again, however. So it potentially leaves behind a stale mount
 after services are shutdown. Furthermore if multiple Keylime processes should
 operate in parallel this could result in a race condition where the "secure"
 tmpfs is mounted twice, in the worst case mounting a fresh tmpfs over
 previously stored content there.
 
 My recommendation is to parse the `/proc/self/mountinfo` pseudo file for mount
 table information instead. Whitespace is specially encoded in this file.
 Furthermore the responsibility of mounting and unmounting this file system
 should be more clearly defined during startup/shutdown of processes and maybe
 a reference counting / locking scheme to prevent race conditions should be
 used.
 
 [1]: https://www.openwall.com/lists/oss-security/2020/06/04/5
 
 ### b) Possible Information Leaks via Unauthenticated Agent Quote Interface
 
 A TPM quote can be requested without authentication from the agent service via
 the network:
 
     $ curl "keyagent-host:9002/?api_version=500&quotes=myquote&nonce=mynonce"
     {
         "code": 200,
         "status": "Success",
         "results": {
             "quote": <base64-data>, "hash_alg": "sha256", "enc_alg": "rsa",
 "sign_alg": "rsassa",
             "pubkey": <PEM key>", "boottime": 1639999864
         }
     }
 
 This exposes for example the *boottime* of the host where the agent is
 running, information that is not otherwise easily publicly available.
 Furthermore: Could the contents of the TPM quote data also be interesting
 data? Could it for example allow deductions about which kind of operating
 system kernel is running on the host?
 
 I recommend to somehow authenticate and cryptographically secure this agent
 interface to prevent information leaks of this kind.
 
 ### c) Arbitrary Remote Code Execution via Unauthenticated Bootstrap Interface
 
 It looks like it is possible to simply post arbitrary new values for the U and
 V key parts and provide a new configuration payload to the agent, only knowing
 the agent's UUID. The agent's UUID can be public or semi-public information
 like when `agent_uuid=hostname` is configured. From the Keylime paper [2,
 section 3.2.2] it sounds like the UUID HMAC check is not considered a security
 feature but only a sanity check:
 
 > This provides the node with a quick check to determine if Kb is correct.
 
 When `extract_payload_script=true` (default) and `payload_script=autorun.sh`
 (default) are configured in `keylime.conf` then the provided payload will be
 unzipped and a potentially contained `autorun.sh` script is executed with full
 root privileges. Attached you can find a reproducer script `post_key.py` that
 demonstrates the issue by creating a file `/tmp/evil` on the agent host by
 only providing the agent hostname and UUID as input parameters.
 
 Even if `payload_script` is disabled then the extraction of a ZIP file as
 _root_ might result in a remote root exploit by extracting files outside of
 the intended target directory. I did not test this variant of the attack
 vector, though. Furthermore by providing a ZIP bomb as payload the agent
 process can be subjected to a remote DoS through memory exhaustion.
 
 Retrieving the full symmetric key previously stored in
 `/var/lib/keylime/secure/derived_tci_key` should not be possible this way,
 because when performing the bootstrap protocol, the previous data is removed
 in
 `keylime_agent.py:242`. A skillful attacker might attempt to first compromise
 the agent node and then wait for the tenant to re-deploy the agent using
 authentic keys and payload. Should this happen then the attacker can obtain
 the secret symmetric key from the compromised agent node after all.
 
 Similar to issue b) I recommend to somehow authenticate and cryptographically
 secure this agent interface to prevent these attacks. As a hotfix disabling
 the relevant configuration features should at least prevent the remote code
 execution and memory exhaustion attack vectors. Setting non-predictable UUID
 values can also help (but one should also consider item 3.a in this context).
 Even then this interface still allows to disrupt the operational state of the
 agent host by simply overwriting its current configuration.
 
 [3]:
 https://www.ll.mit.edu/sites/default/files/publication/doc/2018-04/2016_12_07_SchearN_ACSAC_FP.pdf
 
 ### d) Key Exchange and Bootstrap Protocol Susceptible to Replay Attacks
 
 Authentic payloads being passed from the tenant to the agent should be
 reasonably safe from attackers (when not considering issue c)), since the two
 halves of the symmetric key are encrypted using the per-agent node RSA public
 key. The bootstrap protocol seems to be susceptible to certain replay attacks,
 however. Since the interface does not employ transport security, the
 bootstrap protocol can simply be recorded and replayed to activate an
 authentic configuration payload. This could e.g. be used by an attacker to
 activate an outdated or even insecure older configuration of the agent node.
 
 Code Quality and other Concerns
 -------------------------------
 
 ### e) Lacking Robustness of the REST API Interface
 
 There exist various combinations of REST API requests that lead to exception
 backtraces in the agent's output when running it on the console. For example:
 
 - `keylime_agent.py:87`: challenge extracted without check whether it's
   supplied.
 - `keylime_agent.py:103`: nonce extracted without checks, failure ends in
   empty reply from server.
 - `mask` parameter is not checked for being numeric, can trigger a
   `ValueError` in `check_mask()`.
 - `partial=something` also triggers a `ValueError`.
 - also in the POST case a lot of parameters aren't properly validated.
 
 While these don't seem to pose any security problems as such it seems a more
 defined way of executing the requests would be beneficial for the overall
 stability and security of the service.
 
 In the agent REST interface this was particularly apparent, but some of the
 REST APIs in other components suffer from similar issues.
 
 ### f) Quote Nonce is not Size Limited
 
 In the REST API call to create a TPM quote in `keylime_agent.py:127` there is
 no size restriction on the nonce parameter. This data is turned into a
 hexdigest and passed directly on the command line of the `tpm2_quote` utility.
 This could hit command line length restrictions and cause the subprocess
 execution to fail. To prevent unexpected results overly large parameters
 or non-alphanumeric contents should be rejected.
 
 3) Findings in the Registrar Component
 ======================================
 
 Security Related
 ----------------
 
 ### a) UUID of Agents is Received on Unprotected HTTP Interface
 
 The Registrar provides two separate HTTP interfaces, a TLS protected one and
 an
 unprotected one. Part of the unprotected interface is the agent registration.
 As part of the agent registration the agent UUID is passed unencrypted
 (processed in `registrar_common.py:229`).
 
 This is not a security issue in its own but relates to issue 2.c where the
 knowledge of the UUID facilitates remote code execution on the agent nodes.
 This means if an attacker can listen in on the registrar's agent registration
 communication then even unpredictable agent UUIDs no longer hinder the attack
 described in issue 2.c.
 
 As outlined in 2.c the UUID does not seem to have been thought of as a
 security property in the first place so I see no urge to change anything here.
 Although when the bootstrapping protocol should get TLS protection then for
 completeness it could also make sense to protect this registrar interface as
 well the same way.
 
 ### b) Unsanitized UUID passed on Unprotected HTTP Interface Facilitates Log
 Spoofing
 
 Since the registrar's unprotected HTTP interface requires no authentication,
 anybody can post arbitrary agent registrations with arbitrary parameters. The
 agent ID (UUID) parameter is not sanitized in any way and is used unfiltered
 in log messages (e.g. `registrar_common.py:107`).
 
 As a result the agent ID parameter can be used to inject seemingly valid
 additional log lines that appear e.g. in `journalctl -u
 keylime_registrar.service`. The attached reproducer script `post_agent.py` can
 be used to demonstrate this:
 
     $ ./post_agent.py --host registrar-host --log-line "Please run rm -rf /*
 to protect your system"
 
 In the journal we will then see:
 
     Dec 21 11:44:22 registrar-host keylime_registrar[1426]: 2021-12-21
 11:44:22.281 - keylime.registrar - WARNING - POST for trusted-agent
     Dec 21 11:44:22 registrar-host keylime_registrar[1426]: 2021-12-21
 11:44:22.931 - keylime.registrar - WARNING - Please run rm -rf /* to protect
 your system
     Dec 21 11:44:22 registrar-host keylime_registrar[1426]: 2021-12-21
 11:44:22.940 - keylime.registrar - DEBUG - returning 400 response. [...]
 
 Such log spoofing could be used to entice Administrators to perform actions
 that can be harmful or otherwise in the interest of an attacker.
 
 My recommendation is on the one hand to diligently sanitize untrusted input
 parameters. On the other hand it might make sense to authenticate this
 currently untrusted interface.
 
 ### c) Existing Agent Registrations can be Overwritten without Authentication
 
 Since the agent registration interface is not authenticated anybody can
 add new or overwrite existing agent registrations. These will not be "active",
 they can only be activated by performing a "PUT" request whilst also providing
 an `auth_tag`.
 
 The unauthenticated overwrite seems to allow to disrupt the actual agent's
 trust
 state in certain situations. It might also be possible to actually overwrite
 and *activate* an existing agent while also providing arbitrary "ip" and
 "port" parameters that e.g. the tenant application could evaluate and perform
 an outgoing connection to.
 
 Code Quality and other Concerns
 -------------------------------
 
 - `registrar_common.py:549`: this busy wait sleep loop to join threads is bad
   style, why not do blocking joins?
 - is the `tpm2_makecredential` external program safe to process untrusted blob
   inputs? It is called in the Registrar context in `registrar_common.py:289`.
 
 4) Findings in the Verifier Component
 =====================================
 
 Security Related
 ----------------
 
 ### a) Revocation Notifier Uses fixed /tmp Path for UNIX Domain Socket
 
 In *revocation_notifier.py* a fixed path in the world writable location
 */tmp/keylime.verifier.ipc* is used. The code (in this case the third party
 `zeromq` Python module) forcefully removes any file object found there
 earlier.
 Should the program be running as non-root, or if another local user simply
 places a *directory* at this location, then this serves as a local DoS attack
 against the revocation notifier process, because the socket cannot be created.
 
 This situation doesn't even seem to be noticed by the verifier main process,
 because the child process `broker_proc` is never waited on. This means that
 the local attacker could even replace the "blocking" directory by his own UNIX
 domain socket later on and will then receive revocation events from
 invocations of the `notify()` function in the main verifier process.
 The full impact of this would have to be researched further. It looks like
 failed quote notifications would longer be sent out.
 
 I recommend to place UNIX domains sockets in a dedicated safe directory in
 /run that cannot be staged with attacks by other local users in the system.
 
 ### b) Get Quote Response Contains Possibly Untrusted ZIP Data
 
 The verifier process periodically performs quote operations on registered
 agents. As part of this `process_quote_response()` is called and furthermore
 `check_quote()` and finally `_tpm2_checkquote()`. In `tpm_main.py:1018` a
 couple of ZIP data streams are uncompressed via `zlib.decompress()`.
 
 Since this is processing possibly untrusted data - the verifier is attempting
 to verify the current trust status of the node after all - it needs to be
 assumed that malicous data can also be supplied here.
 
 Therefore the question arises whether `zlib.decompress()` is robust against
 processing invalid ZIP data streams. One thing I already found out is that it
 is not robust against delivering ZIP bombs that will cause a memory exhaustion
 in the verifier process.
 
 Code Quality and other Concerns
 -------------------------------
 
 ### c) Lifetime of Global Variable `global_password` for SSL Private Keys
 
 In the *ca_util.py* module the `global_password` global variable holds the
 cleartext passphrase for SSL private keys. This is global program state
 that should rather be coupled to the lifetime of some program object like the
 web server instance. This means that the password is possibly kept in memory
 longer than necessary and also inherited to forked child processes that
 exist in the verifier context for zeromq or the tornado web service.
 
 Even though it is difficult in Python to make sure the password is not
 residing in memory any more there should still be a clear and as short as
 possible logical lifetime during which the password is available in the
 program.
 
 ### d) Risky `pkill -f cfssl` to Shutdown a Local cfssl Instance
 
 In `ca_impl_cfssl.py:67` the external command `pkill -f cfssl` is used to try
 and terminate any already / still running cfssl instances in the system. Since
 Keylime runs as _root_ this is a risky operation that might conflict with
 other
 applications or services utilizing cfssl in the system possibly leading to
 hard to find bugs.
 
 Instead child process objects should be used for cfssl and they should be used
 to terminate and wait for cfssl. A PID file in the file system could be used
 to avoid zombie processes from previous runs to stick around. Also signal
 handlers or program exit handlers can be used to kill child processes in case
 the program terminates unexpectedly.
 
 Related to this cfssl handling is also `ca_impl_cfssl.py:71` where a racy
 check is made to test whether the child process "immediately" returned. A
 little later a `time.sleep(0.2)` is added to further add "stability". This is
 no
 deterministic way to check for child process errors. Instead the child
 process's stdout should be inspected for proper startup messages, maybe cfssl
 provides further functionality that allows programmatic evaluation of its
 status.
 
 ### e) Letting cfssl Locally Listen via HTTP
 
 The interface towards the cfssl instance as used in `post_cfssl()` raises
 questions about security. Currently the default listen address and port are
 127.0.0.1:8888. This means that a local malicious user could bind a service
 itself on this port, thereby performing a DoS against the verifier. If the
 local attacker manages to make the verifier code assume that cfssl started
 successfully anyway (see shaky child process handling in item d)), then that
 local attacker could even hand out bogus certificates to the verifier, thereby
 undermining trust in the TLS setup.
 
 Another issue could be that if other users in the system have special
 capabilities like *CAP_NET_RAW* then they could start listening in on the
 unencrypted localhost communication towards the cfssl instance and obtain the
 certificates and private keys that the verifier (or other keylime components)
 creates.
 
 For the generation of sensitive data like private key files it would be
 desirable to use an interface towards cfssl that is not accessible to other
 users in the system.
 
 ### f) Connection Reliability Towards Agent Quote API
 
 In _cloud_verifier_tornado.py:719_ the quote requests against agents are
 carried out using the untrusted HTTP interface of the agent. Here the question
 comes to my mind what happens if an attacker in the attached network manages
 to interrupt or hijack this communication. Could it be used as a DoS vector
 to make the verifier assume that verification failed and it revokes the
 verification state of agents without valid reason? Could it furthermore be
 used even to *prevent* the revocation action to be carried out *on* the agent
 node itself?
 
 I am aware that this is a difficult technical problem to solve even with
 transport security enabled. It is an interesting viewing angle when dealing
 with a remote attestation solution, though, that should be thought through.
 
 5) General Findings
 ===================
 
 This section contains findings that apply to all keylime components alike.
 
 Security Related
 ----------------
 
 ### a) World-Readable keylime.conf Contains Potentially Sensitive Data
 
 The configuration `/etc/keylime.conf` is installed world-readable:
 
     $ ls -l /etc/keylime.conf
     -rw-r--r-- 1 root root 26770 Dec 16 14:54 /etc/keylime.conf
 
 This is the case for installations performed manually via the provided
 `installer.sh` script as well as for the RPM packaging found in both openSUSE
 Tumbleweed and Fedora 35 Linux distributions. Further distributions might be
 affected.
 
 `keylime.conf` contains a lot of information, some of it sensitive like the
 TPM ownership password (`tpm_ownerpassword`), TLS certificate private key
 passwords (`private_key_pw`, `registrar_private_key_pw`) or the database
 password for the registrar (`database_password`). Thus this is a local
 information leak, because arbitrary local users can obtain these passwords
 from the configuration file.
 
 NOTE: Alberto (in CC), our SUSE packager, already fixed the permission of the
 file for openSUSE Tumbleweed in the meantime.
 
 My recommendation is to make this file only accessible to _root_ and adjust
 all installation routines and possibly documentation. Affected distributors
 should be notified about this. The Keylime code could perform a sanity check
 of the permissions of the configuration file before reading it in.
 
 Design Related
 --------------
 
 ### b) Lack of Privilege Separation
 
 All keylime services are currently designed to run as root all the time
 (except for testing purposes, see `REQUIRE_ROOT` in `config.py`). Only few
 bits of the keylime components actually should need root privileges. Most
 notably the bootstrapping scripts in the Agent component or the ability to
 bind privileged ports.
 
 Implementing a privilege separation approach would increase the defense in
 depth for keylime considerably, avoiding smaller security issues to become
 severe fast.
 
 Code Quality Concerns
 ---------------------
 
 ### c) Addition of `/usr/local/...` Directories to `PATH` and
 `LD_LIBRARY_PATH`
 
 Code in `tpm_main.py` and `ca_impl_cfssl.py` adds directories from
 `/usr/local` to the binary and library search path environment variables for
 executing child programs. This is done unconditionally and is unnecessary for
 RPM based installations. While it should be safe to assume that programs and
 libraries found in `/usr/local` are safe to run it still adds certain attack
 surface, should something be misconfigured within `/usr/local` that badly
 influences execution of Keylime.
 
 My recommendation is to make the installation path better configurable for
 integrators and to _only_ add it during runtime if it is required and isn't
 already part of the respective environment variables.
 
 ### d) `umask()` Invocations with Unclear Responsibility
 
 There are currently three `os.umask(0o077)` invocations in `keylime_db.py`,
 `tpm_abstract.py` and `config.py`. These invocations happen somewhat
 randomly during process execution and the original umask value is never
 restored. This leaves the process's umask in a more or less undefined state
 for other parts of the program depending on the exact program execution flow.
 
 My recommendation is to set a conservative umask once during program startup
 and cleanly change and restore it temporarily if necessary.
 
 ### e) Uses of `os.makedirs()` and `os.chown()`
 
 `config.ch_dir()` in conjunction with `config.chownroot()` cause an unknown
 number of new directories to be created (with mode `700`, which is good) and
 then calls `chown(0, 0)` on the final directory component. This should
 actually
 be an unnecessary operation, but *if* one of the upper directory components
 should be owned by a non-root user for some reason, then this could allow for
 a race condition where the attacker controlled path is replaced by symlinks to
 have the `chown()` happen on other targets than actually intended.
 
 My recommendation is not to perform `chown()` operations at all without a good
 reason and then it should be made sure that the directory tree is only _root_
 controlled and `lchown()` should be used to avoid symlink attacks at least for
 the final path component.
 
 ### f) Relying on the CWD to Lookup and Write out Data
 
 Some code portions rely on the current working directory to be correctly set
 like in `tpm_abstract.py` the functions `__read_tpm_data()` and
 `__write_tpm_data()`. This means that if another program component decides to
 change the CWD or an attack vector allows to manipulate the CWD then sensitive
 data will be written to unsafe places or will be obtained from unsafe places.
 
 Since other aspects show that the current code base does not always have clear
 responsibilities for global process settings (see items d, e) the danger of
 this happening is tangible. I recommend not to use CWD relative path lookups
 but always absolute paths.
 
 ### g) Needless Re-opens of Temporary Files
 
 There exist a couple of code sequences involving `tempfile.mkstemp()` in the
 files `tpm_main.py` and `vtpm_manager.py` like this:
 
     # write out the public EK
     efd, etemp = tempfile.mkstemp()
     ekFile = open(etemp, "wb")
 
 This means that two file descriptors are created for each temporary file and
 the cleanup (closing, deletion) is also not straightforward. I recommend to
 use `with` statements and `NamedTemporaryFile` instead.
 
 ### h) Remaining uses of SHA-1
 
 SHA-1 is becoming more and more questionable for a lot of cryptographic use
 cases. In the Keylime configuration we still have SHA-1 enabled by default in
 the `accept_tpm_hash_algs` setting. If no tpm 1.2 support is required then
 this default could probably be hardened.
 
 Another instance where SHA1 is used is in `keylime/crypto.py` in the context
 of RSA key usage. It seems only to be used for padding, which should be okay,
 but just to get this question out of the way it could be considered to upgrade
 this to SHA2 just as well.
 
 6) CVE Assignments
 ==================
 
 I suggest to assign CVEs for the following issues:
 
 - item 2.a `check_mounted()` logic can be confused by local user mounts
 - item 2.c arbitrary Remote Code Execution via Unauthenticated Agent Bootstrap
   Interface
 - item 2.d bootstrap Protocol Susceptible to Replay Attacks
 - item 3.b log Spoofing in Registrar due to unsanitized agent UUID
 - item 4.a revocation notifier uses fixed /tmp/keylime.verifier.ipc
 - item 4.b untrusted ZIP data can at least lead to memory exhaustion in the
   Verifier
 - item 5.a world-readable keylime.conf
 
 Possible candidates:
 
 - item 3.b information leaks via unauthenticated agent quote interface. If
   more concrete exploitation of this information leak can be thought of then
   this might also be worth a CVE. I only consider it minor for the time being,
   though.
 - item 3.c existing agent registration can be overwritten. I'm not sure of
   the full impact.
Comment 8 Matthias Gerstner 2022-01-20 13:48:00 UTC
Upstream development is quite active over the last couple of weeks also with
support from Alberto. I reviewed the patches available so far and it looks
like we're on a good way to finish this.

Since a new TLS path towards the Agent will be introduced there will be a
somewhat complex upgrade path to the to-be-released version.

There is still no definite publication date as far as I know but there was
talk about releasing this by the end of the month.
Comment 9 Matthias Gerstner 2022-01-28 09:27:18 UTC
The security advisories are public now in the upstream repository:

https://github.com/keylime/keylime/security/advisories
Comment 10 Matthias Gerstner 2022-02-18 10:08:10 UTC
Maintenance process is complete, thus closing the tracker.