In the Industry 4.0 context, data are a valuable asset that must be protected. OPC UA PubSub enables secure and interoperable solutions, but authentication of IIoT devices remains a sensitive issue. This article presents a novel approach based on open source software, using a Trusted Platform Module to protect secrets on devices, and evaluate the security level on a predictive maintenance use case.
This article reproduces the entirety of a pre-print version of an article written conjointly with Thales Research and Technology and published in volume 134 of Journal of Systems Architecture.
The most common architecture in industrial systems is a system where a supervisor component is connected to a set of slave devices, which can be either sensors or actuators. Sensors typically produce periodic or sporadic data (in the case of alarms) while actuators consume commands. Sensors can also receive data and commands (typically software/firmware updates, configuration or state changes), while actuators can also produce data (generally current state, possibly on the form of a heartbeat). In traditional Industrial Control Systems such as SCADA systems, none of these components is connected to open networks, and most of them rely on field buses, using wired communications and proprietary protocols (e.g. PROFINET, EtherCAT).
As supply chain externalization and subcontracting become more and more common, and while industrial empires tend to divide in different legal entities, there is an increasing need for industrial system’s interoperability. Furthermore, the rise of learning machines made operational data becoming an asset rather than a mere tool for production, and data can be exchanged between different companies, thus strengthening this need.
In Industrial Internet of Things (IIoT), as opposed to the more generic IoT context, the things (devices, but also supervisors) are rarely directly connected to open networks such as Internet. Breach of software integrity of devices involved in such systems could lead to catastrophic results [stuxnet], [blackenergy]. They are generally connected to the public network through gateways. This topology is illustrated in the Figure 1.
To enforce the system’s integrity, the gateways need ensuring a high degree of security.
In this context, which requires both interoperability and security, we focused our analysis on existing open protocols. We present in this paper a secure system architecture based on such open protocols, namely MQTT, OPC UA PubSub, and on a hardware security element, the TPM.
The remainder of this paper is organized as follows. Section 2 summarizes the main goals the IIoT devices must reach. To do so, we chose MQTT and OPC UA PubSub which are presented in Section 3. In Section 4, we analyze some threats and vulnerabilities that arise using such secure protocols. In Section 5, we propose a solution to counter these threats, which we illustrate in Section 6 in an actual system architecture that shows its applicability. In Section 7, we describe the industry challenge of predictive maintenance, as well as the prototype relying on our solution to address its security and connectivity issues. We evaluate the security level of our solution in respect to this practical use case in Section 7.3, comparing it to industry standards.
Finally, we conclude about further enhancement.
Industrial Internet of Thing Opportunities and Challenges¶
Comparatively to the IT world and in addition to increased security needs, there is also a specific need for interoperability. Because of the industry legacy and specific constraints on their environment and usage, the subsystems containing the end-devices often run on proprietary or somewhat exotic protocols. It is generally not an option to switch to more open or widespread protocols, as it implies the re-certification of already safe systems.
As a consequence, there is a need for an upper-level interoperable protocol that allows to bridge industry protocols, but also to connect to open protocols such as TCP/IP, MQTT or even HTTPS. For the adoption of this protocol by new stakeholders, it should be associated to standard data-models for all involved industry fields.
IIoT may use a gateway to bridge interoperable protocols to the existing field-specific protocols. For instance, it enables the communication of legacy devices with more recent ones without re-certification.
Properly signed and encrypted messages are hard to decrypt. Yet in the long term no system is immune to compromise, even for mature, market-proven systems [cisa]. This is particularly true for long-life safe systems (for instance automatic train controllers), which are rarely updated: such systems must be thoroughly tested for systematic bugs before being deployed [metayer2018], [Wolf18]. Since new vulnerabilities can be revealed after the device deployment, a defense-in-depth strategy must be adopted to ensure persistent security throughout the whole system’s lifecycle.
If the keys used to sign and encrypt messages of a device get compromised, the attacker would be able to forge messages as if they were produced by this device. Furthermore, it would allow a rogue client to obtain the same access as the device, leading to potential leaks of industrial secrets, or offering leverage to physical sabotage actions to the attacker.
IIoT often relies on gateway in order to assert security. This security must be effective regarding each part of the CIA triad (Confidentiality, Integrity and Availability). Typically, integrity is addressed through cryptographic signature, confidentiality is addressed through encryption, and availability is ensured by the existence of the gateway and its hardening against Denial of Service (DoS) attacks.
Using a Gateway¶
As shown in Figure 1 and underlined by the need of interoperability with legacy devices, it is not always possible to secure all communications between all devices. In this case, devices that must communicate without encryption must belong to the same isolated private network. To overcome this limitation and enable communications with devices outside this network, a gateway may be used to bridge communications to other isolated networks.
In this situation, the gateway is responsible for the confidentiality and the integrity of the data it relays, from the edge (OT-to-IT border) to the subscriber system — in this case, the analytics mainframe. It can use secure and interoperable protocols to communicate these data to/from other gateways or other remote controllers, even across public infrastructures and perform encryption/authentication of the communications when necessary.
Strict OT zone protection was not covered in this document, as it is heavily dependent on the actual use case — whether because of (1) physical protections impeding access to the OT network, (2) physical limitations on the OT devices or network prevent cryptographic security measures to be applied, or (3) relative low value of the data samples comparatively to the correlated data produced by the secure gateway.
It is important to note that most industrial systems involve different heterogeneous sensors aiming at capturing different phenomena originated from the same physical event. Since these phenomena are not independent, it is possible to correlate the results produced by the different sensors, which make it possible to detect forged messages. While this solution is not covered by our document due to specificities of our target use-case, it would indeed be possible to validate integrity of data from the sensors to the gateway without signing data on the sensors — with solutions ranging from comparing a single sensor state to comparing the sample of the different sensors. These topics are well-studied from the safety viewpoint by SCADA systems, and embedded solutions have been proposed [SCHOLL]. If applicable on OT sensor data, business-specific edge computing may also be performed, i.e. computing of a local state of the gateway that will be sent instead of (or in addition to) the raw data received from the OT sensors.
However, this approach implies that compromising the gateway gives access to the whole subnetwork. In this case, it is possible to forge messages on behalf of the compromised gateway. Since the final receiver may be a critical system, such attacks potentially lead to catastrophic results, including crippling vital infrastructure, long term severe environmental damages and loss of human lives.
In order to ensure security of the industrial system, we present in this document a hardware-based approach to secure the gateway communications, and give in Section 7.2 an example of its usage in the predictive maintenance for the railway industry.
It is not always an option to rely on a secure gateway for very sensitive systems, for example when subject to state security regulations, or target of industrial spying. In these cases, confidentiality must involve only the ends of the communication, without third party.
As such, end-to-end encryption is an important need for IIoT. Only the final receiver of a message should be able to access to its clear text data. It encompasses the capability to cross networks through as many gateways as needed, while the data keeps its confidentiality and integrity, both being guaranteed by cryptographic means. This latter feature is important, as devices in IIoT are usually embedded into multiple layers of subnetworks.
Using end-to-end encryption is only possible when using a data transmission protocol that implements it, and it is usually not applicable for legacy systems.
Recognized Protocols for Interoperability¶
In this paper, we present two protocols that have safe and open implementations: the MQTT in Section 3.1 and the OPC UA in Section 3.2. They both present means to secure communications between devices with low bandwidth overhead.
They also both support the publish-subscribe communication pattern, where publishers publish messages to topics and subscribers subscribe to topics to receive messages on selected topics. In this pattern, publishers and subscribers are not directly connected one to the other. Instead, publishers and subscribers are connected to an agent commonly known as broker. The broker receives messages from publishers and dispatches them to subscribers subscribed to the topic of the messages. In other words, publishers send messages to potentially many subscribers. On the other hand, subscribers receive messages from publishers they might not be aware of. However, as they talk about the same topic, they understand each other. The loose coupling facilitates network scalability and versatility.
MQTT [mqtt] stands for Message Queuing Telemetry Transport and is an OASIS standard messaging protocol standard for the IoT. Using the Transport Control Protocol (TCP), it is a lightweight IoT protocol based on the publish-subscribe pattern. It includes some features such as Quality of Service (QoS) and security using the Transport Layer Security (TLS) protocol.
The MQTT protocol works with a central broker which connects Agents (Publishers, Subscribers, or both). Agents connect (optionally) securely to the broker and subscribe to topics and/or publish payloads to topics. The broker duplicates messages to all the agents that subscribed to the topic. This process is illustrated in Figure 2, where different IIoT sensors monitoring the manufacturing line feed actuators, safety-specific systems and the factory control center through MQTT.
Using TLS to protect the MQTT transport only protects the connections between the broker and the agents. As such, MQTT is not end-to-end encrypted. Data are decrypted by the broker and may be filtered/altered before being encrypted and sent to subscribers. Even if it depends on the use case, this level of protection is usually enough.
Using a centralized broker may also be an advantage in scalable networks, where Agents need to communicate through layers of subnetworks, and will more easily and reliably work with a TCP connection to the broker than, for instance, protocols using UDP Multicast. However, the broker is a central point of vulnerability in a distributed network of otherwise unconnected devices.
The MQTT protocol has been largely adopted in multiple industrial domains and multiple interoperable MQTT solutions exist. This brings the ability to re-use building bricks such as broker software and publish-subscribe libraries, which is an important advantage for its deployment in new scenarios.
OPC UA and OPC UA PubSub¶
OPC UA is a standard [opcua] for data exchange in industrial communications. It provides safe and secure means to connect supervision systems (SCADA) with programmable logic controller (PLC), actuators, and sensors.
Using OPC UA offers many advantages such as:
its interoperability ensured by an independent compliance laboratory,
its independence from hardware and platforms,
its enhanced modeling capabilities,
its management by a foundation (OPC Foundation) of over 800 members, ensuring the development and dissemination of the technology.
The OPC UA standard contains two different flavors:
connected communications with the classic client-server protocol,
publish-subscribe communications with the OPC UA PubSub protocol (with or without broker).
In the client-server approach, servers hold the data in an address space, and clients connects to them to fetch or write data to the servers. Among client-server services, one finds the address space discovery (browse), remote procedure call (method call), object-oriented data management, …
In a network topology where numerous (typically several hundreds) captors and actuators need to communicate with supervisors, a classical client-server approach requires a dense connection mesh to gather and distribute information. In this perspective, the PubSub mechanism is interesting because the supervisors only need to send one message to communicate information to all the remote devices.
The Part 14 of OPC UA standard revision 1.04 details the PubSub flavor of the OPC UA protocol.
OPC UA PubSub is structured in two layers: the Transport layer and the Message layer. This layered architecture ensures that OPC UA PubSub is not tied to a specific communication protocol which can be obsolete as technology evolves. This has been the case in the past with OPC DA which is based on the Microsoft COM/DCOM technology.
OPC UA’s transport layer has no relation with homonymous OSI transport layer. It covers communication protocols ranging from OSI layer 2 (data link) to layer 7 (application). In the case of OPC UA PubSub, it supports a set of well-established communication protocols in order to convey its payload between two clients:
AMQP (Advanced Message Queuing Protocol),
MQTT (Message Queue Telemetry Transport),
The first two protocols are dedicated to “Controller to Controller” communications, and don’t require a broker (the network switches act as lightweight brokers by broadcasting messages). They can be combined with TSN/Ethernet (Time Sensitive Networks over Ethernet) to meet hard real-time requirements for communication. The two latter transports are in fact using existing publish-subscribe protocols, which are only used for their transport capabilities. These are better suited to Cloud Integration as they are supported by the main cloud providers such as AWS, Microsoft Azure or Google Cloud. AMQP is designed with more advanced features than MQTT, but it has more overhead which makes MQTT more adapted to numerous IIoT scenarios.
To overcome the diverse and limited protection these transport layer protocols provide (e.g. Raw Ethernet and Multicast UDP offer to transport security), the OPC UA PubSub provides facilities for end-to-end encryption through its Message layer. We discuss it in the following section.
OPC UA PubSub End-to-End Encryption¶
Message content can be encoded either in text (JSON) or binary (UADP), the latter being the only one to provide end-to-end encryption.
Messages are composed of headers and a payload. When using encryption, only the payload, which bears data values, is encrypted. The whole Message is then signed, and the signature is appended at the end of the message. In this way, regardless of the Transport layer, end-to-end encryption is achieved.
OPC UA PubSub supports three security modes: encryption and signature, signature only, or none of them. There are multiple levels of encryption which are called security policies. They describe which algorithms are used, with which key lengths, signature scheme, … New security policies are regularly added by the OPC Foundation to keep up with the most current technologies. In the same way, older security policies are marked as deprecated and should not be used for new systems.
For now, PubSub encryption always uses block ciphers in counter mode (more specifically, the AES in CTR mode), and the signature algorithm is a message authentication code (HMAC) based on the SHA-256 hash algorithm. Both these algorithms need a secret to work with (the same key is used for encryption and decryption, and another one is used for signature and signature verification). Hence, publishers and subscribers must possess the same secrets to securely exchange messages.
PubSub Messages are associated to a security group. All messages of a security group are encrypted and/or signed with the same keys and the same security policy. To access messages from a publisher, subscribers shall have access to the corresponding security group.
The solution defined in the specification (see Section 5.4.3 of OPC UA specification Part 14), is to use an entity named Security Key Services (SKS). This OPC UA server is in charge of the authentication of the agents of the network, and the distribution of the security keys.
As a consequence, publishers and subscribers must connect to the server using the classic OPC UA protocol to fetch the security keys. They must do so securely, so that the security group keys are transferred in a confidential way on the (maybe) public network.
In classic OPC UA, the authentication of the client is handled through a Public Key Infrastructure (PKI) with Certificate Authorities (CA). It uses the X.509 public key certificate and Certificate Revocation List (CRL) standard. This implies that the client embeds a key pair to prove its identity. This pair is composed of a so-called public key that is signed by a certificate authority (this signed key is also called a certificate), and a private key which must stay secret at all times. The client and server shall also embed the trust chain to be able to check respectively the client and server certificates (see Certificate trust chain validation in OPC UA for details).
OPC UA PubSub two main phases are described in Figure 3, where communications using OPC UA Client/Server paradigm (i.e. authentication and key distribution) are colored in red, while communications using the PubSub paradigm (i.e. encrypted messaging) are colored in blue.
S2OPC is an open-source implementation of the OPC UA specification made by Systerel. Resulting from the INGOPCS project and supported by the French National Cybersecurity Agency (ANSSI), it is designed to target safety and security certifications. For now, S2OPC has been certified by the OPC Foundation 1. A CSPN certification is also in progress on the product.
S2OPC is an OPC UA library aimed at building a combination of the following components:
- OPC UA Client,
- OPC UA Server,
- OPC UA PubSub Publisher,
- OPC UA PubSub Subscriber.
Regarding PubSub, S2OPC supports the Multicast UDP, the Raw Ethernet and the MQTT transport implementations as well as the UADP Message implementation (binary encoding) with end-to-end encryption.
An SKS server implementation using S2OPC with an OPC UA Server configuration is also provided to support a complete deployment of the secure PubSub architecture.
Identifying Threats and Vulnerabilities¶
The OPC UA client-server protocol has been analyzed for security vulnerabilities in the past and is considered secure. Moreover, as of today, no vulnerability has been identified in the OPC UA PubSub protocol, which security relies on modern cryptographic means.
However, as mentioned in Section 3.3, all agents participating in a given security group share the same encryption and signature keys. This holds true whether the clients are subscribers or publishers.
As a consequence, a malicious client holding valid group keys would be able to compromise a whole group, as it could either feed some forged data (from publisher, attack on data integrity), or read secret data (from subscriber, attack on data confidentiality). Hence, the group keys handling, and their distribution is a vulnerable process. Protecting the group keys (encryption and signature) is thus critical to ensure the group security.
As the group keys are securely exchanged on a secure OPC UA client-server connection, their confidentiality is based on two factors:
- their storage in PubSub agents,
- the authenticity of the client certificate used to fetch the group keys from the SKS server.
Legitimate PubSub agents may be compromised by attack on software integrity for instance, either at boot time or at runtime, which enable attackers to gain access to the group keys. Mitigations against boot time attacks are known and can be provided by techniques such as secure boot and CFI.
In-memory protection of the keys is an open subject which reaches beyond the scope of this article.
The client authentication process is detailed in the following subsections.
Securing Client Authentication¶
In OPC UA PubSub specification, agents connect to the SKS for authentication using the OPC UA client/server protocol. To better understand how this process works, and how the authenticity is asserted by the SKS server, here is a summary of the process steps:
the OPC UA client connects to the SKS server,
the client encrypts a message containing its client nonce using the server’s public key, so that only the server can decrypt it,
the client also signs the message with its private key to prove its identity,
the client then sends its certificate and the encrypted/signed message to the server,
the SKS server checks the validity of the client certificate through its Public Key Infrastructure (PKI),
if the client certificate has been signed by a trusted certificate authority, the server extracts the client’s public key from the certificate and verifies the message signature,
if the message is correctly signed, the server decrypts it to recover the client nonce,
the server then encrypts its response message containing the server nonce with the client public key, and signs it with its own private key,
the client checks the validity of the server certificate through its PKI,
if authenticated, the client decrypts the message using its own private key and recovers the server nonce,
on both sides, using the exchanged nonces, temporary session keys (encryption and signature) are generated,
the rest of the communication is encrypted using these unique session keys.
Now that the client has established a secure way to communicate with the server, it can securely fetch the PubSub group keys. As the server has authenticated the client, it can either authorize or deny the client request to fetch the group keys, depending on whether the client is part of the security group.
This process relies on the fact that the client private key cannot be recovered, so that attackers cannot impersonate legitimate clients.
Security measures must be taken all along the life cycle of the client device, as many opportunities may be used to steal the client’s private key. For instance, if the private key is provisioned by an external actor (distributed by a central authority over the network, brought physically to the device on an external storage device, …), it could be compromised even before its provisioning.
Even if the keys are provisioned securely, it is still possible to compromise the device’s private key. For instance, if the private key is stored physically unprotected on the device, an attacker could either compromise it at rest in the ROM, but also at runtime in the RAM, cache or CPU bus.
Another attack mean is to recover the keys from decommissioned devices. A common mitigation to this attack is to revoke keys of decommissioned devices.
To protect the client private key on the device, we used a TPM2-compliant device, as detailed in the following section.
Asserting Authenticity with a Secure Cryptoprocessor¶
The Trusted Platform Module (TPM) is a cryptographic standard for cryptoprocessor providing services such as random number generation, generation and storage of cryptographic keys, encryption and decryption of data. The TPM standard is currently in its second iteration, and it’s frequently referred as TPM2.
The benefit of using a TPM is that it is in charge of protecting the secrets and of doing the encryption/signature process without loading the secrets to the system’s memory. While doing so, the secrets never get out of the TPM hardware module. Moreover, some TPM implementation can be tamper resistant to physical intrusion and reset themselves in such event, protecting the secrets.
TPMs are usually available as discrete hardware components integrated into the target device. Other times, as is the case with ARM TrustZone, TPMs are also available as software components executed in a Trusted Executed Environment (TEE). In all cases, the communication between the device and the TPM is standardized by the cryptographic standard.
TPM and OPC UA¶
In the OPC UA client-server protocol, a TPM can be used to securely store the asymmetric keys and realize signature operations (robust authentication, step 3 in Section 4.1) and decryption (to obtain the nonce, step 10). This requires two additional configuration steps:
the TPM is first provisioned with an asymmetric key pair, i.e. the TPM generates an asymmetric key pair for which the public key part can be retrieved but the private key part cannot,
the public key part of the created key pair is retrieved from the TPM and signed by a trusted certificate authority.
The signed public key part is the certificate that uniquely authenticates an OPC UA client or server in the network architecture. An OPC UA PubSub agent that has a configured TPM is now able to recover the security group keys securely, and the private key used in the OPC UA client-server protocol cannot be extracted from the device.
Protection of the Secrets¶
As illustrated by the previous example, the TPM provides the following properties:
the private key is not readable on non-volatile storage (hard-disk, ROM, …),
the private key is not loaded in live memory of the device,
no human operator ever access or manipulates the private key.
Additionally, to the mere security insurance increase, using a TPM significantly reduces the attack surface of the system. Without a TPM (or other kinds of Hardware Security Modules, HSM), the private key appears in static and/or live memory. It is also generated and provisioned by different actors, human and/or automated. At any point the private key may be the target of an attacker. In our solution, the only point of failure is the TPM, and thus the security assessment is much easier and less error-prone.
Having a single point of protected storage also helps to monitor the security of all the communications. This facilitates the identification of the potential leak of the key and its revocation, keeping safe the other agents of the network more efficiently.
Limits and Remaining Vulnerabilities¶
As it is an additional component, a TPM module must be integrated into the target board. Although technically not challenging since the TPM suppliers generally make this integration as easy as possible, it is likely to lead to a re-certification in critical domains (e.g. avionics). In these domains, TPM usage should thus be considered for new products, rather than used for improving existing systems’ security.
A second possible issue is the limited computational power of the TPM. In the example of the OPC UA client-server connection, the TPM is used to sign a message to be sent (i.e. to encrypt a 32B hash of the message, which is computed by the host) and to decrypt another message (the AES256 nonce, so 16B of payload). In our benchmarks, this process takes around 1 second, using a rather naive implementation. Experience in TPM usage tells us that a more optimized implementation would take around 100 ms (performances can change according to the TPM version and supplier). In the PubSub protocol, the client-server connection is only done to fetch the group keys, so the impacts establishing the connection will strongly depend on the group key lifetime. In some real-time systems with strict needs in terms of worst-case latency, it may be an issue.
As with all security system, the TPM reinforces security but cannot prevent all types of attacks. It is still possible to physically steal the whole device, or to compromise access to the machine and make connections on its behalf, but it is easy to revoke the certificates of devices that have been stolen or compromised, while assuring the security of the exchanges in the rest of the security group.
In a production system, the TPM can additionally be used to perform measured boot coupled with a root of trust, and the use of the created key pair to ensure that the device is only used when the system integrity can be ensured. This reinforces the benefits of using the TPM to protect the group keys.
Finally, decommissioning a TPM should also be done with care, as the private key may still be embedded in the device. Ensuring total and irreversible wipe out of the key is possible, but this action must be clearly planned in the device life cycle. In all cases, it is recommended to also revoke the device certificate from the certificate authority, preventing potential use of an incorrectly cleared TPM.
Proof of Concept¶
An architecture of a system based on MQTT and OPC UA PubSub was developed to test the feasibility of the concepts proposed in this paper. Figure 4 presents the global architecture of the experiment. This architecture represents a system isolated on a private subnetwork on the left, and a remote device on the right. The typical use case for this kind of architecture is for a system that is spread out physically. The controller of the system is in the private subnetwork but needs to communicate securely and efficiently with the remote device.
In this proof of concept, the remote device represents potentially many remote devices that may be spread out physically. The private subnetwork represents another part of the system that requires an isolated network (for example because of legacy devices using unsecured protocols).
The communication protocol is OPC UA PubSub with end-to-end encryption enabled. The MQTT protocol is used to transport the encrypted OPC UA PubSub messages, as it is more convenient to use a TCP broker to connect to remote devices. The gateway of the private subnetwork is then represented by two entry points: the MQTT broker and the SKS server used to distribute the group keys.
The S2OPC library is used to build the SKS server, but also for the devices using the OPC UA PubSub protocol and the OPC UA client-server protocol (connections between PubSub devices and the SKS server). The Paho 2 library is used by S2OPC as the library for the MQTT Transport of the OPC UA PubSub messages. The broker is an off-the-shelf Mosquitto 3 broker. An OPC UA server is also added to the PubSub modules to expose the published data in a client-server manner. This access is mostly for convenience and tests, as data transfers between devices are only done in PubSub.
In the private network, the hypothesis is made that the devices are protected from physical access. It is also supposed that the gateways are hidden behind a firewall so that their access may not be compromised.
However, on the remote device, such hypotheses cannot be made. In this case, its secrets should be protected by a TPM module. To illustrate the principles shown in this paper, we developed a first proof of concept using a Raspberry Pi 4 as remote device, and integrated a TPM2 from ST Microelectronics (the STPM4RasPI extension board) as a discrete component. Its Common Criteria evaluation reached EAL 4+ [STtpmANSSIrep], hence offering a satisfying level of security for this proof of concept.
On the software side, we used the S2OPC OPC UA client-server and PubSub libraries. The S2OPC library has been adapted to communicate with the TPM 4 through the TPM library from the Linux TPM2 & TSS2 Software community.
All software used in this experiment is open-source, with the exception of
tpm2-lib, a proprietary library developed by Thales and used as
high-level API above TSS2.
The proof of concept shows that it is possible to realize such a secure infrastructure with a Raspberry Pi 4 and the STPM4RasPI. It is relatively easy to provision the TPM, extract its public key, and have it signed by the certificate authority. Once this configuration is done, the remote device connects securely to the SKS server and fetches the group keys. The group keys are used to sign/encrypt the PubSub messages. The Raspberry Pi 4 had no trouble handling the encryption/decryption payload, and the sole slowdown of using the TPM is when connecting to the SKS server (see Sections 5.1 and 5.3).
Since the proof of concept achieved its technical goals, we then developed an industrial prototype in order to address a real-life railway industry challenge and relying on industry-proven systems. This is described in detail in the following section.
Predictive Maintenance in railway industry¶
Usage of IIoT can bring benefits along the industrial systems’ lifecycle, including but not limited to phases such as provisioning, deployment, production or operational use. Among these different stages, we explored the maintenance phase as a promising target for introduction of IIoTs.
The traditional approach for maintenance of industrial device consists in periodic inspections by operators, which in case of need will resort to a maintenance specialist. This approach has multiple flaws:
- it typically leads to important ownership costs, especially in the case of large distributed systems,
- it is error-prone, as repetitive tasks with small actual positive rate are not managed well by humans,
- and low rates of failure make it tempting for management levels to increase the inspection period, thus mitigate over-cost yet increasing the likeliness of critical failure to occur, with large disruption of service or even catastrophic situations.
In contrast, predictive maintenance is a technique aiming to optimize the maintenance time by using information remotely harvested from sensors — typically IIoT devices — monitoring the industrial system of interest. In this approach, monitoring the industrial device’s state allows predicting a short-term failure, and thus to dispatch a maintenance operator just-in-time. Predictive maintenance also allows implementing an optimization feedback loop, by harvesting knowledge on the industrial device behavior and refining its states definition and thresholds, for instance through machine learning.
Predictive maintenance is a popular way to introduce IIoT technologies into industry as (1) it is purely additive to the industrial system, without impacting the core critical functionalities and (2) it has a fast Return On Investment, leading to drastic decrease on cost of ownership if done correctly.
The prototype of our solution has been tested in an industrial device to ensure secure communications for predictive maintenance in the railway industry, which is at the same time a critical asset of the supply chain and an industrial system itself. Predictive maintenance however is not limited to this specific industry, but can and indeed should be applied in any industry where critical assets are costly to access and complex enough for failure to be difficult to predict on a strict time-based. Since modern manufacturing processes are typically distributed over large area and involving multiple stakeholders [Bauer] — and indeed use complex device, they typically fit both requirements.
Specifically, we monitored wireless trackside sensors (connected thermometers and unwinder) to get information on the catenaries (respectively heating and physical tension), and exploiting them in order to infer catenaries state in real-time. These data are then transmitted to an analytics center that decides and prioritizes the maintenance operations according to different business factors, such as maintenance team availability and current position, relative severity of the catenary state, or per-hour cost of the line disruption; thus saving cost on maintenance while minimizing service disruption for train users. Figure 5 illustrates the use case. A more realistic use case would use some additional sensors, such as cameras to watch the catenary tension (correlating it to the unwinder) and lasers to measure the counter-weight altitude.
Security needs for Railway predictive maintenance¶
In order to define the security needs for the use case, we performed a risk analysis following the EBIOS methodology involving cybersecurity and safety experts from the railway industry. We defined three critical assets:
- the analytics datalake,
- the train service availability,
- the safety-critical network (i.e. the system in charge of real-time signaling).
Regarding data send to the analytics, the main need is integrity, as forged data may trigger emergency counter-measures, generally involving interruption of service on the whole track. Thus, forged sensor data can be used to perform business-critical denial of service, impacting asset (2). Furthermore, allowing unknown amount of untrusted data to be integrated to the analytics datalake defeats its purpose and prevent its smart exploitation, impacting assets (1).
Availability must be ensured in order to ensure persistence of service (asset 2), although considering the average duration of a maintenance operation comparatively to information sampling, the only challenge will be ensuring that security measures taken will not deter from this objective.
Confidentiality is also needed on data, since sensors may indicate real-time position of the trains, which is sensitive information since trains are national strategical assets. Hence, the exploiting company must ensure protection of such information.
From a connection perspective, authentication and authorization must be strictly enforced in order to ensure security of the safety-critical network (asset 3). Although the maintenance and safety networks are currently strictly separated, it may not be the case in the future, since signaling / critical sensors’ information may be used to refine predictive maintenance. Furthermore, it cannot be ruled out that some hidden path already exists between the two networks, for instance through corporate IT network. Secure authentication ensures that no rogue device can connect into the network, while authorization ensures that any legitimate device will be used the way it is intended to (e.g. that a subscriber will not begin to publish data).
Securing the gateway for predictive maintenance¶
We apply our approach to secure a trackside gateway for connected sensors, in order to ensure connectivity and security of the predictive maintenance. As illustrated in Figure 6, sensors are connected through LoRaWAN to the gateway, which is in turn connected to open networks (such as the Internet) through LTE.
We tested two sensors: an unwinder and a thermometer, both equipped with a STIMIO Railnode module, allowing to embed their outputs into LoRa frames. The gateway was based on a STIMIO Railnet, an ARM Cortex-A7-based gateway supporting both LoRaWAN and 2G/4G communications. The LoRaWAN Join Server responsible for sensors enrollment was hosted by the gateway. The Railnet gateway is railway-certified and used in actual railway infrastructures.
The MQTT broker was the open-source tool Mosquitto, while the Secure Key Service was implemented through a S2OPC server working in client/server OPC UA mode. Both services were run on the same physical remote server, although in actual deployment we would encourage hosting in two distinct machines.
Finally, the analytics center was made of a back-end S2OPC server, in charge of establishing connection, getting OPC UA messages and transferring them to the analytics server. The latter has been simulated through simple display of the results in a remote machine, although connectivity to TIRIS, the actual analytics server used by Thales railway division has been demonstrated in previous experiments.
Certificates from both SKS and clients embed their public key used for the authentication process described in Section 5.1 (associated with the related private key). In our use case, these certificates are all signed by the same Certificate Authority (CA), using SHA256 hash and a RSA4096 key pair. All end-users owned the CA certificate, and thus were able to check the validity of others device’s certificates. While we did not experiment on it, the TPM could be used to ensure the CA certificate integrity, to ensure that an attacker did not tamper the file. In order to do it, a hash of the certificate can be stored within the TPM, and compared to the certificate file on-disk before any usage.
In our experiment, the CA was not connected to clients and SKS, so certificate revocation was not directly possible. In most operationally deployed use cases, a complete PKI should be set up, although the actual implementation of this PKI might differ a lot depending on the use case needs and constraints.
As shown in Figure 7, we integrated a TPM ST33 into STIMIO Railnet
module, and installed the TSS2 open-source stack as well as the
libtpm2. The latter was interfaced with the
open-source, light-weight cryptography library
mbedTLS, itself performing cryptography
services for Systerel S2OPC stack. During the authentication phase, the S2OPC
stack will use this customized cryptography stack to perform hardware-based
secure authentication, as described in Section 5, communicating
with the SKS with OPC UA Client/Server directly over TCP/IP. During the
communication phase, data received in LoRa from the remote IIoT sensor
(railnode) are processed by the LoRAWan-server, which dispatches messages to
the local MQTT broker (mosquitto), in turn queried by the S2OPC stack through
the Paho library, a light-weight
MQTT client implementation. Once the data are received and processed by the
gateway sensor logic, new information is sent through Paho to the remote
Using these software and hardware components, we were able to implement a prototype of secure gateway for IIoT systems. More specifically, we ensured hardware-based authentication of the gateway, and then end-to-end signature and encryption for communications, as intended.
Some very basic level of edge computing was also performed in the gateway, such as computing relative heating and timestamping the data. In more realistic use cases, more extensive computation should be performed at edge (i.e. in the gateway).
Successful connection, persistent connectivity and effective signature and encryption were demonstrated through the use of forged messages and rogue gateways, which were both rejected by the SKS and unable to log in and access any secret or data from the system.
Risk assessment methodology¶
Industrial System Security standards¶
The ISA/IEC 62443 is a set of standards relative to the security of industrial communication networks and systems. It describes a complete methodology to assess risk and evaluate security requirements as well as needed countermeasures. Among the standard’s parts, part 4-2 describing technical requirements for the security of industrial automation and control systems (IACS) is of special interest, as it is directly related to IACS products security.
While ISA/IEC 62443 is targeted at IACS, it is used by many other Operation Technology industries such as automotive, railway, energy, etc., as these industries only began recently to develop security standards tailored to their specific needs, although guidelines had existed beforehand, none of them were recognized international standards.
Security requirements described in ISA/IEC 62443-4-2 are quite high-level as they mostly describe high-level capacities rather than actual technologies and do not cover the relative efficiency of respective implementations. Hence, ISA/IEC 62443 is fitter to a system architect viewpoint aiming to provide security requirements to a cybersecurity expert, than performing an actual in-depth assessment of the deployed system security.
Using the STRIDE method, we can make a high-level threat modelling which actually addresses counter-measure implementation and efficiency. While lacking the completeness of official standards such as ISA/IEC 62443, the STRIDE approaches defines five common risks on systems: spoofing, tampering, repudiation, information disclosure, denial of service and elevation of privilege. It is the responsibility of products owners to and mandates the (1) state instances of these risk on their specific system and (2) enunciate needed counter-measures. While in industrial deployments these risks should be connected to an actual risk analysis, we judged the concise nature of STRIDE methodology to fit well the limited space and effort that we could allocate to this task.
This is done in the following study, on the scope of OPC UA communications (from the gateway OPC UA front-end to the analytics OPC UA back-end). Any event prior than that is excluded, i.e., on the IIoT side, including the receiving end of the gateway, or in communications between the server backend and the analytics. As a consequence, we added a third, more prospective step to each of the detected threats, in the cases where residual risks were still existing after the application of the counter-measures proposed in the article: "residual risks and remediation". In this latter case, we will mention possible remediation that are out of the scope of our works.
In the following assessment, we describe threats following the STRIDE approach, and then mention the related ISA/IEC 62443 security requirement as expressed in part 4-2 (component security) of the standard. Figure 8 illustrates the dependencies between the different threats, leading to 3 feared events (in black in the figure): data disclosure, data corruption and denial of service.
Residual risk assessment¶
Threat 1.1: A rogue gateway may try to get a group key from a legitimate SKS, allowing an attacker to breach data confidentiality and potentially to compromise the IIoT subnetwork.
Protections and Counter-measures: In order to get a group key, a gateway must send its certificate to the SKS, signed by its Certificate Authority private key, within a message signed by its own private key. Since the certificate contains the gateway public key, the SKS can verify (1) the authenticity of the message (the signature matches the public key) and (2) the credential of the source (the certificate is signed by the CA). Signatures are performed with RSA4096, which is the state of the art asymmetric cryptography standard, safe against current cryptanalysis techniques.
Related ISA/IEC 62443-4-2 security requirement: Component Requirement (CR) 1.2 — Software process and device identification and authentication and CR 1.8 — Public key infrastructure certificates with security level 4, as well as CR 1.5 — Authenticator management and CR 1.9 — Strength of public key-based authentication in security level 2 with basic OPC UA PubSub protection. Our work allows reaching security level 4 for the latter.
Residual risk and remediation: New cryptanalysis techniques will make RSA4096 encryption obsolete. Use a new cryptography solution resilient to these new cryptanalysis techniques (e.g. quantum-resistant cryptography algorithm).
Threat 1.2: On the other side, a rogue SKS may attempt to lure a legitimate gateway into a false network, in order to breach data confidentiality and potentially to compromise the whole network. In order to do so, it could usurp an IP address or compromise a Domain Name Server.
Related ISA/IEC 62443-4-2 security requirement: same as threat 1.1.
Protections and Counter-measures: before asking for a group key, the gateway will verify the SKS certificate. If it does not match its CA, the connection will be interrupted. The gateway stores the CA certificate in its mass memory.
Residual risk and remediation: same as threat 1.1.
Threat 2.1: An attacker may replace the SKS CA certificate in order to connect it with a rogue SKS.
Protections and Counter-measures: Standard OS hardening, with writing protection and logging for sensitive files such as CA certificates.
Related ISA/IEC 62443-4-2 security requirement: CR 3.4 Software and information integrity (with a scope restricted to information) with security level 1.
Residual risk and remediation: Storing a hash of the certificate into the TPM and systematically check the certificate validity would allow preventing this attack.
Threat 3.1: A message cannot be traced to its emitter.
Protections and Counter-measures: In compliance with OPC UA PubSub standard and following the publisher/subscriber philosophy, messages published in a group are signed with group keys, not individual keys. Thus, a group cannot repudiate a message, but an individual client (either a publisher or a subscriber) can.
Related ISA/IEC 62443-4-2 security requirement: CR 2.12 — Non-repudiation with security level 3.
Residual risk and remediation: Modifying the OPC UA PubSub standard and providing individual signature key - while encryption key would stay group-based. This, however, would raise many issues in the key distribution process, especially if there is more than on publisher.
Information disclosure (Confidentiality)¶
Threat 4.1: An attacker may try accessing a group data by simply sniffing the packets on the open network.
Protections and Counter-measures: group messages are encrypted with group keys, using AES256 encryption, which is the state-of-the-art symmetric cryptographic standard, making it impossible to read data for an attacker.
Related ISA/IEC 62443-4-2 security requirement: CR 4.3 — Use of cryptography with security level 4.
Threat 4.2: An attacker may try reading group keys by simply sniffing the packets on the open network during a legitimate gateway authentication.
Protections and Counter-measures: while transmitted to the clients, group keys are encrypted with a temporary session keys, using AES256 encryption, which is the state-of-the-art symmetric cryptographic standard, thus making it impossible to read data for an attacker.
Related ISA/IEC 62443-4-2 security requirement: CR 4.3 — Use of cryptography with security level 4.
Threat 4.3: An attacker may try to read session keys by simply sniffing the packets on the open network during a legitimate gateway authentication.
Protections and Counter-measures: session keys are computed on both side using resident secret keys, and data transmitted through the network do not allow re-computing them for a listening third-party.
Related ISA/IEC 62443-4-2 security requirement: CR 4.3 — Use of cryptography with security level 4.
Threat 4.4: An attacker may try to send its own forged nonce after a legitimate gateway authenticate, and then read the server nonce and thus spoof the legitimate gateway and compute group keys.
Protections and Counter-measures: the client’s nonce must be signed by the private key matching the client’s certificate, which happens to be stored in the TPM, thus preventing an attacker stealing it. Thus, it is not possible to send a forged nonce to the SKS.
Related ISA/IEC 62443-4-2 security requirement: CR 1.9 — Strength of public key-based authentication in security level 2 with basic OPC UA PubSub protection. Our work allows reaching security level 4.
Threat 4.5: An attacker may extract a gateway’s private key in order to set up a rogue gateway with legitimate credentials. This threat is a technical step for threat 1.1.
Protections and Counter-measures: The gateway’s private key being stored in a TPM, it is not possible to access it directly without supplier support.
Related ISA/IEC 62443-4-2 security requirement: CR 1.5 — Authenticator management in security level 2 with basic OPC UA PubSub protection. Our work allows reaching security level 4.
Threat 4.6: An attacker may try to read messages within the MQTT broker, which is not trusted.
Protections and Counter-measures: Messages are encrypted end-to-end, which means that messages stored in the MQTT broker are encrypted with AES256 group keys.
Related ISA/IEC 62443-4-2 security requirement: CR 4.3 — Use of cryptography with security level 4.
Threat 4.7: An attacker may try to access to a group key (or multiple ones) on a legitimate gateway, either by compromising a side process or by physical access.
Protections and Counter-measures: Group keys are stored as standard files, and can be read-protected from anyone but OPC UA and encrypted with standard OS right management and encrypted file system.
Related ISA/IEC 62443-4-2 security requirement: CR 4.1 — Information confidentiality and CR 4.3 — Use of cryptography with security level 4 (although not sufficient in this case).
Residual risk and remediation: Elevation of privilege from malicious code is always possible, and is addressed by CR 3.2 — Protection from malicious code. Such activity may be detected by an HIDS, and possibly a NIDS.
Threat 4.8: An attacker may try to access to clear-text server nonce on a legitimate client gateway, either by compromising a side process or by physical access, and thus would be able to compute locally the session key and then to get access to the group keys in transit (so this attack bypass any protection on the group keys at rest).
Protections and Counter-measures: While the nonce may be dumped into a file in file system, it is also possible that it stays in memory during its whole usage, in which case protections inherited from right management on file system will be irrelevant (cf. threat 4.7). In that case, the nonce stored in a process memory will be protected against other processes by protected by the CPU Memory Management Unit, if available, or by the Memory Protection Unit. Some CPU will have neither, and be particularly vulnerable. Such systems will also typically have no file system or right management whatsoever. Protection against these attacks on memory requires installing portioned systems such as hypervisors, or using dedicated trusted enclaves available within the hardware.
Related ISA/IEC 62443-4-2 security requirement: Not covered.
Residual risk and remediation: Same as threat 4.7.
Denial of Service (Availability)¶
Threat 5.1: An attacker may try to flood a SKS with connection requests in order to create a Denial of Service (DoS) for key distribution.
Protections and Counter-measures: OPC UA implements classical measures such as limited number of attempts by host remove the risk from single-source DoS. In any case the norm specifies that the SKS may be duplicated, and this feature is implemented in our solution.
Related ISA/IEC 62443-4-2 security requirement: CR 1.11 — Unsuccessful login attempts and CR 7.1 — Denial of service protection with security level 4.
Residual risk and remediation: Firewalling mitigates the risk of Distributed DoS (DDoS).
Elevation of Privilege (Authorization)¶
Threat 6.1: An already compromised subscriber (e.g. a monitoring client) may attempt publishing (writing) its group data instead of reading them.
Protections and Counter-measures: This only applies to the data on which the subscriber has reading rights, so it will not be able to publish data it cannot subscribe to.
Related ISA/IEC 62443-4-2 security requirement: CR 2.1 — Authorization enforcement with security level 1.
Residual risk and remediation: Since signature is group-based (cf. threat 3.1), it is not possible to prevent this attack without a compensating counter-measure such as network partitioning, network monitoring with HIDS and NIDS. Another solution is to ensure full non-repudiation for individual devices.
Threat 6.2: An already compromised publisher (e.g. a monitoring client) may attempt subscribing (reading) its group data instead of writing them.
Protections and Counter-measures: This only applies to the data on which the publisher has writing right, so it will not be able to subscribe data it cannot publish to. It also only benefits to an attacker if there is more than one publisher for this data.
Related ISA/IEC 62443-4-2 security requirement: CR 2.1 — Authorization enforcement with security level 1.
Residual risk and remediation: In an asymmetric signature pattern, it would be possible to send the verification (public) keys of publishers only to subscribers.
Organisms such as German IUNO 5 or French GIMELEC 6 are mainly promoting methods and tools in order to ensure national companies conformance to Industry4.0 standards. While they may also support research activities, they mainly propose in high-level guidelines for the industry. In the case of IUNO, usage of TPM for authentication is mentioned in a paper, but no implementation is provided, neither its usage within an existing communication protocol is described. In the same paper, hardware properties (SRAM PUFs) are exploited to ensure secure authentication by guarantying integrity and confidentiality of the secret key, similarly to our usage of the TPM solution. This work, however, relies on a specific architecture which may not be usable in actual use cases. Furthermore, no indication on how to integrate the proposed protocol into established standards is proposed in the article.
Regarding communication protocols of industrial use cases, Data Distributed Services (DDS) is an open standard for real-time distributed communications. While the standard is open regarding the actual communications implementation, it supports a brokerless publish/subscribe pattern of communications, and so can be compared to OPC UA which offers such possibility in its PubSub version. DDS is a rich standard offering fine-grain control of the Quality of Service. A security standard has been published, allowing security patterns similar to OPC UA PubSub. Either base or security DDS specifications, however, only describe APIs when OPC UA provides in-depth protocol specifications, hence drastically increasing the tools inter-operability. Furthermore, in our knowledge no national cybersecurity authority has performed a thorough review of DDS Security, as it was the case for OPC UA by the German BSI.
The OPC Foundation also mentions the possibility of using a TPM or other secure storage solutions. However, it does not discuss its potential usage or benefits for authentication.
The wolfTPM library from wolfSSL allows integration of a TPM into the wolfSSL7 library, a cryptographic library implementing the TLS1.3 standard. This approach is quite similar to ours, but it only applies to client/server connections. Its usage into the publish/subscribe communications schemes is not mentioned in the literature. Another example of such approach is presented by authors of A Type-safe, TPM Backed TLS Infrastructure, where benefits of integration of a TPM within the TLS cryptography standard is exploited not only for authentication but also for payload encryption. However, as in the former case it does not cover the publish/subscribe topology, and in case of multiple publishers may lead to multiple unnecessary encryptions.
We propose in this paper a consolidation of the authentication of devices that are part of a distributed system, with both legacy devices that cannot be updated because of safety requirements, and remote devices that must use the public network and require end-to-end encryption.
The proposed solution uses open protocols to communicate its data (MQTT, OPC UA PubSub), enhancing the interoperability of the system, hence its maintainability. It also uses a TPM hardware module to conceal secret identifiers and guarantee the authenticity of the remote modules. The developed prototype shows that the remote data distribution works efficiently and securely. Functional tests included in the S2OPC suite all passed successfully, and operational communications were working as intended. Key distribution was performed in a relatively long time (around 10 seconds), probably because of the low bandwidth of the SPI bus connecting the TPM to the CPU. Considering the timing needs of our application, such duration was acceptable. More constrained applications may require a closer integration of the TPM.
On the security side, the experiment and the subsequent risk assessment described in Section 7.3 highlight the next subjects of attention: (1) the protection of the group keys of the OPC UA PubSub security groups (threat 4.7), and (2) protection of the clear-text nonce (threat 4.8). The first item is under discussion within the OPC Foundation Security Working Group and an amendment can be soon added to the specification.
On a final note, while performances of the remote component are proven satisfying on its industrial prototype, complete industrialization of the solution requires de-risking of the whole architecture, and more specifically further testing of the performances, since the solution’s domain of application is Industrial IoT where a high number of devices provide situations similar to data lakes (this is particularly the case in the railway industry). In the current architecture, there are two bottlenecks: the broker and the SKS server.
Broker reliability and performance is a well-known topic that our architecture is delegating to third-party technology, such as MQTT [banno]. On the other hand, the SKS server should be analyzed with more attention, with focus on the public key infrastructure configuration and maintenance for numerous devices.
This work has been financially supported by the European commission through the ELASTIC project (H2020 grant agreement 825473).