A Platform for Run-time Health Verification of Elastic Cyber-physical Systems

Work was partially supported by the European Commission in terms of U-Test H2020 project

eCPS example Approach for verifying eCPSs eCPS Structure Specification Verification Strategies Specification Evaluation Demo and code

Example of Elastic Cyber-Physical System (eCPS) for analysis of streaming data

ElasticCPS

A Cyber-physical System (CPS) is a system which has components deployed both in the physical world (e.g., industrial machines, smart buildings), and in computing environments (e.g., data centers, cloud infrastructures). For example, a smart factory could be considered as a CPS having components: (i) inside assembly robots, (ii) inside sensor gateways deployed in the factory to collect environmental conditions, and (iii) deployed in a private data-center to analyze data collected from robots and sensor gateways. An Elastic Cyber-physical Systems (eCPS) can further add/remove components at run-time, from computing resources to physical devices. Elasticity enables eCPSs to align their costs, quality, and resource usage to load and owner requirements.

The owner of a smart factory builds an elastic cyber-physical system (eCPS) for analysis of streaming data coming from the factory's industrial robots and environmental sensors. The system can scale to adapt to changes in load or factory requirements by adding and removing both physical and cyber components. Factory sensors robots send data to physical devices called Sensor Gateways. The gateways perform local data processing and sends the data through a HAProxy HTTP Load Balancer to Streaming Analytics services hosted in virtual machines in a Private Cloud. The Streaming Analytics service is deployed as a software artifact in a Tomcat web server. Selected analytics results are published to interested parties through a third-party Messaging Service offered as it is by a Public Cloud provider.

The smart factory owner wants to ensure that the system is healthy and operates within specified parameters, especially after scaling actions which add/remove components. I.e., the system is correctly configured, its components are deployed and running, and provides expected performance.

Approach for run-time verification of elastic cyber-physical systems

ElasticCPS

We introduce a platform for run-time health verification of elastic cyber-physical systems (eCPSs), providing functionality for:

  • Specifying the logical structure of elastic cyber-physical systems, introducing a model capturing their deployment stack and communication dependencies.
  • Managing the run-time structure of elastic cyber-physical systems, introducing a decentralized notification-based system for managing addition/removal of system components.
  • Specifying verification strategies, introducing a domain-specific language for defining periodic and event-driven execution of direct and indirect verification tests on different system components.
  • Executing verification strategies, introducing a distributed mechanism based on remote code execution for execution of verification tests and collection of verification results.
  • Notifying interested parties about the verification result, introducing mechanisms for notifying users about changes in the result of verification tests.

ElasticCPS

Specifying the logical structure of elastic cyber-physical systems

We need a model for capturing the deployment stack and dependencies of system components. As our goal is run-time verification of real eCPSs, the model must capture the state of the run-time infrastructure. The model must also be applicable to heterogeneous eCPSs, and easy to extend with additional types of components depending on particular systems. To this end we introduce an abstract model for representing eCPS components and their run-time instances. Our model targets only the infrastructure of eCPSs and is designed with simplicity and generality in mind. These properties allow the model to be applied to a wide range of systems without requiring a large amount of domain-specific knowledge.

We first capture Physical Machine, Physical Device, and Virtual Machine (VM) components, crucial in describing systems which run both in the cloud and in the physical world. We capture Virtual Container components to describe and verify virtualization containers such as Docker. Increasing the verification detail, we capture OS Process, and Service components. Capturing components from different stack levels enables hierarchical testing, in which we can verify the lower level (e.g., VM), and if that succeeds, verify the higher levels (e.g., OS Process running inside a VM). Additional component types can be defined by extending the Type enumeration.

system Component can have one or more Component Instances according to the system's run-time structure. E.g., multiple instances of the Streaming Analytics component . A component instance can be hostedOn another component, e.g., an OS Process running inside a Virtual Machine. The reverse relationship of hostedOn is hosts, enabling model navigation in the opposite direction. Instances can also communicate with other instances, captured with a connectsTo relationships. Further, components can be combined to achieve functionality. We use the term Composite Component to describe combinations of multiple system components working towards the same functionality goal. For example, the Streaming Analytics component using a VM hosting a Web Server hosting in turn a software Service.

eCPS structure specification in JSON

To verify a system, its static structure description is submitted to our platform as JSON. The system is described as a recursive composition of components according to the model introduced. Each component has a name, type, and potential containedComponents. A component can also be hosted on another component, indicated by hostedOn property.

{
  "type": "Composite",
  "name": "SportsAnalytics",
  "containedUnits": [
    {
      "type": "Composite",
      "name": "DataCapture",
      "containedUnits": [
        {
          "type": "Gateway",
          "name": "Gateway.DataCapture"
        },
        {
          "hostedOn": "Gateway.DataCapture",
          "type": "Process",
          "name": "Process.DataCapture"
        }
      ]
    },
    {
      "type": "Composite",
      "name": "LoadBalancer",
      "containedUnits": [
        {
          "type": "VirtualMachine",
          "name": "VM.LoadBalancer"
        },
        {
          "hostedOn": "VM.LoadBalancer",
          "type": "Process",
          "name": "Process.HAProxy"
        }
      ]
    },
    {
      "type": "Composite",
      "name": "StreamingAnalytics",
      "containedUnits": [
        {
          "type": "VirtualMachine",
          "name": "VM.StreamingAnalytics"
        },
        {
          "hostedOn": "VM.StreamingAnalytics",
          "type": "Process",
          "name": "Process.Tomcat"
        },
        {
          "hostedOn": "Process.Tomcat",
          "type": "Service",
          "name": "Service.StreamingAnalytics"
        }
      ]
    },
    {
      "type": "Composite",
      "name": "MessagingService",
      "containedUnits": [
        {
          "type": "Service",
          "name": "Service.MessagingService"
        }
      ]
    }
  ]
}

Test strategies description language

#Description
#name: "TestName"
#description: "human readable description"
#timeout: 10

#Triggers
#every:  30 s
#event:  "E1" , "E2" on UnitType.VirtualMachine
#event:  "E1FFF" , "E2" on UnitType.Process


#Execution
#executor: UnitType.VirtualMachine for UnitType.VirtualMachine, UnitType.VirtualContainer, UnitType.Process
#executor: UnitType.VirtualContainer for UnitType.Process
#executor: UnitType.SoftwareContainer for UnitType.SoftwareContainer
#executor: UnitType.SoftwareContainer for UnitID."A-Za-z0-9_", UnitID."Process.ProcessNAME", UnitUUID."A-Za-z0-9_."
#executor: UnitID."A-Za-z0-9_" for UnitID."Process.ProcessNAME", UnitUUID."A-Za-z0-9_."
#supported types are Service | Process | SoftwarePlatform | PhysicalDevice | SoftwareContainer   | VirtualContainer | Gateway |  VirtualMachine | PhysicalMachine

Domain-specific language for verification strategies

After the user determines When to verify each health indicator, and defines one or more verification descriptions for each indicator using our domain specific language. The strategy for verifying if the VM component is healthy is depicted in left. As the Streaming Analytics is elastic, network accessibility should be verified when a new VM is created. A test Trigger entry is added (Line 5) for the event: "Added" for ID."VM.StreamingAnalytics" representing the Streaming Analytics VMs, detected by our verification platform. VMs can also fail at run-time due to various factors. Network accessibility should be also verified periodically during the system's run-time. To this end a every: 30 s periodic test trigger is defined in the strategy (Line 6). The executor of the test must also be specified. VM network accessibility should be verified from outside the VM. Thus, a distinct executor is requested (Line 9), having the type VirtualMachine. Finally, a timeout specifies how long to wait for the test result before considering that it has failed (Line 2). This is useful if something happened to the test executor component, e.g., it has also failed.

Evaluation

Prototype

Verification platform prototype

We implement our run-time verification platform prototype in Python due to its low resource consumption and reduced complexity in deploying and operating the platform. Our platform has a centralized Verification Orchestrator providing most of the platform's functionality, and a Test Executor component deployed along system components to enforce verification tests. We expect custom test executors to be implemented for particular target systems, and provide a Messaging Queue. The queue acts as a communication broker between the Verification Orchestrator and Test Executors, hiding their particular implementation details from each other. We use RabbitMQ for the queuing middleware, as it supports both AMQP and MQTT protocols, providing a queuing solution applicable to a wide range of systems and components. The platform's functionality is divided between: (i) a System Structure Manager handling any structure-related operation; (ii) an Events Manager handling the processing of events received from the test executors due to verification results or addition/removal of system component instances; (iii) a Tests Execution Manager dispatching verification tests; (iv) a Persistence Manager using SQLite to persist system and verification information; and (v) a UI Manager handling interactions with the platform's web user interface. For ease of use and integration with third party software components, we implement the interactions with our run-time verification platform as RESTful services using Flask and JSON. We also implement a web-based interface relying on HTML, Javascript and and D3.js enabling human users to interact with our platform. A verification test is a self-contained sequence of Python code and we provide a library to report the results of particular test executions. We also provide a contextualization mechanisms that injects in each python test variables denoting the ids and uuids of the test target and executor to be used in the test.

Verification strategy to test if VM is network accessible

  Description
  timeout: 30

  Triggers
  every:  30 s
  event: "Added" on ID."VM.StreamingAnalytics"

  Execution
  executor: distinct Type.VirtualMachine for Type.VirtualMachine

Describing verification strategy

We write one verification strategy for each verification test, structured in three parts: (i) test properties Description, (ii) specification of test execution Triggers , and (iii) test Execution information. The test properties can be defined specifying for each test a name, a human-readable description, and optional timeout. The name is used to identify the test. The timeout is used to mark as failed tests which do not return results in the specified interval of time.

We use triggers to specify when a particular test should be executed. A trigger can be an event, or a periodic timer.

e support both direct and indirect tests, as detailed in the next section. Thus, in the last strategy section we specify which component will execute the test. One or more executor specifications can be defined, describing which specific executor to execute the test for which specific component identifier. A distinct keyword states that the test executor must be other than the test target, useful in executing indirect tests from components with the same identifier (e.g., pinging a VM from another VM).

Verification test to test if VM is network accessible

  #test implemented as standalone python code
  # all imports must be local
  os = __import__('os')
  #contextualized "targetID" variable
  #executing custom OS command
  response = os.system("ping -c 1" + targetID)
  #construct result
  if response == 0: #if ping fails response is 256
    success = 100
  else:
    success = 0
  #TestResult type provided by our verification platform
  return TestResult(success, response)

Writing verification test

The user must further decide How each health indicator can be verified depending system capabilities. The VM network accessibility indicator can be verified using the ping command available in each VM operating system. Using our platform, the test is defined as a standalone Python script depicted left. The script can use contextualized variables injected at test execution by our platform, such as targetID, which for VMs is their IP (Line 6). It is the responsibility of the test designer to use domain-specific knowledge in implementing the test logic and deciding when a test is successful or not (Lines 8-11). Each test result is be returned using the type defined by our platform (Line 13).

Results

Prototype

Verification strategy to test black box components

Description
name: "CloudAMQPAlive"
description: "Check if CloudAMQP is accessible"
timeout: 30

Triggers
every: 30 s

Execution
executor: UnitType.VirtualMachine for UnitID."Service.MessagingService"

Verification test to check if CloudAMQP alive

 os = __import__('os')
 base64 = __import__('base64')
 httplib = __import__('httplib')

 url = "/api/overview"
 instanceIP = targetID
 auth = base64.encodestring('%s:%s' % (username, password)).replace('\n', '')
 webservice = httplib.HTTPS(instanceIP)
 webservice.putrequest("GET", url)
 webservice.putheader("User-Agent", "Python http auth")
 webservice.putheader("Content-type", "text/html; charset=\"UTF-8\"")
 webservice.putheader("Authorization", "Basic %s" % auth)
 webservice.endheaders()
 res = webservice.getfile().read()
 successful = "OK" in statusmessage
 details = "/api/overview returned " + str(statusmessage)
 meta = {}
 meta["type"]="Checks if RabitMQ API responds to get"
 return TestResult(successful=successful, details=details, meta=meta)

Describing verification strategy

In the following we discuss how our approach can be used to verify black-box components which do not allow installation of test executors. We focus on the Messaging Service component using CloudAMQP, which provides a standalone RabbitMQ accessible through an API over Internet.

The system owner answers the What? and When? to verify that the component is alive, i.e., its provider has not encountered failures. How? to verify is answered by checking if the RabbitMQ API "/api/overview" is online and accessible. Then, a developer implements the verification test as a Python sequence of code issuing a HTTP GET with his CloudAMQP credentials to the service's API. The system owner or developer further defines a verification strategy to execute the test every 30 seconds from any running VM, describing the test executor as executor: Type.VirtualMachine for ID."MessagingService". Finally, the developer can send an alive message to our platform notifying that the component is running and should be tested.

Results

Prototype

Platform demonstration and usage guide

Page still under construction

More results and instructions on how to download, install, and use the platform to follow soon.

Contact:

  • This is part of our work in the U-Test EU project.Pls. contact Hong-Linh Truong truong@dsg.tuwien.ac.at for further information about our work