Sharemind HI As Data Analysis Platform

1. Introduction

Analyzing data from multiple input parties is the simplest use case for Sharemind HI. This tutorial will go through a sample scenario that showcases the basics of Sharemind HI development. It is recommended to be familiar with basic vocabulary of Sharemind HI, by reading through the overview, prior to working through this scenario.

The tutorial is accompanied by the source code of the final project based on the sample scenario, and will demonstrate:

how to set up a new project using the Task Enclave Project (TEP) creation script,
what is included in the pre-built CMake project template,
the contents of the template .cpp files,
some useful headers from the Sharemind HI SDK.

1.1. Sample scenario

The tutorial will create an application that performs privacy preserving set intersection using a sample scenario. The sample scenario models a situation where a number of input parties each have a set of sensitive elements they do not wish to disclose in full to other parties, but wish to find elements that are present in each set, i.e. to find the set intersection of the inputs. As an example this can abstract a conglomerate of companies trying to identify joint clients without disclosing their full client list to the other companies, or financial institutions trying to find common potentially fraudulent accounts without disclosing all of their high-risk accounts.

To keep the focus on operations with Sharemind HI, the tutorial will further constrain the scenario to only two parties providing inputs and an additional party receiving an output. A single task enclave will be used to implement the set intersection, and the inputs and output are assumed to be proper sorted sets (with no duplicate elements) formatted as 32-bit signed integers.

2. Set Up the Task Enclave Project

The very first step in any Sharemind HI project is to set up the Sharemind HI server, the task enclaves, and configuration. To simplify this step, we have created a Task Enclave Project (TEP) creation tool that creates a template project with all of the basic components and default configurations. The tool is located in the Sharemind HI Development bundle: /path/to/HI-bundle/bin/sharemind-hi-create-task-enclave-project.

Running the tool, requires specifying three parameters, the name of your project, the output directory of the generated project, and the path to the local SGX SDK installation. As an example, we will name the project "DataAnalysis", and create it under the user’s home directory. For simplicity, we also assume the Sharemind HI Development bundle and the SGX SDK to be in /usr/local/:

/usr/local/HI-bundle/bin/sharemind-hi-create-task-enclave-project --project-name DataAnalysis --result-dir ~/DataAnalysisProject/ --sgxsdk-dir /usr/local/sgxsdk/

The tool also has a number of optional parameters to further customize the template. All available options and their descriptions can be examined by running /usr/local/HI-bundle/bin/sharemind-hi-create-task-enclave-project --help

The tool will create a CMake project with the following file structure:

├── build/
├── cmake/
│   └── FindSgxSdk.cmake
├── CMakeLists.txt
├── config.local
├── src/enclave/
│   ├── CMakeLists.txt
│   └── Enclave.cpp
├── test/
│   ├── client-and-server.sh
│   ├── client.sh
│   ├── common.sh
│   ├── dataflow-configuration-description.yaml
│   ├── server.sh
│   └── webclient.sh
└── web-client/

build/: contains the compiled project binaries.
cmake/: used by CMake to build the project.
CMakeLists.txt: used by CMake to build the project.
config.local: is used to configure the paths of the Sharemind HI Development bundle and the SGX SDK. These paths should already be filled by the TEP tool.
src/enclave/: contains the source code for the task enclave and a CMakeLists.txt file.
test/: contains a number of scripts that setup and start the Sharemind HI server as well as a script to run a test scenario. Additionally, it includes the dataflow-configuration-description.yaml file, which contains an example dataflow configuration.
web-client/: contains an example integration of the Sharemind HI web-client with a web-application, however this tutorial will only use the CLI client and will not use this directory.

To ensure that everything was set up correctly you can run the test scenario with:

cd build/; ctest -V

The test will start the Sharemind HI Server and run a test scenario involving three stakeholders and a test enclave. In the following we will modify this test scenario step-by-step, starting with the dataflow configuration file located in the test/ folder.

3. Dataflow Configuration

The dataflow configuration (DFC) is the basis for access control in Sharemind HI. It lists the roles and identities (certificates) of each stakeholder, describes who can access which data, who can initiate which enclave, and how long data is stored.

This section will edit the default test DFC to fit our use-case.

3.1. Stakeholders

The provided example DFC already defines three stakeholders, sh1, sh2, and sh3, with test certificates.

Stakeholders:
    - Name: sh1
      CertificateFile: "regular-stakeholder-1.crt"
      RecoveryPublicKeyFile: "regular-stakeholder-1-recovery-pub.pem"
    - Name: sh2
      CertificateFile: "regular-stakeholder-2.crt"
      RecoveryPublicKeyFile: "regular-stakeholder-2-recovery-pub.pem"
    - Name: sh3
      CertificateFile: "regular-stakeholder-3.crt"
      RecoveryPublicKeyFile: "regular-stakeholder-3-recovery-pub.pem"

These test certificates are allowed all possible roles encoded in them and will initially work for getting the rest of project ready. The actual certificates should be generated and distributed to the stakeholders in the deployment step.

As the sample scenario includes exactly three input parties, the existing stakeholders can be renamed to better fit the scenario, for example to input_provider1, input_provider2, and output_consumer.

For other use-cases a number of pre-made test certificates and their corresponding recovery key pairs are located in the Sharemind HI Development bundle: /path/to/HI-bundle/lib/cmake/sharemind-hi/task-enclave-project-default-files/ca_stakeholders/. Additional certificates can be generated by following the Certificate Setup page.

Now the Stakeholders section of the DFC should look as follows:

Stakeholders:
    - Name: input_provider1
      CertificateFile: "regular-stakeholder-1.crt"
      RecoveryPublicKeyFile: "regular-stakeholder-1-recovery-pub.pem"
    - Name: input_provider2
      CertificateFile: "regular-stakeholder-2.crt"
      RecoveryPublicKeyFile: "regular-stakeholder-2-recovery-pub.pem"
    - Name: output_consumer
      CertificateFile: "regular-stakeholder-3.crt"
      RecoveryPublicKeyFile: "regular-stakeholder-3-recovery-pub.pem"

The defined stakeholders also need to be assigned roles. The example DFC has listed the three input parties as auditors and only the first stakeholder as an enforcer.

The exact role assignments are highly dependent on the specific use-case and the capabilities of the involved parties. As such the assignments need to be analyzed thoroughly before a real deployment. For this tutorial we can keep the default assignments, however, as the stakeholders were renamed the change should also be reflected in the role assignments. The role assignment section of the DFC should look as follows:

Auditors:
    - input_provider1
    - input_provider2
    - output_consumer

Enforcers:
    - input_provider1

3.2. Tasks

The next section in the DFC defines all task enclaves included in the service. Defining a new task enclave requires specifying an unique name, the fingerprints of the built and signed enclave, a list of stakeholders that can run the enclave, and, optionally, how long the information about the task can be retained.

The sample scenario only requires a single task enclave to run the set intersection algorithm, so we will edit the existing definition. The name and fingerprint entries of the enclave can be kept as is. During development of the enclave the fingerprints are filled automatically by the test scripts. The fingerprints should be fixed in place during the deployment step, once the enclave has been fully tested and audited.

As for the Auditor and Enforcer roles, assigning the Runner role requires analyzing the project needs and specific use case. Any of the involved stakeholders, or a combination of them, could be assigned as runners, or an entirely new party could be added to run the task. For the sample scenario we will assign the output_consumer stakeholder as the sole Runner.

Lastly, the data retention time for the enclave information can be set. The data retention time is counted from the moment of starting the enclave, and all data related to the the enclave (who and when ran the enclave) will be deleted. It is important to consider local regulations related to data processing and data retention when choosing an appropriate data retention time. If data retention is not required and data can be stored indefinitely, then the DataRetentionTime entry can be omitted entirely. As the sample scenario does not require any data retention, the entry can be removed.

The Tasks section of the DFC should now look as follows:

Tasks:
    - Name: "enclave"
      # Will be filled with the correct fingerprints when running the test.
      EnclaveFingerprint: "${enclave_ENCLAVE_FINGERPRINT}"
      SignerFingerprint: "${enclave_SIGNER_FINGERPRINT}"
      Runners:
        - output_consumer

3.3. Topics

Lastly the DFC defines all data topics used in the service, along with access rights for each topic. Defining a topic requires specifying a unique name for the topic, a list of stakeholders and tasks that are allowed to upload data to the topic, a list of stakeholders and task that are allowed to download data from the topic, and, optionally, the data retention time similar to the task enclaves.

The test DFC uses three input and three output topics, however, the sample scenario only requires a single input topic and a single output topic. Hence we can delete the definitions for topics input2, input3, output2, and output3, and keep the definitions for topics input1 and output1. The remaining topics can also be renamed if needed, however the names input and output fit the sample scenario nicely.

To allow the input parties to upload data into the service, they need to be listed as Producers for a topic. Similarly, to allow the receiving party to download the results, they need to be listed as a Consumer for a topic. As the output_consumer should not be able to access the inputs, it is important that they are not listed as a consumer in the input topic, but only in the output topic. In order for the task enclave to process the inputs and generate an output, it has to be allowed to download data in the input topic, and to upload data to the output topic.

Similarly to the task enclave definition, the sample scenario does not require any data retention, so any DataRetentionTime entries can be removed.

The final DFC should look as follows:

Stakeholders:
    - Name: input_provider1
      CertificateFile: "regular-stakeholder-1.crt"
      RecoveryPublicKeyFile: "regular-stakeholder-1-recovery-pub.pem"
    - Name: input_provider2
      CertificateFile: "regular-stakeholder-2.crt"
      RecoveryPublicKeyFile: "regular-stakeholder-2-recovery-pub.pem"
    - Name: output_consumer
      CertificateFile: "regular-stakeholder-3.crt"
      RecoveryPublicKeyFile: "regular-stakeholder-3-recovery-pub.pem"
Auditors:
    - input_provider1
    - input_provider2

Enforcers:
    - input_provider1

Tasks:
    - Name: "enclave"
      # Will be filled with the correct fingerprints during runtime.
      EnclaveFingerprint: "${enclave_ENCLAVE_FINGERPRINT}"
      SignerFingerprint: "${enclave_SIGNER_FINGERPRINT}"
      Runners:
        - output_consumer

Topics:
    - Name: input
      Producers:
        - input_provider1
        - input_provider2
      Consumers:
        - "enclave"

    - Name: output
      Producers:
        - "enclave"
      Consumers:
        - output_consumer

4. Client Side Test

The template project includes a simple test scenario and a number of bash scripts, which set up the Sharemind HI Server containing the test enclave, and perform run the test scenario.

The main flow of the test is located in the test/client.sh script, which is responsible for emulating each of the stakeholders using the Sharemind HI CLI client. The test first sets up the necessary folders and configuration files, and then runs the test flow under the section MODIFY Running your enclave. The flow consists of performing the following actions using the Sharemind HI CLI client:

sharemind-hi-client -c client-config.yaml -a attestation: The test first sets up sessions between the Sharemind HI server and each stakeholder, by performing remote attestation for each of the stakeholders. The client configuration files (regular-stakeholder-{1,2,3}.yaml) used in the test are generated automatically for the test certificates.
sharemind-hi-client -c client-config.yaml -a dfcApprove: The stakeholder with the enforcer role will approve the DFC. The approval procedure displays the staging DFC to the enforcer, who can then accept or decline the DFC after carefully examining it.
sharemind-hi-client -c client-config.yaml -a dataUpload — --topic input --datafile input.data: The input stakeholders upload their data. In addition to specifying the client configuration file and the desired action, uploading data requires specifying the data file and which topic to upload the data to.
sharemind-hi-client -c client-config.yaml -a taskRun — --task enclave --wait: The taskRun action requires specifying which task enclave to run (using the name specified in the DFC). The optional --wait argument blocks the CLI until the task finishes, and is equivalent to using the taskWait action. The existing test enclave simply copies any data in the input topics to the output topics.
sharemind-hi-client -c client-config.yaml -a dataDownload — --topic output --dataid 0 --datafile output.data: Finally the task output is downloaded. Similarly to uploading data, the output datafile and the topic from where the data is downloaded need to be specified. However, as the output topic may have multiple data items, the data ID also has to be provided.

This test is run when initially setting up the template project and can be re-run at any time using ctest in the build/ folder. As the DFC was changed in the previous section, the test will now fail during the data upload step, as the previously defined topics do not exist. To make the test succeed we first need to modify it to match the sample scenario.

4.1. Modifying the test

Modifying the test/client.sh script to match the sample scenario requires

changing the input data,
updating the operations performed by each stakeholder,
updating the input and output topics,
and testing the correctness of the enclave output.

For simplicity, we will assume that the inputs and the output are proper sorted sets formatted as 32-bit integers. Perl’s pack function can be used to write both the inputs and the expected output into temporary files looks as follows:

#######
# MODIFY Running your enclave
#######

# Some input data for data upload.
perl -e 'print pack "l*", 1, 2, 3, 4' > input1.data
perl -e 'print pack "l*", 2, 4, 8, 16, 32' > input2.data
perl -e 'print pack "l*", 2, 4'
set -x

We can also update the correctness testing at the end of the file:

# Test the correctness of the output.
if ! cmp -s expected_output.data output.data; then
    >&2 echo "File content of expected_output.data does not match the one of output.data."
    >&2 echo "expected_output.data: $(perl -ne 'print unpack "l2", $_' expected_output.data)"
    >&2 echo "output.data: $(perl -ne 'print unpack "l2", $_' output.data)"
    exit 1
fi

Performing the attestation and the enforcer approving the DFC are important steps in ensuring the security of the final solution. As the sample scenario uses the same stakeholders and roles as the template, no changes are required in these steps to fit the new use case.

The next change has to be done in the data upload step, where the last stakeholder is no longer an input party. The data upload operations for the first two stakeholders can remain as is, and the upload operation for the third party should be removed leaving the sections as follows:

# Add input data to the task.
"${SHAREMINDHI_CLIENT}" -c regular-stakeholder-1.yaml -a dataUpload \
    -- --topic input --datafile input1.data --allow-missing-trust
"${SHAREMINDHI_CLIENT}" -c regular-stakeholder-2.yaml -a dataUpload \
    -- --topic input --datafile input2.data --allow-missing-trust

The taskRun action can only be successfully called by the output_consumer stakeholder, as they are the only stakeholder listed as the Runner of the task in the DFC. Similarly, with the dataDownload action, only the output_consumer stakeholder is allowed to download data from the output. Updating the task run and download steps are the last changes needed and leaves the whole test as follows:

#######
# MODIFY Running your enclave
#######

# Some input data for data upload.
perl -e 'print pack "l*", 1, 2, 3, 4' > input1.data
perl -e 'print pack "l*", 2, 4, 8, 16, 32' > input2.data
perl -e 'print pack "l*", 2, 4' > expected_output.data
set -x

if [[ "$SGX_MODE" == "HW" ]]; then
    "${SHAREMINDHI_CLIENT}" -c "regular-stakeholder-1.yaml" -a attestation
    "${SHAREMINDHI_CLIENT}" -c "regular-stakeholder-2.yaml" -a attestation
    "${SHAREMINDHI_CLIENT}" -c "regular-stakeholder-3.yaml" -a attestation
fi

# Approves the task (required to add data to the task and to run it)
"${SHAREMINDHI_CLIENT}" -c regular-stakeholder-1.yaml -a dfcApprove <<< "Y"

# Add input data to the task.
"${SHAREMINDHI_CLIENT}" -c regular-stakeholder-1.yaml -a dataUpload \
    -- --topic input --datafile input1.data --allow-missing-trust
"${SHAREMINDHI_CLIENT}" -c regular-stakeholder-2.yaml -a dataUpload \
    -- --topic input --datafile input2.data --allow-missing-trust

# Run the task.
"${SHAREMINDHI_CLIENT}" -c regular-stakeholder-3.yaml -a taskRun \
    -- --task enclave --wait

# Get output data.
"${SHAREMINDHI_CLIENT}" -c regular-stakeholder-3.yaml -a dataDownload \
    -- --topic output --dataid 0 --datafile output.data --allow-missing-trust

# Test the correctness of the output.
if ! cmp -s expected_output.data output.data; then
    >&2 echo "File content of expected_output.data does not match the one of output.data."
    >&2 echo "expected_output.data: $(perl -ne 'print unpack "l2", $_' expected_output.data)"
    >&2 echo "output.data: $(perl -ne 'print unpack "l2", $_' output.data)"
    exit 1
fi

At this point the test will still fail, however it now fails at the final step when checking the correctness of the output. As the test enclave simply copies any data in the input topic to the output topic, the output will be equal to the first data item uploaded. To make the test succeed, the set intersection algorithm has to be implemented in the task enclave.

5. Set Intersection Algorithm

The most generic form of the set intersection algorithm takes as input any number of collections of elements and returns a new collection that consists of elements that are included in each of the input collections. The sample scenario uses a simplified version that considers only inputs that are proper sets (each element can only occur once in a single input).

For clarity we will go through two versions of the algorithm, first a simple and straight-forward approach that uses only basic programming building blocks, and then a more streamlined approach that uses features available in the Sharemind HI SDK and is better suited for the enclave environment.

The simple version first gathers and parses all the input sets, counting the occurrences of each element using a map (or dictionary) structure. Once all the elements have been counted, the elements that occurred exactly once for each input are placed in the output set. The more complex version uses streams to create a pipeline involving a join operation to gather all equal elements.

It is important to note that both implementations are kept simple and somewhat naive for the purposes of the tutorial. They are limited in practical uses as no protections against side-channel attacks are applied. Implementing side-channel safe algorithms requires applying additional measures specific to each algorithm, project, and use-case.

5.1. The `run` Method

While the actual process of setting up, starting, and tearing down the secure enclave is long and complex, Sharemind HI hides and abstracts most of it from the task enclave developer. The only part of the task enclave lifecycle that the developer needs to participate in, is implementing the data processing performed inside the enclave. The data processing flow inside the enclave only has a single entrypoint, the run(TaskInputs const & inputs, TaskOutputs & outputs) method in src\enclave\Enclave.cpp. From the perspective of the task enclave developer, this is the equivalent to the "main" function of the enclave.

The run method is invoked asynchronously by the taskRun action from the client libraries and provides the developer access to all of the input and output topics available to the enclave. Everything related to how the data is parsed, transformed, or manipulated in between the input and output is left for the developer.

The test task enclave provides a number of example operations that can be performed in the enclave, including file operations, working with the inputs and outputs, accessing the DFC, and using the streams header. It is recommended to work through the examples before writing your own code as it can help introduce the tools provided by the Sharemind HI SDK.

Access to the input and output topics are provided through the TaskInputs and TaskOutputs classes defined in the sharemind-hi/enclave/task/TaskIO.h header (all available headers are located in HI-bundle/include/sharemind-hi/sharemind-hi/). The encrypted inputs in a topic can be accessed with inputs.topic(topicName).data or alternatively it is possible to access a single data item using it’s data ID with inputs.get(topicName, DataId). The encrypted inputs are then decrypted using encrypted_input.decrypt(inputPlaintext).

Writing data to a topic is done with outputs.put(topicName, data, dataSize). Note that it is only possible to write to the end of an output topic, and the data is encrypted automatically before writing.

Once familiar with the provided test code, the content of run method can be cleared, and replaced with the implementation of the set intersection algorithm. Using the constructs and examples in the test code, the initial attempt may look something like this:

void run(TaskInputs const & inputs, TaskOutputs & outputs) {
    enclave_printf_log("Running set intersection task!");

    // Count the occurrences of each unique element
    // We assume that each input is a proper set (no duplicate elements)
    std::map<int32_t, size_t> occurrences{};
    const size_t n_inputs = inputs.topic("input").data.size();

    // Iterate over each input
    for (auto const & input : inputs.topic("input").data) {
        // Decrypt the input
        std::vector<int32_t> input_data {};
        input.decrypt(input_data);

        for (const int32_t num : input_data) {
            occurrences[num] += 1;
        }
    }

    // Now all occurrences have been counted.
    // Output consists of elements that were in all sets
    std::vector<int32_t> output_data {};
    for (const auto entry : occurrences) {
        if (entry.second == n_inputs) {
            output_data.push_back(entry.first);
        }
    }

    // Store the output in the "output" topic
    outputs.put("output", &output_data, output_data.size());
}

This implementation fits the sample scenario and passes the test written in the previous section. However, juggling the data items and decryption distracts from the logic of the algorithm. Sharemind HI offers a number of useful headers and pre-made functions to ease the development process. One of the more useful headers is the Streams.h library.

5.2. The `Streams.h` Library

Using large amounts of memory inside an enclave can occur significant performance penalties and as such it is important to keep the memory footprint in the enclave as low as possible. Utilizing streams is a natural way to process elements as they come in and not store entire inputs in the enclave memory. The streams library provides a convenient way to read the data coming from the input topics and to write data to the output topics. Additionally there are implementations for a number of operations common in data processing workflows, allowing the developer to use similar constructs as in other languages and environments.

Using the tools provided in the streams header allows to replace the decryption and formatting of each input with a single line: stream::into<int32_t>(inputs, "input"). The into function takes the last data item in the specified topic, and generates elements matching the template type (in this case int32_t). The library also provides a number of other functions to create stream sources, like vec to create a source from a std::vector, and range to create a source of sequential integers.

Elements generated by the source can be forwarded, using the pipe operator >>=, to intermediate "pipes" or stream terminating "sinks". Pipes process the incoming elements to generate new ones that can be again forwarded to new stream operations. Examples of pipes are filter, which can filter out elements based on a custom predicate, smap, which can apply a custom function to each element to transform them, and join, which joins two sorted streams into a single sorted stream of the common elements.

Sinks consume any elements and terminate the stream to produce a single output. Examples of sinks are foreach, which similarly to smap applies a custom function to each input, collect, which gathers all incoming elements into a std::vector, and encryptedOutput, which stores all incoming elements directly in an output topic.

The set intersection problem is perfectly solved by the join method, as it can reduce two input streams into a stream of only the common elements. Turning inputs into stream sources is best done using the into method. However, the into method only retrieves the last item in a topic. While a more complex approach can be taken to create a source from each item in a simple topic, a simpler method of adding a second input topic suits our scenario. Adding a new topic requires updating the DFC:

Topics:
    - Name: input1
      Producers:
        - input_provider1
      Consumers:
        - "enclave"

    - Name: input2
      Producers:
        - input_provider2
      Consumers:
        - "enclave"

    - Name: output
      Producers:
        - "enclave"
      Consumers:
        - output_consumer

The test also needs to be updated to reflect the new topics:

# Add input data to the task.
"${SHAREMINDHI_CLIENT}" -c regular-stakeholder-1.yaml -a dataUpload \
    -- --topic input1 --datafile input1.data --allow-missing-trust
"${SHAREMINDHI_CLIENT}" -c regular-stakeholder-2.yaml -a dataUpload \
    -- --topic input2 --datafile input2.data --allow-missing-trust

Now all that is left is to join the two inputs and write the result to the output:

void run(TaskInputs const & inputs, TaskOutputs & outputs) {
    enclave_printf_log("Running set intersection task!");

    stream::join(
        stream::into<int32_t>(inputs, "input1"), // Input of input_provider1
        stream::into<int32_t>(inputs, "input2"), // Input of input_provider2
        [](int32_t el){return el;}, // These lambdas are used to extract joinable keys on more complex structures
        [](int32_t el){return el;}
    ) >>=
    stream::smap([](auto const & joined_pair) {
        return joined_pair.first[0]; // Join returns a pair of vectors of equivalent elements, we only need the one.
    }) >>=
    stream::encryptedOutput(outputs, "output"); // Write all elements to the output
}

The result is a significantly more compact and readable code, that is more efficient and better suited to the enclave environment.

6. Deployment Outlook

Creating a working task enclave project is the first step to successfully deploying a solution that utilizes Sharemind HI. Once the task enclave and the DFC have been solidified and the project is ready to be deployed, a number of operational procedures still have to be completed.

The enclave needs to be built and signed.
The enclave code should be audited.
The certificates and keys for all involved parties have to be configured, generated, and distributed.
The server hosting Sharemind HI has to be setup and configured.
The DFC has to be updated with the new certificates and enclave fingerprints.

Only once all of the steps have been completed can Sharemind HI be successfully run in a live environment with the full security gurantees.