Third Party Auditing for the Archistar Multi-Cloud Storage

Abstract

The significant costs and processing benefits are attracting individuals and organizations to outsource their data to third party service providers. As a result, there is a high chance to concern about data security and privacy. In this case, the users have to check if the data is stored properly on the distributed storage systems. A solution is to engage a third party auditor (there is no need to be trusted) for performing the auditing process. In addition, the auditing mechanism should ensure the user regarding the data privacy, even if the third party auditor co-operates with a set or subset of storage servers. Up to the writing time of this thesis, third party auditing mechanism exist only for single server storage systems and not for distributed storage systems. In this thesis, we will implement the auditing protocol in JAVA language by using four applications (client, server, share, reconstruct) and evaluate its performance. The protocol we will implement is based on any homomorphic secret sharing scheme, which is information-theoretically private, efficient and keyless. In addition, it is backward compatible with existing secret sharing based storage solutions. The results of this thesis are the first implementation of a new kind of privacy preserving remote data checking for dispersed storage systems.

Motivation

Cloud computing is becoming day by day very popular. The main advantage of cloud computing is that its resources are available whenever you need them and you pay on the base on how much you use it. Internet is the most common way for accessing cloud computing resources. In the case of private cloud, intranet is used. However, it is really hard for the cloud storage customers to secure the data because they do not have direct control of their data storage. There are some encryption methods that we will discuss but still the risk of losing data is always there. One solution is to encrypt the data before putting it to the cloud storage so that companies get back direct control of the stored data on remote servers. When applying encryption, businesses that decide to migrate to cloud should consider two options: the need for encrypting the data before storing it to the cloud or choose a SaaS provider that will manage the encryption and decryption key of the corporate data. In the latter case the solution is rather non-effective, because cloud service provider manages the keys and therefore has still full access to the stored data. On the other hand, if data is encrypted before being uploaded to a cloud storage provider, it is protected from provider access, however, if that data is then needed on a mobile or remote device that does not already have the decryption key, the resulting download will be useless encrypted data. Furthermore, encrypting data is computationally expensive and also introduces complexity, because typically requires the use of both private key and the public key encryption. Systems that perform the computational heavy lifting must always have available resources. Another key to remember about encryption is that security of the data becomes the security of the encryption key. Losing the key is equal to losing the data. Likewise, encrypting the data before storing it in the cloud will never be information-theoretically secure. Features which are required for long-term secure data archiving. We have another option instead of encryption which shares fragments of data and distributes them to multiple servers by using threshold secret sharing techniques. Threshold secret sharing schemes protect the integrity and secrecy of information by distributing the information over different locations. The main working principle of threshold secret sharing technique is that data is shared among n users in such a way that it can be reconstructed easily from any k or more shares, but the knowledge of (k − 1) or fewer shares gives no information about data. A disadvantage of the outsourced storage systems is that one should fully trust the storage nodes that have the data inside. Proofs of Retrievability and Proofs of Data Possession have been introduced. Proof of Retrievability (PoR), is a proof by a cloud service provider to client (auditor) that file F can be fully recovered by client server in a different country. Proofs of Data Possession (PoDP) scheme provides an efficient audit to convince a client that his/her file is available at the storage server, ready for retrieval when needed. This can be done by an auditor instead of the data owner itself. Moreover, it is not allowed for the auditor to learn any additional information about the stored data.

Methodological approach

This thesis aims at the implementation and evaluation of a new kind of privacy preserving remote data checking for dispersed storage systems. The results of the thesis will further be integrated in an existing open source software prototype and therefore have a substantial impact and broad distribution. This thesis makes use of the additive homomorphic properties of linear secret sharing, i.e., Shamir’s scheme in our case, to realize a remote data checking protocol. The new key points of this protocol is that it is keyless and privacy preserving. Moreover, this scheme offers additional advantages as increased flexibility, compatibility with existing secret sharing based storage solutions as well as the ones using Reed-Solomon based erasure coding, compatibility with proactive security steps in storage systems. By introducing the auditing scheme, we achieve a constant communication complexity even when auditing an arbitrary large set of shares simultaneously. The complexity on the server’s side is essentially given by computing a sum over the audited data shares. On the auditor’s side, the computation consists of one degree-t interpolation polynomial of n values, and is therefore fully independent of the batch size. The theoretical part of the thesis, consisting of chapters 2 and 3 is relating to the representation of an overview of cloud computing. We will give a representation of the cloud computing models and which are the benefits of using them. Moreover, a detailed explanation of the three biggest cloud service providers will be presented. Lastly, the basic principles of how the remote data auditing works on different cloud service providers will be described. The practical part of the thesis, consisting of chapter 4 is related to implementation and evaluation of the new remote data auditing protocol using JAVA language and deploying it on Amazon EC2 web service. Furthermore, we will give a evaluation of the implemented protocol and a table of values which will represent the performance of the protocol and will be compared with existing research. The last chapter will give the conclusion of this thesis.

Outline of the thesis

Step  1 will give an overview of the cloud computing technology, advantages, disadvantages and how different cloud service providers deal with the third party auditing.

Step 2 will present the Distributed Storage Systems, ARCHISTAR Dispersed System, Remote Data Checking and a proposed third party auditing protocol.

Step 3 will give the results of implementation and evaluation.

Step 4 summarizes the contribution of the thesis and presents the conclusions

Overview scheme of the proposed auditing algorithm

The main issue in the design of auditing protocol is data privacy problem. This
comes as a result of the fact that for public data, the auditor might get the data information by recuperating data blocks from data proof. In this thesis, we will overview and present a third party auditing protocol  for distributed storage which is flexible, information-theoretically private, efficient, and auditiable distributed storage system which is consistent with actual storage solutions.Moreover, the proposed scheme  is entirely keyless. We have a constant communication complexity in our scheme. The computational complexity on the server’s side is given by computing a sum over the audited data shares, plus a constant overhead which is independent of the number of shares to be batch audited. On the auditor’s part it consists of one degree t-interpolation with n-values. Furthermore, the scheme is very flexible because it does not require additional preprocessing or keys. It is consistent with actual Shamir based storage solutions. Moreover, this proposed scheme is compatible with proactive security steps in storage systems. Lastly, customers are prevented to distribute inconsistent data to the storage servers in order to have a practical storage system, which can be achieved by using verifiable secret sharing schemes. The proposed audit protocol consists of four steps as showed in figure below:

  1. The client shares the random values when uploading the files, so that they
    can be used later during the audit. The final random r value is the sum
    of all shares. After generating this random number value, Shamir secret
    sharing is applied to it and then every server will have a share p i of Shamir
    share of random value r.
  2.  Auditor will broadcast a challenge c to the servers.
  3. After these operations the server will compute the s i = c * σ_ i + p_i .
    With σ i are defined the shares that every server will get as s result of
    Shamir secret share applied to the file.
  4. If all the s_i are consistent, the auditor will accept the data integrity.

Software Architecture

Implementation of Auditing protocol is done on JAVA programming language.
We will use four applications respectively called: audit client.java, audit server.java, share.java and reconstruct.java. Figure below represents the architecture of the third party auditing protocol.

Firstly, on the audit server.java application we have four methods as
described below.

  1. randomShare() method will generate the jointly random number. It will
    generate a set of shares from the random number and as an input we will
    have number of servers, number of required servers in order to reconstruct
    a secret and the output will be a list of shares as raw bytes. In a nut-
    shell, this method creates n random numbers and for each random number
    applies SSS algorithm.
  2.  evaluate() method will evaluate the polynomial.
  3.  sumofShares() method will divide the shares gained from applying secret
    sharing to a random number. It will contain the results of p i .
  4. summation() method will compute the s i =∑c * σ i + p i formula, where
    with σ i are defined the shares that every server will receive, p i determines
    shares of random value r and c determines the challenge.

Secondly, on the audit client.java application we have two methods as
described below.

  1. challenge() method the auditor will generate a random number c, and
    every server will get this challenge number e.g first server will get c, second
    server will get s_2 and so on. In our case it will be until 11-th server, because
    this is the maximu size of the array which represent the list of servers.
  2. getPolynomialCoeff() method will define if the auditor will accept

Thirdly, on the share.java application we have two methods as described
below.

  1. calculateThresholdScheme() method will generate a set of shares from
    the secret provided. As an input we will have the secret, number of shares
    to generate, number of shares in order to reconstruct the secret. In the
    output we will have a list of shares as raw bytes.
  2.  evaluate() method will multiply the shares of provided secret with the
    power.

Lastly, on the reconstruct.java application we have three methods as
described below.

  1.  recoverSecret() method will require ’threshold’ shares in order to re-
    construct correctly a secret. On the input will be a list of shares as raw
    bytes, and at it will give as an output the secret.
  2.  production() method, we will call this our next method, which is called
    LagranP LagranP().
  3. LagranP() method will multiply value of our polynomial which is taken
    back by calculateThresholdScheme() method and an index.

Conclusions

We gave a detailed explanation of the first privacy preserving third party auditing protocol for distributed storage systems. The protocol is based on any homomorphic secret sharing scheme and is information-theoretically private. It also supports arbitrary and allows arbitrary modification on the stored data. Moreover, the challenge that we tried to solve was the implementation of this new kind of privacy preserving remote data checking protocol for dispersed storage systems. The protocol was implemented in JAVA language by using four applications (audit client.java, audit server.java, share.java and reconstruct.java) and also evaluated. The protocol that we implemented will have a substantial; impact and broad distribution because it is flexible, compatible with existing Shamir based storage solutions,and compatible with proactive security steps in the storage systems.

Contact me, if you would like to read the full version of my master thesis.

Anomaly Detection for Simulated IEC-60870-5-104 Traffic

The aim of Substation Security project is detection of the anomalies in energy distribution networks.

Motivation: During the past years, there has been an increase in the number of attacks on automation systems. In spite of that, there has not been enough focus dedicated to the protection of such networks. In this project is introduced a novel machine learning based intrusion detection system targeted at automation networks of substations based on the IEC 60780-5-104 protocol. The novelty of the proposed approach opposed to the state-of-the-art is the monitoring of several features on multiple protocol layers, which enables the identification of multiple types of attacks.

Suggested Approach:

  1. Simulate the communication between the substation slave and the server based on data gained from real substations and we simulate the systems behavior under attack, too.
  2. Observe the system’s normal behavior and its behavior under the attack, in order to extract features needed for building an anomaly detection system.
  3. Based on these features we suggest an anomaly detection system for the asynchronous IEC 60870-5-104 protocol.
  4. Design the anomaly detection model by using machine learning from the IEC 60870-5-104 protocol data acquired.
  5. The classifier with the highest performance was chosen by comparing 7 different classification algorithms: the Rule Learner classifier algorithm turned out to be the best.

IEC 104 Protocol

IEC 60870 is a standard for interoperability among compatible telecontrol items. The standard specification for IEC 60870-5-104 protocol is a combination of the application layer of IEC 60870-5-101 and the TCP/IP protocol. IEC 60870-5-101 is a standard for electric power systems with features like supporting balanced and unbalanced modes of data transfer, classification of data into high and low priority, cyclic and spontaneous data etc.

Data Generation

The generation of traffic is based on a simulation tool for bi-directional IEC-60870-5-104 communication. The traffic simulation is done by using real world data packets. The figure shows the working principle of the tool:

On one side of the communication we have a client and on the other side we have a server/router. When the communication between the client (e.g. an intelligent electronic device on the process layer) and the sever is established, we use wireshark as a network analysis tool for capturing all the 104 packets and their features (like frame number, time stamp, delta time between packets, round trip times, etc.)

Attack Modelling

In this project three attacks were used.

  1. First of all, we consider an attack model limited to passive eavesdropping and information gathering. The attackers goal is to gather as much information as possible by observing the network traffic and trying to map the structure of the substation for subsequent active attacks.
  2. A second attack we consider, tries to manipulate the network traffic by replacing legitimate packets with forged ones (here the attacker must know both, the 104 protocol and the structure of the substation). The goal of this is to feed the industrial control system with false measurement data, to issue arbitrary commands to the IEDs or to desynchronize system time.
  3. The third attack model is a plain DoS attack aiming at the disruption of network functionality.

The three above attacks have been chosen by referring to multiply state of the art papers which used similar attacks to verify the functionality of their Intrusion Detection Systems.

Machine Learning Approach

Intrusion Detection Systems (IDS) are systems that monitor inbound and outbound network traffic in order to detect abnormal activities. The working strategy of an IDS is to collect data from the network in order to detect suspicious activities and anomalies in the network.

In this project, a machine learning approach is followed to build an IDS customized for IEC 60870-5-104 traffic. This scheme is illustrated in Figure below:

  1. Traffic generation phase: This step involved the gathering of learning data which we will use to generate actionable knowledge. The data is combined into a single csv file format.
  2. Data exploration and preparation: During the preprocessing phase the PCAP raw data was reduced by keeping only the useful features for our case. With filtering operations we deleted all unnecessary packets like TCP, UDP, ARP etc. and concentrated only on 104 protocol ASDU packets. As already presented in the section above, our data consists of normal and anomalous traffic.
  3. Feature Extraction: From pcapng packets, 9 features were extracted as follows: Length (frame length on wire), Total_Len (total length of packet), Window size value (window size value from the TCP header), DeltaTime_prev_capt_frame (time difference from previous capture), Frame_Length_on_wire (number of bytes sent/received by the adapter), RTT (round trip time), Value (normalized value), Month (month when packets are sent/received), Packets_s (number of packets sent per second).
  4. Labeling: As already mentioned, the status attribute is divided into two classes: target and outlier. Target represents packets from normal traffic and outlier represents packets from attack traffic.
  5. Feature selection: There are 903 instances in our dataset. We have selected 9 features which have the best correlation with the status (label). Automatic feature selection methods were used for features selection like CorrelationAttributeEval which finds the correlation rate of each of the features with the class (target or outlier). Moreover, reducing the number of features has the advantage of the better performance in our predictive model.
  6. Classification Model: Machine learning algorithms are categorized into groups according to their purpose. Predictive models aim at predicting a value called target feature based on other features and their values in the dataset. Supervised learning ensures that the target features (called class) provides a way for the learner to learn the intended task. Categorical features are divided into levels. Classification predicts in which category the instance belongs.

Implementation and Performance Evaluation

After building the classifier it is time to evaluate its performance which we did by a cross validation approach. Moreover, when analyzing data misclassification errors must be taken into account. Cross-validation is the standard way of evaluating the performance of machine learning algorithms. It defines a fixed number of folds of the data. Let us suppose the number of folds is ten. Furthermore, our entire dataset is split into ten partitions. Each of these partitions is used for testing (1/10) and the remainder for training (9/10). This procedure is repeated 10 times; so every partition is used for testing once. Table below represents the result after the cross validation 10 folds was used, where Acc stands for accuracy, CC stands for correctly classified instances and IC for in correctly classified instances:

Taking into account the TPR, FPR, precision, F-Measure and ROC area we can conclude that the best performance is achieved by decision trees and rule learners. In a nutshell, it can be concluded that in our case the Decision Table classifier is the best machine learning approach, because it has the highest accuracy, TPR, FPR, precision, F-Measure and ROC area. To the best knowledge of the authors, these results are the first obtained for a behaviourbased IDS for 104 protocol data acquired from a real substation.

Conclusion

A solution based on machine learning for detecting anomalies in electrical substations was proposed. Firstly, data collection was done by using a replay tool; then three different attacks were performed (passive eavesdropping, reprogramming, DoS). The Weka tool was used to compare the accuracy and effectiveness of different classifiers. We used a dataset with 903 examples/instances with 183 target instances and 720 outlier instances. The following classi- fiers were used: NaiveBayes, IBk, J48, RandomForest, RandomTree, Decision Table and OneR. The results were analyzed based on accuracy, F1 measure and ROC value in order to find the best approach for our problem. The Decision Table ended up to be the best solution for our problem with an accuracy of 91.69%. Future work aims at improving classification accuracy by using larger data sets and refined feature selection.