palisade,gchq

Does it make sense for a directory to have a type and format?

At the moment the directory and system resources have a type and serialised format, however I don't think it makes sense for them to have those attributes. Therefore does the resource interfaces need a LeafResource interface that has the type and serialised format?

The ConnectionDetail class needs to contain the information about which DataService to use

This could be the url for a ProxyRestDataService or an instance of a DataService running in the same jvm.

Docker example deployment broken

The docker example deployment is broken and it has not been getting picked up by the travis-ci tests.

Create a MapReduce client

Ensure examples and Integration tests clean up

The examples and integration tests are leaving round far too many data/example/exampleObj_file1.txt everywhere. These should be cleaned up, in both normal and exceptional exits.

Once this has done, remove the offending lines from the master .gitignore file

Add helpful methods to DataRequestResponse class

It is expected that the client will need to take a DataRequestResponse that is returned by the palisade service and create multiple of them where the Resource -> Connection details map contains only a subset of the Resources. By doing that you enable the client to define how to parallelise the requests for data across multiple processes/mappers/machines.

Add builder methods on the common objects

Create a python client

Create a python client that makes it simple for python users to read data via palisade. It will need to provide a way for uses to provide the resources they want to access and any key value configuration that would go into the Context object. Then the users should get a Panda's dataframe returned.

Create a command line 'cat' client

Create a client that users can use in the command line to access data via Palisade

Hierarchical policy service should use a cache service to get the policies in a distributed way

This could be using any number of open source distributed storage technologies

Single JVM Example is broken

HDFS resource service can't find the path id for a file in a JAR

Explore integration with Alluxio

Work out the best way to hook into Alluxio so the user's interact with Alluxio as they would currently do, however Palisade should be able to hook in just before the data is sent back to the user to make sure it has gone through Palisade to apply the data access policies and apply the required logging.

Update the palisade services to use the config management service

Once gh-53 is completed then the palisade services should only need to be configured to point it at a config management service for it to be able to initialise.

Error handling for PalisadeRecordReader

Currently there is no error handling in PRR for dealing with DataServices that can't be contacted. The CompleteableFuture objects may fail with various completion exceptions thrown, but at the moment these are not dealt with in the class at all.

Suitable code should be added to gracefully handle the failure and move on to the next resource.

Move the MapReduce library to be a MapReduceClient

All the contents of the mapreduce library module are actually all client implementation things specific to a map-reduce-client. Therefore we should move the module and rename it accordingly.

Refactor dataFormat to be serialisedDataFormat

We need to make it clearer what the DataFormat and DataType means

Update the data services to use the config management service

Once gh-53 is completed then the data services should only need to be configured to point it at a config management service for it to be able to initialise.

Update the HDFS resource service to use a cache service

Update the HDFS resource service to store the mapping of data type to connection details in a cache service rather than a local hashmap

DataService should return a stream of bytes

The DataService should return a stream of bytes, probably an InputStream, rather than a Stream.

Add TravisCI

Update the cache services to use the config management service

Once gh-53 is completed then the cache services should only need to be configured to point it at a config management service for it to be able to initialise.

Create an Apache Spark client

Create a client package for Palisade in the same way that the other data source packages at https://spark-packages.org/?q=tags%3A"Data Sources" do. The client should enable a user to use the following syntax to query data via palisade:

spark.read.format('palisade')
.option("justification", "testing client")
.load("hdfs://dir/testfile.avro");

This should add any key values from the options method into a Context object and the load value is the resource ID. Then it will interact with the palisade service and relevant data services to return the data into a Spark dataframe.

Update the audit services to use the config management service

Once gh-53 is completed then the audit service should only need to be configured to point it at a config management service for it to be able to initialise.

Modify the Rules and Policy object

Currently the methods for adding new rules to the Rules and Policy classes take a unique id (String) and a rule. Then you need to go and update the message in the Rules/Policy object to update it to reflect the new rule.

Therefore I propose making a change so that the method takes in the user friendly message as to what the rule is doing and the rule. Then a UUID can be created for that rule to be the unique id and the message can be automatically appended on the end of the message.

Add integration with Koryphe for defining the Policy rules

https://github.com/gchq/koryphe provides several lightweight reusable functions and predicates that would be useful for defining the policy rules. It also provides useful mechanisms for extracting fields from objects and passing them to functions.

Refactor Example to remove duplicate models

Once issue gh-25 is merged into develop then it would be useful to get all the examples using the same more complex data model.

Create a configuration management service

Rename the resource DataType attribute to DataSchema?

It has been highlighted that having a dataType attribute on the resource which is specifying what the java object is that the data will be serialised to. Can be confused as actually meaning the type of the file, such as Avro or CSV.

Therefore we could rename it to dataSchema instead as the Java object is representing the structure of the data.

Refactor the Justification class to be a Context class

The Context class will be the way of the user passing through any query time configuration, such as the justification, push down filters, etc. It is also where we can add environmental properties such as the system being used (which could be added by the client code or Palisade service, dependant on deployment decisions).
The Context class will be a wrapper around a Hashmap<String, Object>, with some of the common keys declared to enable consistency.

Add null check on all the various builder methods around the codebase

Many of the builder methods take objects of the form:

public DataRequestResponse resource(final Resource resource, final ConnectionDetail connectionDetail) {
        resources.put(resource, connectionDetail);
        return this;
}

Without the appropriate "Objects.requiresNonNull(....)" checks in place, we are allowing null values to be injected into objects. This can cause issues later in the runtime when a unrelated method suddenly finds a null object or map where it is not expecting/cannot handle one.

The following classes need modification (derived from "develop" branch):

UserId
User
Context
RequestID
Justification
SimpleConnectionDetail
Request
DataRequestResponse
DataRequestConfig
AbstractResource
System/Stream/File/DirectoryResource
Rules
AuditRequest
AddCacheRequest
GetCacheRequest
ReadRequest
ReadResponse
DataReaderRequest
DataReaderResponse
RegisterDataRequest
MultiPolicy
Policy
CanAccessRequest
GetPolicyRequest
SetPolicyRequest
AddResourceRequest
GetResourcesBy*Request
AddUserRequest
GetUserRequest

Null checks should be of the form for an argument named "dataItem"
Objects.requireNonNull(dataItem, "dataItem");

For arrays, e.g. public User auths(final String... auths) the following should be used:

Objects.requireNonNull(auths,"auths");
for (final String auth : auths) {
    if (auth == null) {
        throw new NullPointerException("entry in auths");
    }
}

For situations where a String is initialised to null by the constructor, a sentinel "UNKNOWN" value should be used. For example of how to do this, please see UserId or RequestId classes.

Create a HDFS resource service

Create a resource service that uses the HDFS API to find out what resources exist and to return the list of resources.
The connection details should be in the form of 'hdfs://<path to resource>'.
The format should be the file ending, e.g (filename bob_00001.txt would have format of 'txt'.
The type should be the first word in the resource name, e.g (filename bob_00001.txt would have type of 'bob'.
The id should be the same as the connection details.

Create a MapReduce example client

Just as with the single JVM and multi JVM examples it would be good to have an example of how to write a MapReduce client using the Palisade input formats and record readers on a Hadoop cluster.

Move example to use the docker stage build deployment

Once we have all the services using a config management service then we can tidy up the docker example stuff to make use of it by putting all the configuration into that and getting the docker images to pull the configuration rather than having it stored in lots of flat files. Then we can remove the currently used maven build process.

Look into how you would deploy on Kubernetes

Explore what helpful scripts or tools can be created to ease the deployment of Palisade on Kubernetes.

Pull the MapReduceClient as a reusable component

Update the Map Reduce example to pull the MapReduceClient as a reusable component

Implement an integration test framework for the different services

Create a hierarchical policy service

We want to be able to apply policies to a data type rather than adding the same policies to potentially 1000's of resources of the same data type.

The initial idea would be to have 2 tables:

The first would be a mapping of type to policies (resource level policy, record level policy)
The second would be a mapping of resource to policies (resource level policy, record level policy)

To find the policies for a resource, you would first get the relevant (resource/record level) policy for that resource, then work your way up the resource parentage chain and then you would get the relevant policy for the type. This way the child resource policies have precedence over the parents, and the parents have precedence over the data type.

Turn the Cache service into a generic KV store interface

We can change the cache service to make it more generic so that it can be used to store any key/value mappings that may need to be persisted and/or shared between lots of processes.

Policy service should throw an error if no policy can be found

Currently if the policy service is asked for a policy for a resource of which it has no record (anywhere in the hierarchy) it returns an empty policy that equates to "show everything". This is not a sensible default and the default behaviour should be to throw an error for requested resources that have no associated policy as a fail-safe.

Split developer guide into multiple pages

Break the developer guide down into:

Initial requirements
Design principles
High level architectural diagram
Description of each component
High level architecture using a map reduce client
Standard data flow through the Palisade system
How might the system be deployed?
Roadmap for Palisade

Update MapReduce example to use local HDFS

Create an abstract LDAP user service

Request implementation classes don't call super in equals/hashcode

We need to update the equals and hashcode methods for the Request implementations so they call super.equals and super.hashcode.

These methods can be auto generated using intelij.

In order to be able to pass the Request objects around as JSON we would need the id to be able to be set from JSON. Currently you cannot manually set the value. We should remove the 'final' keyword and add a setter.

Clarify the security aspects of the architecture

Add to the documentation to make it clear that there is a mechanism in which we can secure the clients request for data. For example by authenticating the user when they register their data access request and then providing a token which can be passed to the data service and verified when the data service requests the trusted data from the palisade service.

Remove the RecordResource

I don't think the RecordResource is required or makes sense as a resource, given we would not be able to directly access a given record.

Integrate some of the duplicated example modules into common ones under example

The mapreduce, single JVM and multi JVM contain some duplicated modules at the moment like xxx-example-model and associated classes. These could be combined under a single example-model module.

Create an audit service that can send data to Stroom

Create an audit service implementation which can create an audit record at the following times:

When a request for data is received
When a request for data fails
When a request for data is successfully responded too

This may require some changes to the AuditRequest object to include a AuditType. Alternatively you could assume that if there is an exception then the request has failed. If there is a howItWasProcessed then it was successfully served.

Create a HDFS Avro data service

Create a data service that can read Avro files stored in HDFS and return the records.
There will be a requirement that the data service is initialised with a schema for reading the data and a converter for turning the datum into the POJO that the rules are expecting each record to be in.

Move exampleObj_file1.txt into example-model module.

This will be fixed as part of the fix to #70

Create a complex data format example

This should include some nesting to be able to test policy definitions against more complex data schemas

gchq / palisade Goto Github PK

palisade's People

Contributors

Stargazers

Watchers

Forkers

palisade's Issues

Recommend Projects

Recommend Topics

Recommend Org