gchq / palisade Goto Github PK
View Code? Open in Web Editor NEWA Tool for Complex and Scalable Data Access Policy Enforcement
License: Apache License 2.0
A Tool for Complex and Scalable Data Access Policy Enforcement
License: Apache License 2.0
At the moment the directory and system resources have a type and serialised format, however I don't think it makes sense for them to have those attributes. Therefore does the resource interfaces need a LeafResource interface that has the type and serialised format?
This could be the url for a ProxyRestDataService or an instance of a DataService running in the same jvm.
The docker example deployment is broken and it has not been getting picked up by the travis-ci tests.
The examples and integration tests are leaving round far too many data/example/exampleObj_file1.txt everywhere. These should be cleaned up, in both normal and exceptional exits.
Once this has done, remove the offending lines from the master .gitignore file
It is expected that the client will need to take a DataRequestResponse that is returned by the palisade service and create multiple of them where the Resource -> Connection details map contains only a subset of the Resources. By doing that you enable the client to define how to parallelise the requests for data across multiple processes/mappers/machines.
Create a python client that makes it simple for python users to read data via palisade. It will need to provide a way for uses to provide the resources they want to access and any key value configuration that would go into the Context object. Then the users should get a Panda's dataframe returned.
Create a client that users can use in the command line to access data via Palisade
This could be using any number of open source distributed storage technologies
HDFS resource service can't find the path id for a file in a JAR
Work out the best way to hook into Alluxio so the user's interact with Alluxio as they would currently do, however Palisade should be able to hook in just before the data is sent back to the user to make sure it has gone through Palisade to apply the data access policies and apply the required logging.
Once gh-53 is completed then the palisade services should only need to be configured to point it at a config management service for it to be able to initialise.
Currently there is no error handling in PRR for dealing with DataServices that can't be contacted. The CompleteableFuture objects may fail with various completion exceptions thrown, but at the moment these are not dealt with in the class at all.
Suitable code should be added to gracefully handle the failure and move on to the next resource.
All the contents of the mapreduce library module are actually all client implementation things specific to a map-reduce-client. Therefore we should move the module and rename it accordingly.
We need to make it clearer what the DataFormat and DataType means
Once gh-53 is completed then the data services should only need to be configured to point it at a config management service for it to be able to initialise.
Update the HDFS resource service to store the mapping of data type to connection details in a cache service rather than a local hashmap
The DataService should return a stream of bytes, probably an InputStream, rather than a Stream.
Once gh-53 is completed then the cache services should only need to be configured to point it at a config management service for it to be able to initialise.
Create a client package for Palisade in the same way that the other data source packages at https://spark-packages.org/?q=tags%3A"Data Sources" do. The client should enable a user to use the following syntax to query data via palisade:
spark.read.format('palisade')
.option("justification", "testing client")
.load("hdfs://dir/testfile.avro");
This should add any key values from the options method into a Context object and the load value is the resource ID. Then it will interact with the palisade service and relevant data services to return the data into a Spark dataframe.
Once gh-53 is completed then the audit service should only need to be configured to point it at a config management service for it to be able to initialise.
Currently the methods for adding new rules to the Rules and Policy classes take a unique id (String) and a rule. Then you need to go and update the message in the Rules/Policy object to update it to reflect the new rule.
Therefore I propose making a change so that the method takes in the user friendly message as to what the rule is doing and the rule. Then a UUID can be created for that rule to be the unique id and the message can be automatically appended on the end of the message.
https://github.com/gchq/koryphe provides several lightweight reusable functions and predicates that would be useful for defining the policy rules. It also provides useful mechanisms for extracting fields from objects and passing them to functions.
Once issue gh-25 is merged into develop then it would be useful to get all the examples using the same more complex data model.
It has been highlighted that having a dataType attribute on the resource which is specifying what the java object is that the data will be serialised to. Can be confused as actually meaning the type of the file, such as Avro or CSV.
Therefore we could rename it to dataSchema instead as the Java object is representing the structure of the data.
The Context class will be the way of the user passing through any query time configuration, such as the justification, push down filters, etc. It is also where we can add environmental properties such as the system being used (which could be added by the client code or Palisade service, dependant on deployment decisions).
The Context class will be a wrapper around a Hashmap<String, Object>, with some of the common keys declared to enable consistency.
Many of the builder methods take objects of the form:
public DataRequestResponse resource(final Resource resource, final ConnectionDetail connectionDetail) {
resources.put(resource, connectionDetail);
return this;
}
Without the appropriate "Objects.requiresNonNull(....)
" checks in place, we are allowing null values to be injected into objects. This can cause issues later in the runtime when a unrelated method suddenly finds a null object or map where it is not expecting/cannot handle one.
The following classes need modification (derived from "develop" branch):
UserId
User
Context
RequestID
Justification
SimpleConnectionDetail
Request
DataRequestResponse
DataRequestConfig
AbstractResource
System/Stream/File/DirectoryResource
Rules
AuditRequest
AddCacheRequest
GetCacheRequest
ReadRequest
ReadResponse
DataReaderRequest
DataReaderResponse
RegisterDataRequest
MultiPolicy
Policy
CanAccessRequest
GetPolicyRequest
SetPolicyRequest
AddResourceRequest
GetResourcesBy*Request
AddUserRequest
GetUserRequest
Null checks should be of the form for an argument named "dataItem"
Objects.requireNonNull(dataItem, "dataItem");
For arrays, e.g. public User auths(final String... auths) the following should be used:
Objects.requireNonNull(auths,"auths");
for (final String auth : auths) {
if (auth == null) {
throw new NullPointerException("entry in auths");
}
}
For situations where a String is initialised to null by the constructor, a sentinel "UNKNOWN" value should be used. For example of how to do this, please see UserId or RequestId classes.
Create a resource service that uses the HDFS API to find out what resources exist and to return the list of resources.
The connection details should be in the form of 'hdfs://<path to resource>'
.
The format should be the file ending, e.g (filename bob_00001.txt would have format of 'txt'.
The type should be the first word in the resource name, e.g (filename bob_00001.txt would have type of 'bob'.
The id should be the same as the connection details.
Just as with the single JVM and multi JVM examples it would be good to have an example of how to write a MapReduce client using the Palisade input formats and record readers on a Hadoop cluster.
Once we have all the services using a config management service then we can tidy up the docker example stuff to make use of it by putting all the configuration into that and getting the docker images to pull the configuration rather than having it stored in lots of flat files. Then we can remove the currently used maven build process.
Explore what helpful scripts or tools can be created to ease the deployment of Palisade on Kubernetes.
Update the Map Reduce example to pull the MapReduceClient as a reusable component
We want to be able to apply policies to a data type rather than adding the same policies to potentially 1000's of resources of the same data type.
The initial idea would be to have 2 tables:
To find the policies for a resource, you would first get the relevant (resource/record level) policy for that resource, then work your way up the resource parentage chain and then you would get the relevant policy for the type. This way the child resource policies have precedence over the parents, and the parents have precedence over the data type.
We can change the cache service to make it more generic so that it can be used to store any key/value mappings that may need to be persisted and/or shared between lots of processes.
Currently if the policy service is asked for a policy for a resource of which it has no record (anywhere in the hierarchy) it returns an empty policy that equates to "show everything". This is not a sensible default and the default behaviour should be to throw an error for requested resources that have no associated policy as a fail-safe.
Break the developer guide down into:
We need to update the equals and hashcode methods for the Request implementations so they call super.equals and super.hashcode.
These methods can be auto generated using intelij.
In order to be able to pass the Request objects around as JSON we would need the id to be able to be set from JSON. Currently you cannot manually set the value. We should remove the 'final' keyword and add a setter.
Add to the documentation to make it clear that there is a mechanism in which we can secure the clients request for data. For example by authenticating the user when they register their data access request and then providing a token which can be passed to the data service and verified when the data service requests the trusted data from the palisade service.
I don't think the RecordResource is required or makes sense as a resource, given we would not be able to directly access a given record.
The mapreduce, single JVM and multi JVM contain some duplicated modules at the moment like xxx-example-model and associated classes. These could be combined under a single example-model module.
Create an audit service implementation which can create an audit record at the following times:
This may require some changes to the AuditRequest object to include a AuditType. Alternatively you could assume that if there is an exception then the request has failed. If there is a howItWasProcessed then it was successfully served.
Create a data service that can read Avro files stored in HDFS and return the records.
There will be a requirement that the data service is initialised with a schema for reading the data and a converter for turning the datum into the POJO that the rules are expecting each record to be in.
This will be fixed as part of the fix to #70
This should include some nesting to be able to test policy definitions against more complex data schemas
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.