Coder Social home page Coder Social logo

palisade's People

Contributors

ac74475 avatar davidradl avatar dev-958 avatar dev930018 avatar developer01189998819991197253 avatar developer6959 avatar gaffer01 avatar gchqdeveloper404 avatar m09526 avatar m55624 avatar m78233 avatar mw342762 avatar nw1984 avatar p0000001 avatar p013570 avatar pd104923 avatar r34721 avatar tar7575 avatar timyagan avatar w86432 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

palisade's Issues

Does it make sense for a directory to have a type and format?

At the moment the directory and system resources have a type and serialised format, however I don't think it makes sense for them to have those attributes. Therefore does the resource interfaces need a LeafResource interface that has the type and serialised format?

Ensure examples and Integration tests clean up

The examples and integration tests are leaving round far too many data/example/exampleObj_file1.txt everywhere. These should be cleaned up, in both normal and exceptional exits.

Once this has done, remove the offending lines from the master .gitignore file

Add helpful methods to DataRequestResponse class

It is expected that the client will need to take a DataRequestResponse that is returned by the palisade service and create multiple of them where the Resource -> Connection details map contains only a subset of the Resources. By doing that you enable the client to define how to parallelise the requests for data across multiple processes/mappers/machines.

Create a python client

Create a python client that makes it simple for python users to read data via palisade. It will need to provide a way for uses to provide the resources they want to access and any key value configuration that would go into the Context object. Then the users should get a Panda's dataframe returned.

Explore integration with Alluxio

Work out the best way to hook into Alluxio so the user's interact with Alluxio as they would currently do, however Palisade should be able to hook in just before the data is sent back to the user to make sure it has gone through Palisade to apply the data access policies and apply the required logging.

Error handling for PalisadeRecordReader

Currently there is no error handling in PRR for dealing with DataServices that can't be contacted. The CompleteableFuture objects may fail with various completion exceptions thrown, but at the moment these are not dealt with in the class at all.

Suitable code should be added to gracefully handle the failure and move on to the next resource.

Create an Apache Spark client

Create a client package for Palisade in the same way that the other data source packages at https://spark-packages.org/?q=tags%3A"Data Sources" do. The client should enable a user to use the following syntax to query data via palisade:

spark.read.format('palisade')
.option("justification", "testing client")
.load("hdfs://dir/testfile.avro");

This should add any key values from the options method into a Context object and the load value is the resource ID. Then it will interact with the palisade service and relevant data services to return the data into a Spark dataframe.

Modify the Rules and Policy object

Currently the methods for adding new rules to the Rules and Policy classes take a unique id (String) and a rule. Then you need to go and update the message in the Rules/Policy object to update it to reflect the new rule.

Therefore I propose making a change so that the method takes in the user friendly message as to what the rule is doing and the rule. Then a UUID can be created for that rule to be the unique id and the message can be automatically appended on the end of the message.

Rename the resource DataType attribute to DataSchema?

It has been highlighted that having a dataType attribute on the resource which is specifying what the java object is that the data will be serialised to. Can be confused as actually meaning the type of the file, such as Avro or CSV.

Therefore we could rename it to dataSchema instead as the Java object is representing the structure of the data.

Refactor the Justification class to be a Context class

The Context class will be the way of the user passing through any query time configuration, such as the justification, push down filters, etc. It is also where we can add environmental properties such as the system being used (which could be added by the client code or Palisade service, dependant on deployment decisions).
The Context class will be a wrapper around a Hashmap<String, Object>, with some of the common keys declared to enable consistency.

Add null check on all the various builder methods around the codebase

Many of the builder methods take objects of the form:

public DataRequestResponse resource(final Resource resource, final ConnectionDetail connectionDetail) {
        resources.put(resource, connectionDetail);
        return this;
}

Without the appropriate "Objects.requiresNonNull(....)" checks in place, we are allowing null values to be injected into objects. This can cause issues later in the runtime when a unrelated method suddenly finds a null object or map where it is not expecting/cannot handle one.

The following classes need modification (derived from "develop" branch):

UserId
User
Context
RequestID
Justification
SimpleConnectionDetail
Request
DataRequestResponse
DataRequestConfig
AbstractResource
System/Stream/File/DirectoryResource
Rules
AuditRequest
AddCacheRequest
GetCacheRequest
ReadRequest
ReadResponse
DataReaderRequest
DataReaderResponse
RegisterDataRequest
MultiPolicy
Policy
CanAccessRequest
GetPolicyRequest
SetPolicyRequest
AddResourceRequest
GetResourcesBy*Request
AddUserRequest
GetUserRequest

Null checks should be of the form for an argument named "dataItem"
Objects.requireNonNull(dataItem, "dataItem");

For arrays, e.g. public User auths(final String... auths) the following should be used:

Objects.requireNonNull(auths,"auths");
for (final String auth : auths) {
    if (auth == null) {
        throw new NullPointerException("entry in auths");
    }
}

For situations where a String is initialised to null by the constructor, a sentinel "UNKNOWN" value should be used. For example of how to do this, please see UserId or RequestId classes.

Create a HDFS resource service

Create a resource service that uses the HDFS API to find out what resources exist and to return the list of resources.
The connection details should be in the form of 'hdfs://<path to resource>'.
The format should be the file ending, e.g (filename bob_00001.txt would have format of 'txt'.
The type should be the first word in the resource name, e.g (filename bob_00001.txt would have type of 'bob'.
The id should be the same as the connection details.

Create a MapReduce example client

Just as with the single JVM and multi JVM examples it would be good to have an example of how to write a MapReduce client using the Palisade input formats and record readers on a Hadoop cluster.

Move example to use the docker stage build deployment

Once we have all the services using a config management service then we can tidy up the docker example stuff to make use of it by putting all the configuration into that and getting the docker images to pull the configuration rather than having it stored in lots of flat files. Then we can remove the currently used maven build process.

Create a hierarchical policy service

We want to be able to apply policies to a data type rather than adding the same policies to potentially 1000's of resources of the same data type.

The initial idea would be to have 2 tables:

  • The first would be a mapping of type to policies (resource level policy, record level policy)
  • The second would be a mapping of resource to policies (resource level policy, record level policy)

To find the policies for a resource, you would first get the relevant (resource/record level) policy for that resource, then work your way up the resource parentage chain and then you would get the relevant policy for the type. This way the child resource policies have precedence over the parents, and the parents have precedence over the data type.

Policy service should throw an error if no policy can be found

Currently if the policy service is asked for a policy for a resource of which it has no record (anywhere in the hierarchy) it returns an empty policy that equates to "show everything". This is not a sensible default and the default behaviour should be to throw an error for requested resources that have no associated policy as a fail-safe.

Split developer guide into multiple pages

Break the developer guide down into:

  • Initial requirements
  • Design principles
  • High level architectural diagram
  • Description of each component
  • High level architecture using a map reduce client
  • Standard data flow through the Palisade system
  • How might the system be deployed?
  • Roadmap for Palisade

Request implementation classes don't call super in equals/hashcode

We need to update the equals and hashcode methods for the Request implementations so they call super.equals and super.hashcode.

These methods can be auto generated using intelij.

In order to be able to pass the Request objects around as JSON we would need the id to be able to be set from JSON. Currently you cannot manually set the value. We should remove the 'final' keyword and add a setter.

Clarify the security aspects of the architecture

Add to the documentation to make it clear that there is a mechanism in which we can secure the clients request for data. For example by authenticating the user when they register their data access request and then providing a token which can be passed to the data service and verified when the data service requests the trusted data from the palisade service.

Remove the RecordResource

I don't think the RecordResource is required or makes sense as a resource, given we would not be able to directly access a given record.

Create an audit service that can send data to Stroom

Create an audit service implementation which can create an audit record at the following times:

  • When a request for data is received
  • When a request for data fails
  • When a request for data is successfully responded too

This may require some changes to the AuditRequest object to include a AuditType. Alternatively you could assume that if there is an exception then the request has failed. If there is a howItWasProcessed then it was successfully served.

Create a HDFS Avro data service

Create a data service that can read Avro files stored in HDFS and return the records.
There will be a requirement that the data service is initialised with a schema for reading the data and a converter for turning the datum into the POJO that the rules are expecting each record to be in.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.