gchq / kai Goto Github PK

View Code? Open in Web Editor NEW

5.0 8.0 6.0 969 KB

Kai is an experimental Graph-as-a-Service framework built with the Amazon CDK

License: Apache License 2.0

JavaScript 0.45% TypeScript 74.22% Python 25.33%

kai's Introduction

Kai

Kai is an experimental Graph as a Service application built on AWS. It uses the Amazon CDK.

The cdk.json file tells the CDK Toolkit how to execute your app.

NOTE: As Kai is currently early in development and likely subject to breaking changes, we do not advise this product be used in any production capacity. If you have an interest in using Kai in production, please watch this repository to stay updated.

Useful commands

npm run build compile typescript to js
npm run watch watch for changes and compile
npm run lint run the eslint style checking
npm run test perform the jest unit tests
npm run e2e run end to end jest integration tests
cdk deploy deploy this stack to your default AWS account/region
cdk diff compare deployed stack with current state
cdk synth emits the synthesized CloudFormation template

Configuration

Kai has a number of different properties which can be altered using the cdk.json file or by passing in context objects through the --context option

Name	Type	Default value	Description
vpcId	string	"DEFAULT"	The Vpc that the eks cluster will use. By default it uses the default VPC for the account you're deploying with. If this is removed, a VPC will be created. If a VPC id is specified it will use that VPC.
extraIngressSecurityGroups	string	""	Additional vpcs that will be added to every application load balancer that comes with a gaffer deployment. To Add multiple ones, use a comma seperated list eg "sg-xxxxxxxxx, sg-yyyyyyyyyy". The security group of the EKS cluster is automatically added.
globalTags	object	{}	Tags that get added to every taggable resource.
clusterNodeGroup	object	null	Configuration for the eks cluster nodegroup. See below for details.
userPoolConfiguration	object	null	Cognito UserPool configuration. See below for details.
graphDatabaseProps	object	see cdk.json	Configuration for the Dynamodb graph database's autoscaling.

Changing the nodegroup properties

By default, Kai ships with a nodegroup with the following parameters:

{
    "instanceType": "m3.medium",
    "minSize": 1,
    "maxSize": 10,
    "preferredSize": 2
}

These properties are changeable through the context variable: "clusterNodeGroup".

Graph Database Autoscaling

Depending on your needs, you may want to change the autoscaling properties of the Graph Database. The default properties in the cdk.json file are as follows:

{
    "graphDatabaseProps": {
      "minCapacity": 1,
      "maxCapacity": 25,
      "targetUtilizationPercent": 80
    }
}

The min and max capacity relate to amazon's read and write capacity units

These settings haven't yet been tested at production scale so may change when we do.

Cognito UserPool configuration

By default Kai uses a vanilla AWS Cognito UserPool to manage authentication with the application. The default UserPool and UserPoolClient settings can be overridden by supplying a userPoolConfiguration context option populated as shown here:

{
    "defaultPoolConfig": {
        "userPoolProps": {
            "selfSignUpEnabled": false // See below for full options
        },
        "userPoolClientOptions": {
            "disableOAuth": true // See below for full options
        }
    }
}

The full list of userPoolProps and userPoolClientOptions can be found on Amazon's docs

Alternatively a pre-configured external pool can be referenced using the following example:

{
    "externalPool": {
        "userPoolId": "myRegion_userPoolId",
        "userPoolClientId": "randomString"
    }
}

kai's People

Contributors

Stargazers

Watchers

Forkers

m29827 jpelbertrios macenturalxl1 msgpo uk-gov-mirror chriss-0x01

kai's Issues

Add clarification that Kai is not yet ready to be used in production

Add a line to the docs informing the community that Kai is still very early in development and is not secure. Therefore any production use of Kai is done at their own risk.

Add a programatic client

Add clients for Kai that will interact with the deployed infrastructure via the REST API. These should be aimed at developers and data scientists who want to programatically interact with Kai. As a first step, a python client should be developed and if developers want different languages, we can add those later down the line.

Integrate user pool with REST API

Only allow users part of the user pool to access the REST API. The pool could also be used to determine who administrates the graph (This should be whoever created the graph initially)

Decouple graph IDs from release names

At the moment we use the graphId as the Helm release name when deploying a Gaffer graph. Unfortunately, this means that all graphIds have to be lowercase alphanumerical strings. We could have them linked - so have a uniqueId which is basically a lowercased graphId that we use as a release name. We would have to use this uniqueId as the primary key in the graph table to stop users having releaseIds which conflict.

Clean up warnings in user pool and tests

Some lint warnings were introduced by #12. To see warnings run npm run lint as indicated in the README
Currently the warnings are as follows:

lib/app-stack.ts
  33:15  warning  'userPool' is assigned a value but never used  @typescript-eslint/no-unused-vars

lib/authentication/user-pool.ts
  46:49  warning  Forbidden non-null assertion  @typescript-eslint/no-non-null-assertion
  53:59  warning  Forbidden non-null assertion  @typescript-eslint/no-non-null-assertion
  56:33  warning  Forbidden non-null assertion  @typescript-eslint/no-non-null-assertion
  63:52  warning  Forbidden non-null assertion  @typescript-eslint/no-non-null-assertion

test/authentication/user-pool-config.test.ts
  17:13  warning  'cdk' is defined but never used      @typescript-eslint/no-unused-vars
  18:13  warning  'cognito' is defined but never used  @typescript-eslint/no-unused-vars

test/authentication/user-pool.test.ts
  19:13  warning  'cognito' is defined but never used  @typescript-eslint/no-unused-vars

The no-unused-vars warnings can be solved by removing the variable/constant they are assigned to and just running the constructor. The no-non-null-assertion ones are fixed by doing a null check and handling it appropriately. It looks as though they have been checked already by the fromConfig function, in which case they can be temporarily disabled.

Add endpoints to Graph objects after deployment

Upon successful deployment of a Gaffer Graph, the Graph object in the backend table should be updated with the various endpoints that get created by the application load balancer. These include:

The Gaffer REST service / UI
The Hadoop Namenode UI
The Accumulo Monitor

The urls for these interfaces can be found by running a kubectl get ing command which should be replicated in the Graph deployment lambda.

Add automated E2E testing

Add testing that deploys the project on AWS, makes some API calls and checks the response. These tests should check that:

The project deploys correctly
A user can create a graph
A user can see that their graph exists
A user can not add a graph twice
A user can delete a graph

As well as any new features implemented before this issue is resolved.

Generated passwords break Accumulo configuration file parsing.

Punctuation characters in generated passwords for Accumulo breaks configuration file parsing.

/etc/accumulo/conf/accumulo-site.xml:37.15: StartTag: invalid element name
    <value>m?<|Ppg'</value>
...
Initializing Accumulo...
[Fatal Error] accumulo-site.xml:37:15: The content of elements must consist of well-formed character data or markup.

Introduce more separation to graphs

All graphs currently have to have unique names as they all share the default namespace. If users could create and use different namespaces, they could re-use graph names and from an admin point of view, it will be easier to locate problems with a graph if they are searching through a namespace with 5 graph deployments in rather than 50.

It would also allow users to make use of test / dev / ref namespaces to use as a sort of environment and allow teams to keep their graphs within their own namespace.

Application load-balancers not created

Adding a graph through the reset API is reporting success but the application ingress load-balancers for the Accumulo, HDFS and Gaffer UI's are failing to create. Logs from the ingress controller pod show there is a permissions problem for the role:

E0630 09:47:45.838990       1 controller.go:217] kubebuilder/controller "msg"="Reconciler error" "error"="failed to find existing LoadBalancer due to AccessDenied: User: arn:aws:sts::01234567890123456789:assumed-role/KaiStackm29827-GraphPlatformEksClusterNodegroupgra-G22JPFJROJLR/i-0187f6c77f1c6a4e8 is not authorized to perform: elasticloadbalancing:DescribeLoadBalancers\n\tstatus code: 403, request id: 05b649bf-2e19-4e95-9a12-b045f9c79700"  "controller"="alb-ingress-controller" "request"={"Namespace":"default","Name":"test-hdfs"}

Add continuous integration

Add continuous integration with a framework of your choice. Typically we use Travis CI.

This should:

Build the project
Run all the tests

Remove EBS volumes when cluster deleted

Similar to behaviour of application load-balancers described in gh-26, EBS Volumes provisioned when graphs are deployed into a cluster are orphaned when the cluster is torn down.

Allow users to run bulk ingest to load large volumes of data into a graph

The ingest should be carried out by lambdas which can run spark-submit jobs to the Kubernetes cluster. These lambdas should initially be developed outside of Kai and referenced via their ARN. The admins of Kai needs some way of adding ingest lambdas to the deployment. The easiest way I can think to do this is with configuration. You could do it via REST but that would require a new user pool etc.

The ingest objects should be stored in DynamoDB and should have the rough structure:

{
    "name": "My Ingest Job",
    "arn": "lambda arn",
    "arguments": {
        "inputFile": "text",
        "generatorJson": "json"
    }
}

A Kai user should be able to retrieve these objects (minus the arn) and a UI should be able to use the arguments and their types to render a form that the user can fill in to trigger a bulk ingest.

Load balancers are not torn down after cluster deletion

When the cluster is torn down as the stack is deleted, the load balancers and their target groups remain. This probably happens because the alb helm chart is torn down before the graphs have unregistered.

Add Basic CRUD backend using DynamoDB

Create a DynamoDB table as part of the deployment.
Integrate the Table with the existing Lambda Functions

When a graph is added (initially by providing a graphId and schema) via the APIGateway, an entry should be created in the DynamoDB table containing it's id, the owner and a URL which should take the user to the Gaffer REST API. Until #15 or some other mechanism for deploying a Gaffer instance is done, we can stub this value.

When a graph is deleted, the row in the DynamoDB table should be deleted.

When the API is queried for all the graphs, all the graphs the user owns in the DynamoDB table should be returned.

Create initial folder structure

Create initial folder structure using amazon CDK.

Externalise storage of graph credentials

Accumulo passwords are generated when the add_graph lambda is invoked, it would be better to create and store these when the stack is provisioned and change the add_graph lambda to retrieve the credentials when required.

Add Lambda to Start a Gaffer Deployment on AWS

Subtask of #6

Add a Lambda which starts a Gaffer deployment in EKS. It should use the Gaffer Helm chart defined in Gaffer Docker. It should return as soon as the request has been sent and provide some mechanism for tracking the progress of the deployment.

503 Temporarily unavailable message on random ALB endpoints

Sometimes when you deploy a graph, one or more of the endpoints will not work. I deployed one graph and the HDFS namenode was broken, the in the next one the monitor was showing the same thing. It could be something to do with the AZ the node is deployed in but needs further investigation.

Add a UI for the app

Add a dynamic SPA which will allow users to provision and manage their Graph instances. It should use some kind of UI framework to make it easier to deliver features in the future. This should probably be either React or Angular.

Add Lambdas to start and stop Gaffer instances on AWS

Add Lambdas which sit behind the API Gateway implemented in #4. Their jobs should be to spin up / tear down the Gaffer instances. They should also communicate with the underlying database implemented in #5.

Deployment fails when extra security groups are not set

When extra security groups are not set, the deployment fails as the environment cannot contain a null value.

This is due to line 43 in worker.ts which is extra_security_groups: extraSecurityGroups == "" ? null : extraSecurityGroups

The desired behaviour should be that if the context variable is an empty string, the environment variable is not set.

Remove hard coded resource names

To encourage collaboration some of the hard coded names should be removed from the deployment. These include:

The name of the cluster (currently this is a variable in cdk.json but it can be autogenerated)
The name of the REST API. This uses the id of the construct so if the ID is set to something like this.node.uniqueId + "API" would work

Improve reliability of ALB deployment

We use the ALB helm chart to deploy the ingress controller. However this deployment is unreliable and often fails. This is particularly frustrating as it has to happen after the EKS cluster has been deployed which takes around half an hour. This results in a lot of time wasted and makes it harder to deploy Kai.

To address this, find the underlying cause of the issue and fix it, or just deploy the Kubernetes resources separately.

Fix failing build

Change oraclejdk8 to openjdk8

Add an underlying database for users and graphs

Add an underlying persistent Database which will store data about users and their graphs.
The database needs to be indexed by username so graph owners can see which graphs they own, by REST endpoint so that we can quickly delete instances by rest endpoint. It makes sense to use a relational database for this.

Worker always logs "Successfully deployed ..." even when it fails

The add graph worker always logs a message saying gaffer was successfully deployed. Even when it wasn't. To fix this make changes to lib/workers/lambdas/add_graph.py so that this message is not printed if the deployment fails.

Add Contributing guidelines

Add a CONTRIBUTING.md file similar to the one in Gaffer Docker which will outline guidelines for contributing.

Create API Gateway

Create an API Gateway (Can be stubbed for now). The endpoints it must support initially are:

GET /graphs - Get all the graphs the user owns

POST /graph - Start a graph in AWS. Data in the post request must include

The name of the graph
The Schema of the graph
It should return a REST endpoint, where the graph is located.

DELETE /graph/ deletes the graph at a given URL.

Add Cognito User Pool

Add a Cognito User pool which will allow users to sign in to the app once #7 is done. It will also allow us to control which graphs are returned from a backend service such as the one implemented in #17