finraos / herd Goto Github PK

Herd is a managed data lake for the cloud. The Herd unified data catalog helps separate storage from compute in the cloud. Manage petabytes of data and make it accessible for data processing and analytical purposes by any cloud compute platform.

Home Page: http://finraos.github.io/herd/

License: Apache License 2.0

JavaScript 4.62% HTML 0.08% Shell 0.06% Java 91.00% Batchfile 0.01% Scala 2.04% Python 1.92% SCSS 0.27%

herd's People

Contributors

Stargazers

Watchers

Forkers

thomas-frank-cs mchao47 seoj nateiam kusid foxsmart dldinternet dawsonlp tarpdalton rpatil524 mbrukman davidbalash barianet adolfoeliazat taehoonkoo varunramanujam aniruddhadas9 mendickxiao raneshreyas86 tinshuksingh diffblue-benchmarks prashantsarkar kcompher diffblue-assistant jcule boopathijava2 agussman pawanrana mhatrep opensourcefinance hoppinghol trucnguyenlam sateeshsiripurapu aeftimia laopeng2021 bellmit venkataalubelli qiminghe rerrabelly

herd's Issues

Generate endpoints in swagger

As a herd Consumer or Publisher I want to view swagger documentation for REST endpoints and ensure that the documentation is in sync with the code and version-controlled

Acceptance Criteria

Generate swagger YAML for REST including:
- endpoints
- list of parameters
- model, JSON, XML
swagger YAML does not include
- parameter descriptions, data types, required/optional indication
- values in JSON, XML
Generate swagger as part of build process
swagger YAML generates swagger-ui without error
swagger YAML can be parsed by swagger editor without error

Use KMS Key ID from Storage attribute in LFU/LFS

As LFU services, I want to obtain the KMS Key ID for the LFS bucket from a Storage Attribute so I can conform with the new standard way of storing and retrieving KMS Keys.
#40 introduces a mechanism to obtain KMS Key ID for a bucket from its Storage Attribute

Acceptance Criteria

KMS Key ID for LFS bucket is present as Storage Attribute
LFU Upload service obtains KMS Key ID from Storage Attribute
LFS/LFU regression passes

Downloader fails when not using secret and access key

Downloader was modified in 0.7.0 to utilize the Herd credential endpoint. With this modification, users are not required to pass secret and access keys in the command line. Instead the Downloader calls the Herd credential endpoint which authenticates with the username and password supplied on the command line and generates temporary credentials. But the call to the Herd credential service is failing as Downloader does not pass the format version.

Steps to reproduce

Utilize 0.7.0 Downloader without passing secret and access key at the command line

Defective behavior

Downloader will call Herd credential endpoint but not pass required format version
When this call fails, Downloader will fall back on the AWS credential chain
If no credentials present on AWS credential chain, Downloader will fail

Desired behavior

Downloader already knows latest format version from BData Get. Downloader should pass this format version to the Herd credential which will then succeed and use the temporary credentials.

NOTE: This passed during testing as the fall back to AWS credential chain found valid credentials and the operation succeeded.

Manage Data Provider values

As a herd Publisher I want to view, insert, and delete Data Provider values so I can manage this data without a herd Administrator

Acceptance Criteria

CRD (no update) endpoints present to manage Data Provider values
Delete should only succeed if the value is unused in the database

Update notification registrations

As an notification user I want to update notification registrations so I can modify my registration instead of having to create a new one.

Assumption - no support for versioning

Acceptance Criteria

PUT endpoint for notification registrations
Same capabilities and data as POST endpoint except requires additional input of namespace + job name
Replaces all data for this notification registration with what is provided in the payload. User must provide all values they want stored. In other words, Herd will not merge the PUT payload with the existing persisted state, user must provide entire state.

Ongoing DB Create Script Maintenance

Introduce process to maintain create and upgrade scripts for each release for OSS

We already are good at maintaining incremental upgrade scripts with each release, this is well tested and works smoothly
Maintain create script along with upgrade scripts by extracting create script from our environment with each release
Discuss upgrade paths and intervals at which we will maintain incremental scripts eg major versions

Acceptance Criteria

Wiki page documentation

Add high volume of files to Storage Files Post

As a Herd Publisher I want to add a high volume of files without the service timing out so I can register a large number of files associated with a BData.

Acceptance Criteria

Business Object Data Storage Files Post works under timeout threshold for up to 30k files for Storages of S3 Storage Platform
- Baseline the performance before and after the fix
Business Object Data Storage Files Post works in performance criteria when files are included in request -- this does not cover auto-discovery

Monitor active Activiti jobs that have not completed

As an Activiti user I want visibility into running jobs so I can perform troubleshooting/cleanup/re-run of job that were created by a trigger.

This might take the form of a REST endpoint that takes the job name and/or job definition ID and returns any jobs that are not in completed status.

Acceptance Critieria

Programmatic interface available to retrieve list of incomplete jobs for a given namespace and optionally job name
Returns for all active jobs
** Include namespace, job name, start time, job ID ( to facilitate subsequent calls to [Get Job Status|http://finraos.github.io/herd/docs/0.1.0/rest/index.html#!/job/getJobStatus]), and environment
Structure response with future expansion in mind - we will likely add different status and additional information about jobs

In the future:

We will add the ability to see jobs that have completed with some sort of walking window filter.

Specify prefix template as Storage Attribute

As a Storage owner I want to specify the prefix template for a Storage that I own so I can manage the prefix structure as I wish.

Currently the prefix template is set as a separate configuration value in the Herd DB for each managed bucket. This should instead be set as an attribute on each Storage.

Affects how prefix is defined, it becomes part of the public interface so we must make it more flexible (eg use velocity)
Also Affects S3KeyPrefix - must modify it to return prefix on this basis

Acceptance Criteria

Migrate prefix template logic to access from storage attribute instead of configuration value. Remove global config value.
S3KeyPrefix will take Storage Name as optional parameter, default = S3_MANAGED
Populate existing S3 Storages with template values
- Reminder - environment difference 'frmt' in DEV vs 'schema' in other environments - make this all 'schema'
Ensure error handling exists for all operations that touch prefix
- If path validation prefix is present and prefix is not present (not enforced at storage creation but enforced at runtime eg when prefix is accessed)
Prefix template should be defined using velocity template language

Provide consistent error messages for failed Activiti tasks

As an Activiti user I want to have improved error reporting on both sync and async tasks so I can tell which task had an error and see it in the workflow variables.

Currently Activiti sync tasks have herd-specific code that handles errors by catching exceptions, capturing the error message in a workflow variable,and then returning flow to Activiti.

Herd Activiti tasks running async and built-in Activiti tasks do not have this behavior - their error reporting is not handled consistently with above. Instead exception messages are not captured in the workflow, they appear only in logs and the workflow appears stopped at the last sync task. This also affects all built-in Activiti tasks.

Acceptance Criteria

Tasks that fail will have exception message in workflow variables
Standard herd workflow variables that show task status will always reflect correct completed, running, or error status
Behavior applies to all herd tasks running sync and async and all built-in Activiti tasks (eg script tasks)

Access GET Expected Partition Value via Activiti wrapper

As a DM workflow user I want to access expected partition value data from my worklfow.

Create an Activiti wrapper for http://finraos.github.io/herd/docs/0.1.0/rest/index.html#!/keys_and_values/getPartitionValuesWithOffset

Acceptance Criteria

Activiti workflow functions the same as REST call
Reminder, documentation updates required in Wiki, Swagger, md files

Restrict permissions when executing Java from JavaScript in Activiti

As a herd Administrator, I want to ensure that Java executed from JavaScript in Activiti cannot be used to execute malicious code.

Use Java SecurityManager on JavaScript engine used by Activiti to restrict all permissions in SecurityManager spec. This will not affect any Activiti tasks or Java written by herd in Activiti tasks - it only impacts Java running from Activity Javascript.

Acceptance Criteria

Javascript in Activiti cannot execute known examples of malicious code such as System.exit, file I/O, , network I/O, spawning threads or processes
Javascript in Activiti will still be able to do legitimate, harmless operations that are currently in use such as assembling requests and parsing responses
herd Java code running in Tomcat is not affected
Solution is compatible with Java 1.8

Retrieve Availability and DDL from data in multiple Storage

As a herd Consumer I want to receive availability data that considers the accessibility of data in the underlying Storage Platform so I can avoid trying to access data that has been archived to Glacier

Acceptance Criteria

For BData that has been archived to Glacier, return as part of notAvailableStatuses with reason ARCHIVED
- Archived to Glacier is defined as having Storage Unit is not in ENABLED status for Storage of S3 Storage Platform and Storage Unit in ENABLED for Storage of Glacier platform
For BData that does not have any ENABLED Storage, should return as part of notAvailableStatuses with reason NO_ENABLED_STORAGE_UNIT
DDL will exclude notAvailableStatuses according to this new rule and current logic to ignoreMissing.
Even if Glacier Storage is specified in Availability request, it will still be returned as notAvailableStatus
Additional Storage Unit Status such as ARCHIVING should be considered equivalent to DISABLED.
- This involves a generic mechanism that allows adding various status and identifying if they are considered ENABLED or DISABLED for purposes of evaluating availability

Receive response from Execute Job only after job data persisted

As an Activiti user I want to ensure that jobs executed via Execute Job REST are persisted to Activiti prior to the rest call returning.

The current Activiti Execute Job REST returns with an instance ID before the job is actually persisted to Activiti. This leaves a window where jobs could be dropped due to system failure. Some Activity users that use external scheduling and have operation staff to monitor jobs will notice this type of failure. But now we have teams that will be using notifications to trigger jobs and will not know if there is a failure of this sort.

We need to close the small gap where jobs could not be persisted by doing:

Write code that modifies the first task (i.e. the start task?) and makes it "async=true" if it isn't already in every workflow at registration
Remove our Activiti create and start workflow “hack” command that returns a process instance Id before the Activiti transaction gets persisted and use the normal command instead where we could catch an exception if the process doesn’t get created and return an error in the REST API

Acceptance Criteria

Ensure that any failure prior to persisting will result in Execute Job returning with error
Ensure job is persisted to Activiti prior to Execute Job returning
Execute Job returns in reasonable time (less than 2-3 seconds) including persisting in Activiti
If no-op is added at registration time ensure no issues with updating registration (eg adds another no-op each time updated)

Retrieve EMR cluster status without calling deprecated DescribeJobFlows

As herd product owner I would like EMR cluster status to come from AWS' newer APIs so we can no longer rely on the deprecated API.

We need to fulfill our current EMR Cluster GET |http://finraos.github.io/herd/docs/0.1.0/rest/index.html#!/cluster/getEMRCluster] service by calling some combination of from AWS below

Based on our current records, you are currently using one of our deprecated API (DescribeJobFlows) to get the details for all of your EMR clusters.

Starting Midnight (24:00 UTC) 12/01/2015, we will require users to move to the new set of APIs. The newer set of APIs provide a much more granular and low latency access.

Here is the list of new APIs:

ListClusters, DescribeCluster, ListSteps, ListInstanceGroups and ListBootstrapActions instead.

API Documentation: http://docs.aws.amazon.com/ElasticMapReduce/latest/API/API_Operations.html

DescribeJobFlows API Doc: http://docs.aws.amazon.com/ElasticMapReduce/latest/API/API_DescribeJobFlows

Utilize temporary token in DataBridge to write to S3

As a herd Administrator I want to provision Uploader/Downloader user access to S3 buckets with temporary tokens so I can eliminate the use of secret and access keys

Tech approach will be similar to LFS use of AWS STS service.

Acceptance Criteria

Uploader/Downloader code will call a REST endpoint that will request a token from STS on behalf of authorized users and grant them temporary token
- Full alternate key must be passed into the endpoint and if that full key down to the Format level does not exist, the endpoint will return an error
- For Upload Credentials endpoint, businessObjectDataVersion is optional and user can specify createNewVersion. These changes are temporary until pre-registration is implemented
REST endpoint is associated with appropriate roles through existing configurable mechanism
Temporary token will have permissions on resources in AWS:
- Read/Write AWS roles = PUT, DELETE on S3 prefix in manifest
- Read-only AWS roles = GET on bucket
Uploader/Downloader code will still pass user credentials to other REST services used by Uploader/Downloader subject to RBAC for endpoint access - no change to this portion of access control
Uploader and Downloader need to extend credentials to maintain token across entire list of files. Utilize existing REST endpoint.
Retain previous mechanism using permanent secret/access keys without STS
Documentation should explicitly mention that the credentials endpoint should only be used by Uploader/Downloader, not other access

Boundary tokens (min/max/latest*) in Availability/DDL should observe status

As a Consumer I want to receive latest VALID partition when retrieving with latestbefore token

Currently, the boundary tokens (min, max, latestbefore, lastestafter) do not consider VALID or INVALID status when evaluating. It finds the boundary date and then returns Availability for that date. Instead we want it to consider status in this logic and find the boundary date with VALID status.

Case that shows current undesired behavior:
2015-05-01 VALID
2015-05-02 VALID
2015-05-03 VALID
2015-05-04 INVALID
Scenario expected receive
"latest before 2015-05-04" 2015-05-03 Not Available
"latest before 2015-05-10" 2015-05-03 Not Available
"latest before 2015-05-03" 2015-05-03 2015-05-03

And for reference, the happy path case
2015-05-01 VALID
2015-05-02 VALID
2015-05-03 VALID
2015-05-04 VALID
Scenario expected receive
"latest before 2015-05-04" 2015-05-04 2015-05-04
"latest before 2015-05-10" 2015-05-04 2015-05-04
"latest before 2015-05-03" 2015-05-03 2015-05-03

Acceptance Criteria

Receive expected response in case above where latest data is INVALID

Utilize temporary token in Uploader/Downloader to write to S3

As a herd Administrator I want to provision Uploader/Downloader user access to S3 buckets with temporary tokens so I can eliminate the use of secret and access keys

Tech approach will be similar to LFU use of AWS STS service.

Acceptance Criteria

Uploader/Downloader code will call a REST endpoint that will request a token from STS on behalf of authorized users and grant them temporary token
REST endpoint is associated with appropriate roles through existing configurable mechanism
Temporary token will have permissions on resources in AWS:
** Read/Write AWS roles = GET on bucket, PUT, DELETE on S3 prefix in manifest
** Read-only AWS roles = GET on bucket
Uploader/Downloader code will still pass user credentials to other REST services used by Uploader/Downloader subject to RBAC for endpoint access - no change to this portion of access control
Uploader and Downloader need to extend credentials to maintain token across entire list of files. Utilize existing REST endpoint.
Remove previous mechanism using permanent secret/access keys

Retrieve artifacts from Sonatype repository

As a herd user, I want to be able to retrieve artifacts from the sonatype repository so that I can create an instance of herd or deploy artifacts to an existing instance without access to the FINRA network. See http://central.sonatype.org/pages/requirements.html for more details.

Acceptance Criteria:

Sonatype requirements
** Need to build and include Javadoc JAR’s
** Need to sign files with GPG/PGP. Considering using Maven plug-in: http://central.sonatype.org/pages/apache-maven.html
** We need to add “description” and “url” tags to all our POM’s.
** Need to add “licenses” tag for our Apache 2.0 usage.
** Need to add a list of “developers” to the pom.
** Need to add “scm” section with Github connection information.
** Move from org.finra.dm to org.finra.herd.
Additionally rename classes and comments as required
Publish to Sonatype during open source release
Need to maintain context-root as dm-app in existing environments
Must ensure regression passes in every environment and watch this very closely as promoting in existing environments

Re-instate removal of legacy namespace

As a herd User, I would like to always have business object definitions endpoints that require namespace. Any endpoints that do not require namespace have already been deprecated - with this story we want to remove these from the system.

In addition, all endpoints with request XML that includes namespace should not have namespace optional - it should always be required.

Acceptance criteria:

All endpoints including Activiti wrappers must specify namespace as part of the alternate key
Legacy flag will be removed from the database
Update tables that have the list of endpoints for security
EXCEPT - leave the endpoint that lists BDef without specifying Namespace

Delete bucket versions in source Storage Platform after move to destination Storage Platform

As a Publisher I want to delete files from the source Storage Platform after they have been moved to the destination Storage Platform

Acceptance Criteria

Trigger at Bdata level
Verify all files for this Bdata moved from source Storage Platform to destination Storage Platform - this is performed in #41
If all moved successfully, delete all prior versions from source Storage Platform
- Note, this operation does not propagate to replicated buckets
This is integrated with the #41 that moves BData from S3 to Glacier.

Access GET Expected Partition Value via Activiti wrapper

As a herd workflow user I want to access expected partition value data from my worklfow.

Create an Activiti wrapper for
http://finraos.github.io/herd/docs/0.1.0/rest/index.html#!/keys_and_values/getPartitionValuesWithOffset

Acceptance Criteria

Activiti workflow functions the same as REST call

Migrate DDL generation to use velocity templates

As a Product Owner, I want to DDL generation to use a generic templating engine rather than string manipulation in Java

Acceptance Criteria

Ensure existing HIVE DDL output format passes all unit and regression tests after migration to velocity
Validate special logic is present in tests:
** escape characters (slash, single quote, back tick) for applicability
** ORGNL_ prefixing behavior
Apply to all DDL generation including collection
Consider impacts of following and discuss as needed
** less duplication of logic than Java approach

Select Storage Policy regardless of age

As the herd PolicyProcessor I want to select Storage Policy based on order of precedence regardless of age so I can override cross-cutting policies by stating a more specific policy.

Storage policies specify some portion of the alternate key and can apply to multiple BData, The following logic should apply to choose the correct policy. More specific rules should override the behaviour of less-specific rules according to this order.

BDef and (Usage + File Type) --- Most specific
BDef
(Usage + File Type)
[no portion of key specified] --- Least specific

Acceptance Criteria

If multiple storage policies apply to a BData, only the most specific should be selected to execute
Age of DBata and elapsed day value in Storage Policy should not be considered until after single policy is selected according to precedence rules

Increase attribute length limits

As a DM Consumer/Publisher, I want to save longer values in certain user-entered fields such as Notification Name

Review all column lengths that are <= 50
Review to determine which are user-entered and make sense to change (eg not Namespace)
Review to ensure that none of these have downstream impact if they are larger (eg aggregate total of alternate key must be < 1024)
** Check strings that will end up in SQS, S3 key

Acceptance Criteria

Desired fields updated to 250 length or appropriate as discussed with user base
No changes in full regression results

Manage Storage Policy

As a Publisher, I want to declare a Storage Policy that transitions data from one Storage Platform to another so I can define when data should move to cold storage

Policy definition includes:

Scope - expressed by limited subset of alternate keys and listed here in order of precedence
- Always include StorageName of current storage location
- BDef and (Usage & FileType) - override at this specific level
- BDef - override at this specific level
- (Usage & FileType) - a cross-cutting scope that applies to all BData of this Usage and FileType
- [none] - global scope that applies to all BData
- NOTE: (Usage & FileType) must always be used together to keep it simple - or else there are too many permutations
Rule - rule type and numeric value
- Initially support one rule type of DAYS_SINCE_BDATA_REGISTERED with the numeric value in days
Transition - destination Storage (required)
- Initially support only Glacier platform as destination

See policy definition examples on the Storage Policy Model page

Acceptance Criteria

REST endpoints to manage Storage Policy: POST, GET
Apply endpoint permissions to appropriate Roles
NOTE: does not include movement of data - this is just CRUD for the policy

Calling Job Executions Get does not return jobs due to case mismatch

Steps to reproduce:

Create JobDefinition with jobName in a different case than the Activiti XML element ID.
Start job of this type
Query for job using Job Executions Get and provide namespace, jobName inputs with any case. Do not specify status.

Current Behavior

Job Executions Get does not return job of namespace, jobName specified in request

Expected Behavior

Job Executions Get returns job of namespace, jobName specified in request

Run Activiti jobs on latest version of Activiti

As herd Product Owner I want to upgrade to the latest stable version of Activiti so I can take advantage of new features and defect fixes.

For example we recently introduced an endpoint that lists basic information for jobs that meet certain filtering criteria. Currently we only support criteria that specify single namespace+jobname combination. Later versions of Activiti (eg 5.18.0) support supplying lists of criteria so we can support supplying only namespace and getting all jobs of any JobName.

Acceptance Criteria

Review release notes of Activiti prior to starting technical work
Activiti jar and database schemas upgraded
Update all assets related to Activiti UI
Unit and functional regression tests pass
- Special attention to herd-specific Activiti code like the first task modifications

Retrieve list of BData affected by Storage Policy rule and scope

As the herd PolicyProcessor I want to retrieve a list of all BData that match the scope and rule of any StoragePolicy so I can execute the transition described in the applicable Storage Policy.

Acceptance Criteria

Create list that contains BData+Storage (storage unit) that match criteria of any registered Storage Policy at the time the list is requested
- BData matches scope of some Storage Policy, for example it has same BDef or Usage&FileType that matches any Storage Policy
  AND
- BData matches rule of some Storage Policy, for example it was created >= 90 days ago and it is in scope of a Storage Policy with a rule saying DAYS_SINCE_BDATA_REGISTERED = 90
If BData matches several Storage Policies, use order of precedence described in Storage Policy Model to identify a single Storage Policy
BData must not already be present in destination storage of transition, for example it has not already been archived to Glacier
A message will be placed in SQS queue for each BData
- Message format contains: source Storage, alternate keys and/or BData ID, Storage Policy name and/or ID
- Message should not contain any actual data - just meta-data - as SQS is not encrypted

Factory CQ - eliminate checkstyle exception for herd-model

Looks like we exclude our herd-model classes from our code quality rules. Most of these classes have auto-generated equals/hash code methods which aren’t compliant with our rules (e.g. Object “o” can’t be a single character object parameter, etc.). It’s probably a good idea for us to clean up these classes and remove the code quality exclusion.

View documentation with improved layout and versioning

As a developer or potential contributor, I want to view documentation with optimized layout and ensure that the documentation is version-controlled.

Acceptance Criteria

Migrate content from wiki to GitHub Pages
- Activiti tasks
- Configuration values
- Uploader/Downloader
Configure build to archive versions of documentation
Main navigation present and covers entire tree down to
Update all links from swagger API pages and other content from io and wiki

Based on Storage Policy, move data to another Storage Platform

As a Publisher, I want data to move from one Storage Platform to another according to the policy I defined

For this story, following limitations are in place

The only supported transition is from source Storage of S3 Storage Platform to destination Storage of Glacier Storage Platform
The only supported rule is DAYS_SINCE_BDATA_REGISTERED
Size limit and concurrency limits in place as described below

Acceptance Criteria

Run at scheduled interval - initially nightly - to move data according to Storage Policy transition, rule, and scope
Pick messages off SQS queue (see #34 ) and perform defined transition
Must have configurable limit based on BData size. Will not move BData if size exceeds this limit.
Must check that BData has not being archived or has already been archived. This is necessary because there could be multiple messages in the queue for the same BData
Destination Storage Unit will be created and put in ARCHIVING status while storage operation completes.
- Destination Storage Unit will contain reference to Source Storage Unit
Once storage operation is complete, status of Destination Storage Unit will be set to ENABLED
Source Storage Unit will not be modified, except that its status will be set to DISABLED. This should occur as an atomic operation together with setting Destination Storage Unit status to ENABLED.
- The timing and history of the previous Source Storage Unit status will be captured in a history table
Should log at minimum:
- Single log entry at start of processing with: Date/time, Bdata alternate key, Policy, source Storage, destination Storage
- Single log entry at end with same as start plus: success/failure, error message

Validate file size based on configuration setting

As a Publisher I want meta-data about file size to be validated against S3 so I can ensure the accuracy of the meta-data

Acceptance Criteria

Validation controlled by Storage attribute
- This file size validation will only be active if the file existence attribute is also present
Validation should occur when adding files and when registering BData
Validation can return error upon first mismatch and does not have to continue processing
Validation failure should result in rollback of meta-data operation (file add or BData registration) and return error message with mismatching file size

Transaction issue with long-running jdbc tasks

Users of the JDBC Activiti task experienced a defect when performing long-running (ie ~30 minutes or more) operations. The task would fail with taskErrorMessage of "commit failed; nested exception is org.hibernate.TransactionException: commit failed". The JDBC statement would complete successfully but the workflow would stop with this error message.

Analysis showed that the database connection was being timed out by Tomcat and closed. Potential fix is to switch the transaction propagation from REQUIRES_NEW to NOT_SUPPORTED in JdbcServiceImpl.executeJdbc

Activiti tasks will not return immediately from request to start Activiti task

A previous story (Job parameters from S3) introduced a defect where the start job request will wait until the workflow is completed before responding to the user.

Defective behavior: Caller of start job request will not receive response until either workflow completes or workflow encounters async task.
Expected behavior: Caller of start job request will receive response immediately before workflow starts. Note that this behavior will be improved in a subsequent story to return almost immediately but only after the initial Activiti job state is persisted to the database.

The bug was introduced when we attempted to fix an issue reported during testing at the time of the implementation of the story, where having a S3 property which is longer than the database column length would cause the workflow to silently fail without any trace of its existence. The resolution for this defect is to remove the code introduced in the previous story that prevents the request from returning before the Activiti workflow is started.

Deleting BData files fails if over 1000 files in S3

Steps to reproduce

Create BData with < 1000 files in S3
Call BData DELETE including the deleteFiles option
Delete will fail on S3 operation

Desired behavior

Create BData with < 1000 files in S3
Call BData DELETE including the deleteFiles option
Delete succeeds in catalog and S3

Generate parameter documentation in swagger

As a herd Consumer or Publisher I want to view swagger documentation including parameters and request, response bodies and ensure that the documentation is in sync with the code and version-controlled

Does not include actual values, only placeholders for parameters, request model, response model.

Acceptance Criteria

Generate swagger YAML for body/parameters from XSD annotations
Does not include values in JSON, XML, these are not required for this story
swagger YAML generates swagger-ui without error
swagger YAML can be parsed by swagger editor without error
Integrate with #38 to ensure this executes as part of build process

Log Workflows triggered by Notifications

As a Herd Administrator I want to view information about each workflow triggered by a Notification so I can analyze and isolate issues more quickly when troubleshooting

Acceptance Criteria

INFO level application log entry for each workflow triggered including triggering BData alt key, namespace, job name
Do not require additional information as Activiti tables will cover the workflow details

Access file from KMS-encrypted bucket via pre-signed URL

As an application that downloads files from a KMS-encrypted bucket, I want to utilize a pre-signed URL and non-signed URL to access the file.

The Download Single Initiation service currently provides temporary session access key, secret key, and token. We should provide a pre-signed URL (using [this|https://java.awsblog.com/post/Tx1518K5RPDEG0B/Generating-Amazon-S3-Pre-signed-URLs-with-SSE-KMS-Part-2] or similar method) along with the existing information. We should also provide a standard non-signed URL to the file. In this scenario, the user is expected to already have the necessary credentials to access the file.

The URL format should be templated in the configuration table to accommodate URL options such as scheme (http/https, etc.). The pre-signed URL should have the temporary credentials embedded within the URL as necessary.

Acceptance Criteria

Ensure previous Amazon defect with pre-signed URL and KMS has been fixed
If there are options for generating pre-signed URL (such as region, bucket naming format) make them configurable
Application that does not have permissions to KMS key for KMS-encrypted bucket should be able to utilize pre-signed URL to download file
Pre-signed URL should be valid for same expiration data as token currently provided

Receive full Business Data Object data in workflow variables for notification

As an Activiti User whose job is triggered by notifications, I want to receive additional data in workflows so I can avoid multiple calls in my workflow.

Use case is related to notifications and especially cases where Activiti users can't pass state from one workflow to another. This request will prevent Activiti users from continuing to request more and more data in a reactive fashion as they discover additional use cases.

Acceptance Criteria

Include a single workflow variable with JSON that contains full information about the Business Data Object that triggered the notification
Information should be in a single JSON string
Information should reference - and not be a copy of - the response to Business Object Data Get endpoint. If the Business Object Data Get endpoint is updated, the workflow variable in notification trigger should get updated without developer action.
Do not remove any existing variables - this will be handled in future story to give users a migration path.

Retrieve Availability and DDL from multiple storage without specifying storage

As a Consumer I want to retrieve Availability and DDL without specifying Storage so I can process data without being concerned about storage.

herd data model structure currently allows files from multiple Storage in a single BData. This drove inclusion of Storage in Availability and DDL requests -- DDL in particular will break if there are two files for the same partition. At this point, Storage is required in Availability in DDL and when specifying multiple storage, the first value takes precedence. Note that this data scenario is theoretical based on the model and it is not in common usage by the current user set.

This story is to perform the following to ensure consumers can make these requests without passing Storage.

Acceptance Criteria

Storage is now optional in Availability and DDL
If multiple storage found for any single BData
** Return an error if Storage is not provided in request
** Return Availability and DDL based on single Storage if one is provided (like today)
** Return Availability and DDL based on precedence by order if multiple Storage provided (like today)
If BData range provided in request and data is spread across different Storage in range, return Availability and DDL based on both Storage combined (like today)

Provide user-friendly message when DDL fails due to more columns than previous format

Current exception/message:

java.lang.IllegalArgumentException: fromIndex(2) > toIndex(1)

Instead, we need to throw a custom error message – similar to this:

java.lang.IllegalArgumentException: Number of subpartition values specified must be less than the number of partition columns defined in schema for business object format {namespace: "UT_Namespace18224", businessObjectDefinitionName: "UT_Bodef18224", businessObjectFormatUsage: "UT_Usage18224", businessObjectFormatFileType: "TXT", businessObjectFormatVersion: 1837542006}.

Utilize Uploader/Downloader on KMS-encrypted storage

As an Uploader/Downloader user I want to store and register confidential data in appropriately encrypted storage so I can fulfill records management requirements.

Assumption: Every Storage written to by Uploader will conform to the S3KeyPrefix prefix structure

Acceptance Criteria

User can optionally specify storage in manifest file when uploading and downloading data
- If user does not specify, defaults to current behavior to pass hard-coded bucket name
Add attribute at Storage level that contains KMS Key ID that will be populated for encrypted buckets
Uploader will utilize KMS Key ID from Storage-level attribute when writing to S3

Disable existing notifications so jobs are not triggered

As an Activiti user I want to disable registration notifications so I can temporarily suppress triggering jobs

Acceptance Criteria

Add Enabled/Disabled status as optional element to CRUD endpoints. Default value is enabled.
Create PUT endpoint for notification enable/disable without having to provide data for entire notifiction update
Observe status when processing registrations and don't trigger job start if status is Disabled

Notification to trigger Activiti job based status change, not registration

As a herd notification consumer, I want to trigger an Activiti job when a BData changes status.

This will be consistent with the way SQS notifications work. Follow SQS notifications for consistency in terms of interface/behavior details eg old and new status.

Acceptance Criteria

Consumer will register notification by specifying similar to today's [Add Notification|http://finraos.github.io/herd/docs/0.1.0/rest/index.html#!/notification/addNotification]
** BDef to watch for status changes, optionally specify format
** Activiti job to run when status changes
Add notification filters:
** Status update from or to a specified status value
** Leave 'from' and 'to' blank to match all
Provide previous and current status in workflow variable (and add it to the existing registration-only notification type)
Provide partition keys in addition to partition values in workflow variables
** partition column names should be pipe delimited list. We are will only going to send as many partition column names as partition values were used to register this BDATA
** partition values should be pipe delimited list
** remove existing variables for partition value and sub-partition value as they will be replaced by the variable containing all partition values
Add PUT for notification registration as part of this story
Maintain existing [Add Notification|http://finraos.github.io/herd/docs/0.1.0/rest/index.html#!/notification/addNotification] -- this should not be removed and will still trigger only on initial registration and not on any status changes

Clean up incomplete or failed Activiti jobs

As an Activiti user or herd administrator I want to be able to clean up jobs that did not complete so I can avoid build-up of data from these jobs

Acceptance Criteria

New REST endpoint takes Job ID
Move all data associated with the job to the history tables
Only applies to jobs in RUNNING status
Access control by endpoint/role mapping (as usual)

Security mapping cache should not be empty if DB connection fails

During a recent outage in an integration environment, DB connections were not available over a period of several minutes as an unrelated issue maxed the connection pool. At this time, service users received 403 responses from RBAC. It was found that the security mapping caching mechanism had not gotten a connection and was empty of security mappings.

Defective behavior

Security mapping cache was empty after failure to obtain DB connection

Desired behavior

Security mapping cache should retain old information if it fails to obtain DB connection

Steps to reproduce

Utilize all DB connections and wait (or artificially trigger) refresh of security mapping cache
Make request to REST endpoint with valid credentials
Receive 403 response as security mappings empty

Factory CQ - eliminate checkstyle exception for herd-model

Generating DDL for large range with sub-partitions and thousands of files results in error

The following request results in the error displayed below. Each partition contained thousands of files. This request should succeed without error. Consider disregarding storage file processing in favor of storage directory.

<businessObjectDataDdlRequest>
    <namespace>APP_A</namespace>
    <businessObjectDefinitionName>OBJECT_O</businessObjectDefinitionName>
    <businessObjectFormatUsage>PRC</businessObjectFormatUsage>
    <businessObjectFormatFileType>ORC</businessObjectFormatFileType>
    <businessObjectFormatVersion>2</businessObjectFormatVersion>
<partitionValueFilters>
        <partitionValueFilter>
            <partitionKey>PARTITION_DATE</partitionKey>
            <partitionValueRange>
                <startPartitionValue>2015-01-12</startPartitionValue>
                <endPartitionValue>2015-11-10</endPartitionValue>
            </partitionValueRange>
        </partitionValueFilter>
    </partitionValueFilters>
    <storageName>S3_MANAGED</storageName>
    <outputFormat>HIVE_13_DDL</outputFormat>
    <tableName>OBJECT_O</tableName>
    <includeDropTableStatement>true</includeDropTableStatement>
    <includeIfNotExistsOption>true</includeIfNotExistsOption>
    <allowMissingData>true</allowMissingData>
</businessObjectDataDdlRequest>

results in “java.lang.OutOfMemoryError: GC overhead limit exceeded“ with stacktrace:

org.springframework.web.servlet.DispatcherServlet.triggerAfterCompletionWithError(DispatcherServlet.java:1287)
                 org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:961)
                   org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:877)
                   org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:966)
                   org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:868)
                   javax.servlet.http.HttpServlet.service(HttpServlet.java:646)
                   org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:842)
                   javax.servlet.http.HttpServlet.service(HttpServlet.java:727)
                   org.finra.herd.ui.RequestLoggingFilter.doFilterInternal(RequestLoggingFilter.java:152)
                   org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107)
                   org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:88)
                   org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107)
                   org.finra.herd.app.Log4jMdcLoggingFilter.doFilter(Log4jMdcLoggingFilter.java:86)
                   org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:330)
                   org.springframework.security.web.authentication.AnonymousAuthenticationFilter.doFilter(AnonymousAuthenticationFilter.java:113)
                   org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:342)
                   org.finra.herd.app.security.HttpHeaderAuthenticationFilter.doHttpFilter(HttpHeaderAuthenticationFilter.java:157)
                   org.finra.herd.app.security.HttpHeaderAuthenticationFilter.doFilter(HttpHeaderAuthenticationFilter.java:99)
                   org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:342)
                   org.finra.herd.app.security.TrustedUserAuthenticationFilter.doHttpFilter(TrustedUserAuthenticationFilter.java:103)
                   org.finra.herd.app.security.TrustedUserAuthenticationFilter.doFilter(TrustedUserAuthenticationFilter.java:72)
                   org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:342)
                   org.springframework.security.web.context.SecurityContextPersistenceFilter.doFilter(SecurityContextPersistenceFilter.java:87)
                   org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:342)
                   org.springframework.security.web.FilterChainProxy.doFilterInternal(FilterChainProxy.java:192)
                   org.springframework.security.web.FilterChainProxy.doFilter(FilterChainProxy.java:160)
                   org.springframework.web.filter.DelegatingFilterProxy.invokeDelegate(DelegatingFilterProxy.java:344)
                   org.springframework.web.filter.DelegatingFilterProxy.doFilter(DelegatingFilterProxy.java:261)
                   org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
                   org.finra.herd.ui.HerdActivitiFilter.doFilter(HerdActivitiFilter.java:80)
</pre></p><p><b>root cause</b> <pre>java.lang.OutOfMemoryError: GC overhead limit exceeded
                   org.hibernate.internal.SessionImpl.internalLoad(SessionImpl.java:988)
                   org.hibernate.type.EntityType.resolveIdentifier(EntityType.java:716)
                   org.hibernate.type.EntityType.resolve(EntityType.java:502)
                   org.hibernate.engine.internal.TwoPhaseLoad.doInitializeEntity(TwoPhaseLoad.java:170)
                   org.hibernate.engine.internal.TwoPhaseLoad.initializeEntity(TwoPhaseLoad.java:144)
                   org.hibernate.loader.Loader.initializeEntitiesAndCollections(Loader.java:1114)
                   org.hibernate.loader.Loader.processResultSet(Loader.java:972)
                   org.hibernate.loader.Loader.doQuery(Loader.java:920)
                   org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:354)
                   org.hibernate.loader.Loader.doList(Loader.java:2553)
                   org.hibernate.loader.Loader.doList(Loader.java:2539)
                   org.hibernate.loader.Loader.listIgnoreQueryCache(Loader.java:2369)
                   org.hibernate.loader.Loader.list(Loader.java:2364)
                   org.hibernate.loader.hql.QueryLoader.list(QueryLoader.java:496)
                   org.hibernate.hql.internal.ast.QueryTranslatorImpl.list(QueryTranslatorImpl.java:387)
                   org.hibernate.engine.query.spi.HQLQueryPlan.performList(HQLQueryPlan.java:231)
                   org.hibernate.internal.SessionImpl.list(SessionImpl.java:1264)
                   org.hibernate.internal.QueryImpl.list(QueryImpl.java:103)
                   org.hibernate.jpa.internal.QueryImpl.list(QueryImpl.java:573)
                   org.hibernate.jpa.internal.QueryImpl.getResultList(QueryImpl.java:449)
                   org.hibernate.jpa.criteria.compile.CriteriaQueryTypeQueryAdapter.getResultList(CriteriaQueryTypeQueryAdapter.java:67)
                   org.finra.herd.dao.impl.HerdDaoImpl.getStorageFilesByStorageUnits(HerdDaoImpl.java:2258)
                   sun.reflect.GeneratedMethodAccessor384.invoke(Unknown Source)
                   sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                   java.lang.reflect.Method.invoke(Method.java:497)
                   org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:317)
                   org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:201)
                   com.sun.proxy.$Proxy202.getStorageFilesByStorageUnits(Unknown Source)
                   org.finra.herd.service.helper.Hive13DdlGenerator.processStorageUnitsForGenerateDdl(Hive13DdlGenerator.java:541)
                   org.finra.herd.service.helper.Hive13DdlGenerator.processPartitionFiltersForGenerateDdl(Hive13DdlGenerator.java:521)
                   org.finra.herd.service.helper.Hive13DdlGenerator.generateCreateTableDdlHelper(Hive13DdlGenerator.java:280)
                   org.finra.herd.service.helper.Hive13DdlGenerator.generateCreateTableDdl(Hive13DdlGenerator.java:204)

FactoryCQ update libraries

Review maven dependencies (except Activiti) and update to latest.
If deprecation requires code changes beyond defined time box, raise for review and possibly defer
Resolve dependency conflicts.

finraos / herd Goto Github PK

herd's People

Contributors

Stargazers

Watchers

Forkers

herd's Issues

Recommend Projects

Recommend Topics

Recommend Org