Coder Social home page Coder Social logo

finraos / herd Goto Github PK

View Code? Open in Web Editor NEW
135.0 135.0 41.0 226.1 MB

Herd is a managed data lake for the cloud. The Herd unified data catalog helps separate storage from compute in the cloud. Manage petabytes of data and make it accessible for data processing and analytical purposes by any cloud compute platform.

Home Page: http://finraos.github.io/herd/

License: Apache License 2.0

JavaScript 4.62% HTML 0.08% Shell 0.06% Java 91.00% Batchfile 0.01% Scala 2.04% Python 1.92% SCSS 0.27%

herd's People

Contributors

afelde avatar aniruddhadas9 avatar aniruddhadas9finra avatar charliepy avatar davidbalash avatar dependabot[bot] avatar foxsmart avatar gudin-anton avatar jazhou avatar jzhang80 avatar k26389 avatar kenisteward avatar kood1 avatar kusid avatar mchao47 avatar mona62 avatar nateiam avatar rongwang0930 avatar saisumughis avatar saisuryafinra avatar seoj avatar wz1371 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

herd's Issues

Generate endpoints in swagger

As a herd Consumer or Publisher I want to view swagger documentation for REST endpoints and ensure that the documentation is in sync with the code and version-controlled

Acceptance Criteria

  • Generate swagger YAML for REST including:
    • endpoints
    • list of parameters
    • model, JSON, XML
  • swagger YAML does not include
    • parameter descriptions, data types, required/optional indication
    • values in JSON, XML
  • Generate swagger as part of build process
  • swagger YAML generates swagger-ui without error
  • swagger YAML can be parsed by swagger editor without error

Use KMS Key ID from Storage attribute in LFU/LFS

As LFU services, I want to obtain the KMS Key ID for the LFS bucket from a Storage Attribute so I can conform with the new standard way of storing and retrieving KMS Keys.
#40 introduces a mechanism to obtain KMS Key ID for a bucket from its Storage Attribute

Acceptance Criteria

  • KMS Key ID for LFS bucket is present as Storage Attribute
  • LFU Upload service obtains KMS Key ID from Storage Attribute
  • LFS/LFU regression passes

Downloader fails when not using secret and access key

Downloader was modified in 0.7.0 to utilize the Herd credential endpoint. With this modification, users are not required to pass secret and access keys in the command line. Instead the Downloader calls the Herd credential endpoint which authenticates with the username and password supplied on the command line and generates temporary credentials. But the call to the Herd credential service is failing as Downloader does not pass the format version.

Steps to reproduce

  • Utilize 0.7.0 Downloader without passing secret and access key at the command line

Defective behavior

  • Downloader will call Herd credential endpoint but not pass required format version
  • When this call fails, Downloader will fall back on the AWS credential chain
  • If no credentials present on AWS credential chain, Downloader will fail

Desired behavior

  • Downloader already knows latest format version from BData Get. Downloader should pass this format version to the Herd credential which will then succeed and use the temporary credentials.

NOTE: This passed during testing as the fall back to AWS credential chain found valid credentials and the operation succeeded.

Manage Data Provider values

As a herd Publisher I want to view, insert, and delete Data Provider values so I can manage this data without a herd Administrator

Acceptance Criteria

  • CRD (no update) endpoints present to manage Data Provider values
  • Delete should only succeed if the value is unused in the database

Update notification registrations

As an notification user I want to update notification registrations so I can modify my registration instead of having to create a new one.

Assumption - no support for versioning

Acceptance Criteria

  • PUT endpoint for notification registrations
  • Same capabilities and data as POST endpoint except requires additional input of namespace + job name
  • Replaces all data for this notification registration with what is provided in the payload. User must provide all values they want stored. In other words, Herd will not merge the PUT payload with the existing persisted state, user must provide entire state.

Ongoing DB Create Script Maintenance

Introduce process to maintain create and upgrade scripts for each release for OSS

  • We already are good at maintaining incremental upgrade scripts with each release, this is well tested and works smoothly
  • Maintain create script along with upgrade scripts by extracting create script from our environment with each release
  • Discuss upgrade paths and intervals at which we will maintain incremental scripts eg major versions

Acceptance Criteria

  • Wiki page documentation

Add high volume of files to Storage Files Post

As a Herd Publisher I want to add a high volume of files without the service timing out so I can register a large number of files associated with a BData.

Acceptance Criteria

  • Business Object Data Storage Files Post works under timeout threshold for up to 30k files for Storages of S3 Storage Platform
    • Baseline the performance before and after the fix
  • Business Object Data Storage Files Post works in performance criteria when files are included in request -- this does not cover auto-discovery

Monitor active Activiti jobs that have not completed

As an Activiti user I want visibility into running jobs so I can perform troubleshooting/cleanup/re-run of job that were created by a trigger.

This might take the form of a REST endpoint that takes the job name and/or job definition ID and returns any jobs that are not in completed status.

Acceptance Critieria

  • Programmatic interface available to retrieve list of incomplete jobs for a given namespace and optionally job name
  • Returns for all active jobs
    ** Include namespace, job name, start time, job ID ( to facilitate subsequent calls to [Get Job Status|http://finraos.github.io/herd/docs/0.1.0/rest/index.html#!/job/getJobStatus]), and environment
  • Structure response with future expansion in mind - we will likely add different status and additional information about jobs

In the future:

  • We will add the ability to see jobs that have completed with some sort of walking window filter.

Specify prefix template as Storage Attribute

As a Storage owner I want to specify the prefix template for a Storage that I own so I can manage the prefix structure as I wish.

Currently the prefix template is set as a separate configuration value in the Herd DB for each managed bucket. This should instead be set as an attribute on each Storage.

Affects how prefix is defined, it becomes part of the public interface so we must make it more flexible (eg use velocity)
Also Affects S3KeyPrefix - must modify it to return prefix on this basis

Acceptance Criteria

  • Migrate prefix template logic to access from storage attribute instead of configuration value. Remove global config value.
  • S3KeyPrefix will take Storage Name as optional parameter, default = S3_MANAGED
  • Populate existing S3 Storages with template values
    • Reminder - environment difference 'frmt' in DEV vs 'schema' in other environments - make this all 'schema'
  • Ensure error handling exists for all operations that touch prefix
    • If path validation prefix is present and prefix is not present (not enforced at storage creation but enforced at runtime eg when prefix is accessed)
  • Prefix template should be defined using velocity template language

Provide consistent error messages for failed Activiti tasks

As an Activiti user I want to have improved error reporting on both sync and async tasks so I can tell which task had an error and see it in the workflow variables.

Currently Activiti sync tasks have herd-specific code that handles errors by catching exceptions, capturing the error message in a workflow variable,and then returning flow to Activiti.

Herd Activiti tasks running async and built-in Activiti tasks do not have this behavior - their error reporting is not handled consistently with above. Instead exception messages are not captured in the workflow, they appear only in logs and the workflow appears stopped at the last sync task. This also affects all built-in Activiti tasks.

Acceptance Criteria

  • Tasks that fail will have exception message in workflow variables
  • Standard herd workflow variables that show task status will always reflect correct completed, running, or error status
  • Behavior applies to all herd tasks running sync and async and all built-in Activiti tasks (eg script tasks)

Restrict permissions when executing Java from JavaScript in Activiti

As a herd Administrator, I want to ensure that Java executed from JavaScript in Activiti cannot be used to execute malicious code.

Use Java SecurityManager on JavaScript engine used by Activiti to restrict all permissions in SecurityManager spec. This will not affect any Activiti tasks or Java written by herd in Activiti tasks - it only impacts Java running from Activity Javascript.

Acceptance Criteria

  • Javascript in Activiti cannot execute known examples of malicious code such as System.exit, file I/O, , network I/O, spawning threads or processes
  • Javascript in Activiti will still be able to do legitimate, harmless operations that are currently in use such as assembling requests and parsing responses
  • herd Java code running in Tomcat is not affected
  • Solution is compatible with Java 1.8

Retrieve Availability and DDL from data in multiple Storage

As a herd Consumer I want to receive availability data that considers the accessibility of data in the underlying Storage Platform so I can avoid trying to access data that has been archived to Glacier

Acceptance Criteria

  • For BData that has been archived to Glacier, return as part of notAvailableStatuses with reason ARCHIVED
    • Archived to Glacier is defined as having Storage Unit is not in ENABLED status for Storage of S3 Storage Platform and Storage Unit in ENABLED for Storage of Glacier platform
  • For BData that does not have any ENABLED Storage, should return as part of notAvailableStatuses with reason NO_ENABLED_STORAGE_UNIT
  • DDL will exclude notAvailableStatuses according to this new rule and current logic to ignoreMissing.
  • Even if Glacier Storage is specified in Availability request, it will still be returned as notAvailableStatus
  • Additional Storage Unit Status such as ARCHIVING should be considered equivalent to DISABLED.
    • This involves a generic mechanism that allows adding various status and identifying if they are considered ENABLED or DISABLED for purposes of evaluating availability

Receive response from Execute Job only after job data persisted

As an Activiti user I want to ensure that jobs executed via Execute Job REST are persisted to Activiti prior to the rest call returning.

The current Activiti Execute Job REST returns with an instance ID before the job is actually persisted to Activiti. This leaves a window where jobs could be dropped due to system failure. Some Activity users that use external scheduling and have operation staff to monitor jobs will notice this type of failure. But now we have teams that will be using notifications to trigger jobs and will not know if there is a failure of this sort.

We need to close the small gap where jobs could not be persisted by doing:

  • Write code that modifies the first task (i.e. the start task?) and makes it "async=true" if it isn't already in every workflow at registration
  • Remove our Activiti create and start workflow “hack” command that returns a process instance Id before the Activiti transaction gets persisted and use the normal command instead where we could catch an exception if the process doesn’t get created and return an error in the REST API

Acceptance Criteria

  • Ensure that any failure prior to persisting will result in Execute Job returning with error
  • Ensure job is persisted to Activiti prior to Execute Job returning
  • Execute Job returns in reasonable time (less than 2-3 seconds) including persisting in Activiti
  • If no-op is added at registration time ensure no issues with updating registration (eg adds another no-op each time updated)

Retrieve EMR cluster status without calling deprecated DescribeJobFlows

As herd product owner I would like EMR cluster status to come from AWS' newer APIs so we can no longer rely on the deprecated API.

We need to fulfill our current EMR Cluster GET |http://finraos.github.io/herd/docs/0.1.0/rest/index.html#!/cluster/getEMRCluster] service by calling some combination of from AWS below


Based on our current records, you are currently using one of our deprecated API (DescribeJobFlows) to get the details for all of your EMR clusters.

Starting Midnight (24:00 UTC) 12/01/2015, we will require users to move to the new set of APIs. The newer set of APIs provide a much more granular and low latency access.

Here is the list of new APIs:

ListClusters, DescribeCluster, ListSteps, ListInstanceGroups and ListBootstrapActions instead.

API Documentation: http://docs.aws.amazon.com/ElasticMapReduce/latest/API/API_Operations.html

DescribeJobFlows API Doc: http://docs.aws.amazon.com/ElasticMapReduce/latest/API/API_DescribeJobFlows

Utilize temporary token in DataBridge to write to S3

As a herd Administrator I want to provision Uploader/Downloader user access to S3 buckets with temporary tokens so I can eliminate the use of secret and access keys

Tech approach will be similar to LFS use of AWS STS service.

Acceptance Criteria

  • Uploader/Downloader code will call a REST endpoint that will request a token from STS on behalf of authorized users and grant them temporary token
    • Full alternate key must be passed into the endpoint and if that full key down to the Format level does not exist, the endpoint will return an error
    • For Upload Credentials endpoint, businessObjectDataVersion is optional and user can specify createNewVersion. These changes are temporary until pre-registration is implemented
  • REST endpoint is associated with appropriate roles through existing configurable mechanism
  • Temporary token will have permissions on resources in AWS:
    • Read/Write AWS roles = PUT, DELETE on S3 prefix in manifest
    • Read-only AWS roles = GET on bucket
  • Uploader/Downloader code will still pass user credentials to other REST services used by Uploader/Downloader subject to RBAC for endpoint access - no change to this portion of access control
  • Uploader and Downloader need to extend credentials to maintain token across entire list of files. Utilize existing REST endpoint.
  • Retain previous mechanism using permanent secret/access keys without STS
  • Documentation should explicitly mention that the credentials endpoint should only be used by Uploader/Downloader, not other access

Boundary tokens (min/max/latest*) in Availability/DDL should observe status

As a Consumer I want to receive latest VALID partition when retrieving with latestbefore token

Currently, the boundary tokens (min, max, latestbefore, lastestafter) do not consider VALID or INVALID status when evaluating. It finds the boundary date and then returns Availability for that date. Instead we want it to consider status in this logic and find the boundary date with VALID status.

Case that shows current undesired behavior:
2015-05-01 VALID
2015-05-02 VALID
2015-05-03 VALID
2015-05-04 INVALID
Scenario expected receive
"latest before 2015-05-04" 2015-05-03 Not Available
"latest before 2015-05-10" 2015-05-03 Not Available
"latest before 2015-05-03" 2015-05-03 2015-05-03

And for reference, the happy path case
2015-05-01 VALID
2015-05-02 VALID
2015-05-03 VALID
2015-05-04 VALID
Scenario expected receive
"latest before 2015-05-04" 2015-05-04 2015-05-04
"latest before 2015-05-10" 2015-05-04 2015-05-04
"latest before 2015-05-03" 2015-05-03 2015-05-03

Acceptance Criteria

  • Receive expected response in case above where latest data is INVALID

Utilize temporary token in Uploader/Downloader to write to S3

As a herd Administrator I want to provision Uploader/Downloader user access to S3 buckets with temporary tokens so I can eliminate the use of secret and access keys

Tech approach will be similar to LFU use of AWS STS service.

Acceptance Criteria

  • Uploader/Downloader code will call a REST endpoint that will request a token from STS on behalf of authorized users and grant them temporary token
  • REST endpoint is associated with appropriate roles through existing configurable mechanism
  • Temporary token will have permissions on resources in AWS:
    ** Read/Write AWS roles = GET on bucket, PUT, DELETE on S3 prefix in manifest
    ** Read-only AWS roles = GET on bucket
  • Uploader/Downloader code will still pass user credentials to other REST services used by Uploader/Downloader subject to RBAC for endpoint access - no change to this portion of access control
  • Uploader and Downloader need to extend credentials to maintain token across entire list of files. Utilize existing REST endpoint.
  • Remove previous mechanism using permanent secret/access keys

Retrieve artifacts from Sonatype repository

As a herd user, I want to be able to retrieve artifacts from the sonatype repository so that I can create an instance of herd or deploy artifacts to an existing instance without access to the FINRA network. See http://central.sonatype.org/pages/requirements.html for more details.

Acceptance Criteria:

  • Sonatype requirements
    ** Need to build and include Javadoc JAR’s
    ** Need to sign files with GPG/PGP. Considering using Maven plug-in: http://central.sonatype.org/pages/apache-maven.html
    ** We need to add “description” and “url” tags to all our POM’s.
    ** Need to add “licenses” tag for our Apache 2.0 usage.
    ** Need to add a list of “developers” to the pom.
    ** Need to add “scm” section with Github connection information.
    ** Move from org.finra.dm to org.finra.herd.
  • Additionally rename classes and comments as required
  • Publish to Sonatype during open source release
  • Need to maintain context-root as dm-app in existing environments
  • Must ensure regression passes in every environment and watch this very closely as promoting in existing environments

Re-instate removal of legacy namespace

As a herd User, I would like to always have business object definitions endpoints that require namespace. Any endpoints that do not require namespace have already been deprecated - with this story we want to remove these from the system.

In addition, all endpoints with request XML that includes namespace should not have namespace optional - it should always be required.

Acceptance criteria:

  • All endpoints including Activiti wrappers must specify namespace as part of the alternate key
  • Legacy flag will be removed from the database
  • Update tables that have the list of endpoints for security
  • EXCEPT - leave the endpoint that lists BDef without specifying Namespace

Delete bucket versions in source Storage Platform after move to destination Storage Platform

As a Publisher I want to delete files from the source Storage Platform after they have been moved to the destination Storage Platform

Acceptance Criteria

  • Trigger at Bdata level
  • Verify all files for this Bdata moved from source Storage Platform to destination Storage Platform - this is performed in #41
  • If all moved successfully, delete all prior versions from source Storage Platform
    • Note, this operation does not propagate to replicated buckets
  • This is integrated with the #41 that moves BData from S3 to Glacier.

Migrate DDL generation to use velocity templates

As a Product Owner, I want to DDL generation to use a generic templating engine rather than string manipulation in Java

Acceptance Criteria

  • Ensure existing HIVE DDL output format passes all unit and regression tests after migration to velocity
  • Validate special logic is present in tests:
    ** escape characters (slash, single quote, back tick) for applicability
    ** ORGNL_ prefixing behavior
  • Apply to all DDL generation including collection
  • Consider impacts of following and discuss as needed
    ** less duplication of logic than Java approach

Select Storage Policy regardless of age

As the herd PolicyProcessor I want to select Storage Policy based on order of precedence regardless of age so I can override cross-cutting policies by stating a more specific policy.

Storage policies specify some portion of the alternate key and can apply to multiple BData, The following logic should apply to choose the correct policy. More specific rules should override the behaviour of less-specific rules according to this order.

  1. BDef and (Usage + File Type) --- Most specific
  2. BDef
  3. (Usage + File Type)
  4. [no portion of key specified] --- Least specific

Acceptance Criteria

  • If multiple storage policies apply to a BData, only the most specific should be selected to execute
  • Age of DBata and elapsed day value in Storage Policy should not be considered until after single policy is selected according to precedence rules

Increase attribute length limits

As a DM Consumer/Publisher, I want to save longer values in certain user-entered fields such as Notification Name

  • Review all column lengths that are <= 50
  • Review to determine which are user-entered and make sense to change (eg not Namespace)
  • Review to ensure that none of these have downstream impact if they are larger (eg aggregate total of alternate key must be < 1024)
    ** Check strings that will end up in SQS, S3 key

Acceptance Criteria

  • Desired fields updated to 250 length or appropriate as discussed with user base
  • No changes in full regression results

Manage Storage Policy

As a Publisher, I want to declare a Storage Policy that transitions data from one Storage Platform to another so I can define when data should move to cold storage

Policy definition includes:

  • Scope - expressed by limited subset of alternate keys and listed here in order of precedence
    • Always include StorageName of current storage location
    • BDef and (Usage & FileType) - override at this specific level
    • BDef - override at this specific level
    • (Usage & FileType) - a cross-cutting scope that applies to all BData of this Usage and FileType
    • [none] - global scope that applies to all BData
    • NOTE: (Usage & FileType) must always be used together to keep it simple - or else there are too many permutations
  • Rule - rule type and numeric value
    • Initially support one rule type of DAYS_SINCE_BDATA_REGISTERED with the numeric value in days
  • Transition - destination Storage (required)
    • Initially support only Glacier platform as destination

See policy definition examples on the Storage Policy Model page

Acceptance Criteria

  • REST endpoints to manage Storage Policy: POST, GET
  • Apply endpoint permissions to appropriate Roles
  • NOTE: does not include movement of data - this is just CRUD for the policy

Calling Job Executions Get does not return jobs due to case mismatch

Steps to reproduce:

  1. Create JobDefinition with jobName in a different case than the Activiti XML element ID.
  2. Start job of this type
  3. Query for job using Job Executions Get and provide namespace, jobName inputs with any case. Do not specify status.

Current Behavior

  • Job Executions Get does not return job of namespace, jobName specified in request

Expected Behavior

  • Job Executions Get returns job of namespace, jobName specified in request

Run Activiti jobs on latest version of Activiti

As herd Product Owner I want to upgrade to the latest stable version of Activiti so I can take advantage of new features and defect fixes.

For example we recently introduced an endpoint that lists basic information for jobs that meet certain filtering criteria. Currently we only support criteria that specify single namespace+jobname combination. Later versions of Activiti (eg 5.18.0) support supplying lists of criteria so we can support supplying only namespace and getting all jobs of any JobName.

Acceptance Criteria

  • Review release notes of Activiti prior to starting technical work
  • Activiti jar and database schemas upgraded
  • Update all assets related to Activiti UI
  • Unit and functional regression tests pass
    • Special attention to herd-specific Activiti code like the first task modifications

Retrieve list of BData affected by Storage Policy rule and scope

As the herd PolicyProcessor I want to retrieve a list of all BData that match the scope and rule of any StoragePolicy so I can execute the transition described in the applicable Storage Policy.

Acceptance Criteria

  • Create list that contains BData+Storage (storage unit) that match criteria of any registered Storage Policy at the time the list is requested
    • BData matches scope of some Storage Policy, for example it has same BDef or Usage&FileType that matches any Storage Policy
      AND
    • BData matches rule of some Storage Policy, for example it was created >= 90 days ago and it is in scope of a Storage Policy with a rule saying DAYS_SINCE_BDATA_REGISTERED = 90
  • If BData matches several Storage Policies, use order of precedence described in Storage Policy Model to identify a single Storage Policy
  • BData must not already be present in destination storage of transition, for example it has not already been archived to Glacier
  • A message will be placed in SQS queue for each BData
    • Message format contains: source Storage, alternate keys and/or BData ID, Storage Policy name and/or ID
    • Message should not contain any actual data - just meta-data - as SQS is not encrypted

Factory CQ - eliminate checkstyle exception for herd-model

Looks like we exclude our herd-model classes from our code quality rules. Most of these classes have auto-generated equals/hash code methods which aren’t compliant with our rules (e.g. Object “o” can’t be a single character object parameter, etc.). It’s probably a good idea for us to clean up these classes and remove the code quality exclusion.

View documentation with improved layout and versioning

As a developer or potential contributor, I want to view documentation with optimized layout and ensure that the documentation is version-controlled.

Acceptance Criteria

  • Migrate content from wiki to GitHub Pages
    • Activiti tasks
    • Configuration values
    • Uploader/Downloader
  • Configure build to archive versions of documentation
  • Main navigation present and covers entire tree down to
  • Update all links from swagger API pages and other content from io and wiki

Based on Storage Policy, move data to another Storage Platform

As a Publisher, I want data to move from one Storage Platform to another according to the policy I defined

For this story, following limitations are in place

  • The only supported transition is from source Storage of S3 Storage Platform to destination Storage of Glacier Storage Platform
  • The only supported rule is DAYS_SINCE_BDATA_REGISTERED
  • Size limit and concurrency limits in place as described below

Acceptance Criteria

  • Run at scheduled interval - initially nightly - to move data according to Storage Policy transition, rule, and scope
  • Pick messages off SQS queue (see #34 ) and perform defined transition
  • Must have configurable limit based on BData size. Will not move BData if size exceeds this limit.
  • Must check that BData has not being archived or has already been archived. This is necessary because there could be multiple messages in the queue for the same BData
  • Destination Storage Unit will be created and put in ARCHIVING status while storage operation completes.
    • Destination Storage Unit will contain reference to Source Storage Unit
  • Once storage operation is complete, status of Destination Storage Unit will be set to ENABLED
  • Source Storage Unit will not be modified, except that its status will be set to DISABLED. This should occur as an atomic operation together with setting Destination Storage Unit status to ENABLED.
    • The timing and history of the previous Source Storage Unit status will be captured in a history table
  • Should log at minimum:
    • Single log entry at start of processing with: Date/time, Bdata alternate key, Policy, source Storage, destination Storage
    • Single log entry at end with same as start plus: success/failure, error message

Validate file size based on configuration setting

As a Publisher I want meta-data about file size to be validated against S3 so I can ensure the accuracy of the meta-data

Acceptance Criteria

  • Validation controlled by Storage attribute
    • This file size validation will only be active if the file existence attribute is also present
  • Validation should occur when adding files and when registering BData
  • Validation can return error upon first mismatch and does not have to continue processing
  • Validation failure should result in rollback of meta-data operation (file add or BData registration) and return error message with mismatching file size

Transaction issue with long-running jdbc tasks

Users of the JDBC Activiti task experienced a defect when performing long-running (ie ~30 minutes or more) operations. The task would fail with taskErrorMessage of "commit failed; nested exception is org.hibernate.TransactionException: commit failed". The JDBC statement would complete successfully but the workflow would stop with this error message.

Analysis showed that the database connection was being timed out by Tomcat and closed. Potential fix is to switch the transaction propagation from REQUIRES_NEW to NOT_SUPPORTED in JdbcServiceImpl.executeJdbc

Activiti tasks will not return immediately from request to start Activiti task

A previous story (Job parameters from S3) introduced a defect where the start job request will wait until the workflow is completed before responding to the user.

Defective behavior: Caller of start job request will not receive response until either workflow completes or workflow encounters async task.
Expected behavior: Caller of start job request will receive response immediately before workflow starts. Note that this behavior will be improved in a subsequent story to return almost immediately but only after the initial Activiti job state is persisted to the database.

The bug was introduced when we attempted to fix an issue reported during testing at the time of the implementation of the story, where having a S3 property which is longer than the database column length would cause the workflow to silently fail without any trace of its existence. The resolution for this defect is to remove the code introduced in the previous story that prevents the request from returning before the Activiti workflow is started.

Deleting BData files fails if over 1000 files in S3

Steps to reproduce

  • Create BData with < 1000 files in S3
  • Call BData DELETE including the deleteFiles option
  • Delete will fail on S3 operation

Desired behavior

  • Create BData with < 1000 files in S3
  • Call BData DELETE including the deleteFiles option
  • Delete succeeds in catalog and S3

Generate parameter documentation in swagger

As a herd Consumer or Publisher I want to view swagger documentation including parameters and request, response bodies and ensure that the documentation is in sync with the code and version-controlled

Does not include actual values, only placeholders for parameters, request model, response model.

Acceptance Criteria

  • Generate swagger YAML for body/parameters from XSD annotations
  • Does not include values in JSON, XML, these are not required for this story
  • swagger YAML generates swagger-ui without error
  • swagger YAML can be parsed by swagger editor without error
  • Integrate with #38 to ensure this executes as part of build process

Log Workflows triggered by Notifications

As a Herd Administrator I want to view information about each workflow triggered by a Notification so I can analyze and isolate issues more quickly when troubleshooting

Acceptance Criteria

  • INFO level application log entry for each workflow triggered including triggering BData alt key, namespace, job name
  • Do not require additional information as Activiti tables will cover the workflow details

Access file from KMS-encrypted bucket via pre-signed URL

As an application that downloads files from a KMS-encrypted bucket, I want to utilize a pre-signed URL and non-signed URL to access the file.

The Download Single Initiation service currently provides temporary session access key, secret key, and token. We should provide a pre-signed URL (using [this|https://java.awsblog.com/post/Tx1518K5RPDEG0B/Generating-Amazon-S3-Pre-signed-URLs-with-SSE-KMS-Part-2] or similar method) along with the existing information. We should also provide a standard non-signed URL to the file. In this scenario, the user is expected to already have the necessary credentials to access the file.

The URL format should be templated in the configuration table to accommodate URL options such as scheme (http/https, etc.). The pre-signed URL should have the temporary credentials embedded within the URL as necessary.

Acceptance Criteria

  • Ensure previous Amazon defect with pre-signed URL and KMS has been fixed
  • If there are options for generating pre-signed URL (such as region, bucket naming format) make them configurable
  • Application that does not have permissions to KMS key for KMS-encrypted bucket should be able to utilize pre-signed URL to download file
  • Pre-signed URL should be valid for same expiration data as token currently provided

Receive full Business Data Object data in workflow variables for notification

As an Activiti User whose job is triggered by notifications, I want to receive additional data in workflows so I can avoid multiple calls in my workflow.

Use case is related to notifications and especially cases where Activiti users can't pass state from one workflow to another. This request will prevent Activiti users from continuing to request more and more data in a reactive fashion as they discover additional use cases.

Acceptance Criteria

  • Include a single workflow variable with JSON that contains full information about the Business Data Object that triggered the notification
  • Information should be in a single JSON string
  • Information should reference - and not be a copy of - the response to Business Object Data Get endpoint. If the Business Object Data Get endpoint is updated, the workflow variable in notification trigger should get updated without developer action.
  • Do not remove any existing variables - this will be handled in future story to give users a migration path.

Retrieve Availability and DDL from multiple storage without specifying storage

As a Consumer I want to retrieve Availability and DDL without specifying Storage so I can process data without being concerned about storage.

herd data model structure currently allows files from multiple Storage in a single BData. This drove inclusion of Storage in Availability and DDL requests -- DDL in particular will break if there are two files for the same partition. At this point, Storage is required in Availability in DDL and when specifying multiple storage, the first value takes precedence. Note that this data scenario is theoretical based on the model and it is not in common usage by the current user set.

This story is to perform the following to ensure consumers can make these requests without passing Storage.

Acceptance Criteria

  • Storage is now optional in Availability and DDL
  • If multiple storage found for any single BData
    ** Return an error if Storage is not provided in request
    ** Return Availability and DDL based on single Storage if one is provided (like today)
    ** Return Availability and DDL based on precedence by order if multiple Storage provided (like today)
  • If BData range provided in request and data is spread across different Storage in range, return Availability and DDL based on both Storage combined (like today)

Provide user-friendly message when DDL fails due to more columns than previous format

Current exception/message:

java.lang.IllegalArgumentException: fromIndex(2) > toIndex(1)

Instead, we need to throw a custom error message – similar to this:

java.lang.IllegalArgumentException: Number of subpartition values specified must be less than the number of partition columns defined in schema for business object format {namespace: "UT_Namespace18224", businessObjectDefinitionName: "UT_Bodef18224", businessObjectFormatUsage: "UT_Usage18224", businessObjectFormatFileType: "TXT", businessObjectFormatVersion: 1837542006}.

Utilize Uploader/Downloader on KMS-encrypted storage

As an Uploader/Downloader user I want to store and register confidential data in appropriately encrypted storage so I can fulfill records management requirements.

Assumption: Every Storage written to by Uploader will conform to the S3KeyPrefix prefix structure

Acceptance Criteria

  • User can optionally specify storage in manifest file when uploading and downloading data
    • If user does not specify, defaults to current behavior to pass hard-coded bucket name
  • Add attribute at Storage level that contains KMS Key ID that will be populated for encrypted buckets
  • Uploader will utilize KMS Key ID from Storage-level attribute when writing to S3

Disable existing notifications so jobs are not triggered

As an Activiti user I want to disable registration notifications so I can temporarily suppress triggering jobs

Acceptance Criteria

  • Add Enabled/Disabled status as optional element to CRUD endpoints. Default value is enabled.
  • Create PUT endpoint for notification enable/disable without having to provide data for entire notifiction update
  • Observe status when processing registrations and don't trigger job start if status is Disabled

Notification to trigger Activiti job based status change, not registration

As a herd notification consumer, I want to trigger an Activiti job when a BData changes status.

This will be consistent with the way SQS notifications work. Follow SQS notifications for consistency in terms of interface/behavior details eg old and new status.

Acceptance Criteria

  • Consumer will register notification by specifying similar to today's [Add Notification|http://finraos.github.io/herd/docs/0.1.0/rest/index.html#!/notification/addNotification]
    ** BDef to watch for status changes, optionally specify format
    ** Activiti job to run when status changes
  • Add notification filters:
    ** Status update from or to a specified status value
    ** Leave 'from' and 'to' blank to match all
  • Provide previous and current status in workflow variable (and add it to the existing registration-only notification type)
  • Provide partition keys in addition to partition values in workflow variables
    ** partition column names should be pipe delimited list. We are will only going to send as many partition column names as partition values were used to register this BDATA
    ** partition values should be pipe delimited list
    ** remove existing variables for partition value and sub-partition value as they will be replaced by the variable containing all partition values
  • Add PUT for notification registration as part of this story
  • Maintain existing [Add Notification|http://finraos.github.io/herd/docs/0.1.0/rest/index.html#!/notification/addNotification] -- this should not be removed and will still trigger only on initial registration and not on any status changes

Clean up incomplete or failed Activiti jobs

As an Activiti user or herd administrator I want to be able to clean up jobs that did not complete so I can avoid build-up of data from these jobs

Acceptance Criteria

  • New REST endpoint takes Job ID
  • Move all data associated with the job to the history tables
  • Only applies to jobs in RUNNING status
  • Access control by endpoint/role mapping (as usual)

Security mapping cache should not be empty if DB connection fails

During a recent outage in an integration environment, DB connections were not available over a period of several minutes as an unrelated issue maxed the connection pool. At this time, service users received 403 responses from RBAC. It was found that the security mapping caching mechanism had not gotten a connection and was empty of security mappings.

Defective behavior

  • Security mapping cache was empty after failure to obtain DB connection

Desired behavior

  • Security mapping cache should retain old information if it fails to obtain DB connection

Steps to reproduce

  • Utilize all DB connections and wait (or artificially trigger) refresh of security mapping cache
  • Make request to REST endpoint with valid credentials
  • Receive 403 response as security mappings empty

Factory CQ - eliminate checkstyle exception for herd-model

Looks like we exclude our herd-model classes from our code quality rules. Most of these classes have auto-generated equals/hash code methods which aren’t compliant with our rules (e.g. Object “o” can’t be a single character object parameter, etc.). It’s probably a good idea for us to clean up these classes and remove the code quality exclusion.

Generating DDL for large range with sub-partitions and thousands of files results in error

The following request results in the error displayed below. Each partition contained thousands of files. This request should succeed without error. Consider disregarding storage file processing in favor of storage directory.

<businessObjectDataDdlRequest>
    <namespace>APP_A</namespace>
    <businessObjectDefinitionName>OBJECT_O</businessObjectDefinitionName>
    <businessObjectFormatUsage>PRC</businessObjectFormatUsage>
    <businessObjectFormatFileType>ORC</businessObjectFormatFileType>
    <businessObjectFormatVersion>2</businessObjectFormatVersion>
<partitionValueFilters>
        <partitionValueFilter>
            <partitionKey>PARTITION_DATE</partitionKey>
            <partitionValueRange>
                <startPartitionValue>2015-01-12</startPartitionValue>
                <endPartitionValue>2015-11-10</endPartitionValue>
            </partitionValueRange>
        </partitionValueFilter>
    </partitionValueFilters>
    <storageName>S3_MANAGED</storageName>
    <outputFormat>HIVE_13_DDL</outputFormat>
    <tableName>OBJECT_O</tableName>
    <includeDropTableStatement>true</includeDropTableStatement>
    <includeIfNotExistsOption>true</includeIfNotExistsOption>
    <allowMissingData>true</allowMissingData>
</businessObjectDataDdlRequest>

results in “java.lang.OutOfMemoryError: GC overhead limit exceeded“ with stacktrace:

org.springframework.web.servlet.DispatcherServlet.triggerAfterCompletionWithError(DispatcherServlet.java:1287)
                 org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:961)
                   org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:877)
                   org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:966)
                   org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:868)
                   javax.servlet.http.HttpServlet.service(HttpServlet.java:646)
                   org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:842)
                   javax.servlet.http.HttpServlet.service(HttpServlet.java:727)
                   org.finra.herd.ui.RequestLoggingFilter.doFilterInternal(RequestLoggingFilter.java:152)
                   org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107)
                   org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:88)
                   org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107)
                   org.finra.herd.app.Log4jMdcLoggingFilter.doFilter(Log4jMdcLoggingFilter.java:86)
                   org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:330)
                   org.springframework.security.web.authentication.AnonymousAuthenticationFilter.doFilter(AnonymousAuthenticationFilter.java:113)
                   org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:342)
                   org.finra.herd.app.security.HttpHeaderAuthenticationFilter.doHttpFilter(HttpHeaderAuthenticationFilter.java:157)
                   org.finra.herd.app.security.HttpHeaderAuthenticationFilter.doFilter(HttpHeaderAuthenticationFilter.java:99)
                   org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:342)
                   org.finra.herd.app.security.TrustedUserAuthenticationFilter.doHttpFilter(TrustedUserAuthenticationFilter.java:103)
                   org.finra.herd.app.security.TrustedUserAuthenticationFilter.doFilter(TrustedUserAuthenticationFilter.java:72)
                   org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:342)
                   org.springframework.security.web.context.SecurityContextPersistenceFilter.doFilter(SecurityContextPersistenceFilter.java:87)
                   org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:342)
                   org.springframework.security.web.FilterChainProxy.doFilterInternal(FilterChainProxy.java:192)
                   org.springframework.security.web.FilterChainProxy.doFilter(FilterChainProxy.java:160)
                   org.springframework.web.filter.DelegatingFilterProxy.invokeDelegate(DelegatingFilterProxy.java:344)
                   org.springframework.web.filter.DelegatingFilterProxy.doFilter(DelegatingFilterProxy.java:261)
                   org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
                   org.finra.herd.ui.HerdActivitiFilter.doFilter(HerdActivitiFilter.java:80)
</pre></p><p><b>root cause</b> <pre>java.lang.OutOfMemoryError: GC overhead limit exceeded
                   org.hibernate.internal.SessionImpl.internalLoad(SessionImpl.java:988)
                   org.hibernate.type.EntityType.resolveIdentifier(EntityType.java:716)
                   org.hibernate.type.EntityType.resolve(EntityType.java:502)
                   org.hibernate.engine.internal.TwoPhaseLoad.doInitializeEntity(TwoPhaseLoad.java:170)
                   org.hibernate.engine.internal.TwoPhaseLoad.initializeEntity(TwoPhaseLoad.java:144)
                   org.hibernate.loader.Loader.initializeEntitiesAndCollections(Loader.java:1114)
                   org.hibernate.loader.Loader.processResultSet(Loader.java:972)
                   org.hibernate.loader.Loader.doQuery(Loader.java:920)
                   org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:354)
                   org.hibernate.loader.Loader.doList(Loader.java:2553)
                   org.hibernate.loader.Loader.doList(Loader.java:2539)
                   org.hibernate.loader.Loader.listIgnoreQueryCache(Loader.java:2369)
                   org.hibernate.loader.Loader.list(Loader.java:2364)
                   org.hibernate.loader.hql.QueryLoader.list(QueryLoader.java:496)
                   org.hibernate.hql.internal.ast.QueryTranslatorImpl.list(QueryTranslatorImpl.java:387)
                   org.hibernate.engine.query.spi.HQLQueryPlan.performList(HQLQueryPlan.java:231)
                   org.hibernate.internal.SessionImpl.list(SessionImpl.java:1264)
                   org.hibernate.internal.QueryImpl.list(QueryImpl.java:103)
                   org.hibernate.jpa.internal.QueryImpl.list(QueryImpl.java:573)
                   org.hibernate.jpa.internal.QueryImpl.getResultList(QueryImpl.java:449)
                   org.hibernate.jpa.criteria.compile.CriteriaQueryTypeQueryAdapter.getResultList(CriteriaQueryTypeQueryAdapter.java:67)
                   org.finra.herd.dao.impl.HerdDaoImpl.getStorageFilesByStorageUnits(HerdDaoImpl.java:2258)
                   sun.reflect.GeneratedMethodAccessor384.invoke(Unknown Source)
                   sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                   java.lang.reflect.Method.invoke(Method.java:497)
                   org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:317)
                   org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:201)
                   com.sun.proxy.$Proxy202.getStorageFilesByStorageUnits(Unknown Source)
                   org.finra.herd.service.helper.Hive13DdlGenerator.processStorageUnitsForGenerateDdl(Hive13DdlGenerator.java:541)
                   org.finra.herd.service.helper.Hive13DdlGenerator.processPartitionFiltersForGenerateDdl(Hive13DdlGenerator.java:521)
                   org.finra.herd.service.helper.Hive13DdlGenerator.generateCreateTableDdlHelper(Hive13DdlGenerator.java:280)
                   org.finra.herd.service.helper.Hive13DdlGenerator.generateCreateTableDdl(Hive13DdlGenerator.java:204)

FactoryCQ update libraries

  • Review maven dependencies (except Activiti) and update to latest.
  • If deprecation requires code changes beyond defined time box, raise for review and possibly defer
    Resolve dependency conflicts.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.