Coder Social home page Coder Social logo

nationalsecurityagency / datawave Goto Github PK

View Code? Open in Web Editor NEW
545.0 57.0 237.0 94.94 MB

DataWave is an ingest/query framework that leverages Apache Accumulo to provide fast, secure data access.

Home Page: https://code.nsa.gov/datawave

License: Apache License 2.0

Java 97.40% Shell 2.18% HTML 0.18% XSLT 0.01% Python 0.03% CSS 0.02% JavaScript 0.15% Dockerfile 0.04%
accumulo java bigdata

datawave's Introduction

Apache License Build Status

DataWave is a Java-based ingest and query framework that leverages Apache Accumulo to provide fast, secure access to your data. DataWave supports a wide variety of use cases, including but not limited to...

  • Data fusion across structured and unstructured datasets
  • Construction and analysis of distributed graphs
  • Multi-tenant data architectures, with tenants having distinct security requirements and data access patterns
  • Fine-grained control over data access, integrated easily with existing user-authorization services and PKI

The easiest way to get started is the DataWave Quickstart

Documentation is located here

Basic build instructions are here

How to Use this Repository

The microservices and associated utility projects are intended to be developed, versioned, and released independently and as such are stored in separate repositories. This repository includes them all as submodules in order to provide an easy way to import them all in an IDE for viewing the code, or refactoring. Git submodules require some extra commands over the normal ones that one may be familiar with.

Cloning with all submodules

Cloning with all of the submodules is not required; however, if you are interested in checking out and building all of the datawave projects under one repo, read this!

It's easiest to clone the repository pointing the submodules at the same branch

# Start out by cloning the project as you normally would.
git clone [email protected]:NationalSecurityAgency/datawave.git

# Now, use git to retrieve all of the datawave submodules.
# This will leave your submodules in a detached head state.
cd datawave
git submodule update --init --recursive

# You can checkout the main branch for each submodule so that you are no longer in a detached head state.
# The addition of `|| :` will ensure that the command is executed for each submodule, 
# ignoring failures for submodules that don't have a main branch.
git submodule foreach 'git checkout main || :'

# It is recommended to build the project using multiple threads.
mvn -Pdocker,dist clean install -T 1C

# If you don't want to build the microservices, you can skip them.
mvn -Pdocker,dist -DskipMicroservices clean install -T 1C

# If you decide that you no longer need the submodules, you can remove them.
git submodule deinit --all

DataWave Microservices

For more information about deploying the datawave quickstart and microservices, check out the Docker Readme

datawave's People

Contributors

alerman avatar apmoriarty avatar austin007008 avatar avgagb avatar bbux-atg avatar billoley avatar bmwmaestoso avatar brianloss avatar cjmctague avatar cogross avatar d-hwang avatar dependabot[bot] avatar drewfarris avatar ejrgilbert avatar fineanddandy avatar friedlou avatar hgklohr avatar hlgp avatar ivakegg avatar jschmidt10 avatar jwomeara avatar jzgithub1 avatar keith-ratcliffe avatar lbschanno avatar miguelricardos avatar milleruntime avatar mineralntl avatar nonessentialprototype avatar plainolneesh avatar tomnelson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datawave's Issues

QueryPropertyMarker.instance of is not identifying marked nodes correctly

The instanceof method in QueryPropertyMarker is used to identify nodes that have been marked with a QueryPropertyMarker (like ASTDelayedPredicate), but at the moment, it does not correctly identify marked nodes after they have gone through serialization/deserialization. At times, it also prematurely identifies the parent of a marked node as the marked node, which can cause problems.

Wildfly setup script fails to create symlink during deploy

During web services deployment, setup-wildfly.sh throws an error on this line....

ln -s $HADOOP_HOME/lib/native $WILDFLY_HOME/modules/org/apache/hadoop/common/main/lib/${OSNAME}-uname -m

...because $WILDFLY_HOME/modules/org/apache/hadoop/common/main/lib doesn't exist.

QueryIterator incorrectly pulls up delated predicates with an or

When we have a sequence of or'ed terms that have been delayed by the planner, but have term frequencies (i.e. are tokenized) then the QueryIterator will pull those terms back up. However if this is in a union of other terms then this can cause results to not be found.

Add ability to ingest/query via new GeoWave Index Strategy

This should be a configurable option to index geospatial data using either the original TieredSFCIndexStrategy or the new XZHierarchicalIndexFactory. GeoWave query functions should be updated as well to support querying the new indices (perhaps via an optional parameter) while also supporting the older indices.

Allow table name overrides in config helpers

We need the ability to allow the configuration for a table to override properties. The idea is to prefix the properties to override with the table name or a defined prefix via another property.

come up with a way to make certain queries survive web server restarts

With the current query API, it can be problematic to restart a web server during a long-running query since either the query will be cancelled and the user has to start over, or we must wait a long time for the query to complete before shutting the server down. The latter also requires load balancer support to ensure that requests for an active query go to the draining server, but no other new requests go to it.

Design and implement a mechanism to checkpoint query progress so that a server can be restarted between next calls without interrupting the query. This could be supported only for certain query logics at first, although it would be nice to have it for all query logics.

prepare microservices to be versioned independently

The "services" folder contains microservices, which are intended to be evolved and deployed largely independently of each other. Currently, this entire hierarchy shares the same pom version as the main datawave parent pom. Instead, each service should support having its own version. The services parent pom (which will eventually become the main parent pom) could then name the current version of each service that will be in use.
Another task that is part of this issue is to break the dependencies that currently exist where microservices code depends on legacy datawave code. Instead, the cross-dependent code should be moved to new modules under services. For example, each service might need a service-api module that contains the public api for the service.

Ivarator directory conflicts

Currently multiple scans of the same shard can occur. This happens when a day range get expanded at the same time we have the shard range for shard 0:

day range: 20180101 to 20180101
shard range: 20180101_0 to 20180101\x00
after day range expansion we get: 20180101 to 20180101\x00

The expanded day range does not get collapsed with the shard range resulting in two scans against the same shard.

Meanwhile, if we have an ivarator we will now be working with the same directory at the same time which can cause all kinds of confusion. Typically we found that when one of the ivarators compacts the files that the other will then falter because of missing files.

Shutdown/refresh deadlock in HTTPClient

The version of HTTPClient we are using is subject to a bug during connection pool shutdown where deadlock can occur (HTTPCORE-446). Since we perform shutdown as a part of the "/Common/Configuration/refresh" method, this bug can take effect during runtime after a refresh has been issued.

Update to a newer version of HTTPClient that is not susceptible to this bug. Also update usage of HTTPClient to delay shutdown for a while to give existing pending requests a chance to finish since they would otherwise be immediately canceled and would fail, which is not what we want during a refresh.

The NumShards cache needs to handle multiple hadoop filesystems

Currently the NumShards cache can only handle URIs that correspond to the configured defaultFS. However we have instances where we may have multiple available filesystems. Instead of using new FileSystem(conf) we should be using new FileSystem(uri, conf) to create the appropriate one.

Negation may filter out too many results

When we have a document specific zone and an index only field with a query of the form
(some expression) && !(some other expression)
The reason has to do with the limitSources path where we will return a field index entry without actually looking it up assuming we already found it in the global index. There is a check for negation but that flag is not appropriatly set within a ASTNotNode in the IteratorBuildingVisitor.

Add ability to alias indexed fields

Add the ability to store the alias at ingest rather than be expanded via query model, reducing index scans. Trade storage for performance.

UID generation is being done redundantly

The UID generation needs to be separated from the setting of the raw data. This is causing in some circumstances the UID to be unnecessarily generated multiple times. Also we need to ensure that the time on the event is set before the UID is generated. Finally, we should allow for alternative UID implementations other than hash, and snowflake.

In Quickstart, 5 Datawave Menu Options Are Forbidden

I performed the following steps:

The response was simply 'forbidden'.

Remove fixed length UID parse assumptions from TLDEventDataFilter

TLDEventDataFilter assumes a 20 character fixed length base UID for rootPointer parsing purposes. This is not always the case. Adjust the code remove the bad assumption and instead calculate the length for each Key, counting the separators to determine root document state.

Support cache eviction across multiple copies of the authorization microservice

The authorization microservice exposes an endpoint to evict users from the cache. However, if more than one copy of the service is running, then the operation won't currently take effect on all copies. The backing cache is shared, however if users are in an in-memory cache on a different copy of the service then they won't be evicted. Use the spring event bus to notify all copies of the authorization service of an eviction request.

Add the ability to permute documents for evaluation

A feature has been requested where we configure the shard query logic with one or more objects that can be used to modify the document prior to evaluation. The particular requirement is to allow us to dynamically create the pieces of a virtual field for insertion into the JexlContext. This will allow one to drop the pieces of a virtual field (saving space in the DB) but still allow users to potentially query on the pieces.

Replace uses of AtomicInteger with LongAdder

Through profiling with yourkit and monitoring cache misses I've noticed improved execution time and fewer cache misses when I avoid some of the CAS operations with AtomicInteger. By using LongAdder in QueryStatsDClient I've seen improved perofrmance. I suspect this will also be the case with the schedulers and scanners that increment integers across threads. I'll make these additional targeted changes ( beyond the statsd client ) as a POC.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.