precog / platform Goto Github PK

Advanced Analytics Engine for NoSQL Data

License: GNU Affero General Public License v3.0

Scala 97.72% Shell 1.52% Python 0.17% Java 0.39% Ruby 0.21%

platform's Introduction

Precog

Precog is an advanced analytics engine for NoSQL data. It's sort of like a traditional analytics database, but instead of working with normalized, tabular data, it works with denormalized data that may not have a uniform schema.

You can plop large amounts of JSON into Precog and start doing analytics without any preprocessing, such as time series analytics, filtering, rollups, statistics, and even some kinds of machine learning.

There's an API for developer integration, and a high-level application called Labcoat for doing ad hoc and exploratory analytics.

Precog has been used by developers to build reporting features into applications (since Precog has very comprehensive, developer-friendly APIs), and together with Labcoat, Precog has been used by data scientists to perform ad hoc analysis of semi-structured data.

This is the Community Edition of Precog. For more information about commercial support and maintenance options, check out SlamData, Inc, the official sponsor of the Precog open source project.

Community

Precog-Dev — An open email list for developers of Precog.
Precog-User — An open email list for users of Precog.
#precog — An IRC channel for Precog.
#quirrel — An IRC channel for the Quirrel query language.

Developer Guide

A few landmarks:

common - Data structures and service interfaces that are shared between multiple submodules.
quirrel - The Quirrel compiler, including the parser, static analysis code and bytecode emitter
- Parser
- Binder
- ProvenanceChecker
mimir - The Quirrel optimizer, evaluator and standard library
- EvaluatorModule
- StdLibModule
- StaticInlinerModule
yggdrasil - Core data access and manipulation layer
- TableModule
- ColumnarTableModule
- Slice
- Column
niflheim - Low-level columnar block store. (NIHDB)
- NIHDB
ingest - BlueEyes service front-end for data ingest.
muspelheim - Convergence point for the compiler and evaluator stacks; integration test sources and data
- ParseEvalStack
- MiscStackSpecs
surtr - Integration tests that run on the NIHDB backend. Surtr also provides a (somewhat defunct) REPL that gives access to the evaluator and other parts of the precog environment.
- NIHDBPlatformSpecs
- REPL
bifrost - BlueEyes service front-end for the
miklagard - Standalone versions for the desktop and alternate backend data stores -- see local README.rst. These need a bit of work to bring them up to date; they were disabled some time ago and may have bitrotted.
util - Generic utility functions and data structures that are not specific to any particular function of the Precog codebase; convenience APIs for external libraries.

Thus, to work on the evaluator, one would be in the mimir project, writing tests in the mimir and muspelheim projects. The tests in the muspelheim project would be run from the surtr project (not from muspelheim), but using the test data stored in muspelheim. All of the other projects are significantly saner.

Getting Started

Step one: obtain PaulP's script. At this point, ideally you would be able to run ./build-test.sh and everything would be fine. Unfortunately, at the present time, you have to jump through a few hoops in order to get all of the dependencies in order.

First, you need to clone and build blueeyes. This should be relatively painless. Grab the repository and run sbt publish-local. After everything finishes, you should be able to just move on to the next ball of wax: Kafka. Unfortunately, Kafka has yet to publish any public Maven artifacts, much less artifacts for precisely the version on which Precog is dependent. At the current time, the best way to deal with this problem is to simply grab the tarball of Ivy dependencies and extract this file into your ~/.ivy2/cache/ directory. Once this is done, you should be ready to go.

Altogether, you need to run the following commands:

$ git clone [email protected]:jdegoes/blueeyes.git
$ cd blueeyes
$ sbt publish-local
$ cd ..
$ cd /tmp
$ wget https://dl.dropboxusercontent.com/u/1679797/kafka-stuff.tar.gz
$ tar xf kafka-stuff.tar.gz -C ~/.ivy2/cache/
$ cd -
$ cd platform
$ sbt

From here, you must run the following tasks in order:

test:compile
ratatoskr/assembly
extract-data
test

The last one should take a fair amount of time, but when it completes (and everything is green), you can have a pretty solid assurance that you're up and running!

In order to more easily navigate the codebase, it is highly recommended that you install CTAGS, if your editor supports it. Our filename conventions are…inconsistent.

Building and Running

These instructions are at best rudimentary, but should be sufficient to get started in a minimal way. More will be coming soon!

The Precog environment is organized in a modular, service-oriented fashion with loosely coupled components that are relatively tolerant to the failure of any single component (with likely degraded function). Most of the components allow for redundant instances of the relevant service, although in some cases (bifrost in particular) some tricky configuration is required, which will not be detailed here.

Services:

bifrost - The primary service for evaluating NIHDB
auth - Authentication provider (checks tokens and grants; to be merged with accounts in the near term)
accounts - Account provider (records association between user information and an account root token; to be merged with auth in the near term)
dvergr - A simple job tracking service that is used to track batch query completion.
ingest - The primary service for adding data to the Precog database.

Runnable jar files for all of these services can be built using the sbt assembly target from the root (platform) project. Sample configuration files for each can be found in the <projectname>/configs/dev directory for each relevant project; to run a simple test instance you can use the start-shard.sh script. Note that this will download, configure, and run local instances of mongodb, apache kafka, and zookeeper. Additional instructions for running the precog database in a server environment will be coming soon.

Contributing

All Contributions are bound by the terms and conditions of the Precog Contributor License Agreement.

Pull Request Process

We use a pull request model for development. When you want to work on a new feature or bug, create a new branch based off of master (do not base off of another branch unless you absolutely need the work in progress on that branch). Collaboration is highly encouraged; accidental branch dependencies are not. Your branch name should be given one of the following prefixes:

topic/ - For features, changes, refactorings, etc (e.g. topic/parse-function)
bug/ - For things that are broken, investigations, etc (e.g. bug/double-allocation)
wip/ - For code that is not ready for team-wide sharing (e.g. wip/touch-me-and-die)

If you see a topic/ or bug/ branch on someone else's repository that has changes you need, it is safe to base off of that branch instead of master, though you should still base off of master if at all possible. Do not ever base off of a wip/ branch! This is because the commits in a wip/ branch may be rewritten, rearranged or discarded entirely, and thus the history is not stable.

Do your work on your local branch, committing as frequently as you like, squashing and rebasing off of updated master (or any other topic/ or bug/ branch) at your discretion.

When you are confident in your changes and ready for them to land, push your topic/ or bug/ branch to your own fork of platform (you can create a fork here).

Once you have pushed to your fork, submit a Pull Request using GitHub's interface. Take a moment to describe your changes as a whole, particularly highlighting any API or Quirrel language changes which land as part of the changeset.

Once your pull request is ready to be merged, it will be brought into the staging branch, which is a branch on the mainline repository that exists purely for the purposes of aggregating pull requests. It should not be considered a developer branch, but is used to run the full build as a final sanity check before the changes are pushed as a fast forward to master once the build has completed successfully. This process ensures a minimum of friction between concurrent tasks while simultaneously making it extremely difficult to break the build in master. Build problems are generally caught and resolved in pull requests, and in very rare cases, in staging. This process also provides a very natural and fluid avenue for code review and discussion, ensuring that the entire team is involved and aware of everything that is happening. Code review is everyone's responsibility.

Rebase Policy

There is one hard and fast rule: if the commits have been pushed, do not rebase. Once you push a set of commits, either to the mainline repository or your own fork, you cannot rebase those commits any more. The only exception to this rule is if you have pushed a wip/ branch, in which case you are allowed to rebase and/or delete the branch as needed.

The reason for this policy is to encourage collaboration and avoid merge conflicts. Rewriting history is a lovely Git trick, but it is extremely disruptive to others if you rewrite history out from under their feet. Thus, you should only ever rebase commits which are local to your machine. Once a commit has been pushed on a non-wip/ branch, you no longer control that commit and you cannot rewrite it.

With that said, rebasing locally is highly encouraged, assuming you're fluent enough with Git to know how to use the tool. As a rule of thumb, always rebase against the branch that you initial cut your local branch from whenever you are ready to push. Thus, my workflow looks something like the following:

$ git checkout -b topic/doin-stuff
...
# hack commit hack commit hack commit hack
...
$ git fetch upstream
$ git branch -f master upstream/master
$ git rebase -i master
# squash checkpoint commits, etc
$ git push origin topic/doin-stuff

If I had based off a branch other than master, such as a topic/ branch on another fork, then obviously the branch names would be different. The basic workflow remains the same though.

Once I get beyond the last command though, everything changes. I can no longer rebase the topic/doin-stuff branch. Instead, if I need to bring in changes from another branch, or even just resolve conflicts with master, I need to use git merge. This is because someone else may have decided to start a project based on topic/doin-stuff, and I cannot just rewrite commits which they are now depending on.

To summarize: rebase privately, merge publicly.

Roadmap

Phase 1: Simplified Deployment

Precog was originally designed to be offered exclusively via the cloud in a multi-tenant offering. As such, it has made certain tradeoffs that make it much harder for individuals and casual users to install and maintain.

In the current roadmap, Phase 1 involves simplifying Precog to the point where there are so few moving pieces, anyone can install and launch Precog, and keep Precog running without anything more than an occasional restart.

The work is currently tracked in the Simplified Precog milestone and divided into the following tickets:

Many of these tickets indirectly contribute to Phase 2, by bringing the foundations of Precog closer into alignment with HDFS.

Phase 2: Support for Big Data

Currently, Precog can only handle the amount of data that can reside on a single machine. While there are many optimizations that still need to be made (such as support for indexes, type-specific columnar compression, etc.), a bigger win with more immediate impact will be making Precog "big data-ready", where it can compete head-to-head with Hive, Pig, and other analytics options for Hadoop.

Spark is an in-memory computational framework that runs as a YARN application inside a Hadoop cluster. It can read from and write to the Hadoop file system (HDFS), and exposes a wide range of primitives for performing data processing. Several high-performance, scalable query systems have been built on Spark, such as Shark and BlinkDB.

Given that Spark's emphasis is on fast, in-memory computation, that it's written in Scala, and that it has already been used to implement several query languages, it seems an ideal target for Precog.

The work is currently divided into the following tickets:

Introduce a "group by" operator into the intermediate algebra
Refactor solve with simpler & saner semantics
Create a table representation based on Spark's RDD
Implement table ops in terms of Spark operations
TODO

Alternate Front-Ends

Support for dynamically-typed, multi-dimensional SQL ("SQL for heterogeneous JSON"), and possibly other query interfaces such as JSONiq and UNQL.

License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.

Legalese

platform's People

Contributors

Stargazers

Watchers

Forkers

coltfred nuttycom leifwickland alissapajer stonegao gclaramunt techstars-archive gopalanj kod3r folone runarorama h2000 vmunier yunglin cdoru infernojj libin kevcheng auvik yilab dealharris srisatish minosiants harrisminer dchenbecker bigjason saad373 jasonkolb djspiewak femotizo tonyzhu jdegoes tkellogg stevematthew traycho-zz robstrange pawan2087 carrot-garden mans kuew hunslater richardsondx nagyist stemarco ilovejs scottcarey germc ussrinivas bhaskarbose obsidian-innovations fikrimuhal aoprisan fmacias64 fr0stbyte bekterra maniacs-oss hubino-testing bigrayhicks xuwei-k sivakumar2505 geekyme joshaghani90

platform's Issues

Shard Service Error

[ERROR] [10/07/2013 18:20:47.688] [PrecogShard-akka.actor.default-dispatcher-310] [akka.dispatch.Dispatcher] Promise already completed: akka.dispatch.DefaultPromise@103dccd9 tried to complete with Right(HttpResponse(OK ,HttpHeaders(Map(Access-Control-Allow-Origin -> *, Access-Control-Allow-Headers -> Origin,X-Requested-With,Content-Type,X-File-Name,X-File-Size,X-File-Type,X-Precog-Path,X-Precog-Service,X-Precog-Token,X-Precog-Uuid,Accept,Authorization, Access-Control-Allow-Methods -> GET,POST,OPTIONS,DELETE,PUT, Allow -> GET,POST,OPTIONS,DELETE,PUT, Content-Type -> application/json)),Some(Right(scalaz.StreamT@2313e481)),HTTP/1.1))
java.lang.IllegalStateException: Promise already completed: akka.dispatch.DefaultPromise@103dccd9 tried to complete with Right(HttpResponse(OK ,HttpHeaders(Map(Access-Control-Allow-Origin -> *, Access-Control-Allow-Headers -> Origin,X-Requested-With,Content-Type,X-File-Name,X-File-Size,X-File-Type,X-Precog-Path,X-Precog-Service,X-Precog-Token,X-Precog-Uuid,Accept,Authorization, Access-Control-Allow-Methods -> GET,POST,OPTIONS,DELETE,PUT, Allow -> GET,POST,OPTIONS,DELETE,PUT, Content-Type -> application/json)),Some(Right(scalaz.StreamT@2313e481)),HTTP/1.1))
at akka.dispatch.Promise$class.complete(Future.scala:782)
at akka.dispatch.DefaultPromise.complete(Future.scala:847)
at akka.dispatch.Future$$anonfun$recover$1.apply(Future.scala:548)
at akka.dispatch.Future$$anonfun$recover$1.apply(Future.scala:546)
at akka.dispatch.DefaultPromise.akka$dispatch$DefaultPromise$$notifyCompleted(Future.scala:943)
at akka.dispatch.DefaultPromise$$anonfun$tryComplete$1$$anonfun$apply$mcV$sp$4.apply(Future.scala:920)
at akka.dispatch.DefaultPromise$$anonfun$tryComplete$1$$anonfun$apply$mcV$sp$4.apply(Future.scala:920)
at scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
at scala.collection.immutable.List.foreach(List.scala:76)
at akka.dispatch.DefaultPromise$$anonfun$tryComplete$1.apply$mcV$sp(Future.scala:920)
at akka.dispatch.Future$$anon$4$$anonfun$run$1.apply$mcV$sp(Future.scala:386)
at akka.dispatch.Future$$anon$4$$anonfun$run$1.apply(Future.scala:378)
at akka.dispatch.Future$$anon$4$$anonfun$run$1.apply(Future.scala:378)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at akka.dispatch.Future$$anon$4.run(Future.scala:378)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:94)
at akka.jsr166y.ForkJoinTask$AdaptedRunnableAction.exec(ForkJoinTask.java:1381)
at akka.jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:259)
at akka.jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:975)
at akka.jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479)
at akka.jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

[ERROR] [10/07/2013 18:20:59.229] [PrecogShard-akka.actor.default-dispatcher-314] [akka.dispatch.Dispatcher] Promise already completed: akka.dispatch.DefaultPromise@41a0e9b5 tried to complete with Right(HttpResponse(OK ,HttpHeaders(Map()),Some(Left(java.nio.HeapByteBuffer[pos=0 lim=744 cap=744])),HTTP/1.1))
java.lang.IllegalStateException: Promise already completed: akka.dispatch.DefaultPromise@41a0e9b5 tried to complete with Right(HttpResponse(OK ,HttpHeaders(Map()),Some(Left(java.nio.HeapByteBuffer[pos=0 lim=744 cap=744])),HTTP/1.1))
at akka.dispatch.Promise$class.complete(Future.scala:782)
at akka.dispatch.DefaultPromise.complete(Future.scala:847)
at akka.dispatch.Future$$anonfun$recover$1.apply(Future.scala:548)
at akka.dispatch.Future$$anonfun$recover$1.apply(Future.scala:546)
at akka.dispatch.DefaultPromise.akka$dispatch$DefaultPromise$$notifyCompleted(Future.scala:943)
at akka.dispatch.DefaultPromise$$anonfun$tryComplete$1$$anonfun$apply$mcV$sp$4.apply(Future.scala:920)
at akka.dispatch.DefaultPromise$$anonfun$tryComplete$1$$anonfun$apply$mcV$sp$4.apply(Future.scala:920)
at scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
at scala.collection.immutable.List.foreach(List.scala:76)
at akka.dispatch.DefaultPromise$$anonfun$tryComplete$1.apply$mcV$sp(Future.scala:920)
at akka.dispatch.Future$$anon$4$$anonfun$run$1.apply$mcV$sp(Future.scala:386)
at akka.dispatch.Future$$anon$4$$anonfun$run$1.apply(Future.scala:378)
at akka.dispatch.Future$$anon$4$$anonfun$run$1.apply(Future.scala:378)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at akka.dispatch.Future$$anon$4.run(Future.scala:378)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:94)
at akka.jsr166y.ForkJoinTask$AdaptedRunnableAction.exec(ForkJoinTask.java:1381)
at akka.jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:259)
at akka.jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:975)
at akka.jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479)
at akka.jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

Query directly from raw files

To make Precog much more accessible and user-friendly to local installs, as well as prepare for work on a distributed version of Precog, we should allow querying directly on files which are stored in formats for which we have an input adapter, similar to how Hive and Pig handle data analysis.

This ticket is to refactor the query engine so that we are able to allow querying directly over JSON data files, CSV files and, of course, NIHDB 'files', in a file system containing a variety of file formats.

To do this, we need to define a suitable input adapter which exposes a Table-oriented view of a file format, and propagate information necessary to use a particular adapter (e.g. for CSV files or possibly even JSON files, the input may be ambiguous and require information such as delimiters in order to unambiguously interpret as a Table).

Some file "formats" may in fact be directories containing many files; we should think about how to handle these.

Note that as per @nuttycom's comment, we already have JSON-backed and even JDBC-backed table adapters. The exact functionality we lack is the ability to discriminate between alternate representations at runtime based on the actual string paths passed to the table load function, as well as an architecture that makes it easy to add new input adapters and rules for selecting them during runtime loads.

This ticket will be considered complete when it is possible to create a Quirrel script that loads data from a JSON file, a CSV file, and a NIHDB file, and joins them all together; and when the associated architecture allows cleanly adding support and selection criteria for new input adapters (by defining the input adapter and describing the rules that dictate when the input adapter is used for dynamically loaded data -- e.g. when the file extension or mime type is such and such).

No access to http://nexus.reportgrid.com/ repo. which is in the build resolver.

[error] Server access Error: Operation timed out when try to connect to http://nexus.reportgrid.com/

Unresolved dependencies

[warn]  ::::::::::::::::::::::::::::::::::::::::::::::
[warn]  ::          UNRESOLVED DEPENDENCIES         ::
[warn]  ::::::::::::::::::::::::::::::::::::::::::::::
[warn]  :: com.reportgrid#blueeyes-json_2.9.2;1.0.0-M9.5: not found
[warn]  :: com.reportgrid#blueeyes-util_2.9.2;1.0.0-M9.5: not found
[warn]  :: com.reportgrid#blueeyes-core_2.9.2;1.0.0-M9.5: not found
[warn]  :: com.reportgrid#blueeyes-mongo_2.9.2;1.0.0-M9.5: not found
[warn]  :: com.reportgrid#bkka_2.9.2;1.0.0-M9.5: not found
[warn]  :: com.reportgrid#akka_testing_2.9.2;1.0.0-M9.5: not found
[warn]  ::::::::::::::::::::::::::::::::::::::::::::::

Conversion from raw files to NihDB file format

Once the analytics engine can query across raw files (e.g. JSON), as well as NihDB files, we need the ability to convert from raw formats to NihDB, to improve performance.

This needn't be integrated into the query engine (although it could be, as some sort of caching later for queries across raw files), but could instead exist as a separate process which a user could run -- a utility of sorts that would convert data from various formats to NihDB, where queries would run much more efficiently.

When this ticket is done, although users can query across their JSON / CSV files directly, they will be able to achieve increased performance by converting those files into NihDB format.

At minimum, a user will be able to convert files manually by running a command-line process, or through a REST API on the storage side which executes the conversion asynchronously.

The conversion should not delete the original files (users can always do that themselves).

Remove Kafka dependency

Kafka is a large, complex piece of software which requires installation and maintenance. There are many ways for Kafka to fail, and Kafka requires ongoing management in order to prevent disk overflow and make tradeoffs between recoverability and resource usage.

While Kafka is very appropriate for a large scale distributed ingest system which has to keep up with fluctuating loads and be fully redundant, it is less appropriate for a single node analytics engine like Precog. When Precog becomes distributed, the focus will be on reading data from HDFS, and not on the ingest of that data, so even long-term, the direct use of Kafka in the Precog project is an unnecessary distraction.

In order to simplify the number of moving pieces in Precog, Kafka needs to be eliminated as a dependency.

Ingest can be as simple as batching up a chunk of data and writing it out to the (abstract) file system -- e.g. appending to the relevant file.

This ticket will be considered complete when Kafka is not a dependency of the project nor referenced or utilized anywhere in the source code, unit tests, or documentation.

See @nuttycom's comment below.

array flatten error

This works:

datam:=//statistics/jobs/months
datam1:=datam.Results where datam.JobGroup="Database Administrator"
finalmonth:=flatten(datam1.avgsalbyitskill) with {date:datam.QueryDate}
finalmonth

This doesn’t and its essentially the same data, just in a different collection:

data:=//statistics/jobs/jobgroups
data1:=data.Results where data.JobGroup="Database Administrator" & data.QueryDate="9/31/2013"
finalday:=flatten(data1.avgsalbyitskill) with {date:data.QueryDate}
finalday

This is the error I am getting:

2013-11-01 12:46:51,582 [atcher-148] E c.p.s.s.SyncQueryServiceHandler {} - Error executing shard query:
java.lang.UnsupportedOperationException: empty.max
at scala.collection.TraversableOnce$class.max(TraversableOnce.scala:201)
at scala.collection.immutable.Set$EmptySet$.max(Set.scala:52)
at com.precog.daze.ArrayLibModule$ArrayLib$Flatten$$anonfun$apply$1$$anonfun$2.apply(ArrayLib.scala:33)
at com.precog.daze.ArrayLibModule$ArrayLib$Flatten$$anonfun$apply$1$$anonfun$2.apply(ArrayLib.scala:27)
at scalaz.StreamT$$anonfun$map$1$$anonfun$apply$58.apply(StreamT.scala:82)
at scalaz.StreamT$$anonfun$map$1$$anonfun$apply$58.apply(StreamT.scala:82)
at scalaz.StreamT$Yield$$anon$7.apply(StreamT.scala:215)
at scalaz.StreamT$$anonfun$map$1.apply(StreamT.scala:82)
at scalaz.StreamT$$anonfun$map$1.apply(StreamT.scala:82)
at scalaz.Monad$$anonfun$map$1$$anonfun$apply$1.apply(Monad.scala:14)
at akka.dispatch.Future$$anon$3.liftedTree1$1(Future.scala:195)
at akka.dispatch.Future$$anon$3.run(Future.scala:194)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:94)
at akka.jsr166y.ForkJoinTask$AdaptedRunnableAction.exec(ForkJoinTask.java:1381)
at akka.jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:259)
at akka.jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:975)
at akka.jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479)
at akka.jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

Remove MongoDB dependency

Precog currently uses MongoDB to persist metadata, such as account information, security information, and so forth. MongoDB is a database in its own right, which has to be installed and maintained. This takes a lot of time and domain knowledge.

In order to minimize the moving pieces in Precog, remove a point of failure, and simplify both initial installation of Precog and ongoing maintenance, the dependency on MongoDB needs to be completely eliminated.

We need to swap out MongoDB with a robust, pure Java embedded database (such as H2), which will make it easier to allow users to introduce redundancy in the future (by switching to a distributed SQL database).

The main criteria is that whatever we use to replace MongoDB should not require installation or maintenance of any kind; it needs to be idiot-proof, crash-proof (auto-recovery), and configuration-free.

Per @nuttycom, the main touch point in the code for the security service is ApiKeyFinder (what about for the accounts service?). The trickier bit will be refactoring and dead code elimination in the dvergr service.

Unresolved dependency: howl;1.0.1-2-precog

I got the same issue as #519, but after I resolve this issue, I got another one:
sbt.ResolveException: unresolved dependency: org.objectweb.howl#howl;1.0.1-2-precog: not found

For this one, I cannot find any repo that contains the file.

Mongo still supported?

Hey

Is mongodb still supported as a db or is the expectation to now use the precog builtin db and use the api to get the data in?

thanks

Enable labcoat

currently we have shard server running. we can fetch data now.
is there any document on how to run labcoat against this?

start-shard.sh depends on a private s3 bucket

start-shard.sh tries to download zookeeper-3.4.3 and kafka-0.7.5 from a private s3 repo.

Zookeeper-3.4.3 can be obtained on the apache project website
Kafka-0.7.5 zip file seems problematic as explained here : 0.7.5 is not an official release

Missing dependencies

Hey,

I'm getting a missing dependency issue with a fresh clone and running sbt assembly.

Kafka 0.7.5 doesn't exist anywhere. Is there something I'm missing?

The specific error I'm getting is:

[error] (common/*:update) sbt.ResolveException: unresolved dependency: org.apache#kafka-core_2.9.2;0.7.5: not found

Separate ingest from query

The goal of this ticket is to ensure that ingest and query are completely decoupled and interact only through the file system, so that overall architectural complexity is minimized and other tickets may be completed more easily.

Ingest must write data to a file system.
Query must read data from a file system.
Query and ingest may interact only through a file system.

This is mostly (entirely?) already the case, but we need to verify the extent of the separation and severe any remaining ties that exist.

Data In-memory

Is there an option that we can exhaust and put the Precog databases fully In-Memory?

Roadmap

Hey.

Do you guys have a roadmap of features, things still to be worked on or developed from scratch? Things that wont be implemented?

Thanks!

Merge and simplify auth / accounts

The auth and accounts services need to be merged (they are heavily dependent on each other), and their interfaces simplified.

Below is a brief account of the intended Precog security model resulting from this ticket.

Users have grants. Grants are the analogue of operations in an ACL security model.

All grants are bound to a particular file or directory; they confer permissions with respect to that resource.

Read -- Read contents of file / read children of directory
Append -- Append new contents to file / append new child in directory
Update -- Change contents of file / rename children
Execute -- Execute script / execute default script associated with directory
Delete -- Delete file / delete directory
Mount -- Mount a data source to the file / mount a data source in the directory
Unmount -- Unmount a data source to the file / unmount a data source in the directory

Unlike the POSIX file security model, grants are hierarchical. Currently, they are always and only hierarchical.

Grants can be used to create additional grants that have the same or reduced permissions.

This ticket will be considered complete when the auth and accounts service have been unified into a single service, the internal logic simplified and refactored to match the above, and a clean, robust, and well-documented REST API exposed (the existing API is not unified, is inconsistent in places, is not robust, and is poorly documented).

Minimal API

GET, POST /access/users/
GET, PUT /access/users/'userId
GET /access/users/'userId/grants/
GET /access/users/'userId/grants/'grantId
GET, POST, DELETE /access/users/'userId/shares/_byusers/'user
GET, POST, DELETE /access/users/'userId/shares/_bypaths/'path
GET, POST, DELETE /access/users/'userId/shares/_byperms/'perm

Simplify file system model

This ticket is to dramatically simplify the file system in a way that is simpler to maintain and is friendly to other tickets and future directions for Precog.

The capabilities of the file system interface must be the intersection of capabilities in a local file system and the HDFS file system used in Hadoop.

Ingest should leverage a write-oriented view of a file system to store data, while query should leverage a read-oriented view of a file system to read data.

The file system model does not need to support any versioning, nor does it need to support atomicity beyond that provided by the intersection of capabilities noted above. The surface area of the file system model should be minimal and limited to whatever is needed to implement the functionality we need now.

Although it's not necessary to completely formalize it at this point, in general, a distributed file system will be capable of executing a subset of operations that are expressible in the DAG that describes a query; and this subset will depend on the exact nature of the file system (e.g. HDFS, Tachyon, Ceph, etc.), as well as the path at which the data is being accessed (in the case of file systems that support mounting). Even a local system that has compact encoding for some file types might support pushing down operations such as "projection" (for a column-oriented file format) or "filtering" (for an indexed view of data). In fact, we could implement a layered file system approach where a NIHDB-encoded file would be handled by a NIHDB file system capable of efficiently handling operations for which acceleration is possible.

Read View

read file
list children
retrieve size

Write View

create file (empty)
create file (with contents)
append file
delete file
rename file

Care should be taken when defining these interfaces so that good implementations are possible on MongoDB and other systems to which Precog might be ported in the future.

See the following for Hadoop's file system:

http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html

And the following for Apache's Common FS:
-http://commons.apache.org/proper/commons-vfs/

This ticket will be considered complete when the internal file system model has been simplified to resemble HDFS / Apache Common FS, when the REST API for the file system supports the semantics of the file system model, and when there is ample documentation and tests for all of the above.

See here for more documentation on the REST API for the exposed file system: https://docs.google.com/document/d/1j43rvBNPvV7sDpO5l9vUXqtO9IPO-oWT2_tF8fMJEt0/edit?usp=sharing

Comment on the ticket for clarification.

Finalize cached queries support

Cached queries are implemented in Precog master, but may have bugs, have not been tested extensively, and their APIs likely differ from the required format.

The following document contains the "analysis" API which is meant to replace the old query APIs, and form the basis of cached queries:

https://docs.google.com/document/d/1j43rvBNPvV7sDpO5l9vUXqtO9IPO-oWT2_tF8fMJEt0/edit?usp=sharing

This ticket will be considered complete when these APIs are implemented, tested, and thoroughly documented.

Indexing needed

Our data is getting around 200GB and our query times are going up quite substantially in terms of time to complete. It seems that with no indexing it is doing full data or full table scans of the data. This is a pretty high priority since our data will begin to grow at 5GB/day. Thanks.

pathname in datastore cooked files

Adding additional shard services to a current precog instance requires copying over the shard-data path. However, since the original pathname is stored in the cooked files, the new shard service will not be able to query the new (copied) shard-data path. Here is the config:

queryExecutor {
systemId = "dev2"
precog {
storage {
root = /home/precog/work/shard2-data/
}
}

Here is the error log:

2013-10-17 12:05:15,061 [patcher-13] E c.p.s.s.SyncQueryServiceHandler {} - Error executing shard query:
java.io.FileNotFoundException: /home/precog/work/shard-data/data/0000000066/statistics/jobs/jobgroups/perAuthProjections/b35a170be4ef15e078a741233a0f4245854c9bb3/cooked_blocks/segment-0--580161124-Decimal6211499993968270611.cooked (No such file or directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.(FileInputStream.java:137)
at com.precog.niflheim.CookedReader.com$precog$niflheim$CookedReader$$read(CookedReader.scala:32)
at com.precog.niflheim.CookedReader$$anonfun$load$1$$anonfun$apply$15$$anonfun$9.apply(CookedReader.scala:119)
at com.precog.niflheim.CookedReader$$anonfun$load$1$$anonfun$apply$15$$anonfun$9.apply(CookedReader.scala:118)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
at scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
at scala.collection.immutable.List.foreach(List.scala:76)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
at scala.collection.immutable.List.map(List.scala:76)
at com.precog.niflheim.CookedReader$$anonfun$load$1$$anonfun$apply$15.apply(CookedReader.scala:118)
at com.precog.niflheim.CookedReader$$anonfun$load$1$$anonfun$apply$15.apply(CookedReader.scala:117)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
at scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
at scala.collection.immutable.List.foreach(List.scala:76)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
at scala.collection.immutable.List.map(List.scala:76)
at com.precog.niflheim.CookedReader$$anonfun$load$1.apply(CookedReader.scala:117)
at com.precog.niflheim.CookedReader$$anonfun$load$1.apply(CookedReader.scala:116)
at scalaz.Validation$class.flatMap(Validation.scala:141)
at scalaz.Success.flatMap(Validation.scala:329)
at com.precog.niflheim.CookedReader.load(CookedReader.scala:116)
at com.precog.niflheim.CookedReader$$anonfun$6.apply(CookedReader.scala:75)
at com.precog.niflheim.CookedReader$$anonfun$6.apply(CookedReader.scala:74)
at scala.Option.map(Option.scala:133)
at com.precog.niflheim.CookedReader.snapshotRef(CookedReader.scala:74)
at com.precog.niflheim.NIHDBSnapshot$$anonfun$getBlockAfter$1.apply(NIHDBSnapshot.scala:54)
at com.precog.niflheim.NIHDBSnapshot$$anonfun$getBlockAfter$1.apply(NIHDBSnapshot.scala:53)
at scala.Option.map(Option.scala:133)
at com.precog.niflheim.NIHDBSnapshot$class.getBlockAfter(NIHDBSnapshot.scala:53)
at com.precog.niflheim.NIHDBSnapshot$$anon$1.getBlockAfter(NIHDBSnapshot.scala:18)
at com.precog.niflheim.NIHDB$$anonfun$getBlockAfter$1.apply(NIHDBActor.scala:75)
at com.precog.niflheim.NIHDB$$anonfun$getBlockAfter$1.apply(NIHDBActor.scala:75)
at akka.dispatch.Future$$anonfun$map$1.liftedTree3$1(Future.scala:625)
at akka.dispatch.Future$$anonfun$map$1.apply(Future.scala:624)
at akka.dispatch.Future$$anonfun$map$1.apply(Future.scala:621)
at akka.dispatch.DefaultPromise.akka$dispatch$DefaultPromise$$notifyCompleted(Future.scala:943)
at akka.dispatch.DefaultPromise$$anonfun$tryComplete$1$$anonfun$apply$mcV$sp$4.apply(Future.scala:920)
at akka.dispatch.DefaultPromise$$anonfun$tryComplete$1$$anonfun$apply$mcV$sp$4.apply(Future.scala:920)
at scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
at scala.collection.immutable.List.foreach(List.scala:76)
at akka.dispatch.DefaultPromise$$anonfun$tryComplete$1.apply$mcV$sp(Future.scala:920)
at akka.dispatch.Future$$anon$4$$anonfun$run$1.apply$mcV$sp(Future.scala:386)
at akka.dispatch.Future$$anon$4$$anonfun$run$1.apply(Future.scala:378)
at akka.dispatch.Future$$anon$4$$anonfun$run$1.apply(Future.scala:378)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at akka.dispatch.Future$$anon$4.run(Future.scala:378)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:94)
at akka.jsr166y.ForkJoinTask$AdaptedRunnableAction.exec(ForkJoinTask.java:1381)
at akka.jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:259)
at akka.jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:975)
at akka.jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479)
at akka.jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

Nondeterministic failure in yggdrasil NIHDB specs

[error] x Properly convert raw blocks to cooked
[error]    A counter-example is '-847538703' (after 0 try)
[error]    '0' is not equal to '1' (NIHDBProjectionSpecs.scala:170)
[info]  
[error] ! step error
[error]   FileNotFoundException: File does not exist: /tmp/nihdbspecs1378877010068-0/nihdbspec003/txLog.lock (FileUtils.java:1653)
[error] org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2275)
[error] org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653)
[error] org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535)
[error] org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2270)
[error] org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653)
[error] org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535)
[error] com.precog.util.IOUtils$$anonfun$recursiveDelete$1.apply(IOUtils.scala:69)
[error] com.precog.util.IOUtils$$anonfun$recursiveDelete$1.apply(IOUtils.scala:68)
[error] com.precog.yggdrasil.nihdb.NIHDBProjectionSpecs.shutdown(NIHDBProjectionSpecs.scala:200)
[error] com.precog.yggdrasil.nihdb.NIHDBProjectionSpecs$$anonfun$map$1.apply$mcV$sp(NIHDBProjectionSpecs.scala:204)
[error] com.precog.yggdrasil.nihdb.NIHDBProjectionSpecs$$anonfun$map$1.apply(NIHDBProjectionSpecs.scala:204)
[error] com.precog.yggdrasil.nihdb.NIHDBProjectionSpecs$$anonfun$map$1.apply(NIHDBProjectionSpecs.scala:204)

[Minor] Missing description in the readme

"bifrost - BlueEyes service front-end for the"

That's all there is, there seem to be missing the end of this sentence and description.

Single process server

To make it as simple as possible to run Precog, we need to bundle Precog and Labcoat and all required dependencies into a fully self-contained package suitable for distribution, run all Precog services in a single process using a single port, and launch Precog and Labcoat with a single command.

Configuration options for this single process server should be kept extremely minimal, and every option must have a sensible default which works out of the box on all supported platforms (Mac, Linux, Windows).

Among the possible options:

The port to run on. Could default to something like 7777.
The home directory for the file system (if local file system is being used). Could default to something like ./data/.
The home directory for temporary files. Could default to /tmp/ or ./tmp/.
The directory for accounts/security/etc. metadata (location of H2 database?). Could default to something like ./meta/.

With no external dependencies and simple configuration options that all have sensible defaults, it will be possible for average and casual users to maintain Precog, and many more people to try Precog out without having to master a half dozen other technologies (kafka, zookeeper, haproxy, mongodb, httpd, etc.).

This ticket will be considered complete when the following is possible:

Run an sbt task to build the new standalone release from scratch (both Precog and Labcoat)
cd into the standalone release directory
Run precog or precog.bat scripts depending on OS (Mac/Linux or Windows)
If no command-line arguments are specified, the script launches Precog server in a single process and port (if it is not already running), and launches Labcoat configured to point to the newly-launched Precog server
In addition to the default action of starting Precog and launching Labcoat, the scripts support the commands 'stop', 'start', and 'restart', which stop, start, and restart the Precog server (respectively), as well as a 'launch' command which launches Labcoat, and a 'status' command which shows whether or not Precog server is running.

The standalone release will be the release version that's pre-built and distributed online for users who don't want to build Precog / Labcoat from scratch. Therefore, it's essential that it be bullet-proof and "just work" out of the box with no tweaking, configuration, or additional external dependencies.

This ticket should not be completed most of the other tickets in the Simplified Precog milestone have been completed.

Remove Zookeeper dependency

Zookeeper is a large, complex piece of software which requires installation and maintenance. There are many ways for Zookeeper to fail, and Zookeeper requires ongoing management in order to ensure continuous uptime.

While Zookeeper is very appropriate for a large scale distributed storage and analysis system, it is less appropriate for a single node analytics engine like Precog. The dependency just needlessly complicates the architecture, adds more points of failure, results in more lines of code that have to be maintained, and makes Precog more difficult for non-IT wizards to deploy.

In order to simplify the number of moving pieces in Precog, Zookeeper needs to be eliminated as a dependency. Once Kafka is removed (see #524), extraction of Zookeeper will be trivial as Zookeeper is used exclusively for tracking offsets.

precog / platform Goto Github PK

platform's Introduction

Precog

Community

Developer Guide

Getting Started

Building and Running

Contributing

Pull Request Process

Rebase Policy

Roadmap

Phase 1: Simplified Deployment

Phase 2: Support for Big Data

Alternate Front-Ends

License

Legalese

platform's People

Contributors

Stargazers

Watchers

Forkers

platform's Issues

Minimal API

Recommend Projects

Recommend Topics

Recommend Org