Coder Social home page Coder Social logo

appcoreopc / incubator-griffin Goto Github PK

View Code? Open in Web Editor NEW

This project forked from apache/griffin

0.0 3.0 0.0 23.32 MB

Mirror of Apache griffin (Incubating)

License: Other

Scala 32.51% Java 34.82% Python 1.50% JavaScript 0.86% CSS 4.90% HTML 12.71% TypeScript 12.71%

incubator-griffin's Introduction

Apache Griffin

Build Status License: Apache 2.0

Apache Griffin is a model driven data quality solution for modern data systems. It provides a standard process to define data quality measures, execute, report, as well as an unified dashboard across multiple data systems.

Getting Started

You can try Griffin in docker following the docker guide.

To run Griffin at local, you can follow instructions below.

Prerequisites

You need to install following items

  • jdk (1.8 or later versions).
  • mysql.
  • npm (version 6.0.0+).
  • Hadoop (2.6.0 or later), you can get some help here.
  • Spark (version 1.6.x, griffin does not support 2.0.x at current), if you want to install Pseudo Distributed/Single Node Cluster, you can get some help here.
  • Hive (version 1.2.1 or later), you can get some help here. You need to make sure that your spark cluster could access your HiveContext.
  • Livy, you can get some help here. Griffin need to schedule spark jobs by server, we use livy to submit our jobs. For some issues of Livy for HiveContext, we need to download 3 files, and put them into HDFS.
    datanucleus-api-jdo-3.2.6.jar
    datanucleus-core-3.2.10.jar
    datanucleus-rdbms-3.2.9.jar
    
  • ElasticSearch. ElasticSearch works as a metrics collector, Griffin produces metrics to it, and our default UI get metrics from it, you can use your own way as well.

Configuration

Create database 'quartz' in mysql

mysql -u username -e "create database quartz" -p

Init quartz tables in mysql by service/src/main/resources/Init_quartz.sql

mysql -u username -p quartz < service/src/main/resources/Init_quartz.sql

You should also modify some configurations of Griffin for your environment.

  • service/src/main/resources/application.properties

    # mysql
    spring.datasource.url = jdbc:mysql://<your IP>:3306/quartz?autoReconnect=true&useSSL=false
    spring.datasource.username = <user name>
    spring.datasource.password = <password>
    
    # hive
    hive.metastore.uris = thrift://<your IP>:9083
    hive.metastore.dbname = <hive database name>    # default is "default"
    
    # external properties directory location, ignore it if not required
    external.config.location =
    
    # login strategy, default is "default"
    login.strategy = <default or ldap>
    
    # ldap properties, ignore them if ldap is not enabled
    ldap.url = ldap://hostname:port
    ldap.email = @example.com
    ldap.searchBase = DC=org,DC=example
    ldap.searchPattern = (sAMAccountName={0})
    
    # hdfs, ignore it if you do not need predicate job
    fs.defaultFS = hdfs://<hdfs-default-name>
    
    # elasticsearch
    elasticsearch.host = <your IP>
    elasticsearch.port = <your elasticsearch rest port>
    # authentication properties, uncomment if basic authentication is enabled
    # elasticsearch.user = user
    # elasticsearch.password = password
    
  • measure/src/main/resources/env.json

     "persist": [
         ...
         {
     		"type": "http",
     		"config": {
     	        "method": "post",
     	        "api": "http://<your ES IP>:<ES rest port>/griffin/accuracy"
     		}
     	}
     ]
    

    Put the modified env.json file into HDFS.

  • service/src/main/resources/sparkJob.properties

    sparkJob.file = hdfs://<griffin measure path>/griffin-measure.jar
    sparkJob.args_1 = hdfs://<griffin env path>/env.json
    
    sparkJob.jars = hdfs://<datanucleus path>/spark-avro_2.11-2.0.1.jar\
        hdfs://<datanucleus path>/datanucleus-api-jdo-3.2.6.jar\
        hdfs://<datanucleus path>/datanucleus-core-3.2.10.jar\
        hdfs://<datanucleus path>/datanucleus-rdbms-3.2.9.jar
        
    spark.yarn.dist.files = hdfs:///<spark conf path>/hive-site.xml
    
    livy.uri = http://<your IP>:8998/batches
    spark.uri = http://<your IP>:8088
    
    • <griffin measure path> is the location you should put the jar file of measure module.
    • <griffin env path> is the location you should put the env.json file.
    • <datanucleus path> is the location you should put the 3 jar files of livy, and the spark avro jar file if you need.
    • <spark conf path> is the location of spark conf directory.

Build and Run

Build the whole project and deploy. (NPM should be installed)

mvn clean install

Put jar file of measure module into <griffin measure path> in HDFS

cp measure/target/measure-<version>-incubating-SNAPSHOT.jar measure/target/griffin-measure.jar
hdfs dfs -put measure/target/griffin-measure.jar <griffin measure path>/

After all environment services startup, we can start our server.

java -jar service/target/service.jar

After a few seconds, we can visit our default UI of Griffin (by default the port of spring boot is 8080).

http://<your IP>:8080

You can use UI following the steps here.

Note: The front-end UI is still under development, you can only access some basic features currently.

Community

You can contact us via email: [email protected]

You can also subscribe this mail by sending a email to here.

You can access our issues jira page here

Contributing

See Contributing Guide for details on how to contribute code, documentation, etc.

References

incubator-griffin's People

Contributors

ahutsunshine avatar bhlx3lyx7 avatar dodobel avatar guoyuepeng avatar john-liu avatar justact avatar lionel3l avatar rachelyang2 avatar whhe avatar zhugezifang avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.