Coder Social home page Coder Social logo

zinggai / zingg Goto Github PK

View Code? Open in Web Editor NEW
891.0 18.0 109.0 448.76 MB

Scalable identity resolution, entity resolution, data mastering and deduplication using ML

License: GNU Affero General Public License v3.0

Java 76.31% Scala 1.42% Shell 0.29% Dockerfile 0.06% FreeMarker 0.01% Python 13.13% Makefile 0.08% Batchfile 0.09% HTML 8.62%
fuzzymatch fuzzy-matching deduplication dedupe masterdata dataengineering data-transformation analytics-engineering entity-resolution identity-resolution

zingg's Introduction

0.4.0 release of Zingg is out!

The Problem

Real world data contains multiple records belonging to the same customer. These records can be in single or multiple systems and they have variations across fields, which makes it hard to combine them together, especially with growing data volumes. This hurts customer analytics - establishing lifetime value, loyalty programs, or marketing channels is impossible when the base data is not linked. No AI algorithm for segmentation can produce the right results when there are multiple copies of the same customer lurking in the data. No warehouse can live up to its promise if the dimension tables have duplicates.

# Zingg - Data Silos

With a modern data stack and DataOps, we have established patterns for E and L in ELT for building data warehouses, datalakes and deltalakes. However, the T - getting data ready for analytics still needs a lot of effort. Modern tools like dbt are actively and successfully addressing this. What is also needed is a quick and scalable way to build the single source of truth of core business entities post Extraction and pre or post Loading.

With Zingg, the analytics engineer and the data scientist can quickly integrate data silos and build unified views at scale!

# Zingg - Data Mastering At Scale with ML

Besides probabilistic matching, also known as fuzzy matching, Zingg also does deterministic matching, which is useful in identity resolution and householding applications.

#Zingg Detereministic Matching

Why Zingg

Zingg is an ML based tool for entity resolution. The following features set Zingg apart from other tools and libraries:

  • Ability to handle any entity like customer, patient, supplier, product etc
  • Ability to connect to disparate data sources. Local and cloud file systems in any format, enterprise applications and relational, NoSQL and cloud databases and warehouses
  • Ability to scale to large volumes of data. See why this is important and Zingg performance numbers
  • Interactive training data builder using active learning that builds models on frugally small training samples to high accuracy. Shows records and asks user to mark yes, no, cant say on the cli.
  • Ability to define domain specific functions to improve matching
  • Out of the box support for English as well as Chinese, Thai, Japanese, Hindi and other languages

Zingg is useful for

  • Building unified and trusted views of customers and suppliers across multiple systems
  • Large Scale Entity Resolution for AML, KYC and other fraud and compliance scenarios
  • Deduplication and data quality
  • Identity Resolution
  • Integrating data silos during mergers and acquisitions
  • Data enrichment from external sources
  • Establishing customer households

The Story

What is the backstory behind Zingg?

Documentation

Check the detailed Zingg documentation

Community

Be part of the conversation in the Zingg Community Slack

People behind Zingg

Zingg is being developed by the Zingg.AI team. If you need custom help with or around Zingg, let us know.

Demo

See Zingg in action here

Getting Started

The easiest way to get started with Zingg is through Docker and by running the prebuilt models.

docker pull zingg/zingg:0.4.0
docker run -it zingg/zingg:0.4.0 bash
./scripts/zingg.sh --phase match --conf examples/febrl/config.json

Check the step by step guide for more details.

Connectors

Zingg connects, reads and writes to most on-premise and cloud data sources. Zingg runs on any private or cloud based Spark service. zinggConnectors

Zingg can read and write to Snowflake, Cassandra, S3, Azure, Elastic, major RDBMS and any Spark supported data sources. Zingg also works with all major file formats including Parquet, Avro, JSON, XLSX, CSV & TSV. This is done through the Zingg pipe abstraction.

Key Zingg Concepts

Zingg learns 2 models on the data:

  1. Blocking Model

One fundamental problem with scaling data mastering is that the number of comparisons increase quadratically as the number of input record increases. Data Mastering At Scale

Zingg learns a clustering/blocking model which indexes near similar records. This means that Zingg does not compare every record with every other record. Typical Zingg comparisons are 0.05-1% of the possible problem space.

  1. Similarity Model

The similarity model helps Zingg predict which record pairs match. Similarity is run only on records within the same block/cluster to scale the problem to larger datasets. The similarity model is a classifier which predicts similarity between records that are not exactly the same, but could belong together.

Fuzzy matching comparisons

To build these models, training data is needed. Zingg comes with an interactive learner to rapidly build training sets.

Shows records and asks user to mark yes, no, cant say on the cli.

Pretrained models

Zingg comes with pretrained models for the Febrl dataset under the models folder.

Reporting bugs and contributing

Want to report a bug or request a feature? Let us know on Slack, or open an issue

Want to commit code? Please check the contributing documentation.

Book Office Hours

If you want to schedule a 30-min call with our team to help you understand if Zingg is the right technology for your problem, please book a slot here. For troubleshooting and Zingg issues, please report the problem as an issue on github.

Asking questions on running Zingg

If you have a question or issue while using Zingg, kindly log a question and we will reply very fast :-) Or you can use Slack

License

Zingg is licensed under AGPL v3.0 - which means you have the freedom to distribute copies of free software (and charge for them if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs, and that you know you can do these things.

AGPL allows unresticted use of Zingg by end users and solution builders and partners. We strongly encourage solution builders to create custom solutions for their clients using Zingg.

Need a different license? Write to us.

Acknowledgements

Zingg would not have been possible without the excellent work below:

zingg's People

Contributors

abhay447 avatar aditya-r-chakole avatar akash-r-7 avatar bagotia16 avatar chetan453 avatar dependabot[bot] avatar edmondop avatar gnanaprakash-ravi avatar jgransac avatar manan-s0ni avatar morazow avatar navinrathore avatar ravirajbaraiya avatar semyonsinchenko avatar shefalika-thapa avatar siddharth2798 avatar sonalgoyal avatar vikasgupta78 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

zingg's Issues

Add case studies

Now that the documents are in a better consumable shape, we should add the case studies

Issue on running --phase train

WARN org.apache.spark.ml.util.Instrumentation - [7dacb264] All labels are the same value and fitIntercept=true, so the coefficients will be zeros. Training is not needed.
INFO org.apache.spark.ml.util.Instrumentation - [aa3fb364] training finished
INFO org.apache.spark.ml.util.Instrumentation - [e20e40e0] training finished
ERROR org.apache.spark.ml.util.Instrumentation - org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)

WARN zingg.client.util.Email - Unable to send email Can't send command to SMTP host
WARN zingg.client.Client - Apologies for this message. Zingg has encountered an error. Exception thrown in awaitResult:

findTrainingData and Label completed

Get output using our input file

Hi Sonalgoya,

In the test.csv, you have the first column called REC ID which already says whether a record is Original or Duplicate. In my case, I have an input file that contains First name, last name and other similar columns that you have in test.csv. How do I use Zingg to find out the duplicates? Is there a pre-processing step I am missing?

Add documentation on scoring

How similarity works - how to interpret the scores. I have already got that in an email explaining to a customer, need to put it to shape and add it to this repo.

Client argument list formatting

Describe the bug
Hello! I'm testing this out on Azure Databricks. I believe my job spec is correct:

{
    "settings": {
        "new_cluster": {
            "spark_version": "7.3.x-scala2.12",
            "spark_conf": {
                "spark.databricks.delta.preview.enabled": "true"
            },
            "azure_attributes": {
                "availability": "ON_DEMAND_AZURE",
                "first_on_demand": 1,
                "spot_bid_max_price": -1
            },
            "node_type_id": "Standard_DS3_v2",
            "enable_elastic_disk": true,
            "num_workers": 5
        },
        "libraries": [
            {
                "jar": "dbfs:/FileStore/jars/960f43e3_3c55_4c5b_a296_e69bc80ed1a6-zingg_0_3_0_SNAPSHOT-aa6ea.jar"
            },
            {
                "maven": {
                    "coordinates": "org.jsoup:jsoup:1.7.2"
                }
            }
        ],
        "spark_jar_task": {
            "main_class_name": "zingg.client.Client",
            "parameters": [
                "--phase",
                "findTrainingData",
                "--conf",
                "dbfs:/Bilbro/examples/zinggconfig.json"
            ],
            "run_as_repl": true
        },
        "email_notifications": {},
        "name": "zinggtest",
        "max_concurrent_runs": 1
    }
}

But the main class does not appear to accept a list of strings. I receive this error:

21/09/22 20:41:51 command--1:1: error: type mismatch;
 found   : Array[String]
 required: String
zingg.client.Client.main(Array("--phase","findTrainingData","--conf","dbfs:/Bilbro/examples/zinggconfig.json"))
                              ^

Zingg setup issue

Installed the spark&jdk in my local machine
When try to run the following script ./scripts/zingg.sh --phase match --conf examples/febrl/config.json
its getting error.
Error message:
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.apache.spark.unsafe.array.ByteArrayMethods.(ByteArrayMethods.java:54)
at org.apache.spark.internal.config.package$.(package.scala:1095)
at org.apache.spark.internal.config.package$.(package.scala)
at org.apache.spark.deploy.SparkSubmitArguments.$anonfun$loadEnvironmentArguments$3(SparkSubmitArguments.scala:157)
at scala.Option.orElse(Option.scala:447)
at org.apache.spark.deploy.SparkSubmitArguments.loadEnvironmentArguments(SparkSubmitArguments.scala:157)
at org.apache.spark.deploy.SparkSubmitArguments.(SparkSubmitArguments.scala:115)
at org.apache.spark.deploy.SparkSubmit$$anon$2$$anon$3.(SparkSubmit.scala:1022)
at org.apache.spark.deploy.SparkSubmit$$anon$2.parseArguments(SparkSubmit.scala:1022)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:85)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.reflect.InaccessibleObjectException: Unable to make private java.nio.DirectByteBuffer(long,int) accessible: module java.base does not "opens java.nio" to unnamed module @755c9148
at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:354)
at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:297)
at java.base/java.lang.reflect.Constructor.checkCanSetAccessible(Constructor.java:188)
at java.base/java.lang.reflect.Constructor.setAccessible(Constructor.java:181)
at org.apache.spark.unsafe.Platform.(Platform.java:56)

NullPointerException during match phase

Hi, I hit a NullPointerException running zingg at the match phase. I had previously run the findTrainingData / label / train phases successfully. The error occurs early in the run (after ~1 minute):

I have no name!@e036d492b95b:/zingg-0.3.0-SNAPSHOT$ time ./scripts/zingg.sh --phase match --conf /tmp/research/tsmart_config.json
2021-11-04 14:52:31,814 [main] WARN  org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
 2021-11-04 14:52:32,064 [main] INFO  zingg.client.Client -
 2021-11-04 14:52:32,065 [main] INFO  zingg.client.Client - ********************************************************
 2021-11-04 14:52:32,066 [main] INFO  zingg.client.Client - *                    Zingg AI                           *
 2021-11-04 14:52:32,067 [main] INFO  zingg.client.Client - *               (C) 2021 Zingg.AI                       *
 2021-11-04 14:52:32,068 [main] INFO  zingg.client.Client - ********************************************************
 2021-11-04 14:52:32,069 [main] INFO  zingg.client.Client -
 2021-11-04 14:52:32,070 [main] INFO  zingg.client.Client - using: Zingg v0.3
 2021-11-04 14:52:32,070 [main] INFO  zingg.client.Client -
 2021-11-04 14:52:32,289 [main] WARN  zingg.client.Arguments - Config Argument is /tmp/research/tsmart_config.json
 2021-11-04 14:52:32,564 [main] WARN  zingg.client.Arguments - phase is match
 2021-11-04 14:52:35,660 [main] WARN  org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry - The function round replaced a previously registered function.
 2021-11-04 14:52:35,660 [main] INFO  zingg.ZinggBase - Start reading internal configurations and functions
 2021-11-04 14:52:35,675 [main] INFO  zingg.ZinggBase - Finished reading internal configurations and functions
 2021-11-04 14:52:35,728 [main] WARN  zingg.util.PipeUtil - Reading input csv
 2021-11-04 14:53:13,325 [main] INFO  zingg.Matcher - Read 932474
 2021-11-04 14:53:13,414 [main] INFO  zingg.Matcher - Blocked
 2021-11-04 14:53:13,918 [main] WARN  org.apache.spark.sql.catalyst.util.package - Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
 2021-11-04 14:53:16,745 [Executor task launch worker for task 30] ERROR org.apache.spark.executor.Executor - Exception in task 0.0 in stage 8.0 (TID 30)
 java.lang.NullPointerException
        at zingg.block.Block$BlockFunction.call(Block.java:403)
        at zingg.block.Block$BlockFunction.call(Block.java:393)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume_0$(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.deserializetoobject_doConsume_0$(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
        at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
        at org.apache.spark.scheduler.Task.run(Task.scala:127)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:462)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:465)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
...

Several instances of the exception are reported, this is the first one.

My config file is:

{
  "fieldDefinition":[
    {
      "fieldName" : "id",
      "matchType" : "DONT_USE",
      "fields" : "voterbase_id"
    },
    {
      "fieldName" : "fname",
      "matchType" : "fuzzy",
      "fields" : "vb_tsmart_first_name",
      "dataType": "\"string\""
    },
    {
      "fieldName" : "mname",
      "matchType" : "fuzzy",
      "fields" : "vb_tsmart_middle_name",
      "dataType": "\"string\""
    },
    {
      "fieldName" : "lname",
      "matchType" : "fuzzy",
      "fields" : "vb_tsmart_last_name",
      "dataType": "\"string\""
    },
    {
      "fieldName" : "address1",
      "matchType": "fuzzy",
      "fields" : "vb_tsmart_full_address",
      "dataType": "\"string\""
    },
    {
      "fieldName" : "city",
      "matchType": "fuzzy",
      "fields" : "vb_tsmart_city",
      "dataType": "\"string\""
    },
    {
      "fieldName" : "state",
      "matchType": "exact",
      "fields" : "vb_tsmart_state",
      "dataType": "\"string\""
    },
    {
      "fieldName" : "zip",
      "matchType": "exact",
      "fields" : "vb_tsmart_zip",
      "dataType": "\"string\""
    },
    {
      "fieldName" : "homephone",
      "matchType": "exact",
      "fields" : "vb_voterbase_phone",
      "dataType": "\"string\""
    },
    {
      "fieldName" : "cellphone",
      "matchType": "exact",
      "fields" : "vb_voterbase_phone_wireless",
      "dataType": "\"string\""
    },
    {
      "fieldName" : "dob",
      "matchType": "fuzzy",
      "fields" : "vb_voterbase_dob",
      "dataType": "\"string\""
    },
    {
      "fieldName" : "gender",
      "matchType": "fuzzy",
      "fields" : "vb_voterbase_gender",
      "dataType": "\"string\""
    },
    {
      "fieldName" : "tier",
      "matchType": "DONT_USE",
      "fields" : "tier",
      "dataType": "\"string\""
    }
    ],
    "output" : [{
      "name":"output",
      "format":"csv",
      "props": {
        "location": "/tmp/zinggOutput",
        "delimiter": ",",
        "header":true
      }
    }],
    "data" : [{
      "name":"test",
      "format":"csv",
      "props": {
        "location": "/tmp/research/tmp/dupes/combined.csv",
        "delimiter": ",",
        "header":false
      },
      "schema":
        "{\"type\" : \"struct\",
        \"fields\" : [
          {\"name\":\"id\", \"type\":\"string\", \"nullable\":false},
          {\"name\":\"fname\", \"type\":\"string\", \"nullable\":true},
          {\"name\":\"mname\", \"type\":\"string\", \"nullable\":true},
          {\"name\":\"lname\",\"type\":\"string\",\"nullable\":true} ,
          {\"name\":\"address1\", \"type\":\"string\", \"nullable\":true},
          {\"name\":\"city\", \"type\":\"string\", \"nullable\":true},
          {\"name\":\"state\", \"type\":\"string\", \"nullable\":true},
          {\"name\":\"zip\", \"type\":\"string\", \"nullable\":true},
          {\"name\":\"homephone\", \"type\":\"string\", \"nullable\":true},
          {\"name\":\"cellphone\", \"type\":\"string\", \"nullable\":true},
          {\"name\":\"dob\",\"type\":\"string\",\"nullable\":true} ,
          {\"name\":\"gender\",\"type\":\"string\",\"nullable\":true}
        ]
      }"
    }],
    "labelDataSampleSize" : 0.1,
    "numPartitions":4,
    "modelId": 100,
    "zinggDir": "/tmp/research/tmp/models"
}

I am running zingg via docker. I built my image using the release tar.gz from github combined with a slightly altered Dockerfile:

FROM docker.io/bitnami/spark:3.0.2
WORKDIR /
ADD assembly/target/zingg-0.3.0-SNAPSHOT-spark-3.0.3.tar.gz .
WORKDIR /zingg-0.3.0-SNAPSHOT

I am not able to include a data sample since it includes PII, but it has fields:

voterbase_id
vb_tsmart_first_name
vb_tsmart_middle_name
vb_tsmart_last_name
vb_tsmart_full_address
vb_tsmart_city
vb_tsmart_state
vb_tsmart_zip
vb_voterbase_phone
vb_voterbase_phone_wireless
vb_voterbase_dob
vb_voterbase_gender
tier

Linking phase matches from the same dataset

During Linking phase, matches and comparison is done on the rows from the same dataset.

** To Reproduce **

  1. Run findTrainingData, label, train, and link phase.
  2. During label phase, rows from the same dataset is displayed for labeling.
  3. The resulting output have matching pairs with the same z_source.

** Expected behavior **
I'm expecting during labeling phase to only show rows from different z_source. And I'm expecting output to also only have pairs with different z_source.

Add make files and targets for install and compile code.

Is your feature request related to a problem? Please describe.
Currently, the user has to manually set up the environments to compile and install Zingg, Instead, we should add make targets that handle all the pre-requisites and setup.

zingg-0.3.0-SNAPSHOT.jar is not available

I have tried to run Zingg utility as per steps shown in readme but zingg-0.3.0-SNAPSHOT.jar referred in zingg.sh is not present in published code. Kindly provide the same

SQL based blocking and distance functions

What if we could take sql from say a dbt model or otherwise and use that for our model training - blocking as well as similarity? Then non Java programmers can also code and customize Zingg without bothering about the internals.

Anonymous Telemetry

Understanding Zingg usage in an anonymous way - number of attributes, size of data, pipes, number of runs, time taken for each run will be helpful to build the right features. This should be absolutely anonymous, with no actual data leaving user environment. And user should have the ability to switch it off anytime.

Clean up zingg.sh script

Need to relook at the zingg.sh script and remove elastic, license and email options. Also SPARK_MEM etc

Add graphic to denote findTrainingData and label phases have to be repeatedd

I see. The pairs to be labeled are found by looking through the entire dataset sample and do not depend on where the records are placed. How many times did you run the findTrainignData and the label phases? You can start off from where you left till you have 30-40 pairs of actual matches.

Also, if you are not finding many matches in the label phase and if the findTrainingData phase is fast enough depending on the hardware you have deployed, you can also change labelDataSampleSize to 0.2 so it will look at twice the sample to fetch the pairs.

I will update the documentation to make things clearer.

Originally posted by @sonalgoyal in #46 (comment)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.