zinggai / zingg Goto Github PK

Scalable identity resolution, entity resolution, data mastering and deduplication using ML

License: GNU Affero General Public License v3.0

Java 76.31% Scala 1.42% Shell 0.29% Dockerfile 0.06% FreeMarker 0.01% Python 13.13% Makefile 0.08% Batchfile 0.09% HTML 8.62%

fuzzymatch fuzzy-matching deduplication dedupe masterdata dataengineering data-transformation analytics-engineering entity-resolution identity-resolution

zingg's Introduction

0.4.0 release of Zingg is out!

The Problem

Real world data contains multiple records belonging to the same customer. These records can be in single or multiple systems and they have variations across fields, which makes it hard to combine them together, especially with growing data volumes. This hurts customer analytics - establishing lifetime value, loyalty programs, or marketing channels is impossible when the base data is not linked. No AI algorithm for segmentation can produce the right results when there are multiple copies of the same customer lurking in the data. No warehouse can live up to its promise if the dimension tables have duplicates.

With a modern data stack and DataOps, we have established patterns for E and L in ELT for building data warehouses, datalakes and deltalakes. However, the T - getting data ready for analytics still needs a lot of effort. Modern tools like dbt are actively and successfully addressing this. What is also needed is a quick and scalable way to build the single source of truth of core business entities post Extraction and pre or post Loading.

With Zingg, the analytics engineer and the data scientist can quickly integrate data silos and build unified views at scale!

Besides probabilistic matching, also known as fuzzy matching, Zingg also does deterministic matching, which is useful in identity resolution and householding applications.

Why Zingg

Zingg is an ML based tool for entity resolution. The following features set Zingg apart from other tools and libraries:

Ability to handle any entity like customer, patient, supplier, product etc
Ability to connect to disparate data sources. Local and cloud file systems in any format, enterprise applications and relational, NoSQL and cloud databases and warehouses
Ability to scale to large volumes of data. See why this is important and Zingg performance numbers
Interactive training data builder using active learning that builds models on frugally small training samples to high accuracy.
Ability to define domain specific functions to improve matching
Out of the box support for English as well as Chinese, Thai, Japanese, Hindi and other languages

Zingg is useful for

Building unified and trusted views of customers and suppliers across multiple systems
Large Scale Entity Resolution for AML, KYC and other fraud and compliance scenarios
Deduplication and data quality
Identity Resolution
Integrating data silos during mergers and acquisitions
Data enrichment from external sources
Establishing customer households

The Story

What is the backstory behind Zingg?

Documentation

Check the detailed Zingg documentation

Community

Be part of the conversation in the Zingg Community Slack

People behind Zingg

Zingg is being developed by the Zingg.AI team. If you need custom help with or around Zingg, let us know.

Demo

See Zingg in action here

Getting Started

The easiest way to get started with Zingg is through Docker and by running the prebuilt models.

docker pull zingg/zingg:0.4.0
docker run -it zingg/zingg:0.4.0 bash
./scripts/zingg.sh --phase match --conf examples/febrl/config.json

Check the step by step guide for more details.

Connectors

Zingg connects, reads and writes to most on-premise and cloud data sources. Zingg runs on any private or cloud based Spark service.

Zingg can read and write to Snowflake, Cassandra, S3, Azure, Elastic, major RDBMS and any Spark supported data sources. Zingg also works with all major file formats including Parquet, Avro, JSON, XLSX, CSV & TSV. This is done through the Zingg pipe abstraction.

Key Zingg Concepts

Zingg learns 2 models on the data:

Blocking Model

One fundamental problem with scaling data mastering is that the number of comparisons increase quadratically as the number of input record increases.

Zingg learns a clustering/blocking model which indexes near similar records. This means that Zingg does not compare every record with every other record. Typical Zingg comparisons are 0.05-1% of the possible problem space.

Similarity Model

The similarity model helps Zingg predict which record pairs match. Similarity is run only on records within the same block/cluster to scale the problem to larger datasets. The similarity model is a classifier which predicts similarity between records that are not exactly the same, but could belong together.

To build these models, training data is needed. Zingg comes with an interactive learner to rapidly build training sets.

Pretrained models

Zingg comes with pretrained models for the Febrl dataset under the models folder.

Reporting bugs and contributing

Want to report a bug or request a feature? Let us know on Slack, or open an issue

Want to commit code? Please check the contributing documentation.

Book Office Hours

If you want to schedule a 30-min call with our team to help you understand if Zingg is the right technology for your problem, please book a slot here. For troubleshooting and Zingg issues, please report the problem as an issue on github.

Asking questions on running Zingg

If you have a question or issue while using Zingg, kindly log a question and we will reply very fast :-) Or you can use Slack

License

Zingg is licensed under AGPL v3.0 - which means you have the freedom to distribute copies of free software (and charge for them if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs, and that you know you can do these things.

AGPL allows unresticted use of Zingg by end users and solution builders and partners. We strongly encourage solution builders to create custom solutions for their clients using Zingg.

Need a different license? Write to us.

Acknowledgements

Zingg would not have been possible without the excellent work below:

zingg's People

Contributors

Stargazers

Watchers

Forkers

mithunmanohar awmanoj tufanrakshit swathimystery anupgoenka stjordanis buster-wang liliesinlakes er-research olivierbinette gitusertd srikanth-gandi jgooly navant datanir rohitongit alpha-sme fantasticrabbit navinrathore sanjc manishnakar kaushik-das khanhgithead adarsh-dattatri micseb srinu1093 ameeransari doloresferrando sanyaohri itsaaditya96 ayoubelabbassi aroy1856 edmondop misselvexu shibu2006 abhipn hackthecrisis21 paultreadel aditya-r-chakole chetan453 aa09010yahoocom ranjithmuddana maxoehm ravirajbaraiya shefalika-thapa itshugo77 anujmnit71 siddharth2798 akash-r-7 guptam rahulsagigainwell bagotia16 daniel-tang-43 arpit94 cckmit pchoengtawee hugojoh prasanth08 cronus1007 shunya007 gopikrishnaangadi abhay447 jgransac pidtoo zinggai joaocps gg-big-org dougscc dumdumju kapambwe zhuohuwu0603 vemaya-asia dochris morazow hirschfeldeli joaonart patilvikram iitbhumanish kendralabs jslacks manan-s0ni jfevrier10 ramkarkinos rohankumardubey efpm04013 vishalshetty104 ygetit12 anth0ny-x gnanaprakash-ravi luatnc87 mehdi-infostrux brunoscaglione vishalmcf thedatadecoder ffos sharath-m starwingo3o subintp nirajzade knguyen1

zingg's Issues

Add case studies

Now that the documents are in a better consumable shape, we should add the case studies

Dockerize the installation

Add recipes for Snowflake, Elastic, Cassandra, Parquet, Avro, XLS, XLSX, JDBC

Add support for other versions of Spark

Let's have multiple profiles in the pom to support different Spark versions (2.4+)

DBT Step by Step

Need a guide for this.

Command line data stewardship

We can build a cli that can filter and show results to the user, much like the labeller

Issue on running --phase train

WARN org.apache.spark.ml.util.Instrumentation - [7dacb264] All labels are the same value and fitIntercept=true, so the coefficients will be zeros. Training is not needed.
INFO org.apache.spark.ml.util.Instrumentation - [aa3fb364] training finished
INFO org.apache.spark.ml.util.Instrumentation - [e20e40e0] training finished
ERROR org.apache.spark.ml.util.Instrumentation - org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)

WARN zingg.client.util.Email - Unable to send email Can't send command to SMTP host
WARN zingg.client.Client - Apologies for this message. Zingg has encountered an error. Exception thrown in awaitResult:

findTrainingData and Label completed

Get output using our input file

Hi Sonalgoya,

In the test.csv, you have the first column called REC ID which already says whether a record is Original or Duplicate. In my case, I have an input file that contains First name, last name and other similar columns that you have in test.csv. How do I use Zingg to find out the duplicates? Is there a pre-processing step I am missing?

Add documentation on scoring

How similarity works - how to interpret the scores. I have already got that in an email explaining to a customer, need to put it to shape and add it to this repo.

call out performance numbers separately

right now perf is hidden under hardware sizing make it a lot difficult to find

Client argument list formatting

Describe the bug
Hello! I'm testing this out on Azure Databricks. I believe my job spec is correct:

{
    "settings": {
        "new_cluster": {
            "spark_version": "7.3.x-scala2.12",
            "spark_conf": {
                "spark.databricks.delta.preview.enabled": "true"
            },
            "azure_attributes": {
                "availability": "ON_DEMAND_AZURE",
                "first_on_demand": 1,
                "spot_bid_max_price": -1
            },
            "node_type_id": "Standard_DS3_v2",
            "enable_elastic_disk": true,
            "num_workers": 5
        },
        "libraries": [
            {
                "jar": "dbfs:/FileStore/jars/960f43e3_3c55_4c5b_a296_e69bc80ed1a6-zingg_0_3_0_SNAPSHOT-aa6ea.jar"
            },
            {
                "maven": {
                    "coordinates": "org.jsoup:jsoup:1.7.2"
                }
            }
        ],
        "spark_jar_task": {
            "main_class_name": "zingg.client.Client",
            "parameters": [
                "--phase",
                "findTrainingData",
                "--conf",
                "dbfs:/Bilbro/examples/zinggconfig.json"
            ],
            "run_as_repl": true
        },
        "email_notifications": {},
        "name": "zinggtest",
        "max_concurrent_runs": 1
    }
}

But the main class does not appear to accept a list of strings. I receive this error:

21/09/22 20:41:51 command--1:1: error: type mismatch;
 found   : Array[String]
 required: String
zingg.client.Client.main(Array("--phase","findTrainingData","--conf","dbfs:/Bilbro/examples/zinggconfig.json"))
                              ^

Zingg setup issue

Installed the spark&jdk in my local machine
When try to run the following script ./scripts/zingg.sh --phase match --conf examples/febrl/config.json
its getting error.
Error message:
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.apache.spark.unsafe.array.ByteArrayMethods.(ByteArrayMethods.java:54)
at org.apache.spark.internal.config.package$.(package.scala:1095)
at org.apache.spark.internal.config.package$.(package.scala)
at org.apache.spark.deploy.SparkSubmitArguments.$anonfun$loadEnvironmentArguments$3(SparkSubmitArguments.scala:157)
at scala.Option.orElse(Option.scala:447)
at org.apache.spark.deploy.SparkSubmitArguments.loadEnvironmentArguments(SparkSubmitArguments.scala:157)
at org.apache.spark.deploy.SparkSubmitArguments.(SparkSubmitArguments.scala:115)
at org.apache.spark.deploy.SparkSubmit$$anon$2$$anon$3.(SparkSubmit.scala:1022)
at org.apache.spark.deploy.SparkSubmit$$anon$2.parseArguments(SparkSubmit.scala:1022)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:85)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.reflect.InaccessibleObjectException: Unable to make private java.nio.DirectByteBuffer(long,int) accessible: module java.base does not "opens java.nio" to unnamed module @755c9148
at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:354)
at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:297)
at java.base/java.lang.reflect.Constructor.checkCanSetAccessible(Constructor.java:188)
at java.base/java.lang.reflect.Constructor.setAccessible(Constructor.java:181)
at org.apache.spark.unsafe.Platform.(Platform.java:56)

Add Snowflake support

Snowflake through external fn

https://medium.com/snowflake/how-snowflakes-it-team-uses-external-functions-497505fb49df

Explaining matching output

explainability will be critical in business context, to say why some records matched and why not.

Ability to update labels once marked

Right now there is no way to update labels if a wrong one has been specified. See https://zinggai.slack.com/archives/C02JNH144TB/p1638361934009800

Test support for xls/xlsx

Build a new febrl test file in XLS/XLSX format and test it

NullPointerException during match phase

Hi, I hit a NullPointerException running zingg at the match phase. I had previously run the findTrainingData / label / train phases successfully. The error occurs early in the run (after ~1 minute):

I have no name!@e036d492b95b:/zingg-0.3.0-SNAPSHOT$ time ./scripts/zingg.sh --phase match --conf /tmp/research/tsmart_config.json
2021-11-04 14:52:31,814 [main] WARN  org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
 2021-11-04 14:52:32,064 [main] INFO  zingg.client.Client -
 2021-11-04 14:52:32,065 [main] INFO  zingg.client.Client - ********************************************************
 2021-11-04 14:52:32,066 [main] INFO  zingg.client.Client - *                    Zingg AI                           *
 2021-11-04 14:52:32,067 [main] INFO  zingg.client.Client - *               (C) 2021 Zingg.AI                       *
 2021-11-04 14:52:32,068 [main] INFO  zingg.client.Client - ********************************************************
 2021-11-04 14:52:32,069 [main] INFO  zingg.client.Client -
 2021-11-04 14:52:32,070 [main] INFO  zingg.client.Client - using: Zingg v0.3
 2021-11-04 14:52:32,070 [main] INFO  zingg.client.Client -
 2021-11-04 14:52:32,289 [main] WARN  zingg.client.Arguments - Config Argument is /tmp/research/tsmart_config.json
 2021-11-04 14:52:32,564 [main] WARN  zingg.client.Arguments - phase is match
 2021-11-04 14:52:35,660 [main] WARN  org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry - The function round replaced a previously registered function.
 2021-11-04 14:52:35,660 [main] INFO  zingg.ZinggBase - Start reading internal configurations and functions
 2021-11-04 14:52:35,675 [main] INFO  zingg.ZinggBase - Finished reading internal configurations and functions
 2021-11-04 14:52:35,728 [main] WARN  zingg.util.PipeUtil - Reading input csv
 2021-11-04 14:53:13,325 [main] INFO  zingg.Matcher - Read 932474
 2021-11-04 14:53:13,414 [main] INFO  zingg.Matcher - Blocked
 2021-11-04 14:53:13,918 [main] WARN  org.apache.spark.sql.catalyst.util.package - Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
 2021-11-04 14:53:16,745 [Executor task launch worker for task 30] ERROR org.apache.spark.executor.Executor - Exception in task 0.0 in stage 8.0 (TID 30)
 java.lang.NullPointerException
        at zingg.block.Block$BlockFunction.call(Block.java:403)
        at zingg.block.Block$BlockFunction.call(Block.java:393)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume_0$(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.deserializetoobject_doConsume_0$(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
        at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
        at org.apache.spark.scheduler.Task.run(Task.scala:127)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:462)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:465)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
...

Several instances of the exception are reported, this is the first one.

My config file is:

{
  "fieldDefinition":[
    {
      "fieldName" : "id",
      "matchType" : "DONT_USE",
      "fields" : "voterbase_id"
    },
    {
      "fieldName" : "fname",
      "matchType" : "fuzzy",
      "fields" : "vb_tsmart_first_name",
      "dataType": "\"string\""
    },
    {
      "fieldName" : "mname",
      "matchType" : "fuzzy",
      "fields" : "vb_tsmart_middle_name",
      "dataType": "\"string\""
    },
    {
      "fieldName" : "lname",
      "matchType" : "fuzzy",
      "fields" : "vb_tsmart_last_name",
      "dataType": "\"string\""
    },
    {
      "fieldName" : "address1",
      "matchType": "fuzzy",
      "fields" : "vb_tsmart_full_address",
      "dataType": "\"string\""
    },
    {
      "fieldName" : "city",
      "matchType": "fuzzy",
      "fields" : "vb_tsmart_city",
      "dataType": "\"string\""
    },
    {
      "fieldName" : "state",
      "matchType": "exact",
      "fields" : "vb_tsmart_state",
      "dataType": "\"string\""
    },
    {
      "fieldName" : "zip",
      "matchType": "exact",
      "fields" : "vb_tsmart_zip",
      "dataType": "\"string\""
    },
    {
      "fieldName" : "homephone",
      "matchType": "exact",
      "fields" : "vb_voterbase_phone",
      "dataType": "\"string\""
    },
    {
      "fieldName" : "cellphone",
      "matchType": "exact",
      "fields" : "vb_voterbase_phone_wireless",
      "dataType": "\"string\""
    },
    {
      "fieldName" : "dob",
      "matchType": "fuzzy",
      "fields" : "vb_voterbase_dob",
      "dataType": "\"string\""
    },
    {
      "fieldName" : "gender",
      "matchType": "fuzzy",
      "fields" : "vb_voterbase_gender",
      "dataType": "\"string\""
    },
    {
      "fieldName" : "tier",
      "matchType": "DONT_USE",
      "fields" : "tier",
      "dataType": "\"string\""
    }
    ],
    "output" : [{
      "name":"output",
      "format":"csv",
      "props": {
        "location": "/tmp/zinggOutput",
        "delimiter": ",",
        "header":true
      }
    }],
    "data" : [{
      "name":"test",
      "format":"csv",
      "props": {
        "location": "/tmp/research/tmp/dupes/combined.csv",
        "delimiter": ",",
        "header":false
      },
      "schema":
        "{\"type\" : \"struct\",
        \"fields\" : [
          {\"name\":\"id\", \"type\":\"string\", \"nullable\":false},
          {\"name\":\"fname\", \"type\":\"string\", \"nullable\":true},
          {\"name\":\"mname\", \"type\":\"string\", \"nullable\":true},
          {\"name\":\"lname\",\"type\":\"string\",\"nullable\":true} ,
          {\"name\":\"address1\", \"type\":\"string\", \"nullable\":true},
          {\"name\":\"city\", \"type\":\"string\", \"nullable\":true},
          {\"name\":\"state\", \"type\":\"string\", \"nullable\":true},
          {\"name\":\"zip\", \"type\":\"string\", \"nullable\":true},
          {\"name\":\"homephone\", \"type\":\"string\", \"nullable\":true},
          {\"name\":\"cellphone\", \"type\":\"string\", \"nullable\":true},
          {\"name\":\"dob\",\"type\":\"string\",\"nullable\":true} ,
          {\"name\":\"gender\",\"type\":\"string\",\"nullable\":true}
        ]
      }"
    }],
    "labelDataSampleSize" : 0.1,
    "numPartitions":4,
    "modelId": 100,
    "zinggDir": "/tmp/research/tmp/models"
}

I am running zingg via docker. I built my image using the release tar.gz from github combined with a slightly altered Dockerfile:

FROM docker.io/bitnami/spark:3.0.2
WORKDIR /
ADD assembly/target/zingg-0.3.0-SNAPSHOT-spark-3.0.3.tar.gz .
WORKDIR /zingg-0.3.0-SNAPSHOT

I am not able to include a data sample since it includes PII, but it has fields:

voterbase_id
vb_tsmart_first_name
vb_tsmart_middle_name
vb_tsmart_last_name
vb_tsmart_full_address
vb_tsmart_city
vb_tsmart_state
vb_tsmart_zip
vb_voterbase_phone
vb_voterbase_phone_wireless
vb_voterbase_dob
vb_voterbase_gender
tier

Linking phase matches from the same dataset

During Linking phase, matches and comparison is done on the rows from the same dataset.

** To Reproduce **

Run findTrainingData, label, train, and link phase.
During label phase, rows from the same dataset is displayed for labeling.
The resulting output have matching pairs with the same z_source.

** Expected behavior **
I'm expecting during labeling phase to only show rows from different z_source. And I'm expecting output to also only have pairs with different z_source.

Add make files and targets for install and compile code.

Is your feature request related to a problem? Please describe.
Currently, the user has to manually set up the environments to compile and install Zingg, Instead, we should add make targets that handle all the pre-requisites and setup.

Add performance information of half and one million febrl datasets

Run Zingg through PySpark

Need to figure out how Zingg can be run from pyspark or python.

git lfs issue while cloning repo

A user reported git lfs issue while cloning the repo. need to verify the steps and fix cloning instructions

Unsupervised model

Having a prebuilt simple model can be helpful to get people started.

Move documentation to docs.zingg.ai

zingg-0.3.0-SNAPSHOT.jar is not available

I have tried to run Zingg utility as per steps shown in readme but zingg-0.3.0-SNAPSHOT.jar referred in zingg.sh is not present in published code. Kindly provide the same

Zingg on Windows

User tried to use Zingg on windows and got the following error
Program 'zingg.sh' failed to run: No application is associated with the specified file for this operationAt line:1

See https://zinggai.slack.com/archives/C02JNH144TB/p1636045905031900

Add information on improving accuracy

SQL based blocking and distance functions

What if we could take sql from say a dbt model or otherwise and use that for our model training - blocking as well as similarity? Then non Java programmers can also code and customize Zingg without bothering about the internals.

Anonymous Telemetry

Understanding Zingg usage in an anonymous way - number of attributes, size of data, pipes, number of runs, time taken for each run will be helpful to build the right features. This should be absolutely anonymous, with no actual data leaving user environment. And user should have the ability to switch it off anytime.

Fix up Febrl Junits

Recipe for Azure

Add documentation on how to use Zingg with Azure

Tutorial with Great Expectations

Need to write how to integrate GE and Zingg.

Add contributor guidelines

add support for integer types

documentation broken links

https://docs.zingg.ai/docs/setup/configuration.html - pipes links are broken

Clean up zingg.sh script

Need to relook at the zingg.sh script and remove elastic, license and email options. Also SPARK_MEM etc

sanctions matching and pep model

Add performance stats for half and 1 million records

serverless spark

try https://cloud.google.com/solutions/spark and see if we can build something out with Zingg

field name didn't accept the "_" underscore

didn't accept the "_" underscore

Originally posted by @premsmac2021 in #49 (comment)

check if zingg.sh needs elastic properties to be passed or if we can do it through the code itself

hardware sizing link is broken in the docs

Snowflake step by step

Need documentation or a blog post for this

Add graphic to denote findTrainingData and label phases have to be repeatedd

I see. The pairs to be labeled are found by looking through the entire dataset sample and do not depend on where the records are placed. How many times did you run the findTrainignData and the label phases? You can start off from where you left till you have 30-40 pairs of actual matches.

Also, if you are not finding many matches in the label phase and if the findTrainingData phase is fast enough depending on the hardware you have deployed, you can also change labelDataSampleSize to 0.2 so it will look at twice the sample to fetch the pairs.

I will update the documentation to make things clearer.

Originally posted by @sonalgoyal in #46 (comment)

headers were bold
options yes/no etc were color coded