Comments (13)
Sure! Happy to try a debug build.
from zingg.
It seems that the blocking functions are giving a null value. Did you train on the same data on which you are running the match? Do you have many null values in the data but not in the training set?
from zingg.
Did you train on the same data on which you are running the match?
Yes, I used the same CSV as the input to both.
Do you have many null values in the data but not in the training set?
Maybe? The source data certainly has several nullable fields, but I don't know about the training data's representation of them. I tried to lean towards more training data by setting labelDataSampleSize
to 0.1
but maybe that wasn't the right thing?
Edit: since id
is the only nullable field, does that imply there's a record w/o an id?
from zingg.
Hard to say, sorry the logs could be better built to tell exactly what went wrong. I can try and give you a debug build to find what went wrong. Will that work for you?
from zingg.
Great. Please give me a few hours
from zingg.
please use this and let me know how it goes https://github.com/zinggAI/zingg/releases/tag/v0.3.0debug
from zingg.
That was definitely more informative (I'd encourage you to leave the extra debugging in by default or via a flag), but seeing the row doesn't immediately give me clarity on what's wrong. The error, light redacted:
I have no name!@097be7d42402:/zingg-0.3.0-SNAPSHOT$ ./scripts/zingg.sh --phase match --conf /tmp/research/zingg/tsmart_config.json
2021-11-05 14:57:32,055 [main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2021-11-05 14:57:32,316 [main] INFO zingg.client.Client -
2021-11-05 14:57:32,317 [main] INFO zingg.client.Client - ********************************************************
2021-11-05 14:57:32,317 [main] INFO zingg.client.Client - * Zingg AI *
2021-11-05 14:57:32,317 [main] INFO zingg.client.Client - * (C) 2021 Zingg.AI *
2021-11-05 14:57:32,317 [main] INFO zingg.client.Client - ********************************************************
2021-11-05 14:57:32,317 [main] INFO zingg.client.Client -
2021-11-05 14:57:32,317 [main] INFO zingg.client.Client - using: Zingg v0.3
2021-11-05 14:57:32,318 [main] INFO zingg.client.Client -
2021-11-05 14:57:32,560 [main] WARN zingg.client.Arguments - Config Argument is /tmp/research/zingg/tsmart_config.json
2021-11-05 14:57:32,812 [main] WARN zingg.client.Arguments - phase is match
2021-11-05 14:57:35,917 [main] WARN org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry - The function round replaced a previously registered function.
2021-11-05 14:57:35,918 [main] INFO zingg.ZinggBase - Start reading internal configurations and functions
2021-11-05 14:57:35,932 [main] INFO zingg.ZinggBase - Finished reading internal configurations and functions
2021-11-05 14:57:35,981 [main] WARN zingg.util.PipeUtil - Reading input csv
2021-11-05 14:58:10,151 [main] INFO zingg.Matcher - Read 932474
2021-11-05 14:58:10,163 [main] DEBUG zingg.block.Block - returning schema after step 1 is StructType(StructField(z_zid,LongType,false), StructField(id,StringType,true), StructField(fname,StringType,true), StructField(mname,StringType,true), StructField(lname,StringType,true), StructField(address1,StringType,true), StructField(city,StringType,true), StructField(state,StringType,true), StructField(zip,StringType,true), StructField(homephone,StringType,true), StructField(cellphone,StringType,true), StructField(dob,StringType,true), StructField(gender,StringType,true), StructField(z_source,StringType,false), StructField(z_hash,IntegerType,false))
2021-11-05 14:58:10,227 [main] INFO zingg.Matcher - Blocked
2021-11-05 14:58:10,670 [main] WARN org.apache.spark.sql.catalyst.util.package - Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
2021-11-05 14:58:13,319 [Executor task launch worker for task 30] DEBUG zingg.block.Block - blocking row [338504,NY-000001111111,MELISSA,Y,XXXXXXX,XXX XXXX XX,BROOKLYN,NY,11228,null,null,19610101,Female,test]
2021-11-05 14:58:13,325 [Executor task launch worker for task 30] ERROR org.apache.spark.executor.Executor - Exception in task 0.0 in stage 8.0 (TID 30)
java.lang.NullPointerException
at zingg.block.Block$BlockFunction.call(Block.java:409)
at zingg.block.Block$BlockFunction.call(Block.java:396)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.deserializetoobject_doConsume_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:462)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:465)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I'm not sure if this is the issue, but the dumped schema specifies a z_hash
in the last field and but the printed row appears to be missing that.
from zingg.
z_hash is an internal field that is built after applying the blocking. If the operation fails, it will not be added.
Looks like the cellphone/homephone fields are null in this row.
How many samples have you trained on? Did they have similar data?
from zingg.
Also, what is the size of the file under /tmp/research/tmp/models/100/model/block ?
from zingg.
How many samples have you trained on? Did they have similar data?
I'm not sure how to answer that. I followed the instructions at the running.md and used the command invocations therein.
Also, what is the size of the file under /tmp/research/tmp/models/100/model/block ?
Aha, a whopping 5 bytes. I guess something went wrong earlier in the process. I'll clear out the working space and start fresh.
from zingg.
Looks like the blocking model is blank. (5 bytes) - the findTrainingData and label phases are run till you have about 30-40 pairs of positive matches, and then you have to run train and then match. Or trainMatch to do saving the models and executing them in one go. If you have already labeled quite a few pairs by running findTrainingData and label, do not clear the space out but just start from where you left. Hope this helps.
from zingg.
Aha, that explains it - when I labelled I never found any positive matches. That seems like its a condition worth flagging both in the docs and after labeling has exhausted its set of pairs.
Its possible the distribution of duplicates in my data tickles this behavior. This set comes from a known dupe-free file, but combined with a slice of itself after fuzzing some fields (introducing typos, blanking some fields, etc). The result contains about 10% duplicates, but they're all at the end (I've simply concatenated the dupes, not interspersed them). This shape may have exacerbated the issue.
from zingg.
I see. The pairs to be labeled are found by looking through the entire dataset sample and do not depend on where the records are placed. How many times did you run the findTrainignData and the label phases? You can start off from where you left till you have 30-40 pairs of actual matches.
Also, if you are not finding many matches in the label phase and if the findTrainingData phase is fast enough depending on the hardware you have deployed, you can also change labelDataSampleSize to 0.2 so it will look at twice the sample to fetch the pairs.
I will update the documentation to make things clearer.
from zingg.
Related Issues (20)
- Cannot read config.json in s3 when deployed to EMR HOT 4
- unnecessary messages in the listener
- Error when running DataBricks Example file HOT 6
- z_minScore 0 value HOT 4
- Azure synapse compatibility HOT 1
- `exportModel` encounters `NullPointerException` HOT 2
- Match Type NULL_OR_BLANK causing zingg.block.Block NPE HOT 70
- Is there a way to pre-train a brand new model? e.g. `Jack == John`; `Joe-Bob == Alexander`; `id 123 == id 456` HOT 1
- In place of `fieldDefinitions`, support avro schema, which is a more comprehensive way to describe data HOT 1
- Support for other feature types in non string fields HOT 4
- Merge Strategy in Zingg AI HOT 2
- Pipe does not need to be generic
- TypeError: 'JavaPackage' object is not callable when calling args = Arguments() HOT 4
- Databricks Error - Py4JJavaError: An error occurred while calling o964.execute. HOT 11
- Pairs against two data frames HOT 3
- 0 positive pairs when i had one HOT 20
- household table as per new design
- selectedcols methods are duplicated HOT 2
- Code refactor for Named
- AWS S3 page in documentation is not visible
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from zingg.