Hi, I hit a NullPointerException running zingg at the

please use this and let me know how it goes <a href="https://github.com/zinggAI/zingg/

NullPointerException during match phase about zingg HOT 13 CLOSED

redterror commented on June 17, 2024

NullPointerException during match phase

from zingg.

Comments (13)

redterror commented on June 17, 2024 1

Sure! Happy to try a debug build.

from zingg.

sonalgoyal commented on June 17, 2024

It seems that the blocking functions are giving a null value. Did you train on the same data on which you are running the match? Do you have many null values in the data but not in the training set?

from zingg.

redterror commented on June 17, 2024

Did you train on the same data on which you are running the match?

Yes, I used the same CSV as the input to both.

Do you have many null values in the data but not in the training set?

Maybe? The source data certainly has several nullable fields, but I don't know about the training data's representation of them. I tried to lean towards more training data by setting labelDataSampleSize to 0.1 but maybe that wasn't the right thing?

Edit: since id is the only nullable field, does that imply there's a record w/o an id?

from zingg.

sonalgoyal commented on June 17, 2024

Hard to say, sorry the logs could be better built to tell exactly what went wrong. I can try and give you a debug build to find what went wrong. Will that work for you?

from zingg.

sonalgoyal commented on June 17, 2024

Great. Please give me a few hours

from zingg.

sonalgoyal commented on June 17, 2024

please use this and let me know how it goes https://github.com/zinggAI/zingg/releases/tag/v0.3.0debug

from zingg.

redterror commented on June 17, 2024

That was definitely more informative (I'd encourage you to leave the extra debugging in by default or via a flag), but seeing the row doesn't immediately give me clarity on what's wrong. The error, light redacted:

 I have no name!@097be7d42402:/zingg-0.3.0-SNAPSHOT$ ./scripts/zingg.sh --phase match --conf /tmp/research/zingg/tsmart_config.json
2021-11-05 14:57:32,055 [main] WARN  org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
 2021-11-05 14:57:32,316 [main] INFO  zingg.client.Client -
 2021-11-05 14:57:32,317 [main] INFO  zingg.client.Client - ********************************************************
 2021-11-05 14:57:32,317 [main] INFO  zingg.client.Client - *                    Zingg AI                           *
 2021-11-05 14:57:32,317 [main] INFO  zingg.client.Client - *               (C) 2021 Zingg.AI                       *
 2021-11-05 14:57:32,317 [main] INFO  zingg.client.Client - ********************************************************
 2021-11-05 14:57:32,317 [main] INFO  zingg.client.Client -
 2021-11-05 14:57:32,317 [main] INFO  zingg.client.Client - using: Zingg v0.3
 2021-11-05 14:57:32,318 [main] INFO  zingg.client.Client -
 2021-11-05 14:57:32,560 [main] WARN  zingg.client.Arguments - Config Argument is /tmp/research/zingg/tsmart_config.json
 2021-11-05 14:57:32,812 [main] WARN  zingg.client.Arguments - phase is match
 2021-11-05 14:57:35,917 [main] WARN  org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry - The function round replaced a previously registered function.
 2021-11-05 14:57:35,918 [main] INFO  zingg.ZinggBase - Start reading internal configurations and functions
 2021-11-05 14:57:35,932 [main] INFO  zingg.ZinggBase - Finished reading internal configurations and functions
 2021-11-05 14:57:35,981 [main] WARN  zingg.util.PipeUtil - Reading input csv
 2021-11-05 14:58:10,151 [main] INFO  zingg.Matcher - Read 932474
 2021-11-05 14:58:10,163 [main] DEBUG zingg.block.Block - returning schema after step 1 is StructType(StructField(z_zid,LongType,false), StructField(id,StringType,true), StructField(fname,StringType,true), StructField(mname,StringType,true), StructField(lname,StringType,true), StructField(address1,StringType,true), StructField(city,StringType,true), StructField(state,StringType,true), StructField(zip,StringType,true), StructField(homephone,StringType,true), StructField(cellphone,StringType,true), StructField(dob,StringType,true), StructField(gender,StringType,true), StructField(z_source,StringType,false), StructField(z_hash,IntegerType,false))
 2021-11-05 14:58:10,227 [main] INFO  zingg.Matcher - Blocked
 2021-11-05 14:58:10,670 [main] WARN  org.apache.spark.sql.catalyst.util.package - Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
 2021-11-05 14:58:13,319 [Executor task launch worker for task 30] DEBUG zingg.block.Block - blocking row [338504,NY-000001111111,MELISSA,Y,XXXXXXX,XXX XXXX XX,BROOKLYN,NY,11228,null,null,19610101,Female,test]
 2021-11-05 14:58:13,325 [Executor task launch worker for task 30] ERROR org.apache.spark.executor.Executor - Exception in task 0.0 in stage 8.0 (TID 30)
 java.lang.NullPointerException
        at zingg.block.Block$BlockFunction.call(Block.java:409)
        at zingg.block.Block$BlockFunction.call(Block.java:396)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume_0$(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.deserializetoobject_doConsume_0$(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
        at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
        at org.apache.spark.scheduler.Task.run(Task.scala:127)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:462)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:465)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

I'm not sure if this is the issue, but the dumped schema specifies a z_hash in the last field and but the printed row appears to be missing that.

from zingg.

sonalgoyal commented on June 17, 2024

z_hash is an internal field that is built after applying the blocking. If the operation fails, it will not be added.
Looks like the cellphone/homephone fields are null in this row.
How many samples have you trained on? Did they have similar data?

from zingg.

sonalgoyal commented on June 17, 2024

Also, what is the size of the file under /tmp/research/tmp/models/100/model/block ?

from zingg.

redterror commented on June 17, 2024

How many samples have you trained on? Did they have similar data?

I'm not sure how to answer that. I followed the instructions at the running.md and used the command invocations therein.

Also, what is the size of the file under /tmp/research/tmp/models/100/model/block ?

Aha, a whopping 5 bytes. I guess something went wrong earlier in the process. I'll clear out the working space and start fresh.

from zingg.

sonalgoyal commented on June 17, 2024

Looks like the blocking model is blank. (5 bytes) - the findTrainingData and label phases are run till you have about 30-40 pairs of positive matches, and then you have to run train and then match. Or trainMatch to do saving the models and executing them in one go. If you have already labeled quite a few pairs by running findTrainingData and label, do not clear the space out but just start from where you left. Hope this helps.

from zingg.

redterror commented on June 17, 2024

Aha, that explains it - when I labelled I never found any positive matches. That seems like its a condition worth flagging both in the docs and after labeling has exhausted its set of pairs.

Its possible the distribution of duplicates in my data tickles this behavior. This set comes from a known dupe-free file, but combined with a slice of itself after fuzzing some fields (introducing typos, blanking some fields, etc). The result contains about 10% duplicates, but they're all at the end (I've simply concatenated the dupes, not interspersed them). This shape may have exacerbated the issue.

from zingg.

sonalgoyal commented on June 17, 2024

I see. The pairs to be labeled are found by looking through the entire dataset sample and do not depend on where the records are placed. How many times did you run the findTrainignData and the label phases? You can start off from where you left till you have 30-40 pairs of actual matches.

Also, if you are not finding many matches in the label phase and if the findTrainingData phase is fast enough depending on the hardware you have deployed, you can also change labelDataSampleSize to 0.2 so it will look at twice the sample to fetch the pairs.

I will update the documentation to make things clearer.

from zingg.

NullPointerException during match phase about zingg HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent