Coder Social home page Coder Social logo

Comments (13)

redterror avatar redterror commented on June 17, 2024 1

Sure! Happy to try a debug build.

from zingg.

sonalgoyal avatar sonalgoyal commented on June 17, 2024

It seems that the blocking functions are giving a null value. Did you train on the same data on which you are running the match? Do you have many null values in the data but not in the training set?

from zingg.

redterror avatar redterror commented on June 17, 2024

Did you train on the same data on which you are running the match?

Yes, I used the same CSV as the input to both.

Do you have many null values in the data but not in the training set?

Maybe? The source data certainly has several nullable fields, but I don't know about the training data's representation of them. I tried to lean towards more training data by setting labelDataSampleSize to 0.1 but maybe that wasn't the right thing?

Edit: since id is the only nullable field, does that imply there's a record w/o an id?

from zingg.

sonalgoyal avatar sonalgoyal commented on June 17, 2024

Hard to say, sorry the logs could be better built to tell exactly what went wrong. I can try and give you a debug build to find what went wrong. Will that work for you?

from zingg.

sonalgoyal avatar sonalgoyal commented on June 17, 2024

Great. Please give me a few hours

from zingg.

sonalgoyal avatar sonalgoyal commented on June 17, 2024

please use this and let me know how it goes https://github.com/zinggAI/zingg/releases/tag/v0.3.0debug

from zingg.

redterror avatar redterror commented on June 17, 2024

That was definitely more informative (I'd encourage you to leave the extra debugging in by default or via a flag), but seeing the row doesn't immediately give me clarity on what's wrong. The error, light redacted:

 I have no name!@097be7d42402:/zingg-0.3.0-SNAPSHOT$ ./scripts/zingg.sh --phase match --conf /tmp/research/zingg/tsmart_config.json
2021-11-05 14:57:32,055 [main] WARN  org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
 2021-11-05 14:57:32,316 [main] INFO  zingg.client.Client -
 2021-11-05 14:57:32,317 [main] INFO  zingg.client.Client - ********************************************************
 2021-11-05 14:57:32,317 [main] INFO  zingg.client.Client - *                    Zingg AI                           *
 2021-11-05 14:57:32,317 [main] INFO  zingg.client.Client - *               (C) 2021 Zingg.AI                       *
 2021-11-05 14:57:32,317 [main] INFO  zingg.client.Client - ********************************************************
 2021-11-05 14:57:32,317 [main] INFO  zingg.client.Client -
 2021-11-05 14:57:32,317 [main] INFO  zingg.client.Client - using: Zingg v0.3
 2021-11-05 14:57:32,318 [main] INFO  zingg.client.Client -
 2021-11-05 14:57:32,560 [main] WARN  zingg.client.Arguments - Config Argument is /tmp/research/zingg/tsmart_config.json
 2021-11-05 14:57:32,812 [main] WARN  zingg.client.Arguments - phase is match
 2021-11-05 14:57:35,917 [main] WARN  org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry - The function round replaced a previously registered function.
 2021-11-05 14:57:35,918 [main] INFO  zingg.ZinggBase - Start reading internal configurations and functions
 2021-11-05 14:57:35,932 [main] INFO  zingg.ZinggBase - Finished reading internal configurations and functions
 2021-11-05 14:57:35,981 [main] WARN  zingg.util.PipeUtil - Reading input csv
 2021-11-05 14:58:10,151 [main] INFO  zingg.Matcher - Read 932474
 2021-11-05 14:58:10,163 [main] DEBUG zingg.block.Block - returning schema after step 1 is StructType(StructField(z_zid,LongType,false), StructField(id,StringType,true), StructField(fname,StringType,true), StructField(mname,StringType,true), StructField(lname,StringType,true), StructField(address1,StringType,true), StructField(city,StringType,true), StructField(state,StringType,true), StructField(zip,StringType,true), StructField(homephone,StringType,true), StructField(cellphone,StringType,true), StructField(dob,StringType,true), StructField(gender,StringType,true), StructField(z_source,StringType,false), StructField(z_hash,IntegerType,false))
 2021-11-05 14:58:10,227 [main] INFO  zingg.Matcher - Blocked
 2021-11-05 14:58:10,670 [main] WARN  org.apache.spark.sql.catalyst.util.package - Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
 2021-11-05 14:58:13,319 [Executor task launch worker for task 30] DEBUG zingg.block.Block - blocking row [338504,NY-000001111111,MELISSA,Y,XXXXXXX,XXX XXXX XX,BROOKLYN,NY,11228,null,null,19610101,Female,test]
 2021-11-05 14:58:13,325 [Executor task launch worker for task 30] ERROR org.apache.spark.executor.Executor - Exception in task 0.0 in stage 8.0 (TID 30)
 java.lang.NullPointerException
        at zingg.block.Block$BlockFunction.call(Block.java:409)
        at zingg.block.Block$BlockFunction.call(Block.java:396)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume_0$(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.deserializetoobject_doConsume_0$(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
        at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
        at org.apache.spark.scheduler.Task.run(Task.scala:127)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:462)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:465)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

I'm not sure if this is the issue, but the dumped schema specifies a z_hash in the last field and but the printed row appears to be missing that.

from zingg.

sonalgoyal avatar sonalgoyal commented on June 17, 2024

z_hash is an internal field that is built after applying the blocking. If the operation fails, it will not be added.
Looks like the cellphone/homephone fields are null in this row.
How many samples have you trained on? Did they have similar data?

from zingg.

sonalgoyal avatar sonalgoyal commented on June 17, 2024

Also, what is the size of the file under /tmp/research/tmp/models/100/model/block ?

from zingg.

redterror avatar redterror commented on June 17, 2024

How many samples have you trained on? Did they have similar data?

I'm not sure how to answer that. I followed the instructions at the running.md and used the command invocations therein.

Also, what is the size of the file under /tmp/research/tmp/models/100/model/block ?

Aha, a whopping 5 bytes. I guess something went wrong earlier in the process. I'll clear out the working space and start fresh.

from zingg.

sonalgoyal avatar sonalgoyal commented on June 17, 2024

Looks like the blocking model is blank. (5 bytes) - the findTrainingData and label phases are run till you have about 30-40 pairs of positive matches, and then you have to run train and then match. Or trainMatch to do saving the models and executing them in one go. If you have already labeled quite a few pairs by running findTrainingData and label, do not clear the space out but just start from where you left. Hope this helps.

from zingg.

redterror avatar redterror commented on June 17, 2024

Aha, that explains it - when I labelled I never found any positive matches. That seems like its a condition worth flagging both in the docs and after labeling has exhausted its set of pairs.

Its possible the distribution of duplicates in my data tickles this behavior. This set comes from a known dupe-free file, but combined with a slice of itself after fuzzing some fields (introducing typos, blanking some fields, etc). The result contains about 10% duplicates, but they're all at the end (I've simply concatenated the dupes, not interspersed them). This shape may have exacerbated the issue.

from zingg.

sonalgoyal avatar sonalgoyal commented on June 17, 2024

I see. The pairs to be labeled are found by looking through the entire dataset sample and do not depend on where the records are placed. How many times did you run the findTrainignData and the label phases? You can start off from where you left till you have 30-40 pairs of actual matches.

Also, if you are not finding many matches in the label phase and if the findTrainingData phase is fast enough depending on the hardware you have deployed, you can also change labelDataSampleSize to 0.2 so it will look at twice the sample to fetch the pairs.

I will update the documentation to make things clearer.

from zingg.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.