larsga / duke Goto Github PK

View Code? Open in Web Editor NEW

613.0 613.0 194.0 3.46 MB

Duke is a fast and flexible deduplication engine written in Java

License: Apache License 2.0

Java 99.88% HTML 0.12%

duke's People

Stargazers

Watchers

Forkers

fabriziofortino leprinco sobolk dmnpignaud jsalim jkot dwag mpiacenza ducky427 fgc haoyuan snazz2001 yangxu02 cklee75 davechan ziodave mpmanesh pombredanne ztsmith ryanhnkim carycorreia antonimmo swamikevala nhambletccri anujsrc tupilabs jffist chornyi szydan shanefitzgibbon tmarthal dineshkumarsarangapani lexsf liangjiema rootfs-analytics efesler muschneider calamarbicefalo dbose chandranil2606 salahoukoud qtips frictionlesscoin hhstechgroup ceperez ecoutu datashed-lewis nmadhire nmataro anandgsal usbrandon programaths ravimanik nventdata pwangjing seankilleen tankclassxer sxyxm willschipp mneedham bharathsubash markodjurovic firatkarakusoglu pelumi cordje jerry33 stg-247 eugenek1978 jatin7 drdebian carljmosca feristhia gaurav46 johnconnelly75 ofergold jamespr civics sbasant-cmx prashanthcz esmundoz bensu fgregg ntbti pankajpawan anandsemtech riversandtechnologies fenfenxu simonschlosser enricopal lfoppiano dilaver fgiasson olamy nhoussos rlugojr federicomorando shivam5992 containerz gilfernandes ivangrod

duke's Issues

Make Maven repository for Duke

From [email protected] on June 10, 2011 21:51:39

There should be a Maven repository for Duke where people can get the .jar

Original issue: http://code.google.com/p/duke/issues/detail?id=21

Match entities against fix index

From [email protected] on November 04, 2011 10:17:28

For our application I'd like to build once a day an index of our entity database and match new entities (online) against this index.

Is this setting supported by duke?

Original issue: http://code.google.com/p/duke/issues/detail?id=49

SPARQL data source hangs

From Michael.Hausenblas on May 21, 2011 16:46:06

What steps will reproduce the problem? 1. java no.priv.garshol.duke.Duke --showmatches dogfood.xml What is the expected output? What do you see instead? When I do a kill -SIGQUIT {PID} I get the following trace:

2011-05-21 15:39:25
Full thread dump Java HotSpot(TM) 64-Bit Server VM (19.1-b02-334 mixed mode):

"Low Memory Detector" daemon prio=5 tid=10184e800 nid=0x108b69000 runnable [00000000]
java.lang.Thread.State: RUNNABLE

"CompilerThread1" daemon prio=9 tid=10184d000 nid=0x108a66000 waiting on condition [00000000]
java.lang.Thread.State: RUNNABLE

"CompilerThread0" daemon prio=9 tid=10184b800 nid=0x108963000 waiting on condition [00000000]
java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=9 tid=10184a800 nid=0x108860000 waiting on condition [00000000]
java.lang.Thread.State: RUNNABLE

"Surrogate Locker Thread (CMS)" daemon prio=5 tid=101849000 nid=0x10875d000 waiting on condition [00000000]
java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=8 tid=101830000 nid=0x108643000 in Object.wait() [108642000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <7f3001300> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
- locked <7f3001300> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)

"Reference Handler" daemon prio=10 tid=10182f000 nid=0x108532000 in Object.wait() [108531000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <7f30011d8> (a java.lang.ref.Reference$Lock)
at java.lang.Object.wait(Object.java:485)
at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
- locked <7f30011d8> (a java.lang.ref.Reference$Lock)

"main" prio=5 tid=101801800 nid=0x100501000 runnable [100500000]
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream. read1 (BufferedInputStream.java:258)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
- locked <7f3e12df0> (a java.io.BufferedInputStream)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
- locked <7f3e10418> (a sun.net.www.protocol.http.HttpURLConnection)
at no.priv.garshol.duke.SparqlClient.getResponse(SparqlClient.java:52)
at no.priv.garshol.duke.SparqlClient.execute(SparqlClient.java:30)
at no.priv.garshol.duke.SparqlDataSource$SparqlIterator.fetchNextPage(SparqlDataSource.java:106)
at no.priv.garshol.duke.SparqlDataSource$SparqlIterator.next(SparqlDataSource.java:92)
at no.priv.garshol.duke.SparqlDataSource$SparqlIterator.next(SparqlDataSource.java:43)
at no.priv.garshol.duke.Duke.main(Duke.java:82)

"VM Thread" prio=9 tid=10182a000 nid=0x10842f000 runnable

"Gang worker#0 (Parallel GC Threads)" prio=9 tid=101804800 nid=0x1007c7000 runnable

"Gang worker#1 (Parallel GC Threads)" prio=9 tid=101805800 nid=0x1017cc000 runnable

"Concurrent Mark-Sweep GC Thread" prio=9 tid=101808000 nid=0x1080b6000 runnable
"VM Periodic Task Thread" prio=10 tid=101850800 nid=0x108c6c000 waiting on condition

"Exception Catcher Thread" prio=10 tid=101802800 nid=0x100604000 runnable
JNI global references: 1704

Heap
par new generation total 19136K, used 14785K [7f3000000, 7f44c0000, 7f44c0000)
eden space 17024K, 86% used [7f3000000, 7f3e707e8, 7f40a0000)
from space 2112K, 0% used [7f40a0000, 7f40a0000, 7f42b0000)
to space 2112K, 0% used [7f42b0000, 7f42b0000, 7f44c0000)
concurrent mark-sweep generation total 63872K, used 0K [7f44c0000, 7f8320000, 7fae00000)
concurrent-mark-sweep perm gen total 21248K, used 8748K [7fae00000, 7fc2c0000, 800000000) What version of the product are you using? On what operating system? Using duke-0.2-SNAPSHOT.jar built from source with Java version "1.6.0_24" on Mac OS X 10.5.8

Original issue: http://code.google.com/p/duke/issues/detail?id=16

PersonNameComparator: handling of short words

From [email protected] on October 28, 2011 02:06:13

I may be wrong but it seems the PersonNameComparator has a couple of bugs:

The execution does not really reach the else if responsible for short tokens handling:
line 88
} else if (t1[ix].length() + t2[ix].length() <= 4)
// it's not an initial, so if the strings are 4 characters
// or less, we quadruple the edit dist
d = d * 4;
else
In line 72, t1.length needs to be t2.length? As t1 is always the longer token.
} else if (d > 1 && (ix + 1) <= t1.length)

What do you think?

Original issue: http://code.google.com/p/duke/issues/detail?id=45

Support NTriples test files

From [email protected] on September 04, 2011 15:39:58

We should support loading NTriples test files.

Original issue: http://code.google.com/p/duke/issues/detail?id=38

Support SPARQL test files

From [email protected] on September 04, 2011 15:40:28

That is, we should be able to query a SPARQL endpoint for definitive link information.

Original issue: http://code.google.com/p/duke/issues/detail?id=39

Duke should be version-stamped

From [email protected] on September 09, 2011 08:27:32

The META-INF file should have the version number, as well as methods in the Duke API. The command-line client should also be able to print the version number.

Original issue: http://code.google.com/p/duke/issues/detail?id=42

Set up a standard performance test

From [email protected] on September 04, 2011 15:21:15

Need this in order to be able to judge the performance impact of various changes.

Original issue: http://code.google.com/p/duke/issues/detail?id=34

Weighted Levenshtein

From [email protected] on October 01, 2011 08:46:49

We need to be able to treat some edits as larger than others. Particularly edits involving numbers are important.

Original issue: http://code.google.com/p/duke/issues/detail?id=44

Formalize and document API

From [email protected] on May 21, 2011 10:05:06

Need to do some refactoring to ensure that the API for embedding Duke is optimal. Ideally the driving loop should be implemented only once. Also, the Deduplicator and Database should be merged. And the code to retrieve a fully functional database should be simpler.

Original issue: http://code.google.com/p/duke/issues/detail?id=14

Test file should be in CSV format

From [email protected] on September 04, 2011 15:39:38

Having a special test file format is idiotic. We need to change over to using just plain CSV files.

Original issue: http://code.google.com/p/duke/issues/detail?id=37

The fixed-size search result loses matches

From [email protected] on May 22, 2011 11:29:14

At the moment we never retrieve more than 50 search results from Lucene. We need to change this to use a variable-size result set that gets sized dynamically.

Original issue: http://code.google.com/p/duke/issues/detail?id=17

Save/Retrieve multiple property values in the lucene database

From [email protected] on June 13, 2011 08:32:15

Hello,

I would like to indexing a record in which a property has >1 values.
At the moment I can see that only 1st value gets saved in the lucene database:

String value = record.getValue(propname);

So that when a candidate is retrieved from the database all other values are lost and the comparison is not what I'd like it to be.

Would it be possible to correctly save/retrieve the whole collection of property values in the database?

Thanks

Original issue: http://code.google.com/p/duke/issues/detail?id=22

Try opening reader directly from the writer

From [email protected] on August 25, 2011 21:30:46

One of the costliest operations we perform right now is IndexWriter.commit(), and in fact we introduced the whole troublesome batching concept specifically to be able to live with this limitation. It's possible to open a special reader from a writer to get "near real-time" searching, and we should try out whether this works better. http://lucene.apache.org/java/3_3_0/api/core/org/apache/lucene/index/IndexReader.html#open(org.apache.lucene.index.IndexWriter , boolean)

Original issue: http://code.google.com/p/duke/issues/detail?id=31

Merge matching records and use this for improved matching

From [email protected] on April 02, 2011 12:15:15

Once two records have been found to match this can be exploited to go back and see if adding more information about the entity allows us to find more matches.

Original issue: http://code.google.com/p/duke/issues/detail?id=4

Lookup property analysis must consider maybeThreshold

From [email protected] on June 10, 2011 21:51:00

If the maybe threshold is set, that should be used for lookup property analysis rather than the certain threshold.

Original issue: http://code.google.com/p/duke/issues/detail?id=20

Validation of config file

From [email protected] on May 21, 2011 11:22:31

We need to do some kind of structural validation of the config file to help users get the format right. The best choice is to use RELAX-NG, but this involves an extra dependency (Jing). We could try using a DTD, if we can convince the crap standard parser to load our DTD and not listen to the document.

Original issue: http://code.google.com/p/duke/issues/detail?id=15

Proper tests for handling of null and ""

From [email protected] on September 05, 2011 14:41:22

In RecordImpl and elsewhere.

Original issue: http://code.google.com/p/duke/issues/detail?id=41

Support for SPARQL link database

From [email protected] on September 04, 2011 15:59:00

That is, use the LinkDatabase interface to maintain same-as statements in a triple store via SPARQL and SPARQL Update.

Original issue: http://code.google.com/p/duke/issues/detail?id=40

Add persistent store management actions to command-line client

From [email protected] on April 02, 2011 12:24:55

The command-line client needs to be able to implement at least the "reindex" action. Other actions may be needed, too.

Original issue: http://code.google.com/p/duke/issues/detail?id=6

Support for more complex string matching

From [email protected] on April 02, 2011 12:07:02

At the moment we only support string equality matching. Need to extend this with support for tokenized matching and Levenshtein distance matching.

Original issue: http://code.google.com/p/duke/issues/detail?id=2

Support for Hadoop processing

From [email protected] on September 04, 2011 15:25:27

Longer-term we should be able to farm out processing work to Hadoop clusters.

Original issue: http://code.google.com/p/duke/issues/detail?id=36

Implement proper escaping in NTriplesWriter

From [email protected] on August 31, 2011 12:09:20

This writer is now being used for real, so we need to complete it and add full support for escaping etc

Original issue: http://code.google.com/p/duke/issues/detail?id=33

Must handle all properties having the same probability

From [email protected] on June 10, 2011 21:50:20

It may happen that the user sets all properties to the same probability. We must handle this case correctly. (Whatever correctly might be. Not sure yet.)

Original issue: http://code.google.com/p/duke/issues/detail?id=19

Design and implement XML configuration language

From [email protected] on April 02, 2011 12:16:37

We need to be able to write up configurations in an XML format, and to be able to load these configurations.

Original issue: http://code.google.com/p/duke/issues/detail?id=5

Can we add support for TFIDF matching?

From [email protected] on August 25, 2011 12:09:49

Term frequency matching is generally considered the best string matching approach, but requires a source of information about term frequencies. Can we come up with some way to add this?

Original issue: http://code.google.com/p/duke/issues/detail?id=27

Support exporting link files to NTriples format

From [email protected] on May 24, 2011 16:51:47

We need to be able to write link files in NTriples format, to make it easier to work with Linked Data.

Original issue: http://code.google.com/p/duke/issues/detail?id=18

Why not just extend Lucene scoring?

From [email protected] on June 18, 2011 03:00:44

Just curious. You seem to be doing Bayes calculation after getting results from Lucene. Why not implement your own scoring instead? Wouldn't that work? Like - https://issues.apache.org/jira/browse/LUCENE-2091

Original issue: http://code.google.com/p/duke/issues/detail?id=24

Character encoding in NTriples parser

From [email protected] on May 11, 2011 15:08:31

We need to fix this parser so that it (a) only allows US-ASCII character literals (as per spec) and (b) interprets character escapes correctly.

Original issue: http://code.google.com/p/duke/issues/detail?id=12

Fuzzy search in Lucene

From [email protected] on August 24, 2011 10:52:59

Apparently it's possible to do fast fuzzy searches in Lucene 3.x. Need to find out how. Keywords are "ngram index" and "spellcheck". Haven't found anything yet, but need to see if there is a way to do this.

Original issue: http://code.google.com/p/duke/issues/detail?id=26

Add MatchListener method for debugging record scores

From [email protected] on June 13, 2011 08:50:13

Would it be possible to add a method in the MatchListener which would be called regardless of whether a match has been identified or not. For debugging and fine-tuning purposes it would be nice to see what kind of probabilities each record scores. It would be even better if each property showed what probability it scored on its own.

Thanks

Original issue: http://code.google.com/p/duke/issues/detail?id=23

Implement SPARQL data source

From [email protected] on April 07, 2011 16:18:45

We need a SPARQL data source so that we can process data from SPARQL stores. Note that this needs to work in two different modes: (1) batch mode where we get all the data (probably with some kind of paging), and (2) incremental mode where an outside source tells us what resources to process.

Original issue: http://code.google.com/p/duke/issues/detail?id=9

Better configuration options for handling link output

From [email protected] on August 25, 2011 21:27:15

It should be possible to choose an output for the links, whether it be JDBCLinkDatabase or one of the other alternatives. Need to consider how, though.

Original issue: http://code.google.com/p/duke/issues/detail?id=30

Support for qgrams matching

From [email protected] on August 25, 2011 12:10:16

Need to implement a qgrams comparator.

Original issue: http://code.google.com/p/duke/issues/detail?id=28

Design data source API

From [email protected] on April 02, 2011 12:10:13

Design the data source API and implement two (CSV & JDBC) data sources so that we can try it out. Also make sure it will support the RDF push use case.

Original issue: http://code.google.com/p/duke/issues/detail?id=3

Proper command-line syntax parsing

From [email protected] on April 09, 2011 11:03:27

Need this in order to add useful control switches to the command-line client.

Original issue: http://code.google.com/p/duke/issues/detail?id=11

Build command-line client

From [email protected] on April 02, 2011 12:06:15

We need to build a command-line client that supports CSV and JDBC data sources so that we can try out the basic engine and configuration to ensure that performance is acceptable.

Original issue: http://code.google.com/p/duke/issues/detail?id=1

Should be possible for data sources to assert difference

From [email protected] on November 01, 2011 11:23:39

It should be possible to assert A owl:differentFrom B in a data source, and for this to prevent Duke from ever claiming that A owl:sameAs B.

The JDBCLinkDatabase component in Duke already supports this. If a row (A, B, DIFFERENT, ASSERTED) were to appear in the database, Duke would never add an owl:sameAs between A and B. However, Duke cannot now get this information from the UMIC and into the link database.

To add support for that we'd need to:

Add a Collection getLinks() method to the Record interface, so that records can arrive in Duke with pre-known link information.
Add support for populating this data to individual data sources.

Oh, and we also need to update the code so that this gets written correctly to the link database.

Original issue: http://code.google.com/p/duke/issues/detail?id=46

Turn Database into an interface

From [email protected] on September 18, 2011 13:37:06

After all, we could conceivably create more than just one Database backend. For example, for smaller datasets we could simply keep all records and do full n x n matching.

Original issue: http://code.google.com/p/duke/issues/detail?id=43

Configure Database connection via JNDI

From [email protected] on November 04, 2011 09:21:45

We have a web application were all database connections are configured via JNDI. This allows us, for example, to set up different database connections for the test and our production system without maintaining different war files.

A datasource configuration could look like this:

<column name=.../>

The actual data source would be configured within the context of the web application's servlet container.

In any case: nice project!

Original issue: http://code.google.com/p/duke/issues/detail?id=47

Upgrade to Lucene 3.3

From [email protected] on August 25, 2011 21:31:33

Should also test the performance difference in doing so.

Original issue: http://code.google.com/p/duke/issues/detail?id=32

Open a discussion forum

From [email protected] on November 04, 2011 10:05:48

Is there an open discussion forum for duke?

Original issue: http://code.google.com/p/duke/issues/detail?id=48

Implement LinkDatabase interface persistently

From [email protected] on April 07, 2011 16:05:02

Either on top of H2, or on top of something simpler. Could be just a b-tree implementation, really.

Original issue: http://code.google.com/p/duke/issues/detail?id=7

Support for multithreaded processing

From [email protected] on September 04, 2011 15:22:56

We should be able to use threads to make use of all the processor cores in modern machines. Below is an outline of how it might be done.

one thread runs the data source and collects records from there into
a queue.

another set of threads collects records from the queue and indexes
them. it seems that multiple threads doing indexes should work. http://darksleep.com/lucene/ once indexed the records are stuffed
into a second queue.

a pool of threads picks records from the second queue and does the
matching on them

Original issue: http://code.google.com/p/duke/issues/detail?id=35

setBeanProperty must support type java.lang.Integer

Now it tries to look that up as an object by reference.

Support writing links to SPARQL endpoint

From [email protected] on August 25, 2011 21:26:37

We'll need to handle different dialects, but that should be doable.

Original issue: http://code.google.com/p/duke/issues/detail?id=29

Implement SDshare server

From [email protected] on April 07, 2011 16:13:53

Plug in as backend to some SDshare server framework, and build it on top of the LinkDatabase.

Original issue: http://code.google.com/p/duke/issues/detail?id=8

Normalizer should strip accents

From [email protected] on August 22, 2011 11:10:46

One way to do it is given here: http://blog.smartkey.co.uk/2009/10/how-to-strip-accents-from-strings-using-java-6/

Original issue: http://code.google.com/p/duke/issues/detail?id=25

Build SDshare client backend

From [email protected] on April 07, 2011 16:22:00

Need to implement a backend to the Ontopia SDshare client which will receive snapshots and fragments and use these to trigger processing via the SPARQL data source.

Original issue: http://code.google.com/p/duke/issues/detail?id=10

Finalize Jaro-Winkler implementation

From [email protected] on May 20, 2011 15:29:44

The current implementation does not include the final three JW adjustments described by Yancey. Also, it does not include the full battery of tests that the LingPipe people documented. Get all this in.

Original issue: http://code.google.com/p/duke/issues/detail?id=13

larsga / duke Goto Github PK

duke's People

Stargazers

Watchers

Forkers

duke's Issues

Recommend Projects

Recommend Topics

Recommend Org