Coder Social home page Coder Social logo

Title disambiguation about patcit HOT 7 OPEN

cverluise avatar cverluise commented on May 8, 2024
Title disambiguation

from patcit.

Comments (7)

gg4u avatar gg4u commented on May 8, 2024 1

I take on this issue.

A clean and transparent disambiguation would definitely be a strong plus.

My approach will be make use of a jupyter notebook, to explore possible methods.

Other contributors could publish there new strategies.

Once chosen, we can create a a final script to be imported in the library.

I am creating a notebook now, after having dealt with some issues of google big query client.
I will update also how to solve them in the jupyter (migth smone has same problems)

from patcit.

cverluise avatar cverluise commented on May 8, 2024

Hello @gg4u ,

welcome on board!

We've been thinking about an exotic strategy for this question.

The idea would be the following:

  1. Get unique title_j (about 1.5 million) and their number of occurrences
  2. Send these unique title_j to the Google Knowledge Graph API - API reference here . Restricting to "types"="Periodical" (extend to "Thing" if you don't get anything from "Periodical") and to the 3 highest scoring results seems to be satisfying at first sight. Should be battle tested
  3. You end up having a table as follows:
title_j g_id
Nature [g_id-1, g_id-2]
Nature journal [g_id-1, g_id-3, g_id-4]
Science [g_id-5]

In brief, there are keys (the g_id) with a list of candidate values (the title_j) for each of these keys

  1. Iterate over keys by decreasing number of potential matches and select the true candidates

NB: Although 4. might seem tedious, we have been investigating an efficient technical solution using label-studio and had a great support from the community. See here for their guidelines. Importantly, this would provide high quality disambiguation.

If you could investigate this idea, that would be great. In any case, any idea is most welcome!

Thanks

from patcit.

gg4u avatar gg4u commented on May 8, 2024

Hi @cverluise , thank you.

I started a jupyter notebook before your note.

Please help me clarify a few things.

I looked at the bigquery db, title_jis the title of a journal.
What is title_m ? It is none for all fields in your example query:


q = """
    SELECT
      DISTINCT(title_j), title_m
    FROM
      `npl-parsing.patcit.beta`
    WHERE
      LOWER(title_j) LIKE "%ibm%"
    ORDER BY
      title_j DESC
"""

What is a g_id key ? I cannot find it on the schema of the db.

Is it an arbitrary Id of a record ? E.g. Could it be an hashkey of a record for the query above?
Or is it the unique id of a journal ?

The tasks is entity matching.
Do you want to reconciliate entities against google knowledge graph?

Any particular reason for not choosing Wikidata ?

I looked at Google knowledge graph (KG) reference:
https://developers.google.com/knowledge-graph/reference/rest/v1
but seems it might not complete.

As example, I query on a search engine for :
IBM Systems Journal I found it on the web.
Instead the KG points me to IBM Journal of Research and Development, which is a different entity according to the search engine.

Per contrast, on wikidata, I can find the two distincted entities:

https://www.wikidata.org/wiki/Q15760627?wprov=srpw1_0

https://www.wikidata.org/wiki/Q15753899?wprov=srpw1_0

In my opinion it seems wikidata can be a good choice (Google KG is anyway based on wikidata, although I don't know why the above query reconciliates differently).

What shall we do with them?

  • One options would be to point all the journals to IBM entity.
  • Another option would be to manually create a list of all IBM journals, as labels to train then the suggestions. I thought Wikidata might come to rescue.

In the query above, I have dirty results: it seems that together with a journal title, I also have some part of an article title. I can handle that, but wonder what persons using label-studio may do with it.

Shall they select a list of suggested labels (title journal) for each item?
How would they choose the right journal in the case there are more options (as IBM Systems Journal and IBM Journal of Research and Development may be? )

Please let me know what people should work with it, it will help to model an output.

So far I created a matrix like the table you suggested, of N x N items.
If one would have a list of the correct journals, that would help for one could create a matrix M x N, where M (number of journals) << N (number of rows) (at least, I expect that).


To update you on what I am working on :
I m exploring a bit of methods to compute differences and similarities between strings.

The approach is endogeneous (the only information is just from the string, I don't know what it represents). It might be expanded with

I am exploring different approaches with tokenizers, n-grams and collations. I will select ones offering best results, sharing ideas on the points above (see point 3) will help.

Having a list of final labels, it might help, for I could train against them.
Otherwise, I could work on some other ways to clusterize them.

For one work to match entities ( granular food ingredients, about 10^4 raw distinct records reduced to 10^4 ), I end up creating an interface in wolfram mathematica to fine-grain overview the automatic clustering - I could not do it not, but useful to know how the community will interact (see point 4).

from patcit.

cverluise avatar cverluise commented on May 8, 2024

Hello,

thanks a lot for your feedback!

on 1.
title_m: Title of the item holding the NPL -- for non journal items only , e.g. conference, proceedings, etc.

The schema of the table is detailed in the "Schema" pane on GBQ https://console.cloud.google.com/bigquery?project=npl-parsing&p=npl-parsing&d=patcit&t=v02_npl&page=table

It is also detailed here: https://grobid.readthedocs.io/en/latest/training/Bibliographical-references/

on 2.

At this point there is no specific reason to favor the knowledge graph API. So, if you find that wikidata can do better, feel free to experiment!

On 3.

That's why I suggest to:

  1. keep the k best candidate values (e.g. k=3)
  2. validate by hand the key-value matching using a custom Label Studio setting (see my points 3 and 4 above + guidelines

In anyway, we cannot fully rely on the output of wikidata/knowledge graph/ ... api output

On 4

On the query above, I have dirty results: it seems that together with a journal title, I also have some part of an article title.

No, the labeller will just accept the dirty result as long as it actually contains the info for the journal title.

The idea is that, from the API, you will have, let's say, 3 keys per ambiguous title_j. In the end, there will be multiple ambiguous title_j (values) with a common key. E.g "IBM tech Disclosure" and "IBM technical disclosure" will certainly have a common key. From the API, each key is defined by a unique id (what i call g-idin the case of the google knowledge graph) and a clean title.

If one would have a list of the correct journals, that would help for one could create a matrix M x N, where M (number of journals) << N (number of rows) (at least, I expect that).

We don't have that because patent to "science" citations are not standard. For example, IBM technical disclosure bulletin is absent from major academic databases.

By the way, as you should keep only the k (~ 3) most relevant API entity outputs, you should end up with a dict with "number of distinct title_j" keys x k values. Nb: the keys here will eventually be the "candidate values" and the values the "unique keys" of the labeling task. The final clustering task whould be high quality and will be done as mauch as possible by hand - that's why an efficient labeling environment is crucial.

Hope it helps. If it is still unclear, we can have a call if you want.

Thanks, you are doing great work!

Cheers!

from patcit.

gg4u avatar gg4u commented on May 8, 2024

Hi,

Tks, I had oversighted title_mon the schema.

Ok, I will look at wikidata or google knowledge graph.

Interesting. For a work of entity matching of my former startup I had to match raw ingredients described by people (things like "little chunk tomatoes" and "pomodorini") to the corresponding ingredient in nutritional datasets). Complexity of 10^4 raw ingredients to 10^3 meta.

I dealt with the task combining NLP and then building an GUI interface in Wolfram mathematica, that dynamically select a group of matching items and the person can tick manually which suggestions matches with the queried string.

You might want to replicate something similar ( @niklub if of your interest too) for I found that approach very useful and effective (and also boring, if you are the only one to fine grain, but if there is community process goes fast).

Ok.

From the API, each key is defined by a unique id (what i call g-idin the case of the google knowledge graph) and a clean title.

so.

There is a g_id that is the matching id of a knowledge graph.
We will use the corresponding title of the chosen knowledge graph as the label.

you should keep only the k (~ 3) most relevant API entity outputs,

Mm here I think we might have longer outputs.
I looked a the plot of the "decay" of ranking, and the approach I used in my other works is to take all of the items till the point where the proximity distribution "bent enough" (an elbow in the curve).

This approach worked reasonably well.
But if don't have much title,s you can also take em all.

Nb: the keys here will eventually be the "candidate values" and the values the "unique keys" of the labeling task

Please make me an example.

I raise this one:


 ('Ibm Corp', 'Ibme Technical Disclosure Bulletin', 0.10831550828287981),
 ('Ibm Corp', 'Ibm Tecnical Disclosure Bulletin', 0.10831550828287981),
 ('Ibm Corp', 'Ibm Technical Dosclosure Bulletin', 0.10831550828287981),
 ('Ibm Corp', 'Ibm Technical Disclusure Bulletin', 0.10831550828287981),
 ('Ibm Corp', 'Ibm Technical Disclosure Bulletin', 0.10831550828287981),
 ('Ibm Corp', 'Ibm Technical Disclossure Bulletin', 0.10831550828287981),
 ('Ibm Corp', 'Ibm Tdb', 0.07898279407844809),

In the example above, I compute a proximity between the labels. We humans know that Ibm Technical Disclusure Bulletin is the label, but how to know it is, looking only at the strings ? Why not Ibm Corp ?

A few options.

a)
I thought to look at frequency : if Ibm Technical Disclosure Bulletin or similar happen the most, then it might be a journal.

b)
Another approach may be that a human queries for the label - 'Ibm Tecnical Disclosure Bulletin' - and will be prompted with all suggestions to match with the true label, instead of the opposite (being prompted with the true labels for an entity).
(see the point 3).

c)
I thought about that we don't know the journals.
So another approach may be to selecting a sample (randomly picked ?) of values from the 1.5 millions and run queries against a knowledge graph for each of them. For the ones that get a result (not None) then we will use the corresponding g_id and true title as a label.
So we could also build a training set.

Please let me know what you think and if we are understanding well on these points.


Question:
Google big query may carry costs.
I wonder if this project is supported in somehow by university or your research center or your phd.
May I ask for an in-kind support to consume Google Big query ? E.g. a Quota.

Get unique title_j (about 1.5 million) and their number of occurrences

Or could you share a table ? (wetrasnfer may work, but since we are working with bigquery it would be also nice to just use that :) )

Hope it helps. If it is still unclear, we can have a call if you want.

Yes, I wrote you an email proposing for having a call and share some thoughts
(see the email from nifty.works)

Glad to connect!

from patcit.

cverluise avatar cverluise commented on May 8, 2024

Hello @gg4u,

Hope you are doing well.

Any news to share?

Cheers

from patcit.

cverluise avatar cverluise commented on May 8, 2024

Could help a lot for this kind of tasks:
https://github.com/dedupeio

from patcit.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.