Comments (7)
I take on this issue.
A clean and transparent disambiguation would definitely be a strong plus.
My approach will be make use of a jupyter notebook, to explore possible methods.
Other contributors could publish there new strategies.
Once chosen, we can create a a final script to be imported in the library.
I am creating a notebook now, after having dealt with some issues of google big query client.
I will update also how to solve them in the jupyter (migth smone has same problems)
from patcit.
Hello @gg4u ,
welcome on board!
We've been thinking about an exotic strategy for this question.
The idea would be the following:
- Get unique
title_j
(about 1.5 million) and their number of occurrences - Send these unique
title_j
to the Google Knowledge Graph API - API reference here . Restricting to "types"="Periodical" (extend to "Thing" if you don't get anything from "Periodical") and to the 3 highest scoring results seems to be satisfying at first sight. Should be battle tested - You end up having a table as follows:
title_j |
g_id |
---|---|
Nature | [g_id-1, g_id-2] |
Nature journal | [g_id-1, g_id-3, g_id-4] |
Science | [g_id-5] |
In brief, there are keys (the g_id
) with a list of candidate values (the title_j
) for each of these keys
- Iterate over keys by decreasing number of potential matches and select the true candidates
NB: Although 4. might seem tedious, we have been investigating an efficient technical solution using label-studio and had a great support from the community. See here for their guidelines. Importantly, this would provide high quality disambiguation.
If you could investigate this idea, that would be great. In any case, any idea is most welcome!
Thanks
from patcit.
Hi @cverluise , thank you.
I started a jupyter notebook before your note.
Please help me clarify a few things.
I looked at the bigquery db, title_j
is the title of a journal.
What is title_m
? It is none
for all fields in your example query:
q = """
SELECT
DISTINCT(title_j), title_m
FROM
`npl-parsing.patcit.beta`
WHERE
LOWER(title_j) LIKE "%ibm%"
ORDER BY
title_j DESC
"""
What is a g_id
key ? I cannot find it on the schema of the db.
Is it an arbitrary Id of a record ? E.g. Could it be an hashkey of a record for the query above?
Or is it the unique id
of a journal ?
The tasks is entity matching.
Do you want to reconciliate entities against google knowledge graph?
Any particular reason for not choosing Wikidata ?
I looked at Google knowledge graph (KG) reference:
https://developers.google.com/knowledge-graph/reference/rest/v1
but seems it might not complete.
As example, I query on a search engine for :
IBM Systems Journal
I found it on the web.
Instead the KG points me to IBM Journal of Research and Development
, which is a different entity according to the search engine.
Per contrast, on wikidata, I can find the two distincted entities:
https://www.wikidata.org/wiki/Q15760627?wprov=srpw1_0
https://www.wikidata.org/wiki/Q15753899?wprov=srpw1_0
In my opinion it seems wikidata can be a good choice (Google KG is anyway based on wikidata, although I don't know why the above query reconciliates differently).
What shall we do with them?
- One options would be to point all the journals to
IBM
entity. - Another option would be to manually create a list of all IBM journals, as labels to train then the suggestions. I thought Wikidata might come to rescue.
In the query above, I have dirty results: it seems that together with a journal title, I also have some part of an article title. I can handle that, but wonder what persons using label-studio may do with it.
Shall they select a list of suggested labels (title journal) for each item?
How would they choose the right journal in the case there are more options (as IBM Systems Journal
and IBM Journal of Research and Development
may be? )
Please let me know what people should work with it, it will help to model an output.
So far I created a matrix like the table you suggested, of N x N items.
If one would have a list of the correct journals, that would help for one could create a matrix M x N, where M (number of journals) << N (number of rows) (at least, I expect that).
To update you on what I am working on :
I m exploring a bit of methods to compute differences and similarities between strings.
The approach is endogeneous (the only information is just from the string, I don't know what it represents). It might be expanded with
I am exploring different approaches with tokenizers, n-grams and collations. I will select ones offering best results, sharing ideas on the points above (see point 3) will help.
Having a list of final labels, it might help, for I could train against them.
Otherwise, I could work on some other ways to clusterize them.
For one work to match entities ( granular food ingredients, about 10^4 raw distinct records reduced to 10^4 ), I end up creating an interface in wolfram mathematica to fine-grain overview the automatic clustering - I could not do it not, but useful to know how the community will interact (see point 4).
from patcit.
Hello,
thanks a lot for your feedback!
on 1.
title_m
: Title of the item holding the NPL -- for non journal items only , e.g. conference, proceedings, etc.
The schema of the table is detailed in the "Schema" pane on GBQ https://console.cloud.google.com/bigquery?project=npl-parsing&p=npl-parsing&d=patcit&t=v02_npl&page=table
It is also detailed here: https://grobid.readthedocs.io/en/latest/training/Bibliographical-references/
on 2.
At this point there is no specific reason to favor the knowledge graph API. So, if you find that wikidata can do better, feel free to experiment!
On 3.
That's why I suggest to:
- keep the k best candidate values (e.g. k=3)
- validate by hand the key-value matching using a custom Label Studio setting (see my points 3 and 4 above + guidelines
In anyway, we cannot fully rely on the output of wikidata/knowledge graph/ ... api output
On 4
On the query above, I have dirty results: it seems that together with a journal title, I also have some part of an article title.
No, the labeller will just accept the dirty result as long as it actually contains the info for the journal title.
The idea is that, from the API, you will have, let's say, 3 keys per ambiguous title_j
. In the end, there will be multiple ambiguous title_j
(values) with a common key. E.g "IBM tech Disclosure" and "IBM technical disclosure" will certainly have a common key. From the API, each key is defined by a unique id (what i call g-id
in the case of the google knowledge graph) and a clean title.
If one would have a list of the correct journals, that would help for one could create a matrix M x N, where M (number of journals) << N (number of rows) (at least, I expect that).
We don't have that because patent to "science" citations are not standard. For example, IBM technical disclosure bulletin is absent from major academic databases.
By the way, as you should keep only the k (~ 3) most relevant API entity outputs, you should end up with a dict with "number of distinct title_j" keys x k values. Nb: the keys here will eventually be the "candidate values" and the values the "unique keys" of the labeling task. The final clustering task whould be high quality and will be done as mauch as possible by hand - that's why an efficient labeling environment is crucial.
Hope it helps. If it is still unclear, we can have a call if you want.
Thanks, you are doing great work!
Cheers!
from patcit.
Hi,
Tks, I had oversighted title_m
on the schema.
Ok, I will look at wikidata or google knowledge graph.
Interesting. For a work of entity matching of my former startup I had to match raw ingredients described by people (things like "little chunk tomatoes" and "pomodorini") to the corresponding ingredient in nutritional datasets). Complexity of 10^4 raw ingredients to 10^3 meta.
I dealt with the task combining NLP and then building an GUI interface in Wolfram mathematica, that dynamically select a group of matching items and the person can tick manually which suggestions matches with the queried string.
You might want to replicate something similar ( @niklub if of your interest too) for I found that approach very useful and effective (and also boring, if you are the only one to fine grain, but if there is community process goes fast).
Ok.
From the API, each key is defined by a unique id (what i call g-idin the case of the google knowledge graph) and a clean title.
so.
There is a g_id
that is the matching id of a knowledge graph.
We will use the corresponding title of the chosen knowledge graph as the label.
you should keep only the k (~ 3) most relevant API entity outputs,
Mm here I think we might have longer outputs.
I looked a the plot of the "decay" of ranking, and the approach I used in my other works is to take all of the items till the point where the proximity distribution "bent enough" (an elbow in the curve).
This approach worked reasonably well.
But if don't have much title,s you can also take em all.
Nb: the keys here will eventually be the "candidate values" and the values the "unique keys" of the labeling task
Please make me an example.
I raise this one:
('Ibm Corp', 'Ibme Technical Disclosure Bulletin', 0.10831550828287981),
('Ibm Corp', 'Ibm Tecnical Disclosure Bulletin', 0.10831550828287981),
('Ibm Corp', 'Ibm Technical Dosclosure Bulletin', 0.10831550828287981),
('Ibm Corp', 'Ibm Technical Disclusure Bulletin', 0.10831550828287981),
('Ibm Corp', 'Ibm Technical Disclosure Bulletin', 0.10831550828287981),
('Ibm Corp', 'Ibm Technical Disclossure Bulletin', 0.10831550828287981),
('Ibm Corp', 'Ibm Tdb', 0.07898279407844809),
In the example above, I compute a proximity between the labels. We humans know that Ibm Technical Disclusure Bulletin
is the label, but how to know it is, looking only at the strings ? Why not Ibm Corp
?
A few options.
a)
I thought to look at frequency : if Ibm Technical Disclosure Bulletin
or similar happen the most, then it might be a journal.
b)
Another approach may be that a human queries for the label - 'Ibm Tecnical Disclosure Bulletin' - and will be prompted with all suggestions to match with the true label, instead of the opposite (being prompted with the true labels for an entity).
(see the point 3).
c)
I thought about that we don't know the journals.
So another approach may be to selecting a sample (randomly picked ?) of values from the 1.5 millions and run queries against a knowledge graph for each of them. For the ones that get a result (not None
) then we will use the corresponding g_id
and true title as a label.
So we could also build a training set.
Please let me know what you think and if we are understanding well on these points.
Question:
Google big query may carry costs.
I wonder if this project is supported in somehow by university or your research center or your phd.
May I ask for an in-kind support to consume Google Big query ? E.g. a Quota.
Get unique title_j (about 1.5 million) and their number of occurrences
Or could you share a table ? (wetrasnfer may work, but since we are working with bigquery it would be also nice to just use that :) )
Hope it helps. If it is still unclear, we can have a call if you want.
Yes, I wrote you an email proposing for having a call and share some thoughts
(see the email from nifty.works)
Glad to connect!
from patcit.
Hello @gg4u,
Hope you are doing well.
Any news to share?
Cheers
from patcit.
Could help a lot for this kind of tasks:
https://github.com/dedupeio
from patcit.
Related Issues (20)
- Dead links in `target`
- Variable description HOT 1
- Missing `title_*`
- "Pages" in `title_j`
- Make data available for download HOT 2
- Add the version of the PATSTAT that was used as source data into the description HOT 1
- npl_publn_id with same doi -> merge? HOT 1
- Create variable dedicated to NPL class (bibliographical resources, search report, standards, etc) HOT 1
- Sources of NPL HOT 6
- Non latin NPL citations mess up the npl_class HOT 2
- Add link to patstat appln_id HOT 1
- Naming of the files in the tar archives HOT 1
- Broken link
- Using npl_publn_id to merge PatCit to PATSTAT ??? HOT 1
- Zotero gzipped file is corrupt HOT 1
- Geographic information
- Multiple `title_j` for the same `ISSN`/`ISSNe` HOT 1
- Consolidate technical bulletins and conferences
- Standardise and/or propagate `title_abbrev_j`
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from patcit.