cverluise / patcit Goto Github PK

View Code? Open in Web Editor NEW

103.0 20.0 13.0 25.36 MB

Making Patent Citations Uncool Again

Home Page: https://cverluise.github.io/PatCit/

License: MIT License

Python 8.67% Shell 0.07% Makefile 0.12% Jupyter Notebook 91.14%

patent-citations patents science innovation economics

patcit's Introduction

patCit

Building a comprehensive dataset of patent citations

👩‍🔬 Exploring the universe of patent citations has never been easier. No more complicated data set-up, memory issue and queries running for ever, we host patCit on BigQuery for you.

🤗 patCit is community driven and benefits from the suppport of a reactive team who is eager happy to help and tackle your next request. This is where academics and industry practitioners meet.

🔮 patCit is based on state-of-the-art open source projects and libraries such as grobid/biblio-glutton and spaCy. Even better, patCit is continuously improving with the rest of its ecosystem.

🎓 Want to know more? Read patCit academic presentation or dive into usage and technical guides on patCit documentation website.

💌 Receive project updates in your mails/gitHub feed, join the patCit newsletter and star the repository on gitHub.

What will you find in patCit?

Patents are at the crossroads of many innovation nodes: science, open knwoledge, products, competition, etc. At patCit, we are building a comprehensive dataset of patent citations to help the community explore this terra incognita. patCit is:

🌎 worlwide coverage
📄 & 📚 front-page and in-text citations
🌈 all sorts of documents, not just scientific articles

💡 How we do? We use recent progress in Natural Language Processing (NLP) to extract and structure citations into actionable piece of information.

Front-page

patCit builds on DOCDB, the largest database of Non Patent Literature (NPL) citations. First, we deduplicate this corpus and organize it into 10 categories. Then, we design and apply category specific information extraction models using spaCy. Eventually, when possible, we enrich the data using external domain specific high quality databases.

Category	Classification	Information extraction	Enrichment	Colab notebook
Bibliographical reference	✅	✅	✅	🔜
Office action	✅
Patent	✅
Search report	✅
Product documentation	✅
Norm & standard	✅	✅
Webpage	✅
Database	✅	✅		🔜
Litigation	✅
Wiki	✅	✅
All	✅	NR	✅

In-text

patCit builds on Google Patents corpus of USPTO full-text patents. First, we extract patent and bibliographical reference citations. Then, we parse detected in-text citations into a series of category dependent attributes using grobid[grobid. Patent citations are matched with a standard publication number using the Google Patents matching API and bibliographical references are matched with a DOI using biblio-glutton. Eventually, when possible, we enrich the data using external domain specific high quality databases.

Category	Citation extraction	Information extraction	Enrichment	BigQuery table	Colab notebook
Bibliographical reference	✅	✅	✅		🔜
Patents	✅	✅	✅		🔜

FAIR

📍 Find - The patCit dataset is available on BigQuery in an interactive environment. For those who have a smattering of SQL, this is the perfect place to explore the data. It can also be downloaded on Zenodo.

👨‍🎓 If you are new to BigQuery and want to learn the basics of Google BigQuery (GBQ), you can take the GBQ Quickstart. This should not take more than 2 minutes and might help a lot !

📖 Access - We maintain a detailed documentation on how to access the data once you have found them on BigQuery or Zenodo. See usage notes on the patCit documentation website.

🔀 Interoperate - Interoperability is at the core of patCit ambition. We take care to extract unique identifiers whenever it is possible to enable data enrichment for domain specific high quality databases. This includes the DOI, PMID and PMCID for bibliographical references, the Technical Doc Number for standards, the Accession Number for Genetic databases, the publication number for PATSTAT and Claims, etc. See specific table for more details.

🔂 Reproduce - You are at the right place. This gitHub repository is the project factory. You can learn more about data recipes and models on the patCit documentation website.

Contributing

There are many ways to contribute to patCit, many do not include coding.

Give feedback - We want to make patCit truly useful to the community. We are thus very happy for feedback.

Share your thoughts - We believe that discussions are much more valuable if they are publicly shared. This way, everyone can benefit from it. Hence, we strongly encourage you to share your issues and request on patCit GitHub repository issue section.

Feel like coding today? - We will be more than happy to receive any contributions from you and the community. We have already started to tag some issues with and .

Team

This project was initiated by Gaétan de Rassenfosse (EPFL) and Cyril Verluise (Collège de France) in 2019.

Since then, it has benefited from the contributions of Gabriele Cristelli (EPFL), Francesco Gerotto (Sciences Po), Kyle Higham (Hitsotsubashi University) and Lucas Violon (HEC Paris).

We are also thankful to Domenico Golzio for constant support and to @leflix311, @kermitt2, Tim Simcoe (Boston University) @SuperMayo and @wetherbeei for helpful comments.

Contribution details are available in CRediT.

patcit's People

Contributors

Stargazers

Watchers

Forkers

zeonium fenglsuc orgpatentroot lucas-violon hkocoglu lucyxiaoluwang 1z2x3c4v5b6n7m8 mharrisonbaker liuzoom cleancoindev atdavidpark zzzlllttt

patcit's Issues

Non latin NPL citations mess up the npl_class

Due to the limited abilities of the labelers (including me), the classification model was trained only on English (and some other Latin-based languages) examples. Hence, citations based on non Latin mess up the classification out of sample.

Proposal

Add a language detection pipeline. E.g. spaCy-langdetect or spaCy-cld and exclude non english citations/ create a specific subset

Variable description

Missing variable description for the beta dataset

Complete schema

Sources of NPL

Hi all,

just tried to find out whether this dataset would be useful for someone else's project, but I could not find the list of databases that you can link to anywhere. Some pointers on the main websites would be really useful!

Cheers,
Felix

Date in the IDNO column instead of WHEN

How to reproduce the behaviour

SELECT a.npl_publn_id,
       a.idno,
       a.title_main_a,
       a.when
FROM `npl-parsing.patcit.beta` a
WHERE
  npl_publn_id = 12761
  OR npl_publn_id = 42574

Returns:

Ligne	npl_publn_id	idno	title_main_a	when
1	12761	1988-08-02	Test procedure and specifications for component susceptibility to electrostatic discharges	null
2	42574	1988-03-07	QuickSilver distributed file services: an architecture for horizontal growth	null

Comments

I am not sure about the nature of the idno column as I cannot see any pattern accross observations.
But as you can see, sometimes the date is wrongly populated in this column (for proof I have included links). In those two cases it is an IEEE conference paper so maybe there is a pattern.

Broken link

From this page: https://cverluise.github.io/PatCit/ to this page: https://cverluise.github.io/notebook

Make data available for download

Data is available on BigQuery but some users might want to extract the full (part of?) database to work with their own tools.

Feature description

The feature might actually come in different flavours:

full extraction
partial extraction based on publication year, a list of patents, a list of NPL, etc

Any there use case is most welcome and will be addressed asap - based on the amount of resources required

Consolidate technical bulletins and conferences

Technical bulletins (e.g. Ibm Technical Disclosure Bulletin) and conferences (e.g. various IEEE conferences) are frequent in npl citations. Many of them are not covered by crossref, meaning that they are not consolidated in the baseline Grobid/Biblio-Glutton pipeline.

Feature request

Increase consolidation coverage to these journals/conferences.

Ideas

At this point, I am still thinking at the right approach.

A simple way to list quickwins if we cannot find a global approach would be to address top ranking journals in the following query:

SELECT
  title_j,
  COUNT(title_j) AS count
FROM
  `npl-parsing.patcit.beta`
WHERE
  ISSN IS NULL
GROUP BY
  title_j
ORDER BY
  count DESC

Any suggestions/contributions in general and in particular on a more global approach are most welcome

Standardise and/or propagate `title_abbrev_j`

Motivation

There is a 1:n relation between title_j and title_abbrev_j. E.g.

title_j	count_distinct	title_abbrev_j
Inflammation Research	3	Inflamm. res.,Inflamm. Res.,Inflamm Res
Biological Cybernetics	3	Biological Cybernetics,Biol. Cybern.,Biol. Cybernetics
Journal of Materials Science	3	J Mater Sci,JOURNAL OF MATERIALS SCIENCE,Journal of Materials Science

About half of the npl_publn with a title_j have no title_abbrev_j.

How to reproduce the behavior

See below

SELECT
  title_j,
  count(distinct(title_abbrev_j)) as count_distinct,
  STRING_AGG(distinct(title_abbrev_j)) as title_abbrev_j
FROM
  `npl-parsing.patcit.beta`
WHERE
  title_j is not null
GROUP BY 
  title_j 
ORDER BY
  count_distinct
  DESC

See below

SELECT
  count(distinct(npl_publn_id)) as count_distinct
FROM
  `npl-parsing.patcit.beta`
WHERE
  title_j is not null and title_abbrev_j is NULL # comment and title_abbrev_j is NULL for denom

Feature request

Decide whether we should keep the title_abbrev_j. Note that, at least in the beta, there is no npl_publn with null title_j but with non null title_abbrev_j. In a sense, title_abbrev_j does not add any specific/new information
Following 0., the priority seems to bee to standardise title_j. From that, we can populate the title_abbrev_j disregarding it parsed value.

Dead links in `target`

Issue

There are dead links in the target field. (e.g Http://Edrm.Net/002/Wp-Content/Uploads/2009/09/Edrm-Legaltech.Pdf )

How to reproduce the behaviour

SELECT
  npl_publn_id,
  target
FROM
  `npl-parsing.patcit.beta`
WHERE
  npl_publn_id=260140

npl_publn_id	target
260140	Http://Edrm.Net/002/Wp-Content/Uploads/2009/09/Edrm-Legaltech.Pdf

Request

Might be nice to flag (remove?) dead links

Add link to patstat appln_id

Patstat and PatCit

In most case, researchers use Patstat to explore patent level information. The main identifiying variable from Patstat is called appln_id. This variable can then be linked to the many different tables of Patstat, and does not change in subsequent release of Patstat.

What could be done

appln_id can be linked to the publication number, which is the identifier in the current version of PatCit (v2.0) using Patstat table tls211. There is a one to one mapping between the appln_id and the final publication number. It would be useful to have a new entry in the array cited_by adding the corresponding appln_id

Thanks

Naming of the files in the tar archives

Hi and, first of all, thanks for this tremendous job!

I downloaded both the frontpage_bibliographicalreference.tar and intext_bibliographicalreference.tar files.

I untar both in a folder. If I'm not wrong, the name of both the file lists in the two tar archives starts from 0. Therefore, the second time I overwrote the fist files.

Why not to use different names for the files, like frontpage_bibliographicalreference_000000000000.jsonl.gz instead of bibliographical_reference_000000000000.jsonl.gz?

Does this rise an issue with the name of the database tables? In this case, maybe would be worth just to write a warning in the documentation so that people know that it's needed to use two subfolders, one for the front-page citations and one for the in-text ones.

Create variable dedicated to NPL class (bibliographical resources, search report, standards, etc)

Despite the recent focus on scientific papers cited by patents, NPL citations are actually a superset of so called patent-to-science citations.

The different NPL classe include (inter allia):

Bibliographical References
Search Reports
Office Actions
Databases
Patents
Webpages
Product Documentations
Norms and standard documents
Litigation documents

Feature description

That would be useful to have a dedicated variable with the NPL class (let's say npl_class).

Zotero gzipped file is corrupt

How to reproduce the behaviour

Download intext_patent_csv.tar from Zotero

Your Environment

The 5th file is corrupt.

$ gzip -tv data-release/*.gz
data-release/intext_patent_000000000000.csv.gz:   OK
data-release/intext_patent_000000000001.csv.gz:   OK
data-release/intext_patent_000000000002.csv.gz:   OK
data-release/intext_patent_000000000003.csv.gz:   OK
data-release/intext_patent_000000000004.csv.gz:   OK
gzip: data-release/intext_patent_000000000005.csv.gz: unexpected end of file
gzip: data-release/intext_patent_000000000005.csv.gz: uncompress failed
data-release/intext_patent_000000000005.csv.gz:   NOT OK

Using npl_publn_id to merge PatCit to PATSTAT ???

I am trying to unnest the front-page npl data on BigQuery so I can merge it into PATSTAT data in Stata. However, both the npl_publn_id and the cited_by variables are repeated. Why is this necessary?

Missing `title_*`

Around 10% of the npl_publn in the beta version have neither title_j nor title_m nor title_main_a. Most of the time, part of these elements are wrongly parsed the title_main_m.

How to reproduce the behaviour

SELECT
  *
FROM (
  SELECT
    *
  FROM
    `npl-parsing.patcit.beta`
  WHERE
    title_j is NULL
    AND title_m is NULL
    AND title_main_a is NULL
    ) 
    AS parsing
JOIN (
  SELECT
    npl_publn_id AS id,
    npl_biblio
  FROM
    `usptobias.patstat.tls214`) AS tls214
ON
  tls214.id=parsing.npl_publn_id

Ideas/ solution

There seems to be a common pattern in these citations in the sense that they are already very structured (e.g NIELSEN F ET AL: 'HERSTELLUNG STAUBARMER, FREIFLIESSENDER PRODUKTE', CHEMIETECHNIK, HUTHIG, HEIDELBERG, DE, vol. 22, no. 10, 1 October 1993 (1993-10-01), pages 48 - 49, XP000415410, ISSN: 0340-9961).

At this stage, training the Grobid model on these examples seems to be the best option. Then, examples affected by this issue will be processed again.

"Pages" in `title_j`

Around 0.8% of the NPL publication in the beta dataset have "Pages" as title_j.

How to reproduce the behaviour

SELECT
  *
FROM (
  SELECT
    *
  FROM
    `npl-parsing.patcit.beta`
  WHERE
    title_j ="Pages"
    ) 
    AS parsing
JOIN (
  SELECT
    npl_publn_id AS id,
    npl_biblio
  FROM
    `usptobias.patstat.tls214`) AS tls214
ON
  tls214.id=parsing.npl_publn_id

Ideas Solution

The issue seems to be closely related to the one described in #14

There seems to be a common pattern in these citations in the sense that they are already well structured (e.g ENTNEHEMEN UND PRUEFEN MIT EINEM SCHNELLEN HANDHABUNGSGERAET', KUNSTSTOFFE,DE,CARL HANSER VERLAG. MUNCHEN, vol. 80, no. 8, 1 August 1990 (1990-08-01), pages 894, XP000150775, ISSN: 0023-5563).

As for #14 , training the Grobid model on these examples seems to be the best option. Then, examples affected by this issue will be processed again.

Geographic information

First off: Thank you so much for this awesome tool/data. I'm not sure it is under active development any more, but it should be if it isn't!

Geographic information on both the citing patents and the cited bibliographic publications would be a superb addition. It would help one track knowledge flows across space and across time (as of now, as far as I can tell, the best one can do is link patents to their patent office applications. In principle, I believe there is a way to link some of these records back to PATSTAT but it appears a bit involved.

I of course know de minimis about the data pipeline, but, for citing patents, there is likely inventor and/or applicant location information wherever the raw patent data is coming from and of course there is de Rassenfosse et al. (2019) for more carefully geo-located patent data (I believe de Rassenfosse and perhaps others on that paper are/were involved in this project).

For cited academic publications, I can see that being tougher, but, in principle, the author affiliations should be useful in "geo-locating." Indeed, in a nonstandard format, it appears the table bibliographical_reference documents, when it can, the affiliation location. There is also a location field under "event" but I'm not 100% certain what "event" refers to here.

npl_publn_id with same doi -> merge?

There are different npl_publn_id with the same doi.

Actually, on the ~11 million npl_publ_id matched with a doi, there are 1.328 million npl_publn_id without any duplicate according to the matched doi.

WITH
  tmp AS (
  SELECT
    doi,
    COUNT(npl_publn_id) AS nb_dupl
  FROM
    `npl-parsing.patcit.v01`
  WHERE
    doi IS NOT NULL
  GROUP BY
    doi)
SELECT
  nb_dupl,
  COUNT(doi) AS nb
FROM
  tmp
GROUP BY
  nb_dupl
order by 
  nb_dupl ASC

nb_dupl	nb
1	1328805
2	516832
3	246001
4	145311
5	94703
...	...
3994	1
4376	1
4516	1
5183	1
9019	1

Feature description

Ultimately, it might help to merge npl_publication_id which actually refer to the same NPL. Not so easy because we want to keep PatStat compatibility. Open question

Title disambiguation

A given title (in title_j, title_m) can appear under different forms in the database. This might be due to typos (e.g Ibm Tchnical Disclosure Bulletin), abbreviations (Ibm Tdb), parsing error (Ibm Tech-Nical Disclosure Bulletin, Ibm Corp) etc

Example ⬇️

SELECT
  DISTINCT(title_j)
FROM
  `npl-parsing.patcit.beta`
WHERE
  LOWER(title_j) LIKE "%ibm%"
ORDER BY
  title_j DESC

title_j
Ibme Technical Disclosure Bulletin
Ibm-Tdb
Ibm Tecnical Disclosure Bulletin
Ibm Technical Dosclosure Bulletin
Ibm Technical Document
Ibm Technical Dislosure Bulletin
Ibm Technical Disclusure Bulletin
Ibm Technical Disclosures Bulletin
Ibm Technical Disclosure Bulleting
Ibm Technical Disclosure Bulletin; 'Improved First-In First-Out'
Ibm Technical Disclosure Bulletin, Ref. No. Xp
Ibm Technical Disclosure Bulletin, Nn Corp., Us
Ibm Technical Disclosure Bulletin, Ibm Corp. Ny
Ibm Technical Disclosure Bulletin, Ibm Corp
Ibm Technical Disclosure Bulletin Ibm
Ibm Technical Disclosure Bulletin
Ibm Technical Disclosure Bullentin
Ibm Technical Disclosure Bulle
Ibm Technical Disclossure Bulletin
Ibm Techn.Discl.Mag
Ibm Techn. Discl. Bull
Ibm Tech-Nical Disclosure Bulletin, Ibm Corp
Ibm Tech Disc Bulletin
Ibm Tdb
Ibm Tchnical Disclosure Bulletin
Ibm Disclosure Bulletin

Feature description

Title variables are useful to many use-cases. A clean and transparent disambiguation would definitely be a strong plus.

At this point, I have no particular idea on the most appropriate tools/algos to be used in the disambiguation process. Anyone should feel free to contribute.
Ultimately, we want a correspondence table between a "unique identifier" (e.g "Ibm Technical Disclosure Bulletin") and all the related variations.
The output of the disambiguation could be used to propagate ISSN(e)s (see issue #6 )

Confusion between `npl_publn_id` and `when` fields

Sometimes, the field "when" takes the value in field "npl_publn_id" (see, e.g., npl_publn_id 1003)

DOI field contains `doi:`

Issue

In some cases ( ~2.5% of the beta dataset), the doi contains an explicit doi: string. E.g. Doi:Doi:10.1117/12.148585 for npl_publn_id 571155

-> The metadata matching is likely to fail while we have the right identifier(parsed from the raw citation).

How to reproduce the behaviour

SELECT
  *
FROM
  `npl-parsing.patcit.beta`
WHERE
  DOI IS NOT NULL
  AND DOI!=""
  AND LOWER(DOI) LIKE "%doi%"

Solution

Extract DOIs with this issue (see above query for instance)
Clean DOIs ("doi", ":")
Bibliographical metadata look-up (e.g curl http://localhost:8080/service/lookup?doi=10.1117/12.148585)
Update data if 3. sucess, else, update DOI only with clean version from 2.

More homogenous date format

Problem

The when column is a string that could either be the full date (%Y-%M-%D) or simply the year (%Y) wich makes analyses complicated because one would need to parse date before.

Feature request

I see two possible workarounds:

Coerce year-only cells to %Y-01-01 but this would mix precise date with truncated dates
Having two columns: date and year where year-only date are coerced to NULL in date and to the year in year (see example below)

Before:

id	when
1	null
2	1993
3	1969-05-03

Solution 1

id	date
1	null
2	1993-01-01
3	1969-05-03

Solution 2

id	date	year
1	null	null
2	null	1993
3	1969-05-03	1969

No `cited_by` data?

Cyril, Gaétan
Great job, I am just looking into it. Have two questions:

Do I understand correctly that the present beta version of dataset contains only bibliographic info on NPL, and no patent identifier? Because I can find none (correct me if I am wrong). That is, it is about what's cited, but not who's citing
Do NPL citations come only from search reports, or do they also include in-text citations? From your e-mail it is not entirely clear
Thanks and congratulations again
Francesco

Multiple `title_j` for the same `ISSN`/`ISSNe`

Issue

Some npl publications sharing the same ISSN(e) have different title_j (journal title). E.g

`ISSN`	`journal_titles`
0946-2171	Applied Physics B,Applied Physics B Laser and Optics,Applied Physics B Lasers and Optics,Applied Physics B: Lasers and Optics
0003-021X	Journal of the American Oil Chemists Society,Journal of the American Oil Chemists' Society,Journal of the American Oil Chemists’ Society
0236-5731	Journal of Radioanalytical and Nuclear Chemistry Articles,Journal of Radioanalytical and Nuclear Chemistry Letters,Journal of Radioanalytical and Nuclear Chemistry

Quantitively, this seems to be the case for 0.5-1% of npl_publications_id in the beta.

How to reproduce the behaviour

WITH
  tmp AS (
  SELECT
    title_j,
    ISSN,
    COUNT(npl_publn_id ) AS count
  FROM
    `npl-parsing.patcit.beta`
  WHERE
    ISSN IS NOT NULL
    AND ISSN!=""
  GROUP BY
    title_j,
    ISSN )
SELECT
  ISSN,
  STRING_AGG(title_j) AS journal_titles,
  SUM(count) AS sum,
  COUNT(ISSN) AS count
FROM
  tmp
GROUP BY
  ISSN
ORDER BY
  count DESC

Ideas/ solutions

title_j could be standardized using the title_proper reported at https://portal.issn.org/

Any suggestions welcome

Add the version of the PATSTAT that was used as source data into the description

The NPL_PUBLN_ID is a surrogate key with a special rule: everything under 950.000.000 million is stable across PATSTAT releases, but anything above can change.

Feature description

Having the PATSTAT release allows is to link without any doubt to the correct record without further needs for matching via other fields. (patent publication can side multiple NPL’s)

cverluise / patcit Goto Github PK

patcit's Introduction

patCit

Building a comprehensive dataset of patent citations

What will you find in patCit?

Front-page

In-text

FAIR

Contributing

Team

patcit's People

Contributors

Stargazers

Watchers

Forkers

patcit's Issues

Proposal

How to reproduce the behaviour

Comments

Feature description

Feature request

Ideas

Motivation

How to reproduce the behavior

Feature request

Issue

How to reproduce the behaviour

Request

Patstat and PatCit

What could be done

Naming of the files in the tar archives

Feature description

How to reproduce the behaviour

Your Environment

How to reproduce the behaviour

Ideas/ solution

How to reproduce the behaviour

Ideas Solution

Feature description

Feature description

Issue

How to reproduce the behaviour

Solution

Problem

Feature request

Before:

Solution 1

Solution 2

Issue

How to reproduce the behaviour

Ideas/ solutions

Feature description

Recommend Projects

Recommend Topics

Recommend Org