Comments (2)
Implemented in patcit@nightly using pycld2 (based on CLD2 which is itself derived for the chromium compact language detector project)
Note for dev: I chose CLD2 rather than CLD3 because CLD2 guarantees text preprocessing (such as url cleaning etc) while CLD3 does not which can cause strange errors.
NPL citations language (top 100)
SELECT
COUNT(npl_publn_id) AS nb,
LANGUAGE
FROM
`npl-parsing.external.npl_language`
GROUP BY
LANGUAGE
ORDER BY
nb DESC
Row | nb | LANGUAGE | |
---|---|---|---|
1 | 34768364 | ENGLISH | |
2 | 1944290 | Unknown | |
3 | 1533901 | Chinese | |
4 | 347973 | GERMAN | |
5 | 177638 | Japanese | |
6 | 98810 | DANISH | |
7 | 72890 | FRENCH | |
8 | 30161 | LATIN | |
9 | 27244 | LUXEMBOURGISH | |
10 | 21236 | Korean | |
11 | 19275 | RUSSIAN | |
12 | 9886 | DUTCH | |
13 | 5631 | NORWEGIAN | |
14 | 5598 | POLISH | |
15 | 5141 | ChineseT | |
16 | 4544 | PORTUGUESE | |
17 | 4454 | SPANISH | |
18 | 4301 | ITALIAN | |
19 | 3505 | INTERLINGUE | |
20 | 3109 | NORWEGIAN_N | |
21 | 1786 | INDONESIAN | |
22 | 1501 | SCOTS | |
23 | 1307 | CZECH | |
24 | 1297 | INTERLINGUA | |
25 | 1287 | FRISIAN | |
26 | 1276 | SWEDISH | |
27 | 1211 | KHASI | |
28 | 1210 | RHAETO_ROMANCE | |
29 | 1105 | JAVANESE | |
30 | 1079 | AFAR | |
31 | 1078 | MALAGASY | |
32 | 1022 | HAUSA | |
33 | 1012 | CATALAN | |
34 | 922 | CORSICAN | |
35 | 783 | GALICIAN | |
36 | 779 | VOLAPUK | |
37 | 775 | SANSKRIT | |
38 | 774 | SCOTS_GAELIC | |
39 | 743 | AFRIKAANS | |
40 | 676 | GREEK | |
41 | 650 | FINNISH | |
42 | 640 | ROMANIAN | |
43 | 640 | SLOVAK | |
44 | 590 | WARAY_PHILIPPINES | |
45 | 540 | MANX | |
46 | 522 | HUNGARIAN | |
47 | 488 | X_PIG_LATIN | |
48 | 480 | SERBIAN | |
49 | 468 | LITHUANIAN | |
50 | 467 | TATAR | |
51 | 442 | NAURU | |
52 | 440 | CEBUANO | |
53 | 435 | MALAY | |
54 | 431 | BASQUE | |
55 | 427 | HAITIAN_CREOLE | |
56 | 426 | OCCITAN | |
57 | 412 | ESTONIAN | |
58 | 411 | BRETON | |
59 | 408 | GUARANI | |
60 | 408 | TAGALOG | |
61 | 390 | UZBEK | |
62 | 367 | SESELWA | |
63 | 354 | VIETNAMESE | |
64 | 336 | WOLOF | |
65 | 323 | KINYARWANDA | |
66 | 311 | X_KLINGON | |
67 | 301 | MAURITIAN_CREOLE | |
68 | 298 | SLOVENIAN | |
69 | 289 | ESPERANTO | |
70 | 284 | WELSH | |
71 | 271 | LINGALA | |
72 | 270 | XHOSA | |
73 | 254 | CROATIAN | |
74 | 243 | TURKISH | |
75 | 238 | BISLAMA | |
76 | 219 | SHONA | |
77 | 214 | RUNDI | |
78 | 205 | TSWANA | |
79 | 188 | SAMOAN | |
80 | 179 | FAROESE | |
81 | 174 | ALBANIAN | |
82 | 164 | NYANJA | |
83 | 162 | SWAHILI | |
84 | 158 | LATVIAN | |
85 | 157 | SUNDANESE | |
86 | 156 | IRISH | |
87 | 156 | HAWAIIAN | |
88 | 153 | SESOTHO | |
89 | 145 | SOMALI | |
90 | 138 | ZHUANG | |
91 | 135 | TURKMEN | |
92 | 132 | GANDA | |
93 | 130 | MALTESE | |
94 | 121 | FIJIAN | |
95 | 108 | TONGA | |
96 | 108 | TSONGA | |
97 | 105 | OROMO | |
98 | 86 | ICELANDIC | |
99 | 77 | AKAN | |
100 | 75 | GREENLANDIC |
Unknown
seems to be mainly very short npl, in particular bibliographical references with many abbreviations -> they should be kept
Sample of `Unknown`
WITH
tmp AS (
SELECT
npl_publn_id
FROM
`npl-parsing.external.npl_language`
WHERE
LANGUAGE="Unknown")
SELECT
npl_biblio
FROM
`usptobias.patstat.tls214` AS npl,
tmp
WHERE
tmp.npl_publn_id = npl.npl_publn_id
AND rand()<100/1900000
Row | npl_biblio | |
---|---|---|
1 | MicroVit Vitrectomy System , Copyright 1983. | |
2 | JP 2001-095573 | |
3 | JPN6013032839; J. Dairy Sci., 2001, Vol.84, No.2, pp.319-331 | |
4 | Mac Tool Catalog (1997), p. 17. | |
5 | Oppolzer, Tetrahedron Lett. No. 12, pp. 1001 1004 (1974). | |
6 | Derwent-Ref. 84-056356/10 | |
7 | Kishio et al., Jpn. J. Appl. Phys. (1987) 26:L1228. | |
8 | Georges et al., Macromolecules 1994, 27, 7228. | |
9 | Kalmbach et al (2007 JMB 371:639-48). | |
10 | Laser Pegs 2012 Catalog. | |
11 | Ueda et al., CA, 106, 1987, 79659k. | |
12 | JPN7015002669; Feng, Guo-Liang; Ji, Shun-Jun; Lai, Wen-Yong; Huang, Wei: 'Synthesis and optical properties of starburst carbazoles based on 9-phenylcarbazole core' Synlett (17), 2006, 2841-2845 | |
13 | WO 88/00617 | |
14 | Hannun et al., J. Biol. Chem. 262: 13620, 1987. | |
15 | AKIRI ET AL., ONCOGENE, vol. 28, 2009, pages 2163 - 2172 | |
16 | SUGAWARE M. ET AL.: 'pH Kanjusei Maku Yugo Liposome Lipoplex Fukugotai ni yoru Idenshi Delivery: Ca Ion Doji Donyu ni yoru Idenshi Donyu Koka no Zokyo', DRUG DELIVERY SYSTEM, vol. 17, no. 3, 2002, pages 272, II-O-13, XP003016762 | |
17 | Morrison & Boyd, Chapter 22, Organic Chemistry, 3rd Ed. (1973). | |
18 | Cheng, et al., Tetrahedron Lett., 32(49), 7333 7336 (1991). | |
19 | JPN6012043393; Tim Olson, Bob O'Hara, Emily H. Qi, Necati Canpolat, Simon Black, Jari Jokela: 'Normative Text Proposal for Diagnostics and Troubleshooting' IEEE 802.11-05/1070r2 , 20060111, paragraph 7.3,21.13, IEEE mentor | |
20 | Yayon et al. 1991. Cell 64:841. | |
21 | JPN6013054674; Zinner H et al: Journal fuer Praktische Chemie Vol.317, 1975, p.379-86 | |
22 | DE-Z: 'ntz' Heft 13, 1984, S. 175-176 | |
23 | Poulos, et al, GenBank No. AAT67231.1 2006. | |
24 | Presnov, M.A., et al., 'Antitumor properties of cis-dichlorodiamminedihydroxyplatinum(IV)', Izvestiya Akademii Nauk SSSR, Seriya Biologicheskaya (1986), (3), pp. 417-428, 1986. | |
25 | Hartlage-Rubsamen et al., Glia 41(2) 169-179 (Dec. 28, 2002). | |
26 | KOSHKIN ET AL., TETRAHEDRON, vol. 54, 1998, pages 3607 - 3630 | |
27 | Dubreuil et al., Endocrinology (1989) 125(3):1378 1384. | |
28 | J. Kresta, R. Chang, S. Kathiriya and K. Frisch, Makromol Chemie , 180, p. 1081 (1979). | |
29 | Schilmiller et al, 2009, PNAS, 106:10865-10870, see pp. 10866-10867. | |
30 | BIOORGANIC & MEDICINAL CHEMISTRY LETTERS, vol. 15, no. 1, 2005, pages 231 - 234 | |
31 | VAN DIJK; VAN DE WINKEL, CURR. OPIN. PHARMACOL., vol. 5, 2001, pages 368 - 74 | |
32 | PEYRAUD J. L.; ROUILLÉ B.; HURTAUD C.; BRUNSCHWIG P.: 'Les acides gras du lait de vache - Collection Synthèse', 2011, article 'La modulation du profil en acides gras des laits par l'alimentation', pages: 13 - 28 | |
33 | 肖刚等: '《大能源 分布式能源》', 30 September 2015 | |
34 | Lettau, Chemie der Heterocyclen, p. 17-27, 1st edition, VEB, Weinheim (1979). | |
35 | Diamond 2001 | |
36 | 康文甲: '《管道工》', 31 December 1989, article '冷凝器', pages: 604 | |
37 | Gillessen, S. et al., Mouse interleukin 12 (IL 12) p40 homodimer: a potent IL 12 antagonist Eur. J. Immunol. 25:200 206 (1995). | |
38 | 梁金钟等: '微生物发酵法合成高分子聚合物γ-PGA的研究', 《北京工商大学学报(自然科学版)》 | |
39 | Neurosci. Ltrs 188(1995)41-44,Daidson et al. | |
40 | Murphy et al., J. Biol. Chem. 269, 6632-6636 (1994). | |
41 | REICH ET AL., MOL. VISION., vol. 9, 2003, pages 210 - 216 | |
42 | McClean et al, 1993, Eur J Cancer, 29A: 2243-2248.* | |
43 | Database Uniprot, 'Interleukin-17 receptor B precursor (IL-17 receptor B) (IL-17RB) (Interleukin-17B receptor) (IL-17B receptor) (IL-17 receptor homolog 1) (IL-17Rh1) (IL17Rh1) (Cytokine receptor CRL4)', Accession No. Q9NRM6, May 27, 2002. | |
44 | Kretzschmar, E. et al., 'Synthese von 2,6-disubstituierten 4-Hydroxy-5,6,7,8-tetrahydropyrido[4,3-d]pyrimidinen', Pharmazie, 43(7), 475-476 (1988). | |
45 | DE-Firmenprospekt, Flying Kajakat, 1987 | |
46 | JP Office Action dtd Sep. 2, 2008, JP Appln. 2007-021773. | |
47 | Crainich, L. ‘Forming a 90 deg Bend’ Metal Forming Magazine (1991) vol. 25, No. 8 pp. 59-60. | |
48 | JPN6015011443; Journal of Experimental Medicine Vol.205,No.2, 2008, p287-294 | |
49 | Cordoba, J. and B. Minguez (2008) “Hepatic Encephalopathy” Semin Liver Dis, 28(1):70-80. | |
50 | LU, X .; YU, M .; WANG, G .; ZHAI T .; XIE, S .; LING , Y .; TONG, Y .; LI, Y., ADV. MATER., vol. 25, 2013, pages 267 - 272 | |
51 | Kluting, Flierl, Grudno and Luttermann; MTZ Magazine, Aug. 1999, 'Drosselfreie Laststeuerung miy vollvariablen Ventiltrieben'. | |
52 | DE-Z.: Korrespondenz Abwasser 38(1991), S. 228-34 | |
53 | U.S. Appl. No. 13/608,744. | |
54 | JPN6013021469; MAALEJ N et al: 'Antithrombotic Effect of Flavonoids in Red Wine' ACS Symp Ser No.661, 1997, Page.247-260 | |
55 | Dixon et al., Ann. Rev. Pharmacol. Toxicol., 1980, p. 441-462, 20. | |
56 | Albery et al., Amperometric enzyme electrodes , Phil. Trans. R. Soc. Long., vol. B 316, pp. 107 119 (1987). | |
57 | Kniskern, P. J. et al., Gene 46, 135 (1986) (Kniskern I). | |
58 | Prospekt, VVS-Isolering der Fa. Gullfiber, 1979 | |
59 | Crosslinking Polymer CA 81(24):153514t Kajiyama et al. Feb. 1970. | |
60 | M. J. GROGAN; M. R. PRATT; L. A. MARCAURELLE; C. R. BERTOZZI, ANNU. REV. BIOCHEM., vol. 71, 2002, pages 593 - 634 | |
61 | U.S. Appl. No. 11/090,432. | |
62 | SAMBROOK, J.; RUSSELL, D. W.: 'Molecular Cloning: a Laboratory Manual', 2001, COLD SPRING HARBOR LABORATORY | |
63 | BiliBed® Phototherapy System, Medela AG, http://www.medela.com/ISBD/neonatology/bilibed/index.php, 6 pages, 2008. | |
64 | JPN7011004201; J. Natl. Cancer Inst. (1997) vol.89, no.4, p.293-300 | |
65 | GUSTAFSSON ET AL., N ENGL. J. MED., vol. 334, 1996, pages 349 - 355 | |
66 | Sommer-Knudsen, J. et al., Hydroxyproline-Rich Plant Glycoproteins, Phytochemistry, 1998, 47(4): 483-497. | |
67 | Kaiser, Amino Acids 2012, 42, 679-684 | |
68 | CA113(8): 68388q, 1989. | |
69 | Honée, G., Convents, D., Van Rie, J., Jansens, S., Peferoen, M., Visser, B. The C-terminal domain of the toxic fragment of a Bacillus thuringiensis crystal protein determines receptor binding. (1991) Mol. Microbiol. 5:2799-2806. | |
70 | Zhang et al., Acta Pharmacol. Sinica 27(2): 179-183 (2006). | |
71 | Franz et al., (1980) Pflugeos arch., p. R2. | |
72 | 王建新: '《化妆品植物原料大全》', 30 June 2012 | |
73 | Hall et al., Carcinogenesis 2000; 21: 53-60. | |
74 | Lahourcade, Lise , et al., 'Molecular beam epitaxy of semipolar AlN(1122) and GaN(1122) on m-sapphire', J Mater Sci: Mater Electron, No. 19, (2008), pp. 805-809. | |
75 | JPN6012063322; JETI Vol.55, No.13, 2007, p.35-37 | |
76 | Norm DIN EN 14604 | |
77 | M. Aldissi et al., Polymer, vol. 23, pp. 243 245, (1982). | |
78 | XP002900204 | |
79 | JPN6012065635; Usha R Deshpande et al: Indian Journal of Experimental Biology 36(6), 1998, p.573-577 | |
80 | Bowie et al. (1990) Science 247 : 1306-1310. | |
81 | ALTSCHUL ET AL., J. MOL. BIOL., vol. 215, 1990, pages 403 - 410 | |
82 | SP 103 bulletin. | |
83 | Liu, et.al., '99mTc-Labeling of a Hydrazinonicotinamide-Conjugated Vitronectin Receptor Antagonist Useful For Imaging Tumors' Bioconjugate Chem. 2001, 12, 623-629. | |
84 | Pereira et al. Polymorphism of Human Cytomegalovirus Glycoproteins Characterized by Monoclonal Antibodies Virology (1984) 139:73 86. | |
85 | Thompson, J.Virol. 61: 229 232 (1987). | |
86 | B. Kumar and J. Kumar, J. Electrochem. Soc., 2010, 157, A611. | |
87 | Rauvala et al., Biochim. Biophys. Acta 531: 266 274, 1978. | |
88 | Carvajal et al., J. Vet. Diagn. Invest., 7:60-64, (1995). | |
89 | Okabe, et al. J. Org. Chem. 56:4392 (1991). | |
90 | ORGANIC LETTERS, 2000, pages 1749 - 51 | |
91 | Einde et al., JFS, 2003, Vol. 68, No. 8, p. 2396-2404. | |
92 | DIN 3223 |
from patcit.
Addressed in v03 🎉 .
The npl_cat
classifier was trained on examples in english (and unknown) only. A npl_cat_flag
bool was added to the v03.
npl_cat_flag
:
- if lang in ['en', 'un'], false
- else, true
Ideally, one should restrict tonpl_cat_flag=True
.
Closing this issue, feel free to reopen.
from patcit.
Related Issues (20)
- Title disambiguation HOT 7
- Dead links in `target`
- Variable description HOT 1
- Missing `title_*`
- "Pages" in `title_j`
- Make data available for download HOT 2
- Add the version of the PATSTAT that was used as source data into the description HOT 1
- npl_publn_id with same doi -> merge? HOT 1
- Create variable dedicated to NPL class (bibliographical resources, search report, standards, etc) HOT 1
- Sources of NPL HOT 6
- Add link to patstat appln_id HOT 1
- Naming of the files in the tar archives HOT 1
- Broken link
- Using npl_publn_id to merge PatCit to PATSTAT ??? HOT 1
- Zotero gzipped file is corrupt HOT 1
- Geographic information
- Multiple `title_j` for the same `ISSN`/`ISSNe` HOT 1
- Consolidate technical bulletins and conferences
- Standardise and/or propagate `title_abbrev_j`
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from patcit.