Coder Social home page Coder Social logo

Comments (2)

cverluise avatar cverluise commented on May 9, 2024

Implemented in patcit@nightly using pycld2 (based on CLD2 which is itself derived for the chromium compact language detector project)

Note for dev: I chose CLD2 rather than CLD3 because CLD2 guarantees text preprocessing (such as url cleaning etc) while CLD3 does not which can cause strange errors.

NPL citations language (top 100)
SELECT
  COUNT(npl_publn_id) AS nb,
  LANGUAGE
FROM
  `npl-parsing.external.npl_language`
GROUP BY
  LANGUAGE
ORDER BY
  nb DESC
Row nb LANGUAGE  
1 34768364 ENGLISH  
2 1944290 Unknown  
3 1533901 Chinese  
4 347973 GERMAN  
5 177638 Japanese  
6 98810 DANISH  
7 72890 FRENCH  
8 30161 LATIN  
9 27244 LUXEMBOURGISH  
10 21236 Korean  
11 19275 RUSSIAN  
12 9886 DUTCH  
13 5631 NORWEGIAN  
14 5598 POLISH  
15 5141 ChineseT  
16 4544 PORTUGUESE  
17 4454 SPANISH  
18 4301 ITALIAN  
19 3505 INTERLINGUE  
20 3109 NORWEGIAN_N  
21 1786 INDONESIAN  
22 1501 SCOTS  
23 1307 CZECH  
24 1297 INTERLINGUA  
25 1287 FRISIAN  
26 1276 SWEDISH  
27 1211 KHASI  
28 1210 RHAETO_ROMANCE  
29 1105 JAVANESE  
30 1079 AFAR  
31 1078 MALAGASY  
32 1022 HAUSA  
33 1012 CATALAN  
34 922 CORSICAN  
35 783 GALICIAN  
36 779 VOLAPUK  
37 775 SANSKRIT  
38 774 SCOTS_GAELIC  
39 743 AFRIKAANS  
40 676 GREEK  
41 650 FINNISH  
42 640 ROMANIAN  
43 640 SLOVAK  
44 590 WARAY_PHILIPPINES  
45 540 MANX  
46 522 HUNGARIAN  
47 488 X_PIG_LATIN  
48 480 SERBIAN  
49 468 LITHUANIAN  
50 467 TATAR  
51 442 NAURU  
52 440 CEBUANO  
53 435 MALAY  
54 431 BASQUE  
55 427 HAITIAN_CREOLE  
56 426 OCCITAN  
57 412 ESTONIAN  
58 411 BRETON  
59 408 GUARANI  
60 408 TAGALOG  
61 390 UZBEK  
62 367 SESELWA  
63 354 VIETNAMESE  
64 336 WOLOF  
65 323 KINYARWANDA  
66 311 X_KLINGON  
67 301 MAURITIAN_CREOLE  
68 298 SLOVENIAN  
69 289 ESPERANTO  
70 284 WELSH  
71 271 LINGALA  
72 270 XHOSA  
73 254 CROATIAN  
74 243 TURKISH  
75 238 BISLAMA  
76 219 SHONA  
77 214 RUNDI  
78 205 TSWANA  
79 188 SAMOAN  
80 179 FAROESE  
81 174 ALBANIAN  
82 164 NYANJA  
83 162 SWAHILI  
84 158 LATVIAN  
85 157 SUNDANESE  
86 156 IRISH  
87 156 HAWAIIAN  
88 153 SESOTHO  
89 145 SOMALI  
90 138 ZHUANG  
91 135 TURKMEN  
92 132 GANDA  
93 130 MALTESE  
94 121 FIJIAN  
95 108 TONGA  
96 108 TSONGA  
97 105 OROMO  
98 86 ICELANDIC  
99 77 AKAN  
100 75 GREENLANDIC

Unknown seems to be mainly very short npl, in particular bibliographical references with many abbreviations -> they should be kept

Sample of `Unknown`
WITH
  tmp AS (
  SELECT
    npl_publn_id
  FROM
    `npl-parsing.external.npl_language`
  WHERE
    LANGUAGE="Unknown")
SELECT
  npl_biblio
FROM
  `usptobias.patstat.tls214` AS npl,
  tmp
WHERE
  tmp.npl_publn_id = npl.npl_publn_id
  AND rand()<100/1900000  
Row npl_biblio  
1 MicroVit Vitrectomy System , Copyright 1983.  
2 JP 2001-095573  
3 JPN6013032839; J. Dairy Sci., 2001, Vol.84, No.2, pp.319-331  
4 Mac Tool Catalog (1997), p. 17.  
5 Oppolzer, Tetrahedron Lett. No. 12, pp. 1001 1004 (1974).  
6 Derwent-Ref. 84-056356/10  
7 Kishio et al., Jpn. J. Appl. Phys. (1987) 26:L1228.  
8 Georges et al., Macromolecules 1994, 27, 7228.  
9 Kalmbach et al (2007 JMB 371:639-48).  
10 Laser Pegs 2012 Catalog.  
11 Ueda et al., CA, 106, 1987, 79659k.  
12 JPN7015002669; Feng, Guo-Liang; Ji, Shun-Jun; Lai, Wen-Yong; Huang, Wei: 'Synthesis and optical properties of starburst carbazoles based on 9-phenylcarbazole core' Synlett (17), 2006, 2841-2845  
13 WO 88/00617  
14 Hannun et al., J. Biol. Chem. 262: 13620, 1987.  
15 AKIRI ET AL., ONCOGENE, vol. 28, 2009, pages 2163 - 2172  
16 SUGAWARE M. ET AL.: 'pH Kanjusei Maku Yugo Liposome Lipoplex Fukugotai ni yoru Idenshi Delivery: Ca Ion Doji Donyu ni yoru Idenshi Donyu Koka no Zokyo', DRUG DELIVERY SYSTEM, vol. 17, no. 3, 2002, pages 272, II-O-13, XP003016762  
17 Morrison & Boyd, Chapter 22, Organic Chemistry, 3rd Ed. (1973).  
18 Cheng, et al., Tetrahedron Lett., 32(49), 7333 7336 (1991).  
19 JPN6012043393; Tim Olson, Bob O'Hara, Emily H. Qi, Necati Canpolat, Simon Black, Jari Jokela: 'Normative Text Proposal for Diagnostics and Troubleshooting' IEEE 802.11-05/1070r2 , 20060111, paragraph 7.3,21.13, IEEE mentor  
20 Yayon et al. 1991. Cell 64:841.  
21 JPN6013054674; Zinner H et al: Journal fuer Praktische Chemie Vol.317, 1975, p.379-86  
22 DE-Z: 'ntz' Heft 13, 1984, S. 175-176  
23 Poulos, et al, GenBank No. AAT67231.1 2006.  
24 Presnov, M.A., et al., 'Antitumor properties of cis-dichlorodiamminedihydroxyplatinum(IV)', Izvestiya Akademii Nauk SSSR, Seriya Biologicheskaya (1986), (3), pp. 417-428, 1986.  
25 Hartlage-Rubsamen et al., Glia 41(2) 169-179 (Dec. 28, 2002).  
26 KOSHKIN ET AL., TETRAHEDRON, vol. 54, 1998, pages 3607 - 3630  
27 Dubreuil et al., Endocrinology (1989) 125(3):1378 1384.  
28 J. Kresta, R. Chang, S. Kathiriya and K. Frisch, Makromol Chemie , 180, p. 1081 (1979).  
29 Schilmiller et al, 2009, PNAS, 106:10865-10870, see pp. 10866-10867.  
30 BIOORGANIC & MEDICINAL CHEMISTRY LETTERS, vol. 15, no. 1, 2005, pages 231 - 234  
31 VAN DIJK; VAN DE WINKEL, CURR. OPIN. PHARMACOL., vol. 5, 2001, pages 368 - 74  
32 PEYRAUD J. L.; ROUILLÉ B.; HURTAUD C.; BRUNSCHWIG P.: 'Les acides gras du lait de vache - Collection Synthèse', 2011, article 'La modulation du profil en acides gras des laits par l'alimentation', pages: 13 - 28  
33 肖刚等: '《大能源 分布式能源》', 30 September 2015  
34 Lettau, Chemie der Heterocyclen, p. 17-27, 1st edition, VEB, Weinheim (1979).  
35 Diamond 2001  
36 康文甲: '《管道工》', 31 December 1989, article '冷凝器', pages: 604  
37 Gillessen, S. et al., Mouse interleukin 12 (IL 12) p40 homodimer: a potent IL 12 antagonist Eur. J. Immunol. 25:200 206 (1995).  
38 梁金钟等: '微生物发酵法合成高分子聚合物γ-PGA的研究', 《北京工商大学学报(自然科学版)》  
39 Neurosci. Ltrs 188(1995)41-44,Daidson et al.  
40 Murphy et al., J. Biol. Chem. 269, 6632-6636 (1994).  
41 REICH ET AL., MOL. VISION., vol. 9, 2003, pages 210 - 216  
42 McClean et al, 1993, Eur J Cancer, 29A: 2243-2248.*  
43 Database Uniprot, 'Interleukin-17 receptor B precursor (IL-17 receptor B) (IL-17RB) (Interleukin-17B receptor) (IL-17B receptor) (IL-17 receptor homolog 1) (IL-17Rh1) (IL17Rh1) (Cytokine receptor CRL4)', Accession No. Q9NRM6, May 27, 2002.  
44 Kretzschmar, E. et al., 'Synthese von 2,6-disubstituierten 4-Hydroxy-5,6,7,8-tetrahydropyrido[4,3-d]pyrimidinen', Pharmazie, 43(7), 475-476 (1988).  
45 DE-Firmenprospekt, Flying Kajakat, 1987  
46 JP Office Action dtd Sep. 2, 2008, JP Appln. 2007-021773.  
47 Crainich, L. ‘Forming a 90 deg Bend’ Metal Forming Magazine (1991) vol. 25, No. 8 pp. 59-60.  
48 JPN6015011443; Journal of Experimental Medicine Vol.205,No.2, 2008, p287-294  
49 Cordoba, J. and B. Minguez (2008) “Hepatic Encephalopathy” Semin Liver Dis, 28(1):70-80.  
50 LU, X .; YU, M .; WANG, G .; ZHAI T .; XIE, S .; LING , Y .; TONG, Y .; LI, Y., ADV. MATER., vol. 25, 2013, pages 267 - 272  
51 Kluting, Flierl, Grudno and Luttermann; MTZ Magazine, Aug. 1999, 'Drosselfreie Laststeuerung miy vollvariablen Ventiltrieben'.  
52 DE-Z.: Korrespondenz Abwasser 38(1991), S. 228-34  
53 U.S. Appl. No. 13/608,744.  
54 JPN6013021469; MAALEJ N et al: 'Antithrombotic Effect of Flavonoids in Red Wine' ACS Symp Ser No.661, 1997, Page.247-260  
55 Dixon et al., Ann. Rev. Pharmacol. Toxicol., 1980, p. 441-462, 20.  
56 Albery et al., Amperometric enzyme electrodes , Phil. Trans. R. Soc. Long., vol. B 316, pp. 107 119 (1987).  
57 Kniskern, P. J. et al., Gene 46, 135 (1986) (Kniskern I).  
58 Prospekt, VVS-Isolering der Fa. Gullfiber, 1979  
59 Crosslinking Polymer CA 81(24):153514t Kajiyama et al. Feb. 1970.  
60 M. J. GROGAN; M. R. PRATT; L. A. MARCAURELLE; C. R. BERTOZZI, ANNU. REV. BIOCHEM., vol. 71, 2002, pages 593 - 634  
61 U.S. Appl. No. 11/090,432.  
62 SAMBROOK, J.; RUSSELL, D. W.: 'Molecular Cloning: a Laboratory Manual', 2001, COLD SPRING HARBOR LABORATORY  
63 BiliBed® Phototherapy System, Medela AG, http://www.medela.com/ISBD/neonatology/bilibed/index.php, 6 pages, 2008.  
64 JPN7011004201; J. Natl. Cancer Inst. (1997) vol.89, no.4, p.293-300  
65 GUSTAFSSON ET AL., N ENGL. J. MED., vol. 334, 1996, pages 349 - 355  
66 Sommer-Knudsen, J. et al., Hydroxyproline-Rich Plant Glycoproteins, Phytochemistry, 1998, 47(4): 483-497.  
67 Kaiser, Amino Acids 2012, 42, 679-684  
68 CA113(8): 68388q, 1989.  
69 Honée, G., Convents, D., Van Rie, J., Jansens, S., Peferoen, M., Visser, B. The C-terminal domain of the toxic fragment of a Bacillus thuringiensis crystal protein determines receptor binding. (1991) Mol. Microbiol. 5:2799-2806.  
70 Zhang et al., Acta Pharmacol. Sinica 27(2): 179-183 (2006).  
71 Franz et al., (1980) Pflugeos arch., p. R2.  
72 王建新: '《化妆品植物原料大全》', 30 June 2012  
73 Hall et al., Carcinogenesis 2000; 21: 53-60.  
74 Lahourcade, Lise , et al., 'Molecular beam epitaxy of semipolar AlN(1122) and GaN(1122) on m-sapphire', J Mater Sci: Mater Electron, No. 19, (2008), pp. 805-809.  
75 JPN6012063322; JETI Vol.55, No.13, 2007, p.35-37  
76 Norm DIN EN 14604  
77 M. Aldissi et al., Polymer, vol. 23, pp. 243 245, (1982).  
78 XP002900204  
79 JPN6012065635; Usha R Deshpande et al: Indian Journal of Experimental Biology 36(6), 1998, p.573-577  
80 Bowie et al. (1990) Science 247 : 1306-1310.  
81 ALTSCHUL ET AL., J. MOL. BIOL., vol. 215, 1990, pages 403 - 410  
82 SP 103 bulletin.  
83 Liu, et.al., '99mTc-Labeling of a Hydrazinonicotinamide-Conjugated Vitronectin Receptor Antagonist Useful For Imaging Tumors' Bioconjugate Chem. 2001, 12, 623-629.  
84 Pereira et al. Polymorphism of Human Cytomegalovirus Glycoproteins Characterized by Monoclonal Antibodies Virology (1984) 139:73 86.  
85 Thompson, J.Virol. 61: 229 232 (1987).  
86 B. Kumar and J. Kumar, J. Electrochem. Soc., 2010, 157, A611.  
87 Rauvala et al., Biochim. Biophys. Acta 531: 266 274, 1978.  
88 Carvajal et al., J. Vet. Diagn. Invest., 7:60-64, (1995).  
89 Okabe, et al. J. Org. Chem. 56:4392 (1991).  
90 ORGANIC LETTERS, 2000, pages 1749 - 51  
91 Einde et al., JFS, 2003, Vol. 68, No. 8, p. 2396-2404.  
92 DIN 3223

from patcit.

cverluise avatar cverluise commented on May 9, 2024

Addressed in v03 🎉 .
The npl_cat classifier was trained on examples in english (and unknown) only. A npl_cat_flag bool was added to the v03.
npl_cat_flag:

  • if lang in ['en', 'un'], false
  • else, true
    Ideally, one should restrict to npl_cat_flag=True.
    Closing this issue, feel free to reopen.

from patcit.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.