Coder Social home page Coder Social logo

angelosalatino / dblp-parser Goto Github PK

View Code? Open in Web Editor NEW
20.0 2.0 2.0 49 KB

A simple python script for parsing DBLP dataset

License: MIT License

Python 100.00%
dblp dblp-dataset dblp-bibliography science-communication scholarly-metadata science-of-science

dblp-parser's Introduction

1. DBLP Parser

A simple python script for parsing DBLP dataset

1.1. Table of content

2. Set up

From your terminal run:

git clone https://github.com/angelosalatino/dblp-parser.git
cd dblp-parser
pip install -r requirements.txt

In order to work, it is important to download both the DBLP dump (dblp.xml.gz and unzip it) and the DTD file (dblp.dtd). The basic requirements to run this code are pandas and lxml.

3. Download DBLP

The code already provides facilities to download everything is needed:

  • the dtd file,
  • the latest dump available.

To do so, you need to run the following lines:

dblp = DBLP()
dblp.download_latest_dump()

Otherwise, you can simply do this outside the Python environment using the terminal:

wget https://dblp.org/xml/dblp.xml.gz
wget https://dblp.org/xml/dblp.dtd
gzip -d dblp.xml.gz

4. Parser

4.1. Type of documents extracted

Here is the list of the 10 types of documents available within the DBLP dump:

"article",
"inproceedings",
"proceedings",
"book",
"incollection",
"phdthesis",
"mastersthesis",
"www",
"person",
"data"

4.2. Type of features extracted per document

Here are the 23 types of features that can be used to decribe a particular document in DBLP:

"address" 
"author" 
"booktitle"
"cdrom" 
"chapter" 
"cite" 
"crossref" 
"editor" 
"ee" 
"isbn"
"journal" 
"month" 
"note" 
"number" 
"pages" 
"publisher" 
"publnr" 
"school" 
"series" 
"title" 
"url"
"volume" 
"year"

In addition to this, the algorithm extract an additional feature: type. This feature specififies the kind of document extracted (article, inproceedings and so on).

Finally, if the parameter include_key_and_mdate=True, it will add two additional features: key and mdate which are attribute of the document entity in the XML file.

5. Usage examples

5.1. Parse all papers

With this function (parse_all) you can parse all documents available in DBLP.

5.1.1. Parameters

Parameter Default Info
dblp_path - File to load
save_path - Where to save the file
features_to_extract None Features to extract from the dump. If None (def.) extracts everything
include_key_and_mdate False Extracts further keys in the element tag
output "jsonl" Defines the kind of output (jsonl or dataframe)

dblp_path:str, save_path:str, features_to_extract:dict=None, include_key_and_mdate:bool=False, output:str="jsonl"

5.1.2. Generate JSONL file

Within python you can run the following code:

from dblp_parser import DBLP
dblp_path = "dblp.xml"
save_path = "dblp.jsonl"
dblp = DBLP()
dblp.parse_all(dblp_path, save_path)

This will extract all documents from dblp.xml and describe them according to the 23 features available in the dataset. The output file is a jsonl file in which each row is a dictionary. To be read, you must read line-by-line and load it as json dictionary.

Extract specific set of features (e.g., just title, url, ee and few others) per document:

from dblp_parser import DBLP
dblp_path = "dblp.xml"
save_path = "dblp.jsonl"
dblp = DBLP()
features = {"url", "author", "ee", "journal", "number", "pages", "publisher", "series","booktitle", "title", "volume", "year"}
dblp.parse_all(dblp_path, save_path, features_to_extract=features)

This will create the final file with as many rows as the number of documents, described with just the required features.

5.1.3. Generate Dataframe (pandas)

Export DBLP in a dataframe:

from dblp_parser import DBLP
dblp_path = "dblp.xml"
dblp = DBLP()
features = {"url", "author", "ee", "journal", "number", "pages", "publisher", "series","booktitle", "title", "volume", "year"}
df = dblp.parse_all(dblp_path, features_to_extract=features, output="dataframe")
print(df)

5.1.4. Export by Year

Export DBLP content of a given year:

from dblp_parser import DBLP
dblp = DBLP()

dblp_path = "dblp.xml"
save_path = "dblp_2022.json"
dblp.parse_by_year("2022", dblp_path, save_path)

6. Coming soon

Soon will add new features and usecases

  • parse just conferences papers
  • parse just journal papers
  • ... if you have an idea please open an issue

7. Disclaimer

This work is inspired by: https://github.com/IsaacChanghau/DBLPParser

dblp-parser's People

Contributors

angelosalatino avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

dblp-parser's Issues

XMLSyntaxError

I got the following error:

  File "dblp.xml", line 284
    <author>Francesco Trov&ograve;</author>
                                 ^
XMLSyntaxError: Entity 'ograve' not defined, line 284, column 31

It seems to be an error caused by decoding

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.