covuworie / nobel-physics-prizes Goto Github PK

Predicting Nobel Physics Prize winners. Final project for Harvard CS109a 2017 edition https://github.com/covuworie/a-2017.

License: MIT License

Jupyter Notebook 5.24% HTML 93.14% Python 1.57% JavaScript 0.03% CSS 0.02%

nobel-physics-prizes nobel-laureates machine-learning logistic-regression svm random-forest natural-language-processing topic-modeling matrix-factorization scraping

nobel-physics-prizes's Introduction

Predicting Nobel Physics Prize Winners

And the Nobel goes to ...

Illustration: Niklas Elmehad/Nobel Media (IEEE Spectrum)

Winners of the Nobel Prize in Physics 2018

Background

The Nobel Prize in Physics is widely regarded as the most prestigious award in Physics. It has been awarded to 207 Nobel Laureates between 1901 and 2017. John Bardeen is the only double Nobel Laureate meaning that 206 physicists have actually won the prize. The will of Alfred Nobel states that that the prize should be awarded to the "person who shall have made the most important discovery or invention within the field of physics". In fact, the prize can actually be awarded to a maximum of 3 people in any year and can be split for a maximum of 2 inventions or discoveries. The prize is not awarded posthumously; however, if a person is awarded a prize and dies before receiving it, the prize may still be presented.

Problem Statement

The Nobel Prize in Physics is awarded by The Royal Swedish Academy of Sciences, Stockholm, Sweden. The nomination and selection process is a lengthy and complex process taking just over a year. Three of the key stages are:

September - Nomination forms are sent out. The Nobel Committee sends out confidential forms to around 3,000 people - selected professors at universities around the world, Nobel Laureates in Physics and Chemistry, and members of the Royal Swedish Academy of Sciences, among others.
March-May - Consultation with experts. The Nobel Committee sends the names of the preliminary candidates to specially appointed experts for their assessment of the candidates' work.
October - Nobel Laureates are chosen. In early October, the Academy selects the Nobel Laureates in Physics through a majority vote. The decision is final and without appeal. The names of the Nobel Laureates are then announced.

Furthermore, details of the nominations are not made public until 50 years after. The nature of the selection process has led to claims that the selection process is dominated more by the demographics of the nominee and the nominators than by the quality of the nominee's work. For some more details, see this excellent five part series from Physics Today that examines the data and dives into the history of physicists nominated for the Nobel Prize. This PBS article also describes 8 ways to win the Nobel Prize in Physics of which 5 refer to demographics. Some of the nominee demographics mentioned in both articles include:

Gender
Age / years lived
Nationality
Institutions studied at and affiliated with
Connected to past winners of the Nobel Prize in Physics or Chemistry through progeny or academics
Theorist or experimentalist
Astronomer or physicist

The Physics Today article claims that "We'll probably never know for sure why some physicists win Nobel glory and others come up short; the Nobel committee is notoriously secretive about their deliberations." However, the data in the article suggests that there may exist underlying patterns that in general enhance a physicist's chance of winning a Nobel prize.

Project Goals

The goals of the project are to answer the following questions:

Do demographics play a major role in selecting the winner of the Nobel Prize in Physics?
Which demographic factors have the biggest influence on the outcome?
Who are the most likely winners of The Nobel Prize in Physics 2018?

The questions will be answered by building a machine learning model, based on demographic data alone, that predicts whether a physicist has won or will win a Nobel Prize. The Nobel Committee has acknowledged the gender bias towards women across all of the Nobel Prizes and is actively looking to address the situation. It seems that a predictive model such as this could provide insight into biases present in the selection process. The Nobel Committee could utilize such a model to make informed decisions that help to permanently erradicate such biases.

Data Resources

A list of physicists notable for their achievements will be created by scraping the following Wikipedia articles:

Lists of Nobel Prize Winners in both Physics and Chemistry from 1901-2017 will be obtained by scraping the following Wikipedia articles:

These lists will be used to obtain demographic data in JSON format for the physicists by sending HTTP requests to DBpedia. DBpedia is a crowd-sourced community effort to extract structured content from the information created in various Wikimedia projects. In this case, the JSON data is similar to the structured data in an Infobox on the top right side of the Wikipedia article for each physicist. The following are examples of data that is available for the physicists:

Environment

An environment for computational reproducibility of this project can be setup by following these simple steps:

Download and install python 3.6.5 (64-bit) (any 3.6.x version should be ok) for your operating system from python.org or anaconda. Make sure to check the option "Add python 3.6 to PATH" during installation.
Download and install the latest version (any version should be ok) of git-scm for your operating system.
Clone the github repository:

git clone https://github.com/covuworie/nobel-physics-prizes.git

Create a .env file at the root where you cloned the repo. See .env-example for an example.
Use pipenv to spawn a shell with the virtualenv activated (this will also load the .env environment variables):

pipenv shell

Install all packages from the Pipfile (both develop and default packages):

pipenv install --dev

Launch the JupyterLab application in your default browser:

jupyter lab

Notebooks

Notebooks are located under the notebooks directory. The individual notebooks of the projects can be run interactively in JupyterLab. Or if you prefer, there is the run-all notebook, which allows one to run all the notebooks sequentially in a non-interactive manner. This is useful for reproducing the output results of the entire study without having to interact with the individual notebooks.

The outputs of the individual notebooks are located in HTML files under the notebooks/html_output directory and can be viewed in a web browser. They are produced after a notebook has been run by issuing the following command in a terminal from the notebooks directory:

jupyter nbconvert --to html --output-dir=html mynotebook.ipynb

The actual notebooks only contain source code and markdown narrative as the output is cleaned after running them by issuing the following commands in a terminal from the notebooks directory:

jupyter nbconvert --to notebook --ClearOutputPreprocessor.enabled=True mynotebook.ipynb

mv mynotebook.nbconvert.ipynb nbconvert.ipynb

Cleaning the output allows for better source control of notebooks as the diff outputs only contain code and markdown narrative changes. If output diffs are desired then the diffs between the versions of html files can be examined.

Tests

Tests are located under the tests directory. There are two sets of tests, tests for the notebooks located at tests/notebooks and tests for the scripts located at tests/src.

Notebook tests use ipytest. The functions in the notebook they are testing need to loaded into the same IPython interactive namespace. There are a few different ways of doing this. However, the simplest way to do this is to use JupyterLab to connect both notebooks to the same kernel. This can be achieved through the Kernel > Change Kernel option in the JupyterLab user interface. Please see the JupyterLab documentation for more information on managing kernels.

Script tests use pytest and can be run from within the virtualenv with the command:

pytest

Website

A website describing the findings of this project is available under the website directory and can be viewed using any web browser. Once you have cloned the repository, just open the index.html file to view the contents of the website offline.

nobel-physics-prizes's People

Contributors

Stargazers

Watchers

nobel-physics-prizes's Issues

Collect list of physicists

Scrape List of physicists and List of theoretical physicists from Wikipedia and output the combined list to a txt file.

Change alpha 2 country code to alpha 3 country code

Change alpha 2 country code to alpha 3 country code in 3.0-build-features.ipynb since those values have more clarity.

Issues with DBpedia and wikipedia being out of sync

Redirects for the following physicists leads to json data without a name. There are a few issues here.

DBpedia names not in sync with Wikipedia names:

Ea Ea -> Craige Schensted
Gian Carlo Wick -> Gian-Carlo Wick
Hans Adolf Buchdahl -> Hans Adolph Buchdahl
James Jeans -> James Hopwood Jeans
Lawrence Bragg -> William Lawrence Bragg
Shin'ichirō Tomonaga -> Sin'ichirō_Tomonaga
Thales of Miletus -> Thales

DBpedia data without the necessary fields:

Ricardo Carezani

Not a physicist:

Matthew Sanders

These links did not correctly redirect due to commas in the name:

John William Strutt -> John William Strutt, 3rd Baron Rayleigh
Sir George Stokes -> Sir_George_Stokes,_1st_Baronet

Build residence country codes and continents features

Create residence country features from residence field (indicators). Clearly some NER and a lookup of ISO country codes from city and / or state names is needed. A few options are available:

https://stackoverflow.com/questions/4844811/how-can-i-determine-a-region-country-and-continent-based-on-a-city-using-pytho
https://github.com/ushahidi/geography

Next try to convert the path of the URL (assumed to be a nationality) to a country and if a latitude and longitude exists in the response keep it, otherwise
Use named entity recognition on the path of the URL to extract any NORP (nationalities) and convert to countries. If a latitude and longitude exists in the response then keep it, otherwise discard (nothing was found).

The following links are useful libraries for the conversions:

https://stackoverflow.com/questions/44772314/converting-nationality-to-country-in-python
https://github.com/Dinu/country-nationality-list/blob/master/countries.csv (not as comprehensive as strict ISO)

One hot encode categorical features consisting of lists

One hot encode categorical features consisting of lists since machine learning models prefer to deal with these instead. Use sklearn MultiLabelBinarizer but remember to convert back to bool so that the one-hot encoded values are treated as categorical by Factor Analysis of Mixed Data (FAMD).

Build number of doctoral students who are Physics and Chemistry Nobel Laureate features

Create features for the number of doctoral students who are a Physics and Chemistry Nobel Laureate (ints). Use the list of Nobel Laureates in Physics and the list of Nobel Laureates in Chemistry and doctoralStudent field to determine this.

Build number of notable students who are Physics and Chemistry Nobel Laureate features

Create features for the number of notable students who are a Physics and Chemistry Nobel Laureate (ints). Use the list of Nobel Laureates in Physics and the list of Nobel Laureates in Chemistry and notableStudent field to determine this.

Collect raw data on physicists

Collect raw data on every physicist in the list of physicists. Obtain the structured data from DBPedia.
For an example see Albert Einstein. Output the list to a jsonlines file.

Build birth and death country codes and continents features

Create birth country code feature from birthPlace field and death country code feature from deathPlace field (categoricals). Also create birth continent and death continent features. Also create features for the number of birth country codes and number of death country codes and likewise for the continent codes.

Remove imputation of parents / children in features

Remove imputation of parents / children when building features since this is not transferable to the testing phase and is a form of data snooping. In particular this is not possible to do in the case of one test example. Just accept the fact that there is some missing data here.

Build number of doctoral advisors who are Physics and Chemistry Nobel Laureate features

Create features for the number of academic advisors who are a Physics and Chemistry Nobel Laureate (ints). Use list of Nobel Laureates in Physics and list of Nobel Laureates in Chemistry and the doctoralAdvisor field to determine this.

Scope of work

Project statement
Description of the data
High level project goals
References

Write explanation for feature building

Write prose explaining rationale for feature building.

Regenerate all data as links on dbpedia have changed

Regenerate all data as links on dbpedia have changed.

Use copy.deepcopy for all dictionaries

Change all the code to use copy.deepcopy when copying dictionaries. Also, check to make sure the passed parameter dictionaries are not being modified.

Build the target variable

Create the target variable. An indicator that states whether the physicist is a Nobel Laureate in Physics. Base this on whether the physicist is in the list of Nobel Laureates in Physics. Note the award field is not sufficient to use as some Nobel Laureates in Physics are not listed as such there.

Process physicists raw data

Process the JSON data on physicists to access the demographic data of interest and output to a csv file.

Fix last issues with Wikipedia and DBpedia names being out of sync

The following names need to be forced mapped to the correct resources:

Ernest Mouchez -> Amédée_Mouchez
Hans Ziegler (physicist) -> Hans Ziegler
Kenneth Young (physicist) -> Kenneth Young,
Raúl Rabadán -> Raúl Rabadan
William Fuller Brown Jr. -> William Fuller Brown, Jr.
Yakov Alpert -> Yakov Lvovich Alpert
Yang Chen-Ning -> Chen-Ning Yang

Apparently the last guy is very famous, a Nobel Laureate nonetheless!

Collect raw data on Nobel Chemistry Prize winners

Scrape Wikipedia list of Nobel Laureates in Chemistry and output as csv file.

Process places raw data

Create pandas dataframe from the places json lines file. The variables at a minimum should be:

resource, source, fullName
abstract
comment
categories
latitude
longitude
city
country

Check for redirects and impute them. And impute missing latitude and longitude values where possible. These should be based on the city.

Persist the dataframes to disk for later use.

Remove Royal Prussia from list of physicists

Add Royal Prussia to the list of urls to ignore so that it does not show up in the list of physicists. Fix up all associated notebooks downstream of this and regenerate outputs. Correct relevant asserts.

Build number of influenced and influencedBy who are Physics and Chemistry Nobel Laureate features

Create features for the number of influenced and influencedBy who are a Physics and Chemistry Nobel Laureate (ints). Use the list of Nobel Laureates in Physics and the list of Nobel Laureates in Chemistry and influenced and influencedBy fields to determine this.

Build years lived feature

Create the years lived feature (int) from the birth date and death date fields. If there is no death date then use today's date.

Reverse geocode places and map country codes and names.

Reverse geocode from latitude and longitudes to countryCode (alpha2). Map from countryCode (alpha2) to the following variables:

countryName
countryCode (alpha3)
continentCode
continentName

Add all these variables to the places dataframe and persist the dataframe to disk.

Redirect links in Physics and Chemistry Nobel Laureates

Similar to the list of physicists, some of the Laureate names (links) are redirected. Call the same code that was used for that.

Fix redirect issue from physicist links collected from Wikipedia

Some links were redirected in the list of physicists collected. Use requests to find the URLs that these are redirected to and regenerate the list. The situation is even trickier than suspected as redirects are done via javascript so requests does not handle that. The value is stored in the variable wgInternalRedirectTargetUrl.

Collect raw data on Nobel Physics Prize winners

Scrape Wikipedia list of Nobel Laureates in Physics and output as csv file.

Recreate pipfile

Recreate the pipfile and the lock file as the previous one had issues. Remove unneeded dependencies.

Train-test split for physicists data

Process the physicists dataframe to obtain a train-test split for physicists who were or are potentially eligible to be awarded a Nobel Prize in Physics. This essentially means physicists who were alive since the end of 1901.

Build number of academic advisors who are Physics and Chemistry Nobel Laureate features

Create a feature for the number of academic advisors who are a Physics and Chemistry Nobel Laureates (ints). Use list of Nobel Laureates in Physics and the list of Nobel Laureates in Chemistry to determine this.

Build number of spouses who are a Physics or Chemistry Nobel Laureate features

Create features for the number of spouses who are a Physics and Chemistry Nobel Laureate (ints). Use lists of Nobel Laureates in Physics and Nobel Laureates in Chemistry and spouse field to determine this.

Impute missing country codes for nationalities

Impute any missing country 2 alpha codes, 2-alpha codes, country name, continent codes, continent name etc using a demonym list:

https://github.com/nicolanrizzo/nationalitylist/blob/master/csv/en.csv

This will make dealing with nationalities and citizenships much simpler when creating features.

Issue with places not being split on pipe

When building features and getting the codes for countries and continents, the codes in the cell are not being split on the pipe which treats multiples as one.

Proper quoting in URLs

Ensure that the quoting occurs for all characters. It seems that the parse.quote method in urllib knows itself whether to quote the characters. These issues result in certain URLs being wrong.

Also replace new lines in a field with a pipe as these are logically separate entities. Seems to only happen in almaMater and workplaces column.

Build number of children and parents who are Physics and Chemistry Nobel Laureate features

Create features for the number of children and parents who are a Physics Nobel Laureate and who are a Chemistry Nobel Laureate (ints). Use lists of Nobel Laureates in Physics and Nobel Laureates in Chemistry and children field to determine this. Impute missing values as necessary.

Fix new line star problem and multiple references to the same redirected URL

Fix issue with \n* appearing in the fields of interest. Remove them, remove all white space, break up the field, resort the field.

Fix issue with multiple URLs referring to the same thing. e.g. Aage Bohr birthPlace that is:

results in duplicated entries. Resolve this in the impute_urls method by using a set instead.

Build country code of citizenship feature

Combine citizenship and nationality into one feature referencing the country code of citizenship. There will still be many missing values. So use named entity recognition to extract values from description field. Use demonym listing to convert nationality to country.
Also have features for continent and number of all of these.

Pipenv fails to install jupyterlab

Temporarily remove jupyterlab from the list of dev dependencies due to this issue:

pypa/pipenv#2880

The workaround given in the link does not work. When the issued is fixed, add jupyterlab back to the dev dependencies.

Need to check for redirected links when processing physicists raw data

When processing physicists raw data, a check needs to be made when a link is encountered in case the link is redirected in DBpedia. The redirected link should replace the other link in this case. This simplifies the workflow downstream as it enables all semantic URLs to point to the correct "thing".

Build alma mater and country and continent codes features

Create alma mater features and features for the country and continent codes of the alma mater. Also create features for the number of all of these.

Collect raw data for places

Use the notable physicists dataframe to extract all the semantic URLs from the following fields of interest:

almaMater
workplaces
birthPlace
deathPlace
residence
citizenship
nationality

Use requests to fetch the json data from the URLs. Persist the data to a json lines file.

Build theoretical, experimental, astronomer features

Create theoretical, experimental, astronomer features (indicators) to indicate whether theoretical physicist, experimental physicist and astronomer. Use the fields categories, field, description and comment as necessary to extract this information.

Richard Feynman is his own child

Crazily Richard Feynman is his own child in the physicists data. This happened as the URLs for his actual children redirect back to him! DBpedia nonsense here!!!

I suggest being extra conservative by excluding spouse , child and parent from the impute keys list. Very few of these are actually associated with Nobel Prizes (sans Curie family in particular). However, there is already code to handle this when creating the features by looking at the name field which gives variants of the name.

Most of the interesting stuff involving names is in academic advisors, doctoral advisors, notable students etc. And these fields still have the redirected names.

Issue with empty alma mater and number of alma mater

Alma mater is coming back as empty for some records. e.g. The 5th physicist in the training data. See Utrecht University has no country alpha 2 codes and consequently none of the other values. The number of alpha 2 codes is 1 even though the list is empty.

Probably better to just return empty list when nothing is found rather than special casing NaN values. In any case I will be converting these lists to indicators for each of the categoricals. It seems natural that the NaN value would be the one dropped in that case.