Coder Social home page Coder Social logo

covuworie / nobel-physics-prizes Goto Github PK

View Code? Open in Web Editor NEW
3.0 1.0 0.0 121.23 MB

Predicting Nobel Physics Prize winners. Final project for Harvard CS109a 2017 edition https://github.com/covuworie/a-2017.

License: MIT License

Jupyter Notebook 5.24% HTML 93.14% Python 1.57% JavaScript 0.03% CSS 0.02%
nobel-physics-prizes nobel-laureates machine-learning logistic-regression svm random-forest natural-language-processing topic-modeling matrix-factorization scraping

nobel-physics-prizes's Introduction

Predicting Nobel Physics Prize Winners

And the Nobel goes to ...

alt text Illustration: Niklas Elmehad/Nobel Media (IEEE Spectrum)

Winners of the Nobel Prize in Physics 2018

Background

The Nobel Prize in Physics is widely regarded as the most prestigious award in Physics. It has been awarded to 207 Nobel Laureates between 1901 and 2017. John Bardeen is the only double Nobel Laureate meaning that 206 physicists have actually won the prize. The will of Alfred Nobel states that that the prize should be awarded to the "person who shall have made the most important discovery or invention within the field of physics". In fact, the prize can actually be awarded to a maximum of 3 people in any year and can be split for a maximum of 2 inventions or discoveries. The prize is not awarded posthumously; however, if a person is awarded a prize and dies before receiving it, the prize may still be presented.

Problem Statement

The Nobel Prize in Physics is awarded by The Royal Swedish Academy of Sciences, Stockholm, Sweden. The nomination and selection process is a lengthy and complex process taking just over a year. Three of the key stages are:

  • September - Nomination forms are sent out. The Nobel Committee sends out confidential forms to around 3,000 people - selected professors at universities around the world, Nobel Laureates in Physics and Chemistry, and members of the Royal Swedish Academy of Sciences, among others.

  • March-May - Consultation with experts. The Nobel Committee sends the names of the preliminary candidates to specially appointed experts for their assessment of the candidates' work.

  • October - Nobel Laureates are chosen. In early October, the Academy selects the Nobel Laureates in Physics through a majority vote. The decision is final and without appeal. The names of the Nobel Laureates are then announced.

Furthermore, details of the nominations are not made public until 50 years after. The nature of the selection process has led to claims that the selection process is dominated more by the demographics of the nominee and the nominators than by the quality of the nominee's work. For some more details, see this excellent five part series from Physics Today that examines the data and dives into the history of physicists nominated for the Nobel Prize. This PBS article also describes 8 ways to win the Nobel Prize in Physics of which 5 refer to demographics. Some of the nominee demographics mentioned in both articles include:

  • Gender
  • Age / years lived
  • Nationality
  • Institutions studied at and affiliated with
  • Connected to past winners of the Nobel Prize in Physics or Chemistry through progeny or academics
  • Theorist or experimentalist
  • Astronomer or physicist

The Physics Today article claims that "We'll probably never know for sure why some physicists win Nobel glory and others come up short; the Nobel committee is notoriously secretive about their deliberations." However, the data in the article suggests that there may exist underlying patterns that in general enhance a physicist's chance of winning a Nobel prize.

Project Goals

The goals of the project are to answer the following questions:

  1. Do demographics play a major role in selecting the winner of the Nobel Prize in Physics?
  2. Which demographic factors have the biggest influence on the outcome?
  3. Who are the most likely winners of The Nobel Prize in Physics 2018?

The questions will be answered by building a machine learning model, based on demographic data alone, that predicts whether a physicist has won or will win a Nobel Prize. The Nobel Committee has acknowledged the gender bias towards women across all of the Nobel Prizes and is actively looking to address the situation. It seems that a predictive model such as this could provide insight into biases present in the selection process. The Nobel Committee could utilize such a model to make informed decisions that help to permanently erradicate such biases.

Data Resources

A list of physicists notable for their achievements will be created by scraping the following Wikipedia articles:

Lists of Nobel Prize Winners in both Physics and Chemistry from 1901-2017 will be obtained by scraping the following Wikipedia articles:

These lists will be used to obtain demographic data in JSON format for the physicists by sending HTTP requests to DBpedia. DBpedia is a crowd-sourced community effort to extract structured content from the information created in various Wikimedia projects. In this case, the JSON data is similar to the structured data in an Infobox on the top right side of the Wikipedia article for each physicist. The following are examples of data that is available for the physicists:

Environment

An environment for computational reproducibility of this project can be setup by following these simple steps:

  1. Download and install python 3.6.5 (64-bit) (any 3.6.x version should be ok) for your operating system from python.org or anaconda. Make sure to check the option "Add python 3.6 to PATH" during installation.

  2. Download and install the latest version (any version should be ok) of git-scm for your operating system.

  3. Clone the github repository:

git clone https://github.com/covuworie/nobel-physics-prizes.git
  1. Create a .env file at the root where you cloned the repo. See .env-example for an example.

  2. Use pipenv to spawn a shell with the virtualenv activated (this will also load the .env environment variables):

pipenv shell
  1. Install all packages from the Pipfile (both develop and default packages):
pipenv install --dev
  1. Launch the JupyterLab application in your default browser:
jupyter lab

Notebooks

Notebooks are located under the notebooks directory. The individual notebooks of the projects can be run interactively in JupyterLab. Or if you prefer, there is the run-all notebook, which allows one to run all the notebooks sequentially in a non-interactive manner. This is useful for reproducing the output results of the entire study without having to interact with the individual notebooks.

The outputs of the individual notebooks are located in HTML files under the notebooks/html_output directory and can be viewed in a web browser. They are produced after a notebook has been run by issuing the following command in a terminal from the notebooks directory:

jupyter nbconvert --to html --output-dir=html mynotebook.ipynb

The actual notebooks only contain source code and markdown narrative as the output is cleaned after running them by issuing the following commands in a terminal from the notebooks directory:

jupyter nbconvert --to notebook --ClearOutputPreprocessor.enabled=True mynotebook.ipynb

mv mynotebook.nbconvert.ipynb nbconvert.ipynb

Cleaning the output allows for better source control of notebooks as the diff outputs only contain code and markdown narrative changes. If output diffs are desired then the diffs between the versions of html files can be examined.

Tests

Tests are located under the tests directory. There are two sets of tests, tests for the notebooks located at tests/notebooks and tests for the scripts located at tests/src.

Notebook tests use ipytest. The functions in the notebook they are testing need to loaded into the same IPython interactive namespace. There are a few different ways of doing this. However, the simplest way to do this is to use JupyterLab to connect both notebooks to the same kernel. This can be achieved through the Kernel > Change Kernel option in the JupyterLab user interface. Please see the JupyterLab documentation for more information on managing kernels.

Script tests use pytest and can be run from within the virtualenv with the command:

pytest

Website

A website describing the findings of this project is available under the website directory and can be viewed using any web browser. Once you have cloned the repository, just open the index.html file to view the contents of the website offline.

nobel-physics-prizes's People

Contributors

covuworie avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

nobel-physics-prizes's Issues

Issues with DBpedia and wikipedia being out of sync

Redirects for the following physicists leads to json data without a name. There are a few issues here.

  1. DBpedia names not in sync with Wikipedia names:

Ea Ea -> Craige Schensted
Gian Carlo Wick -> Gian-Carlo Wick
Hans Adolf Buchdahl -> Hans Adolph Buchdahl
James Jeans -> James Hopwood Jeans
Lawrence Bragg -> William Lawrence Bragg
Shin'ichirō Tomonaga -> Sin'ichirō_Tomonaga
Thales of Miletus -> Thales

  1. DBpedia data without the necessary fields:

Ricardo Carezani

  1. Not a physicist:

Matthew Sanders

  1. These links did not correctly redirect due to commas in the name:

John William Strutt -> John William Strutt, 3rd Baron Rayleigh
Sir George Stokes -> Sir_George_Stokes,_1st_Baronet

Build residence country codes and continents features

Create residence country features from residence field (indicators). Clearly some NER and a lookup of ISO country codes from city and / or state names is needed. A few options are available:

https://stackoverflow.com/questions/4844811/how-can-i-determine-a-region-country-and-continent-based-on-a-city-using-pytho
https://github.com/ushahidi/geography

Next try to convert the path of the URL (assumed to be a nationality) to a country and if a latitude and longitude exists in the response keep it, otherwise
Use named entity recognition on the path of the URL to extract any NORP (nationalities) and convert to countries. If a latitude and longitude exists in the response then keep it, otherwise discard (nothing was found).

The following links are useful libraries for the conversions:

https://stackoverflow.com/questions/44772314/converting-nationality-to-country-in-python
https://github.com/Dinu/country-nationality-list/blob/master/countries.csv (not as comprehensive as strict ISO)

One hot encode categorical features consisting of lists

One hot encode categorical features consisting of lists since machine learning models prefer to deal with these instead. Use sklearn MultiLabelBinarizer but remember to convert back to bool so that the one-hot encoded values are treated as categorical by Factor Analysis of Mixed Data (FAMD).

Build birth and death country codes and continents features

Create birth country code feature from birthPlace field and death country code feature from deathPlace field (categoricals). Also create birth continent and death continent features. Also create features for the number of birth country codes and number of death country codes and likewise for the continent codes.

Remove imputation of parents / children in features

Remove imputation of parents / children when building features since this is not transferable to the testing phase and is a form of data snooping. In particular this is not possible to do in the case of one test example. Just accept the fact that there is some missing data here.

Scope of work

  • Project statement
  • Description of the data
  • High level project goals
  • References

Build the target variable

Create the target variable. An indicator that states whether the physicist is a Nobel Laureate in Physics. Base this on whether the physicist is in the list of Nobel Laureates in Physics. Note the award field is not sufficient to use as some Nobel Laureates in Physics are not listed as such there.

Fix last issues with Wikipedia and DBpedia names being out of sync

The following names need to be forced mapped to the correct resources:

Ernest Mouchez -> Amédée_Mouchez
Hans Ziegler (physicist) -> Hans Ziegler
Kenneth Young (physicist) -> Kenneth Young,
Raúl Rabadán -> Raúl Rabadan
William Fuller Brown Jr. -> William Fuller Brown, Jr.
Yakov Alpert -> Yakov Lvovich Alpert
Yang Chen-Ning -> Chen-Ning Yang

Apparently the last guy is very famous, a Nobel Laureate nonetheless!

Process places raw data

Create pandas dataframe from the places json lines file. The variables at a minimum should be:

  • resource, source, fullName
  • abstract
  • comment
  • categories
  • latitude
  • longitude
  • city
  • country

Check for redirects and impute them. And impute missing latitude and longitude values where possible. These should be based on the city.

Persist the dataframes to disk for later use.

Remove Royal Prussia from list of physicists

Add Royal Prussia to the list of urls to ignore so that it does not show up in the list of physicists. Fix up all associated notebooks downstream of this and regenerate outputs. Correct relevant asserts.

Build years lived feature

Create the years lived feature (int) from the birth date and death date fields. If there is no death date then use today's date.

Reverse geocode places and map country codes and names.

Reverse geocode from latitude and longitudes to countryCode (alpha2). Map from countryCode (alpha2) to the following variables:

  • countryName
  • countryCode (alpha3)
  • continentCode
  • continentName

Add all these variables to the places dataframe and persist the dataframe to disk.

Fix redirect issue from physicist links collected from Wikipedia

Some links were redirected in the list of physicists collected. Use requests to find the URLs that these are redirected to and regenerate the list. The situation is even trickier than suspected as redirects are done via javascript so requests does not handle that. The value is stored in the variable wgInternalRedirectTargetUrl.

Recreate pipfile

Recreate the pipfile and the lock file as the previous one had issues. Remove unneeded dependencies.

Train-test split for physicists data

Process the physicists dataframe to obtain a train-test split for physicists who were or are potentially eligible to be awarded a Nobel Prize in Physics. This essentially means physicists who were alive since the end of 1901.

Proper quoting in URLs

Ensure that the quoting occurs for all characters. It seems that the parse.quote method in urllib knows itself whether to quote the characters. These issues result in certain URLs being wrong.

Also replace new lines in a field with a pipe as these are logically separate entities. Seems to only happen in almaMater and workplaces column.

Build country code of citizenship feature

Combine citizenship and nationality into one feature referencing the country code of citizenship. There will still be many missing values. So use named entity recognition to extract values from description field. Use demonym listing to convert nationality to country.
Also have features for continent and number of all of these.

Pipenv fails to install jupyterlab

Temporarily remove jupyterlab from the list of dev dependencies due to this issue:

pypa/pipenv#2880

The workaround given in the link does not work. When the issued is fixed, add jupyterlab back to the dev dependencies.

Need to check for redirected links when processing physicists raw data

When processing physicists raw data, a check needs to be made when a link is encountered in case the link is redirected in DBpedia. The redirected link should replace the other link in this case. This simplifies the workflow downstream as it enables all semantic URLs to point to the correct "thing".

Collect raw data for places

Use the notable physicists dataframe to extract all the semantic URLs from the following fields of interest:

  • almaMater
  • workplaces
  • birthPlace
  • deathPlace
  • residence
  • citizenship
  • nationality

Use requests to fetch the json data from the URLs. Persist the data to a json lines file.

Build theoretical, experimental, astronomer features

Create theoretical, experimental, astronomer features (indicators) to indicate whether theoretical physicist, experimental physicist and astronomer. Use the fields categories, field, description and comment as necessary to extract this information.

Richard Feynman is his own child

Crazily Richard Feynman is his own child in the physicists data. This happened as the URLs for his actual children redirect back to him! DBpedia nonsense here!!!

I suggest being extra conservative by excluding spouse , child and parent from the impute keys list. Very few of these are actually associated with Nobel Prizes (sans Curie family in particular). However, there is already code to handle this when creating the features by looking at the name field which gives variants of the name.

Most of the interesting stuff involving names is in academic advisors, doctoral advisors, notable students etc. And these fields still have the redirected names.

Issue with empty alma mater and number of alma mater

Alma mater is coming back as empty for some records. e.g. The 5th physicist in the training data. See Utrecht University has no country alpha 2 codes and consequently none of the other values. The number of alpha 2 codes is 1 even though the list is empty.

Probably better to just return empty list when nothing is found rather than special casing NaN values. In any case I will be converting these lists to indicators for each of the categoricals. It seems natural that the NaN value would be the one dropped in that case.

Build features for workplaces

Create features for workplaces, the workplace, country and continent of workplaces as well as the number for these.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.