filippobovo / production-data-science Goto Github PK

View Code? Open in Web Editor NEW

448.0 448.0 130.0 4.3 MB

Production Data Science: a workflow for collaborative data science aimed at production

Jupyter Notebook 81.46% Python 18.54%

collaborative data-science production workflow

production-data-science's People

Contributors

Stargazers

Watchers

Forkers

bradleyfay katakwar86 zds0 danielaristidou jbdatascience wang775773 a-atef ssharma8 nitishghosal trefoil-ml marcotav sharmanatasha heaven00 tun-lin stenpiren csyhuang muzafferestelik jonathanshuai eliekawerk carltoews meganxueerwang theechris iamtomcheng mormukut11 kirosg kykrueger pankaj-kumar-sinha nine-s valerylynn kapilkhanal roman807 stjordanis deadbigreddog jltsao88 5hri babacar29 harshraizada michaeljohnhayden reginababo jushih dmitriykats1 secg95-zz oldbananin ibalejandro huanxia1 vincenthhu jeydion keshavkmr48 sergiogutz joyce233 kubekmonika raunaksharan blacksea2001 jaredh55 raghavendrakuttala rooierakkert shreya1sharma mmejdoubi marcelowukong l-inda shamafarabi krishisharma45 mb-la xinaodan chsqueen zhakhverdyan standard-workflows ahull002 fan-han bill-kalog davidlaucw spring1229 varuni-d nishitsingh2023 hedgehog612 abebual josemmontoro jslaga adityav95 yuwei-jacque-wang rongliprojects triminhduong deepsystemspharmacology kroudir patrick5455 bgereke bellaforjob tangwat roshnikoli profanshul cyuancheng moumita-das-7019 sukhadia1 lbrou82 gibsonggh-forks gibsonggh tomdxb0004 adamjauregui-hub priyam81 veraguzelsoy

production-data-science's Issues

Review on Explore Tutorial

At the end of last tutorial I did not started nor cloned a git repo (did not follow link) .. So to start this I set up the folder I used as a new git repo using git init
At the beginning of it I forgot to reactive the titanic virtualenviroment. Which I had to found and activate typing cd ./.virtualenvs/titanic/bin && source activate
After I was ready to use the workbook I took me some time to understand what to do: I launched jupyter locally (run jupyter notebook) and created a new notebook out of the titanic specs just created (in jupyter, new > titanic);
Had to find local location of my dataset
../data/titanic.csv did not work;
at out [6], after replacing with the median, why do we look at the df head again? I would have rather check the missing value percentages to see that it worked correctly.
git commands were not working smoothly as I was working in a locally created repo

Question on command_line.py

I have a question on command_line.py. I believe I followed all the steps in the tutorial and entered the following command:

titanic_analysis --filename exploration/data/titanic.csv

When I enter this command, I get the following message:

titanic_analysis: command not found

It says in the instructions that the "virtual environment titanic has to be active to run this command." How do I activate the titanic virtual environment? I have the titanic_datascience virtual environment active as described in the previous parts of the tutorial. I have the titanic package installed. But I don't know how to activate the titanic virtual environment. Or, could some other issue be causing the titanic_analysis: command not found?

Any help would be appreciated!

VirtualEnv vs. Conda

Hi @FilippoBovo ,

I finished your tutorial. Thanks for writing it. It was well written and informative.

I am wondering though, why you chose to create a virtual environment based on virtualenv instead of conda. In my own limited experience doing Data Science, I have always used Anaconda and didn't come across using virtualenvs until this tutorial. I also found this article comparing the two where the author recommends using Anaconda.

Is the reason because virtualenvs are far more common in the general software engineering world and you wanted to use this more general practice?

Thanks!

Give a motivation for splitting the notebook into several files

From Part C - Refactoring the Notebook:

In this section, we refactor some of the notebook code into the titanic package for production. We do this by creating the modules titanic/titanic/data.py and titanic/titanic/models.py where we put respectively functions for data processing and predictive modelling.

It might make sense to elaborate on why it's good to keep files short and focused. You could mention the Single Responsibility Principle, which states that each bit of code should be focused on doing one thing with a very limited scope. Modular files are easier to maintain, they force you to stop using global variables, force you to have variables with narrow scopes, and to have clear input and output parameters.

Source: https://github.com/Satalia/production-data-science/tree/master/tutorial/c-refactor#refactoring-the-notebook

One folder for each exploratory project, even those with only one file

From Part B - Organisation of the Exploration Folder:

Dedicated Folder — Each exploratory project should be placed in a dedicated folder. If the project involves a single document, like a Jupyter Notebook, no folder is needed.

I would suggest to have a folder for each exploratory work, even if it involves only one file. During the refactoring stage you're very likely to break it down into several, more focused, files anyway.
This way the top folder looks very consistent: only sub folders and no loose notebook files.

Source: https://github.com/Satalia/production-data-science/tree/master/tutorial/b-explore#organisation-of-the-exploration-folder

Add link to "The Best of the Best Practices (BOBP) Guide for Python".

Add this link, https://gist.github.com/sloria/7001839#imports, to The Best of the Best Practices (BOBP) Guide for Python when talking about best practices.

Requirements text file generated on windows 10

The equivalent line on command prompt would be:

pip freeze | findstr /V "titanic" > titanic\requirements.txt

Rework imports

Rework imports to follow The Best of the Best Practices (BOBP) Guide for Python https://gist.github.com/sloria/7001839#imports

Move requirement.txt outside of the package folder

requirement.txt should be placed outside of the package folder, as it creates the environment where to carry out exploration as well.

Better explanation for branching for exploration

In software development, branching is used to isolate the development of new features. Branching can be easily extended to data science, where instead of developing new features, we explore data.

Here it would make it more clear to explain that branches are used so that several people can work on their own stuff without interfering with others. (Thanks Yohann)

Keep your focus on the goal at hand

From Part C - Refactoring for exploration:

It is important to understand that this is just a qualitative criterion which should act more as a principle rather than a rule. This is why the words "within reason" and "reasonably" were specified in the italic sentences above. For example, if an analysis just needs few words to prove a point, it does not make sense to forcefully simplify the code to obtain a better code to word ratio. In other words, rather than blindly following rules, understand the principles and the motivations behind them, and do what makes most sense.

| A similar concept is stated in the Python PEP 8 document: A Foolish Consistency is the Hobgoblin of Little Minds.

I'm not sure this aparte serves the general purpose of the tutorial. It kind of distracted me from the ideas you wrote before that. Maybe it could belong in a separate section at the end of the tutorial, where you'd invite participants to reflect on what they've just learned?

Source: https://github.com/Satalia/production-data-science/tree/master/tutorial/c-refactor#refactoring-for-exploration

Typo

There are several toolS for exploratory data analysis in Python. Some of the most widely used are,

Jupyter Notebook
Spyder or other IDES specific for data science
Normal text editors

Typos in A - Setup

Remove .6 from python3.6:

git clone <git-repository>
cd titanic
mkvirtualenv --python=python3.6 titanic
pip install -r titanic/requierements.txt
pip install -e titanic

b-collaborate: clarity

thanks for this tutorial, appreciate the thought and effort behind it.

couple things that failed for me due to clarity items:

tutorial doesn't say where to start so I stayed in virtualenv, where pip freeze doesn't work. might note where you intend folks to start this part of the tutorial.

secondly: for the install_requires part, I'm not sure what's intended there, I manually wrote it in via nano, but running that code in a bash shell fails.

Replace code in Cookie Cutter template README.md

Replace this,

pip install -e {{cookiecutter.package_name}}
pip freeze | grep -v {{cookiecutter.package_name}} > {{cookiecutter.package_name}}/requirements.txt

with this,

pip install -r requirements.txt
pip install -e {{cookiecutter.package_name}}

Also add later on how to freeze requirements:

pip freeze | grep -v {{cookiecutter.package_name}} > requirements.txt

Breaking notebooks and explorations under refactoring

When refactoring the core, old exploration may break. For this reason, notebook should be tested as well.

Testing a notebook may mean to run it before and after refactoring, making sure that there are no errors and that the output of cells are the same.

mkvirtualenv

Hello,
I know this is a basic question, but I'm not sure how to get out of the gate using mkvirtualenv to set up the virtual environment. I get the response "bash: mkvirtualenv: command not found".
I am using a debian instance on GCP, if that helps.
Any help is much appreciated.

Add instructions to install Jupyter kernel in template README.md

Missing indications on which directory to run commands in

During the tutorial, sometimes I was into some specific directory to do stuff and I made mistakes of not coming back to the right directory to run the proper commands.

Pytest warning

When running pytest, you may notice the following warning:

RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility.

This is a cython waraning that should be silenced but is unsilenced by pytest.

See this link for more information.

Make folder with `setup.py` the root folder.

Make folder with setup.py the root folder.
Move the exploration directory inside it.
Move the data directory inside exploration.

Move tests folder outside of module package.

It is better to move tests folder outside of module package, as the test are likely not to be shipped with the package.

Elaborate on the concept of "code to text ratio"

From Part C - Refactoring for Exploration:

Because the main purpose of an exploratory analysis is to prove a point rather than showing code, refactoring for exploration should be aimed at reducing the code to text ratio, within reason. In this way, a notebook would look more like a document that uses words (and plots) to reason and prove a point.

However, with this criterion, we may as well increase the number of words, just to decrease the code to text ratio. This leads to longer documents that are both text and code heavy and, in turn, harder to read. A better solution is to simplify both code and text, while keeping the code to word ratio reasonably low.

What is this ratio actually about? What is the goal of this step? I.e. can you find some examples of what is wanted and what is not? For instance, where does a file with no comments but only code stand on your quality scale? How about a file with lots of text but little code?

_Source: _ https://github.com/Satalia/production-data-science/tree/master/tutorial/c-refactor#refactoring-
for-exploration

Give section about testing a more accurate name

In Part C, section Refactoring for Production is introducing the concept of testing and does not cover anything else. It might make sense to rename it accordingly. Or maybe this is just one sub-section of a bigger section on refactoring for production, which is vast topic in itself.

Source: https://github.com/Satalia/production-data-science/tree/master/tutorial/c-refactor#refactoring-for-production