Coder Social home page Coder Social logo

filippobovo / production-data-science Goto Github PK

View Code? Open in Web Editor NEW
448.0 448.0 130.0 4.3 MB

Production Data Science: a workflow for collaborative data science aimed at production

Jupyter Notebook 81.46% Python 18.54%
collaborative data-science production workflow

production-data-science's People

Contributors

filippobovo avatar kykrueger avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

production-data-science's Issues

Review on Explore Tutorial

  • At the end of last tutorial I did not started nor cloned a git repo (did not follow link) .. So to start this I set up the folder I used as a new git repo using git init
  • At the beginning of it I forgot to reactive the titanic virtualenviroment. Which I had to found and activate typing cd ./.virtualenvs/titanic/bin && source activate
  • After I was ready to use the workbook I took me some time to understand what to do: I launched jupyter locally (run jupyter notebook) and created a new notebook out of the titanic specs just created (in jupyter, new > titanic);
  • Had to find local location of my dataset
    ../data/titanic.csv did not work;
  • at out [6], after replacing with the median, why do we look at the df head again? I would have rather check the missing value percentages to see that it worked correctly.
  • git commands were not working smoothly as I was working in a locally created repo

Question on command_line.py

I have a question on command_line.py. I believe I followed all the steps in the tutorial and entered the following command:

titanic_analysis --filename exploration/data/titanic.csv

When I enter this command, I get the following message:

titanic_analysis: command not found

It says in the instructions that the "virtual environment titanic has to be active to run this command." How do I activate the titanic virtual environment? I have the titanic_datascience virtual environment active as described in the previous parts of the tutorial. I have the titanic package installed. But I don't know how to activate the titanic virtual environment. Or, could some other issue be causing the titanic_analysis: command not found?

Any help would be appreciated!

VirtualEnv vs. Conda

Hi @FilippoBovo ,

I finished your tutorial. Thanks for writing it. It was well written and informative.

I am wondering though, why you chose to create a virtual environment based on virtualenv instead of conda. In my own limited experience doing Data Science, I have always used Anaconda and didn't come across using virtualenvs until this tutorial. I also found this article comparing the two where the author recommends using Anaconda.

Is the reason because virtualenvs are far more common in the general software engineering world and you wanted to use this more general practice?

Thanks!

Give a motivation for splitting the notebook into several files

From Part C - Refactoring the Notebook:

In this section, we refactor some of the notebook code into the titanic package for production. We do this by creating the modules titanic/titanic/data.py and titanic/titanic/models.py where we put respectively functions for data processing and predictive modelling.

It might make sense to elaborate on why it's good to keep files short and focused. You could mention the Single Responsibility Principle, which states that each bit of code should be focused on doing one thing with a very limited scope. Modular files are easier to maintain, they force you to stop using global variables, force you to have variables with narrow scopes, and to have clear input and output parameters.

Source: https://github.com/Satalia/production-data-science/tree/master/tutorial/c-refactor#refactoring-the-notebook

One folder for each exploratory project, even those with only one file

From Part B - Organisation of the Exploration Folder:

  • Dedicated Folder โ€” Each exploratory project should be placed in a dedicated folder. If the project involves a single document, like a Jupyter Notebook, no folder is needed.

I would suggest to have a folder for each exploratory work, even if it involves only one file. During the refactoring stage you're very likely to break it down into several, more focused, files anyway.
This way the top folder looks very consistent: only sub folders and no loose notebook files.

Source: https://github.com/Satalia/production-data-science/tree/master/tutorial/b-explore#organisation-of-the-exploration-folder

Better explanation for branching for exploration

In software development, branching is used to isolate the development of new features. Branching can be easily extended to data science, where instead of developing new features, we explore data.

Here it would make it more clear to explain that branches are used so that several people can work on their own stuff without interfering with others. (Thanks Yohann)

Keep your focus on the goal at hand

From Part C - Refactoring for exploration:

It is important to understand that this is just a qualitative criterion which should act more as a principle rather than a rule. This is why the words "within reason" and "reasonably" were specified in the italic sentences above. For example, if an analysis just needs few words to prove a point, it does not make sense to forcefully simplify the code to obtain a better code to word ratio. In other words, rather than blindly following rules, understand the principles and the motivations behind them, and do what makes most sense.

| A similar concept is stated in the Python PEP 8 document: A Foolish Consistency is the Hobgoblin of Little Minds.

I'm not sure this aparte serves the general purpose of the tutorial. It kind of distracted me from the ideas you wrote before that. Maybe it could belong in a separate section at the end of the tutorial, where you'd invite participants to reflect on what they've just learned?

Source: https://github.com/Satalia/production-data-science/tree/master/tutorial/c-refactor#refactoring-for-exploration

Typo

There are several toolS for exploratory data analysis in Python. Some of the most widely used are,

Jupyter Notebook
Spyder or other IDES specific for data science
Normal text editors

Typos in A - Setup

Remove .6 from python3.6:

git clone <git-repository>
cd titanic
mkvirtualenv --python=python3.6 titanic
pip install -r titanic/requierements.txt
pip install -e titanic

b-collaborate: clarity

thanks for this tutorial, appreciate the thought and effort behind it.

couple things that failed for me due to clarity items:

tutorial doesn't say where to start so I stayed in virtualenv, where pip freeze doesn't work. might note where you intend folks to start this part of the tutorial.

secondly: for the install_requires part, I'm not sure what's intended there, I manually wrote it in via nano, but running that code in a bash shell fails.

Replace code in Cookie Cutter template README.md

Replace this,

pip install -e {{cookiecutter.package_name}}
pip freeze | grep -v {{cookiecutter.package_name}} > {{cookiecutter.package_name}}/requirements.txt

with this,

pip install -r requirements.txt
pip install -e {{cookiecutter.package_name}}

Also add later on how to freeze requirements:

pip freeze | grep -v {{cookiecutter.package_name}} > requirements.txt

Breaking notebooks and explorations under refactoring

When refactoring the core, old exploration may break. For this reason, notebook should be tested as well.

Testing a notebook may mean to run it before and after refactoring, making sure that there are no errors and that the output of cells are the same.

mkvirtualenv

Hello,
I know this is a basic question, but I'm not sure how to get out of the gate using mkvirtualenv to set up the virtual environment. I get the response "bash: mkvirtualenv: command not found".
I am using a debian instance on GCP, if that helps.
Any help is much appreciated.

Pytest warning

When running pytest, you may notice the following warning:

RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility.

This is a cython waraning that should be silenced but is unsilenced by pytest.

See this link for more information.

Elaborate on the concept of "code to text ratio"

From Part C - Refactoring for Exploration:

Because the main purpose of an exploratory analysis is to prove a point rather than showing code, refactoring for exploration should be aimed at reducing the code to text ratio, within reason. In this way, a notebook would look more like a document that uses words (and plots) to reason and prove a point.

However, with this criterion, we may as well increase the number of words, just to decrease the code to text ratio. This leads to longer documents that are both text and code heavy and, in turn, harder to read. A better solution is to simplify both code and text, while keeping the code to word ratio reasonably low.

What is this ratio actually about? What is the goal of this step? I.e. can you find some examples of what is wanted and what is not? For instance, where does a file with no comments but only code stand on your quality scale? How about a file with lots of text but little code?

_Source: _ https://github.com/Satalia/production-data-science/tree/master/tutorial/c-refactor#refactoring-
for-exploration

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.