This repo introduces basic Python concepts to begin mastering programming, machine learning, and data science concepts.
This guide to Python overviews concepts and tools for machine learning in Python that include environments, packages, scripts, and notebooks.
Covering all the skills of data science are beyond the scope of this repo. Additional skills and tools are mentioned throughout to facilitate independent study. A great many resources are available, many free, for learning and experimenting with data science.
If you are new to data science, or desire a quick refresh, the Supporting Knowledge section covers some of these additional skills briefly to provide context and a more holistic understanding of the data science process.
Links to example files, websites, and notebooks are included when possible. Many subdirectories in this repo contain additional README files with a deeper dive into the specific concepts or example of that directory.
- Python Basics Repo
- Why Python
- Supporting Knowledge
- Packages
- Essential Packages
- Environments
- Workflow Tools
- Scripts
- Notebooks
- Effective Programming
- Tips and Tricks
This repo is designed to demonstrate key aspects of Python programming and basic data science. This readme provides an overview of key concepts and links to other relevant files in the repo. Additional information on some topics may be found in Markdowns. Markdown (.md) files include text, links, and code snippets. Notebooks (.ipynb) files include text, executable code, and data visualizations. Python scripts (.py) files also contain executable code and may save models and files, but do not display visuals directly. See Workflow Tools for more information on the use of scripts vs notebooks.
Examples are divided by goal (Python vs ML) and analysis type (e.g., classification vs regression). Details on setting up workflow and running code is found throughout this file. A common environment is used for all examples. NOTE: Code will not execute correctly without the proper environment. Examples are designed to run independently. Therefore, once the appropriate environment is set up and activated, examples may be run in any order. NOTE: Notebooks and scripts within an example may have a required order of execution.
Outlined below are key directories (alphabetically) in this repo:
- Basics
- Basics.ipynb: Notebook introduction to Python data structions and syntax
- Basics.md: Covers basic Python operations including loops, functions, methods, and classes
- ClassificationExamples
- iris: example workflow classifying 3 species of iris flowers using multiple notebooks
- random_forest_lecture: a single notebook with in-depth details on decision trees and random forests. Good starting point for individuals with programming experience looking for entry into machine learning
- Environment
- README: details on environment installs and management tools
- environment files: txt and ymls containing required details on Python version and packages
- LICENSE: Info on using this repository
- Markdowns
- Data Descriptors: Facotrs that define and provide insight into data quality, such as structured data, missing values, and time-series features
- Data Science Topics: Overview of data science goals and topics providing a quick explanation of types of analyses and their importance
- Helpful Tools: Tools to optimize efficiency including text editing, debugging, style guides, and more
- Jupyter Notebooks: Describes cell types, kernels, and basic execution instructions for notebooks
- Model Training and Evaluation: Terminology and descriptions of core data science best practices, training, and evaluation of models
- Operations: Loops, dictionaries, functions, and methods to control workflows
- Package Importance: Expands on Packages section emphasizing the importance of writing clean, reusable, and maintainable code
- README.md: This file with an overview of many topics. If new to Python, read the whole file. If experienced, use the Topics section to jump to topics of interest
- RegressionExamples
- housing: regression example using historic housing prices from Boston using multiple notebooks
With the goal of consistency, several common directories may be found in example folders. Some examples may be a single notebook, while others contain all the directories listed below.
clf_results
: holds results of multiple runs with metrics and visuals for eachimages
: images generated from EDA or analysisnotebooks
: collection of ordered notebooks to runREADME
: markdown file with details about that specific examplesrc
: python scripts which may be imported into notebooks or executed directly
Python is a popular platform for machine learning due to its simplicity, readability, and extensive ecosystem of libraries and frameworks. Its syntax is straightforward, making it accessible for beginners and efficient for experienced programmers. Python’s robust community support and comprehensive documentation further enhance its usability in machine learning projects. Key libraries and packages such as Numpy, Pandas, Scikit-Learn, TensorFlow, and PyTorch provide powerful tools for numerical computations, data manipulation, and model building, streamlining the development process. More on packages found here.
To effectively work in Python for machine learning, it’s important to have knowledge of the following tools: Git, GitHub, CLIs, and SQL.
Deep-dive topics of interest:
-
Data Types: Summary of data type and factors for data science such as structured or not, missing values, and time-series features
-
Data Science Topics: Overview of data science goals and topics providing a quick explanation of types of analyses and their importance
-
Model Training and Evaluation: Terminology and descriptions of core data science best practices, training, and evaluation of models
-
Helpful Tools: Tools to optimize efficiency including text editing, debugging, style guides, and more
-
Package Importance: Expands on Packages section emphasizing the importance of writing clean, reusable, and maintainable code
-
Git and GitHub:
-
Git: A version control system that helps track changes to code, collaborate with others, and manage different versions of a project.
- Basic Commands:
git init # Initialize a new repository git add . # Stage changes git commit -m "message" # Commit changes git push # Push changes to a remote repository
- Basic Commands:
-
GitHub: A platform for hosting and managing Git repositories, facilitating collaboration and sharing of code.
- NOTE: Repos can be private or public making it very important to consider what information and data is shared and with whom.
-
-
Command Line Interfaces (CLIs):
- CLIs allow for efficient navigation and execution of commands on your computer. Familiarity with basic command line operations is essential for tasks such as setting up environments, managing packages, and running scripts.
- Basic Commands:
ls # List directory contents cd path/to/directory # Change directory mkdir new_directory # Create a new directory
- Basic Commands:
- CLIs allow for efficient navigation and execution of commands on your computer. Familiarity with basic command line operations is essential for tasks such as setting up environments, managing packages, and running scripts.
-
SQL:
- A relational database querying language that's efficient for data analysis and manipulation
- Data scientists must often collect data from databases such as SQL for analysis
A package in Python is a collection of modules organized in directories that provide reusable code and functionality. Packages allow you to structure your Python code logically and systematically, facilitating code management and reuse. They can include various types of files, such as modules (individual Python files), sub-packages (directories with their own __init__.py
file), and data files.
Example Structure of a Package:
mypackage/
__init__.py
module1.py
module2.py
subpackage/
__init__.py
submodule1.py
__init__.py
: Indicates that the directory should be treated as a package.- Modules: Python files (
.py
) containing code, functions, classes, and variables. - Sub-packages: Nested directories with their own
__init__.py
files, further organizing code.
Using a Package:
# Importing a module from a package
from mypackage import module1
# Using functions or classes from the imported module
result = module1.my_function()
There are many benefits to the package organization including sharing, updating, and reproducing analysis. For more details see: Package Importance.
Key benefits elaborated on in the file linked above.
- Code Reusability
- Maintainability
- Collaboration
- Reproducibility
- Robustness
Numpy is a library for numerical computations. Used directly and also a common dependency for other relevant libraries.
- Creating Arrays:
import numpy as np a = np.array([1, 2, 3])
- Array Operations:
b = np.array([4, 5, 6]) c = a + b # array addition
Pandas is used for data manipulation and analysis.
A pandas DataFrame holds a two-dimensional data structure that can store data of different types, including:
- Characters
- Integers
- Floating point values
- Categorical data
- Dates and times
- Boolean values
DataFrames are similar to spreadsheets, SQL tables, or the data.frame object in R. They are made up of rows and columns, and each cell can contain a single value. DataFrames are often used to store and analyze data from a variety of sources, such as CSV files, Excel spreadsheets, and SQL databases.
DataFrames can be used to perform a variety of data analysis tasks, such as:
-
Filtering and sorting data
-
Calculating summary statistics
-
Creating visualizations
-
Joining and merging dataframes
-
Performing machine learning.
-
Creating DataFrames:
import pandas as pd df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
-
Basic Operations:
df['C'] = df['A'] + df['B'] # add new column
Scikit-learn is a machine learning library. It offers a wide variety of analysis options, many of which follow a standard order of operations. Most commonly, the sequence is to instantiate a model, fit a model, and then predict with the model. Includes tools for splitting data for training, testing, and/or cross-validation.
Analysis types include:
-
Classification
-
Regression
-
Clustering
-
Dimensionality Reduction
-
Model Selection
-
Data Preprocessing
-
Training a Model:
from sklearn.linear_model import LinearRegression model = LinearRegression() model.fit(X_train, y_train)
-
Making Predictions:
predictions = model.predict(X_test)
Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. It is highly customizable and supports a wide range of plots and charts.
- Key Features:
- 2D Plotting: Line plots, scatter plots, bar charts, histograms, etc.
- Customization: Extensive options for customizing the appearance of plots, including colors, labels, and styles.
- Integration: Can be used with Jupyter Notebooks for interactive visualizations.
- Example Usage:
import matplotlib.pyplot as plt x = [1, 2, 3, 4, 5] y = [10, 20, 25, 30, 40] plt.plot(x, y) plt.xlabel('X-axis label') plt.ylabel('Y-axis label') plt.title('Sample Plot') plt.show()
Seaborn is a data visualization library built on top of Matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
- Key Features:
- Statistical Plots: Supports complex visualizations like violin plots, box plots, pair plots, and heatmaps.
- Themes: Comes with built-in themes for styling Matplotlib graphics.
- DataFrames: Works seamlessly with Pandas DataFrames for data visualization.
- Example Usage:
import seaborn as sns import matplotlib.pyplot as plt # Load sample dataset data = sns.load_dataset('iris') # Create a pair plot sns.pairplot(data, hue='species') plt.show()
Plotly is a powerful graphing library for Python that allows for interactive, publication-quality visualizations. It is particularly useful for creating complex and customizable visualizations with ease. It integrates seamlessly with Pandas and can be used in Jupyter notebooks for interactive data analysis.
-
Key Features
- Interactive Plots: Zoom, pan, hover, and click to interact with data points.
- Wide Range of Charts: From simple line charts to complex 3D scatter plots.
- Customization: Extensive options for styling and customizing visualizations.
- Integration: Works well with Pandas and Jupyter notebooks.
- Offline and Online Modes: Create plots that work both offline and online.
-
Example Usage
import plotly.graph_objects as go # Sample data x = [1, 2, 3, 4, 5] y = [10, 11, 12, 13, 14] # Create a figure fig = go.Figure() # Add a line trace fig.add_trace(go.Scatter(x=x, y=y, mode='lines', name='Line Chart')) # Add titles and labels fig.update_layout(title='Basic Line Chart', xaxis_title='X Axis', yaxis_title='Y Axis') # Show the plot fig.show()
Python has a large and ever-growing number of resources for machine learning, AI, visualizations, and more.
Additional advanced libraries of interest for data science:
Environments, and their management, are essential to successful project collaboration and maintenance. It is vital to document the environment in which a project was made and deployed because Python versions and packages change. Without a record, it is easy to introduce errors and issues into your code. A list of packages and versions can be generated with Conda or pip to help reduce such problems. Multiple Python environments can be stored and activated when needed. See Environments for more details.
Code can be developed and executed in many different ways. The two primary data science options are scripts and notebooks.
Choosing between scripts and notebooks depends on the task at hand and the stage of your workflow. Here are some guidelines on when to use each:
-
Production Code:
- Scripts are ideal for code that needs to be run in production environments. They can be version-controlled more effectively and integrated into automated workflows.
- Scripts are easier to test, maintain, and refactor compared to notebooks.
-
Automation and Reproducibility:
- If you need to automate tasks, such as data processing pipelines, scripts are more suitable because they can be scheduled and executed without manual intervention.
- Scripts are also better for ensuring reproducibility of your workflows.
-
Complex Projects:
- For larger projects with multiple components, scripts (organized into modules and packages) are preferable. They allow for better organization, modularity, and separation of concerns.
- Using scripts makes it easier to manage dependencies and environment configurations.
-
Performance:
- If you need to optimize performance and reduce overhead, scripts are typically more efficient as they avoid the interactive overhead of notebooks.
-
Exploratory Data Analysis (EDA):
- Notebooks are great for EDA because they allow you to iteratively run and refine your code, visualize data, and document insights in a linear, readable manner.
- The ability to mix code, visualizations, and markdown makes it easy to understand and share your exploratory process.
-
Data Visualization:
- If you are generating plots and visualizations, notebooks allow you to see the results immediately and make adjustments on the fly.
-
Documentation and Tutorials:
- Notebooks are excellent for creating tutorials or documenting workflows because of their readability and interactivity.
- Combining code with detailed explanations helps others (or future you) understand the thought process and logic behind the code.
-
Interactive Development:
- For tasks where you need to quickly test hypotheses or experiment with different approaches, notebooks provide a flexible environment.
Sometimes, a project may require both interactive analysis and automated processing. You can use notebooks for the interactive parts (like data exploration and visualization) and scripts for the automated parts (like data cleaning and model training).
-
Prototyping to Production:
- You might start with a notebook for prototyping and exploration. Once the approach is validated, you can refactor the code into scripts for production deployment.
- This allows you to benefit from the interactive development environment of notebooks while ensuring the robustness and maintainability of scripts.
-
Comprehensive Reports:
- Notebooks can be used to create detailed reports with visualizations and explanations. Scripts can be called within these notebooks to run heavy computations or preprocess data.
- This combination allows you to maintain clean, well-documented analyses while leveraging the efficiency of scripts for computational tasks.
-
Machine Learning Workflow:
- Notebook: Initial data exploration, feature engineering, model prototyping, and visualizations.
- Script: Data preprocessing pipeline, model training, evaluation, and deployment.
-
Data Analysis Project:
- Notebook: EDA, hypothesis testing, and generating visualizations for reports.
- Script: Data extraction, transformation, and loading (ETL) processes, as well as scheduled reports generation.
-
Tutorial Development:
- Notebook: Creating interactive tutorials with step-by-step explanations and code snippets.
- Script: Including complex functions or utility scripts to keep the notebook concise and focused on the teaching content.
By leveraging the strengths of both scripts and notebooks, you can create a workflow that is both efficient and easy to maintain, while also being adaptable to different stages of your projects.
Scripts are Python files (.py) that contain code to be executed.
- Basic Structure:
# myscript.py import numpy as np import pandas as pd def main(): df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}) df['C'] = df['A'] + df['B'] print(df) if __name__ == "__main__": main()
- Running a Script:
python myscript.py
Notebooks allow for interactive code execution and visualization.
- Installing Jupyter:
conda install jupyter # or pip install jupyter
- Starting a Notebook:
jupyter notebook
- Basic Operations in a Notebook:
import numpy as np import pandas as pd import matplotlib.pyplot as plt # Create data x = np.linspace(0, 10, 100) y = np.sin(x) # Plot data plt.plot(x, y) plt.show()
To maximize efficiency, data scientists not only rely on packages and libraries, but must also customize their workflow and tools. A myriad of tools exist for text editing, debugging, optimization, and testing. Available features include color syntax highlighting, color themes, testing frameworks, and documentation standards.
I highly encourage taking the time to invest in researching, selecting, and fine-tuning your work tools and flows. This investment will save many hours down the road by providing a positive workspace and on-the-go feedback.
A concise list of options with brief descriptions is in HelpfulTools.
Code snippets for common commands that can help quickly pinpoint code issues.
**NOTE:**Examples shown below are from Mac, and should work on most linux distros too. Windows machines may require different commands and syntax.
-
Checking Versions A common source of error is the use of incorrect versions of required software. Remember environments help address this issue and the correct environment must be activated. If an environment is active, the env name may be in the filepaths returned.
- Shell Version
- In terminal:
echo "$SHELL"
- In terminal:
- Python Version
- CLI:
- Filepath:
which python
- Version:
python --version
- Filepath:
- Notebook
import sys sys.executable
- CLI:
- Package Version
import pandas as pd pd.__version__
- Shell Version
-
.gitignore
Git repos have a hidden file,
.gitignore
, which can be used to exclude files from being stored in the repo. Git/GitHub is not an ideal location for large data storage, which can make the repo slow to sync and update. Data, large model files, and junk files should often be ignored.file_to_ignore.txt # ignore single file **/data/ # ignore all directories named data anywhere in the repo **/.DS_Store #ignore all files with this name anywhere in the repo