Coder Social home page Coder Social logo

zstenger93 / data_science_piscine Goto Github PK

View Code? Open in Web Editor NEW
2.0 1.0 0.0 76 KB

42 Data Science Piscine using SQL, Python and docker-compose

Python 79.42% Shell 1.79% SQL 18.79%
42 42heilbronn data-science datascience piscine python docker-compose docker-container sql data-analysis

data_science_piscine's Introduction

data_science_piscine

Since postgres is not available on our Mac's and we are allergic to VM's we decided to run things on docker

I recommend move all of your database csv files to the docker image and not trying to access it from outside which will reduce your headaches a bit and as well executing all the queries directly on the postgres database which for creating the 5 table takes ~35 sec compared to 8.5 min otherwise.

Day 0 Data Engineer

ex 00 and 01 was just about to create the docker-compose file which runs 2 service, postgres for database and pgadmin to handle and see the db easier

ex 02 we created our first data table where I quickly realized doing things NOT ON the database make it significantly slower so first I connect to the postgres database and run my sql code directly on the database which made a table creation 7 second instead of 1,5-2 minute

ex 03 is the same as 02 but we need to create all the tables from the provided csv files

ex 04 we create the items table from the corresponding csv file, technically same as 02

Day 1 Data Warehouse

ex 00 create all the tables again from previous day with one extra from this subject.

ex 01 we merge all data from 2022 oct until 2023 feb into a customers table keeping all the duplicates

ex 02 now, remove all the duplicate rows from table

ex 03 we merge the items table into the customers table but be careful, the items table has duplicates as well, so you should group those duplicates before adding the values to the customers table with the matching id's

Day 2 Data Analyst

ex 00 we create a pie chart from the event types. For this day the pdf asks to use the data without removing the duplicates which makes no sense so we used the cleared datatable

ex 01 to create a plot for the number of customers during the time period from the data, a histogram for the total sales in each month and another plot with the average spend/cutomer/month which is filled up

ex 02 creating a box plot, another from the Interquartile range (IQR) of the box plot, and at the third one you will rethink your lifedecisions

ex 03 instant braindamage

The frequency will be the count of purchases, grouped by the user_id

The monetary value will be the sum of prices grouped by user_id and the sum is below 225, ordered by the total

ex 04 Elbow Method

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters)

In cluster analysis, the elbow method is a heuristic used in determining the number of clusters in a data set. The method consists of plotting the explained variation as a function of the number of clusters and picking the elbow of the curve as the number of clusters to use.

The elbow method is considered both subjective and unreliable. In many practical applications, the choice of an "elbow" is highly ambiguous as the plot does not contain a sharp elbow.

WSS = Σ (i=1 to k) Σ (j=1 to n_i) ||x_ij - c_i||^2

Where:

WSS is the within-cluster sum of squares.

k is the number of clusters.

n_i is the number of data points in cluster i.

x_ij is the j-th data point in cluster i.

c_i is the centroid of cluster i.

ex 05 CLustering algorithms

You are required to use clustering algorithms to create at least 2 plot from the website customers grouped by let's say, their activity on the website, how many times did they purchase something, are they loyal customers or just bought something once, etc.

Day 3 Data Scientist 1

Day 4 Data Scientist 1

A few alias/command to make things fast, or at least faster than python

up alias as:

alias up='docker-compose up -d'

run alias as:

alias run='docker stop $(docker ps -a -q) && docker rm $(docker ps -a -q) && docker-compose up -d'

connect alias as:

alias connect='docker exec -it postgres psql -U zstenger -d piscineds -h localhost -W'

postgres command examples

SELECT * FROM data_2022_oct LIMIT 10;
DROP TABLE "data_2022_oct";
SELECT COUNT(*) FROM customers;

list tables \dt \d

quit \q

data_science_piscine's People

Contributors

zstenger93 avatar

Stargazers

 avatar  avatar

Watchers

 avatar

data_science_piscine's Issues

Setup question

Hi i'm a 42 student and even if i haven't finish my CC, i ve started looking to this data science piscine. I ve noticed that you are recommending to launch everythin and have the csv file inside a docker images ? Can you explain what do you mean, how to do it and why ? Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.