Since postgres is not available on our Mac's and we are allergic to VM's we decided to run things on docker
I recommend move all of your database csv files to the docker image and not trying to access it from outside which will reduce your headaches a bit and as well executing all the queries directly on the postgres database
which for creating the 5 table takes ~35 sec compared to 8.5 min otherwise.
ex 00
and 01
was just about to create the docker-compose file which runs 2 service, postgres
for database and pgadmin
to handle and see the db easier
ex 02
we created our first data table where I quickly realized doing things NOT ON the database make it significantly slower so first I connect to the postgres database and run my sql code directly on the database which made a table creation 7 second instead of 1,5-2 minute
ex 03
is the same as 02
but we need to create all the tables from the provided csv files
ex 04
we create the items table from the corresponding csv file, technically same as 02
ex 00
create all the tables again from previous day with one extra from this subject.
ex 01
we merge all data from 2022 oct until 2023 feb into a customers table keeping all the duplicates
ex 02
now, remove all the duplicate rows from table
ex 03
we merge the items table into the customers table but be careful, the items table has duplicates as well
, so you should group
those duplicates before adding the values to the customers table with the matching id's
ex 00
we create a pie chart
from the event types. For this day the pdf asks to use the data without removing the duplicates which makes no sense so we used the cleared datatable
ex 01
to create a plot
for the number of customers during the time period from the data, a histogram
for the total sales in each month
and another plot
with the average spend/cutomer/month
which is filled up
ex 02
creating a box plot
, another from the Interquartile range (IQR)
of the box plot, and at the third one you will rethink your lifedecisions
ex 03
instant braindamage
The frequency will be the count of purchases
, grouped
by the user_id
The monetary value will be the sum
of prices grouped
by user_id
and the sum is below 225, ordered
by the total
ex 04
Elbow Method
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters)
In cluster analysis, the elbow method is a heuristic used in determining the number of clusters in a data set. The method consists of plotting the explained variation as a function of the number of clusters and picking the elbow of the curve as the number of clusters to use.
The elbow method is considered both subjective and unreliable. In many practical applications, the choice of an "elbow" is highly ambiguous as the plot does not contain a sharp elbow.
WSS = Σ (i=1 to k) Σ (j=1 to n_i) ||x_ij - c_i||^2
Where:
WSS
is the within-cluster sum of squares.
k
is the number of clusters.
n_i
is the number of data points in cluster i
.
x_ij
is the j
-th data point in cluster i
.
c_i
is the centroid of cluster i
.
ex 05
CLustering algorithms
You are required to use clustering algorithms to create at least 2 plot from the website customers grouped by let's say, their activity on the website, how many times did they purchase something, are they loyal customers or just bought something once, etc.
alias up='docker-compose up -d'
alias run='docker stop $(docker ps -a -q) && docker rm $(docker ps -a -q) && docker-compose up -d'
alias connect='docker exec -it postgres psql -U zstenger -d piscineds -h localhost -W'
SELECT * FROM data_2022_oct LIMIT 10;
DROP TABLE "data_2022_oct";
SELECT COUNT(*) FROM customers;
list tables \dt \d
quit \q