Coder Social home page Coder Social logo

mysql_data_pipeline's Introduction

Report techtechnique

Question 1: Analysis of the data model

Currently, the tech stack include:

  • MySQL server implement with docker compose
  • Database include 3 tables
    • employee(id, first_name, lastName, email, group_id)
    • groupe(id, nom)
    • badge(id, serial_number, employee_id)

Comments:

  1. Because of these problem have very simple database, so MySQL works well, but in real word, we should use Oracle,PostgreSQL or Microsoft SQL Server

  2. Problem of docker-compose: The current password is password is not secure and it sets directly on docker environment variable, that is not best practices.
    I recommend to use MYSQL_RANDOM_ROOT_PASSWORD and then save the password to secure location or use (prefer) MYSQL_ROOT_PASSWORD_FILE to set the password, or set MYSQL_ROOT_PASSWORD by an environment variable. Read more at MySQL document. Another options by using docker swarm secret, further advanced we have hashicorp vault or Bitnami Sealed Secrets (combine with Kubernetes)

  3. The database is not well design (base on name convension of the columns). We should change it to:

    • employee(employee_id, first_name, last_name, groupe_id)
    • groupe(groupe_id, groupe_name) (I still use groupe table not group because group is reserved keyword) may be we can name as departement.
    • badge(badge_id, badge_serial_number, employee_id)

Question 2: Analysis of the quality of existing data

Analysis data

  1. Let take a look at the number of each table:
table count null
badge 1000 0
employee 1171 12
groupe 8 1

We can see that we have total 1171 employee, but we have only 1000 badge. The table group have 1 null value of groupe_nom and have 2 External value

  1. Check duplicate data: in the table badge, the are we want to make sure every employee have only one badge, but in the current database,
select employee_id, count(employee_id) as count_id from badge
group by employee_id
order by count_id desc
limit 3;

Result:

employee_id count
529 5
251 5
849 4
  1. We also want to check the duplicate data of employee
SELECT e1.*
FROM employee e1
JOIN employee e2 ON e1.first_name = e2.first_name AND e1.lastName = e2.lastName
WHERE e1.id <> e2.id
order by e1.first_name;
id first_name last_name email groupe_id
1170 Elian Foster [email protected] 1
808 Elian Foster [email protected] 1
1171 Freddie Brooks [email protected] 2
809 Freddie Brooks [email protected] 1
1169 Isabella Kelly [email protected] 1
960 Isabella Kelly [email protected] 3
656 Lucy Baker [email protected] 6
1168 Lucy Baker [email protected] 3
  1. We also want to check the correctness of data in employee table: for example: email is valid, we have both first_name and last_name
SELECT id, email
FROM acme.employee
WHERE email NOT LIKE '%_@__%.__%'
limit 3;
id email
650 maria.coleatacme.com
831 maddie.warrenvacme.com
957 sydney.perkins.acme.com

or query valid first_name, last_name

SELECT id, first_name, lastName, email
FROM acme.employee
WHERE first_name = '' or lastName = ''
limit 4;
id first_name lastName email
50 Alberta [email protected]
105 Vincent [email protected]
148 Reed [email protected]
186 [email protected]

=> we can conclude that the database have some quality issues, and we can fix that by:

Data Quality DashBoard:

Currently, I just thinking about create a dash bash that contains the cout of each table, and maybe some duplicates data. I create a dashboard by using Redash, you can check at that link:

Integration of new data

0. Start database

docker compose up -d

1. Simple way (Python as ETL)

Let's take a look at the employees - infos badge à mettre à jour.csv file, and we see that we have 1171 lines, correspondence 1171 employee, and when we compare with the employee table in database, the order is the same (first_name, lastName,... ). That lead to we can get the employee_id from this file. So basicly, the algorithm to update the employee table is:

  1. Read employee first_name, last_name, serial_number from csv file
  2. The employee_id is the index of each line
  3. Update new record to new_badge table (I think we can drop the table, and write a new one because we have only 679 employee_id and the write task take little time than delete, modify the existing table)
df = pd.read_csv("./employees - infos badge à mettre à jour.csv")
df['index'] = range(1, len(df) + 1)
df_val = df[['employee_first_name', 'employee_last_name', 'badge_serial_number', 'index']].fillna("").values
mydb = mysql.connector.connect(...)
mycursor = mydb.cursor()

for x in df_val:
    try:
        sql = "INSERT INTO new_badge (badge_serial_number, employee_id) VALUES (%s, %s)"
        val = (x[2], x[3])
        mycursor.execute(sql, val)
        mydb.commit()
    except Exception as e:
        print(e)

2. Spark

image

The pipeline for apache spark:

  1. We read data from CSV file (df1) and MySQL (df2)
  2. We join 2 dataframe by first_name and last_name
  3. We select serial_number and employee_id
  4. Write result to new dataframe

When we join 2 dataframes by first_name and last_name, there are some cases that make whe have alot of duplicate data like the person who have first_name is "" and last_name is "" or some case in section 2, so we need to get all emplyee that have unique name. The SQL query below help use do that

select id, first_name, lastName
from employee
where (first_name, lastName) IN (
SELECT first_name, lastName
    FROM employee
    GROUP BY first_name, lastName
    having count(*)=1);

How to run:

  1. Config hostname, port, user of MySQL from .env file
  2. pip install -r requirements.txt
  3. Run ingest:
export mysql_password=password
python ingest_pyspark.py

this command make our password more secure.

3. Apache Beam

With Apache beam, I have 2 solutions:

  1. Use ParDo to process each row of PCollection
    • Connect to database (with beam extension plugin)
    • Create 2 PCollection that contain the new information and employee information like in spark
    • Inner join
    • Write to database

How to run: Make sure you have Maven and Java SDK 17.0. You can follow the instruction of apache beam to install everything

cd beam/beam-starter-java

mvn package -Dmaven.test.skip

java -jar ./target/beam-java-starter-1-jar-with-dependencies.jar  --inputMySQLHostName='localhost' --inputInfoBadgeCSVPath=../../employees\ -\ infos\ badge\ à\ mettre\ à\ jour.csv --inputMySQLUserName='root' --inputMySQLPassword='password'

Change the hostname, user, password, path by your setting.

  1. Convert PCollection to dataframe and then join This method is spark dataframe-like method

mysql_data_pipeline's People

Contributors

batuan avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.