SQL (Structured Query Language) is a programming language used to manage and manipulate relational
database.
As a data scientist, you may find yourself working with large datasets stored in relational databases.
Here are some of the essential SQL commands for data scientist.
SELECT
This is the most fundamental SQL command is used to retrieve data from a database. The Select command is
used to specify the columns and tables to retrieve data from.
Syntax:
SELECT column1, column2, …
FROM table_name;
Example:
it is used to select name and age of students.
SELECT name, age
FROM students;
WHERE
This command is used to filter data based on specified condition. The WHERE clause is used with
the SELECT to retrieve specified data that meets the specified criteria.
Syntax:
SELECT column1, column2
FROM table_name
WHERE column1 = ‘value’;
Example:
This is used to select students who have age greater than 22.
SELECT name, age
FROM students
WHERE age > 22;
DISTINCT
This command is used to retrieve unique values from a specified column. The DISTINCT keyword is used in conjunction with
the SELECT statement.
Syntax:
SELECT DISTINCT column1
FROM table_name;
Example:
SELECT DISTINCT department
FROM employees;
- ORDER BY
This command is used to Sort the data in a specified order based on a column. The ORDER BY clause can sort in ascending
or descending order.
Syntax:
SELECT column1, column2
FROM table_name
ORDER BY column1 ASC;
Example:
it is used to select students name and age being ordered by their age in Descending order.
SELECT name, age
FROM students
ORDER BY age DESC;
- JOIN
This command is used to combine two or more tables based on a common column. The Join command is used to retrieve data that
is spread across multiple tables. To perform join you need to use an alias to refer to the same table with different names.
This is necessary because SQL requires that each table in a join operation has a unique name.
Example:
SELECT column1, column2, column3
FROM table1 AS t1
JOIN table2 AS t2
ON t1.common_column=t2.common_column
Example:
SELECT orders.order_id,
customers.name,
orders.order_date
FROM orders o
JOIN customers c
ON o.customer_id = c.customer_id;
SQL Server supports many kinds of different joins including INNER JOIN, SELF JOIN, CROSS JOIN, and OUTER JOIN. In fact, each join type defines the way two tables are related in a query. OUTER JOINS can further be divided into LEFT OUTER JOINS, RIGHT OUTER JOINS, and FULL OUTER JOINS.
INNER JOIN
creates a result table by combining rows that have matching values in two or more tables.
SELECT *
FROM table1
INNER JOIN table2
ON table1.column = table2.column;
LEFT JOIN
includes in a result table unmatched rows from the table that is specified before the LEFT OUTER JOIN clause.
SELECT *
FROM table1
LEFT JOIN table2
ON table1.column = table2.column;
RIGHT JOIN
creates a result table and includes into it all the records from the right table and only matching rows from the
left table.
SELECT *
FROM table1
RIGHT JOIN table2
ON table1.column = table2.column;
SELF JOIN
A self join is a join operation where a table is joined with itself. In other words, it is a way to combine rows from the same table based on a related column.
SELECT e.employee_name,
m.employee_name AS manager_name
FROM employees e
JOIN employees m
ON e.manager_id = m.employee_id;
For example, consider a table called "employees" that has columns for "employee_id", "employee_name", and "manager_id", where the "manager_id" column references the "employee_id" of the employee's manager. To retrieve the names of all employees and their respective managers, you could use a self join like this:
SELECT e.employee_name, m.employee_name AS manager_name
FROM employees e
JOIN employees m ON e.manager_id = m.employee_id;
CROSS JOIN
A cross join, also known as a Cartesian join, is a type of join operation in SQL where every row from one table is joined with every row from another table. This means that the resulting table contains all possible combinations of rows from the two tables being joined.
SELECT *
FROM table1
CROSS JOIN table2;
Cross joins are not often used in practice, as they can produce very large result sets and can be computationally expensive. However, they can be useful in certain situations, such as when you need to generate test data or when you want to analyze all possible combinations of data.
- GROUP BY
This command is used to group data based on a specified column. The GROUP BY command is often in conjunction with aggregated functions
Example:
SELECT department, COUNT(*)
FROM employees GROUP BY department;
- HAVING
This command is used to filter rows based on conditions on aggregated data. Syntax:
SELECT column1, COUNT(column2)
FROM table_name
GROUP BY column1
HAVING COUNT(column2) > value;
Example:
SELECT department, COUNT()
FROM employees GROUP BY department
HAVING COUNT() > 2;
- UNION
This command is used to combine the results of two or more SELECT statements into a single table.
Syntax:
SELECT column1
FROM table1
UNION
SELECT column1
FROM table2;
Example:
SELECT name
FROM students
WHERE age > 22
UNION
SELECT name
FROM employees
WHERE department = 'Sales';
- LIMIT
This command is used to limit the number of rows returned by a query.
Syntax:
SELECT column1, column2
FROM table_name LIMIT 10;
Example:
This command selects the first 2 students only.
SELECT name
FROM students LIMIT 2;
- INSERT
This command is used to insert new data into a table.
Syntax:
INSERT INTO table_name (column1, column2)
VALUES (value1, value2);
Example:
This command inserts a new row into the students table with the name 'David' and age 24.
INSERT INTO students (name, age)
VALUES ('David', 24);
These are some of the essential SQL commands for data science. By mastering these commands, you can efficiently manage
and manipulate relational databases to extract meaningful insights from data.