Common SQL techniques Part 1

Table of Contents

SQL is pretty much a required skill in all the Data Analytics and/or Data Science jobs, and one of the technical assessment in some of those job interviews is to write SQL query to complete given data tasks. In this blog, I am compiling some common SQL techniques based on my learning and working experiences. This will be the Part 1 of a series which focuses on some big statements/operations/commands for data manipulation in SQL ; the 2nd part will focus more on specific details of manipulating different types of variables.

1. Select #

SELECT is the very basic command in SQL, you basically use it to select columns that you want from a table. If you want to select all columns from the table, you can use SELECT *, but this is not suggested due to the cost of memory and some other various reasons. Check this blog to learn more about why you should avoid select * when not necessary. Basically query format for the select command:

select col1, col2 
-- use 'select DISTINCT' if you want to drop duplicates
from data_base.table_name

2. Join #

Join is used when you are trying to extract information from another table for the data in a given table based on one or multiple related columns - One join happens between two tables (left table and right table, they could be the same in self join) and both tables need to have related column(s) for them to be able to join. There are different types of joins.

left join is used when you want to keep all the data/rows in the left table no matter whether the information you want to extract from the right table is available for those rows or not.

Opposite from left join, right join will keep all the data/rows in the right table

inner join will keep all the matched data - data/rows that exist in both left and right tables.

outer join will keep all the rows from both left and right tables - you should expect the number of rows in the resulting table greater or equal to the number of rows in the left or right table. Basic query format for join:

select l.col1, r.col2
from data_base.left_table l
left join data_base.right_table r 
-- can change left join to right join, inner join or outer join
on l.id = r.id and l.name=r.name
--here we are joining two tables based on id and name columns

In the above query example, the join condition is equal. We can also join two tables with unequal condition(s) like follwing:

select l.col1, r.col2
from data_base.left_table l
left join data_base.right_table r 
-- can change left join to right join, inner join or outer join
on l.id = r.id and l.name=r.name
and l.start_date > r.start_date
-- here we are joining two tables based on id and name columns and also having an unequal condition to make sure the start date in left table is greater than the start date in the right table
-- here the > can be other unequal operations such as <>, <, between

Note that here the unequal condition is within the ON clause in the above sample query. We can also move the unequal condition to a WHERE clause like following, which may return different results depending on what kind of join you are using - this is because when it is within ON cluase, it is a join condition while when it is within WHERE clause, it further filters the data after the join operation (the resulting table would be the same for inner join no matter whether the unequal condition is in ON or WHERE clause).

select l.col1, r.col2
from data_base.left_table l
left join data_base.right_table r 
-- can change left join to right join, inner join or outer join
on l.id = r.id and l.name=r.name
where l.start_date > r.start_date
-- here we are joining two tables based on id and name columns and using an unequal condition to further filter the resulting table from join operation to make sure that the start date in left table is greater than the start date in the right table
-- here the > can be other unequal operations such as <>, <, between

Whenever the ON clause is omitted, the default result is CROSS JOIN or a Cartesian product. This means the query will return every combination of rows from the left table with every combination of rows from the right table, which you should probably avoid since it might be very computationally expensive.

Here we will skip covering too much details of self join and cross join since they are less common.

3. Union #

If join is combining data horizontally, then union is to concatenate/append data vertically. Again, one union operator happens between two tables (left and right, can be the same table) - you should expect the number of rows in the resulting table equal or greater than the number of rows in either left or right table. Basic query format:

select col1, col2 from data_base.left_table
UNION -- or UNION ALL
select col3, col4 from data_base.right_table

A few things to be noted here:

the number of selected columns from left table needs to be the same as that from the right table; the name of those columns can be different (the resulting table will keep the name from the left table) but the data type of the corresponding columns need to be the same.
UNION will remove duplicate rows during the operation
UNION ALL will not remove duplicate rows

4. Aggregation and grouping #

In SQL, we can use GROUP BY clause to perform any aggregation over the grouped column(s) - whenever you select other columns that are not being aggregated and perform some aggregation the same time, you have to use GROUP BY and have those non-aggregated columns specified after the GROUP BY (as grouping columns). Common aggregation functions include:

count(), sum()
avg(), min(), max()
first_value(), last_value()

Basic query format for aggregation and grouping

select col1, col2, sum(col3) as col3_sum
from database.table_name
group by 1, 2
-- here 1,2 refers to col1, col2 for simplicity, you can use group by col1, col2
order by 1, 2 
-- ordering the resulting table by col1 and col2, order by should be used after group by

Some things to be aware when you use the COUNT() aggregation function

select count(1) will not exclude null values because 1 is a non-null expression
count(*) will not exclude null values, it will return the same result as count(1)
count(specified_col) will exclude null values
count(distinct specified_col) will exclude duplicates counting the unique values

5. Case statement #

Case statement is a very commonly used statement in real life. It is to create new column based on some conditions using existing columns in the table. This statement is very easy to use and the basic query format is follwing:

/*Example 1*/
select col1,  -- existing column
case when existing_col1 = 1 then 1 
when existing_col1 = 2 then 2
else 3 
end as new_col2 -- new_col2 will have 3 values: 1,2,3
from data_base.table_name

/*Example 2*/
select count(case when a=1 then a end) as a1_count
-- here the count() will count the occurence of 1 in column a

6. WHERE & HAVING #

Both WHERE and HAVING are used for filtering with specified conditions. HAVING can be used with aggregation while WHERE cannot.

/*WHERE*/
select col1, col2
from database.table_name
where col_A between a1 and a2 --between is inclusive
and col_B is not NULL
and col_C in (1,2,3)
-- NOTE: no aggregation in this query, so we use where for filtering

/*HAVING*/
select col1, count(col2) as col2_cnt
from database.table_name
group by col1
where having count(col2) >4
--order by  -- optional sorting

7. Subqueries & CTEs #

CTE here refers to Common Table Expressions. Both subqueries and CTE are used to create some intermediate querying result/table. CTE is always used in conjunction with WITH clause. Sample queries are as following:

/*CTE*/
with cte_tbl1 as (
    select col1, col2 from tbl1
    where col2 > 2
), -- this comma is necessary
cte_tbl2 as (
    select col3, col4 from tbl2
    where col3 = 4
) -- no comma after the last CTE 
select * from cte_tbl1
union
select * from cte_tbl2;

/*Subquery*/
-- example 1
select * from (select col1, col2 from tbl1 where col2 > 2)
union
select * from (select col3, col4 from tbl2 where col3 = 4);

-- example 2
select col1, col2 from table1
where col3 > (select avg(col3) from tbl1)

As we can see, both CTE and Subqueries can be used to perform the same tasks, but there are still some differences between them.

CTE is defined at the front of the query while subquery are used inline (wherever needed)
CTE must be named while subquery does not have to
CTE is more readable/cleaner than subquery especially with big query
A CTE can be used many times in the query while subquery can only used wherever you defined it - CTE would be much better in queries where you need to use same intermediate data multiple times
subquery can be used with WHERE clause while CTE cannot (see the example 2 above)

8. Window function #

window function is probably one the most advanced techniques in SQL. Window function

will not aggregate the data into smaller table even it is sometimes used with aggregation function, instead it will keep all the rows and give each row the (aggregated) summary value/statistic.
is commonly used with over (PARTITION BY)

Other than the typical aggregate functions, a few common SQL tasks that are usually using window function:

Ranking
- row_number() simply returns the row number
- rank() will skip some ranks if there is a tie.
- dense_rank() does not skip any ranks if there is a tie between the ranks of the preceding
Extact prior value or next value
- lead() to get the next value
- lag() to get the lagged/prior value

/*Example 1*/
select col_a, ROW_NUMBER() 
over (partition BY col_a order BY col_b) as new_col
-- get the ranking of col_b within each group of col_a
from table_name

/*Example 2*/
select lag(col_c)
over (partition BY col_a order BY col_b) as new_col
-- get the ranking of col_b within each group of col_a
from table_name

/*Example 3*/
select sum(col_c) over (partition BY col_a order BY col_b), 
--running/rolling sum of col_c (ordered by col_b) within each group of col_a 
avg(col_c) over (partition BY col_a) 
--average of m within each group of col_a
from table_name

References #

Window function:
- window function in sql
- sql window functions
Practice problems
- sql interview questions you must prepare - the ultimate guide
- top 30 sql query interview questions