A Practical Guide to Complex SQL Queries

Jungle Scout Engineering
7 min readFeb 18, 2021

--

Author: Hon Lam, Engineer, Jungle Scout

When working with SQL, sometimes you can solve a problem by simply retrieving raw data like this:

In other cases, you may need to write complex queries — queries that may involve some intermediate understanding of JOINs and/or incorporated subqueries like this:

But how would you know that a problem requires a complex query? If you do need to write one, what is a systematic approach for planning and writing the query?

That’s the purpose of this guide: to suggest some standard steps for solving these problems. This guide walks you through the problem-solving process by:

  • Using a simple problem as an example.
  • Applying general programming tips for planning and writing code that can be used for different types of problems and programming languages.
  • Exploring the capabilities and significance of SQL clauses like SELECT FROM, and JOIN.

SQL Flavour
The syntax in this article is based on PostgreSQL.

Who is this article for?

You’ll find this guide useful if:

  • You know how to retrieve data in SQL, but want to learn more about manipulating the raw data with multiple steps;
  • You have experience with writing complex queries, but you want to standardize your steps;
  • You already have some standardized procedures, but you want to learn different approaches for solving problems; or
  • You want to see how programming practices can apply to writing complex queries.

Before we start: prerequisite knowledge of SQL

If you are unfamiliar with any SQL syntax, we suggest researching a basic explanation of the following clauses: SELECT, FROM, and JOIN. You don’t need an explanation of all their capabilities — this guide will further detail how the clauses work.

Exploring a problem

Let’s start by looking at a problem. Suppose that you have a database that contains only two tables:

Table: revenues
transaction_date | amount
------------------+--------
2020-01-01 | 4
2020-01-02 | 4
2020-12-31 | 4
Table: expenses
transaction_date | amount
------------------+--------
2020-01-03 | 2
2020-01-04 | 2
2020-12-31 | 2

For this article, the goal is to solve this problem:
Write a query that returns the profit of January 2020.

Framing the problem

When we receive a problem, we should look for the simplest solutions first. If none exist, then we should proceed to more complex solutions.

In terms of time complexity (O(N)), one of the simplest solutions would be to call a literal and not have any further operations like creating variables, manipulating data, etc. Here is an example in Javascript:

The simplest solution, if it exists, would be to find a single cell in our database tables that completes the original problem. The table and its cell would look something like this:

Table name: profit
Columns:
monthly_period | amount
----------------+--------
January 2020 | 999.00

Then the solution would be this:

But we don’t have any tables or data that would produce such a simple solution. Therefore, we need to find a more complex solution. To do so, we need to keep rephrasing the problem into a progressively more complex and/or lower-level version until the entities in the revised problem resemble the available data. Here is the original problem again:

Write a query that returns the profit of January 2020.

And here is a more complex revision of the problem:

Find the total expenses for January 2020, and subtract that from the total revenue for January 2020.

We don’t have to rephrase this any further because the entities in our problem now resemble our available data. The entities are “expenses’’ and “revenue”. We do have some tables that contain these entities.

The fact that the problem mentions “total” and “January 2020” should not warrant further rephrasing because it is safe to assume that SQL is capable of math and filtering data; many programming languages could perform these operations.

Defining the formula’s values

We now have the revised problem, which is essentially a subtraction operation. But we don’t have a single row in our database tables that matches either variable of the formula: 1) total revenue for January 2020 and 2) total expenses for January 2020. We have the individual transactions of revenue and expenses in our database tables like this:

Table: revenues
transaction_date | amount
------------------+--------
2020-01-01 | 4
2020-01-02 | 4
2020-12-31 | 4

But they are not yet filtered by date, and even if they were, they are not yet totalled. Therefore, we will need to manipulate the data.

Manipulating the data inside-out

When we manipulate data, there are at least two levels of data in our query: 1) the raw data (lower-level data) and 2) the result of the processed data (higher-level data). Typically such queries look like this:

Let’s tackle this by going inside-out: solve for the lower-level data first, and then progressively work towards a higher level until we solve the overall problem. Although we could instead go in the opposite direction, the inside-out approach is more productive for two reasons:

  1. Lower-level data is more likely to constrain higher-level data than the other way around. It’s more productive to identify our constraints in the early stages of planning and code writing. If we discover them later (like an outside-in approach), then we may need to scrap much of our progress.
  2. Identifying constraints early in the process would limit our code design options. When you have fewer options, it is easier to choose.

Analogy: technical design documents and product briefs
Another inside-out approach is how we write technical design documents and product briefs. When we write these materials, we recommend identifying the constraints first, i.e. the non-negotiable features users must have, infrastructure limitations, etc. With this information, it’s easier to plan the feature because there are fewer possible implementation designs to select from.

Building subqueries

There are two components of our formula. Let’s build each of them, one at a time (the lower-level part of our code). Then we can subtract one from the other (the higher-level part of our code).

Total revenue is simply (Subquery A):

Total expenses follows suit (Subquery B):

Making subquery results accessible

We now have the two values for the formula. But we can’t compute a subtraction until the two values are accessible in the same context.

Planning the context

The SELECT clause defines which values to output. But not all values are accessible for SELECT. It can only access whatever is in the same context, a concept that is common in many languages like Javascript:

In SQL’s case, SELECT can only access what is produced in the FROM clause if it is in the same query level (or “context”). For example:

That means we need to craft the FROM clause in a way that makes total revenue and total expenses accessible for SELECT.

Our high-level query will resemble this:

Incorporating JOIN

We need to use a type of the JOIN clause to combine total revenue and total expenses into the same context. Let’s briefly evaluate the different options for JOIN.

To understand them, we should first understand the ON argument:

The argument returns a boolean — this is why sometimes a query contains ON TRUE — which represents whether the rows on either table are considered a match. However, different types of JOIN have different ways of treating a non-match. To summarize (because JOIN is a potentially large topic):

  • LEFT JOIN — If the right-side table doesn’t produce any matches, then output the left-side data anyway. The output shows a NULL wherever the right side doesn’t match:
Result:left_id                                          |  right_id
-------------------------------------------------+----------
'some ID that doesn't exist in right-side data' | NULL
  • RIGHT JOIN — Same concept, but with the roles reversed. Therefore, the data on the right side is guaranteed to appear. For unmatched rows, NULL is printed in place of the left-side data.
  • INNER JOIN — Discard all the rows that don’t match. In other words, you won’t see NULLs as placeholders for the left or right table.
  • CROSS JOIN — Include all permutations of both datasets.

From here, we have two options:

Option 1: Use any of the above, but state ON TRUE.

That means regardless of the values in either dataset, the rows will always match; no row will be discarded, nor will they print out NULL as a filler. We want to accomplish this because each of our datasets only contain one cell (see Subquery A and Subquery B), and we always want them combined into the same dataset like this:

total_revenue  |  total_expense
---------------+---------------
some number | some number
  • Notice how neither of the datasets have a column with a matching value: an ID or foreign key. Hence ON TRUE instead of something similar to ON dataset1.foreign_key = dataset2.id.

Option 2: Use CROSS JOIN because it will accomplish the same thing without any ON argument.

Since both accomplish the same result, and Option 2 is more succinct, we will go with that:

Computing the results returned by FROM

We need to use SELECT to process the subtraction in our revised problem. SELECT basically evaluates a value. Some examples of values include:

  • Table columns
  • Function calls
  • Literals
  • Mathematical operations

In other words, if you can evaluate the value in a console like Javascript’s console, then you can do the same in SQL:

Since the values total_revenues and total_expenses are now accessible in the same context, we can call them and subtract them:

Experimenting with dummy data

If you want to experiment, here is the same query with the same data from the original problem. Just paste it into your code editor, edit the dummy data (see the WITH clause) wherever needed, and run it.

Final takeaways:

When we receive a problem, sometimes we can’t answer it directly. We need to break down the problem into multiple parts, solve each of them, and then sum them up.

The same fundamental programming principles that apply to many other languages also apply to writing queries. Here are some takeaways:

  1. Differentiate between low-level data and high-level data. Doing so will reveal constraints, which can save you a lot of time for planning and writing code.
  2. Be aware of context. Some entities (like variables) are accessible, and some aren’t.
  3. Be analytical of data types:
  • When leveraging a function, what are the datatypes of the arguments? SELECT accepts any data type because it simply evaluates statements. The ON argument is just a boolean.
  • What is the datatype of a response? This is an important question for our subqueries because their returned data types will constrain the code design of the higher-level query.

--

--

Jungle Scout Engineering

We are Jungle Scout. Our mission is to provide powerful data and resources to help entrepreneurs and brands grow successful businesses on Amazon