SQL Window Functions: Analyzing Sales Data
1. Comparing First and Last Sale Prices
To analyze how sales data evolves over time, we can compare each record with the first or last record for a common element. Consider the following SQL query:
SELECT
CustomerName,
SaleDate,
TotalSalePrice AS CurrentSale,
FIRST_VALUE(TotalSalePrice)
OVER (PARTITION BY CustomerName
ORDER BY SaleDate)
AS InitialOrder,
LAST_VALUE(TotalSalePrice)
OVER (PARTITION BY CustomerName
ORDER BY SaleDate
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
AS FinalOrder
FROM
allsales;
This query joins the
Customer
and
Sales
tables to access customer and sales data. It outputs the
CustomerName
,
SaleDate
, and
TotalSalePrice
fields. Additionally, it uses the
FIRST_VALUE()
and
LAST_VALUE()
functions to show the first and last sale prices for each customer.
The
FIRST_VALUE()
and
LAST_VALUE()
functions require the following elements to work correctly:
-
OVER
: Starts the windowing function.
-
PARTITION BY
: Subsets the data based on a shared field. In this example, it groups by
CustomerName
to find the first or last value for each customer.
-
ORDER BY
: Sorts the data subset.
-
Field
: Specifies the data to return from the preceding record. Here, it’s the
SalePrice
field.
These functions are useful for avoiding complex subqueries or derived tables. They traverse the data for each element specified in the
PARTITION BY
clause and return the required record.
The following mermaid flowchart shows the general process:
graph TD;
A[Start] --> B[Join Customer and Sales tables];
B --> C[Output basic fields];
C --> D[Apply FIRST_VALUE and LAST_VALUE functions];
D --> E[Group by CustomerName];
E --> F[Sort by SaleDate];
F --> G[Return results];
G --> H[End];
2. Displaying Rolling Averages
The sales director wants to see the rolling average sale value for the last three sales per customer. The following SQL query can achieve this:
SELECT
CustomerName,
SaleDate,
TotalSalePrice,
AVG(TotalSalePrice)
OVER (PARTITION BY CustomerName ORDER BY SaleDate
ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
AS AverageSalePrice
FROM
allsales
ORDER BY
CustomerName, SaleDate;
A rolling average is a classic analytical metric. It requires isolating a certain number of records up to and including the current record and calculating the average of a value in those records. The
ROWS
windowing function helps with this.
The windowing function needs the following elements:
-
AVG()
: The aggregation function to apply.
-
OVER
: Indicates a windowing function.
-
PARTITION BY
: Groups records into subgroups. Here, it uses
CustomerName
to calculate the rolling average per customer.
-
ORDER BY
: Sorts the records in the subset, which is essential for defining the sequence for the rolling average.
-
ROWS BETWEEN
: Specifies the range of records relative to the current record.
-
PRECEDING
: Determines how many rows back to start the running average.
-
CURRENT ROW
: Tells the windowing function to stop at the current record and include it in the total.
When using the
ROWS()
windowing function, keep the following points in mind:
- A rolling total can be created similarly, using the
SUM()
function instead of
AVG()
.
- Add an
ORDER BY
clause at the end of the query to ensure the desired output order, as the
ORDER BY
inside the
OVER
function doesn’t guarantee correct final sorting.
- You can apply a different sort order to the final output without affecting the running average calculation.
- To exclude the current row from the running average or total, replace
CURRENT ROW
with
1 PRECEDING
in the SQL.
- You can specify any row range relative to the current row for a running total.
The following table summarizes the key elements for the rolling average calculation:
| Element | Description |
| ---- | ---- |
| AVG() | Aggregation function |
| OVER | Windowing function indicator |
| PARTITION BY | Grouping records |
| ORDER BY | Sorting records in subset |
| ROWS BETWEEN | Defining record range |
| PRECEDING | Starting point for running average |
| CURRENT ROW | Ending point for running average |
3. Showing First and Last Sales per Client
To show the first order and the last four sales for each customer, we can use the following SQL query:
SELECT
CustomerName,
SaleDate,
TotalSalePrice,
FIRST_VALUE(TotalSalePrice) OVER (PARTITION BY CustomerName
ORDER BY SaleDate) AS FirstOrder,
LAG(TotalSalePrice, 3) OVER (PARTITION BY CustomerName
ORDER BY SaleDate) AS
LastButThreeOrder,
LAG(TotalSalePrice, 2) OVER (PARTITION BY CustomerName
ORDER BY SaleDate) AS LastButTwoOrder,
LAG(TotalSalePrice, 1) OVER (PARTITION BY CustomerName
ORDER BY SaleDate) AS LastButOneOrder,
LAST_VALUE(TotalSalePrice) OVER (PARTITION BY CustomerName
ORDER BY SaleDate) AS LatestOrder
FROM
allsales;
This query helps estimate how much each client is spending and how sales are evolving over time. It uses the following functions:
-
FIRST_VALUE
: Shows the sale price for the first purchase by a customer.
-
LAST_VALUE
: Shows the sale price for the latest purchase by a customer.
-
LAG
: Shows a previous specific sale.
These windowing functions require the following keywords:
-
OVER
: Indicates a windowing function.
-
ORDER BY
: Specifies how records are sorted in a subset.
-
PARTITION BY
: Groups records by a criterion for meaningful comparison. Here, it partitions by the client name to see sales per client.
The following mermaid flowchart shows the process of this query:
graph TD;
A[Start] --> B[Select customer and sale data];
B --> C[Apply FIRST_VALUE function];
C --> D[Apply LAG functions];
D --> E[Apply LAST_VALUE function];
E --> F[Group by CustomerName];
F --> G[Sort by SaleDate];
G --> H[Return results];
H --> I[End];
4. Calculating Cumulative Distribution
A salesman wants to know the relative standing of each sale by make. The following SQL query can provide this information:
SELECT
MakeName, ModelName, InvoiceNumber,
SalePrice,
ROUND(CUME_DIST()
OVER (PARTITION BY MakeName ORDER BY SalePrice),2)
AS RelativeStanding
FROM
allsales
ORDER BY
MakeName, SalePrice, RelativeStanding;
The
CUME_DIST()
function calculates the cumulative distribution, showing the relative position of a value compared to other values.
The query requires joining multiple tables to select the required fields. It then adds the
CUME_DIST()
function to the
SELECT
clause, specifying an
ORDER BY
element (the sale price) and a
PARTITION BY
clause (using the make of the vehicle) to group records into subsets.
When using the
CUME_DIST()
function, note the following:
- The
PARTITION BY
clause is optional. Without it, the function will show the percentage of vehicles that sell for less than the current car for all sales, regardless of make.
- The field used in the
ORDER BY
function of the
OVER
clause must be numeric.
- The
ROUND()
function is added to prevent displaying 15 decimal places by default. You can omit it if needed.
- You can sort the dataset by the sale price in descending order to see the most expensive sale per make first. The overall query sort order is independent of the
ORDER BY
in the
CUME_DIST()
function.
The following table summarizes the key points for the
CUME_DIST()
function:
| Point | Description |
| ---- | ---- |
| PARTITION BY | Optional for grouping |
| ORDER BY field | Must be numeric |
| ROUND() | For decimal place control |
| Sort order | Independent of CUME_DIST() ORDER BY |
5. Classifying Data using PERCENT_RANK()
The sales director wants to see vehicle sales for each country by value and the percentile of each vehicle sale in the sale hierarchy. The following SQL query can achieve this:
SELECT
CountryName, SaleDate, InvoiceNumber,
FORMAT_NUMBER(PERCENT_RANK()
OVER (PARTITION BY CountryName
ORDER BY TotalSalePrice), "0.00 %") AS PercentageRanking
FROM
allsales
ORDER BY
CountryName, TotalSalePrice DESC;
The
PERCENT_RANK()
function indicates the relative rank of a row inside a group of rows. It requires the
OVER
clause and an
ORDER BY
function with a numeric field.
The query selects the necessary tables, adds desired fields to display, and uses the
ORDER BY
clause for better comprehensibility. The
PARTITION BY
clause subsets the relative sale position by country.
When using
PERCENT_RANK()
, remember:
- The field used in the
ORDER BY
function of the
OVER
clause must be numeric.
- Adding the field used as the basis for
PERCENT_RANK()
to the
SELECT
clause can make the output easier to understand.
- Formatting the output as a percentage makes it more comprehensible, but it’s not required.
The following mermaid flowchart shows the process of this query:
graph TD;
A[Start] --> B[Select necessary tables];
B --> C[Add fields to display];
C --> D[Apply PERCENT_RANK() function];
D --> E[Partition by CountryName];
E --> F[Sort by TotalSalePrice];
F --> G[Format output];
G --> H[Return results];
H --> I[End];
6. Using the LAG() Function with Alphabetical Data
The marketing director wants a list of all the company’s customers in alphabetical order, grouped by the first letter of their name. The following SQL query can help:
SELECT
CASE
WHEN LEFT(CustomerName, 1) =
LAG (LEFT(CustomerName, 1))
OVER (ORDER BY CustomerName) THEN NULL
ELSE LEFT(CustomerName, 1)
END AS IndexLetter,
CustomerName
FROM
SalesByCountry
GROUP BY
CustomerName
ORDER BY
CustomerName;
Isolating the first letter of each client’s name is easy using the
LEFT()
function. However, showing the index letter only when a new letter occurs requires the
LAG()
function.
The code first applies the
LAG()
function to the
LEFT()
function on the customer name, specifying 1 row to go back. It then uses a
CASE
statement to compare the first letter of the current and previous customer names. If they are the same, it outputs
NULL
; otherwise, it outputs the first letter.
The following table summarizes the steps for this query:
| Step | Action |
| ---- | ---- |
| 1 | Apply LAG() to LEFT() function on customer name |
| 2 | Compare first letters using CASE statement |
| 3 | Output NULL or first letter based on comparison |
| 4 | Group by CustomerName |
| 5 | Order by CustomerName |
In conclusion, these SQL window functions are powerful tools for analyzing sales data. They allow us to create running totals, averages, and counts, segment the output, and show the relative weighting of values in a list. The functions covered in this article are summarized in the following table:
| Function | Description |
| ---- | ---- |
| ROW_NUMBER() | Adds a sequential number to each record |
| LAG() | Reaches back over previous records |
| FIRST_VALUE() | Returns the starting value |
| LAST_VALUE() | Returns the final value |
| ROWS BETWEEN | Defines record range for running calculation |
| PRECEDING | Helps define record range |
| CURRENT ROW | Specifies range including current row |
| CUME_DIST() | Calculates cumulative distribution |
| PERCENT_RANK() | Indicates relative rank as a percentage |
SQL Window Functions: Analyzing Sales Data
7. Summary of Key Concepts
Let’s summarize the key concepts and functions we’ve explored so far:
| Function | Key Elements | Use Case |
|---|---|---|
| FIRST_VALUE() and LAST_VALUE() | OVER, PARTITION BY, ORDER BY, Field | Compare first and last sale prices for each customer |
| AVG() with ROWS BETWEEN | AVG(), OVER, PARTITION BY, ORDER BY, ROWS BETWEEN, PRECEDING, CURRENT ROW | Calculate rolling average sale values |
| LAG() | OVER, ORDER BY, PARTITION BY | Show previous specific sales |
| CUME_DIST() | OVER, ORDER BY, PARTITION BY (optional) | Determine relative standing of sales by make |
| PERCENT_RANK() | OVER, ORDER BY (numeric field), PARTITION BY (optional) | Classify sales by percentile within a group |
| LAG() with alphabetical data | LAG(), LEFT(), CASE statement | Create an alphabetical index of customers |
This table provides a quick reference for understanding the different window functions, their required elements, and typical use cases.
8. Practical Tips for Using Window Functions
When working with SQL window functions, the following practical tips can enhance your efficiency and accuracy:
-
Understand the Data Flow
: Before writing a query, visualize how the data will be partitioned, sorted, and aggregated. This can help you choose the right functions and parameters. For example, when using
PARTITION BY, think about how the data should be grouped to achieve your analysis goal. - Test with Small Datasets : When dealing with complex queries, start by testing them on a small subset of data. This can help you identify errors and understand how the functions work without having to wait for long query execution times on large datasets.
-
Use Descriptive Aliases
: When naming columns in your query results, use descriptive aliases. This makes the output more understandable, especially when multiple functions are involved. For instance, instead of using a generic alias like
col1, use something likeInitialOrderorAverageSalePrice. -
Be Mindful of Performance
: Some window functions can be computationally expensive, especially when used on large datasets. Make sure to use appropriate indexing on the columns used in
PARTITION BYandORDER BYclauses to improve query performance.
9. Advanced Use Cases and Extensions
Beyond the basic use cases we’ve covered, window functions can be used in more advanced scenarios:
- Combining Multiple Window Functions : You can combine multiple window functions in a single query to perform more complex analyses. For example, you could calculate both the rolling average and the cumulative distribution in the same query to gain deeper insights into sales data.
SELECT
CustomerName,
SaleDate,
TotalSalePrice,
AVG(TotalSalePrice)
OVER (PARTITION BY CustomerName ORDER BY SaleDate
ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
AS AverageSalePrice,
ROUND(CUME_DIST()
OVER (PARTITION BY CustomerName ORDER BY TotalSalePrice),2)
AS RelativeStanding
FROM
allsales
ORDER BY
CustomerName, SaleDate;
- Using Window Functions in Subqueries : Window functions can also be used in subqueries to perform intermediate calculations. This can be useful when you need to perform multiple levels of aggregation or filtering.
SELECT
CustomerName,
SaleDate,
TotalSalePrice,
AvgSale
FROM (
SELECT
CustomerName,
SaleDate,
TotalSalePrice,
AVG(TotalSalePrice)
OVER (PARTITION BY CustomerName ORDER BY SaleDate
ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
AS AvgSale
FROM
allsales
) subquery
WHERE
TotalSalePrice > AvgSale;
The following mermaid flowchart shows the process of the combined window functions query:
graph TD;
A[Start] --> B[Select customer and sale data];
B --> C[Apply AVG() window function];
C --> D[Apply CUME_DIST() window function];
D --> E[Group by CustomerName];
E --> F[Sort by SaleDate and TotalSalePrice];
F --> G[Return results];
G --> H[End];
10. Conclusion
In this article, we’ve explored the power of SQL window functions in analyzing sales data. These functions provide a flexible and efficient way to perform various calculations, such as comparing first and last sales, calculating rolling averages, and determining relative rankings.
By understanding the key elements and use cases of each function, you can leverage them to gain valuable insights from your data. Whether you’re a sales director looking for trends in sales data or a marketing director organizing customer information, SQL window functions can be a valuable addition to your analytical toolkit.
Remember to keep the practical tips in mind when working with these functions, and don’t be afraid to explore advanced use cases and extensions to meet your specific analysis needs. With practice, you’ll be able to write complex and effective SQL queries using window functions.
So, go ahead and start applying these functions to your own sales data. You’ll be surprised at the depth of analysis and insights you can achieve.
超级会员免费看
2141

被折叠的 条评论
为什么被折叠?



