35、SQL Window Functions: Analyzing Sales Data

redis7keeper

于 2025-10-09 14:13:15 发布

阅读量26

点赞数

CC 4.0 BY-SA版权

分类专栏：掌握Spark SQL实战精髓文章标签： SQL Window Functions Sales Data Analysis

本文链接：https://blog.youkuaiyun.com/redis7keeper/article/details/153172388

掌握Spark SQL实战精髓专栏收录该内容

39 篇文章 ¥499.90

订阅专栏¥69.90

会员秒杀 ¥9.9 重磅福利

超级会员免费看

SQL Window Functions: Analyzing Sales Data

1. Comparing First and Last Sale Prices

To analyze how sales data evolves over time, we can compare each record with the first or last record for a common element. Consider the following SQL query:

SELECT
    CustomerName,
    SaleDate,
    TotalSalePrice AS CurrentSale,
    FIRST_VALUE(TotalSalePrice) 
        OVER (PARTITION BY CustomerName 
              ORDER BY SaleDate)
        AS InitialOrder,
    LAST_VALUE(TotalSalePrice) 
        OVER (PARTITION BY CustomerName
              ORDER BY SaleDate
              ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) 
        AS FinalOrder
FROM
    allsales;

This query joins the Customer and Sales tables to access customer and sales data. It outputs the CustomerName , SaleDate , and TotalSalePrice fields. Additionally, it uses the FIRST_VALUE() and LAST_VALUE() functions to show the first and last sale prices for each customer.

The FIRST_VALUE() and LAST_VALUE() functions require the following elements to work correctly:
- OVER : Starts the windowing function.
- PARTITION BY : Subsets the data based on a shared field. In this example, it groups by CustomerName to find the first or last value for each customer.
- ORDER BY : Sorts the data subset.
- Field : Specifies the data to return from the preceding record. Here, it’s the SalePrice field.

These functions are useful for avoiding complex subqueries or derived tables. They traverse the data for each element specified in the PARTITION BY clause and return the required record.

The following mermaid flowchart shows the general process:

graph TD;
    A[Start] --> B[Join Customer and Sales tables];
    B --> C[Output basic fields];
    C --> D[Apply FIRST_VALUE and LAST_VALUE functions];
    D --> E[Group by CustomerName];
    E --> F[Sort by SaleDate];
    F --> G[Return results];
    G --> H[End];

2. Displaying Rolling Averages

The sales director wants to see the rolling average sale value for the last three sales per customer. The following SQL query can achieve this:

SELECT
    CustomerName,
    SaleDate,
    TotalSalePrice,
    AVG(TotalSalePrice) 
        OVER (PARTITION BY CustomerName ORDER BY SaleDate 
              ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
        AS AverageSalePrice
FROM
    allsales
ORDER BY
    CustomerName, SaleDate;

A rolling average is a classic analytical metric. It requires isolating a certain number of records up to and including the current record and calculating the average of a value in those records. The ROWS windowing function helps with this.

The windowing function needs the following elements:
- AVG() : The aggregation function to apply.
- OVER : Indicates a windowing function.
- PARTITION BY : Groups records into subgroups. Here, it uses CustomerName to calculate the rolling average per customer.
- ORDER BY : Sorts the records in the subset, which is essential for defining the sequence for the rolling average.
- ROWS BETWEEN : Specifies the range of records relative to the current record.
- PRECEDING : Determines how many rows back to start the running average.
- CURRENT ROW : Tells the windowing function to stop at the current record and include it in the total.

When using the ROWS() windowing function, keep the following points in mind:
- A rolling total can be created similarly, using the SUM() function instead of AVG() .
- Add an ORDER BY clause at the end of the query to ensure the desired output order, as the ORDER BY inside the OVER function doesn’t guarantee correct final sorting.
- You can apply a different sort order to the final output without affecting the running average calculation.
- To exclude the current row from the running average or total, replace CURRENT ROW with 1 PRECEDING in the SQL.
- You can specify any row range relative to the current row for a running total.

The following table summarizes the key elements for the rolling average calculation:
| Element | Description |
| ---- | ---- |
| AVG() | Aggregation function |
| OVER | Windowing function indicator |
| PARTITION BY | Grouping records |
| ORDER BY | Sorting records in subset |
| ROWS BETWEEN | Defining record range |
| PRECEDING | Starting point for running average |
| CURRENT ROW | Ending point for running average |

3. Showing First and Last Sales per Client

To show the first order and the last four sales for each customer, we can use the following SQL query:

SELECT
    CustomerName,
    SaleDate,
    TotalSalePrice,
    FIRST_VALUE(TotalSalePrice) OVER (PARTITION BY CustomerName
                                       ORDER BY SaleDate) AS FirstOrder,
    LAG(TotalSalePrice, 3) OVER (PARTITION BY CustomerName
                                 ORDER BY SaleDate) AS 
        LastButThreeOrder,
    LAG(TotalSalePrice, 2) OVER (PARTITION BY CustomerName
                                 ORDER BY SaleDate) AS LastButTwoOrder,
    LAG(TotalSalePrice, 1) OVER (PARTITION BY CustomerName
                                 ORDER BY SaleDate) AS LastButOneOrder,
    LAST_VALUE(TotalSalePrice) OVER (PARTITION BY CustomerName
                                     ORDER BY SaleDate) AS LatestOrder
FROM
    allsales;

This query helps estimate how much each client is spending and how sales are evolving over time. It uses the following functions:
- FIRST_VALUE : Shows the sale price for the first purchase by a customer.
- LAST_VALUE : Shows the sale price for the latest purchase by a customer.
- LAG : Shows a previous specific sale.

These windowing functions require the following keywords:
- OVER : Indicates a windowing function.
- ORDER BY : Specifies how records are sorted in a subset.
- PARTITION BY : Groups records by a criterion for meaningful comparison. Here, it partitions by the client name to see sales per client.

The following mermaid flowchart shows the process of this query:

graph TD;
    A[Start] --> B[Select customer and sale data];
    B --> C[Apply FIRST_VALUE function];
    C --> D[Apply LAG functions];
    D --> E[Apply LAST_VALUE function];
    E --> F[Group by CustomerName];
    F --> G[Sort by SaleDate];
    G --> H[Return results];
    H --> I[End];

4. Calculating Cumulative Distribution

A salesman wants to know the relative standing of each sale by make. The following SQL query can provide this information:

SELECT
    MakeName, ModelName, InvoiceNumber, 
    SalePrice,
    ROUND(CUME_DIST() 
        OVER (PARTITION BY MakeName ORDER BY SalePrice),2) 
        AS RelativeStanding
FROM
    allsales
ORDER BY
    MakeName, SalePrice, RelativeStanding;

The CUME_DIST() function calculates the cumulative distribution, showing the relative position of a value compared to other values.

The query requires joining multiple tables to select the required fields. It then adds the CUME_DIST() function to the SELECT clause, specifying an ORDER BY element (the sale price) and a PARTITION BY clause (using the make of the vehicle) to group records into subsets.

When using the CUME_DIST() function, note the following:
- The PARTITION BY clause is optional. Without it, the function will show the percentage of vehicles that sell for less than the current car for all sales, regardless of make.
- The field used in the ORDER BY function of the OVER clause must be numeric.
- The ROUND() function is added to prevent displaying 15 decimal places by default. You can omit it if needed.
- You can sort the dataset by the sale price in descending order to see the most expensive sale per make first. The overall query sort order is independent of the ORDER BY in the CUME_DIST() function.

The following table summarizes the key points for the CUME_DIST() function:
| Point | Description |
| ---- | ---- |
| PARTITION BY | Optional for grouping |
| ORDER BY field | Must be numeric |
| ROUND() | For decimal place control |
| Sort order | Independent of CUME_DIST() ORDER BY |

5. Classifying Data using PERCENT_RANK()

The sales director wants to see vehicle sales for each country by value and the percentile of each vehicle sale in the sale hierarchy. The following SQL query can achieve this:

SELECT
    CountryName, SaleDate, InvoiceNumber,
    FORMAT_NUMBER(PERCENT_RANK() 
        OVER (PARTITION BY CountryName 
              ORDER BY TotalSalePrice), "0.00 %") AS PercentageRanking
FROM
    allsales
ORDER BY
    CountryName, TotalSalePrice DESC;

The PERCENT_RANK() function indicates the relative rank of a row inside a group of rows. It requires the OVER clause and an ORDER BY function with a numeric field.

The query selects the necessary tables, adds desired fields to display, and uses the ORDER BY clause for better comprehensibility. The PARTITION BY clause subsets the relative sale position by country.

When using PERCENT_RANK() , remember:
- The field used in the ORDER BY function of the OVER clause must be numeric.
- Adding the field used as the basis for PERCENT_RANK() to the SELECT clause can make the output easier to understand.
- Formatting the output as a percentage makes it more comprehensible, but it’s not required.

The following mermaid flowchart shows the process of this query:

graph TD;
    A[Start] --> B[Select necessary tables];
    B --> C[Add fields to display];
    C --> D[Apply PERCENT_RANK() function];
    D --> E[Partition by CountryName];
    E --> F[Sort by TotalSalePrice];
    F --> G[Format output];
    G --> H[Return results];
    H --> I[End];

6. Using the LAG() Function with Alphabetical Data

The marketing director wants a list of all the company’s customers in alphabetical order, grouped by the first letter of their name. The following SQL query can help:

SELECT
    CASE
        WHEN LEFT(CustomerName, 1) = 
             LAG (LEFT(CustomerName, 1)) 
             OVER (ORDER BY CustomerName) THEN NULL
        ELSE LEFT(CustomerName, 1)
    END AS IndexLetter,
    CustomerName
FROM
    SalesByCountry
GROUP BY
    CustomerName
ORDER BY
    CustomerName;

Isolating the first letter of each client’s name is easy using the LEFT() function. However, showing the index letter only when a new letter occurs requires the LAG() function.

The code first applies the LAG() function to the LEFT() function on the customer name, specifying 1 row to go back. It then uses a CASE statement to compare the first letter of the current and previous customer names. If they are the same, it outputs NULL ; otherwise, it outputs the first letter.

The following table summarizes the steps for this query:
| Step | Action |
| ---- | ---- |
| 1 | Apply LAG() to LEFT() function on customer name |
| 2 | Compare first letters using CASE statement |
| 3 | Output NULL or first letter based on comparison |
| 4 | Group by CustomerName |
| 5 | Order by CustomerName |

In conclusion, these SQL window functions are powerful tools for analyzing sales data. They allow us to create running totals, averages, and counts, segment the output, and show the relative weighting of values in a list. The functions covered in this article are summarized in the following table:
| Function | Description |
| ---- | ---- |
| ROW_NUMBER() | Adds a sequential number to each record |
| LAG() | Reaches back over previous records |
| FIRST_VALUE() | Returns the starting value |
| LAST_VALUE() | Returns the final value |
| ROWS BETWEEN | Defines record range for running calculation |
| PRECEDING | Helps define record range |
| CURRENT ROW | Specifies range including current row |
| CUME_DIST() | Calculates cumulative distribution |
| PERCENT_RANK() | Indicates relative rank as a percentage |

SQL Window Functions: Analyzing Sales Data

7. Summary of Key Concepts

Let’s summarize the key concepts and functions we’ve explored so far:

Function	Key Elements	Use Case
FIRST_VALUE() and LAST_VALUE()	OVER, PARTITION BY, ORDER BY, Field	Compare first and last sale prices for each customer
AVG() with ROWS BETWEEN	AVG(), OVER, PARTITION BY, ORDER BY, ROWS BETWEEN, PRECEDING, CURRENT ROW	Calculate rolling average sale values
LAG()	OVER, ORDER BY, PARTITION BY	Show previous specific sales
CUME_DIST()	OVER, ORDER BY, PARTITION BY (optional)	Determine relative standing of sales by make
PERCENT_RANK()	OVER, ORDER BY (numeric field), PARTITION BY (optional)	Classify sales by percentile within a group
LAG() with alphabetical data	LAG(), LEFT(), CASE statement	Create an alphabetical index of customers

This table provides a quick reference for understanding the different window functions, their required elements, and typical use cases.

8. Practical Tips for Using Window Functions

When working with SQL window functions, the following practical tips can enhance your efficiency and accuracy:

Understand the Data Flow : Before writing a query, visualize how the data will be partitioned, sorted, and aggregated. This can help you choose the right functions and parameters. For example, when using PARTITION BY , think about how the data should be grouped to achieve your analysis goal.
Test with Small Datasets : When dealing with complex queries, start by testing them on a small subset of data. This can help you identify errors and understand how the functions work without having to wait for long query execution times on large datasets.
Use Descriptive Aliases : When naming columns in your query results, use descriptive aliases. This makes the output more understandable, especially when multiple functions are involved. For instance, instead of using a generic alias like col1 , use something like InitialOrder or AverageSalePrice .
Be Mindful of Performance : Some window functions can be computationally expensive, especially when used on large datasets. Make sure to use appropriate indexing on the columns used in PARTITION BY and ORDER BY clauses to improve query performance.

9. Advanced Use Cases and Extensions

Beyond the basic use cases we’ve covered, window functions can be used in more advanced scenarios:

Combining Multiple Window Functions : You can combine multiple window functions in a single query to perform more complex analyses. For example, you could calculate both the rolling average and the cumulative distribution in the same query to gain deeper insights into sales data.

SELECT
    CustomerName,
    SaleDate,
    TotalSalePrice,
    AVG(TotalSalePrice) 
        OVER (PARTITION BY CustomerName ORDER BY SaleDate 
              ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
        AS AverageSalePrice,
    ROUND(CUME_DIST() 
        OVER (PARTITION BY CustomerName ORDER BY TotalSalePrice),2) 
        AS RelativeStanding
FROM
    allsales
ORDER BY
    CustomerName, SaleDate;

Using Window Functions in Subqueries : Window functions can also be used in subqueries to perform intermediate calculations. This can be useful when you need to perform multiple levels of aggregation or filtering.

SELECT
    CustomerName,
    SaleDate,
    TotalSalePrice,
    AvgSale
FROM (
    SELECT
        CustomerName,
        SaleDate,
        TotalSalePrice,
        AVG(TotalSalePrice) 
            OVER (PARTITION BY CustomerName ORDER BY SaleDate 
                  ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
            AS AvgSale
    FROM
        allsales
) subquery
WHERE
    TotalSalePrice > AvgSale;

The following mermaid flowchart shows the process of the combined window functions query:

graph TD;
    A[Start] --> B[Select customer and sale data];
    B --> C[Apply AVG() window function];
    C --> D[Apply CUME_DIST() window function];
    D --> E[Group by CustomerName];
    E --> F[Sort by SaleDate and TotalSalePrice];
    F --> G[Return results];
    G --> H[End];

10. Conclusion

In this article, we’ve explored the power of SQL window functions in analyzing sales data. These functions provide a flexible and efficient way to perform various calculations, such as comparing first and last sales, calculating rolling averages, and determining relative rankings.

By understanding the key elements and use cases of each function, you can leverage them to gain valuable insights from your data. Whether you’re a sales director looking for trends in sales data or a marketing director organizing customer information, SQL window functions can be a valuable addition to your analytical toolkit.

Remember to keep the practical tips in mind when working with these functions, and don’t be afraid to explore advanced use cases and extensions to meet your specific analysis needs. With practice, you’ll be able to write complex and effective SQL queries using window functions.

So, go ahead and start applying these functions to your own sales data. You’ll be surprised at the depth of analysis and insights you can achieve.