看下SQL SERVER中,对SCD的解决方案(不看SQL SERVER 2008 CDC方法),摘自《expert sql server 2005 integration services》
在这里介绍的SCD TYPE0,1,2与KIM BALL的(TYPE1,TYPE2,TYPE3)有些出入,需要仔细辨别(主要区别是,kim ball的typ3,是新增加一个attribute的处理,而这里SSIS没有提到,而这里多了TYPE0,和Inferred attribute,TYPE0是不变attribute,inferred attribute是在FACT 表被ETL的时候才添加的属性,这种应该以transaction的形式过滤掉,但是有时候且是大多数时候,海量数据,不允许做transaction,特别是serializable transaction)
SSIS Slowly Changing Dimension Wizard
Now it’s time to look at the built-in support for dimension ETL, called the Slowly Changing Dimension (SCD) Wizard. The SCD Wizard is a data-flow transformation and initially works like all the other transformations-simply drag and drop the transformation into the data flow and connect it to the upstream source or transformation.Figure 4-16 shows the data flow that was used earlier in the chapter with the SCD transformation now connected to the output of theUnion All.
Double-clicking the transformation will invoke the wizard. Like other user interface wizards, several windows will prompt you for configurations, in order to build the dimension ETL process. One of the nice advantages of the SCD Wizard is that it allows for very rapid ETL development.
The SCD Wizard supports Type 1 changing attributes, Type 2 historical attributes, inferred members, and Type 0 fixed attributes, all out-of-the-box.
When the source is connected to the SCD component, and the wizard invoked, the first screen will prompt you to identify the target dimension table, then the mapping of source columns from the data flow pipeline to the dimension columns in the dimension table, and, finally, the business keys in the source and dimension table.Figure 4-17 shows the mapping between the source rows generated for the product dimension and the dimension columns themselves.

Figure 4-17: Mapping between source rows generated for the product dimension and the dimension columns
Note that the data types must match in order for the source columns to be compared with the dimension table columns, which may require use of the Data Conversion transformation in the upstream data cleansing logic. For the wizard to handle the matching automatically, the column names must be the same. Matching names is not a requirement, because the matching can be done manually. Furthermore, if you have any columns in the dimension table that are not attributes, but rather management or metadata columns (such as StartDate and EndDate columns that identify when the row is active), these will not be matched. Later options in the wizard will give you the opportunity to specify usage of these.
The final step in the first screen is to identify the business key or keys. InFigure 4-17, theProductAlternateKey is the business key manually matched to the source columnProduct ID (which has been aliased asProductAlternateKey). The business keys help identify dimension records that need to be added as new members. The business keys also provide part of the equation on identifying matching records that need to be evaluated for changes.
The next screen of the wizard is about associating the dimension columns that have just been matched with the dimension change type. The wizard does not use the common dimension changing type numbers (Type 0, Type 1, or Type 2); rather, it uses the descriptive terms to identify the type of change (fixed, changing, or historical).
Figure 4-18 shows the matching of the dimension changes. The column on the left contains any non-business key matches identified in the prior screen, and the column on the right is a drop-down of the three different change types.
In this example, the following groupings were made:
-
Fixed attributes- Size (changes in size will generate a new product, therefore, this is fixed)
-
Historical attributes- Class, ModelName, ListPrice
-
Changing attributes- Color, DaysToManufacture, EnglishProductName, FinishedGoodsFlag, FrenchProductName, ProductLine, ProductSubcategoryKey, ReorderPoint, SafetyStockLevel, SizeRange, SizeUnitMeasureCode, SpanishProductName, StandardCost, Style, Weight, WeightUnitMeasureCode
Important | The matching shown in Figure 4-18 can be a tiresome process if you have dozens of dimension columns and you are using your mouse to select the column in the drop-down. A quicker way to fill in these values is to use the Tab key and the up and down arrow keys. After finishing a row, the Tab key will send the cursor to the next row and automatically select the next column in the list. |
Now that you have defined the column matches and identified the change types, the next few screens will help you to manage the advanced requirements for the dimension ETL process.
In the next screen, shown in Figure 4-19, some specific requirements are asked about Fixed attribute and Changing attribute members.
If you do not identify any Fixed attribute or Changing attribute columns, then the respective detail questions shown in the screen shot will be grayed out.
The option for fixed attribute asks, “If there is a change in a fixed attribute, what should happen?”
-
By leaving the box unchecked, the change will be ignored and the value in the dimension table will stay the same as it was originally.
-
By selecting the box, if a change is detected in a Fixed attribute column, then the transformation will intentionally generate an error, so the data can be reviewed. This may be useful if, when a change is identified, it indicates a bigger problem.
The option for changing (Type 1) attributes identifies which records to update when a change happens. In other words, if the dimension record has any Type 2 historical attributes, there may be multiple records for the same business key.
-
When the check box is selected, all the related records (the current dimension member and the outdated members) are updated.
-
When the option is cleared, only the current record gets updated with the new value of the column. For example, if there’s a change inColor fromblue toaqua, and the particular business key had five historical changes, then only the last record would be updated with aqua. When the Changing attributes check box is cleared, and color is marked as a Type 1, only the Current record gets updated fromblue toaqua. The rest of the historical records remain asblue.
If any historical attributes have been selected, then the next screen will prompt to identify how the current row is identified.Figure 4-20 shows the Historical Attribute Options screen.
The two choices to identify a current row are:
-
Use a flag column (usually Boolean) to identify whether a row is current or not. Besides just selecting the column that should be used, the SCD Wizard also supports defining what values identify a current record and expired record. Note also that the columns in the drop-down list are any dimension columns that have not been matched from the source. That is because these are considered metadata columns that are used for management purposes like this.
-
Use a StartDate and EndDate combination to manage when a dimension record is active. The StartDate and EndDate column selections need to be dimension table columns defined with adatetime data type. Furthermore, one other option exists if taking the approach of start time and end time columns-that is, choosing which package variable should be used as the value to update the record with. In other words, when a new dimension member needs to be added, because there is a change in a Type 2 historical attribute, the prior record needs to be first updated with a newEndDate and the new record needs to be initiated with a newStartDate. Any system or user variable can be used. One good choice is to use theSystem::StartTime variable, which is automatically updated when the package is executed. And, in this example, theStartDate andEndDate columns are used in conjunction with theSystem::StartTime variable.
Important | Choosing to use StartDate and EndDatecolumns will give you the advantage of knowing exactly when a particular dimension member was active, either for reporting purposes or even to handle the scenario when a fact record arrives late and a historical dimension record needs to be selected instead of the current member. |
The SSIS inferred member support is found in the next screen. As a reminder, inferred members are added during the fact ETL and updated during the dimension ETL. The built-in support for inferred members revolves around the dimension update.Figure 4-21 shows the options available for inferred members.
First of all, inferred member support is either enabled or disabled. If enabled, the SCD Wizard needs to know how to identify whether a record is an inferred member. The two choices are to leave all the dimension columns in the dimension table as NULL, or to use an Inferred Member Boolean column in the table that identifies which rows are inferred members.
Since the product dimension table in the AdventureWorksDW database does not have an inferred member column, the first choice is selected.
Important | Using the first choice for inferred members (all attributes containNULLvalues) is not often practical, because, first of all, it assumes that the columns in the dimension table allowNULLvalues, and secondly, it makes for difficult querying. Using an unknown value, for example, is often a better way to see data for a reporting system. But in addition, if Analysis Services is used for analytics,NULLvalues are also not a good choice. The best practice is to define an inferred member column and handle the identification by setting the Boolean value toTrueorFalse. |
The final screen, not shown, presents a summary of the outputs that will be created. By selecting Finish on this screen, the SCD Wizard will take all the configuration options and create several downstream transformations and destinations.Figure 4-22 shows the end result of the SCD Wizard-the SCD transformation remains, but it contains several outputs to handle the different attribute types.
Since SSIS dynamically builds the data flow, the resulting layout in this example may not look exactly like your testing. However, the functionality is the same. For this example, on the left are changing attributes. Down the center are new members and historical attributes. On the right are inferred members. Starting with the SCD transformation, when a dimension row is processed, the SCD will determine which (if any) attribute changes occur, and whether there are new members to be added or inferred members to update. Consider the SCD transformation to be like a Conditional Split; it evaluates every row, one at a time, and routes the records to different outputs.
The simplest output to understand is the Changing Attributes Updates Output, which is linked to the OLE DB Command 2 transformation, connected to theAdventureWorksDW database. The following code shows theSQLCommand property, which defines the UPDATE statement:
UPDATE [dbo].[DimProduct]
SET [Color] = ?
,[DaysToManufacture] = ?
,[EnglishProductName] = ?
,[FinishedGoodsFlag] = ?
,[FrenchProductName] = ?
,[ProductLine] = ?
,[ProductSubcategoryKey] = ?
,[ReorderPoint] = ?
,[SafetyStockLevel] = ?
,[SizeRange] = ?
,[SizeUnitMeasureCode] = ?
,[SpanishProductName] = ?
,[StandardCost] = ?
,[Style] = ?
,[Weight] = ?
,[WeightUnitMeasureCode] = ?
WHERE [ProductAlternateKey] = ?
What you should note in the UPDATE statement is that only the columns that were defined as changing attributes (Type 1) are included in theUPDATE statement, simply because this output is only for the Type 1 changing attributes. Also notice that the SQL statement is an OLE DB parameterized statement with question marks, which is the way that the OLE DB provider handles the parameterization.Figure 4-23 shows the Column Mappings tab, which maps (in order) the pipeline input columns to the parameterized query.
The order of the question marks defines the order of the mappings.
The second output is the Inferred Member Updates Output. This output is very similar to the Changing Attributes Update Output because it also performs anUPDATE statement. Just like the first output, the inferred member output uses an OLE DB Command transformation to handle the updates (in this case, the OLE DB Command 1 transformation). TheUPDATE statement defined in theSQLCommand property is as follows:
UPDATE [dbo].[DimProduct]
SET [Class] = ?
,[Color] = ?
,[DaysToManufacture] = ?
,[EnglishProductName] = ?
,[FinishedGoodsFlag] = ?
,[FrenchProductName] = ?
,[ListPrice] = ?
,[ModelName] = ?
,[ProductLine] = ?
,[ProductSubcategoryKey] = ?
,[ReorderPoint] = ?
,[SafetyStockLevel] = ?
,[Size] = ?
,[SizeRange] = ?
,[SizeUnitMeasureCode] = ?
,[SpanishProductName] = ?
,[StandardCost] = ?
,[Style] = ?
,[Weight] = ?
,[WeightUnitMeasureCode] = ?
WHERE [ProductAlternateKey] = ?
The difference, as you would expect, is that there are more columns in theUPDATE statement. Not only are the Type 1 changing attributes updated, but also the Type 2 historical attributes. Included in theUPDATE statement are the Class,ModelName, andListPrice columns, which were defined as a Type 2 historical attributes. These are updated because of the nature of an inferred member, which requires updates to all the columns without generating a new record. Furthermore, if you had defined anInferred Member Flag column, this is where the inferred member column would also be updated. Just like the first OLE DB Command transformation, the order of question marks defines the order of the mapping.
The next two outputs to consider are also related. Both the New Output (new members to be added) and the Historical Attribute Inserts Output add rows to the dimension table. What you can see inFigure 4-22 earlier is that a Union All transformation is used to bring these two outputs together for a single destination insert.
If the business key from the source does not exist in the dimension table, it’s identified as a new member that needs to be inserted. The New Output that handles new members goes directly to the Union All transformation. With the Historical Attribute Insert Output, before the records are brought together in the Union All, a couple of metadata management tasks need to happen. Whether the new historical attribute record is marked as current through a combination of dates or a separate column, the old record needs to be updated before the insert can happen. Either the End Date column is updated, or a current flag column is updated, which is handled in a two-step process:
-
The Derived Column transformation that is attached to the Historical Attribute Insert Output adds either anEndDate column to the data flow (as in this example), or it adds the expired flag value.Figure 4-24 shows the Derived Column editor that defines anEndDate column and then also uses theSystem::StartTime variable as specified in the SCD Wizard.
-
Another OLE DB Command transformation is used to update theEnd Date for the expired record, based on the business key and the current indicator (in this case, the record to be expired will currently have anEnd Date of NULL). The SQLCommand property of this OLE DB Command transformation is as follows:
UPDATE [dbo].[DimProduct] SET [EndDate] = ? WHERE [ProductAlternateKey] = ? AND [EndDate] IS NULL
One nice feature of the OLE DB Command transformation, as this example shows, is that the records from the pipeline can perform theUPDATE statement, and then still be available downstream for other purposes. After the expired record has been updated, then the record is ready to be “union-ed” with the new member output.
Before the final insert for the new members and Type 2 historical attributes, one final step is necessary- you need to add theStart Date (or current record flag). This is handled by a Derived Column transformation, which adds a column to the pipeline calledStartDate (in this example).Figure 4-25 shows the details of the Derived Column transformation that falls just below the Union All transformation and before the OLE DB destination (from the data flow shown earlier inFigure 4-22).
Just like the EndDate, the StartDate uses the System::StartTime variable, which is used for the new record. New records that are new members or Historical-attribute inserts require theEndDate to beNULL. (If you have specified a current indicator, you put the indicator’s current value in this Derived Column transformation.)
When executed, the SCD transformation routes the rows from the source to the different outputs. Notice inFigure 4-26 that the 504 input rows are not all sent out the outputs. This is because some of the records do not go through any changes, so they are effectively ignored.
Advanced Properties and Additional Outputs of the SCD
There are two advanced properties of the SCD that can be used to customize how the SCD transformation handles the processing:
-
The CurrentRowWhere property identifies how the current dimension row for a source record is identified. If you have configured the SCD to use aCurrent Flag column, then, by default, this property would filter on theCurrent Flag where the value you defined in the wizard is current. Alternately, if you specifiedStart Date andEnd Date, then, by default, the property would assume that theEnd Date columnIS NULL. This property can be changed if you need to re-define how the SCD transformation searches for the current record. For example, if your organizational standards do not allowNULL values, then you would have to modify this property to check theEnd Date for the default value set for the column (oftentimes, a date far out in the future is used to accommodate this, such as 1/1/2050).
-
The SQLCommand property contains the SQL syntax used for the lookup against the dimension table to either determine if a new member needs to be created, or if there have been any changes in a Type 0 fixed attribute, Type 1 changing attribute, or Type 2 historical attribute.
Both of the properties referenced here can be found by reviewing the Properties window when selected on the SCD transformation, as shown inFigure 4-27, or by looking at the advanced editor of the SCD transformation.
Furthermore, the SCD transformation contains two additional outputs that are not used by default, but are useful for auditing and data validation.
The first output enables you to capture the rows that have not gone through any change. Very likely, if you are pulling the entire dimension source (as opposed to just targeting new and changed records), you will have many dimension records from the source that are unchanged or are completely in synch with the dimension table. Although a change has not happened, you may have a need to count the number of rows that are unchanged, or capture the unchanged rows in a table or file for review. The Unchanged Output is accessible by selecting the green path output from the SCD transformation, and connecting it to another transformation or destination. When this is done, you will be prompted to choose the right output, and are prompted with the remaining outputs that are not used. In Figure 4-28, a Row count transformation is used to capture the number of rows that are unchanged into a variable that is later captured for auditing purposes.
The second additional output is the Fixed Attribute Output, which sends out any rows where a fixed attribute column has changed when it should not have. Rather than a Row count, a better use of this output is to capture the records to a staging table for review, since a change was not supposed to happen.
Tip | Only when the Ignore fixed attribute changes option is selected will this output be used. Otherwise, if a fixed attribute change occurred, the SCD would intentionally fail. |
Slowly Changing Dimension Wizard Advantages and Disadvantages
The SCD Wizard is a very powerful tool, and will be appreciated by ETL developers who commonly deal with managing complicated ETL processes for dimensions. Several benefits will be achieved by using the built-in SCD support. However, there are also a few limitations surrounding the SCD that should be mentioned.
The advantages focus on management, development, and standardization, including the following:
-
Simplicity- The SCD Wizard can handle most dimension scenarios. It makes the often complicated dimension processing straightforward, and helps standardize ETL development for dimensions.
-
Rapid development- The SCD can save time in the development lifecycle by reducing the design and development time, and also easing the management and testing. This leaves more availability for other areas of an ETL process.
-
Wizard allows changes- If the inputs entered into the SCD Wizard require changing, the wizard can be re-invoked, and these changes will propagate down to the downstream-generated components automatically. A caveat to this is presented in the limitations section.
-
Customized output transformations- Since the SCD Wizard generates transformations rather than a black-box approach, the output can be customized. For example, the OLE DB Command transformation used for Type 1 changing attributes can be removed and replaced with a staging table in order to achieve set-based updates, which often perform faster than row-by-row updates.
-
Beyond dimension processing- The SCD transformation can be used beyond just dimension processing, such as table synchronization. Even though the name suggests that the SCD Wizard focuses exclusively on dimension processing, one alternate use is to just leverage the Type 1 changing attribute support (and the included new member support).
The limitations of the SCD support focus mostly on scalability for large-dimension scenarios:
-
Dimension table lookup scalability- The dimension table is not cached in memory. Therefore, for every row coming in from the source, a separate lookup statement is sent to the dimension table in the underlying relational engine.
-
All updates are row-based- Relational updates are required for the Type 1 changing attribute output, the Inferred Member output, and the Type 2 historical attribute output (to expire the previous record). Because the OLE DB Command transformation is employed, every row coming through these transformations sends a separateUPDATE statement to the dimension table, in a cursor-like fashion. When dealing with several thousand updates, this can be limiting.
-
Customized outputs are overwritten by changes- Although the wizard can be re-run (with the prior run’s values remaining), if you have customized the output and then run through the wizard again, when the wizard finishes, it will overwrite any changes you made (the transformations will be orphaned by a new set of transformations). Be careful with that if you’re making customizations. The wizard will overwrite them.
-
Locking issues and inserts- All at the same time, data may be queried from the dimension table for comparison, it may be updated in the dimension table to handle a Type 1 change, and it may also be inserted for new members and Type 2 historical records. All this activity on the dimension table at one time can slow down the dimension ETL process. Furthermore, the inserts cannot take advantage of the Fast Load option because of the locking contentions, thus resulting in row-by-row inserts.
Optimizing the Built-in Slowly Changing Dimension Support
Later in this chapter, we examine writing an SSIS package for dimension ETL without using the built-in SCD Wizard support. However, if you are dealing with a large dimension table, a couple of techniques can be used to achieve better performance.
Index Optimizations
Since the dimension lookups and updates both are row-by-row operations, be sure to check the indexes on your dimension table to speed up the identification of the current record. If you are seeing very poor performance with the SCD (anything less than approximately 2,000 rows per minute), then chances are the SCD lookups are requiring relational table scans or bookmark lookups after identifying the record key. For best ETL optimization, create your dimension table’s clustered index on the business key, rather than the dimension surrogate key. Including the current indicator flag or end date as the second column in the index will improve the performance even more.
Important | Index optimization must be balanced between the ETL and the query usage. Optimization for query patterns should take priority over optimizations for ETL. However, some situations may require ETL-focused optimization as a priority to achieve the service-level agreements (SLAs) identified for processing times. Also, be cautious that too many indexes can slow down operations. |
Update Optimizations
Dimension table updates to handle Type 1 changing attributes are a common occurrence in any dimension table ETL process. Although the inferred member output also requires updates, the number of rows will typically be a fraction of the number of Type 1 changing attributes, since inferred members are considered an exception to the rule. Since the updates are row by row (this is the way the OLE DB Command transformation works), then dealing with thousands of updates will create a processing bottleneck. One way to improve performance is to replace the OLE DB Command update with a set-based update approach. InFigure 4-29, the OLE DB Command that handles the Type 1 changing output has been replaced with an OLE DB Destination to a staging table.
Using a staging table for this scenario improves performance, because it allows a single set-basedUPDATE statement to be run. TheUPDATE statement is handled with an Execute SQL Task in the control flow.
The set-based statement is comprised of an inner join between the staging table and the dimension table across the business key, where any matching rows (already identified as requiring a Type 1 changing update) will require the attributes to be updated. The following highlights the TSQL code that performs this operation:
UPDATE [dbo].[DimProduct]
SET
[Color] = STG.[Color]
,[DaysToManufacture] = STG.[DaysToManufacture]
,[EnglishProductName] = STG.[EnglishProductName]
,[FinishedGoodsFlag] = STG.[FinishedGoodsFlag]
,[FrenchProductName] = STG.[FrenchProductName]
,[ProductLine] = STG.[ProductLine]
,[ProductSubcategoryKey] = STG.[ProductSubcategoryKey]
,[ReorderPoint] = STG.[ReorderPoint]
,[SafetyStockLevel] = STG.[SafetyStockLevel]
,[SizeRange] = STG.[SizeRange]
,[SizeUnitMeasureCode] = STG.[SizeUnitMeasureCode]
,[SpanishProductName] = STG.[SpanishProductName]
,[StandardCost] = STG.[StandardCost]
,[Style] = STG.[Style]
,[Weight] = STG.[Weight]
,[WeightUnitMeasureCode] = STG.[WeightUnitMeasureCode]
FROM [dbo].[DimProduct]
INNER JOIN [SSISOps].[dbo].[stgDimProductUpdates] STG
ON [DimProduct].[ProductAlternateKey]
= STG.[ProductAlternateKey]
Be sure to truncate your staging table for every ETL run (by adding an Execute SQL Task at the start of the control flow); otherwise, you will be updating data from old rows from a prior execution.
Handling Advanced Dimension Processing with the Slowly Changing Dimension Support
This section examines how to handle the ETL for advanced dimension forms, combining the functionality of the SCD with other out-of-the-box transformations, focusing on snowflake dimensions, parent-child dimensions, and date dimensions.
Snowflake Dimension Tables
A snowflake table, as briefly described in the beginning of this chapter with the product dimension (seeFigure 4-2 earlier), requires some unique ETL handling aspects. In a snowflake dimension, the higher-level tables (subcategory and category in this example) also have surrogate keys. With the product snowflake dimension, theProductCategoryKey cascades down as a foreign key in theDimProductSubCategory table, and theProductSubCategoryKey cascades down to a foreign key relationship in theDimProduct table. TheProductKey itself relates directly to the fact table, whereas the surrogate keys in the category and subcategory tables do not relate directly to the reseller fact table.
The design is called a snowflake because when viewed in relationship to the fact table, the table layout looks like a snowflake as opposed to a star. (A star schema has dimension tables one level out from the fact table. A single table dimension is often called a star dimension.) Generally, most dimensions are designed as a star dimension. However, there are two very valuable reasons to break out a dimension table into a snowflake design:
-
When a dimension table has several attributes that relate directly to a higher level within a dimension hierarchy, managing those dimension changes can be a lot easier with a snowflake design. For example, say the product subcategory table contains the English, Spanish, and French names of the subcategory. If these columns were included in the base product table, and the subcategory changed for the dimension, it would be much more difficult to ensure that the Spanish and French names were in synch with the subcategory. Otherwise, if they were not, reporting of these attributes would be misleading with the incorrect association.
-
A second and more compelling reason to use a snowflake is when you have multiple fact tables related to the same dimension table at different levels. For example, if the Reseller Sales fact table tracked sales at the Product Level, but the Sales Quota facts were assigned to a Product Category, if the Product dimension only had one table, the Category could not be effectively joined. Using a snowflake design, the fact tables can be related to one another because they share a common table at different levels of the product dimension hierarchy. Sharing dimension tables between fact tables is a driving emphasis of dimensional modeling calledconforming dimensions.
Important | Analysis Services supports having a standard dimension relate to different measure groups, at different levels. (Ameasure groupin Analysis Services is equivalent to a fact table in the database.) It understands how to do the right aggregations. The Analysis Services engine naturally understands how to associate this data and perform aggregations. |
A straightforward method to processing snowflake dimension in SSIS is to use multiple SCDs embedded in different data flows, linked by precedence constraints in the control flow starting at the top level of the snowflake tables and working down to the lowest level. Figure 4-30 shows the control flow of the product dimension package. Note the very first task is an Execute SQL Task that truncates the staging table used for the set-based update, followed by the three Data Flow Tasks.
The first data flow shown is the product category dimension table, followed by the subcategory dimension table, and concluding with the product data flow. The final Execute SQL Task handles the set-based updates for Type 1 changing attributes as discussed earlier.
The product category data flow is the most straightforward because it only requires one Lookup transformation to get the category translation attributes, and, furthermore, it only contains Type 1 changing attributes.Figure 4-31 shows the product category data flow.
Next, you process the product subcategory. When processing data within intermediate snowflake tables in the hierarchy, a lookup is also required to reference the parent table surrogate key. For example, as you process the product subcategory table, include a Lookup transformation to pull the surrogate key of the category, as shown inFigure 4-32.
The ETL processing in the lowest-level product table has already been discussed. Not mentioned, but also included in the data preparation steps for the product dimension, is the need to pull the surrogate key of the subcategory table.
Parent-Child Dimension ETL
The next dimension design that we will consider is the parent-child dimension. A parent-child is a self-referencing dimension table and also has special ETL requirements. Simply put, a parent-child dimension has a surrogate key and a parent surrogate key, which gives the dimension a self reference. In addition, parent-child dimensions usually have a business key and a parent business key. The self-referencing business key nicely illustrates the parent relationship in the table.
An organizational structure is a great example. An organization typically has common attributes for all the individuals in the organization, such as location, office, salary, and so on. All of these attributes relate to higher levels in the hierarchy. A parent-child enables you to build a hierarchy where members at different levels have common attributes. Secondly, a parent-child dimension also allows the hierarchy to be unbalanced, where not every drill path in the hierarchy goes down to the same level.
The Employee dimension table as shown in the beginning of the chapter (seeFigure 4-2) is an example of an organizational structure.Figure 4-33 shows a subset of data and columns within the Employee dimension table.
As you can see, some dimension members relate to other dimension members. For example, Dan Bacon reports to Jean Trenary (Dan’s parent employee key is 44, which is the employee key of Jean). At the top of the table, the Chief Executive Officer, Ken Sanchez has no parent key and, therefore, no manager. Every member in the table is a member at a different level in the hierarchy. Taking this subset of data and building the hierarchy for the parent-child relationship turns into the hierarchy shown inFigure 4-34.
Note a few points about this parent-child dimension:
-
This is an unbalanced hierarchy. The levels within this dimension don’t all extend to the lowest level (Laura Norman, for example, has no direct reports). In a USA-based geography dimension, for example, everything typically goes to the same state or city level. Within an unbalanced hierarchy, there are levels that don’t cascade all the way down.
-
A parent-child dimension can also be a ragged hierarchy. A ragged hierarchy has holes in the hierarchy; you could be at the top level, skip the next level, and go directly to the level below. You must be aware of these variations when processing ETL.
-
Parent-child dimensions have shared attributes. Most of the records, except at the top level, share common attributes, such as Employee Address.
When processing parent-child dimensions in SSIS, a couple of methods can be applied for acquiring the parent record key.
The first approach is to use a Lookup transformation to acquire the parent record for the parent-child relationship.Figure 4-35 shows the data flow used to process the Employee dimension table.
This data flow has similarities to the product dimension data flow, as the first several transformations are used to prepare the data for the SCD transformation. Note that the third data flow object is a Lookup transformation used to acquire the parent surrogate key. In this case, the Lookup transformation joins the parent employee business key of the source to the matching employee business key from the employee table.Figure 4-36 shows the Columns tab of the Lookup editor.
Besides the join being across the parent business key to the business key in the reference table, the surrogate key that is returned is aliased to match theParentEmployeeKey.
A second approach involves using an Execute SQL Task in the control flow and performing a set-based update to associate the employee with the employee’s manager record. This SQL Task needs to be run after the data flow that adds new dimension records. The following SQL code is used to perform the operation:
UPDATE dbo.DimEmployee
SET ParentEmployeeKey = Parent.EmployeeKey
FROM dbo.DimEmployee
INNER JOIN dbo.DimEmployee Parent
ON DimEmployee.ParentEmployeeNationalIDAlternateKey =
Parent.EmployeeNationalIDAlternateKey
AND Parent.EndDate IS NULL
WHERE DimEmployee.ParentEmployeeKey IS NULL
The query could be handled by a self join or a correlated subquery. In this case, a join is used between theEmployee table and theEmployee table (aliased asParent) matching across the child employee with the parent manager record. TheParentEmployeeKey is updated with theEmployeeKey of theEmployee table joined as the parent table. Also important to note is that the only records affected are those with theParentEmployeeKey set toNULL, which targets and, therefore, optimizes the updates.
A final approach is to use a combination of a Lookup transformation and an Execute SQL Task. The Lookup transformation approach may fall short of handling all situations if an employee is added at the same time as the assigned manager. In this case, the parent employee key would not yet be available. This combined approach would give better scalability in higher-volume situations by reducing the rows affected by the update.
Date Dimension ETL
The date dimension is probably the most common and conformed dimension in any data warehouse or data mart structure. In other words, most fact tables have relationships to a Date table. In the sampleAdventureWorksDW, the date dimension table is called DimTime. Honestly, it’s poorly named. The wordTime has connotations of time of day-hour or minute, not the date grain, which is actually what is stored in theDimTime table.DimDate would have been a better name.
A date dimension can have many different attributes, besides the year, month, and date. It might include the following:
-
Day of week
-
Week of year
-
Holiday
-
Fiscal hierarchy
The attributes combine to create different hierarchies. An example of a natural calendar hierarchy would be Year-Quarter-Month-Date or Year-Week-Date. In the natural calendar, weeks do not line up with months (a week can span two months). Therefore, there are two natural calendar hierarchies. Because of this challenge, organizations have come up with different, and often unique, fiscal hierarchies. Your organization may have its own fiscal hierarchy.
A common approach to a custom fiscal hierarchy is to break a quarter into three periods: four weeks in period 1, four weeks in period 2, and five weeks in period 3 (commonly called a 4-4-5 fiscal date hierarchy). A period essentially replaced a month to accommodate the week-month challenge. One organization we worked with had at least 10 different fiscal hierarchies-every branch of the organization wanted to see the sales data from a different perspective.
When it comes to handling the date dimension ETL, the dimension is relatively easy to process because once a new date record is added, it is usually never updated. The Date dimension is commonly not snowflaked into multiple tables.
Important | The date dimension is not typically snowflaked even if you have higher-level fact tables and a multi-grained scenario. For example, account balances might be tracked at a week level, whereas sales come in at a day level, and inventory at a month level. The date dimension almost always uses the day for granularity. You would use the first period of that grain as the key. For example, if inventory is at the month level, use the first day of the month as the key for the month. |
There are a few different ways to process date-dimension ETL. The following are some common mechanisms:
-
Handle the date dimension population through a stored procedure. Whether you’re calculating one date at a time every day or out into the future once a year, a stored procedure can load all of the records and handle all the date logic.
-
Use a data flow with Derived Column transformations and date part expressions. The data flow generates a set of dates where additional attributes are added using the date functions in SSIS updated.
-
Use a For Loop Container to iterate through a range of dates with SQL Inserts. The For Each Loop container could be configured to increase the date variable by one every loop and then leverage an Execute SQL Task to manage the inserts, one row at a time.
Any of these approaches are viable. Since the date dimension usually contains a small number of records, this package will process quickly regardless of the approach chosen.
As an example, if the second approach is taken, the process would involve identifying the first new date and the last date to add, followed by a data flow to handle the inserts.Figure 4-37 shows the control flow of a package with an Execute SQL Task followed by a data flow.
Defined in the package are two user variables, FromDate and ToDate, which are populated by the Execute SQL Task using a single row resultset. TheFromDate is populated by querying theMAX date in the time dimension plus one day, and the ToDate is populated from theMAX date in the sales header table source. As you would guess, the data flow only needs to run if theToDate is greater than theFromDate. This is handled through the Precedence Constraint after the Execute SQL Task.Figure 4-38 shows the Precedence Constraint Editor.
If the Execute SQL Task is successful and the@ToDate is greater than@FromDate, then the Precedence Constraint will allow the Data Flow Task to execute. At this point, if the criteria are met, then there are rows that need to be inserted into the time dimension table. The data flow, shown inFigure 4-39, contains five components:
-
A Script Component that generates a row for every day starting with theStart Date and ending with theEnd Date
-
A Derived Column transformation that adds the required date attributes
-
Two Lookup transformations to pull in some date translations from a translation table
-
A destination adapter that inserts the rows into the Time dimension table.
Since the Script Component can be used as a source (as described inChapter 2), it is a great candidate to generate rows from theFromDate to theToDate values. The following code takes the SSIS variables and, using aDo While loop, iterates over the dates from theFromDate to theToDate, adding a date for every loop:
Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper
Public Class ScriptMain
Inherits UserComponent
Public Overrides Sub CreateNewOutputRows()
Dim vFromDate As Date = Me.Variables.FromDate
Do While vFromDate <= Me.Variables.ToDate
With NewDatesOutputBuffer
.AddRow()
.FullDateAlternateKey = vFromDate
End With
vFromDate = DateAdd(DateInterval.Day, 1, vFromDate)
Loop
End Sub
End Class
The second component to highlight is the Derived Column transformation.Figure 4-40 shows the Derived Column editor, which contains several additional columns using theDATEPART SSIS expression function to pull out common attributes.
Before the OLE DB Destination, which is just an insert into the DimTime table, two Lookups handle the translation, because the time dimension table supports multiple languages for the month- and day-related attributes.
Overall, the package is straightforward, performs well, and leverages SSIS capabilities.