Wednesday, December 23, 2015

Aggregate Transformation

The Aggregate transformation applies aggregate functions, such as Average, to column values and copies the results to the transformation output. 

The transformation provides the GROUP BY clause, which we can use to specify groups to aggregate across.


The Aggregate transformation is asynchronous, which means it performs groupings and aggregations using whole rowset and then publishes the results. 


Only the input columns used in aggregate functions and used for grouping are copied to the transformation output.


We can add multiple outputs. For example, if the Aggregate transformation applies the Sum and the Average functions, each aggregation can be directed to a different output.


We can apply multiple aggregations to a single input column. For example, if we want the sum and average values for an input column named Sales, we can configure the transformation to apply both the Sum and Average functions to the Sales column.


The Aggregate transformation has one input and one or more outputs. It does not support an error output.
 

The following rules apply when handling NULL values:

  • In a GROUP BY clause, if the grouping column contains more than one null value, the null values are put into a single group.

  • In the COUNT (column name) and COUNT (DISTINCT column name) functions, nulls are ignored and the result excludes rows that contain null values in the named column.

  • In the COUNT (*) function, all rows, including rows with null values are counted.


Handling Big Numbers in Aggregates
The Aggregation transformation includes the IsBig property. If a column value may exceed 4 billion or a precision beyond a float data type is required, IsBig should be set to 1. Setting the IsBig property to 1 affects the output in the following ways:

  • The DT_R8 data type is used instead of the DT_R4 data type.

  • Count results are stored as the DT_UI8 data type.

  • Distinct count results are stored as the DT_UI4 data type.


Note: We cannot set IsBig to 1 on columns that are used in the GROUP BY, Maximum, or Minimum operations.
 

Performance Considerations
Properties that we can set to enhance the performance of the transformation. Setting these properties
we improve performance because the transformation is able to allocate adequate memory for the data that the transformation caches.

  • When performing a Group by operation, set the Keys or KeysScale properties of the component and the component outputs. 

  • When performing a Distinct count operation, set the CountDistinctKeys or CountDistinctScale properties of the component. 

 

Custom properties specific to Aggregate Transformation 

  • CountDistinctKeys - It’s a number to tell how many distinct column count to expect. Specifying this number speeds up performance since transformation preallocates required buffer space. If a CountDistinctScale value is specified, the value in CountDistinctKeys takes precedence.

  • CountDistinctScale - It’s an assumed number of distinct column count to expect.
    This property can have one of the following values:

    • Low (1)—indicates up to 500,000 key values 

    • Medium (2)—indicates up to 5 million key values 

    • High (3)—indicates more than 25 million key values. 

    • Unspecified (0)—indicates no CountDistinctScale value is used. Using the Unspecified (0) option may affect performance in large data sets.

  • Keys - A value that specifies the exact number of Group By keys. If a KeyScale value is specified, the value in Keys takes precedence

  • KeyScale - A value that describes approximately how many Group By key values the aggregation can write. This property can have one of the following values: 

    • Low (1)—indicates up to 500,000 key values. 

    • Medium (2)—indicates up to 5 million key values. 

    • High (3)—indicates more than 25 million key values. 

    • Unspecified (0)—indicates that no KeyScale value is used.

  • AutoExtendFactor - A value between 1 and 100 that specifies the percentage by which memory can be extended during the aggregation. The default value of this property is 25.

  • AggregationComparisonFlags - A value that specifies how the Aggregate transformation compares string data in a column.

  • AggregationType - A value that specifies the aggregation operation to be performed on the column. This property can have one of the following values: 

    • Count (1) 

    • Count all (2) 

    • Count distinct (3) 

    • Sum (4) 

    • Average (5) 

    • Maximum (7) 

    • Minimum (6)

    • Group by (0)

 

The following example uses Aggregate Transformation. Before moving ahead with example, refer these earlier articles: 

 

Open SSISStudy / SSISStudy2012 project created earlier. Open Source Explorer and right click on SSIS Packages and select New SSIS Package and rename it to AggregateTransformationExample.


Drag and drop a Data Flow Task on the Control Flow area. Edit Data Flow Task to add an OLE DB Source connected to AdventureWorks database and HumanResources.Employee table. Drag and drop Aggregate Transformation on the Data Flow area.
Connect OLE DB Source to Aggregate Transformation.

Right click on Aggregate and select Edit to open Aggregate Transformation Editor.
Now let’s say, we want to know for each manager, the number of Employee’s working with them, the sum of their VacationHours and Average of their SickLeaveHours. We wish to achieve something of this sort

Under the Aggregations Tab and in Available Input Columns, click in the check box against the column name as shown
The selections made appear in the area below the Input Columns as shown
We can also select the same column multiple times by selecting from the drop down under Input Column in new row
Rename the Output Alias columns to required output column names. Operation defines either Group by or one can select the aggregate to perform.In our case, we require to get SUM of VacationHours and AVG of SickLeaveHours. We can do by selecting the required aggregate under Operation against each Input Column


Click OK. Drag and drop a Derived Column on the Data Flow are and connect Aggregate to Derived Column. Derived Column in this example will serve as a dummy for us to show the visual output. Add a Data Viewer between Aggregate and Derived Column. You will see the output alias created in Aggregate appear under Displayed columns in Data Viewer
  Click OK and return back to the Data Flow area. It should look something like this
Execute the SSIS package and we should in the Data Viewer the aggregations. We can execute the query in SQL Management Studio and compare the result with Data Viewer output.

 

Click for NEXT article in the sequence.  

 

Please feel free to correct me by commenting your suggestions and feedback.

No comments:

Post a Comment