Wednesday, December 23, 2015

CDC Source

article in progress

Aggregate Transformation

The Aggregate transformation applies aggregate functions, such as Average, to column values and copies the results to the transformation output. 

The transformation provides the GROUP BY clause, which we can use to specify groups to aggregate across.


The Aggregate transformation is asynchronous, which means it performs groupings and aggregations using whole rowset and then publishes the results. 


Only the input columns used in aggregate functions and used for grouping are copied to the transformation output.


We can add multiple outputs. For example, if the Aggregate transformation applies the Sum and the Average functions, each aggregation can be directed to a different output.


We can apply multiple aggregations to a single input column. For example, if we want the sum and average values for an input column named Sales, we can configure the transformation to apply both the Sum and Average functions to the Sales column.


The Aggregate transformation has one input and one or more outputs. It does not support an error output.
 

The following rules apply when handling NULL values:

  • In a GROUP BY clause, if the grouping column contains more than one null value, the null values are put into a single group.

  • In the COUNT (column name) and COUNT (DISTINCT column name) functions, nulls are ignored and the result excludes rows that contain null values in the named column.

  • In the COUNT (*) function, all rows, including rows with null values are counted.


Handling Big Numbers in Aggregates
The Aggregation transformation includes the IsBig property. If a column value may exceed 4 billion or a precision beyond a float data type is required, IsBig should be set to 1. Setting the IsBig property to 1 affects the output in the following ways:

  • The DT_R8 data type is used instead of the DT_R4 data type.

  • Count results are stored as the DT_UI8 data type.

  • Distinct count results are stored as the DT_UI4 data type.


Note: We cannot set IsBig to 1 on columns that are used in the GROUP BY, Maximum, or Minimum operations.
 

Performance Considerations
Properties that we can set to enhance the performance of the transformation. Setting these properties
we improve performance because the transformation is able to allocate adequate memory for the data that the transformation caches.

  • When performing a Group by operation, set the Keys or KeysScale properties of the component and the component outputs. 

  • When performing a Distinct count operation, set the CountDistinctKeys or CountDistinctScale properties of the component. 

 

Custom properties specific to Aggregate Transformation 

  • CountDistinctKeys - It’s a number to tell how many distinct column count to expect. Specifying this number speeds up performance since transformation preallocates required buffer space. If a CountDistinctScale value is specified, the value in CountDistinctKeys takes precedence.

  • CountDistinctScale - It’s an assumed number of distinct column count to expect.
    This property can have one of the following values:

    • Low (1)—indicates up to 500,000 key values 

    • Medium (2)—indicates up to 5 million key values 

    • High (3)—indicates more than 25 million key values. 

    • Unspecified (0)—indicates no CountDistinctScale value is used. Using the Unspecified (0) option may affect performance in large data sets.

  • Keys - A value that specifies the exact number of Group By keys. If a KeyScale value is specified, the value in Keys takes precedence

  • KeyScale - A value that describes approximately how many Group By key values the aggregation can write. This property can have one of the following values: 

    • Low (1)—indicates up to 500,000 key values. 

    • Medium (2)—indicates up to 5 million key values. 

    • High (3)—indicates more than 25 million key values. 

    • Unspecified (0)—indicates that no KeyScale value is used.

  • AutoExtendFactor - A value between 1 and 100 that specifies the percentage by which memory can be extended during the aggregation. The default value of this property is 25.

  • AggregationComparisonFlags - A value that specifies how the Aggregate transformation compares string data in a column.

  • AggregationType - A value that specifies the aggregation operation to be performed on the column. This property can have one of the following values: 

    • Count (1) 

    • Count all (2) 

    • Count distinct (3) 

    • Sum (4) 

    • Average (5) 

    • Maximum (7) 

    • Minimum (6)

    • Group by (0)

 

The following example uses Aggregate Transformation. Before moving ahead with example, refer these earlier articles: 

 

Open SSISStudy / SSISStudy2012 project created earlier. Open Source Explorer and right click on SSIS Packages and select New SSIS Package and rename it to AggregateTransformationExample.


Drag and drop a Data Flow Task on the Control Flow area. Edit Data Flow Task to add an OLE DB Source connected to AdventureWorks database and HumanResources.Employee table. Drag and drop Aggregate Transformation on the Data Flow area.
Connect OLE DB Source to Aggregate Transformation.

Right click on Aggregate and select Edit to open Aggregate Transformation Editor.
Now let’s say, we want to know for each manager, the number of Employee’s working with them, the sum of their VacationHours and Average of their SickLeaveHours. We wish to achieve something of this sort

Under the Aggregations Tab and in Available Input Columns, click in the check box against the column name as shown
The selections made appear in the area below the Input Columns as shown
We can also select the same column multiple times by selecting from the drop down under Input Column in new row
Rename the Output Alias columns to required output column names. Operation defines either Group by or one can select the aggregate to perform.In our case, we require to get SUM of VacationHours and AVG of SickLeaveHours. We can do by selecting the required aggregate under Operation against each Input Column


Click OK. Drag and drop a Derived Column on the Data Flow are and connect Aggregate to Derived Column. Derived Column in this example will serve as a dummy for us to show the visual output. Add a Data Viewer between Aggregate and Derived Column. You will see the output alias created in Aggregate appear under Displayed columns in Data Viewer
  Click OK and return back to the Data Flow area. It should look something like this
Execute the SSIS package and we should in the Data Viewer the aggregations. We can execute the query in SQL Management Studio and compare the result with Data Viewer output.

 

Click for NEXT article in the sequence.  

 

Please feel free to correct me by commenting your suggestions and feedback.

Tuesday, December 22, 2015

XML Source

The XML source reads an XML data file.

 

The data in XML files frequently includes hierarchical relationships

 

The XML source supports use of a XML Schema Definition (XSD) file or inline schemas to translate the XML data into a tabular format. Using XML Source Editor dialog box, the user can generate an XSD from the specified XML data file. 

DTD's are not supported. 

 

The schemas can support a single namespace only; they do not support schema collections. 

 

The XML source does not validate the data in the XML file against the XSD.


The XML source supports three different data access modes. We can specify the XML data file location, the variable containing file location, or the variable containing XML data

 

The XML source includes the XMLData and XMLSchemaDefinition custom properties.

 

The XML source supports multiple regular outputs and multiple error outputs.

 

Using the XML Source Editor

The data in the XML files frequently includes hierarchical relationships. To provide information about the hierarchical relationship of the data in the outputs, the XML source adds a column in the outputs that identifies the parent element for each child element. The XML source creates an output for every element, but it is not required that we use all the outputs.

 

For every output, the XML source adds an error output. By default, the columns in error outputs have Unicode string data type (DT_WSTR) with a length of 255, but we can configure the columns in the error outputs by modifying their data type and length.

 

If the XML data file contains elements that are not in the XSD, these elements are ignored and no output is generated for them. On the other hand, if the XML data file is missing elements that are represented in the XSD, the output will contain columns with null values.

 

When the data is extracted from the XML data file, it is converted to an Integration Services data type except DT_TIME2 or DT_DBTIMESTAMP2 data types because the source does not support these data types.

 

The XSD or inline schema may specify the data type for elements, but if it does not, the XML Source Editor dialog box assigns the Unicode string data type (DT_WSTR) to the column in the output that contains the element, and sets the column length to 255 characters.

 

If the schema specifies the maximum length of an element, the length of output column is set to this value. If the maximum length is greater than the length supported by the Integration Services data type to which the element is converted, then the data is truncated to the maximum length of the data type. For example, if a string has a length of 5000, it is truncated to 4000 characters because the maximum length of the DT_WSTR data type is 4000 characters; likewise, byte data is truncated to 8000 characters, the maximum length of the DT_BYTES data type. If the schema specifies no maximum length, the default length of columns with either data type is set to 255. Data truncation in the XML source is handled the same way as truncation in other data flow components.

 

Custom properties - All properties are read/write.

  • AccessMode - Integer - The mode used to access the XML data.

  • UseInlineSchema - Boolean - A value that indicates whether to use an inline schema definition within the XML source. The default value of this property is False.

  • XMLData - String - The file or variables from which to retrieve the XML data.

The value of this property can be specified by using a property Expression.

  • XMLSchemaDefinition - String - The path and file name of the schema definition file (.xsd).

Custom properties of the output of the XML source. All properties are read/write.

  • RowsetID - String - A value that identifies the rowset associated with the output.

 

The following example uses XML Source. In this example, we will create XML file and use the same with XML Source. Before moving ahead with example, refer these earlier articles: 

 

Open SSISStudy / SSISStudy2012 project created earlier. Open Source Explorer and right click on SSIS Packages and select New SSIS Package and rename it to XMLSourceExample.

 

Before we use XML Source, we need to create XML file. For this example, we have queried the AdventureWorks database to get information about the Sales and Territory information in XML format:

We have modified (i.e. append root element) and saved the output from above query as XMLSource.xml.

Now returning back to SSIS designer, drag and drop Data Flow Task on the Control Flow area. Open Data Flow tab, drag and drop XML Source on Data Flow area.

Right click and select Edit to open XML Source Editor. Under Data access mode, select XML file location to specify the file directly or XML file from variable to read file path saved in variable or XML data from variable to query XML data saved to variable. For this example, we have selected XML file location. Click on Browse button next to XML location and select the XMLSource.xml file created  earlier. If the XML provided as input has XML Schema inline, check mark Use inline schema, else if schema file is available then click on Browse to select the schema file, else click on Generate XSD so that XML Source generates the XSD file required to query the XML. For this example, we have used the Generate XSD to generate the XSD schema file and saved as XMLSource.xsd.

The XMLSource.xml that we have used has hierarchical information of SalesHeader, Territory and Product. XML Source output should maintain this hierarchy and it does this by providing unique RowID information across nodes. The RowID of the parent node is shared with child node to maintain the relation while child node which itself will have its own RowID provided if it has a child of its own and so on until last node which will have its parents RowID but none of its own.

 

Select Columns. Under Output name, select D for Product node information, select H for Header node information and select T for Territory node information. To maintain relation between Header and Territory, XML Source adds a RowID column in Header and Territory and also RowID of Header is shared with Territory. Product will have RowID of Territory but no RowID of its own.

Header RowID = H_Id and Territory RowID = T_Id.

Check mark all the columns required including the RowID’s.


Click OK.

 

Drag and drop Derived Column on the Data Flow area and connect XML Source output to it. Input Output Selection window pops up. Under Output, select the Output required and connect it to Derived Column.   

Repeat the same step two more times and connect the required outputs. Add Data Viewer to each path connecting Source to Derived Column’s. 

Execute the package and you should see the output. Also check for RowID across outputs to check if hierarchy is maintained.  


 
Click for NEXT article in the sequence.

Please feel free to correct me by commenting your suggestions and feedback.


Recordset Destination

article in progress

Raw File Source

The Raw File source reads raw data from a file.

Since the data is native to the source and requires no translation and almost no parsing. This means that the Raw File source can read data more quickly than other sources like the Flat File and the OLE DB.

The Raw File source is used to retrieve raw data that was previously written by the Raw File destination.


We configure the Raw File by specifying the name of the file that the source reads. This source does not use a connection manager.

 

This source has one output. It does not support an error output.

 

Custom properties

All properties are read/write.

  • AccessMode: Integer (enumeration) - The mode used to access the raw data. The possible values are File name (0) and File name from variable (1). The default value is File name (0).

  • FileName: String - The path and file name of the source file.

 

The following example uses RAW FILE source. In this example, like earlier we will copy data Raw file containing data from Employee table into Employee_new table. Before moving ahead with example, refer these earlier articles:


 Let’s first TRUNCATE or DELETE data from Employee_new table.

 

Open SSISStudy / SSISStudy2012 project created earlier. Open Source Explorer and right click on SSIS Packages and select New SSIS Package and rename it to RawFileSourceExample.

 

Before we use Raw File Source, we need to create Raw File using Raw File Destination. For this, drag and drop Data Flow Task on Control Flow area. Open Data Flow tab, drag and drop OLE DB Source on the Data Flow area. Configure to read data from HumanResources.Employee table. Drag and drop Raw File Destination on the Data Flow area.

  Connect OLE DB Source to Raw File Destination.

  Right click on Raw File Destination and select Edit to open Raw File Destination Editor. Set Access mode as File name, click on Browse button and select the location where the raw file needs to be created and provide file name and under Write option, keep the default Create Always.

Select Columns and check mark all the columns required as Raw File columns.

Click OK. Execute the package to create raw file RawFileSource. Check if raw file RawFileSource with data is created at the specified location.

 

Now returning back, drag and drop another Data Flow Task. Connect the output of the first Data Flow to this new Data Flow. For convenience and understanding, we have renamed the first Data Flow as To Raw File and current Data Flow as From Raw File.

  Open From Raw File and drag and drop Raw File Source on Data Flow area.

  Right click and select Edit to open Raw File Source Editor. Under Access mode, select File name to specify the file directly or File name from variable to read file path saved in variable. For this example, we have selected File name and selected the file created earlier using Browse button.

Select Columns and check mark all the columns required to be copied to destination table Employee_new.

Click OK.

 

Drag and drop OLE DB Destination on the Data Flow area and connect Raw File Source output to OLE DB Destination. Configure OLE DB Destination to copy data to Employee_new table.

Execute the package and it should run provided the file is available at mentioned location.

 

What if the file is deleted by someone who used our system after we went away for some much needed break? Let’s try that by deleting the RawFileSource file and then execute the package. We should run into error. On closer look, we should find the location and reason for the error.

We have highlighted the location “Error at From Raw File [Raw File Source[1]]” and reason “the file is not found.”

   But our setup first creates the file and then we read from the file. So there should not be any error, right? The Data Flow Compiler does not think that way. It sees that before package execution, the file does not exist and hence throws the error. To overcome this, we need to set the DelayValidation property of Data Flow(From Raw File) to True. This signals the compiler that we are sure to create the file before it is read and the compiler does not need to signal this as validation error.

Right click Data Flow(From Raw File) and select Properties. Set DelayValidation to True and then execute the package.

 

Click for NEXT article in the sequence.



Please feel free to correct me by commenting your suggestions and feedback.

Wednesday, December 16, 2015

Flat File Source

The Flat File source reads data from a text file. The text file can be in delimited, fixed width, or mixed format.
  • Delimited format uses column and row delimiters to define columns and rows.
  • Fixed width format uses width to define columns and rows. This format also includes a character for padding fields to their maximum width.
  • Ragged right or mixed format uses width to define all columns, except for the last column, which is delimited by the row delimiter.

The Flat File connection manager must be configured to interpret zero-length strings as nulls when using delimited format.



The output columns include the FastParse property. FastParse indicates whether the column uses the quicker, but locale-insensitive, fast parsing routines that Integration Services provides or the locale-sensitive standard parsing routines. By implementing fast parse, a package forfeits its ability to interpret date, time, and numeric data in locale-specific formats, but the package enhances its performance. For example, fast parse supports only the most commonly used date format representations such as YYYYMMDD and YYYY-MM-DD, does not perform locale-specific parsing, does not recognize special characters in currency data, and cannot convert hexadecimal or scientific representation of integers.

If the data flow in the package requires locale-sensitive parsing, standard parse is recommended instead of fast parse.

 

Output columns also include the UseBinaryFormat property. We use this property to implement support for binary data, such as data with the packed decimal format, in files. By default UseBinaryFormat is set to false. If we want to use a binary format, set UseBinaryFormat to true and the data type on the output column to DT_BYTES. When we do this, the Flat File source skips the data conversion and passes the data to the output column as is. We can then use a transformation such as the Derived Column or Data Conversion to cast the DT_BYTES data to a different data type, or we can write custom script in a Script transformation to interpret the data.



This source has one output and one error output.


Custom properties

FileNameColumnName: The name of an output column that contains the file name. Can be set by using the Advanced Editor.
RetainNulls: A value that specifies whether to retain Null values from the source file as Null values. The default value of this property is False.
FastParse: A value that indicates whether the column uses the quicker, but locale-insensitive, fast parsing routines that DTS provides or the locale-sensitive standard parsing routines. Can be set by using the Advanced Editor.


The following example uses Flat File Source. In this example, we will copy data from Flat File Source containing data from Employee table into Employee_new table. Before moving ahead with example, refer these earlier articles:

  • Create a new SSIS project using BIDS / SSDT

 

Let’s first TRUNCATE or DELETE data from Employee_new table. We want to copy data from Employee table into new flat file (.txt or Text file). This can be done by using Import and Export feature in SSMS. We can create Fixed width, Delimited and Ragged right flat file.
For our demonstration, we have copied data from Employee table to a new fixed width flat file, new comma delimited flat file and new ragged right flat file and renamed the files as FixedWidthTextSource, DelimitedTextSource and RaggedRightTextSource respectively.

   

Open SSISStudy / SSISStudy2012 project created earlier. Open Source Explorer and right click on SSIS Packages and select New SSIS Package and rename it to FlatFileSourceExample.


 

Drag and drop Data Flow Task on Control Flow area. Open Data Flow tab. Drag and drop Flat File Source on Data Flow area.

Rename the Flat File Source to Fixed Width.
Right click Fixed Width and select Edit to open Flat File Source Editor.
Click on New to open Flat File Connection Manager Editor window. This is used to create new Flat File Source Connection manager.
Click on Browse to open File Dialog. Browse to the Fixed Width flat file created earlier. Select and click Open.
We will not change the Locale or Code Page. Under Format, select Fixed Width. Based on the data in your Fixed Width flat file, if the file includes header columns provide the appropriate values in Header row delimiter, Header rows to skip and check mark Column names in the first data row.
Select Advanced. Here we need to specify the columns that would appear as output from source. Basically, we will help the Flat File Source to interpret the data in flat file. In this example, we have a total of 14 columns from Employee table in flat file. Click on New to create 14 columns.
Type FixedWidth under Connect manager name. Rename the columns. 


 

By default the columns will be of type string [DT_STR] with default length. We can use Suggest Types to suggest possible column types. But do not completely rely on it. Compare against the schema of the Employee table to get the idea about column width and type. Change InputColumnWidth and DataType properties for each Column accordingly. Also OutputColumnWidth if enabled, make sure the value is equal or more than InputColumnWidth value.  
Click on Preview and make sure the data from flat file appears in correct expected columns. If any column data appears split across columns, change column width in Advanced. Click OK.
Flat File Source output columns will be by default of type string [DT_STR]. To avoid any conversion or truncation errors, right click Flat File Source and select Show Advanced Editor. Select Input and Output Properties tab. Under Inputs and outputs, expand Flat File Source Output, Output Columns and update DataType property for each output column. The conversion should be to corresponding destination column type. For example, EmployeeID column in destination table is of type Int. So we have converted the column datatype to four-byte signed integer (DT_I4). Click OK.
Drag and drop OLE DB Destination. Connect Flat File Source output to OLE DB Destination. Configure OLE DB Destination to map columns from source to Employee_new table.
Execute the package. We should see data copied from Fixed Width Flat File Source to Employee_New table. 


Now let’s see an example of Comma Delimited Flat File Source. Truncate table Employee_New. Drag and drop Flat File Source. Right click and edit. Select New to create flat file connection manager. Browse to open Comma delimited file created earlier. Under Format, select Delimited. If any text qualifier like “” used in the source file, specify the same under Text qualifier. Specify, Header row delimiter and Header rows to skip according to the source file content. Check mark, Column names in the first data row if column names appear as first row in the source file.
Since we have check marked against Column names in the first data row, Column names from source file appear under Columns and Advanced with default DataType and OutputColumnWidth values.
Either change column type and width under DataType and OutputColumnWidth or do so under Advanced Editor for Flat File Source.
Drag and drop OLE DB Destination. Connect Source to destination and execute the package.
We should see data from Comma-Delimited Flat File copied to Employee_New table.



Now let’s see an example of Ragged Right flat file source. Truncate table Employee_New. Drag and drop Flat File Source. Right click and edit. Select New to create flat file connection manager. Browse to open Ragged Right file created earlier. Under Format, select Ragged right. Specify, Header row delimiter and Header rows to skip according to the source file content. Check mark, Column names in the first data row if column names appear as first row in the source file.
Repeat the steps specified under Fixed Width flat file source explanation.

 

Click for NEXT article in the sequence.

 

Please feel free to correct me by commenting your suggestions and feedback.