Tuesday, December 22, 2015

XML Source

The XML source reads an XML data file.

 

The data in XML files frequently includes hierarchical relationships

 

The XML source supports use of a XML Schema Definition (XSD) file or inline schemas to translate the XML data into a tabular format. Using XML Source Editor dialog box, the user can generate an XSD from the specified XML data file. 

DTD's are not supported. 

 

The schemas can support a single namespace only; they do not support schema collections. 

 

The XML source does not validate the data in the XML file against the XSD.


The XML source supports three different data access modes. We can specify the XML data file location, the variable containing file location, or the variable containing XML data

 

The XML source includes the XMLData and XMLSchemaDefinition custom properties.

 

The XML source supports multiple regular outputs and multiple error outputs.

 

Using the XML Source Editor

The data in the XML files frequently includes hierarchical relationships. To provide information about the hierarchical relationship of the data in the outputs, the XML source adds a column in the outputs that identifies the parent element for each child element. The XML source creates an output for every element, but it is not required that we use all the outputs.

 

For every output, the XML source adds an error output. By default, the columns in error outputs have Unicode string data type (DT_WSTR) with a length of 255, but we can configure the columns in the error outputs by modifying their data type and length.

 

If the XML data file contains elements that are not in the XSD, these elements are ignored and no output is generated for them. On the other hand, if the XML data file is missing elements that are represented in the XSD, the output will contain columns with null values.

 

When the data is extracted from the XML data file, it is converted to an Integration Services data type except DT_TIME2 or DT_DBTIMESTAMP2 data types because the source does not support these data types.

 

The XSD or inline schema may specify the data type for elements, but if it does not, the XML Source Editor dialog box assigns the Unicode string data type (DT_WSTR) to the column in the output that contains the element, and sets the column length to 255 characters.

 

If the schema specifies the maximum length of an element, the length of output column is set to this value. If the maximum length is greater than the length supported by the Integration Services data type to which the element is converted, then the data is truncated to the maximum length of the data type. For example, if a string has a length of 5000, it is truncated to 4000 characters because the maximum length of the DT_WSTR data type is 4000 characters; likewise, byte data is truncated to 8000 characters, the maximum length of the DT_BYTES data type. If the schema specifies no maximum length, the default length of columns with either data type is set to 255. Data truncation in the XML source is handled the same way as truncation in other data flow components.

 

Custom properties - All properties are read/write.

  • AccessMode - Integer - The mode used to access the XML data.

  • UseInlineSchema - Boolean - A value that indicates whether to use an inline schema definition within the XML source. The default value of this property is False.

  • XMLData - String - The file or variables from which to retrieve the XML data.

The value of this property can be specified by using a property Expression.

  • XMLSchemaDefinition - String - The path and file name of the schema definition file (.xsd).

Custom properties of the output of the XML source. All properties are read/write.

  • RowsetID - String - A value that identifies the rowset associated with the output.

 

The following example uses XML Source. In this example, we will create XML file and use the same with XML Source. Before moving ahead with example, refer these earlier articles: 

 

Open SSISStudy / SSISStudy2012 project created earlier. Open Source Explorer and right click on SSIS Packages and select New SSIS Package and rename it to XMLSourceExample.

 

Before we use XML Source, we need to create XML file. For this example, we have queried the AdventureWorks database to get information about the Sales and Territory information in XML format:

We have modified (i.e. append root element) and saved the output from above query as XMLSource.xml.

Now returning back to SSIS designer, drag and drop Data Flow Task on the Control Flow area. Open Data Flow tab, drag and drop XML Source on Data Flow area.

Right click and select Edit to open XML Source Editor. Under Data access mode, select XML file location to specify the file directly or XML file from variable to read file path saved in variable or XML data from variable to query XML data saved to variable. For this example, we have selected XML file location. Click on Browse button next to XML location and select the XMLSource.xml file created  earlier. If the XML provided as input has XML Schema inline, check mark Use inline schema, else if schema file is available then click on Browse to select the schema file, else click on Generate XSD so that XML Source generates the XSD file required to query the XML. For this example, we have used the Generate XSD to generate the XSD schema file and saved as XMLSource.xsd.

The XMLSource.xml that we have used has hierarchical information of SalesHeader, Territory and Product. XML Source output should maintain this hierarchy and it does this by providing unique RowID information across nodes. The RowID of the parent node is shared with child node to maintain the relation while child node which itself will have its own RowID provided if it has a child of its own and so on until last node which will have its parents RowID but none of its own.

 

Select Columns. Under Output name, select D for Product node information, select H for Header node information and select T for Territory node information. To maintain relation between Header and Territory, XML Source adds a RowID column in Header and Territory and also RowID of Header is shared with Territory. Product will have RowID of Territory but no RowID of its own.

Header RowID = H_Id and Territory RowID = T_Id.

Check mark all the columns required including the RowID’s.


Click OK.

 

Drag and drop Derived Column on the Data Flow area and connect XML Source output to it. Input Output Selection window pops up. Under Output, select the Output required and connect it to Derived Column.   

Repeat the same step two more times and connect the required outputs. Add Data Viewer to each path connecting Source to Derived Column’s. 

Execute the package and you should see the output. Also check for RowID across outputs to check if hierarchy is maintained.  


 
Click for NEXT article in the sequence.

Please feel free to correct me by commenting your suggestions and feedback.


No comments:

Post a Comment