Sunday, November 29, 2015

No of Data Flow Task executing simultaneously based on the number of Processor and Parallelization setting

In this example, we will see the effect of number of system processors on execution of multiple Data Flow Tasks simultaneously.




We are demonstrating this example using Intel Core i5 system with 4 processors.


We have created a new SSIS package and renamed it to MultipleDFTAndProcessors.dtsx


In this example, we will use multiple Data Flow Tasks. 


We have created replication of HumanResources.Employee table in AdventureWorks2012 database as dbo.Employee. Further we added two columns DF(Number of the Data Flow Task running) and InTime(Time of record insertion to dbo.Employee). 

Let us drag and drop Data Flow Task on the Control Flow area. Using OLE DB Source and Destination, we configured Data Flow Task to copy from HumanResources.Employee and insert to dbo.Employee table. We have used Sort Transformation to sort by BusinessEntityID key field in HumanResources.Employee. We have also used Derived Column Transformation to provide incremental value field DF, to recognize each Data Flow Task, starting with 1.


 


Repeat the same for another 14 times to create in total of 15 Data Flow Tasks using incremental DF values



Now let us experiment by setting Control Flow property MaxConcurrentExecutables. This controls parallelization in Control Flow Task. MaxConcurrentExecutables is -1 by default meaning the number of concurrent Data Flow Tasks to execute will be equal to the number of system processors + 2.

As mentioned earlier, we are experimenting on 4 processors system. If we set MaxConcurrentExecutables to -1, we will be using all the 4 processors in this system.

Execute the package and we should see that simultaneously 6 Data Flow Tasks (i.e. 4 Processors + 2) are executing. Post package execution and on checking dbo.Employee table, we see that InTime for some DF values is same. This means that we can say for sure that parallelization took place and some Data Flow Tasks executed and completed simultaneously. But again parallelization depends on the system load and availability of processors at the time of package execution.

If we set MaxConcurrentExecutables to value greater than 4, Control Flow Engine will try to meet the request for concurrent executions but this again depends on the processor availability and system load. Truncated table dbo.Employee before executing the package with the changes mentioned below.


If we set MaxConcurrentExecutables to value less than 4, we will be using that number of  processors simultaneously. Truncated table dbo.Employee before executing the package with the changes mentioned below.
 
 

 

Thus we can conclude that the number of processors and setting of MaxConcurrentExecutables plays a role in Control Flow Tasks parallelization.

Also, if we set MaxConcurrentExecutables to 1, we can suppress parallelization. The property MaxConcurrentExecutables cannot be set to 0 as it results in error.



Please provide your feedback and suggestions by commenting.  

Saturday, November 28, 2015

T-SQL 2012 Interview questions

Q: Given two tables T1 and T2 with same structure. If T1 has 20 records and T2 has 35 records, then the Full Join should result in how many records?

A: 50 records.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


Q: What is difference between Equi-Join and NonEqui-Join?

A: If all columns in the ON clause are matched on equality ( using operator = ) the join is called "Equi-Join". If one or more columns are using an inequality comparison (using operators like <>, >, >=, <, <=, != ) the join is called "Nonequi-Join".


Q: Given table Country with one column with values India, Australia and England in specified order. Write a T-SQL query to display knockout matches between teams in the specified order: 

India vs Australia

India vs England

England vs Australia


Q: Given table T1 with ID column and row values 3, 10, 19, 20. Find missing values starting with the minimum and maximum from table.


Please leave your suggestions and feedback by commenting

Develop a SSIS package to fetch only Excel files from given folder and read through multiple sheets in each file.

The requirement is to fetch only *.xls/*.xlsx files from given folder. Each file is expected to have random number of sheets but with same metadata i.e. type and number of columns. Read through each of such Worksheet in each file and copy data to common table.


This can be achieved using ForEach Loop Container, OLE DB Source(To read data from Excel 2007 and onwards)/Excel Source(To read data from Excel 97-2003) and any compatible destination in SSIS. 


This example explains steps to read from *.xlsx file with some notes to use *.xls file.

Drag and drop Foreach Loop Container. Configure to read from folder containing *.xlsx files.


Drag and drop another Foreach Loop Container into existing Foreach Loop Container and rename each container according to its purpose. 

In the Connection Managers area, right click and select New ADO.NET Connection...

Select New button under Configure ADO.NET Connection Manager. Under Provider, select Microsoft Office 12.0 Access Database Engine OLE DB Provider and click OK.

Under Connection Manager window, select All. Under Extended Properties, type:

  • EXCEL 12.0; HDR=YES; - For Excel 2007 and onwards

  • EXCEL 8.0; HDR=YES; - For Excel 97-2003

Click OK. 

Click OK on Configure ADO.NET Connection Manager window.

Right click on created ADO.NET Connection Manager and select Properties. 
Selecting Expressions property, click ellipsis button to open Property Expressions Editor window.
Select ConnectionString under Property and click on ellipsis button under Expressions corresponding to ConnectionString property row.

Opens Expression Builder window. Type following under Expression:


"Data Source=" + REPLACE(@[User::ExcelFilePath], "\\\\", "\\") + ";Provider=Microsoft.ACE.OLEDB.12.0;Extended Properties=" + "\""+"EXCEL 12.0;HDR=YES;"+ "\"" + ";"


This expression sets the Data Source to the excel file path from variable ExcelFilePath replacing double backslash with single backslash. Further it sets the Provider and Extended Properties.  

Evaluate Expression and click OK.

Click OK. Note the ADO.NET Connection Manager icon changes to reflect use of expression(fx).


Edit nested Foreach Loop Container. Selecting Foreach ADO.NET Schema Rowset Enumerator, select ADO.NET Connection Manager created.

Under Schema, select Tables.
Just to note, click on Set Restrictions. We see that Table_Name is listed 3rd and for this example, it provides us with SheetNames of the excel file.
Do not check mark on any options here and click Cancel.

Select Variable Mappings.
Create new variable SheetName.
Under Index corresponding to this variable, specify 2 to get TABLE_NAME(WorkSheets for this example).

Click OK. 


Add breakpoints on internal Foreach Loop Container for each iteration. Execute the package. Add Watch on SheetName variable to see the different sheets in excel file.
We can use OLE DB Source/Excel Source to read from these excel sheets based on Excel file version.


If you see package execution failure with following message in Progress tab:


Error: The GetEnumerator method of the ForEach Enumerator has failed with error 0x80131509 "(null)". This occurs when the ForEach Enumerator cannot enumerate.


Rectify this by setting Run64BitRuntime to False under SSIS Project Properties -> Configuration Properties -> Debugging.


Please feel to correct me by commenting your feedback and suggestions.