Wednesday, May 22, 2019

Clustered and Non-Clustered Indexes

Clustered indexes :-
---------------------
Clustered indexes sort and store the data rows in the table based on their key values. There can only be one clustered index per table, because the data rows themselves can only be sorted in one order.

In SQL Server, indexes are organized as B-trees. Each page in an index B-tree is called an index node. The top node of the B-tree is called the root node. The bottom level of nodes in the index is called the leaf nodes. Any index levels between the root and the leaf nodes are collectively known as intermediate levels. In a clustered index, the leaf nodes contain the data pages of the underlying table. The root and intermediate level nodes contain index pages holding index rows. Each index row contains a key value and a pointer to either an intermediate level page in the B-tree, or a data row in the leaf level of the index. The pages in each level of the index are linked in a doubly-linked list.

Clustered indexes have one row in sys.partitions, with index_id = 1 for each partition used by the index. By default, a clustered index has a single partition. When a clustered index has multiple partitions, each partition has a B-tree structure that contains the data for that specific partition. For example, if a clustered index has four partitions, there are four B-tree structures; one in each partition.

Depending on the data types in the clustered index, each clustered index structure will have one or more allocation units in which to store and manage the data for a specific partition. At a minimum, each clustered index will have one IN_ROW_DATA allocation unit per partition. The clustered index will also have one LOB_DATA allocation unit per partition if it contains large object (LOB) columns. It will also have one ROW_OVERFLOW_DATA allocation unit per partition if it contains variable length columns that exceed the 8,060 byte row size limit. For more information about allocation units, see Table and Index Organization.

The pages in the data chain and the rows in them are ordered on the value of the clustered index key. All inserts are made at the point where the key value in the inserted row fits in the ordering sequence among existing rows. The page collections for the B-tree are anchored by page pointers in the sys.system_internals_allocation_units system view.

For a clustered index, the root_page column in sys.system_internals_allocation_units points to the top of the clustered index for a specific partition. SQL Server moves down the index to find the row corresponding to a clustered index key. To find a range of keys, SQL Server moves through the index to find the starting key value in the range and then scans through the data pages using the previous or next pointers. To find the first page in the chain of data pages, SQL Server follows the leftmost pointers from the root node of the index.

Levels of a clustered index


Non-clustered Index Structures :-
------------------------------------
Non-clustered indexes have the same B-tree structure as clustered indexes, except for the following significant differences:

The data rows of the underlying table are not sorted and stored in order based on their non-clustered keys.

The leaf layer of a nonclustered index is made up of index pages instead of data pages.

Nonclustered indexes can be defined on a table or view with a clustered index or a heap. Each index row in the nonclustered index contains the nonclustered key value and a row locator. This locator points to the data row in the clustered index or heap having the key value.

The row locators in nonclustered index rows are either a pointer to a row or are a clustered index key for a row, as described in the following:

If the table is a heap, which means it does not have a clustered index, the row locator is a pointer to the row. The pointer is built from the file identifier (ID), page number, and number of the row on the page. The whole pointer is known as a Row ID (RID).

If the table has a clustered index, or the index is on an indexed view, the row locator is the clustered index key for the row. If the clustered index is not a unique index, SQL Server makes any duplicate keys unique by adding an internally generated value called a uniqueifier. This four-byte value is not visible to users. It is only added when required to make the clustered key unique for use in nonclustered indexes. SQL Server retrieves the data row by searching the clustered index using the clustered index key stored in the leaf row of the nonclustered index.

Nonclustered indexes have one row in sys.partitions with index_id >0 for each partition used by the index. By default, a nonclustered index has a single partition. When a nonclustered index has multiple partitions, each partition has a B-tree structure that contains the index rows for that specific partition. For example, if a nonclustered index has four partitions, there are four B-tree structures, with one in each partition.


Depending on the data types in the nonclustered index, each nonclustered index structure will have one or more allocation units in which to store and manage the data for a specific partition. At a minimum, each nonclustered index will have one IN_ROW_DATA allocation unit per partition that stores the index B-tree pages. The nonclustered index will also have one LOB_DATA allocation unit per partition if it contains large object (LOB) columns . Additionally, it will have one ROW_OVERFLOW_DATA allocation unit per partition if it contains variable length columns that exceed the 8,060 byte row size limit. For more information about allocation units, see Table and Index Organization. The page collections for the B-tree are anchored by root_page pointers in the sys.system_internals_allocation_units system view.

Levels of a nonclustered index

Roles and Responsibilities In My Project

Banking Domain Client :-
---------------------------
My client is ABSA, it an south african company called Barclay. We work on customer data and Employee data. 

Employee Data :
If any employees resigns or joins or salary increments or transfers to other department , then the Employee data changes. 
So we update the Employee dimension target tables with SCD II techniques.
We load the fact tables after loading dimension tables.

Facts & Dimension Involved are :
Date_Dim, State_Dim, Country_Dim, AccountType_Dim, Company_Dim, Currency_Dim, Bank_Dim
Balance_Fact, AccountID_Fact, CompanyID_Fact, BankID_Fact, CountryID_Fact


Roles and responsibilities in your project :-
---------------------------------------
1) Data we get will be in the format of flat files, sql or oracle.

- We perform change data capture mappings with mapping variable, whenever there is timestamp transactions.
- Sometimes we follow Incremental Aggregation logic, when we have incremental data.
- We follow Indirect meathods, when we have multiple flat files.
- We follow 'configure concurrent workflow' logic, when new flat files arrives now and then.
- We also do partitions ?


2) I work on mapping code errors and reply to them as problem is resolved.

- Connection problems like source database change, we correct it in parameter files.
- Workflow running for long time, we follow performance tuning techniques.
- Suddenly Code has stopped working, we check the complete mapping code and correct it. 
- We use debugger to debug a mapping code.


3) We update dimensions when ever there is requirement, mostly the requirement will be on the end of every month.

- Using cust_key suggorate key into the target 
- Using Md5 function 
- Using dynamic lookup


4) I perform Error Handling techniques to ensure no data is missing.


5) I perform Performance tuning if the session is taking long time.

6) We execute workflow on unix with wrapper script 
  (pmcmd startworkflow -sv <Integration Service Name> -d <Domain Name> -u <Integration Service Username> -p <Password> -f <Folder Name> <Workflow>)

7) We schedule jobs and monitor jobs in tivoli workload scheduler.

Debugger to Troubleshoot Etl Logic

Debugger is an 'Interface' integrated with Mapping Designer. Informatica executes the mapping based on the transformation logic. We cannot see the immediate results or how the data is getting modified from one transformation to another transformation. Many times we get the data into the target which is not expected but with help of debugger, we can analyse data row by row in a debugger.

Before starting a debugger we need to select the session instance, whether we want to use new session or existing session.




Whenever we invoke the debugger, few windows will be displayed to show debugging information, such as Instance window, Target window, Mapping window and Debugger log. We can monitor the debugger with these windows




Target Window :- This shows how the data is processed into target window. We can notice whether the record is going to be inserted, updated, deleted or rejected. If there are any multiple target instances, select the instance name from the drop down window.

Instance Window :- This shows how the data is transformed at a transformation instance, this window gets refreshed as debugger progresses from one transformation to other. We can choose the transformations from the drop down menu and see how the data looks like at that particular transformation instance for a particular source row.


Break point in Debugger :-
----------------------------
When ever we look for a particular data, instead of checking every instance of source data, we can define break points. Break point is a condition defined during configuration of debugging such as when the condition becomes true debugger halts to that particular instance.



Use continue option in debugger to stop at when the statement is true. Here by using 'Continue' option, we don't need to manually press F10 key to go to the next row.







Error Handling Techinques

Error Handling Techniques are of two types, they are fatal errors and non-fatal errors.

Fatal error force session to fail.

This happens when its unable to access source, target or repository

Non-fatal error does not force session to fail. 
It happens while reading data from source to target. Rejected records will come in this category.

To capture the transformation level errors, we need to enable the property in the configure tab in session properties 'Error Log Type'. 


And we choose DB connection along with the connection information.



As a result 4 error tables are created in database as below :-


PMERR_DATA : This stores Data and Meta data about a transformation row error and its corresponding source rows.
PMERR_MSG : This stores Meta data about an error and error msg
PMERR_SESS : This stores Meta data about session
PMERR_TRANS : This stores Meta data about source and transfomation ports such as name, datatype and when transformation errors occurs. By this we can verify the error, due to which mapping is failing.


Understanding Error Log File :-
---------------------------------
Error Type : 
1 - Reader Error
2 - Writer Error
3 - Transformation Error

Source Row Type :
0 - Insert
1 - Update
2 - Delete
3 - Reject

Source Data :
D - Valid
O - Overflow
N - Null
T - Truncated
B - Binary
U - Data unavailable

Transformation Data :
D - Valid
O - Overflow
N - Null
T - Truncated
B - Binary
U - Data unavailable


Apart from this there are user defined exceptions, we will learn soon about it.




















Stored Procedure Transformation

Stored Procedure Transformation is Passive transformation. 
It is both connected and unconnected.
Procedure are stored in database and we call the stored procedure through stored procedure transformation in Informatica.
To perform the calculations like Salary hike, Min and Max salary, we use stored procedure Transfomation.

Examples :-
------------
If an application updates the customer table in ten different places, there can be a single stored procedure and a standard procedure call from the application for this functionality.
If a stored procedure has four SQL statements in the code, then there only needs to be a single call to the database instead of four calls for each individual SQL statement.


Two types of ports :-
---------------------- 
Input : This port is used to get Input parameter.
Output: This port will display the output values as per the procedure defined in database.
Return : This port is used to return values. 


Stored Procedure Type :-
--------------------------
There are 5 different Options available in Stored Procedure Transformation Properties.

- Normal : It will run row by row basis.
- Source Pre Load : It works like property in Source Qualifier as 'Pre Sql'. 
  Before reading the data from the source Source Pre Load would act on              database.
- Source Post Load : It works like property in Source Qualifier as 'Post Sql'. 
   After the load, Source Post Load would act on database. 
- Target Pre Load : It works like property in Source Qualifier as 'Pre Sql'. 
   Before reading the data to the target Target Pre Load would act on                     database.
- Target Post Load : It works like property in Source Qualifier as 'Post Sql'. 
   After the load, Source Post Load would act on database


To create a stored procedure transformation :-

- Go to the transformations tab and click import stored procedure option.
- Connect to the database where the stored procedures is stored and select it.
- Now connect the appropriate ports to input and out put ports.




- Select the connection information 



- Now execute the workflow, we get the expected results.

Friday, May 10, 2019

Loops in Stored Procedures

Whenever we execute a procedure each query or statement is executing once. 
Loop is a programming concept which is used to execute a group of statements repeatedly for the specific no of times.

3 Types of Loops :-
- Simple Loop
- For Loop
- While Loop

Thursday, May 9, 2019

Aggregate functions

Aggregate functions :-
The functions SUM, COUNT, AVG, MIN, MAX are the common aggregate functions the result of which does not depend on the order of the records.

Aggregate functions Return a single output by calculating values from the table or column. They return the group value multiple times with each record.

SELECT DEPTNO, COUNT(*) DEPT_COUNT FROM EMP 
WHERE DEPTNO IN (20, 30)
GROUP BY DEPTNO;

DEPTNO                 DEPT_COUNT             
---------------------- ---------------------- 
20                     5                      
30                     6 


SUM() function :-
It returns the sum of a column. The column should be numeric. 

SELECT SUM(SAL) FROM EMP;
SELECT EMPNO, SUM(SAL) FROM EMP GROUP BY EMPNO;


COUNT() function :-
It returns/counts the number of rows in a query. But, it will not count any null values/column in a table. 

SELECT COUNT(*) FROM EMP;
OR 
SELECT COUNT(EMPNO) FROM EMP;
SELECT EMPNO, COUNT(EMPNO) FROM EMP GROUP BY EMPNO;


AVG() function :-
It returns the average value of a column. The column should be numeric.

SELECT AVG(SAL) FROM EMP;
SELECT EMPNO, MIN(SAL) FROM EMP GROUP BY EMPNO;


MIN() function :-
It returns the minimum or smallest value of a column. The column should be numeric.

SELECT MIN(SAL) FROM EMP;
SELECT EMPNO, AVG(SAL) FROM EMP GROUP BY EMPNO;


MAX() function :-
It returns the maximum or biggest value of a column. The column should be numeric. 

SELECT MAX(SAL) FROM EMP;
SELECT EMPNO, MAX(SAL) FROM EMP GROUP BY EMPNO;


Date Functions in SQL

Select Getdate() as CurrentDate; Select Getdate() -1 PreviousDate; Select Getdate() +1 NextDate; Select DATEADD(dd,1,getdate()) as NextDate;...