Wednesday, May 22, 2019

Workflow variables

Workflow variables :-
----------------------
Workflow variables provides ability to store data, in order to use it in the conditions and actions within workflow. If we want to store the variable values or numbers and to track number of times the session triggered we use workflow variables.

Series of sessions which are dependent on each other in an order is called Sequential batch


Series of sessions which are not dependent on each other and connected individually to start task is called Parallel batch.


In order to create workflow variable we go to workflow tab and edit it. We select variable tab and add the variables there.




To use work flow variable an assignment task is created and it is used to assign variables to the other sessions.




After assigning variable Next we give link condition so that next session triggers only after satisfying the condition.




Final Workflow looks like this












Performance Tuning

When ETL Tool Informatica takes time to execute the workflow then we need to perform performance tuning in order to execute faster. There many steps involved in performing performance tuning.  

At first when a session is triggered the integration service starts Data Transformation Manger (DTM) which is responsible to start reader, writer and transformation thread.

'Reader thread' is responsible for reading data from sources. 
If there is any problem while reading data then its called 'Source bottleneck'

'Writer thread' is responsible for writing data to the targets.
If there is any problem while writing data then its called 'Target bottleneck'

'Transformation thread' is responsible to process data according to the logic.
If there is any problem while processing data then its called 'Mapping bottleneck'

When ever we run a session, it will capture all the thread statistics or details in the session log. In order to do performance tuning we would check the busy percentage in session log.




If the busy percentage is nearly 100, either in reader, writer or transformation threads, then we would identify it as performance bottleneck. 

Lets look at how to do performance tuning for the bottlenecks.

Source Bottleneck :-
---------------------
Incorrect source query or huge size queries or connection problems or small database network packet sizes can cause source bottleneck.

Performance tuning :-
- Rectify the source query
- Rectify the connections

Target Bottleneck :-
---------------------------
If problem is in Target then it can be due to heavy load into the target or connection problems 

Performance tuning :-
- Use Bulk Load option to insert data.
- Increasing the commit interval from 10,000 to 50,000.
- Drop index before data loading and create indexes after data loading with the help of Pre SQL and Post SQL which is under truncate target table option.
- Increase Database network packet size from default 4096.


Mapping Bottleneck :-
-----------------------
If problem is in Transformation thread then it is called mapping bottleneck and we would tune Aggregator, Joiner, Lookup.

Aggregator :
- Filter the data before applying aggregation as it will reduce unnecessary aggregation operation.
- Limit no of ports used in aggregator this will reduce volume of data that stored inside aggregator cache.  
- Use sorted input 
- Group by on simple columns, use numbers for instead of string and date columns for Group By port.

Joiner : 
- Always select the table with the less no of rows as master table.
- Sort the data before joining the tables.

Lookup : 
- Create an index for the column which is used in lookup condition.
- Delete unnecessary columns from the lookup table.

Filter :-
- Use filter transformation as early as possible as it will remove unwanted data early in the mapping.

Source Qualifier :
- Bring only the required column from the source.
- Avoid using order by clause in source qualifier SQL override


Session Bottleneck :-
------------------------------
If there is no bottleneck in the source, target or mapping then we would check the session bottleneck


Mapping Parameters & Mapping variables

Mapping Parameters :-
------------------------
Parameter are used to minimize hard coding in ETL as it avoids changing values in ETL. Mapping parameters are values which remain constant through out the entire session. We will give the mapping parameters values in parameter file.

After we create parameter or variables it appears in expression editor, we can use it in any expression of the mapping. Before we run a session, we define parameter value in a parameter file for the session, here we use any constant value. If the parameter is not defined in parameter file, the integration service uses user defined value as parameter. If the initial value is not defined, the Integration Service uses a default value based on the datatype of the mapping parameter.


Parameter file contains two sections:-

------------------------------------
Header section / Global variables:
It is to identify integration service, folder, workflow, worklet, session to which parameter and variables apply

Definition section / Local variables :

It contains list of parameters and variables with their assigned values.

So lets create a simple mapping that loads source data to target table and pass the value to the parameter file.





Now we give condition in source qualifier as Source EMP1.DEPTNO=$$DEPTNO.




Now create a parameter file for the mapping 




Now we have to give parameter file location. 

Go to the mappings tab and in the left, we find an option as ' files, directories and commands', so click on that.A new screen appears, under 'parameter filename option' give the path of this parameter file including the file name.




So lets us execute workflow and see the output 




Mapping variables :-
---------------------
Mapping variables are basically used for incremental load or to read the source. If your source table contains timestamp transactions and if you want to evaluation transaction on a daily basis we use mapping variables. 



Mapping variables are values that can change between sessions.

Integration Service saves the latest value of a mapping variable to the repository at the end of each successful session. 

The Integration Service holds two different values for a mapping variable during a session run:

- Start value of a mapping variable
- Current value of a mapping variable

Start Value :-

------------
The start value is the value of the variable at the start of the session. The start value could be a value defined in the parameter file for the variable, a value 
assigned in the pre-session variable assignment, a value saved in the repository from the previous run of the session, a user defined initial value for the
variable, or the default value based on the variable datatype. The Integration Service looks for the start value in the following order:

- Value in parameter file

- Value in pre-session variable assignment
- Value saved in the repository
- Initial value
- Datatype default value

For example, you create a mapping variable in a mapping or mapplet and enter an initial value, but you do not define a value for the variable in a parameter file. The first time the Integration Service runs the session, it evaluates the start value of the variable to the configured initial value. The next time the session runs, the Integration Service evaluates the start value of the variable to the value saved in the repository. If you want to override the value saved in the repository before running a session, you need to define a value for the variable in a parameter file. When you define a mapping variable in the parameter file, the Integration Service uses this value instead of the value saved in the repository or the configured initial value for the variable.


When you use a mapping variable ('$$MAPVAR') in an expression, the expression always returns the start value of the mapping variable. If the start value of MAPVAR is 0, then $$MAPVAR returns 0. To return the current value of the mapping variable, use the following expression: SETVARIABLE($$MAPVAR, NULL).


Current Value :-

--------------
The current value is the value of the variable as the session progresses. When a session starts, the current value of a variable is the same as the start value. As the session progresses, the Integration Service calculates the current value using a variable function that you set for the variable. Unlike the start value of a mapping variable, the current value can change as the Integration Service evaluates the current value of a variable as each row passes through the mapping. The final current value for a variable is saved to the repository at the end of a successful session. When a session fails to complete, the Integration Service does not update the value of the variable in the repository. The Integration Service states the value saved to the repository for each mapping variable in the session log.

If a variable function is not used to calculate the current value of a mapping variable, the start value of the variable is saved to the repository.


The transformation language provides the following variable functions to use in a mapping:

- SetMaxVariable
- SetMinVariable
- SetCountVariable
- SetVariable.

The order in which the Integration Service encounters variable functions in the mapping may not be the same for every session run. This may cause inconsistent results when you use the same variable function multiple times in a mapping.


The Integration Service does not save the final current value of a mapping variable to the repository when any of the following conditions are true:


The session fails to complete.

The session is configured for a test load.
The session is a debug session.
The session runs in debug mode and is configured to discard session output.

In order to create Mapping variable, go to mappings tab and select ' parameters and variables' option. Add mapping variable here





Now we give condition in source qualifier as Source EMP1.DEPTNO> $$DATE




We define mapping variable in expression editor of expression transformation like this image




After executing the workflow, it will load variables in $$date. 

Each and every time we execute workflow new variable is replace in the place of $$date. As a result we get a persistent value as in the image





So with the help of mapping variables we can perform incremental loads.

We can refresh persistent values if we want to load data from starting
















Clustered and Non-Clustered Indexes

Clustered indexes :-
---------------------
Clustered indexes sort and store the data rows in the table based on their key values. There can only be one clustered index per table, because the data rows themselves can only be sorted in one order.

In SQL Server, indexes are organized as B-trees. Each page in an index B-tree is called an index node. The top node of the B-tree is called the root node. The bottom level of nodes in the index is called the leaf nodes. Any index levels between the root and the leaf nodes are collectively known as intermediate levels. In a clustered index, the leaf nodes contain the data pages of the underlying table. The root and intermediate level nodes contain index pages holding index rows. Each index row contains a key value and a pointer to either an intermediate level page in the B-tree, or a data row in the leaf level of the index. The pages in each level of the index are linked in a doubly-linked list.

Clustered indexes have one row in sys.partitions, with index_id = 1 for each partition used by the index. By default, a clustered index has a single partition. When a clustered index has multiple partitions, each partition has a B-tree structure that contains the data for that specific partition. For example, if a clustered index has four partitions, there are four B-tree structures; one in each partition.

Depending on the data types in the clustered index, each clustered index structure will have one or more allocation units in which to store and manage the data for a specific partition. At a minimum, each clustered index will have one IN_ROW_DATA allocation unit per partition. The clustered index will also have one LOB_DATA allocation unit per partition if it contains large object (LOB) columns. It will also have one ROW_OVERFLOW_DATA allocation unit per partition if it contains variable length columns that exceed the 8,060 byte row size limit. For more information about allocation units, see Table and Index Organization.

The pages in the data chain and the rows in them are ordered on the value of the clustered index key. All inserts are made at the point where the key value in the inserted row fits in the ordering sequence among existing rows. The page collections for the B-tree are anchored by page pointers in the sys.system_internals_allocation_units system view.

For a clustered index, the root_page column in sys.system_internals_allocation_units points to the top of the clustered index for a specific partition. SQL Server moves down the index to find the row corresponding to a clustered index key. To find a range of keys, SQL Server moves through the index to find the starting key value in the range and then scans through the data pages using the previous or next pointers. To find the first page in the chain of data pages, SQL Server follows the leftmost pointers from the root node of the index.

Levels of a clustered index


Non-clustered Index Structures :-
------------------------------------
Non-clustered indexes have the same B-tree structure as clustered indexes, except for the following significant differences:

The data rows of the underlying table are not sorted and stored in order based on their non-clustered keys.

The leaf layer of a nonclustered index is made up of index pages instead of data pages.

Nonclustered indexes can be defined on a table or view with a clustered index or a heap. Each index row in the nonclustered index contains the nonclustered key value and a row locator. This locator points to the data row in the clustered index or heap having the key value.

The row locators in nonclustered index rows are either a pointer to a row or are a clustered index key for a row, as described in the following:

If the table is a heap, which means it does not have a clustered index, the row locator is a pointer to the row. The pointer is built from the file identifier (ID), page number, and number of the row on the page. The whole pointer is known as a Row ID (RID).

If the table has a clustered index, or the index is on an indexed view, the row locator is the clustered index key for the row. If the clustered index is not a unique index, SQL Server makes any duplicate keys unique by adding an internally generated value called a uniqueifier. This four-byte value is not visible to users. It is only added when required to make the clustered key unique for use in nonclustered indexes. SQL Server retrieves the data row by searching the clustered index using the clustered index key stored in the leaf row of the nonclustered index.

Nonclustered indexes have one row in sys.partitions with index_id >0 for each partition used by the index. By default, a nonclustered index has a single partition. When a nonclustered index has multiple partitions, each partition has a B-tree structure that contains the index rows for that specific partition. For example, if a nonclustered index has four partitions, there are four B-tree structures, with one in each partition.


Depending on the data types in the nonclustered index, each nonclustered index structure will have one or more allocation units in which to store and manage the data for a specific partition. At a minimum, each nonclustered index will have one IN_ROW_DATA allocation unit per partition that stores the index B-tree pages. The nonclustered index will also have one LOB_DATA allocation unit per partition if it contains large object (LOB) columns . Additionally, it will have one ROW_OVERFLOW_DATA allocation unit per partition if it contains variable length columns that exceed the 8,060 byte row size limit. For more information about allocation units, see Table and Index Organization. The page collections for the B-tree are anchored by root_page pointers in the sys.system_internals_allocation_units system view.

Levels of a nonclustered index

Roles and Responsibilities In My Project

Banking Domain Client :-
---------------------------
My client is ABSA, it an south african company called Barclay. We work on customer data and Employee data. 

Employee Data :
If any employees resigns or joins or salary increments or transfers to other department , then the Employee data changes. 
So we update the Employee dimension target tables with SCD II techniques.
We load the fact tables after loading dimension tables.

Facts & Dimension Involved are :
Date_Dim, State_Dim, Country_Dim, AccountType_Dim, Company_Dim, Currency_Dim, Bank_Dim
Balance_Fact, AccountID_Fact, CompanyID_Fact, BankID_Fact, CountryID_Fact


Roles and responsibilities in your project :-
---------------------------------------
1) Data we get will be in the format of flat files, sql or oracle.

- We perform change data capture mappings with mapping variable, whenever there is timestamp transactions.
- Sometimes we follow Incremental Aggregation logic, when we have incremental data.
- We follow Indirect meathods, when we have multiple flat files.
- We follow 'configure concurrent workflow' logic, when new flat files arrives now and then.
- We also do partitions ?


2) I work on mapping code errors and reply to them as problem is resolved.

- Connection problems like source database change, we correct it in parameter files.
- Workflow running for long time, we follow performance tuning techniques.
- Suddenly Code has stopped working, we check the complete mapping code and correct it. 
- We use debugger to debug a mapping code.


3) We update dimensions when ever there is requirement, mostly the requirement will be on the end of every month.

- Using cust_key suggorate key into the target 
- Using Md5 function 
- Using dynamic lookup


4) I perform Error Handling techniques to ensure no data is missing.


5) I perform Performance tuning if the session is taking long time.

6) We execute workflow on unix with wrapper script 
  (pmcmd startworkflow -sv <Integration Service Name> -d <Domain Name> -u <Integration Service Username> -p <Password> -f <Folder Name> <Workflow>)

7) We schedule jobs and monitor jobs in tivoli workload scheduler.

Debugger to Troubleshoot Etl Logic

Debugger is an 'Interface' integrated with Mapping Designer. Informatica executes the mapping based on the transformation logic. We cannot see the immediate results or how the data is getting modified from one transformation to another transformation. Many times we get the data into the target which is not expected but with help of debugger, we can analyse data row by row in a debugger.

Before starting a debugger we need to select the session instance, whether we want to use new session or existing session.




Whenever we invoke the debugger, few windows will be displayed to show debugging information, such as Instance window, Target window, Mapping window and Debugger log. We can monitor the debugger with these windows




Target Window :- This shows how the data is processed into target window. We can notice whether the record is going to be inserted, updated, deleted or rejected. If there are any multiple target instances, select the instance name from the drop down window.

Instance Window :- This shows how the data is transformed at a transformation instance, this window gets refreshed as debugger progresses from one transformation to other. We can choose the transformations from the drop down menu and see how the data looks like at that particular transformation instance for a particular source row.


Break point in Debugger :-
----------------------------
When ever we look for a particular data, instead of checking every instance of source data, we can define break points. Break point is a condition defined during configuration of debugging such as when the condition becomes true debugger halts to that particular instance.



Use continue option in debugger to stop at when the statement is true. Here by using 'Continue' option, we don't need to manually press F10 key to go to the next row.







Error Handling Techinques

Error Handling Techniques are of two types, they are fatal errors and non-fatal errors.

Fatal error force session to fail.

This happens when its unable to access source, target or repository

Non-fatal error does not force session to fail. 
It happens while reading data from source to target. Rejected records will come in this category.

To capture the transformation level errors, we need to enable the property in the configure tab in session properties 'Error Log Type'. 


And we choose DB connection along with the connection information.



As a result 4 error tables are created in database as below :-


PMERR_DATA : This stores Data and Meta data about a transformation row error and its corresponding source rows.
PMERR_MSG : This stores Meta data about an error and error msg
PMERR_SESS : This stores Meta data about session
PMERR_TRANS : This stores Meta data about source and transfomation ports such as name, datatype and when transformation errors occurs. By this we can verify the error, due to which mapping is failing.


Understanding Error Log File :-
---------------------------------
Error Type : 
1 - Reader Error
2 - Writer Error
3 - Transformation Error

Source Row Type :
0 - Insert
1 - Update
2 - Delete
3 - Reject

Source Data :
D - Valid
O - Overflow
N - Null
T - Truncated
B - Binary
U - Data unavailable

Transformation Data :
D - Valid
O - Overflow
N - Null
T - Truncated
B - Binary
U - Data unavailable


Apart from this there are user defined exceptions, we will learn soon about it.




















SUBQUERIES PRACTISE QUESTIONS

1. Write a SQL query to find those employees who receive a higher salary than the employee with ID 7369. SELECT * FROM EMP WHERE SAL >  (...