Basic Informatica Tutorial

Sunday, May 26, 2019

Partitioning in Informatica

It's used for parallel processing in order to decrease the time to load data into target. Different partitions are pass through, round robin, hash user key, auto user key, key range and finally database partition.
Each type of partition works according to its own logic.

- Partitioning option is found in session, click on mapping and you can find an option in bottom left, beside transformation option as 'Partitions', Click on it.

- Look of mapping after we develop a mapping code

Pass through : Its default partition. It distributes the rows sequentially to all the partitions. The Integration Service processes data without redistributing rows
among partitions. All rows in a single partition stay in the partition after crossing a pass-through partition point. Choose pass-through partitioning when we want to create an additional pipeline stage to improve performance, but do not want to change the distribution of data across partitions.

Round robin : The Integration Service distributes data evenly among all partitions. Use round-robin partitioning where we want each partition to process approximately the same numbers of rows i.e. load balancing.

Hash Auto-key : It generates a hash key value by which it distributes rows according to its hash key logic to the partitions.The Integration Service uses a hash function to group rows of data among partitions. The Integration Service groups the data based on a partition key.The Integration Service uses all grouped or sorted ports as a compound partition key. We may need to use hash auto-keys partitioning at Rank, Sorter and unsorted Aggregator transformations.

Hash userkey : The Integration Service uses a hash function to group rows of data among partitions. We define the number of ports to generate the partition key.

Key range : We need to give the key range like which partition should process how many rows. Its completely according to our logic. The Integration Service
distributes rows of data based on a port or set of ports that we define as the partition key. For each port, we define a range of values. The Integration Service uses the key and ranges to send rows to the appropriate partition. Use key range partitioning when the sources or targets in the pipeline are partitioned by key range.

Database : It automatically checks the database like how many partitions are available and based on the it distributes rows to the partitions.The Integration
Service queries the database system for table partition information. It reads partitioned data from the corresponding nodes in the database.

Points to consider while using Informatica partitions :-
-----------------------------------------------------------
* We cannot specify partition for Sequence generator.

* We should specify sorter before joiner otherwise the session fails.

* We cannot create a partition key for hash auto-keys, round-robin, or pass-through types partitioning.

* If you have bitmap index defined upon the target and you are using pass-through partitioning to, say Update the target table - the session might fail as bitmap index creates serious locking problem in this scenario.

* Partitioning considerably increases the total DTM buffer memory requirement for the job. Ensure you have enough free memory in order to avoid memory allocation failures.

* When you do pass-through partitioning, Informatica will try to establish multiple connection requests to the database server. Ensure that database is configured to accept high number of connection requests.

* As an alternative to partitioning, you may also use native database options to increase degree of parallelism of query processing. For example in Oracle database you can either specify PARALLEL hint or alter the DOP of the table in subject.

* If required you can even combine Informatica partitioning with native database level parallel options - e.g. you create 5 pass-through pipelines, each sending query to Oracle database with PARALLEL hint.

Default Partitions are SourceQualifier, Aggregator, Target.
SQ : Pass though
Sorter : Hash Auto key
Router :
Lookup : Any partition
Expression : Pass through
Aggregator : Hash Auto key
Sorter : Pass through
Target : Round robin or Pass through
Joiner : 1:n Its has a different logic of partitiong.

Friday, May 24, 2019

Practise Scenarios

1) How to convert single row from source to three rows in target

- In normalizer create column1 and increase occurs to 3 so that it create 3 inputs

2) How to Split the non-key columns to separate tables with key column in both

3) How to separate duplicate and non duplicate records to a separate table.

- In sorter transformation sort on first port in ascending order
- In aggregator transformation create a new port and check group by port on column1, then take count on same column as count(address1).
- In router create two groups and give conditions as count = 1 and count >1.

4) How to Retrieving first and last record from a table/file.

- Give rank as top and bottom for both the rank transfomations and set number of ranks as 1.
- Create two instances of target.

5) How to remove footer from your file ???????????

- In filter give the condition as instr(col1,'footer')=0, which removes the word footer from the footer file.

6) How to send 1st half rows to target1 and second half rows to target 2 ?

- Generate count in aggregator and group by on empno
- Generate seqno in expression
- Use joiner to join both aggregator and expression
- In router give two conditions as ' o_seqno <= count_rec/2' and 'o_seqno >= count_rec/2

7) How to send alternate records to 2 different targets ?

- Generate sequence and connect to expression
- In expression transformation, add two ports as Even and Odd.
Give the syntax for Odd port as ' NEXTVAL%2 = 1' & Even port as 'NEXTVAL % 2 = 0'.
- In router transformation, create two groups Even and Odd. Give the filter condition for Even as ' Even = 1' and for Odd as ' Odd = 1'.

8) How to send record into target in cyclic order ?

- In sequence generator set start value as 1, end value as 3 and check cycle option to repeat 1 to 3 cycle.
- In router create 3 groups and give condition as Nextval = 1, Nextval = 2, Nextval = 3 and connect outputs to target.

9) How to generate target file names dynamically for flat files?

- Go to the target and select top right 'F' option, a new column will be generated as filename.

- In expression create a 'Dynamic file' name port and give its any name like 'Test_file' or ' To_char(sessstarttime,'yyyymmdd24').
- Connect Dynamic file to filename port in the target.

10) How to process 'CurrentlyProcessedFileName' into the target?

After importing source definition, go to properties tab and select the option 'add Currently Processed Flat File Name'. As a result the filename will be processed into the target.

11) How to remove special characters from a string ?

- In expression transformation create a o_name port and give a synatax as
'REG_REPLACE (NAME,'[^A-Z0-9a-z ]','').

- This expression removes all the special characters except Captial letters, Small letters and numbers.

SELECT regexp_replace('Removing_Special - Characters .,_-$%&^& ',

'[^0-9 A-Za-z]', '');

Update emp2 set Ename = Regexp_Replace (Ename,'[^A-Z0-9 ]', '');

12) How to load the last 3 rows from a flat file into a target?

- Sort on Empno port and connect to Expression transformation as well as Aggregator transformation
- Create dummy ports in both the expression and Aggregator, inorder to join both the transformation with joiner
- Create count port in aggregator transformation
- Check the sorted input option in both the aggregator and joiner transformations
- In joiner give condition as dummy= dummy1
- In router or filter give the condition as Count - o_seqno <=2, as a result last 3 rows will be inserted into the target.

Rank transformation can also be used here to make it simple.

- Select bottom 3 option in rank transformation & connect it to target.

13) How to load the first 3 rows into target1, last 3 rows into a target2 and remaining rows into target3 ?

- This logic is almost same as previous one, but additional conditions and targets are added as per the requirement
- In router first condition should be O_SEQ_NO <= 3, second condition should be COUNT - O_SEQ_NO <= 2
- Now connect all to the targets 1,2 and 3
- Default group will return remaining rows, which are not mentioned in condition

14) Scenario, if you have source with 10 records and 3 target table, your target1 output should be 1-2 records, target2 output should be 2-7 records, target3 output should be 8-10 records.

- Sort on Empno port and connect to Expression transformation as well as Aggregator transformation
- Create dummy ports in both the expression and Aggregator, in order to join both the transformation with joiner
- Create count port in aggregator transformation
- Check the sorted input option in both the aggregator and joiner transformations
- In joiner give condition as dummy= dummy1
- In router, create two different groups and name it as First and Last
- Give the First group condition as Seqno <=2, Last group condition as Count - o_seqno <=2 as a result last 3 rows will be inserted into the target
- Finally connect default port to target 3, remaining records will be loaded into the target

Scheduling Tools In Informatica

Scheduling tools in informatica are used to schedule the jobs. It means these scheduling tools are used to automate workflows, the time specified by us.

Informatica scheduler is basic scheduling tool which is present as a default feature for scheduling.

- Go to edit workflow & select scheduler option. Select Non-Reusable or Re-usable based on your requirement.

- Click scheduler option available on the right for automation

- Put all the necessary details for scheduling purpose & save it.

- If we try to execute manually it fails because its scheduled on a particular time to execute.

But there are many drawbacks for this, so many third party schedulers are being introduce to fulfill the requirements. Some the schedulers are Autosys, Tivoli and Control M Schedulers.

Lets learn about the Tivoli Scheduling tool.

Thursday, May 23, 2019

Session parameters & Session variables

Session Parameter :-
----------------------
Session parameters, like mapping parameters, represent values you might want to change between sessions, such as a database connection or source file.

Use session parameters in the session properties, and then define the parameters in a parameter file.

You can specify the parameter file for the session to use in the session properties.

You can also specify it when you use pmcmd to start the session.The Workflow Manager provides one built-in session parameter, $PMSessionLogFile.With $PMSessionLogFile, you can change the name of the session log generated for the session.

The Workflow Manager also allows you to create user-defined session parameters.

Naming Conventions for User-Defined Session Parameters :

Parameter Type Parameter Connection
------------------ ------------------------
Database Connection $DBconnectionName
Source File $InputFileName
TargetFile $OutputFileName
Lookup File $LookupFileName
Reject File $BadFileName

Lets create a Mapping with session parameters for Oracle DB and execute it.

- Create any mapping and a workflow
- Double click the session, edit the connection 'source_system' and create a parameter connect as in the image.

- Create a parameter file as in the image
- First we give the folder name, next we give session name followed by the session parameter and its connection without spaces in between.

- Give the parameter file location along with name in the session properties.

- Now execute the workflow.

Lets create a Mapping with session parameters for Flat files and execute it.

- Create any mapping and a workflow
- Double click the session and give the source name as $InputFile.

- Create a parameter file as in the image
- First we give the folder name, next we give session name followed by the session parameter and Flat File location.

- Now execute the workflow.

Scenario 2 :-
-------------
Use session parameters to make sessions more flexible.
For example, you have the same type of transactional data written to two different databases, and you use the database connections TransDB1 and TransDB2 to connect to the databases.

You want to use the same mapping for both tables. Instead of creating two sessions for the same mapping, you can create a database connection parameter, $DBConnectionSource, and use it as the source database connection for the session.

When you create a parameter file for the session, you set $DBConnectionSource to TransDB1 and run the session. After the session completes, you set $DBConnectionSource to TransDB2 and run the session again.

You might use several session parameters together to make session management easier. For example, you might use source file and database connection parameters to configure a session to read data from different source files and write the results to different target databases. You can then use reject file parameters to write the session reject files to the target machine. You can use the session log parameter, $PMSessionLogFile, to write to different session logs in the target machine, as well.

When you use session parameters, you must define the parameters in the parameter file. Session parameters do not have default values. When the PowerCenter Server cannot find a value for a session parameter, it fails to initialize the session.

Session Variables :-
---------------------

Wednesday, May 22, 2019

Workflow variables

Workflow variables :-
----------------------
Workflow variables provides ability to store data, in order to use it in the conditions and actions within workflow. If we want to store the variable values or numbers and to track number of times the session triggered we use workflow variables.

Series of sessions which are dependent on each other in an order is called Sequential batch

Series of sessions which are not dependent on each other and connected individually to start task is called Parallel batch.

In order to create workflow variable we go to workflow tab and edit it. We select variable tab and add the variables there.

To use work flow variable an assignment task is created and it is used to assign variables to the other sessions.

After assigning variable Next we give link condition so that next session triggers only after satisfying the condition.

Final Workflow looks like this

Performance Tuning

When ETL Tool Informatica takes time to execute the workflow then we need to perform performance tuning in order to execute faster. There many steps involved in performing performance tuning.

At first when a session is triggered the integration service starts Data Transformation Manger (DTM) which is responsible to start reader, writer and transformation thread.

'Reader thread' is responsible for reading data from sources.
If there is any problem while reading data then its called 'Source bottleneck'

'Writer thread' is responsible for writing data to the targets.
If there is any problem while writing data then its called 'Target bottleneck'

'Transformation thread' is responsible to process data according to the logic.
If there is any problem while processing data then its called 'Mapping bottleneck'

When ever we run a session, it will capture all the thread statistics or details in the session log. In order to do performance tuning we would check the busy percentage in session log.

If the busy percentage is nearly 100, either in reader, writer or transformation threads, then we would identify it as performance bottleneck.

Lets look at how to do performance tuning for the bottlenecks.

Source Bottleneck :-
---------------------
Incorrect source query or huge size queries or connection problems or small database network packet sizes can cause source bottleneck.

Performance tuning :-
- Rectify the source query
- Rectify the connections

Target Bottleneck :-
---------------------------
If problem is in Target then it can be due to heavy load into the target or connection problems

Performance tuning :-
- Use Bulk Load option to insert data.
- Increasing the commit interval from 10,000 to 50,000.
- Drop index before data loading and create indexes after data loading with the help of Pre SQL and Post SQL which is under truncate target table option.
- Increase Database network packet size from default 4096.

Mapping Bottleneck :-
-----------------------
If problem is in Transformation thread then it is called mapping bottleneck and we would tune Aggregator, Joiner, Lookup.

Aggregator :
- Filter the data before applying aggregation as it will reduce unnecessary aggregation operation.
- Limit no of ports used in aggregator this will reduce volume of data that stored inside aggregator cache.
- Use sorted input
- Group by on simple columns, use numbers for instead of string and date columns for Group By port.

Joiner :
- Always select the table with the less no of rows as master table.
- Sort the data before joining the tables.

Lookup :
- Create an index for the column which is used in lookup condition.
- Delete unnecessary columns from the lookup table.

Filter :-
- Use filter transformation as early as possible as it will remove unwanted data early in the mapping.

Source Qualifier :
- Bring only the required column from the source.
- Avoid using order by clause in source qualifier SQL override

Session Bottleneck :-
------------------------------
If there is no bottleneck in the source, target or mapping then we would check the session bottleneck

Mapping Parameters & Mapping variables

Mapping Parameters :-
------------------------
Parameter are used to minimize hard coding in ETL as it avoids changing values in ETL. Mapping parameters are values which remain constant through out the entire session. We will give the mapping parameters values in parameter file.

After we create parameter or variables it appears in expression editor, we can use it in any expression of the mapping. Before we run a session, we define parameter value in a parameter file for the session, here we use any constant value. If the parameter is not defined in parameter file, the integration service uses user defined value as parameter. If the initial value is not defined, the Integration Service uses a default value based on the datatype of the mapping parameter.

Parameter file contains two sections:-
------------------------------------
Header section / Global variables:
It is to identify integration service, folder, workflow, worklet, session to which parameter and variables apply

Definition section / Local variables :
It contains list of parameters and variables with their assigned values.

So lets create a simple mapping that loads source data to target table and pass the value to the parameter file.

Now we give condition in source qualifier as Source EMP1.DEPTNO=$$DEPTNO.

Now create a parameter file for the mapping

Now we have to give parameter file location.
Go to the mappings tab and in the left, we find an option as ' files, directories and commands', so click on that.A new screen appears, under 'parameter filename option' give the path of this parameter file including the file name.

So lets us execute workflow and see the output

Mapping variables :-
---------------------
Mapping variables are basically used for incremental load or to read the source. If your source table contains timestamp transactions and if you want to evaluation transaction on a daily basis we use mapping variables.

Mapping variables are values that can change between sessions.
Integration Service saves the latest value of a mapping variable to the repository at the end of each successful session.

The Integration Service holds two different values for a mapping variable during a session run:
- Start value of a mapping variable
- Current value of a mapping variable

Start Value :-
------------
The start value is the value of the variable at the start of the session. The start value could be a value defined in the parameter file for the variable, a value
assigned in the pre-session variable assignment, a value saved in the repository from the previous run of the session, a user defined initial value for the
variable, or the default value based on the variable datatype. The Integration Service looks for the start value in the following order:

- Value in parameter file
- Value in pre-session variable assignment
- Value saved in the repository
- Initial value
- Datatype default value

For example, you create a mapping variable in a mapping or mapplet and enter an initial value, but you do not define a value for the variable in a parameter file. The first time the Integration Service runs the session, it evaluates the start value of the variable to the configured initial value. The next time the session runs, the Integration Service evaluates the start value of the variable to the value saved in the repository. If you want to override the value saved in the repository before running a session, you need to define a value for the variable in a parameter file. When you define a mapping variable in the parameter file, the Integration Service uses this value instead of the value saved in the repository or the configured initial value for the variable.

When you use a mapping variable ('$$MAPVAR') in an expression, the expression always returns the start value of the mapping variable. If the start value of MAPVAR is 0, then $$MAPVAR returns 0. To return the current value of the mapping variable, use the following expression: SETVARIABLE($$MAPVAR, NULL).

Current Value :-
--------------
The current value is the value of the variable as the session progresses. When a session starts, the current value of a variable is the same as the start value. As the session progresses, the Integration Service calculates the current value using a variable function that you set for the variable. Unlike the start value of a mapping variable, the current value can change as the Integration Service evaluates the current value of a variable as each row passes through the mapping. The final current value for a variable is saved to the repository at the end of a successful session. When a session fails to complete, the Integration Service does not update the value of the variable in the repository. The Integration Service states the value saved to the repository for each mapping variable in the session log.

If a variable function is not used to calculate the current value of a mapping variable, the start value of the variable is saved to the repository.

The transformation language provides the following variable functions to use in a mapping:
- SetMaxVariable
- SetMinVariable
- SetCountVariable
- SetVariable.

The order in which the Integration Service encounters variable functions in the mapping may not be the same for every session run. This may cause inconsistent results when you use the same variable function multiple times in a mapping.

The Integration Service does not save the final current value of a mapping variable to the repository when any of the following conditions are true:

The session fails to complete.
The session is configured for a test load.
The session is a debug session.
The session runs in debug mode and is configured to discard session output.

In order to create Mapping variable, go to mappings tab and select ' parameters and variables' option. Add mapping variable here

Now we give condition in source qualifier as Source EMP1.DEPTNO> $$DATE

We define mapping variable in expression editor of expression transformation like this image

After executing the workflow, it will load variables in $$date.
Each and every time we execute workflow new variable is replace in the place of $$date. As a result we get a persistent value as in the image

So with the help of mapping variables we can perform incremental loads.
We can refresh persistent values if we want to load data from starting