FINAL INTERVIEW QUESTIONS ( ETL - INFORMATICA) Data .

Transcription

FINAL INTERVIEW QUESTIONS ( ETL - INFORMATICA)Data warehousing Basics1. Definition of data warehousing?Data warehouse is a Subject oriented, Integrated, Time variant, Non volatile collection of data in support ofmanagement's decision making process.Subject OrientedData warehouses are designed to help you analyze data. For example, to learn more about your company's salesdata, you can build a warehouse that concentrates on sales. Using this warehouse, you can answer questions like"Who was our best customer for this item last year?" This ability to define a data warehouse by subject matter,sales in this case makes the data warehouse subject oriented.IntegratedIntegration is closely related to subject orientation. Data warehouses must put data from disparate sources into aconsistent format. They must resolve such problems as naming conflicts and inconsistencies among units ofmeasure. When they achieve this, they are said to be integrated.NonvolatileNonvolatile means that, once entered into the warehouse, data should not change. This is logical because thepurpose of a warehouse is to enable you to analyze what has occurred.Time VariantIn order to discover trends in business, analysts need large amounts of data. This is very much in contrast toonline transaction processing (OLTP) systems, where performance requirements demand that historical data bemoved to an archive. A data warehouse's focus on change over time is what is meant by the term time variant.2. How many stages in Datawarehousing?Data warehouse generally includes two stagesETLReport GenerationETLShort for extract, transform, load, three database functions that are combined into one tool Extract -- the process of reading data from a source database.Transform -- the process of converting the extracted data from its previous form into required formLoad -- the process of writing the data into the target database.ETL is used to migrate data from one database to another, to form data marts and data warehouses and also toconvert databases from one format to another format.It is used to retrieve the data from various operational databases and is transformed into useful information andfinally loaded into Datawarehousing system.

1 INFORMATICA2 ABINITO3 DATASTAGE4. BODI5 ORACLE WAREHOUSE BUILDERSReport generationIn report generation, OLAP is used (i.e.) online analytical processing. It is a set of specification which allows theclient applications in retrieving the data for analytical processing.It is a specialized tool that sits between a database and user in order to provide various analyses of the data storedin the database.OLAP Tool is a reporting tool which generates the reports that are useful for Decision support for top levelmanagement.1. Business Objects2. Cognos3. Micro strategy4. Hyperion5. Oracle Express6. Microsoft Analysis Services Different Between OLTP and OLAPOLTPOLAPApplication Oriented (e.g., purchaseorder it is functionality of anapplication)Subject Oriented (subject in the sense2Used to run businessUsed to analyze business3Detailed dataSummarized data4Repetitive accessAd-hoc access5Few Records accessed at a time(tens), simple queryLarge volumes accessed at a time(millions),complex query6Small databaseLarge Database7Current dataHistorical data8Clerical UserKnowledge User1customer, product, item, time)

9Row by Row LoadingBulk Loading10Time invariantTime variant11Normalized dataDe-normalized data12E – R schemaStar schema

3. What are the types of datawarehousing?EDW (Enterprise datawarehousing)It provides a central database for decision support throughout the enterpriseIt is a collection of DATAMARTSDATAMARTIt is a subset of DatawarehousingIt is a subject oriented database which supports the needs of individuals depts. in an organizationsIt is called high performance query structureIt supports particular line of business like sales, marketing etc.ODS (Operational data store)It is defined as an integrated view of operational database designed to support operational monitoringIt is a collection of operational data sources designed to support Transaction processingData is refreshed near real-time and used for business activityIt is an intermediate between the OLTP and OLAP which helps to create an instance reports

4. What are the modeling involved in Data Warehouse Architecture?

5. What are the types of Approach in DWH?Bottom up approach: first we need to develop data mart then we integrate these data mart into EDWTop down approach: first we need to develop EDW then form that EDW we develop data martBottom upOLTP ETL Data mart DWH OLAPTop downOLTP ETL DWH Data mart OLAPTop downCost of initial planning & design is highTakes longer duration of more than an yearBottom upPlanning & Designing the Data Marts without waiting for the Global warehouse designImmediate results from the data martsTends to take less time to implementErrors in critical modules are detected earlier.Benefits are realized in the early phases.It is a Best ApproachData Modeling Types:Conceptual Data ModelingLogical Data ModelingPhysical Data ModelingDimensional Data Modeling1. Conceptual Data ModelingConceptual data model includes all major entities and relationships and does not contain much detailed level ofinformation about attributes and is often used in the INITIAL PLANNING PHASEConceptual data model is created by gathering business requirements from various sources like business documents,discussion with functional teams, business analysts, smart management experts and end users who do thereporting on the database. Data modelers create conceptual data model and forward that model to functionalteam for their review.Conceptual data modeling gives an idea to the functional and technical team about how business requirementswould be projected in the logical data model.

2. Logical Data ModelingThis is the actual implementation and extension of a conceptual data model. Logical data model includes all requiredentities, attributes, key groups, and relationships that represent business information and define business rules.3. Physical Data ModelingPhysical data model includes all required tables, columns, relationships, database properties for the physicalimplementation of databases. Database performance, indexing strategy, physical storage and demoralization areimportant parameters of a physical model.Logical vs. Physical Data ModelingLogical Data ModelPhysical Data ModelRepresents business information and definesbusiness rulesRepresents the physical implementation of the model in adatabase.EntityTableAttributeColumnPrimary KeyPrimary Key ConstraintAlternate KeyUnique Constraint or Unique IndexInversion Key EntryNon Unique IndexRuleCheck Constraint, Default ValueRelationshipForeign KeyDefinitionCommentDimensional Data ModelingDimension model consists of fact and dimension tablesIt is an approach to develop the schema DB designsTypes of Dimensional modelingStar schemaSnow flake schemaStar flake schema (or) Hybrid schema

Multi star schemaWhat is Star Schema?The Star Schema Logical database design which contains a centrally located fact table surrounded by at least one ormore dimension tablesSince the database design looks like a star, hence it is called star schema dbThe Dimension table contains Primary keys and the textual descriptionsIt contain de-normalized business informationA Fact table contains a composite key and measuresThe measure are of types of key performance indicators which are used to evaluate the enterprise performance inthe form of success and failureEg: Total revenue , Product sale , Discount given, no of customersTo generate meaningful report the report should contain at least one dimension and one fact tableThe advantage of star schemaLess number of joinsImprove query performanceSlicing downEasy understanding of data.Disadvantage:Require more storage space

Example of Star Schema:Snowflake SchemaIn star schema, If the dimension tables are spitted into one or more dimension tablesThe de-normalized dimension tables are spitted into a normalized dimension tableExample of Snowflake Schema:In Snowflake schema, the example diagram shown below has 4 dimension tables, 4 lookup tables and 1 fact table. Thereason is that hierarchies (category, branch, state, and month) are being broken out of the dimension tables(PRODUCT, ORGANIZATION, LOCATION, and TIME) respectively and separately.It increases the number of joins and poor performance in retrieval of data.In few organizations, they try to normalize the dimension tables to save space.Since dimension tables hold less space snow flake schema approach may be avoided.Bit map indexes cannot be effectively utilizedImportant aspects of Star Schema & Snow Flake SchemaIn a star schema every dimension will have a primary key.In a star schema, a dimension table will not have any parent table.Whereas in a snow flake schema, a dimension table will have one or more parent tables.Hierarchies for the dimensions are stored in the dimensional table itself in star schema.Whereas hierarchies are broken into separate tables in snow flake schema. These hierarchies help to drill down thedata from topmost hierarchies to the lowermost hierarchies.Star flake schema (or) Hybrid SchemaHybrid schema is a combination of Star and Snowflake schemaMulti Star schemaMultiple fact tables sharing a set of dimension tablesConfirmed Dimensions are nothing but Reusable Dimensions.The dimensions which u r using multiple times or in multiple data marts.Those are common in different data martsMeasure Types (or) Types of Facts

Additive - Measures that can be summed up across all dimensions.o Ex: Sales RevenueSemi Additive - Measures that can be summed up across few dimensions and not with otherso Ex: Current BalanceNon Additive - Measures that cannot be summed up across any of the dimensions.o Ex: Student attendanceSurrogate KeyJoins between fact and dimension tables should be based on surrogate keysUsers should not obtain any information by looking at these keysThese keys should be simple integers

A sample data warehouse schemaWhy need staging area for DWH?Staging area needs to clean operational data before loading into data warehouse.Cleaning in the sense your merging data which comes from different source.It’s the area where most of the ETL is doneData CleansingIt is used to remove duplicationsIt is used to correct wrong email addressesIt is used to identify missing dataIt used to convert the data typesIt is used to capitalize name & addresses.Types of Dimensions:There are three types of DimensionsConfirmed DimensionsJunk Dimensions Garbage DimensionDegenerative DimensionsSlowly changing DimensionsGarbage Dimension or Junk DimensionConfirmed is something which can be shared by multiple Fact Tables or multiple Data Marts.Junk Dimensions is grouping flagged valuesDegenerative Dimension is something dimensional in nature but exist fact table.(Invoice No)Which is neither fact nor strictly dimension attributes. These are useful for some kind of analysis. These arekept as attributes in fact table called degenerated dimensionDegenerate dimension: A column of the key section of the fact table that does not have the associateddimension table but used for reporting and analysis, such column is called degenerate dimension or line itemdimension.For ex, we have a fact table with customer id, product id, branch id, employee id, bill no, and date in keysection and price, quantity, amount in measure section. In this fact table, bill no from key section is a single value;it has no associated dimension table. Instead of creating aSeparate dimension table for that single value, we can Include it in fact table to improve performance. SO here thecolumn, bill no is a degenerate dimension or line item dimension.Informatica Architecture

The Power Center domainIt is a primary unit of the Administration.Can have single and multiple domains.It is a collection of nodes and services.NodesA node is the logical representation of a machine in a domainOne node in the domain acts as a gateway node to receive service requests from clients and route them to theappropriate service and nodeIntegration Service:Integration Service does all the real job. It extracts data from sources, processes it as per the business logic andloads data to targets.Repository Service:Repository Service is used to fetch the data from the repository and sends it back to the requesting components(mostly client tools and integration service)Power Center Repository:Repository is nothing but a relational database which stores all the metadata created in Power Center.Power Center Client Tools:The Power Center Client consists of multiple tools.Power Center Administration Console:This is simply a web-based administration tool you can use to administer the Power Center installation.

Q. How can you define a transformation? What are different types of transformations available in Informatica?A. A transformation is a repository object that generates, modifies, or passes data. The Designer provides a set oftransformations that perform specific functions. For example, an Aggregator transformation performs calculationson groups of data. Below are the various transformations available in Informatica: Aggregator Custom Expression External Procedure Filter Input Joiner Lookup Normalizer Rank Router Sequence Generator Sorter Source Qualifier Stored Procedure Transaction Control Union Update Strategy XML Generator XML Parser

XML Source QualifierQ. What is a source qualifier? What is meant by Query Override?A. Source Qualifier represents the rows that the PowerCenter Server reads from a relational or flat file sourcewhen it runs a session. When a relational or a flat file source definition is added to a mapping, it is connected to aSource Qualifier transformation.PowerCenter Server generates a query for each Source Qualifier Transformation whenever it runs the session. Thedefault

FINAL INTERVIEW QUESTIONS ( ETL - INFORMATICA) Data warehousing Basics 1. Definition of data warehousing? Data warehouse is a Subject oriented, Integrated, Time variant, Non volatile collection of data in support of management's decision making process. Subject Oriented Data warehouses are designed to help you analyze data. For example, to learn more about your company's sales data, you