Parsing A PDF File With PowerCenter

Transcription

Parsing a PDF File with PowerCenter 2010 Informatica

AbstractYou can parse data from a PDF file with a PowerCenter mapping. Define the PDF file as a Data Transformation source. Thisarticle describes how to configure the Data Transformation source to interface with a Data Transformation service.Supported Versions PowerCenter 9.0.1Table of ContentsOverview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2Mapping Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3PDF File Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Create the Data Transformation Source. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Export the XML File Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Create the Target Definition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Data Transformation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Create the Data Transformation Project. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Deploy the Data Transformation Project. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Define the Service Name. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Configure the Workflow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Run the Workflow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11OverviewA PDF is a common file format that stores invoices and account statements. You can configure a PowerCenter mapping toextract the data from the PDF when the page layout is the same for each invoice. Configure a Data Transformation source inthe PowerCenter Designer.This article explains how to configure a Data Transformation source that represents a multiple page PDF file. The articleshows how to configure the PowerCenter source with a Data Transformation service to extract the data from the PDF file.The target is a set of relational tables.To parse the data from a PDF file, complete the following tasks: Create the Data Transformation source in the Designer. Export the structure as an XML schema from the Designer. Create a Data Transformation Parser project in the Data Transformation Developer Studio. Import the XML schemathat you created in PowerCenter. Deploy the project as a Data Transformation service. Deploy the project to the Data Transformation repository localto the PowerCenter Client. Deploy another copy of the service to the Data Transformation repository local to thePowerCenter Integration Service. Define the Data Transformation service name in the Data Transformation source. Create and run the PowerCenter workflow.2

Mapping OverviewCreate a PowerCenter mapping to parse the data from the PDF file and pass the data to relational targets.The following figure shows the mapping in the Designer:The mapping contains the following objects:Data Transformation sourceThe Data Transformation source is a PDF file. The Integration Service calls a Data Transformation service toparse the data from the PDF. The Data Transformation service returns XML to the Integration Service. The DataTransformation source passes row data to the pipeline.Application Multi-Group Source QualifierThe Source Qualifier transformation represents the rows that the Integration Service passes to the target. Whenyou add the Data Transformation source to the mapping, the Designer creates a source qualifier by default.TargetsThe targets are the Invoice Header, Buyer Total, and Transaction Detail tables.PDF File StructureThe PDF source file is multiple page invoice. The first page contains the customer name, the address, and the accountnumber. The page includes a summary of the current charges and the total balance due. The first page also containsadvertising and other text that you do not need to extract. The second page contains a list of the charges sorted by buyer.Each buyer has multiple charges.You can view a sample PDF file in the Data Transformation tutorial #3. The PDF file is OrshavaInvoice.pdf.3

The following figure shows the first page of the PDF:The second PDF page contains transactions by buyer name:4

Create the Data Transformation SourceCreate the Data Transformation source in the PowerCenter Source Analyzer.When you create a Data Transformation source, the Designer creates the following default ports:InputFileNameReturns the name of the current input file.OutputBufferOutput port that returns XML from a Data Transformation service if you do not create output ports. When youdefine ports on the Output Hierarchy tab, the OutputBuffer does not return data.5

To pass row data to the relational tables, configure output ports on the Output Hierarchy tab. Create a hierarchy of groups inthe left pane of the Output Hierarchy tab. All groups are under the root group. Each group can contain ports and othergroups. The group structure represents the relationship between target tables. When you define a group within a group, youdefine a parent-child relationship between the groups. The Designer defines a primary key-foreign key relationship betweenthe groups with a generated key.The following figure shows the Output Hierarchy tab:Define the following groups of ports to represent the invoice database tables:Group1 Invoice Header Account. Customer account number. Period Ending. Date of current charges. Current Total. Total amount of purchases for the period. Balance Due. Total amount due including past due charges.Group2 Buyer Total Name. Name of the buyer that purchased the products. Total. Total cost of the products for the buyer.Group3 Transaction Detail Date. Purchase date. Ref. Purchase reference number. Product. Product name. Total. Product price.6

Export the XML File StructureExport the group structure from the Output Hierarchy tab as an XML schema. You can import the .xsd file when you createthe Data Transformation project in the Data Transformation Studio.Click Export to XML Schema on the Output Hierarchy tab. The Designer creates the following .xsd file: ?xml version "1.0" encoding "UTF-8" standalone "no" ? !-- AUTO-GENERATED FILE - DO NOT EDIT -- !-- This file has been generated by Informatica PowerCenter -- xs:schema attributeFormDefault "unqualified" elementFormDefault "qualified"targetNamespace "www.informatica.com/CDET/XSD/mappingName DT pdf SOURCE" xmlns "www.informatica.com/CDET/XSD/mappingName DT pdf SOURCE" xmlns:xs "http://www.w3.org/2001/XMLSchema" xs:element name "PC XSD ROOT" type "PC XSD ROOTT"/ xs:complexType name "PC XSD ROOTT" xs:sequence xs:element maxOccurs "unbounded" minOccurs "0" ref "GROUP1"/ /xs:sequence /xs:complexType xs:element name "GROUP1" type "GROUP1T"/ xs:complexType name "GROUP1T" xs:sequence xs:element maxOccurs "unbounded" minOccurs "0" ref "GROUP2"/ xs:element minOccurs "0" name "Account" type "xs:string"/ xs:element minOccurs "0" name "Period Ending" type "xs:string"/ xs:element minOccurs "0" name "Current Total" type "xs:decimal"/ xs:element minOccurs "0" name "Balance Due" type "xs:decimal"/ /xs:sequence /xs:complexType xs:element name "GROUP2" type "GROUP2T"/ xs:complexType name "GROUP2T" xs:sequence xs:element maxOccurs "unbounded" minOccurs "0" ref "GROUP3"/ xs:element minOccurs "0" name "Name" type "xs:string"/ xs:element minOccurs "0" name "Total" type "xs:decimal"/ /xs:sequence /xs:complexType xs:element name "GROUP3" type "GROUP3T"/ xs:complexType name "GROUP3T" xs:sequence xs:element minOccurs "0" name "Date" type "xs:string"/ xs:element minOccurs "0" name "Ref" type "xs:string"/ xs:element minOccurs "0" name "Product" type "xs:string"/ xs:element minOccurs "0" name "Total" type "xs:string"/ /xs:sequence /xs:complexType /xs:schema Create the Target DefinitionThe target is a billing database that stores the invoice information.The database has three tables that store invoice data. The Invoice Header stores the invoice summary information. TheBuyer Total table stores the total sales by Buyer for each invoice number. The Transaction Detail table stores transactioninformation, including the date, product, and price.The following figure shows the tables in the target definition:7

Data TransformationData Transformation is the application that transforms file formats such as Excel spreadsheets or PDF documents.Create Data Transformation projects in the Data Transformation Studio. Deploy the projects from the Data TransformationStudio to the Data Transformation repository. The Designer accesses the services in the Data Transformation repositorywhen you create a Data Transformation source. The PowerCenter Integration Service accesses a Data Transformationservice when it runs a workflow that has a Data Transformation source, target, or Unstructured Data transformation.Create the Data Transformation ProjectCreate a parser project in the Data Transformation Studio. A Parser project extracts data and returns XML.For this example, you can import the parser project from the Data Transformation tutorial #3. The project files are in thefollowing directory: Data Transformation Installation \tutorials\exercises\Files for Tutorial 3You can import the parser to the Data Transformation Studio from the Results directory. The parser project isPDFInvoiceParser.cmw.The tutorial describes how to create the parser. To interface the project with PowerCenter, use the .xsd file that youexported from the Designer instead of the OrshavaInvoice.xsd file.8

The parser runs a document processor to convert the data from a binary PDF format to text. The parser project usespositional formatting to determine the location of the data in the PDF. You configure the anchors that define the text locationand the content. Define a repeating group for the buyer and a nested repeating group for each buyer transaction. Define aCalculateValue action to add product prices for each buyer and a total for the invoice.You can run the project in the Data Transformation Studio. View results from the sample data. When you call a DataTransformation service from PowerCenter, the Data Transformation Engine passes the XML back to the PowerCenterIntegration Service.When you run the project, Data Transformation returns the following XML: ?xml version "1.0" encoding "windows-1252" ? - Invoice account "12345" Period Ending April 30, 2003 /Period Ending Current Total 351.04 /Current Total Balance Due 475.07 /Balance Due - Buyer name "Molly" total "217.65" - Transaction date "Apr 02" ref "22498" Product large eggs /Product Total 29.07 /Total /Transaction - Transaction date "Apr 08" ref "22536" Product large eggs /Product Total 58.14 /Total /Transaction - Transaction date "Apr 08" ref "22536" Product cheddar cheese /Product Total 43.61 /Total /Transaction - Transaction date "Apr 21" ref "22798" Product cream cheese /Product Total 26.98 /Total /Transaction - Transaction date "Apr 29" ref "22903" Product large eggs /Product Total 59.85 /Total /Transaction /Buyer - Buyer name "Jack" total "133.39" - Transaction date "Apr 12" ref "22570" Product large eggs /Product Total 29.93 /Total /Transaction - Transaction date "Apr 18" ref "22734" Product large eggs /Product Total 59.85 /Total /Transaction - Transaction date "Apr 25" ref "22841" Product cheddar cheese /Product Total 43.61 /Total /Transaction /Buyer /Invoice Deploy the Data Transformation ProjectAfter you design and test the Data Transformation project, deploy the project as a service to a Data Transformation repository.Deploy the Data Transformation project to a Data Transformation repository that is on the same machine as thePowerCenter Client. The PowerCenter Client can access the repository to retrieve Data Transformation service names andport requirements.Deploy the Data Transformation project to the repository that is on the same machine as the Data Integration Service whenyou want to run the workflow. The PowerCenter Integration Service calls the Data Transformation service to transform theData Transformation source.9

Define the Service NameAfter you define the Data Transformation service, update the Data Transformation source with the service name.Add the Data Transformation service name in the Settings tab. The service name must be defined in the DataTransformation source or the mapping is invalid.The following figure shows where to enter the Data Transformation service name in the Data Transformation source:Configure the WorkflowBefore you can run the workflow, deploy the Data Transformation project to the Data Transformation repository that is on thesame machine as the PowerCenter Integration Service.Configure the name of the source PDF in the session properties. If you want to process multiple PDF files, you can use awildcard in the session properties. You can use the following wildcard characters in the session properties:* (asterisk)Match any combination of characters. For example, *.doc matches all files with the doc extension. Or, ab*.txtmatches every file that begins with ab and has txt extension.? (question mark)Match one character. For example, ab?.txt matches any file with ab as the first two characters any character asthe third character. The extension must be txt.The following figure shows how to configure the session to process multiple source PDF files:10

The source file name is *Invoice*.pdf. The session is configured to use wildcards.Run the WorkflowWhen you run the workflow, the PowerCenter Integration Service passes the source PDF to the Data TransformationEngine. The Data Transformation Engine parses the PDF and returns XML to the PowerCenter Integration Service. ThePowerCenter Integration Service writes rows to the target database tables.The Integration Service writes the following row to the Invoice Header table:XPK InvoiceAccountPeriod EndingCurrent TotalBalance Due112345April 30, 2003351.04475.07The Integration Service writes the following row to the Buyer table:XPK BuyerFK InvoiceBuyer NameTotal11Molly217.6521Jack133.3911

The Integration Service writes the following row to the Transaction Detail table:XPK TransactionFK BuyerDateRefProductTotal11Apr 0222498large eggs29.0721Apr 0822536large eggs58.1431Apr 0822536cheddar cheese43.6141Apr 2122798cream cheese26.9851Apr 2922903large eggs59.8562Apr 1222570large eggs29.9372Apr 1822734large eggs59.8582Apr 2522841cheddar cheese43.61AuthorEllen ChandlerPrincipal Technical Writer12

A PDF is a common file format that stores invoices and account statements. You can configure a PowerCenter mapping to extract the data from the PDF when the page layout is the same for each invoice. Configure a Data Transforma