Sharing System (CISS) Replication/ETL August 23, 2012

Transcription

State of ConnecticutCriminal Justice Information SystemConnecticut Information Sharing System (CISS)Technology Workshop 1: Data Replication/ETLAugust 23, 2012

Data Replication/ETL: TermsData Replication: Data Replication is the process of copyingthe data from a certain data source to another source whilemaintaining identical copies of the data that aresynchronized.Extraction, Transformation, and Load (ETL): ETL is theprocess of extracting data from an environment,transforming elements, and loading the data into anenvironment.August 23, 20122

Data EnvironmentsData structures have transformed since the storing of information began. In many cases, earlierfile‐based data still exists today. This is most common with IBM and DEC (Compaq/HP)environments. The ability to access information, from any type of data environment and toreplicate it into a common structure, is vital to the success of CISS. The CISS community currentlysupports data structures ranging from flat and indexed files on IBM and HP mainframe/super‐mini systems to databases including SQL Server, Oracle, Microsoft Access; there is even a LotusNotes application within our world.August 23, 20123

Document Your DataAugust 23, 20124

Move Data to the Right DatabaseAugust 23, 20125

Transforms Data as it Moves itAugust 23, 20126

Replication/ETL Requirements Linking connected/disconnected “Clouds of data” into unified target environment(s) with orwithout naming transformationsSupports direct integration of legacy data into any relational databaseFacilitates Automated Data Synchronizingo Highly customizableo Automated failure controlo Unlimited capabilities including scripting, triggers, external applicationso Immediate integration with Niem naming conventions via templatesEliminates resources (coding)o No legacy coding structureso No user‐developed coding points of failureo Widely utilized technologies (adopted by Federal and State Agencies)o Facilitates RAD conceptso Single point of maintenance/configurationAutomates information streams (indexes, partial, incremental, full)Supports most commonly used DBs and data file structureso Oracle, SQL Server, RMS, DB2, VSAM, ISAM, Sequential, Indexed, Adabase, Lotus NotesGroup entities (agencies, tables, files, etc.)August 23, 20127

Replication/ETL Requirements, cont’d.Data Replication/ETLData replication/ETL is a vital component of the CISS architecture in that it enables theindexing of agency data elements from a common, secured environment. The ability to indexreplicated data supports the ability for Microsoft’s FAST product to build and update anindex using a singular data structure, SQL Server.Relevance to CISSOne of the primary goals of CISS is to enable an individual, with proper credentials, to rapidlyretrieve searched data across a spectrum of diverse information from 200 CISS stakeholderbusiness systems. The implementation of a well‐designed and structured architecture thatreplicates, maps to NIEM, indexes, and presents data from such an expansive environmentwith 5 second response time will ensure this goal is achieved.August 23, 20128

Replication OptionsThere are three options from which Agency Stakeholders can choose tosupport CISS searching their data environments. The three options giveour CISS community flexibility to decide what method to use, each withdiffering levels of complexity for integration.¾ Federated Search¾ Agency Replicates Data¾ Crawling of their Data99/5/2012

Replication – Option 1 Data from a stakeholder’s environment is access from CISS via web service query. Theagency will be required to create Web Service interfaces to be used by CISS. Theseservices will respond to query requests from CISS, which will generate data extractionsvia views, stored procedures, or other methods an agency prefers to use. The selected data will then be returned via the original request, synchronously. Eachtable to be searched by CISS will require a distinct Web Service. This option requires the most effort by the agency and CISS and impacts the ability ofCISS to respond to a query request rapidly. In effect, CISS does not recommend thisoption unless it is absolutely necessary to interface in this manner. For each searchrequest made by a user within CISS, each of the agencies will be required to respond to asearch request via web services and respond with the appropriate data. These queriesare for initial search requests and detail requests. As an example: if there are 5,000 search requests (initial or detail) per hour, the agenciesusing Option 1 will be required to respond to each request – receive the request, querytheir data environments, build an XML message, and send the response via web serviceresponse. This scenario will significantly impact agency storage throughput, both theirsand the State of Connecticut’s network, affects Search response times and is a significantpoint of maintainability and failure in the Search segment of CISS.August 23, 201210

Replication – Option 2 Data from a stakeholder’s environment is replicated (copied) by the agency and eitherput on a common network drive for CISS Access,placed on a FTP site for CISS to retrieve, orput into another data environment where CISS can replicate the data. A schedule tosupport collecting the replicated data will be arranged between CISS and the respectiveagencies.This option requires an effort by an agency to provide a mechanism to extract portions oftheir data into one of several structures (Database, index or flat file, XML, etc.) and to makethe data available to CISS. Disadvantages to this option include delays in indexing and searching stale data, storagerequirements for the agency to “hold” the replicated data and the repetitive process ofbuilding container(s) to retrieve updated data records. If the agency cannot identify the changed records efficiently, then the entire database (onlynecessary fields) will need to be replicated, repetitively, throughout the day. Lastly, the overhead and impact to re‐index all the data from an agency will impact theagency’s and CISS’s systems and the State of Connecticut’s Network.August 23, 201211

Replication – Option 3 This option is the most efficient scenario, and lowest impact to the stakeholder, CISS andthe State of Connecticut’s networks. It offers the ability to support agency governance of data, encryption of data, minimizednetwork traffic, a single point of configuration, auditable, and secure. To support this option, an agency works with CISS to:ooo identify the pertinent data to be replicated (tables, fields, data files, etc.),provides a User ID/Password with Read‐Only access to their data environment(SQL, Oracle, O/S) anddepending on their data environment, may be required to install a listener service. Thelistener service applies to VAX/Alpha systems and IBM environments using VSAM/ISAMfor their data file structures. Oracle and SQL Server environments require only access totheir respective Database’s IP Port Number (Oracle 1521, SQL Server 1433)This option enables CISS to scan the agency’s selected data environments on a pre‐determined schedule. Data environment which have high frequencies of updated data will bescanner frequently and those environments which have minor updates or are stale will bescanned infrequently. The frequency in either scenario depends on the size of the database,the nature of the data and necessity to have visibility to the data.August 23, 201212

¾ Bottom Line: Replicating data has an impact onsystems and networks.¾ To minimize the performance impact caused byreplication, it is imperative that agenciesunderstand these three options.August 23, 201213

SQL Server to SQL Server (remote)Source PC SpecificaionDestination PC SpecificationIntel Core2 Extreme X9650 @ 3 Ghz, 8GB ram, 1TB Hard disk, 1Gbit Network, Windows 2008 64bitIntel Xeon X3565 Extreme X9650 @ 3.2 Ghz, 12GB ram, 1TB Hard disk, 1Gbit Network, Windows 2008R2 64bitSource DatabaseTarget DatabaseSQL Server 2008R2SQL Server 2008R2Table Schema:create table sql2008R2.onemillionrowtable( id char(10), name char(40), createdate date, sold decimal(11,2), ordercount integer, notes varchar(115))create unique index blahone on sql2008R2.onemillionrowtable( id )Table WidthTable RowsTable Columns197 bytes1M5Full SyncIncremental Sync (no changes)Incremental Sync (100 updates) .01% changeIncremental Sync (500 updates) .05% changeIncremental Sync (1000 updates) .1% change1M Elapsed Time (H:MM:SS) DDL for Update0:00:350:00:070:00:09 update SQL2008R2.onemillionrowtable set createdate now() , sold 1234.56 where id between '0000100000' and '0000100099'0:00:08 update SQL2008R2.onemillionrowtable set createdate now() , sold 7234.56 where id between '0000100000' and '0000100499'0:00:09 update SQL2008R2.onemillionrowtable set createdate now() , sold 6234.56 where id between '0000100000' and '0000100999'1M Elapsed Time 71M Elapsed Time (H:MM:SS)0:00:130:00:090:00:040:00:00Full SyncAugust 23, 2012Incremental Sync (no Incremental Sync (100 Incremental Sync (500changes)updates) .01% change updates) .05% changeIncremental Sync(1000 updates) .1%change14

RMS to SQL Server (remote)Source PC SpecificaionDestination PC SpecificationItanium VMS HP rx1600 1GHzIntel Core2 Extreme X9650 @ 3 Ghz, 8GB ram, 1TB Hard disk, 1Gbit Network, Windows 2008 64bitIntel Xeon X3565 Extreme X9650 @ 3.2 Ghz, 12GB ram, 1TB Hard disk, 1Gbit Network, Windows 2008R2 64bitSource DatabaseTarget DatabaseRMS (VMS Itanium)SQL Server 2008R2Table Schema:converted from RMS flat file ‐ unique index on IDTable WidthTable RowsTable Columns196 bytes1M5Full SyncIncremental Sync (no changes)Incremental Sync (100 updates) .01% changeIncremental Sync (500 updates) .05% changeIncremental Sync (1000 updates) .1% changeidnamecreatedatesoldordercountnotesText (Right Space Padded)Text (Right Space Padded)Text Date (YYYYMMDD)Zoned Numeric ‐ DecimalZoned Numeric ‐ DecimalText (Right Space Padded)CharCharDateDecimalDecimalChar010505864711M Elapsed Time (H:MM:SS) DDL for Update0:01:010:00:220:00:22 update rms.onemillionrowtable set createdate now() , sold 1234.56 where id between '0000100000' and '0000100099'0:00:22 update rms.onemillionrowtable set createdate now() , sold 7234.56 where id between '0000100000' and '0000100499'0:00:24 update rms.onemillionrowtable set createdate now() , sold 6234.56 where id between '0000100000' and '0000100999'1M Elapsed Time (H:MM:SS)0:01:090:01:000:00:520:00:430:00:351M Elapsed Time (H:MM:SS)0:00:260:00:170:00:090:00:00Full SyncAugust 23, 2012Incremental Sync (no Incremental Sync (100 Incremental Sync (500changes)updates) .01% change updates) .05% changeIncremental Sync(1000 updates) .1%change151040867125

RMS to SQL Server (remote)Source PC SpecificaionDestination PC SpecificationItanium VMS HP rx1600 1GHzIntel Core2 Extreme X9650 @ 3 Ghz, 8GB ram, 1TB Hard disk, 1Gbit Network, Windows 2008 64bitIntel Xeon X3565 Extreme X9650 @ 3.2 Ghz, 12GB ram, 1TB Hard disk, 1Gbit Network, Windows 2008R2 64bitSource DatabaseTarget DatabaseRMS (VMS Itanium)SQL Server 2008R2Table Schema:converted from RMS flat file ‐ unique index on IDTable WidthTable RowsTable Columns196 bytes50M5Full SyncIncremental Sync (no changes)Incremental Sync (5000 updates) .01% changeIncremental Sync (25000 updates) .05% changeIncremental Sync (50000 updates) .1% changeidnamecreatedatesoldordercountnotesText (Right Space Padded)Text (Right Space Padded)Text Date (YYYYMMDD)Zoned Numeric ‐ DecimalZoned Numeric ‐ DecimalText (Right Space 4086712550M Elapsed Time (H:MM:SS) DDL for Update0:55:510:35:390:35:39 update rms.fiftymillionrowtable set createdate now() , sold 1234 where id between '0010000000' and '0010004999'0:37:40 update rms.fiftymillionrowtable set createdate now() , sold 7234 where id between '0010000000' and '0010024999'0:36:26 update rms.fiftymillionrowtable set createdate now() , sold 6234 where id between '0010000000' and '0010049999'50M Elapsed Time 850M Elapsed Time (H:MM:SS)0:21:360:14:240:07:120:00:00Full SyncAugust 23, 2012Incremental Sync (noIncremental SyncIncremental SyncIncremental Syncchanges)(5000 updates) .01% (25000 updates) .05% (50000 updates) .1%changechangechange16

DB2 To SQL Server (remote)Source PC SpecificaionDestination PC SpecificationIntel Core2 Extreme X9650 @ 3 Ghz, 8GB ram, 1TB Hard disk, 1Gbit Network, Windows 2008 64bitIntel Xeon X3565 Extreme X9650 @ 3.2 Ghz, 12GB ram, 1TB Hard disk, 1Gbit Network, Windows 2008R2 64bitSource DatabaseTarget DatabaseDB2 9.7.1SQL Server 2008R2Table Schema:create table db2.onemillionrowtable( id char(10), name char(40), createdate date, sold decimal(11,2), ordercount integer, notes varchar(115))create unique index blahone on db2.onemillionrowtable( id )Table WidthTable RowsTable Columns197 bytes1M5Full SyncIncremental Sync (no changes)Incremental Sync (100 updates) .01% changeIncremental Sync (500 updates) .05% changeIncremental Sync (1000 updates) .1% change1M Elapsed Time (H:MM:SS) DDL for Update0:00:390:00:080:00:07 update db2.onemillionrowtable set createdate now() , sold 1234.56 where id between '0000100000' and '0000100099'0:00:10 update db2.onemillionrowtable set createdate now() , sold 7234.56 where id between '0000100000' and '0000100499'0:00:09 update db2.onemillionrowtable set createdate now() , sold 6234.56 where id between '0000100000' and '0000100999'1M Elapsed Time 21M Elapsed Time (H:MM:SS)0:00:170:00:130:00:090:00:040:00:00Full SyncIncremental Sync (no Incremental Sync (100 Incremental Sync (500changes)updates) .01% change updates) .05% changeAugust 23, 2012Incremental Sync(1000 updates) .1%change17

VSAM to SQL Server (remote)Source Mainframe SpecificaionDestination PC SpecificationDataSync PC SpecificaionSource DatabaseTarget Databasez/OS version 1 release 12 level 1009, 5 MSU, BatchIntel Xeon X3565 Extreme X9650 @ 3.2 Ghz, 12GB ram, 1TB Hard disk, 1Gbit Network, Windows 2008R2 64bitIntel Core2 Extreme X9650 @ 3 Ghz, 8GB ram, 1TB Hard disk, 1Gbit Network, Windows 2008 64bitVSAMSQL Server 2008R2Table Schema:Repro'd from qsam flat file ‐ unique index on IDTable WidthTable RowsTable Columns200 bytes1M5Full SyncIncremental Sync (no changes)Incremental Sync (100 updates) .01% changeIncremental Sync (500 updates) .05% changeIncremental Sync (1000 updates) .1% changeidnamecreatedatesoldordercountnotesText (Right Space Padded)Text (Right Space Padded)Text Date (YYYYMMDD)Zoned Numeric ‐ DecimalZoned Numeric ‐ DecimalText (Right Space 408671251M Elapsed Time (H:MM:SS) DDL for Update0:04:440:04:230:04:24 update vsam.onemillionrowtable set createdate now() , sold 1234.56 where id between '0000100000' and '0000100099'0:04:25 update vsam.onemillionrowtable set createdate now() , sold 7234.56 where id between '0000100000' and '0000100499'0:04:28 update vsam.onemillionrowtable set createdate now() , sold 6234.56 where id between '0000100000' and '0000100999'1M Elapsed Time 81M Elapsed Time (H:MM:SS)0:04:240:04:190:04:150:04:11Full SyncAugust 23, 2012Incremental Sync (no Incremental Sync (100 Incremental Sync (500changes)updates) .01% change updates) .05% changeIncremental Sync(1000 updates) .1%h18

Oracle to SQL Server (remote)Source PC SpecificaionDestination PC SpecificationIntel Core2 Extreme X9650 @ 3 Ghz, 8GB ram, 1TB Hard disk, 1Gbit Network, Windows 2008 64bitIntel Xeon X3565 Extreme X9650 @ 3.2 Ghz, 12GB ram, 1TB Hard disk, 1Gbit Network, Windows 2008R2 64bitSource DatabaseTarget DatabaseOracle 11iSQL Server 2008R2Table Schema:create table orcl.onemillionrowtable( id char(10), name char(40), createdate date, sold decimal(11,2), ordercount integer, notes varchar(115))create unique index blahone on orcl.onemillionrowtable( id )Table WidthTable RowsTable Columns197 bytes1M5Full SyncIncremental Sync (no changes)Incremental Sync (100 updates) .01% changeIncremental Sync (500 updates) .05% changeIncremental Sync (1000 updates) .1% change1M Elapsed Time (H:MM:SS) DDL for Update0:01:010:00:220:00:22 update orcl.onemillionrowtable set createdate now() , sold 1234.56 where id between '0000100000' and '0000100099'0:00:22 update orcl.onemillionrowtable set createdate now() , sold 7234.56 where id between '0000100000' and '0000100499'0:00:24 update orcl.onemillionrowtable set createdate now() , sold 6234.56 where id between '0000100000' and '0000100999'1M Elapsed Time (H:MM:SS)0:01:090:01:000:00:520:00:430:00:351M Elapsed Time (H:MM:SS)0:00:260:00:170:00:090:00:00Full SyncIncremental Sync (no Incremental Sync (100 Incremental Sync (500changes)updates) .01% change updates) .05% changeIncremental Sync(1000 updates) .1%change

Oracle to SQL Server (remote)Source PC SpecificaionDestination PC SpecificationIntel Core2 Extreme X9650 @ 3 Ghz, 8GB ram, 1TB Hard disk, 1Gbit Network, Windows 2008 64bitIntel Xeon X3565 Extreme X9650 @ 3.2 Ghz, 12GB ram, 1TB Hard disk, 1Gbit Network, Windows 2008R2 64bitSource DatabaseTarget DatabaseOracle 11iSQL Server 2008R2Table Schema:create table orcl.tenmillionrowtable( id char(10), name char(40), createdate date, sold decimal(11,2), ordercount integer, notes varchar(115))create unique index blahfifty on orcl.tenmillionrowtable( id )Table WidthTable RowsTable Columns197 bytes10M5Full SyncIncremental Sync (no changes)Incremental Sync (1000 updates) .01% changeIncremental Sync (5000 updates) .05% changeIncremental Sync (10000 updates) .1% change10M Elapsed Time (H:MM:SS) DDL for Update0:09:310:04:570:05:06 update orcl.tenmillionrowtable set createdate now() , sold 1234.56 where id between '0000100000' and '0000100999'0:05:08 update orcl.tenmillionrowtable set createdate now() , sold 7234.56 where id between '0000100000' and '0000104999'0:05:39 update orcl.tenmillionrowtable set createdate now() , sold 6234.56 where id between '0000100000' and '0000109999'10M Elapsed Time (H:MM:SS)0:10:050:08:380:07:120:05:460:04:1910M Elapsed Time (H:MM:SS)0:02:530:01:260:00:00Full SyncIncremental Sync (noIncremental Syncchanges)(1000 updates) .01%changeIncremental Sync(5000 updates) .05%changeIncremental Sync(10000 updates) .1%change

Oracle to SQL Server (local)Source PC SpecificaionDestination PC SpecificationIntel Core2 Extreme X9650 @ 3 Ghz, 8GB ram, 1TB Hard disk, 1Gbit Network, Windows 2008 64bitSame as Source ‐ Both Source & Target are on same physical PCSource DatabaseTarget DatabaseOracle 11iSQL Server 2008R2Table Schema:create table orcl.fiftymillionrowtable( id char(10), name char(40), createdate date, sold decimal(11,2), ordercount integer, notes varchar(115))create unique index blahfifty on orcl.fiftymillionrowtable( id )Table WidthTable RowsTable Columns197 bytes50M5Full SyncIncremental Sync (no changes)Incremental Sync (5000 updates) .01% changeIncremental Sync (25000 updates) .05% changeIncremental Sync (50000 updates) .1% change50M Elapsed Time (H:MM:SS) DDL for Update0:54:160:18:460:19:04 update orcl.fiftymillionrowtable set createdate now() , sold 1234.56 where id between '10000000' and '10004500'0:19:47 update orcl.fiftymillionrowtable set createdate now() , sold 7234.56 where id between '10000000' and '10022500'0:20:15 update orcl.fiftymillionrowtable set createdate now() , sold 6234.56 where id between '10000000' and '10045000'50M Elapsed Time (H:MM:SS)0:57:360:50:240:43:120:36:000:28:4850M Elapsed Time (H:MM:SS)0:21:360:14:240:07:120:00:00Full SyncAugust 23, 2012Incremental Sync (noIncremental SyncIncremental SyncIncremental Syncchanges)(5000 updates) .01% (25000 updates) .05% (50000 updates) .1%changechangechange21

Oracle to SQL Server (remote)Source PC SpecificaionDestination PC SpecificationIntel Core2 Extreme X9650 @ 3 Ghz, 8GB ram, 1TB Hard disk, 1Gbit Network, Windows 2008 64bitIntel Xeon X3565 Extreme X9650 @ 3.2 Ghz, 12GB ram, 1TB Hard disk, 1Gbit Network, Windows 2008R2 64bitSource DatabaseTarget DatabaseOracle 11iSQL Server 2008R2Table Schema:create table orcl.fiftymillionrowtable( id char(10), name char(40), createdate date, sold decimal(11,2), ordercount integer, notes varchar(115))create unique index blahfifty on orcl.fiftymillionrowtable( id )Table WidthTable RowsTable Columns197 bytes50M5Full SyncIncremental Sync (no changes)Incremental Sync (5000 updates) .01% changeIncremental Sync (25000 updates) .05% changeIncremental Sync (50000 updates) .1% change50M Elapsed Time (H:MM:SS) DDL for Update0:51:570:18:020:18:46 update orcl.fiftymillionrowtable set createdate now() , sold 1234.56 where id between '10000000' and '10004500'0:19:05 update orcl.fiftymillionrowtable set createdate now() , sold 7234.56 where id between '10000000' and '10022500'0:19:59 update orcl.fiftymillionrowtable set createdate now() , sold 6234.56 where id between '10000000' and '10045000'50M Elapsed Time (H:MM:SS)0:57:360:50:240:43:120:36:000:28:4850M Elapsed Time (H:MM:SS)0:21:360:14:240:07:120:00:00Full SyncAugust 23, 2012Incremental SyncIncremental SyncIncremental Sync (noIncremental Syncchanges)(5000 updates) .01% (25000 updates) .05% (50000 updates) .1%changechangechange22

FeedbackWe need your feedback —please send us your comments, questions & suggestions.Sean Thakkar — Sean.Thakkar@ct.govMark Tezaris — Mark.Tezaris@ct.govRick Ladendecker — Rick.Ladendecker@ct.govNance McCauley — Nance.McCauley@ct.govThank youAugust 23, 201223

Appendix: AcronymsAFIS Automated Fingerprint Identification systemAST Application Support SystemBEST Bureau of Enterprise Systems and TechnologyBICE Bureau of Immigration and Customs EnforcementBOPP Board of Pardons and ParolesCAD Computer Aided DispatchCCH Computerized Criminal HistoryCIB Centralized Infraction Bureau (Judicial)CIB Centralized Infractions BureauCIDRIS Conn. Impaired Driver Records Information SystemCISS Conn. Information Sharing SystemCIVLS CT Integrated Vehicle & Licensing SystemCJIS Criminal Justice Information SystemCJPPD Criminal Justice Policy Development and Planning DivisionCMIS (Judicial’s) Case Management Information SystemCOLLECT Connecticut On‐Line Law EnforcementCommunications Teleprocessing networkCPCA Conn. Police Chiefs AssociationCRMVS Criminal and Motor Vehicle System (Judicial)CSSD Court Support Services DivisionDCJ Division of Criminal JusticeDAS Dept. of Administrative ServicesDESPP Department of Emergency Services & Public ProtectionDEMHS Dept of Emergency Management & Homeland SecurityDMV Dept. of Motor VehiclesDOC Department of CorrectionsDOIT Dept. of Information TechnologyDPDS Div. of Public Defender ServicesIST Infrastructure Support TeamAugust 23, 2012JMI Jail Management SystemJUD Judicial BranchLEA Law Enforcement AgencyLAW Local Law Enforcement (e.g., DPS, CPCA)LIMS State Crime Laboratory DatabaseMNI Master Name Index (State Police)OBIS Offender Based Information System (Corrections)OBTS Offender Based Tracking SystemOVA Office of Victim AdvocacyOVS Office of Victim ServicesRMS Records Management System (Police Agency RMSmanages & stores info on arrests, incidents)OSET Office of Statewide Emergency TelecommunicationsPOR Protective Order Registry (Judicial)PRAWN Paperless Re‐Arrest Warrant Network (Judicial)PSDN Public Safety Data NetworkSCO Superior court operationsSOR Sex Offender Registry (Judicial)Technology RelatedCOTS Computer Off The Shelf (e.g., software)ETL Extraction, Transformation, and LoadIEPD Information Exchange Package DeliveryPOC Proof of ConceptRDB Relational DatabaseSDLC Software Development Life CycleSOA Service Oriented ArchitectureSQL Structured Query Language24

Aug 23, 2012 · Extraction, Transformation, and Load (ETL): ETL is the process of extracting data from an environment, transforming elements, and loading the data into an environment. Data