The Data Detective's Toolkit - SAS

Transcription

The correct bibliographic citation for this manual is as follows: Chantala, Kim. 2020. The Data Detective’s Toolkit:Cutting-Edge Techniques and SAS Macros to Clean, Prepare, and Manage Data. Cary, NC: SAS Institute Inc.The Data Detective’s Toolkit: Cutting-Edge Techniques and SAS Macros to Clean, Prepare, and Manage DataCopyright 2020, SAS Institute Inc., Cary, NC, USAISBN 978-1-952363-04-7 (Hardcover)ISBN 978-1-952363-00-9 (Paperback)ISBN 978-1-952363-01-6 (Web PDF)ISBN 978-1-952363-02-3 (EPUB)ISBN 978-1-952363-03-0 (Kindle)All Rights Reserved. Produced in the United States of America.For a hard copy book: No part of this publication may be reproduced, stored in a retrieval system, or transmitted, inany form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permissionof the publisher, SAS Institute Inc.For a web download or e-book: Your use of this publication shall be governed by the terms established by the vendorat the time you acquire this publication.The scanning, uploading, and distribution of this book via the Internet or any other means without the permission ofthe publisher is illegal and punishable by law. Please purchase only authorized electronic editions and do notparticipate in or encourage electronic piracy of copyrighted materials. Your support of others’ rights is appreciated.U.S. Government License Rights; Restricted Rights: The Software and its documentation is commercial computersoftware developed at private expense and is provided with RESTRICTED RIGHTS to the United States Government.Use, duplication, or disclosure of the Software by the United States Government is subject to the license terms of thisAgreement pursuant to, as applicable, FAR 12.212, DFAR 227.7202-1(a), DFAR 227.7202-3(a), and DFAR 227.7202-4,and, to the extent required under U.S. federal law, the minimum restricted rights as set out in FAR 52.227-19 (DEC2007). If FAR 52.227-19 is applicable, this provision serves as notice under clause (c) thereof and no other notice isrequired to be affixed to the Software or documentation. The Government’s rights in Software and documentationshall be only those set forth in this Agreement.SAS Institute Inc., SAS Campus Drive, Cary, NC 27513-2414December 2020SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS InstituteInc. in the USA and other countries. indicates USA registration.Other brand and product names are trademarks of their respective companies.SAS software may be provided with certain third-party software, including but not limited to open-source software,which is licensed under its applicable third-party software license agreement. For license information about thirdparty software distributed with SAS software, refer to http://support.sas.com/thirdpartylicenses.

C ontentsAbout This Book . vAbout The Author . ixAcknowledgments . xiChapter 1: Advantages of Using the Data Detective’s Toolkit . 1Introduction. 1An Overview of the Data Detective’s Toolkit. 2Summary . 7Chapter 2: The Data Detective’s Toolkit and SAS . 9Introduction. 9Preparing Your SAS Data Set . 9Fundamental SAS Macro Concepts . 15The Output Delivery System. 30Summary . 31Chapter 3: Codebooks: A Roadmap to Your Data . 33Introduction. 33Understanding Codebooks . 33Using the %TK codebook Macro . 36Example 3-1: Create a Codebook with Potential Problem Reports . 40Inside the Toolkit: %TK codebook . 52Summary . 53Chapter 4: Customizing Codebooks . 55Introduction. 55Example 4-1: Embellishing Titles . 55Example 4-2: Add a Logo to Your Codebook . 59Example 4-3: Codebook Output Data Set and Default Design . 61Example 4-4: Create a Custom Design for Your Codebook . 71Summary . 81Chapter 5: Catalog Your Data . 83Introduction. 83Using the %TK inventory Macro . 84Using the %TK xwalk Macro. 89Summary . 95

iv The Data Detective’s ToolkitChapter 6: Detecting and Correcting Data Errors . 97Introduction .97Harmonizing Data Sets: Using the %TK harmony Macro .98Example 6-1: Harmonizing Two Data Sets.99Inside the Toolkit: How %TK harmony Works .103Finding Duplicates: Using the %TK find dups Macro.108Example 6-2: Identifying Duplicates Based on Multiple Variables .108Inside the Toolkit: How %TK find dups Works .111Summary .114Chapter 7: Inspect and Edit Flow through Skip Patterns . 115Introduction .115Understanding Skip Patterns.116Identifying Skip Patterns in a Survey .117Traditional Method of Auditing Skip Patterns .120Example 7-1: Using the %TK skip edit Macro .123A Blueprint to Using %TK skip edit .130Example 7-2: Automated Method of Checking Skip Patterns .132Inside the Toolkit: How %TK skip edit Works .148Summary .154Chapter 8: Create and Validate New Variables . 157Introduction .157Coding Variables .157Easy Ways to Check Variable Construction .162Summary .174Appendix A: Your Part in the Data Life Cycle . 177Introduction .177Understanding the Data Life Cycle .177Summary .185Appendix B: Skip Pattern Data Codebook. 187Introduction .187SAS Program to Create Codebook .187Appendix C: Research Data Codebook . 193Introduction .193SAS Program to Create Codebook .193Index . 197

About This BookWhat Does This Book Cover?Data professionals who survived deep cuts in funding during the financial crisis of 2007–2008 had to develop innovative methods of data preparation. This book presentsinnovative data tools and techniques that helped data managers, practitioners, andprogrammers survive these challenges by reducing the cost and time needed for datamanagement while improving the quality of data prepared with their use. These toolsinclude SAS macros as well as ingenious ways of using SAS procedures and functions.Is This Book for You?This book is designed to help automate many of the tasks performed to turn raw datainto analysis-friendly data. These tasks are often filled with a mix of irksome andstrenuous activities that stand between you and data that can be used. This book willhelp preparers of the data in different ways:Intermediate andAdvanced users:You will reduce your workload and improve the quality of your databy using the SAS macro programs included with this book toautomate error-checking and create documentation for your projectdata. Using these programs included with this book will alleviate thetedious nature of data preparation by automating the identificationof inconsistencies and anomalies in raw data.

vi The Data Detective’s ToolkitNovice users:If you are not familiar with SAS and are just starting to work withdata, you will need to get help from a more experienced programmerto use the SAS macro programs that automatically producecodebooks, reports highlighting problems in the data, inventories ofavailable data sets, and crosswalks showing commonalities ofmultiple data sets. These are covered in Chapters 3 through 6. Oncethe SAS statements are set up to run the SAS programs producingthese reports, you will find it easy to assist in the detective work ofdata preparation. Examining these reports will really help you get toknow your data, and you can help to solve problems identified in thedata. Focusing on the discussion of the output in examples of thisbook will help you learn to interpret these reports and lead to abetter understanding of your data. Skip the sections in each chaptertitled “Inside the Toolkit” that discuss the macro program statementsin detail.Data managers and You will be able to choose from the many automated reports thatResearch staff:function as roadmaps into your data, snapshots of data quality andmonitoring, and use these reports to improve communicationbetween your programmer, practitioners, and the data collectionsponsors.All users:No matter what your level of experience, you should read Chapter 1,“Advantages of Using the Data Detective’s Toolkit” and Appendix A,“Your Part in the Data Life Cycle.”What Are the Prerequisites for This Book?Familiarity with SAS programming (the DATA step and basic rules of the SAS language)as well as manipulating SAS data with procedures such as PROC CONTENTS, PROCMEANS, and PROC FREQ provide adequate prerequisites for working with the SASprograms and techniques discussed in this book. Familiarity with basic features of theSAS macro language would be useful to run the SAS macro programs that accompanythis book. For programmers new to the SAS macro language, detailed instruction isprovided in Chapter 2 with information about using SAS.

About This Book viiWhat Should You Know about the Examples?Software Used to Develop the Book's ContentThe output in this book was created with SAS 9.4. Most programs in this book can berun using BASE SAS on the platform that you typically use. A few of the examples useprocedures found in the SAS/STAT software.Example Code and DataAll data used in the examples in this book was simulated. Any resemblance to actualdata sets is purely coincidental. Errors and other anomalies were purposely added tothe data to illustrate special features described in this book to clean, prepare, andperform quality control checks on your data.You can access the example code and data for this book by linking to its author page athttps://support.sas.com/chantala.Output and GraphicsAll output in this book was created with the SAS Output Delivery System. Your outputmight look slightly different because changes in the appearance of some tables haveoccurred during the formatting of this book.We Want to Hear from YouSAS Press books are written by SAS Users for SAS Users. We welcome your participationin their development and your feedback on SAS Press books that you are using. Pleasevisit sas.com/books to do the following: Sign up to review a book Recommend a topic Request information about how to become a SAS Press author Provide feedback on a bookDo you have questions about a SAS Press book that you are reading? Contact the authorthrough saspress@sas.com or https://support.sas.com/author feedback.SAS has many resources to help you find answers and expand your knowledge. If youneed additional help, see our list of resources: sas.com/books.

viii The Data Detective’s ToolkitLearn more about this author by visiting her author page athttp://support.sas.com/chantala. There you can download free book excerpts, accessexample code and data, read the latest reviews, get updates, and more.

About the AuthorKim Chantala is a Programmer Analyst in the Research ComputingDivision at RTI International with over 25 years of experience inmanaging and analyzing research data. Before joining RTIInternational, she was a data analyst at the University of NorthCarolina at Chapel Hill. In addition to providing data management andanalytical services at the University, she taught workshops onanalyzing survey data, focusing on the problems of sample weightsand design effects. Kim believes that the real challenge in dataanalysis is bridging the gap between raw or acquired data and datathat is ready to analyze. This inspired her to develop computerizeddata management tools revolutionizing the way data is prepared,allowing users to improve the quality of their data while lowering the cost of data preparation.Kim earned a BS in Engineering Physics from the Colorado School of Mines and an MS inBiometrics from the University of Colorado.Learn more about this author by visiting her author page at http://support.sas.com/chantala.There you can download free book excerpts, access example code and data, read the latestreviews, get updates, and more.

x The Data Detective’s Toolkit

Chapter 1: Advantages of Using theData Detective’s ToolkitIntroduction . 1An Overview of the Data Detective’s Toolkit . 2%TK codebook . 3%TK inventory. 4%TK xwalk. 5%TK find dups . 5%TK harmony. 5%TK skip edit . 6%TK max length . 6Summary. 7IntroductionYou will find the right data tools in this book for creating project data that is ready forexploration and analysis. Using these tools will reduce the amount of time needed to clean, edit,validate, and document your data. Advantages of using the techniques in this book include: Accomplishing more while doing less by automating and modernizing the typical datapreparation activities Beginning at the end by creating research-ready data sets and documentation early inthe project with continual updates and improvements throughout collection andpreparation Keeping the sponsor or lead research investigators engaged by pr

All data used in the examples in this book was simulated. Any resemblance to actual data sets is purely coincidental. Errors and other anomalies were purposely added to the data to illustrate special features described in this book to clean, prepare, and perf