Tutorials On Data Management - FOSTER Open Science

Transcription

Tutorials on DataManagementCC image by Cobalt123 on FlickrLesson 4: Data Collection Entry and ManipulationData Entry and Manipulation

Lesson TopicsCC image by JISC on Flickr Best Practices for Creating Data Files Data Entry Options Data Manipulation OptionsData Entry and Manipulation

Learning Objectives Recognize inconsistencies that can make a dataset difficult tounderstand and/or manipulate Describe characteristics of stable data formats and listreasons for using these formats Identify data entry tools Identify validation measures that can be performed as data isentered Describe the basic components of a relational databaseData Entry and Manipulation

The Data Life ribePreserveData Entry and Manipulation

Goals of Data Entry Create data sets that are:–CC image by Travis S on Flickr–ValidOrganized to support ease of useData Entry and Manipulation

Example: Poor Data Entry Data Entry and ManipulationInconsistency between data collection events Location of Date information Inconsistent Date format Column names Order of columns

Example: Poor Data Entry Data Entry and ManipulationInconsistency between data collection events Different site spellings, capitalization, spaces insite names—hard to filter Codes used for site names for some data, butspelled out for others Mean1 value is in Weight column Text and numbers in same column – what is themean of 12, “escaped 15”, and 91?

Best Practices Columns of data are consistent:only numbers, dates, or textConsistent Names, Codes, Formats (date) used in each columnData are all in one table, which is much easier for a statistical program to workwith than multiple small tables which each require human interventionData Entry and Manipulation

Best Practices Create descriptive column names without spaces or specialcharacters––Soil T30 Soil Temp 30cmSpecies-Code Species Code (avoid using -, ,*, in column names.Some software may interpret these symbols as an operator) Use a descriptive file name. For instance, a file namedSEV SmallMammalData v.5.25.2010.csv indicates the projectthe data is associated with (SEV), the theme of the data(SmallMammalData) and also when this version of the datawas created (v.5.25.2010). This name is much more helpfulthan a file named mydata.xls.Data Entry and Manipulation

Best Practices Missing data––––Preferably leave field empty (NULL no value)In numeric fields, use a distinct value such as 9999 to indicate amissing valueIn text fields, use NA (“Not Applicable” or “Not Available”)Use Data flags in a separate column to qualify missing valueDateTimeNO3 N 2008101114300.0182008101115000.001NO3 N Conc FlagM1 missing; no samplecollectedM1Data Entry and ManipulationE1E1 estimated fromgrab sample

Best Practices Enter complete lines of dataSorting anExcel file withempty cells isnot a goodidea!Data Entry and Manipulation

Best Practices For the long term, store data in a consistent format that canbe read well in to the future and that can be used by anyapplication now or in the future Appropriate file types include:––––Non-proprietary: Open, documented standardCommon usage by research community: Standard representation(ASCII, Unicode)UnencryptedUncompressed ASCII formatted files will be readable into the future–Use ASCII (comma-separated) for tabular dataData Entry and Manipulation

References1. Best Practices for Preparing Environmental Data Sets toShare and Archive. September 2010. Les A. Hook, Suresh K.Santhana Vannan, Tammy W. Beaty, Robert B. Cook, andBruce E. pdfData Entry and Manipulation

Data Entry Tools Googledocs Forms SpreadsheetsData Entry and Manipulation

Googledocs FormsData Entry and Manipulation

Googledocs FormsData Entry and Manipulation

Data Entry Tools: ExcelData Entry and Manipulation

Excel: Data Validation20Data Entry and Manipulation

Spreadsheet vs. Relational Database Great for charts, graphs,calculations Flexible about cell contenttype—cells in same columncan contain numbers or text Lack record integrity--cansort a column independentlyof all others) Easy to use – but harder tomaintain as complexity andsize of data growsData Entry and Manipulation Easy to query to selectportions of data Data fields are typed – Forexample, only integers areallowed in integer fields Columns cannot be sortedindependently of each other Steeper learning curve thana spreadsheet

What is a relational database?Sample sites*siteIDsite pleID*sampleIDsiteID siteIDsample datesample ngflag flagcommentscommentsSpecies*speciesIDspecies namecommon namefamilyorder A set of tables Relationships A commandlanguageData Entry and Manipulation

Database Features: Explicit control overdata typesDateSiteHeightFlowering dates only text only real numbers only ‘y’ and ‘n’ only Advantages quality control performanceData Entry and Manipulation

Relationships are defined between tablesDateSpeciesFlowering?2/13/2010 ABOGR2y2/13/2010 BHODRy4/15/2010 BBOER4y4/15/2010 CPLJAnMix andMatchdata onthe 107.5Data Entry and Manipulation

Powerful Command Language calledStructured Query Language (SQL)This table is called SoilTempDatePlotTreatmentSensorDepthSoil 0-02-02CR06.32010-02-02AN015.1SQL examples: Select Date, Plot, Treatment, SensorDepth, Soil Temperature fromSoilTemp where Date ‘2010-02-01’DatePlotTreatmentSensorDepthSoil ect * from SoilTemp where Treatment ‘N’ and SensorDepth ‘0’DatePlotTreatmentSensorDepthSoil Temperature2010-02-02AN015.1Data Entry and Manipulation

Data Entry with a Database Forms can be created that make entering data in to arelational database as easy as entering it in to Excel. Thescreenshot below shows embedded forms that were quicklygenerated in MS Access for adding data to three tables in adatabase of plant cover measurementsData Entry and Manipulation

ConclusionCC image by fo.ol on Flickr Be aware of Best Practices when designing data file structures Choose a data entry method that allows some validation ofdata as it is entered Consider investing time in learning how to use a database ifdatasets are large or complexData Entry and Manipulation

If you want to try a database: Consider trying one of these:––Personal, single-user databases can be developed in MS Access,which is stored as a file on the user’s computer. MS Access comeswith easy GUI tools to create databases, run queries, and writereports.A more robust database that is free, accommodates multiple usersand will run on Windows or Linux is MySQL. GUI interfaces forMySQL include phpMyadmin (free) and Navicat (inexpensive).Data Entry and Manipulation

To learn more about designing a relationaldatabase: Database Design for Mere Mortals: A Hands-On Guide toRelational Database Design (2nd Edition) by Michael J.Hernandez. Addison-Wesley. 2003.Data Entry and Manipulation

Data Manipulation Useful for analyzing, subsetting and transforming data Can be used to quality assure data Options include SAS, SPSS, R, and Matlab–Not Free SAS: Has outstanding supportSPSS: Has a user-friendly GUIMatlab: Analysis and Visualization platform that has “toolboxes”available for different disciplines, such as modeling or genomic analysesData Entry and Manipulation

R Free (http://www.r-project.org/index.html)Produces publication quality graphicsLots of forums from which to get helpSoftware (such as Kepler for developing workflows) willintegrate analytical components written in RData Entry and Manipulation

The full slide deck may be downloaded sted citation:DataONE Education Module: Data Entry and Manipulation.DataONE. Retrieved Nov12, 2012. Fromhttp://www.dataone.org/sites/all/documents/L04 DataEntryManipulation.pptxCopyright license information:No rights reserved; you may enhance and reuse foryour own purposes. We do ask that you provideappropriate citation and attribution to DataONE.Data Entry and Manipulation

Data Entry and Manipulation Missing data - Preferably leave field empty (NULL no value) - In numeric fields, use a distinct value such as 9999 to indicate a missing value - In text fields, use NA ("Not Applicable" or "Not Available") - Use Data flags in a separate column to qualify missing value Best Practices Date Time NO3_N_Conc NO3_N_Conc_Flag