Te Technical Note Tech

Transcription

te technical note technThe Development and Evaluation of aBehaviorally Based Rating Form for theAssessment of En Route Air TrafficController PerformanceJennifer J. Vardaman, Ph.D., PERIEarl S. Stein, Ph.D., ACT-530June 1998DOT/FAA/CT-TN98/5Document is available to the publicthrough the National Technical InformationService, Springfield, Virginia 22161U.S. Department of TransportationFederal Aviation AdministrationWilliam J. Hughes Technical CenterAtlantic City International Airport, NJ 08405

NOTICEThis document is disseminated under the sponsorshipof the U.S. Department of Transportation in the interest ofinformation exchange. The United States Governmentassumes no liability for the contents or use thereof.The United States Government does not endorseproducts or manufacturers. Trade or manufacturers’Names appear herein solely because they are consideredessential to the objective of this report.

Technical Report Documentation Page1. Report No.2. Government Accession No.3. Recipient’s Catalog No.DOT/FAA/CT-TN98/54. Title and Subtitle5. Report DateJune 1998The Development and Evaluation of a Behaviorally Based Rating Form for theAssessment of En Route Air Traffic Controller Performance6. Performing Organization CodeACT-5307.Author(s) JenniferJ. Vardaman, Ph.D., PERI and Earl S. Stein, Ph.D., ACT-5309. Performing Organization Name and Address8. Performing Organization Report No.DOT/FAA/CT-TN98/510. Work Unit No. (TRAIS)Federal Aviation AdministrationWilliam J. Hughes Technical CenterAtlantic City International Airport, NJ 0840512. Sponsoring Agency Name and Address11. Contract or Grant No.F2202K13. Type of Report and Period CoveredFederal Aviation AdministrationHuman Factors Division800 Independence Ave., S.W.Washington, DC 20591Technical Note14. Sponsoring Agency CodeAAR-10015. Supplementary Notes16. AbstractThis project expanded and evaluated the performance evaluation method developed by Sollenberger, Stein, and Gromelski (1997), aTerminal Radar Approach Control rating form and training package designed to better assess air traffic controller performance. Theform is a research-oriented testing and assessment tool designed to measure the efficacy of new air traffic control (ATC) systems,system enhancements, and operational procedures in simulation research. The rating form used in the present study focused onobservable behaviors that supervisory air traffic control specialists (SATCSs) use to make behaviorally based ratings of en routecontroller performance. The present study evaluated the inter-rater and intra-rater reliability of performance ratings made by nineAir Route Traffic Control Center supervisors who viewed videotapes and computerized replays of controllers from a previouslyrecorded en route study. The rating form contained 26 items, which were organized into six major categories. Various observablebehaviors, which SATCSs identified as those they consider when assessing controller performance, anchored each performance area.Inter-rater (between rater) reliability of SATCS performance ratings assessed using intra-class correlations was somewhat low. Intrarater (within rater) reliability of SATCS performance ratings was consistent with previous studies and indicated that raters werestable over time in the ratings they assigned. Researchers also investigated the relationship between SATCS performance ratings andpersonality traits from the Sixteen Personality Factor personality inventory. The results indicated that what SATCSs bring with themto the experimental evaluation setting, in terms of personality traits, may be related to their ratings. Future research efforts shouldconcentrate on distinguishing the sources of measurement error and making whatever changes necessary to produce a reliablecontroller performance assessment tool.17. Key Words18. Distribution StatementEn Route Air Traffic ControlController Performance AssessmentAir Traffic Control SimulationThis document is available to the public throughthe National Technical Information Service,Springfield, Virginia, 2216119. Security Classif. (of this report)20. Security Classif. (of this page)21. No. of PagesUnclassifiedForm DOT F 1700.7 (8-72)Unclassified65Reproduction of completed page authorized22. Price

AcknowledgmentsThe authors wish to acknowledge several people who contributed to this study. Dave Cognata,Supervisory Air Traffic Control Specialist, Jacksonville Air Route Traffic Control Center, servedas the subject matter expert on the project. Dr. Laurie Davidson, Princeton Economic Research,Inc., (PERI), audio recorded the training sessions and provided significant feedback on thecomments made by the participants during the study. Albert Macias and Mary Delemarre, ACT510, tirelessly worked to ensure that ATCoach ran to the satisfaction of the researchers. GeorgeRowand, System Resources Corporation, spent countless hours editing videotapes andsynchronizing the audio and videotapes each day of the study. Bill Hickman, ACT-510, and MikeCullum and Bruce Slack, System Resources Corporation, served as the simulation pilots on theproject. Dr. Earl Stein, ACT-530, served as the project lead for this study. Dr. RandySollenberger, ACT-530, assisted with participant training and drafting the test plan. Finally, PaulStringer, Vice President - Aviation Division, PERI, also contributed to the design of this study.iii

iv

Table of ContentsPageAcknowledgments .iiiExecutive Summary.vii1. Introduction. 11.1 Background . 11.2 Problem Statement. 11.3 Assumptions and Goals. 11.4 Purpose . 21.4.1 Observing and Rating Behavior . 21.4.2 Accommodating Subject Matter Experts . 32. Method. 32.1 Participants. 32.2 Rating Form . 42.3 Airspace and Traffic Scenarios. 42.4 Simulation Facility . 52.5 Procedure . 52.5.1 Replay Files. 73. Results . 73.1 Participant Ratings. 73.1.1 Inter-Rater Reliability of Participant Ratings. 83.1.2 Intra-Rater Reliability of Participant Ratings.113.2 Relationship Between Participant Ratings and System Effectiveness Measures .123.3 Intercorrelations Among Overall Performance Area Ratings.123.4 Relationship Between Participant Ratings and Scores on the 16PF Personality Inventory .143.5 Summary of Final Questionnaire .154. Discussion .154.1 Reliability of Participant Ratings .154.1.1 Inter-Rater Reliability.164.1.2 Intra-Rater Reliability.164.2 Relationship Between Participant Ratings and System Effectiveness Measures .164.3 Intercorrelations Among Overall Performance Area Ratings.164.4 Relationship Between Participant Ratings and Scores on the 16PF Personality Inventory .174.5 Summary of Final Questionnaire .185. Conclusions .196. Recommendations .19References.21v

Table of Contents (Cont.)AppendixesA - Observer Rating Form -- TRACONB - Observer Rating Form -- En RouteC - Background QuestionnaireD - Participants’ Air Traffic Control Training and Evaluation ExperienceE - Participation Consent FormF - Final QuestionnaireG - Hourly Schedule of ActivitiesH - Summary SheetI - Presentation Order of ScenariosJ - System Effectiveness MeasuresK - 16PF Descriptive StatisticsL - Correlational Analysis Between Participant Ratings and Scores on 16PF Global FactorsM - Correlational Analysis Between Participant Ratings and Scores on 16PF Basic FactorsList of IllustrationsTables1.2.3.4.5.6.7.Participant Rating Grand Means .8Inter-Rater Reliability for the En Route Rating Form .9Inter-Rater Reliability for Endsley et al. (1997) Condition A and Condition B Scenarios .10Intra-Rater Reliability for the En Route Rating Form .11Performance Measures.12Intercorrelations Among the Overall Performance Areas.14Mean Weights Assigned to Each Performance Category .15vi

Executive SummaryIn this second study on performance rating, researchers investigated the process used bysupervisory air traffic control specialists (SATCSs) to rate en route air traffic control specialists(ATCSs). This project expanded and evaluated an earlier performance evaluation methoddeveloped for Terminal Radar Approach Control (TRACON) ATCSs. This rating form andtraining package was a testing and assessment tool to measure the efficacy of new air trafficcontrol systems, system enhancements, and operational procedures in simulation research.The rating form used in the present study focused on observable en route behaviors that SATCSscan use to make behaviorally based ratings of controller performance. The present studyevaluated the reliability of the rating process by determining the level of agreement betweenratings of air route traffic control center (ARTCC) supervisors who viewed videotapes andcomputerized graphical replays of controllers from a previously recorded en route study.The en route rating form contained 26 items. However, participants concluded that they hadinsufficient information to rate two items. The performance areas were organized into sixcategories: Maintaining Safe and Efficient Traffic Flow, Maintaining Attention and SituationalAwareness, Prioritizing, Providing Control Information, Technical Knowledge, andCommunicating. Observable behaviors anchored each performance area. SATCSs identifiedthese behaviors as those they consider when assessing ATCS performance. The rating formcontained an eight-point rating scale format with statements describing the applicable controlleractions for each point. A comment section for each item provided space for participants toexplain the ratings they assigned.The study took place in the Research Development and Human Factors Laboratory (RDHFL) atthe Federal Aviation Administration William J. Hughes Technical Center, Atlantic CityInternational Airport, New Jersey. Nine en route SATCSs from five different ARTCCsparticipated as observers. The RDHFL video projection system presented three views of apreviously recorded en route study. The primary view was a graphical playback of the trafficscenario that showed all the information on the controller’s radar display. Another view was anover-the-shoulder video recording of the controller’s upper body that showed interactions withthe workstation equipment. The third view was a video recording of the traffic scenario as itappeared on the simulation pilot’s display. All three views were simultaneously presented ondifferent screens and synchronized with an audio recording of the controllers and simulationpilots.The researchers assessed two types of reliability: inter-rater and intra-rater. Inter-rater reliabilityrefers to the uniformity of the ratings between participants, and intra-rater reliability refers to theuniformity of the ratings on repeated occasions.The results of the present study indicated that the inter-rater reliability of the en route rating formranged from r .27 to r .74. The overall ratings for each performance category were generallymore reliable than the individual ratings included in each category. The intra-rater reliabilitieswere higher. Participants were more consistent individually over time than they were betweeneach other reviewing the same controller behavior.vii

There are possible explanations for the inter-rater reliability coefficients. Participants concludedthat they had specialized knowledge and wanted to take a very active role in the process ofdeveloping the rating form and its associated training. Second, the changes, even thoughrecommended by the en route SATCSs who participated in the present study, may also have hadan impact on inter-rater reliability. Finally, there were some problems with the simulation replaytechnology during the present study.Researchers also investigated the relationship between participant ratings and selected personalitytraits. Participants completed the Sixteen Personality Factor (16PF) personality inventory. Theresults indicated that the personality traits participants bring with them to the experimental settingmay be related to their ratings. Such traits are difficult to overcome with only 1 week of trainingin the experimental environment.The performance rating form is a research-oriented assessment tool, which provides data aboutcontroller performance that is not available from other sources. Future research efforts shouldfocus on identifying the sources of measurement error and making whatever changes arenecessary to produce a more reliable instrument.viii

1. IntroductionThis is the second in a series of research studies involved in developing more effectiveperformance rating procedures. The first study involved developing a performance rating form totest and assess simulation research using Terminal Radar Approach Control (TRACON)personnel. The present study concentrated on a performance rating form for en route air trafficcontrol specialists (ATCSs).1.1 BackgroundSollenberger, Stein, and Gromelski (1997) conducted the first study. They developed theTRACON rating form to assess new air traffic control (ATC) systems, system enhancements, andoperational procedures. They attempted to (a) build a reliable tool for measuring controllerperformance in the research setting; (b) improve the quality of ATC performance evaluations; (c)improve the quality, reliability, and comprehensiveness of ATC evaluations and tests in theresearch setting; and (d) identify criteria for evaluating controller performance.The Sollenberger et al. (1997) study indicated that the rating process was workable in aTRACON environment. It also identified the performance areas that were more difficult forparticipants to evaluate consistently, possibly due to misunderstanding rating criteria oroverlooking critical controller actions. Finally, the study demonstrated the feasibility of usingvideo and computerized graphical playback technology as a presentation method for evaluatingcontroller performance.1.2 Problem StatementHuman performance is essential to overall system performance. The decisions humans make andhow they act on them directly impact the degree to which the system achieves its goals. There is,however, disagreement on what role the human plays in the system and what makes up humanperformance. Most systems have some definition of minimum essential performance for theirhuman operators, but they do not distinguish levels of performance quality above the minimumlevel. The problem, then, is, if standards of performance are not well defined, how do subjectmatter experts (SMEs) know what constitutes “acceptable” or “unacceptable” performance?Researchers at the Federal Aviation Administration (FAA) William J. Hughes Technical Centerhave studied human performance issues for many years. Much of this research has stressedsystem effectiveness measures (SEMs) that can be collected in real time during ATC simulations.SEMs are objective measures that can be collected and analyzed to assess the effects of new ATCsystems and procedures on controller performance.1.3 Assumptions and GoalsSollenberger et al. (1997) conducted a study to determine if SEMs are related to how SMEsevaluate controller performance. The authors investigated whether or not SMEs could be trainedto evaluate ATCS performance so that they were looking at the same behaviors and assigningsimilar values to them. They also investigated whether or not SMEs’ combined performanceevaluations are related to the SEMs, assuming that the SMEs ratings are reliable.1

Sollenberger et al. (1997) believed that it is possible to train supervisory ATCSs (SATCSs) toobjectively observe and evaluate controller performance. SATCSs are experienced with FAAForm 3120-25, the ATCT/ARTCC OJT Instructor Evaluation Report. The authors assumed thatFAA Form 3120-25 could be improved, and, when supported by a training curriculum,performance-rating quality would also improve. They did not intend to develop a performanceevaluation form to replace FAA Form 3120-25. Rather, they intended to develop anobservational performance rating system that could be used to validate other measurementsystems.Performance can vary along a continuum of quality based on a variety of variables. Oneimportant variable is the human operator, who must complete specific tasks that are assessed inrelation to a known standard. If the operator’s performance exceeds that standard, it is labeled“acceptable,” but, if the operator’s performance fails to meet that standard, it is labeled“unacceptable.”1.4 PurposeThe purpose of the present study was threefold: (1) determine the reliability of participant ratingsof controller performance obtained via the en route rating form; (2) determine the relationshipbetween participant ratings and selected personality traits; and (3) further investigate thefeasibility of using video and computerized graphical playback technology as a controllerperformance evaluation method.1.4.1 Observing and Rating BehaviorSMEs evaluate performance. However, sometimes they apply their personal standard rather thanthe known standard. Personal standards are often influenced by the SME’s experience, training,peer performance, and organizational standards (Anastasi, 1988). Real-time performance ratingsmust focus on concrete, observable behaviors. Even though the purpose of the rating should notinfluence the quality of the rating design or execution, it sometimes does.Anastasi (1988) discussed using ratings as criterion measures for the verification of principallypredictive indicators. The author stated that despite technical flaws and biases of evaluators,ratings are important sources of criterion information when they are collected under systematicconditions. She emphasized the importance of evaluator/rater training to increase reliability andvalidity while reducing common judgmental errors. Training can take many forms, but anythingthat heightens an evaluator’s observational skills will probably improve rating quality, whichaffects reliability.This study evaluated two types of reliability: inter-rater and intra-rater. Inter-rater reliabilityrefers to the reliability of two or more independent raters. Intra-rater reliability refers to thereliability of an individual rater over time. Performance ratings can be sources of measurementerror, so it is important to evaluate the consistency of such ratings. Inter-rater reliability is oftenevaluated through intra-class correlations, and researchers evaluate intra-rater reliability withPearson’s product moment correlations. Some standardized instruments have obtainedreliabilities that are considered acceptable, with r .85 or better (Gay, 1987, p. 141).2

FAA researchers assess the reliabilities of many types of ratings, including over-the-shoulder(OTS) observational ratings. ATCSs have employed OTS observational ratings since theinitiation of the ATC system. ATCSs believe they are qualified to observe and evaluate eachother. However, a controversy exists over the value of observational performance ratings ascompared to objective data that are obtained in the laboratory. One problem is that ATCSs arevery decisive, and it can be hard to change their ideas about performance evaluation. Whenobserving the same behavior at the same time under the same conditions, evaluators who have notbeen trained to systematically observe may produce different results from the trained evaluators.Under such circumstances, inter-rater reliability decreases.OTS observational ratings have, however, often been used in ATC simulation research. Buckley,DeBaryshe, Hitchner, and Kohn (1983) included observational ratings in their performanceevaluation system. Two observers completed performance evaluations every 10 minutes duringthe simulations. They used a 10-point scale to rate two areas: overall system effectiveness andindividual controller judgment/technique. Inter-rater reliability ranged from .06 to .72.1.4.2 Accommodating Subject Matter ExpertsThere are advantages and disadvantages to accommodating SMEs. The primary advantage is thatthey can make suggestions for changes to the ATCS performance-rating form that would increaseits realism and its applicability to the field setting. Also, there is more participant buy-in possible.However, the corresponding disadvantage is that incorporating such suggestions may render theform facility- or use- specific. That is, if researchers incorporate SATCSs’ suggestions into theform, and some of those suggestions apply only to the participant’s particular facility, the formwould be useless. The rating form used in this study was intended to be a research tool only, notto replace the evaluation form currently used in the field. Therefore, researchers included onlythose suggestions that related to observable behaviors that could be evaluated both by the formand in the research environment currently in use at the Research Development and Human FactorsLaboratory (RDHFL). A related disadvantage is that SMEs bring to the research environmentpersonal and facility biases that can influence the research process. When observing andevaluating controller performance as a group, the goal is for the SMEs to adopt mutual ratingcriteria in making their evaluations. If SMEs were using the same criteria in making theirevaluations, researchers would be better able to assess the validity and usefulness of the ratingform. SME biases should be addressed by including comments and suggested items in the form,but not if those items cannot be behaviorally evaluated.2. Method2.1 ParticipantsNine SATCSs from five different air route traffic control centers (ARTCCs) participated in thepresent study. They ranged in age from 31 to 54 years (M 44.56, SD 7.45). The participantswere full performance level SMEs with current experience in controlling traffic at their respectiveARTCCs. They actively controlled traffic from 11 to 12 of the previous 12 months (M 11.89,SD 0.33). They had from 9 to 29 years experience controlling air traffic (M 20.00, SD 6.16), including from 1½ to 20 years experience training and evaluating controllers (M 13.94,SD 5.75). Finally, the participants had normal vision correctable to 20/30 with glasses.3

2.2 Rating FormThe Sollenberger et al. (1997) TRACON rating form (see Appendix A) was the basis fordeveloping the en route form. The TRACON rating form contained 24 items that assesseddifferent areas of controller performance. They organized the performance areas into sixcategories, with an overall rating scale included for each category. Participants identified variousobservable behaviors that should be considered when assessing controller performance for eachperformance area. It contained an eight-point scale format, with statements describing thenecessary controller actions for each scale. A comment section encouraging participants to writeas much as possible appeared at the end of the form. This kept them oriented on controllerbehavior and helped to reduce their dependence on memory when assigning numerical ratings.The en route rating form (see Appendix B) contained 26 items, including two 2-question items(items 15 and 19). However, participants concluded that they had insufficient information to rateitems 13 and 18. The en route SATCSs gave significant input on organizing the rating form, andthe researchers revised it according to their suggestions. They changed items 15 and 16 in theTRACON form to items 15A and 15B, added items 16 and 19B, and changed item 19 to item19A. Further, the en route rating form provided space for comments after each item, with spacefor general comments at the end. Finally, as per technical instructions given to the researchers bythe project technical lead, the N/A choice was eliminated from the rating scale in the en routeform to discourage avoidance of an item. Instead, participants wrote N/A next to those items thatthey felt did not apply. The en route rating form included instructions on how to use the form andsome assumptions about ATC and controller performance.2.3 Airspace and Traffic ScenariosThe replay files used in the present study were recorded during a simulation study thatinvestigated the effects of free flight conditions on controller performance, workload, andsituation awareness (Endsley, Mogford, Allendoerfer, Snyder, & Stein, 1997). During that study,10 controllers from the Jacksonville ARTCC (ZJX) worked traffic scenarios using theGreencove/Keystone sector, a combined high altitude sector.Greencove/Keystone is responsible for altitudes of flight level (FL) 240 and higher and has fourprimary traffic flows. Southbound aircraft enter Greencove/Keystone from the northeast andnorthwest and continue south and southeast toward Fort Lauderdale, Miami, and West PalmBeach along the J45 or J79 airways. Aircraft are usually at their final altitude when they enterGreencove/Keystone. Some northbound aircraft leave Orlando International Airport and travelnorth or northwest along the J53 or J81 airways. They usually contact the sector at about FL 180while climbing to an interim altitude of FL 230. They will be cleared to their final altitude whenfeasible. Other northbound aircraft depart from southeast Florida and enter Greencove/Keystonein the south, near Orlando. These aircraft continue north and northwest along the J53 and J81airways. These aircraft are usually at their final altitude when they enter the sector butoccasionally may need the controller to clear them to their final altitude. For Endsley et al.’s(1997) purposes, these aircraft were at their final altitude when they reached the sector.The Greencove/Keystone sector is bordered below by the St. Augustine and St. Johns sectors, onthe northeast by the States/Hunter combined sector, on the north-northwest by the Alma/Moultrie4

combined sector, on the west by the Lake City/Ocala sector, on the southwest by the Mayosector, on the south by the Miami ARTCC (ZMA) Boyel sector, and on the south-southeast bythe ZMA Hobee sector. For Endsley et al.’s (1997) purposes, all adjacent sectors accepted allhandoffs and approved all point-outs. Greencove/Keystone is bordered on the east by a warningarea that is controlled by the US Navy. Civilian aircraft may enter the warning area only withspecial permission. For Endsley et al.’s purposes, the warning area was considered to be active,so no civilian aircraft were permitted to enter the area.Endsley et al. (1997) used four types of scenarios. The present study incorporated only two ofthe four free flight study scenario types. The “condition A” scenarios included current ATCprocedures. The “condition B” scen

Assessment of En Route Air Traffic Controller Performance Jennifer J. Vardaman, Ph.D., PERI Earl S. Stein, Ph.D., ACT-530. NOTICE . L - Correlational Analysis Between Participant Ratings and Scores on 16PF Global Factors M - Correlational Analysis Between Participant Ratings and Scores on 16PF Basic Factors List of Illustrations Tables