DevOps Troubleshooting: Linux Server Best Practices



This page intentionally left blank

DevOpsTroubleshootingLinux Server Best PracticesKyle RankinUpper Saddle River, NJ Boston Indianapolis San FranciscoNew York Toronto Montreal London Munich Paris MadridCapetown Sydney Tokyo Singapore Mexico City

Many of the designations used by manufacturers and sellers to distinguish theirproducts are claimed as trademarks. Where those designations appear in this book,and the publisher was aware of a trademark claim, the designations have beenprinted with initial capital letters or in all capitals.The author and publisher have taken care in the preparation of this book, but makeno expressed or implied warranty of any kind and assume no responsibility forerrors or omissions. No liability is assumed for incidental or consequential damagesin connection with or arising out of the use of the information or programscontained herein.The publisher offers excellent discounts on this book when ordered in quantityfor bulk purchases or special sales, which may include electronic versions and/orcustom covers and content particular to your business, training goals, marketingfocus, and branding interests. For more information, please contact:U.S. Corporate and Government Sales(800) 382-3419corpsales@pearsontechgroup.comFor sales outside the United States, please contact:International Salesinternational@pearson.comVisit us on the Web: Data is on file with the Library of Congress.Editor-in-ChiefMark TaubExecutive EditorDebra Williams CauleyDevelopment EditorMichael ThurstonManaging EditorJohn FullerProject EditorElizabeth RyanCopy EditorRebecca RiderIndexerRichard EvansProofreaderDiane FreedTechnical ReviewerBill ChildersPublishing CoordinatorKim BoedigheimerCompositorKim ArneyCopyright 2013 Pearson Education, Inc.All rights reserved. Printed in the United States of America. This publication isprotected by copyright, and permission must be obtained from the publisher priorto any prohibited reproduction, storage in a retrieval system, or transmission in anyform or by any means, electronic, mechanical, photocopying, recording, or likewise.To obtain permission to use material from this work, please submit a written requestto Pearson Education, Inc., Permissions Department, One Lake Street, Upper SaddleRiver, New Jersey 07458, or you may fax your request to (201) 236-3290.ISBN-13: 978-0-321-83204-7ISBN-10:0-321-83204-3Text printed in the United States on recycled paper at RR Donnelley inCrawfordsville, Indiana.First printing, November 2012

This book wouldn’t be possible without the support of my wife, Joy,who once again helped me manage my time so I could complete thebook, only this time while carrying our first child, Gideon. I’d alsolike to dedicate this book to my son, Gideon, who so far is easier totroubleshoot than any server.

This page intentionally left blank

ContentsPrefaceAcknowledgmentsAbout the AuthorCHAPTER 1CHAPTER 2Troubleshooting Best Practicesxiiixixxxi1Divide the Problem SpacePractice Good Communication When CollaboratingConference CallsDirect ConversationEmailReal-Time Chat RoomsHave a Backup Communication MethodFavor Quick, Simple Tests over Slow, Complex TestsFavor Past SolutionsDocument Your Problems and SolutionsKnow What ChangedUnderstand How Systems WorkUse the Internet, but CarefullyResist Rebooting3445678891012131415Why Is the Server So Slow? Running Out of CPU,RAM, and Disk I/O17System LoadWhat Is a High Load Average?Diagnose Load Problems with topMake Sense of top OutputDiagnose High User TimeDiagnose Out-of-Memory IssuesDiagnose High I/O WaitTroubleshoot High Load after the FactConfigure sysstatView CPU Statistics18202022242527293030vii

viiiContentsView RAM StatisticsView Disk StatisticsView Statistics from Previous DaysCHAPTER 3CHAPTER 4CHAPTER 5313233Why Won’t the System Boot? Solving Boot Problems35The Linux Boot ProcessThe BIOSGRUB and Linux Boot LoadersThe Kernel and Initrd/sbin/initBIOS Boot OrderFix GRUBNo GRUB PromptStage 1.5 GRUB PromptMisconfigured GRUB PromptRepair GRUB from the Live SystemRepair GRUB with a Rescue DiskDisable Splash ScreensCan’t Mount the Root File SystemThe Root Kernel ArgumentThe Root Device ChangedThe Root Partition Is Corrupt or FailedCan’t Mount Secondary File Systems363637383945474748494950515152525555Why Can’t I Write to the Disk? Solving Fullor Corrupt Disk Issues57When the Disk Is FullReserved BlocksTrack Down the Largest DirectoriesOut of InodesThe File System Is Read-OnlyRepair Corrupted File SystemsRepair Software RAID58595961626364Is the Server Down? Tracking Down the Sourceof Network Problems67Server A Can’t Talk to Server BClient or Server ProblemIs It Plugged In?686969

ContentsCHAPTER 6CHAPTER 7ixIs the Interface Up?Is It on the Local Network?Is DNS Working?Can I Route to the Remote Host?Is the Remote Port Open?Test the Remote Host LocallyTroubleshoot Slow NetworksDNS IssuesFind the Network Slowdown with tracerouteFind What Is Using Your Bandwidth with iftopPacket CapturesUse the tcpdump ToolUse Wireshark70717274767678798081838488Why Won’t the Hostnames Resolve? Solving DNSServer Issues93DNS Client TroubleshootingNo Name Server Configured or InaccessibleName ServerMissing Search Path or Name Server ProblemDNS Server TroubleshootingUnderstanding dig OutputTrace a DNS QueryRecursive Name Server ProblemsWhen Updates Don’t Take95979898101104107Why Didn’t My Email Go Through? TracingEmail Problems119Trace an Email RequestUnderstand Email HeadersProblems Sending EmailClient Can’t Communicate with the OutboundMail ServerOutbound Mail Server Won’t Allow RelayOutbound Mail Server Can’t Communicatewith the DestinationProblems Receiving EmailTelnet Test Can’t ConnectTelnet Can Connect, but the Message Is RejectedPore Through the Mail Logs95120123125126130131135136137138

xContentsCHAPTER 8CHAPTER 9Is the Website Down? Tracking Down WebServer Problems141Is the Server Running?Is the Remote Port Open?Test the Remote Host LocallyTest a Web Server from the Command LineTest Web Servers with CurlTest Web Servers with TelnetHTTP Status Codes1xx Informational Codes2xx Successful Codes3xx Redirection Codes4xx Client Error Codes5xx Server Error CodesParse Web Server LogsGet Web Server StatisticsSolve Common Web Server ProblemsConfiguration ProblemsPermissions ProblemsSluggish or Unavailable Web 3163164166Why Is the Database Slow? Tracking DownDatabase Problems171Search Database LogsMySQLPostgresSQLIs the Database Running?MySQLPostgresSQLGet Database MetricsMySQLPostgresSQLIdentify Slow 182182183

ContentsCHAPTER 10 It’s the Hardware’s Fault! Diagnosing CommonHardware ProblemsThe Hard Drive Is DyingTest RAM for ErrorsNetwork Card FailuresThe Server Is Too HotPower Supply FailuresIndexxi185186190191192194197

This page intentionally left blank

PrefaceDevOps describes a world where developers, Quality Assurance (QA), andsystems administrators work more closely together than in many traditional environments. Although DevOps is already recognized as a boon torapid software deployment and automation, an often-overlooked benefitof the DevOps approach is the rapid problem solving that occurs whenthe whole team can collaborate to troubleshoot a problem on a system.Unfortunately, developers, QA, and sysadmins have gaps in their troubleshooting skills that they often resolve by blaming each other for problemson the system. This book aims to bridge those gaps and guide all groupsthrough a standard set of troubleshooting practices that they can apply asa team to some of the most common Linux server problems.Although the overall topics covered in the book are traditionally thedomain of sysadmin, in a DevOps environment, developers and QA alsofind themselves troubleshooting network problems, setting up web servers, and diagnosing high load, even if they may not have a background inLinux administration. What makes this book more than just a sysadmintroubleshooting guide is the audience and focus. This book assumes thereader may not be a Linux sysadmin, but instead is a talented developer orQA engineer in a DevOps organization who may not have much systemlevel Linux experience. That said, if you are a sysadmin, you won’t be leftout either. Included are troubleshooting techniques that can supplementthe skills of even senior sysadmin—just written in an accessible way.In a traditional enterprise environment without DevOps principles, troubleshooting is as dysfunctional as development is. When there is a serverproblem, if you can even get developers and sysadmin on the same call,you can expect everyone to fall into their traditional roles—the sysadminwill only look at server resources and logs; the developers will wait forxiii

xivPrefacethe inevitable blame to be heaped on them for their “bloated” or “buggy”code, at which point they will complain about the unstable, underpoweredserver; or maybe everyone will redirect the blame at QA for not finding theproblem before it hit production. All the while, the actual problem is notany closer to being solved.In a DevOps organization, cooperation between all the teams is stressed,but when it comes to troubleshooting, often people still fall into their traditional roles even if there’s no blame game. Why? Well, even if everyone wants to work together, without the same troubleshooting skills andtechniques, everyone may still be waiting on everyone else to troubleshoottheir part. The goal of this book is to get every member of your DevOpsteam on the same page when it comes to Linux troubleshooting. Wheneveryone has the same Linux troubleshooting skills, the QA team will better be able to diagnose problems before they hit production, developerswill be better at tracking down why that latest check-in doubled the loadon the system, and sysadmins can be more confident in their diagnoses, sowhen a problem strikes, everyone can pitch in to help.This book is broken into ten chapters based on some of the most common problems you’ll face on Linux systems, and the chapters are orderedso that techniques you learn in some of the earlier chapters (particularlyabout how to diagnose high load and how to troubleshoot network problems) can be helpful as you get further into the book. That said, I realizeyou may not read this book cover-to-cover, but instead you will probablyjust turn to the chapter that’s relevant to your particular problem. So whentopics in other chapters are helpful, I will point you to them. Chapter 1: Troubleshooting Best Practices Before you learn howto troubleshoot specific problems, it may be best to learn an overallapproach to troubleshooting that you can apply to just about any kindof problem, even outside of Linux systems. This chapter talks aboutgeneral troubleshooting principles that you will use when you try specific troubleshooting steps throughout the rest of the book. Chapter 2: Why Is the Server So Slow? Running Out of CPU, RAM,and Disk I/O This chapter introduces troubleshooting principlesthat you will apply to one of the most common problems you’ll have

Prefacexvto solve: Why is the server slow? Whether you are in QA and are trying to figure out why the latest load test is running much slower; youare a developer trying to find out if your program is I/O bound, RAMbound, or CPU bound; or you are a sysadmin who isn’t sure whethera load of 8, 9, or 13 is OK, this chapter will give you all the techniquesyou need to solve load problems. Chapter 3: Why Won’t the System Boot? Solving Boot Problems Anynumber of different problems can stop a system from booting. Whetheryou have ever thought about the Linux boot process or not, this chapter helps you track down boot problems by first walking you througha healthy Linux boot process, and then discussing what it looks likewhen each stage in that boot process fails. Chapter 4: Why Can’t I Write to the Disk? Solving Full or CorruptDisk Issues Just about anyone who has used Linux for a period oftime has run across a system where they can’t write to the disk. It couldbe that you are a developer who enabled debugging in your logs andyou accidentally filled the disk, or you could simply be the victim offile system corruption. In either case, this chapter helps you track downwhat directories are using up the most space on the system and how torepair corrupted file systems. Chapter 5: Is the Server Down? Tracking Down the Source of Network Problems No matter where you fit in a DevOps organization,network troubleshooting skills are invaluable. Sometimes it can be difficult to track down networking problems because they often impacta system in strange ways. This chapter walks you through how to isolate and diagnose a network problem step-by-step by testing problemson different network layers. This chapter also lays the groundworkfor troubleshooting techniques for specific network services (such asDNS) covered in the rest of the book. Chapter 6: Why Won’t the Hostnames Resolve? Solving DNS ServerIssues DNS can be one of the trickier services to troubleshootbecause even though so much of the network relies on it, many usersare unfamiliar with how it works. Whether you are a web developerwho gets DNS service for your site on a web GUI via your registrar, or asysadmin in charge of a full BIND instance, these DNS troubleshooting

xviPrefacetechniques will prove invaluable. This chapter will trace a normal, successful DNS request and then elaborate on the DNS troubleshootingcovered in Chapter 5 with more specific techniques for finding problems in DNS zone transfers, caching issues, and even syntax errors. Chapter 7: Why Didn’t My Email Go Through? Tracing Email Problems Email was one of the first services on the Internet and still isan important way to communicate. Whether you are tracing why yourautomated test emails aren’t being sent, why yo

Linux administration. What makes this book more than just a sysadmin troubleshooting guide is the audience and focus. This book assumes the reader may not be a Linux sysadmin, but instead is a talented developer or QA engineer in a DevOps organization who may not have much system-level Linux experience. That said, if you are a sysadmin, you won .File Size: 619KBPage Count: 58