Troubleshooting Your SUSE OpenStack Cloud TUT19873

Transcription

Troubleshooting YourSUSE OpenStack Cloud TUT19873Adam SpiersSUSE Cloud/HA Senior Software EngineerDirk MüllerSUSE OpenStack Senior Software Engineer

SUSE OpenStack Cloud .

SUSE OpenStack Cloud

SUSE OpenStack Cloud 4653Parameters

SUSE OpenStack Cloud 14Components

SUSE OpenStack Cloud 2Hours

SUSE OpenStack CloudTroubleshooting 1Hour

SUSE OpenStack Cloud gerStudioDashboard(Horizon)Compute(Nova)PortalApp MonitorCloud APIsHeatSec & HypervisorXen, KVMXen,KVMVmware, ystemSUSE LinuxEnterpriseServer 11 SP3Physical Infrastructure: x86-64, Switches, StorageOpenStackSUSECloud AddsToolsManagementOS artnerSolutions

Non-HA SUSE OpenStack Cloud

HA SUSE OpenStack Cloud

Crowbar and Chef

Generic SLES Troubleshooting All Nodes in SUSE OpenStack Cloud are SLESbased Watch out for typical issues: ‹#› –dmesg for hardware-related errors, OOM, interesting kernelmessages–usual syslog targets, e.g. /var/log/messagesCheck general node health via:–top, vmstat, uptime, pstree, free–core files, zombies, etc

Supportconfig‹#› supportconfig can be run on any cloud node supportutils-plugin-susecloud.rpm–installed on all SUSE OpenStack Cloud nodes automatically–collects precious cloud-specific information for further analysis

Admin Node: Crowbar UIUseful Export Pageavailable in theCrowbar UI in order toexport various log files‹#›

Cloud Installscreen install-suse-cloud bar/barclamp install/*.log

SUSE OpenStack Cloud Admin SUSE CloudAddonCrowbar UICrowbarServicesChef/RabbitRepo MirrorSLES 11 SP3Install /couchdb/couchdb.logCrowbar repo r/log/crowbar/production.{out,log}

Chef Cloud uses Chef for almost everything:–All Cloud and SLES non-core packages–All config files are overwritten–All daemons are started–Database tables are initializedhttp://docs.getchef.com/chef quick overview.html‹#›

Admin Node: Using Chefknife node listknife node show nodeid knife search node "*:*"

SUSE OpenStack Cloud Admin Populate root/.ssh/authorized keysprior install Barclamp install logs:/var/log/crowbar/barclamp install Node discovery logs:/var/log/crowbar/sledgehammer/d macid . domain .log Syslog of crowbar installed nodes sent via rsyslog to:/var/log/nodes/d macid .log‹#›

Useful Tricks Root login to the Cloud installed nodes should bepossible from admin node (even in discovery stage) If admin network is reachable: /.ssh/config:host 192.168.124.*StrictHostKeyChecking nouser root‹#›

SUSE OpenStack Cloud Admin If a proposal is applied, chef client logs are at:/var/log/crowbar/chef-client/ macid . domain .log Useful crowbar commands:crowbar machines helpcrowbar transition node state crowbar barclamp proposal list show name crowbar barclamp proposal delete defaultcrowbar reset nodescrowbar reset proposal barclamp default‹#›

Admin Node: Crowbar Services Nodes are deployed via PXE boot:/srv/tftpboot/discovery/pxelinux.cfg/* Installed via AutoYaST; profile generated to:/srv/tftpboot/nodes/d mac . domain /autoyast.xml Can delete & rerun chef-client on the admin node Can add useful settings to autoyast.xml: confirm config:type "boolean" true /confirm (don’t forget to chattr i the file)‹#›

Admin Node: Crowbar UIRaw settings in barclampproposals allow access to"expert" (hidden) optionsMost interesting are:debug: trueverbose: true‹#›

Admin Node: Crowbar Gotchas

Admin Node: Crowbar Gotchas Be patient–Only transition one node at a time–Only apply one proposal at a timeCloud nodes should boot from:1. Network2. First disk‹#›

SUSE OpenStack Cloud SUSE CloudAddonCloud NodeAll managed via Chef:/var/log/chef/client.logrcchef-client statusNodespecificservicesChef ClientSLES 11 SP3chef-client can be invokedmanually

SUSE OpenStack Cloud Control Node SUSE CloudAddonControl NodeJust like any other cloud node:/var/log/chef/client.logrcchef-client statusOpenStackAPIservices.chef-clientChef overwrites all config files ittouches chattr i is your friendChef ClientSLES 11 SP3

High Availability

What is High Availability? Availability Uptime / Total Time‒99.99% (“4 nines”) 53 minutes / year‒99.999% (“5 nines”) 5 minutes / yearHigh Availability (HA)‒ 30Typically accounts for mild / moderate failure scenarios‒e.g. hardware failures and recoverable software errors‒automated recovery by restarting / migrating servicesHA ! Disaster Recovery (DR)‒Cross-site failover‒Partially or fully automatedHA ! Fault Tolerance

Internal architecture31

Resource Agents Executables which start / stop / monitor resources RA types:‒LSB init scripts‒OCF scripts ( LSB meta-data monitor action .)‒32/usr/lib/ocf/resource.d/‒Legacy Heartbeat RAs (ancient, irrelevant)‒systemd services (in HA for SLE12 )

Results of resource failures If fail counter is exceeded, clean-up is required:crm resource cleanup resource Failures are expected:‒when a node dies‒when storage or network failures occurFailures are not expected during normal operation:‒applying a proposal‒starting or cleanly stopping resources or nodesUnexpected failures usually indicate a bug!‒33Do not get into the habit of cleaning up and ignoring!

Before diagnosis 34Understand initial state / context‒crm conf igure graph is awesome!‒crm mon‒Which fencing devices are in use?‒What's the network topology?‒What was done leading up to the failure?Look for first (relevant) failure‒Failures can cascade, so don't confuse cause and effect‒Watch out for STONITH

crm configure graph FTW!35

Diagnosis What failed?‒Resource?‒Node?‒Orchestration via Crowbar / chef-client?‒ ‒Pacemaker config? (e.g. incorrect constraints)‒Corosync / cluster communications?chef-client logs are usually a good place to start‒36cross-cluster orderingMore on logging later

Symptoms of resource failures Failures reported via Pacemaker UIsFailed actions:neutron-ha-tool start 0 on d52-54-01-77-77-01 'unknown error' (1): call 281,status complete, last-rc-change 'Thu Jun 4 16:15:14 2015', queued 0ms,exec 1734msneutron-ha-tool start 0 on d52-54-02-77-77-02 'unknown error' (1): call 259,status complete, last-rc-change 'Thu Jun 4 16:17:50 2015', queued 0ms,exec 392ms37 Services become temporarily or permanentlyunavailable Services migrate to another cluster node

Symptoms of node failures38 Services become temporarily or permanentlyunavailable, or migrated to another cluster node Node is unexpectedly rebooted (STONITH) Crowbar web UI may show a red bubble icon next toa controller node Hawk web UI stops responding on one of thecontroller nodes (should still be able to use theothers) ssh connection to a cluster node freezes

Symptoms of orchestration failures Proposal / chef-client failed Synchronization time-outs are common and obviousINFO: Processing crowbar-pacemaker sync mark[wait-keystone database] action guess(keystone::server line 232)INFO: Checking if cluster founder has set keystone database to 5.FATAL: Cluster founder didn't set keystone database to 5! Find synchronization mark in recipe:crowbar pacemaker sync mark "wait-keystone database"# Create the Keystone Databasedatabase "create #{node[:keystone][:db][:database]} database" do. So node timed out waiting for cluster founder tocreate keystone database i.e. you're looking at the wrong log! So .root@crowbar: # knife search node founder:true -i39

Logging All changes to cluster configuration driven bychef-client‒either from application of barclamp proposal‒‒or run by chef-client daemon every 900 seconds‒/var/log/chef/client.log on each node Remember chef-client often runs in parallel acrossnodes All HAE components log to /var/log/messages oneach cluster node‒40admin node: /var/log/crowbar/chef-client/ NODE.logNothing Pacemaker-related on admin node

HAE logsWhich nodes' log files to look at? Node failures:‒ Resource failures:‒/var/log/messages from DC and node with failed resource but remember the DC can move around (elections) Use hb report or crm history or Hawk toassemble chronological cross-cluster log‒41/var/log/messages from DCSaves a lot of pain – strongly recommended!

Syslog messages to look out for Fencing going wrongpengine[16374]: warning: cluster status: We do not have quorum fencing and resource management disabledpengine[16374]: warning: stage6: Node d52-54-08-77-77-08 is unclean!pengine[16374]: warning: stage6: Node d52-54-0a-77-77-0a is unclean!pengine[16374]: notice: stage6: Cannot fence unclean nodes until quorum isattained (or no-quorum-policy is set to ignore) Fencing going rightcrmd[16376]: notice: te fence node: Executing reboot fencing operation (66)on d52-54-0a-77-77-0a (timeout 60000)stonith-ng[16371]: notice: handle request: Client crmd.16376.f6100750 wants tofence (reboot) 'd52-54-0a-77-77-0a' with device '(any)' 42‒Reason for fencing is almost always earlier in log.‒Don't forget all the possible reasons for fencing!Lots more – get used to reading/var/log/messages!

Stabilising / recovering a cluster Start with a single node‒‒‒43Stop all others‒rcchef-client stop‒rcopenais stop‒rccrowbar join stopClean up any failures‒crm resource cleanup‒crm resource -C is buggy‒crm resource -o \awk '/\tStopped Timed Out/ { print 1 }' \xargs -n1 crm resource cleanupMake sure chef-client is happy

Stabilising / recovering a cluster (cont.) 44Add another node in‒rm /var/spool/corosync/block automatic start‒service openais start‒service crowbar join start‒Ensure nothing gets fenced‒Ensure no resource failures‒If fencing happens, check /var/log/messages to find outwhy, then rectify causeRepeat until all nodes are in cluster

Degrade Cluster for Debuggingcrm configure location fixup-cl-apache cl-apache \rule -inf: '#uname' eq HOSTNAME Allows to degrade an Activate/Activate resource toonly one instance per cluster Useful for tracing Requests‹#›

TL; DR: Just Enough HA‹#› crm resource list crm mon crm resource restart X crm resource cleanup X

Now Coming to OpenStack

OpenStack Architecture Diagram

OpenStack Block diagramAccesses almosteverythingKeystone: SPOF

OpenStack Architecture ‹#›Typically each OpenStack component provides:–an API daemon / service–one or many backend daemons that do the actual work–openstack / prj command line client to access the API– proj -manage client for admin-only functionality–dashboard ("Horizon") Admin tab for a graphical view on theservice–uses an SQL database for storing state

OpenStack Packaging Basics Packages are usually named:openstack- codename –usually a subpackage for each service (-api, -scheduler, etc)–log to /var/log/ codename / service .log–each service has an init script:dde-ad-be-ff-00-01: # rcopenstack-glance-api statusChecking for service glance-api‹#›.running

OpenStack Debugging Basics Log files often lack useful information withoutverbose enabled TRACEs of processes are not logged without verbose Many reasons for API error messages are not loggedunless debug is turned on Debug is very verbose ( 10GB per ck.org/icehouse/‹#›

OpenStack ArchitectureAccesses almosteverythingKeystone: SPOF

OpenStack Dashboard: Horizon/var/log/apache2openstack-dashboarderror log‹#› Get the exact URL it tries toaccess! Enable “debug” in Horizonbarclamp Test componentsindividually

OpenStack Identity: Keystone‹#› Needed to access all services Needed by all services for checking authorization Use keystone token-get to validate credentialsand test service availability

OpenStack Imaging: Glance To validate service availability:glance image-listglance image-download id /dev/nullglance image-show id Don’t forget hypervisor type property!‹#›

OpenStack Compute: Novanova-manage service listnova-manage logs errorsnova show id shows compute nodevirsh list, virsh dumpxml

Nova "Launches" go to Scheduler; rest to Conductor

Nova Booting VM Workflow

Nova: Scheduling a VM ‹#›Nova scheduler tries to select a matching computenode for the VM

Nova SchedulerTypical errors: No suitable compute node can be found All suitable compute nodes failed to launch the VMwith the required settings nova-manage logs errorsINFO nova.filters [req-299bb909-49bc-4124-8b88732797250cf5 47a52e6b02a2ef] FilterRamFilter returned 0 hosts‹#›

OpenStack Volumes: CinderSchedulerAPI‹#›VolumeVolumeVolumeVolume

OpenStack Cinder: VolumesSimilar syntax to Nova:cinder-manage service listcinder-manage logs errorscinder-manage host listcinder list show‹#›

OpenStack Networking: Neutron Swiss Army knife for SDNneutron agent-listneutron net-listneutron port-listneutron router-list ‹#›There's no neutron-manage

Basic Network Layout

Networking with OVS:Compute nce/content/under the hood openvswitch.html

Networking with LB:Compute Node

Neutron TroubleshootingNeutron uses IP Networking Namespaces on theNetwork node for routing overlapping networksneutron net-listip netns listip netns exec qrouter- id bashping.arping.ip ro.curl.‹#›

Q&A http://ask.openstack.org/ http://docs.openstack.org/ nk you‹#›

Bonus Material

OpenStack Orchestration: Heat

OpenStack Orchestration: Heat Uses Nova, Cinder, Neutron to assemble completestacks of resourcesheat stack-listheat resource-list show stack heat event-list show stack ‹#›Usually necessary to query the actual OpenStackservice for further information

OpenStack Imaging: Glance Usually issues are in the configured glance backenditself (e.g. RBD, swift, filesystem) so debuggingconcentrates on those Filesytem:/var/lib/glance/images RBD:ceph -wrbd -p pool ls‹#›

SUSE OpenStack Cloud ‹#›

Unpublished Work of SUSE. All Rights Reserved.This work is an unpublished work and contains confidential, proprietary, and trade secret information of SUSE.Access to this work is restricted to SUSE employees who have a need to know to perform tasks within the scope oftheir assignments. No part of this work may be practiced, performed, copied, distributed, revised, modified, translated,abridged, condensed, expanded, collected, or adapted without the prior written consent of SUSE.Any use or exploitation of this work without authorization could subject the perpetrator to criminal and civil liability.General DisclaimerThis document is not to be construed as a promise by any participating company to develop, deliver, or market aproduct. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in makingpurchasing decisions. SUSE makes no representations or warranties with respect to the contents of this document,and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose.The development, release, and timing of features or functionality described for SUSE products remains at the solediscretion of SUSE. Further, SUSE reserves the right to revise this document and to make changes to its content, atany time, without obligation to notify any person or entity of such revisions or changes. All SUSE marks referenced inthis presentation are trademarks or registered trademarks of Novell, Inc. in the United States and other countries. Allthird-party trademarks are the property of their respective owners.

Generic SLES Troubleshooting All Nodes in SUSE OpenStack Cloud are SLES based Watch out for typical issues: - dmesgfor hardware-related errors, OOM, interesting kernel messages - usual syslog targets, e.g. /var/log/messages Check general node health via: - top, vmstat, uptime, pstree, free - core files, zombies, etc