Architecting Splunk For High Availability And Disaster Recovery

Transcription

2019 SPLUNK INC. 2019 SPLUNK INC.ArchitectingSplunk for HighAvailability andDisaster RecoverySean DelaneyPrincipal Architect SplunkJustin HardemanPlatform Architect Splunk

2019 SPLUNK INC.Sean DelaneyJustin HardemanPrincipal Architect Splunk Principal Architect Splunk

2019 SPLUNK INC.ForwardLookingStatementsDuring the course of this presentation, we may make forward‐looking statements regardingfuture events or plans of the company. We caution you that such statements reflect ourcurrent expectations and estimates based on factors currently known to us and that actualevents or results may differ materially. The forward-looking statements made in the thispresentation are being made as of the time and date of its live presentation. If reviewed afterits live presentation, it may not contain current or accurate information. We do not assumeany obligation to update any forward‐looking statements made herein.In addition, any information about our roadmap outlines our general product direction and issubject to change at any time without notice. It is for informational purposes only, and shallnot be incorporated into any contract or other commitment. Splunk undertakes no obligationeither to develop the features or functionalities described or to include any such feature orfunctionality in a future release.Splunk, Splunk , Turn Data Into Doing, The Engine for Machine Data, Splunk Cloud, SplunkLight and SPL are trademarks and registered trademarks of Splunk Inc. in the United Statesand other countries. All other brand names, product names, or trademarks belong to theirrespective owners. 2019 Splunk Inc. All rights reserved.

2019 SPLUNK INC.AgendaDisaster RecoveryRecover in the event of a disasterHigh AvailabilityMaintain an acceptable levelof continuous service Data Collection Indexing & SearchingSmart Store and HATop Takeaways

2019 SPLUNK INC.Disaster Recovery (DR)Section subtitle goes here

2019 SPLUNK INC.What is Disaster Recovery?Set of processes necessary to ensurerecovery of service after a disasterDR

2019 SPLUNK INC.Disaster Recovery Steps1Backup necessary data2RestoreBackup to a medium at least as resilient as sourceLocal Backup vs. RemoteEnsure this worksBackup is worthless without restoreDR

2019 SPLUNK INC.Backup1DRabConfigurations SPLUNK HOME/etc/*IndexesBuckets: Hot*, Warm, Cold, Frozen

2019 SPLUNK INC.Backup ConfigurationsSplunk Instance SPLUNK HOME/etc/*DR

2019 SPLUNK INC.Backup DataBucket TypeDRStateCan Backup?Read WriteNo*Read OnlyYesRead OnlyYes*Unless using snapshot aware FS or roll to warm first (which introduces a performance penalty).

2019 SPLUNK INC.Restore ConfigurationsDRNew Splunk Instance SPLUNK HOME/etc/* SPLUNK HOME/etc/*New Splunk Instance

2019 SPLUNK INC.Restore DataDRNew Splunk Instance Indexes Location( SPLUNK HOME/var/lib/splunk)Splunk advises restoring fullyfrom a backup rather thanrestoring on top of a partiallycorrupted datastore. Indexes Location( SPLUNK HOME/var/lib/splunk)New SplunkInstance

2019 SPLUNK INC.Backup KV StoreDR KV store backup scripts added in 7.1, previous versions backup KV Store files Run on the Searchead to take a stateful snapshot of the KV store ./splunk backup /latest/Admin/BackupKVstore( SPLUNK HOME/var/lib/splunk)New Splunk Instance

2019 SPLUNK INC.Backup Clustered DataOption 1:Backup all data on each node Will also result in backups of duplicate dataOption 2:Identify one copy of each bucket onthe cluster and backup only those(requires scripting) Decide whether or not you need to also backupindex filesDRBucket naming conventions Non-clustered buckets:db newest time oldest time localid Clustered original bucket:db newest time oldest time localid guid Clustered replicated bucket copies:rb newest time oldest time localid guid

2019 SPLUNK INC.Putting Restore Together2abcDR(New) Splunk InstanceConfigurationsData/Indexes

2019 SPLUNK INC.ConsiderationsRecovery Time andTolerable LossDRVS.Complexityand Cost

2019 SPLUNK INC.Other elements in your environmentJob Artifacts, DM, Collections etc.Utility/Management Instances: Deployment Server License Master Cluster Master Deployer

2019 SPLUNK INC.Other DR ArchitecturesSmart Store ArchitectureHot/CacheStorageHot/CacheStorageRemote Storage

2019 SPLUNK INC.High Availability (HA)Section subtitle goes here

2019 SPLUNK INC.What is High Availability?HAA design methodology whereby a systemis continuously operational, bounded by aset of predetermined tolerances.Note: “high availability” ! “complete availability”

2019 SPLUNK INC.Splunk High Availability1Data Collection/Reception2Searching3IndexingHA

2019 SPLUNK INC.Data CollectionAHAIndexersB.ForwarderForwarderForwarder

2019 SPLUNK INC.Data Group mygroup.[tcpout:mygroup]server A:9997, B:9997autoLB true.ForwarderForwarderForwarder

2019 SPLUNK INC.Searching2HASearch Head Clustering(SHC)

2019 SPLUNK INC.SearchingHATypical SearchHierarchy.Indexer AIndexer BIndexer N

2019 SPLUNK INC.SearchingHATypical SearchHierarchy.Indexer AIndexer BIndexer N

2019 SPLUNK INC.Search Head Clustering (SHC)Improved horizontal scalingImproved high availabilityNo single point of failureHA

2019 SPLUNK INC.SHCHAReplication protocol syncs:- Configurations- Job ArtifactsABC.Indexer AIndexer BIndexer CIndexer N

2019 SPLUNK INC.SHCHAReplication protocol syncs:- Configurations- Job ArtifactsABCConfigurationsDeployer.Indexer AIndexer BIndexer CIndexer N

2019 SPLUNK INC.SHCHAReplication protocol syncs:- Configurations- Job ArtifactsABCDeployer ensures identicaldeployed configurationsConfigurationsDeployer.Indexer AIndexer BIndexer CIndexer N

2019 SPLUNK INC.SHCHAReplication protocol syncs:- Configurations- Job ArtifactsABCCaptain plays a special role in clusterorchestration and job scheduling.ConfigurationsCaptainDeployer.Indexer AIndexer BIndexer CIndexer N

2019 SPLUNK INC.SHC Operation - High LevelDeployer ensures all SHC members have identical baseline configurations Subsequent UI changes propagated using an internal replication mechanismJob Scheduler gets disabled on all members but the CaptainCaptain selects members to run scheduled jobs based on load Selection based on load statistics. Captain orchestrates job artifact replication to selected members/candidates of the cluster.Transparent job artifact proxying (and eventual replication) if artifact not presenton user’s SH.HA

2019 SPLUNK INC.SHC Operation - High LevelMajority requirement and failure handling Surviving majority ( 51%)Site-awareness gotchas No notion of site in SHC (unlike in index replication) Case for static captain electionLatency and number of nodesHA

2019 SPLUNK INC.Stretched SHCCaptainHADeployerIndexerIndexerSite AIndexerIndexerSite B

2019 SPLUNK INC.Stretched SHCHACaptain must be statically electedas there is no SHC majorityCaptainDeployerIndexerIndexerSite AIndexerIndexerSite B

2019 SPLUNK INC.Stretched SHCHADeployer launched in second site,if SCH updates requiredCaptainDeployerIndexerIndexerSite AIndexerIndexerSite B

2019 SPLUNK INC.Indexing3HAIndexer Clustering

2019 SPLUNK INC.Index ReplicationHACluster a group of search peers (indexers) that replicate each others' bucketsData Availability Availability for ingestion and searchingTrade offsData Fidelity Forwarder Acknowledgement, assuranceDisaster Recovery Site awarenessSearch Affinity Local search preference vs. remote Extra storage Slightly increasedprocessing load.

2019 SPLUNK INC.Cluster ComponentsMaster Node Orchestrates replication/remedial process. Informs the SH where to find searchable data.Helps manage peer configurations.Peer Nodes Receive and index data. Replicate data to/from other peers. Peer Nodes Number RFSearch Head(s) Must use one to search across the cluster.Forwarders Use with auto-lb and indexer acknowledgementHA

2019 SPLUNK INC.Single Site Cluster Architecture– Credit: Splunk Docs TeamCredit: Splunk Docs TeamHA

2019 SPLUNK INC.Single Site Cluster Architecture– Credit: Splunk Docs TeamReplication Factor (RF) Number of copies of data in the cluster.Default RF 3 Cluster can tolerate RF-1 node failuresCredit: Splunk Docs TeamHA

2019 SPLUNK INC.Single Site Cluster Architecture– Credit: Splunk Docs TeamSearch Factor (SF) Number of copies of data in the cluster.Default SF 2 Requires more storage Replicated vs. Searchable BucketCredit: Splunk Docs TeamHA

2019 SPLUNK INC.Clustered IndexingOriginating peer node streams copies of data to other clustered peers. Receiving peers store those copies.Master determines replicated data destination. Instructs peers what peers to stream data to. Does not sit on data path.Master manages all peer-to-peer interactions and coordinates remedial activities.Master keeps track of which peers have searchable data. Ensures that there are always SF copies of searchable data available.HA

2019 SPLUNK INC.Clustered SearchingSearch head coordinates all searches in the clusterSH relies on master to tell it who its peers are. The master keeps track of which peers have searchable dataOnly one replicated bucket is searchable a.k.a primary i.e., searches occur over primary buckets, only.Primary buckets may change over time Peers know their status and therefore know where to searchHA

2019 SPLUNK INC.Multisite ClusteringSite awareness introduced in Splunk 6.1Improved disaster recovery Multisite clusters provide site failover capabilitySearch Affinity Search heads will scope searches to local site, whenever possible Ability to turn off for better thruput vs. X-Site bandwidthHA

2019 SPLUNK INC.Multi Site Cluster ArchitectureDifferences vs. single site Assign a site to each node Specify RF and SF on a site by site basisCredit: Splunk Docs Team

2019 SPLUNK INC.Multisite Clustering Cont’dEach node belongs to an assigned site, except for the Master Node, which controls allsites but it’s not logically a member of anyReplication of bucket copies occurs in a site-aware manner. Multisite replication determines # copies on each site. Ex. 3 site cluster: site replication factor origin:2, site1:1, site2:1, site3:1, total:4Bucket-fixing activities respect site boundaries when applicableSearches are fulfilled by local peers whenever possible (a.k.a search affinity) Each site must have at least a full set of searchable data

2019 SPLUNK INC.Rolling Restartsand UpgradesSection subtitle goes here

2019 SPLUNK INC.Rolling Restarts/UpgradesNew features in 7.1.0Searchable Rolling Restarts Perform a restart of an Indexer Cluster without effecting SearchUpgrade a Splunk Clustered Deployment (SHC and Indexer Cluster) without downtime Rolling upgrades apply to versions 7.1 and above (7.1 - 7.2)

2019 SPLUNK INC.Searchable Rolling Restart The master restarts peers one at a time. The master runs health checks to confirm that the cluster is in a searchable state before it initiates thesearchable rolling restart. The peer waits for in-progress searches to complete, up to a maximum time period, as determined bythe decommission search jobs wait secs attribute in server.conf. The default for this attribute is 180seconds. This covers the majority of searches in most cases. Searchable rolling restart applies to both historical searches and real-time searches. splunk rolling-restart cluster-peers -searchable true

2019 SPLUNK INC.Searchable Rolling UpgradesRolling Upgrade Steps:1. Perform pre-flight indexer cluster health checks splunk show cluster-status --verbose2. Upgrade the cluster master3. Initialize rolling upgrade on the indexer cluster splunk upgrade-init cluster-peers4. Select a cluster peer and gracefully shutdown the cluster-peer splunk offline5. Upgrade the selected cluster-peer and restart6. Repeat steps 4 and 5 for all cluster peers7. Finalize rolling upgrade on the indexer cluster splunk upgrade-finalize cluster-peers

2019 SPLUNK INC.Searchable Rolling UpgradesDeliver business continuity by making search highly op the cluster master2Upgrade and restart cluster master3Rolling upgrade of Search Heads4Upgrade SHC deployer5Upgrade each indexer ensuring searchcontinuity. Repeat for all indexers6Repeat step 5 for multi-site deployments

2019 SPLUNK INC.S2 – Smart StoreSmart option for HA/DR

2019 SPLUNK INC.High AvailabilityData vs. SearchData ResiliencyWill my data be protected if something goes wrong? Smart Store relies on the remote store to provide data resiliency– Smart Store and indexer clustering will not protect data in the remote store– When used on Smart Store enabled indexes, indexer clustering will only protect hot bucketsSearch ResiliencyWill my search results be complete if something goes wrong? Indexer clustering must be used to provide highly available search– Smart Store does not natively provide search HA

2019 SPLUNK INC.Failure ScenariosFailed peers replication factorAvailable copies of a bucketexist in the clusterCluster master initiates a fix-up operationcopy bucketHOTHOTpush metadataCACHEDSTUB– Hot bucket: full copy– Cached bucket: metadata push– Peers don’t need the full contents of cached buckets, as it can be fetchedfrom the remote store– Metadata replication is done with a POST to a REST endpoint on the target peer(s)push metadataSTUB Peers with an existing copy of buckets areinstructed to copy them to other peers untilRF/SF is metSTUB

2019 SPLUNK INC.Failure ScenariosFailure of peers replication factorAvailable copies of a bucket do not existin the cluster1Only buckets that have been stored in theremote store can be recovered. Hot bucketswill be lost.3STUBSTUB21. CM requests a random peer to download themetadata for the missing buckets2. Peer downloads the bucket metadata3. CM requests the peer replicate bucket metadatato other peers to meet rep factorRemote Storage

2019 SPLUNK INC.Failure ScenariosCluster Bootstrap (complete failure)Completely new cluster – No buckets exist11. CM requests an UP peer to discover all bucketsin the remote store for the known indexes2. Peer provides the CM with a list of availablebuckets3523. CM randomly assigns buckets to peers4. Peers download the metadata for the buckets5. CM requests the peers replicate bucketmetadata to other peers to meet rep factor4

2019 SPLUNK INC.Autoscaling in AWSWorking towards automatic recovery

2019 SPLUNK INC.AutoscalingWhat is it?Collection of EC2 instances assembled into a group Maintaining count of instances and automatic scaling are core functionality Policies drive actions Can scale across AZ’sScaling options Maintain current levels (sweet spot for Splunk) Manual up/down Schedule/demand/predictive based (not recommended currently)

2019 SPLUNK INC.Architecture Diagram Splunk Management Servers(CM, MC, LM, etc.)AWS RegionVPCAvailability zone 1Availability zone 2Public subnetPublic subnetNAT gatewayNAT gatewaySplunk ManagementServicesAuto Scaling groupIndexers/Search HeadsAuto Scaling groupSplunk ManagementServicesIndexers/Search Heads– Group of 1 for stateless servers– Use DNS names instead of IP’s Indexers / Search Heads :– Replace damaged instances– By stretching across availabilityzones, you can multisite cluster foradditional HA Tips :– Leverage configuration managementfor replacing instances– Automate when possible– Zero touch replacement

2019 SPLUNK INC.AWS Example (Hint: It’s Splunk Cloud)

2019 SPLUNK INC.Final WordsSection subtitle goes here

2019 SPLUNK INC.Putting it TogetherDeployerMaster. Search Head Clustering. Index Clustering. Forwarding Layer – auto LB

2019 SPLUNK INC.Top TakeawaysDR – Process of backing-up and restoring service in case of disaster Configuration files – copy of SPLUNK HOME/etc/ folder Indexed data – backup and restore buckets– Hot, warm, cold, frozen– Can’t backup hot (without snapshots) but can safely backup warm and coldHA – continuously operational system bounded by a set of tolerances Data collection– Autolb from forwarders to multiple indexers– Use Indexer Acknowledgement to protect in flight data Searching– Search Head Clustering (SHC) Indexing– Use Index Replication

2019 SPLUNK INC.SVAs – Splunk Validated ArchitecturesRecommended and supported Splunk deploymenttopologies based upon the following design pillars: Availability Performance Scalability Security ers/splunk-validated-architectures.pdf

2019 SPLUNK INC.Resources Splunk Validated Architectures– lidated-architectures.pdf Docs––––High test/DistSearch/AboutSHC– Rolling restarts and upgrades:– /DistSearch/SHCrollingupgrade– Smart Store– t/Indexer/SmartStorearchitecture

2019 SPLUNK INC.Q&A

2019 SPLUNK INC.ThankYou!Go to the .conf19 mobile app toRATE THIS SESSION

not be incorporated into any contract or other commitment.Splunk undertakes no obligation either to develop the features or functionalities described or to include any such feature or functionality in a future release. Splunk, Splunk , Turn Data Into Doing, The Engine for Machine Data, Splunk Cloud, Splunk