Elasticsearch: The Definitive Guide

Transcription

Elasticsearch: The Definitive GuideIf you’re a newcomer to both search and distributed systems, you’llquickly learn how to integrate Elasticsearch into your application. Moreexperienced users will pick up lots of advanced techniques. Throughoutthe book, you’ll follow a problem-based approach to learn why, when, andhow to use Elasticsearch features. Understand how Elasticsearch interprets data in yourdocuments Index and query your data to take advantage of searchconcepts such as relevance and word proximity Handle human language through the effective use of analyzersand queries Summarize and group data to show overall trends, withaggregations and analytics Use geo-points and geo-shapes—Elasticsearch’s approachesto geolocation Model your data to take advantage of Elasticsearch’s horizontalscalability Learn how to configure and monitor your cluster in productionbook could easily be“Theretitled as 'Understandingsearch engines usingElasticsearch.' Great job.Way beyond just simplyusing Elasticsearch.”—Ivan BrusicSearch ConsultantClinton Gormley was the first user of Elasticsearch and wrote the Perl API backin 2010. When Elasticsearch formed a company in 2012, he joined as a developerand the maintainer of the Perl modules.DATABA SES/ WEBUS 49.99Twitter: @oreillymediafacebook.com/oreillyCAN 57.99ISBN: 978-1-449-35854-9ElasticsearchThe Definitive GuideA DISTRIBUTED REAL-TIME SEARCH AND ANALYTICS ENGINEGormley& TongZachary Tong has been working with Elasticsearch since 2011, and has writtenseveral tutorials to help beginners using the server. Zach is a developer atElasticsearch and maintains the PHP client.Elasticsearch:The Definitive GuideWhether you need full-text search or real-time analytics of structured data—or both—the Elasticsearch distributed search engine is an ideal way to putyour data to work. This practical guide not only shows you how to search,analyze, and explore data with Elasticsearch, but also helps you deal with thecomplexities of human language, geolocation, and relationships.Clinton Gormley &Zachary Tong

Elasticsearch: The Definitive GuideIf you’re a newcomer to both search and distributed systems, you’llquickly learn how to integrate Elasticsearch into your application. Moreexperienced users will pick up lots of advanced techniques. Throughoutthe book, you’ll follow a problem-based approach to learn why, when, andhow to use Elasticsearch features. Understand how Elasticsearch interprets data in yourdocuments Index and query your data to take advantage of searchconcepts such as relevance and word proximity Handle human language through the effective use of analyzersand queries Summarize and group data to show overall trends, withaggregations and analytics Use geo-points and geo-shapes—Elasticsearch’s approachesto geolocation Model your data to take advantage of Elasticsearch’s horizontalscalability Learn how to configure and monitor your cluster in productionbook could easily be“Theretitled as 'Understandingsearch engines usingElasticsearch.' Great job.Way beyond just simplyusing Elasticsearch.”—Ivan BrusicSearch ConsultantClinton Gormley was the first user of Elasticsearch and wrote the Perl API backin 2010. When Elasticsearch formed a company in 2012, he joined as a developerand the maintainer of the Perl modules.DATABA SES/ WEBUS 49.99Twitter: @oreillymediafacebook.com/oreillyCAN 57.99ISBN: 978-1-449-35854-9ElasticsearchThe Definitive GuideA DISTRIBUTED REAL-TIME SEARCH AND ANALYTICS ENGINEGormley& TongZachary Tong has been working with Elasticsearch since 2011, and has writtenseveral tutorials to help beginners using the server. Zach is a developer atElasticsearch and maintains the PHP client.Elasticsearch:The Definitive GuideWhether you need full-text search or real-time analytics of structured data—or both—the Elasticsearch distributed search engine is an ideal way to putyour data to work. This practical guide not only shows you how to search,analyze, and explore data with Elasticsearch, but also helps you deal with thecomplexities of human language, geolocation, and relationships.Clinton Gormley &Zachary Tong

Elasticsearch: The Definitive GuideClinton Gormley and Zachary Tong

Elasticsearch: The Definitive Guideby Clinton Gormley and Zachary TongCopyright 2015 Elasticsearch. All rights reserved.Printed in the United States of America.Published by O’Reilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions arealso available for most titles (http://safaribooksonline.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com.Editors: Mike Loukides and Brian AndersonProduction Editor: Shiny KalapurakkelProofreader: Sharon WilkeyIndexer: Ellen Troutman-ZaigInterior Designer: David FutatoCover Designer: Ellie VolkhausenIllustrator: Rebecca DemarestFirst EditionJanuary 2015:Revision History for the First Edition2015-01-16:First ReleaseSee http://oreilly.com/catalog/errata.csp?isbn 9781449358549 for release details.The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Elasticsearch: The Definitive Guide, thecover image, and related trade dress are trademarks of O’Reilly Media, Inc.Many of the designations used by manufacturers and sellers to distinguish their products are claimed astrademarks. Where those designations appear in this book, and O’Reilly Media, Inc. was aware of a trade‐mark claim, the designations have been printed in caps or initial caps.While the publisher and the authors have used good faith efforts to ensure that the information andinstructions contained in this work are accurate, the publisher and the authors disclaim all responsibilityfor errors or omissions, including without limitation responsibility for damages resulting from the use ofor reliance on this work. Use of the information and instructions contained in this work is at your ownrisk. If any code samples or other technology this work contains or describes is subject to open sourcelicenses or the intellectual property rights of others, it is your responsibility to ensure that your usethereof complies with such licenses and/or rights.978-1-449-35854-9[LSI]

Table of ContentsForeword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiPreface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiiiPart I.Getting Started1. You Know, for Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Installing ElasticsearchInstalling MarvelRunning ElasticsearchViewing Marvel and SenseTalking to ElasticsearchJava APIRESTful API with JSON over HTTPDocument OrientedJSONFinding Your FeetLet’s Build an Employee DirectoryIndexing Employee DocumentsRetrieving a DocumentSearch LiteSearch with Query DSLMore-Complicated SearchesFull-Text SearchPhrase SearchHighlighting Our SearchesAnalyticsTutorial Conclusion455666799101010121315161718192023iii

Distributed NatureNext Steps23242. Life Inside a Cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25An Empty ClusterCluster HealthAdd an IndexAdd FailoverScale HorizontallyThen Scale Some MoreCoping with Failure262627293031323. Data In, Data Out. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35What Is a Document?Document MetadataindextypeidOther MetadataIndexing a DocumentUsing Our Own IDAutogenerating IDsRetrieving a DocumentRetrieving Part of a DocumentChecking Whether a Document ExistsUpdating a Whole DocumentCreating a New DocumentDeleting a DocumentDealing with ConflictsOptimistic Concurrency ControlUsing Versions from an External SystemPartial Updates to DocumentsUsing Scripts to Make Partial UpdatesUpdating a Document That May Not Yet ExistUpdates and ConflictsRetrieving Multiple DocumentsCheaper in BulkDon’t Repeat YourselfHow Big Is Too 5660604. Distributed Document Store. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61Routing a Document to a Shardiv Table of Contents61

How Primary and Replica Shards InteractCreating, Indexing, and Deleting a DocumentRetrieving a DocumentPartial Updates to a DocumentMultidocument PatternsWhy the Funny Format?6263656667695. Searching—The Basic Tools. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71The Empty SearchhitstookshardstimeoutMulti-index, MultitypePaginationSearch LiteThe all FieldMore Complicated Queries727373737474757677786. Mapping and Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79Exact Values Versus Full TextInverted IndexAnalysis and AnalyzersBuilt-in AnalyzersWhen Analyzers Are UsedTesting AnalyzersSpecifying AnalyzersMappingCore Simple Field TypesViewing the MappingCustomizing Field MappingsUpdating a MappingTesting the MappingComplex Core Field TypesMultivalue FieldsEmpty FieldsMultilevel ObjectsMapping for Inner ObjectsHow Inner Objects are IndexedArrays of Inner le of Contents v

7. Full-Body Search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97Empty SearchQuery DSLStructure of a Query ClauseCombining Multiple ClausesQueries and FiltersPerformance DifferencesWhen to Use WhichMost Important Queries and Filtersterm Filterterms Filterrange Filterexists and missing Filtersbool Filtermatch all Querymatch Querymulti match Querybool QueryCombining Queries with FiltersFiltering a QueryJust a FilterA Query as a FilterValidating QueriesUnderstanding ErrorsUnderstanding 41051051061071071081081098. Sorting and Relevance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111SortingSorting by Field ValuesMultilevel SortingSorting on Multivalue FieldsString Sorting and MultifieldsWhat Is Relevance?Understanding the ScoreUnderstanding Why a Document MatchedFielddata1111121131131141151161191199. Distributed Search Execution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121Query PhaseFetch PhaseSearch Optionspreferencevi Table of Contents122123125125

timeoutroutingsearch typescan and scroll12612612712710. Index Management. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131Creating an IndexDeleting an IndexIndex SettingsConfiguring AnalyzersCustom AnalyzersCreating a Custom AnalyzerTypes and MappingsHow Lucene Sees DocumentsHow Types Are ImplementedAvoiding Type GotchasThe Root ObjectPropertiesMetadata: source FieldMetadata: all FieldMetadata: Document IdentityDynamic MappingCustomizing Dynamic Mappingdate detectiondynamic templatesDefault MappingReindexing Your DataIndex Aliases and Zero 14414514714714814915015111. Inside a Shard. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153Making Text SearchableImmutabilityDynamically Updatable IndicesDeletes and UpdatesNear Real-Time Searchrefresh APIMaking Changes Persistentflush APISegment Merging154155155158159160161165166Table of Contents vii

optimize APIPart II.168Search in Depth12. Structured Search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173Finding Exact Valuesterm Filter with Numbersterm Filter with TextInternal Filter OperationCombining FiltersBool FilterNesting Boolean FiltersFinding Multiple Exact ValuesContains, but Does Not EqualEquals ExactlyRangesRanges on DatesRanges on StringsDealing with Null Valuesexists Filtermissing Filterexists/missing on ObjectsAll About CachingIndependent Filter CachingControlling CachingFilter 19019119219219319413. Full-Text Search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197Term-Based Versus Full-TextThe match QueryIndex Some DataA Single-Word QueryMultiword QueriesImproving PrecisionControlling PrecisionCombining QueriesScore CalculationControlling PrecisionHow match Uses boolBoosting Query ClausesControlling Analysisviii Table of Contents197199199200201202203204205205206207209

Default AnalyzersConfiguring Analyzers in PracticeRelevance Is Broken!21121321414. Multifield Search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217Multiple Query StringsPrioritizing ClausesSingle Query StringKnow Your DataBest Fieldsdis max QueryTuning Best Fields Queriestie breakermulti match QueryUsing Wildcards in Field NamesBoosting Individual FieldsMost FieldsMultifield MappingCross-fields Entity SearchA Naive ApproachProblems with the most fields ApproachField-Centric QueriesProblem 1: Matching the Same Word in Multiple FieldsProblem 2: Trimming the Long TailProblem 3: Term FrequenciesSolutionCustom all Fieldscross-fields QueriesPer-Field BoostingExact-Value 123223223323323423523523623823915. Proximity Matching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241Phrase MatchingTerm PositionsWhat Is a PhraseMixing It UpMultivalue FieldsCloser Is BetterProximity for RelevanceImproving PerformanceRescoring ResultsFinding Associated Words242242243244245246247249249250Table of Contents ix

Producing ShinglesMultifieldsSearching for ShinglesPerformance25125225325516. Partial Matching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257Postcodes and Structured Dataprefix Querywildcard and regexp QueriesQuery-Time Search-as-You-TypeIndex-Time OptimizationsNgrams for Partial MatchingIndex-Time Search-as-You-TypePreparing the IndexQuerying the FieldEdge n-grams and PostcodesNgrams for Compound Words25825926026226426426526526727027117. Controlling Relevance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275Theory Behind Relevance ScoringBoolean ModelTerm Frequency/Inverse Document Frequency (TF/IDF)Vector Space ModelLucene’s Practical Scoring FunctionQuery Normalization FactorQuery CoordinationIndex-Time Field-Level BoostingQuery-Time BoostingBoosting an Indext.getBoost()Manipulating Relevance with Query StructureNot Quite Notboosting QueryIgnoring TF/IDFconstant score Queryfunction score QueryBoosting by Popularitymodifierfactorboost modemax boostBoosting Filtered Subsetsx Table of 291291293294296298299301301

filter Versus queryfunctionsscore modeRandom ScoringThe Closer, The BetterUnderstanding the price ClauseScoring with ScriptsPluggable Similarity AlgorithmsOkapi BM25Changing SimilaritiesConfiguring BM25Relevance Tuning Is the Last 10%Part III.302303303303305308308310310313314315Dealing with Human Language18. Getting Started with Languages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319Using Language AnalyzersConfiguring Language AnalyzersPitfalls of Mixing LanguagesAt Index TimeAt Query TimeIdentifying LanguageOne Language per DocumentForeign WordsOne Language per FieldMixed-Language FieldsSplit into Separate FieldsAnalyze Multiple TimesUse n-grams32032132332332432432532632732932932933019. Identifying Words. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333standard Analyzerstandard TokenizerInstalling the ICU Plug-inicu tokenizerTidying Up Input TextTokenizing HTMLTidying Up Punctuation33333433533533733733820. Normalizing Tokens. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341In That Case341Table of Contents xi

You Have an AccentRetaining MeaningLiving in a Unicode WorldUnicode Case FoldingUnicode Character FoldingSorting and CollationsCase-Insensitive SortingDifferences Between LanguagesUnicode Collation AlgorithmUnicode SortingSpecifying a LanguageCustomizing Collations34234334634734935035135335335435535821. Reducing Words to Their Root Form. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359Algorithmic StemmersUsing an Algorithmic StemmerDictionary StemmersHunspell StemmerInstalling a DictionaryPer-Language SettingsCreating a Hunspell Token FilterHunspell Dictionary FormatChoosing a StemmerStemmer PerformanceStemmer QualityStemmer DegreeMaking a ChoiceControlling StemmingPreventing StemmingCustomizing StemmingStemming in situIs Stemming in situ a Good 7237337522. Stopwords: Performance Versus Precision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377Pros and Cons of StopwordsUsing StopwordsStopwords and the Standard AnalyzerMaintaining PositionsSpecifying StopwordsUsing the stop Token FilterUpdating StopwordsStopwords and Performancexii Table of Contents378379379380380381383383

and Operatorminimum should matchDivide and ConquerControlling PrecisionOnly High-Frequency TermsMore Control with Common TermsStopwords and Phrase QueriesPositions DataIndex OptionsStopwordscommon grams Token FilterAt Index TimeUnigram QueriesBigram Phrase QueriesTwo-Word PhrasesStopwords and 339439423. Synonyms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395Using SynonymsFormatting SynonymsExpand or contractSimple ExpansionSimple ContractionGenre ExpansionSynonyms and The Analysis ChainCase-Sensitive SynonymsMultiword Synonyms and Phrase QueriesUse Simple Contraction for Phrase QueriesSynonyms and the query string QuerySymbol Synonyms39639739839839940040140140240440540524. Typoes and Mispelings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409FuzzinessFuzzy QueryImproving PerformanceFuzzy match QueryScoring FuzzinessPhonetic MatchingPart IV.409410411412413413AggregationsTable of Contents xiii

25. High-Level Concepts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419BucketsMetricsCombining the Two42042042026. Aggregation Test-Drive. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423Adding a Metric to the MixBuckets Inside BucketsOne Final Modification42642742927. Building Bar Charts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43328. Looking at Time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437Returning Empty BucketsExtended ExampleThe Sky’s the Limit43944144329. Scoping Aggregations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44530. Filtering Queries and Aggregations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449Filtered QueryFilter BucketPost FilterRecap44945045145231. Sorting Multivalue Buckets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453Intrinsic SortsSorting by a MetricSorting Based on “Deep” Metrics45345445532. Approximate Aggregations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457Finding Distinct CountsUnderstanding the Trade-offsOptimizing for SpeedCalculating PercentilesPercentile MetricPercentile RanksUnderstanding the Trade-offs45846046146246446746933. Significant Terms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471significant terms DemoRecommending Based on Popularityxiv Table of Contents472474

Recommending Based on Statistics47834. Controlling Memory Use and Latency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481FielddataAggregations and AnalysisHigh-Cardinality Memory ImplicationsLimiting Memory UsageFielddata SizeMonitoring fielddataCircuit BreakerFielddata FilteringDoc ValuesEnabling Doc ValuesPreloading FielddataEagerly Loading FielddataGlobal OrdinalsIndex WarmersPreventing Combinatorial ExplosionsDepth-First Versus 9649850050235. Closing Thoughts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507Part V.Geolocation36. Geo-Points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511Lat/Lon FormatsFiltering by Geo-Pointgeo bounding box FilterOptimizing Bounding Boxesgeo distance FilterFaster Geo-Distance Calculationsgeo distance range FilterCaching geo-filtersReducing Memory UsageSorting by DistanceScoring by Distance51151251351451551651751751952052237. Geohashes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523Mapping Geohashesgeohash cell Filter524525Table of Contents xv

38. Geo-aggregations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527geo distance Aggregationgeohash grid Aggregationgeo bounds Aggregation52753053239. Geo-shapes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535Mapping geo-shapesprecisiondistance error pctIndexing geo-shapesQuerying geo-shapesQuerying with Indexed ShapesGeo-shape Filters and CachingPart VI.536536537537538540541Modeling Your Data40. Handling Relationships. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545Application-side JoinsDenormalizing Your DataField CollapsingDenormalization and ConcurrencyRenaming Files and DirectoriesSolving Concurrency IssuesGlobal LockingDocument LockingTree Locking54654854955255555555655755841. Nested Objects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 561Nested Object MappingQuerying a Nested ObjectSorting by Nested FieldsNested Aggregationsreverse nested AggregationWhen to Use Nested Objects56356456556756857042. Parent-Child Relationship. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571Parent-Child MappingIndexing Parents and ChildrenFinding Parents by Their Childrenmin children and max childrenFinding Children by Their Parentsxvi Table of Contents572572573575575

Children AggregationGrandparents and GrandchildrenPractical ConsiderationsMemory UseGlobal Ordinals and LatencyMultigenerations and Concluding Thoughts57657757957958058043. Designing for Scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583The Unit of ScaleShard OverallocationKagillion ShardsCapacity PlanningReplica ShardsBalancing Load with ReplicasMultiple IndicesTime-Based DataIndex per Time FrameIndex TemplatesRetiring DataMigrate Old IndicesOptimize IndicesClosing Old IndicesArchiving Old IndicesUser-Based DataShared IndexFaking Index per User with AliasesOne Big UserScale Is Not 596597597600601602Part VII. Administration, Monitoring, and Deployment44. Monitoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607Marvel for MonitoringCluster HealthDrilling Deeper: Finding Problematic IndicesBlocking for Status ChangesMonitoring Individual Nodesindices SectionOS and Process SectionsJVM SectionThreadpool Section607608609611612613616617620Table of Contents xvii

FS and Network SectionsCircuit BreakerCluster StatsIndex StatsPending Taskscat API62262262362362462645. Production Deployment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 631HardwareMemoryCPUsDisksNetworkGeneral ConsiderationsJava Virtual MachineTransport Client Versus Node ClientConfiguration ManagementImportant Configuration ChangesAssign NamesPathsMinimum Master NodesRecovery SettingsPrefer Unicast over MulticastDon’t Touch These Settings!Garbage CollectorThreadpoolsHeap: Sizing and SwappingGive Half Your Memory to LuceneDon’t Cross 32 GB!Swapping Is the Death of PerformanceFile Descriptors and MMapRevisit This List Before 3863964064064164164264264464564646. Post-Deployment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647Changing Settings DynamicallyLoggingSlowlogIndexing Performance TipsTest Performance ScientificallyUsing and Sizing Bulk RequestsStorageSegments and Mergingxviii Table of Contents647648648649650650651651

OtherRolling RestartsBacking Up Your ClusterCreating the RepositorySnapshotting All Open IndicesSnapshotting Particular IndicesListing Information About SnapshotsDeleting SnapshotsMonitoring Snapshot ProgressCanceling a SnapshotRestoring from a SnapshotMonitoring Restore OperationsCanceling a RestoreClusters Are Living, Breathing 4Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665Table of Contents xix

ForewordOne of the most nerve-wracking periods when releasing the first version of an opensource project occurs when the IRC channel is created. You are all alone, eagerly hop‐ing and wishing for the first user to come along. I still vividly remember those days.One of the first users that jumped on IRC was Clint, and how excited was I. Well for a brief period, until I found out that Clint was actually a Perl user, no less workingon a website that dealt with obituaries. I remember asking myself why couldn’t we getsomeone from a more “hyped” community, like Ruby or Python (at the time), and aslightly nicer use case.How wrong I was. Clint ended up being instrumental to the success of Elasticsearch.He was the first user to roll out Elasticsearch into production (version 0.4 no less!),and the interaction with Clint was pivotal during the early days in shaping Elastic‐search into what it is today. Clint has a unique insight into what is simple, and he isvery rarely wrong, which has a huge impact on various usability aspects of Elastic‐search, from management, to API design, to day-to-day usability features. It was a nobrainer for us to reach out to Clint and ask if he would join our company immedi‐ately after we formed it.Another one of the first things we did when we formed the company was offer publictraining. It’s hard to express how nervous we were about whether or not peoplewould even sign up for it.We were wrong.The trainings were and still are a rave success with waiting lists in all major cities.One of the people who caught our eye was a young fellow, Zach, who came to one ofour trainings. We knew about Zach from his blog posts about using Elasticsearch(and secretly envied his ability to explain complex concepts in a very simple manner)and from a PHP client he wrote for the software. What we found out was that Zachhad actually paid to attend the Elasticsearch training out of his own pocket! You can’txxi

really ask for more than that, and we reached out to Zach and asked if he would joinour company as well.Both Clint and Zach are pivotal to the success of Elasticsearch. They are wonderfulcommunicators who can explain Elasticsearch from its high-level simplicity, to its(and Apache Lucene’s) low-level internal complexities. It’s a unique skill that wedearly cherish here at Elasticsearch. Clint is also responsible for the Elasticsearch Perlclient, while Zach is responsible for the PHP one - both wonderful pieces of code.And last, both play an instrumental role in most of what happens daily with the Elas‐ticsearch project itself. One of the main reasons why Elasticsearch is so popular is itsability to communicate empathy to its users, and Clint and Zach are both part of thegroup that makes this a reality.xxii Foreword

PrefaceThe world is swimming in data. For years we have been simply overwhelmed by thequantity of data flowing through and produced by our systems. Existing technologyhas focused on how to store and structure warehouses full of data. That’s all well andgood—until you actually need to make decisions in real time informed by that data.Elasticsearch is a distributed, scalable, real-time search and analytics engine. It ena‐bles you to search, analyze, and explore your data, often in ways that you did notanticipate at the start of a project. It exists because raw data sitting on a hard drive isjust not useful.Whether you need full-text search, real-time analytics of structured data, or a combi‐nation of the two, this book introduces you to the fundamental concepts required tostart working with Elasticsearch at a basic level. With these foundations laid, it willmove on to more-advanced search techniques, which you will need to shape thesearch experience

in 2010. When Elasticsearch formed a company in 2012, he joined as a developer and the maintainer of the Perl modules. Zachary Tong has been working with Elasticsearch since 2011, and has written several tutorials to help beginners using the server. Zach is a developer at Elasticsearch and maintains the PHP client. Elasticsearch: The Definitive .