Mastering ElasticSearch PDF Free Download

2y ago

32 Views

1 Downloads

2.72 MB

386 Pages

Report/dmca

Download PDF

Transcription

Mastering ElasticSearchExtend your knowledge on ElasticSearch, and queryingand data handling, along with its internal workingsRafał KućMarek RogozińskiBIRMINGHAM - MUMBAI

Mastering ElasticSearchCopyright 2013 Packt PublishingAll rights reserved. No part of this book may be reproduced, stored in a retrievalsystem, or transmitted in any form or by any means, without the prior writtenpermission of the publisher, except in the case of brief quotations embedded incritical articles or reviews.Every effort has been made in the preparation of this book to ensure the accuracyof the information presented. However, the information contained in this book issold without warranty, either express or implied. Neither the authors, nor PacktPublishing, and its dealers and distributors will be held liable for any damagescaused or alleged to be caused directly or indirectly by this book.Packt Publishing has endeavored to provide trademark information about all of thecompanies and products mentioned in this book by the appropriate use of capitals.However, Packt Publishing cannot guarantee the accuracy of this information.First published: October 2013Production Reference: 1211013Published by Packt Publishing Ltd.Livery Place35 Livery StreetBirmingham B3 2PB, UK.ISBN 978-1-78328-143-5www.packtpub.comCover Image by Prashant Timappa Shetty (sparkling.spectrum.123@gmail.com)

CreditsAuthorsRafał KućProject CoordinatorShiksha ChaturvediMarek RogozińskiProofreaderReviewersMario CecereRavindra BharathiSurendra MohanMarcelo OchoaAcquisition EditorJames JonesLead Technical EditorArun NadarTechnical EditorsIram MalikKrishnaveni NairShruti RawoolIndexerPriya SubramaniGraphicsRonak DhruvProduction CoordinatorKyle AlbuquerqueCover WorkKyle Albuquerque

About the AuthorsRafał Kuć is a born team leader and a Software Developer. Working as aConsultant and a Software Engineer at Sematext Group, Inc., he concentrateson open source technologies such as Apache Lucene, Solr, ElasticSearch,and Hadoop stack. He has more than 11 years of experience in various softwarebranches—from banking software to e-commerce products. He is mainly focusedon Java, but open to every tool and programming language that will make theachievement of his goal easier and faster. He is also one of the founders of thesolr.pl site, where he tries to share his knowledge and help people to resolvetheir problems with Solr and Lucene. He is also a speaker for various conferencesaround the world such as Lucene Eurocon, Berlin Buzzwords, ApacheCon,and Lucene Revolution.Rafał began his journey with Lucene in 2002 and it wasn't love at first sight.When he came back to Lucene in late 2003, he revised his thoughts about theframework and saw the potential in search technologies. Then Solr came andthis was it. He started working with ElasticSearch in the middle of 2010.Currently, Lucene, Solr, ElasticSearch, and information retrieval are his mainpoints of interest.Rafał is also an author of Solr 3.1 Cookbook, the update to it—Solr 4.0 Cookbook,and is a co-author of ElasticSearch Server all published by Packt Publishing.

The book you are holding in your hands was something that I wanted to write afterfinishing the ElasticSearch Server book and I got the opportunity. I wanted not to jumpfrom topic to topic, but concentrate on a few of them and write about what I know andshare the knowledge. Again, just like the ElasticSearch Server book, I couldn't include alltopics I wanted, and some small details that are more or less important, depending onthe use case, had to be left aside. Nevertheless, I hope that by reading this book you'llbe able to easily get into all the details about ElasticSearch and underlying ApacheLucene, and I also hope that it will let you get the desired knowledge easier and faster.I would like to thank my family for their support and patienceduring all those days and evenings when I was sitting in frontof a screen instead of being fully with them.I would also like to thank all the people I'm working with atSematext, especially Otis, who took his time and convincedme that Sematext is the right company for me.Finally, I would like to thank all the people involved in creating,developing, and maintaining ElasticSearch and Lucene projectsfor their work and passion. Without them this book wouldn't bewritten and open source search would have been less powerful.Once again, thank you.Marek Rogoziński is a Software Architect and a Consultant with more than10 years of experience. His specialization involves solutions based on open sourcesearch engines such as Solr and ElasticSearch and software stack for big dataanalytics including Hadoop, Hbase, and Twitter Storm.He is also a co-founder of the solr.pl site which publishes information and tutorialsabout Solr and Lucene library and is the co-author of the ElasticSearch Server bookpublished by Packt Publishing.He currently holds a position of Chief Technology Officer in a company buildingproducts based on the processing and analysis of large streams of input data.

Just like the previous book, writing Mastering ElasticSearch was a difficult task.To tell the truth, it was much harder not only because of more advanced topicscovered in this book, but also because of the constantly introduced changes inthe ElasticSearch codebase. The development of it is not going to slow down andliterally speaking, every day brings something new. Please remember that thisbook should be treated as a continuation of the previous book. This means,we have tried to omit all the topics that we had covered before, and we wantedto add everything that was omitted. You can see if you have succeeded yourself.Now it's time to thank everyone.Thanks to all the people who have created ElasticSearch, Lucene,and all of those libraries and modules published aroundthese projects.I would also like to thank the team working on this book. First of all,to the ones who worked on the extermination of all my errors, typos,and ambiguities.Last but not the least, thanks to all the friends, who withstood meduring this time.

About the ReviewersRavindra Bharathi has worked in the software industry for over a decadein various domains such as education, Digital Media Marketing/Advertising,Enterprise Search, and Energy Management Systems. He has a keen interest insearch-based applications that involve data visualization, mashups, and dashboards.He blogs at http://ravindrabharathi.blogspot.com.I wish to thank my wife, Vidya, for her support in all my endeavors.Surendra Mohan is currently serving as a Drupal Consultant cum DrupalArchitect at a well-known Software Consulting Ltd. organization in India. Prior tojoining this organization, he served a few Indian MNCs and a couple of startups invaried roles such as Programmer, Technical Lead, Project Lead, Project Manager,Solution Architect, and Service Delivery Manager. He has around nine years ofwork experience in web technologies covering media and entertainment, real estate,travel and tours, publishing, e-learning, enterprise architecture, and so on. He is alsoa well-known speaker who delivers talks on Drupal, Open Source, PHP, Moodle,and so on, along with organizing and delivering TechTalks in Drupal meetupsand Drupal Camps in Mumbai, India.He also reviewed other technical books such as Drupal 7 Multi Site Configuration,by Matt Butcher, Drupal Search Engine Optimization, by Ric Shreves, Building e-commerceSites with Drupal Commerce Cookbook, by Richard Carter. In addition to technicalreviewing activities, he is also writing a book on Apache Solr which is scheduledto be published by the end of October, 2013.I would like to thank my family and friends who supported andencouraged me in completing my reviews on time with good quality.

Marcelo Ochoa works at the System Laboratory of Facultad de Ciencias Exactasof the Universidad Nacional del Centro de la Provincia de Buenos Aires and is theCTO at Scotas.com, a company specialized in Near Real Time Search solutionsusing Apache Solr and Oracle. He divides his time between University jobs andexternal projects related to Oracle and big data technologies. He has worked inseveral Oracle-related projects such as translation of Oracle manuals and multimediaCBTs. His background is in database, network, web, and Java technologies. In theXML world he is known as the developer of DB Generator for the Apache Cocoon project,the open source projects DBPrism and DBPrism CMS, the Lucene-Oracle integrationby using Oracle JVM Directory implementation and in the Restlet.org project theOracle XDB Restlet Adapter, an alternative to write native REST web services insidethe database-resident JVM.Since 2006, he is a part of the Oracle ACE program; Oracle ACEs are known for theirstrong credentials as Oracle community enthusiasts and advocates, with candidatesnominated by ACEs in the Oracle Technology and Applications communities.He is the author of Chapter 17, 360-Degree Programming the Oracle Database of thebook, Oracle Database Programming Using Java and Web Services, by Kuassi Mensah,at Digital Press and Chapter 21, DB Prism: A Framework to Generate Dynamic XML froma Database of the book Professional XML Databases, by Kevin Williams, at Wrox Press.

www.PacktPub.comSupport files, eBooks, discount offers and moreYou might want to visit www.PacktPub.com for support files and downloads relatedto your book.Did you know that Packt offers eBook versions of every book published, with PDFand ePub files available? You can upgrade to the eBook version at www.PacktPub.comand as a print book customer, you are entitled to a discount on the eBook copy. Get intouch with us at service@packtpub.com for more details.At www.PacktPub.com, you can also read a collection of free technical articles,sign up for a range of free newsletters and receive exclusive discounts and offerson Packt books and eBooks.TMhttp://PacktLib.PacktPub.comDo you need instant solutions to your IT questions? PacktLib is Packt's onlinedigital book library. Here, you can access, read, and search across Packt's entirelibrary of books.Why Subscribe? Fully searchable across every book published by PacktCopy and paste, print and bookmark contentOn demand and accessible via web browserFree Access for Packt account holdersIf you have an account with Packt at www.PacktPub.com, you can use this to accessPacktLib today and view nine entirely free books. Simply use your login credentialsfor immediate access.

Table of ContentsPrefaceChapter 1: Introduction to ElasticSearchIntroducing Apache LuceneGetting familiar with LuceneOverall architectureAnalyzing your dataIndexing and querying178881011Lucene query language11Introducing ElasticSearchBasic concepts1515Understanding the basicsQuerying fieldsTerm modifiersHandling special ReplicaGatewayKey concepts behind ElasticSearch architectureWorking of ElasticSearchThe boostrap processFailure detectionCommunicating with 18192023

Table of ContentsChapter 2: Power User Query DSLDefault Apache Lucene scoring explainedWhen a document is matchedThe TF/IDF scoring formulaThe Lucene conceptual formulaThe Lucene practical formulaThe ElasticSearch point of viewQuery rewrite explainedPrefix query as an exampleGetting back to Apache LuceneQuery rewrite propertiesRescoreUnderstanding rescoreExample DataQueryStructure of the rescore queryRescore parametersTo sum upBulk OperationsMultiGetMultiSearchSorting dataSorting with multivalued fieldsSorting with multivalued geo fieldsSorting with nested objectsUpdate APISimple field updateConditional modifications using scriptingCreating and deleting documents using the Update APIUsing filters to optimize your queriesFilters and cachingNot all filters are cached by defaultChanging ElasticSearch caching behaviorWhy bother naming the key for the cache?When to change the ElasticSearch filter caching behaviorThe terms lookup filterHow does it work?Performance considerationsLoading terms from inner objectsTerms lookup filter cache settingsFilter and scopes in ElasticSearch faceting mechanismExample data[ ii 74849505051525354555555585959606061

Table of ContentsFaceting and filteringFilter as a part of the queryThe Facet filterGlobal scopeSummaryChapter 3: Low-level Index ControlAltering Apache Lucene scoringAvailable similarity modelsSetting per-field similaritySimilarity model configurationChoosing the default similarity modelConfiguring the chosen similarity modelsConfiguring TF/IDF similarityConfiguring Okapi BM25 similarityConfiguring DFR similarityConfiguring IB similarity61636567697171727374757676777778Using codecsSimple use casesLet's see how it worksAvailable posting formatsConfiguring the codec behavior7878798182NRT, flush, refresh, and transaction logUpdating index and committing changes8586Default codec propertiesDirect codec propertiesMemory codec propertiesPulsing codec propertiesBloom filter-based codec propertiesChanging the default refresh timeThe transaction log83838383848687The transaction log configurationNear Real Time GETLooking deeper into data handlingInput is not always analyzedExample usageChanging the analyzer during indexingChanging the analyzer during searchingThe pitfall and default analysisSegment merging under controlChoosing the right merge policyThe tiered merge policyThe log byte size merge policyThe log doc merge policy[ iii ]888990909495969797989999100

Table of ContentsMerge policies configuration100Scheduling103The tiered merge policyThe log byte size merge policyThe log doc merge policy100101102The concurrent merge schedulerThe serial merge schedulerSetting the desired merge scheduler103104104SummaryChapter 4: Index Distribution ArchitectureChoosing the right amount of shards and replicasSharding and over allocationA positive example of over allocationMultiple shards versus multiple indicesReplicasRouting explainedShards and dataLet's test routingIndexing with routingIndexing with g115AliasesMultiple routing valuesAltering the default shard allocation behaviorIntroducing ShardAllocatorThe even shard ShardAllocatorThe balanced ShardAllocatorThe custom sting shard allocationAllocation iveAllocationDeciderDiskThresholdDeciderForcing allocation awareness121122122122122123123123124124124128[ iv ]

Table of ContentsFiltering128Runtime allocation updating130Defining total shards allowed per node131But what those properties mean?129Index-level updatesCluster-level updatesInclusionRequirementsExclusionAdditional shard allocation propertiesQuery execution preferenceIntroducing the preference parameterUsing our knowledgeAssumptionsData volume and queries iguration142Changes are coming147Node-level configurationIndices configurationThe directories layoutGateway configurationRecoveryDiscoveryLogging slow queriesLogging garbage collector workMemory setupOne more ngMultiple Indices147148148SummaryChapter 5: ElasticSearch AdministrationChoosing the right directory implementation – the store moduleStore typeThe simple file system storeThe new IO filesystem storeThe MMap filesystem storeThe memory storeThe default store typeDiscovery configurationZen tUnicastMinimum master nodesZen discovery fault detection156157157158[v]

Table of ContentsAmazon EC2 discovery158Local gateway163Recovery configuration164EC2 plugin's installationGateway and recovery configurationGateway recovery processConfiguration propertiesExpectations on nodes159161161162163Backing up the local gateway164Cluster-level recovery configurationIndex-level recovery settings165166Segments statisticsIntroducing the segments API166167Visualizing segments informationUnderstanding ElasticSearch cachingThe filter cache170170171The responseFilter cache typesIndex-level filter cache configurationNode-level filter cache configuration167171172173The field data cache173Clearing the caches180Index-level field data cache configurationNode-level field data cache configurationFiltering174174175Index, indices, and all caches clearingClearing specific cachesClearing fields-related caches181181182Summary182Chapter 6: Fighting with Fire183Knowing the garbage collectorJava memoryThe life cycle of Java object and garbage collectionsDealing with garbage collection problemsTurning on logging of garbage collection workUsing JStatCreating memory dumpsMore information on garbage collector workAdjusting garbage collector work in ElasticSearchAvoiding swapping on Unix-like systemsWhen it is too much for I/O – throttling explainedControlling I/O throttlingConfigurationThrottling typeMaximum throughput per second[ vi ]184184185186186187189189190191193193193193194

Table of ContentsNode throttling defaultsConfiguration exampleSpeeding up queries using warmersReason for using warmersManipulating warmersUsing the PUT Warmer APIAdding warmers during index creationAdding warmers to templatesRetrieving warmersDeleting warmersDisabling warmersTesting the g without warmers presentQuerying with warmer presentVery hot threadsHot Threads API usage clarificationHot Threads API responseReal-life scenariosSlower and slower performanceHeterogeneous environment and load imbalanceMy server is under fireSummaryChapter 7: Improving the User Search ExperienceCorrecting user spelling mistakesTest dataGetting into technical detailsSuggestersUsing the suggest REST endpointIncluding suggestions requests in a queryThe term suggesterThe phrase suggesterCompletion suggesterThe logic behind completion suggesterUsing completion 7218218221224227237238238Improving query relevanceThe dataThe quest for improving relevance243244246Summary264The standard queryThe Multi match queryPhrases comes into playLet's throw the garbage awayAnd now we boostMaking a misspelling-proof searchDrill downs with faceting247248250254256257260[ vii ]

Table of ContentsChapter 8: ElasticSearch Java APIsIntroducing the ElasticSearch Java APIThe codeConnecting to your clusterBecoming the ElasticSearch nodeUsing the transport connection methodChoosing the right connection methodAnatomy of the APICRUD operationsFetching documentsHandling errors265266267268268270271272274274276Indexing documentsUpdating documentsDeleting documentsQuerying ElasticSearchPreparing a queryBuilding rming multiple cSearch 1.0 and higherThe explain APIBuilding JSON queries and documentsThe administration APIThe cluster administration API297298299300302302Using the match all documents queryThe match queryUsing the geo shape query287287288BulkThe delete by queryMulti GETMulti SearchThe cluster and indices health APIThe cluster state APIThe update settings APIThe reroute API[ viii ]296296296297302303303303

Table of ContentsThe nodes information APIThe node statistics APIThe nodes hot threads APIThe nodes shutdown APIThe search shards API304304305305305The Indices administration API306The index existence APIThe Type existence APIThe indices stats APIIndex statusSegments information APICreating an index APIDeleting an indexClosing an indexOpening an indexThe Refresh APIThe Flush APIThe Optimize APIThe put mapping APIThe delete mapping APIThe gateway snapshot APIThe aliases APIThe get aliases APIThe aliases exists APIThe clear cache APIThe update settings APIThe analyze APIThe put template APIThe delete template APIThe validate query APIThe put warmer APIThe delete warmer 0311311311312312312313313314314Summary314Chapter 9: Developing ElasticSearch Plugins315Implementing the URLChecker classImplementing the JSONRiver classImplementing the JSONRiverModule classImplementing the JSONRiverPlugin classInforming ElasticSearch about the JSONRiver plugin class324327329329330Creating the Apache Maven project structureUnderstanding the basicsStructure of the Maven Java projectThe idea of POMRunning the build processIntroducing the assembly Maven pluginCreating a custom river pluginImplementation details[ ix ]316316317317319319322322

Table of ContentsTesting our river331Building our riverInstalling our riverInitializing our riverChecking if our JSON river works331331332333Creating custom analysis pluginImplementation detailsImplementing TokenFilterImplementing the TokenFilter factoryImplementing custom analyzerImplementing analyzer providerImplementing analysis binderImplementing analyzer indices componentImplementing analyzer moduleImplementing analyzer pluginInforming ElasticSearch about our custom analyzerTesting our custom analysis pluginBuilding our custom analysis pluginInstalling the custom analysis pluginChecking if our analysis plugin Summary346Index347[x]

PrefaceWelcome to the world of ElasticSearch and to the Mastering ElasticSearch book.While reading the book you'll be taken through different topics, all connectedto ElasticSearch. We will start with the introduction to Apache Lucene andElasticSearch, because even if you are familiar with it, it is crucial to have thebackground in order to fully understand what is going on when you form acluster, send a document for indexing, or make a query.You will learn how Apache Lucene scoring works, how to influence it, and howto tell ElasticSearch to choose different scoring algorithms. The book will showyou what query rewriting is and why it happens. Apart from that, you'll see howto change your queries to leverage ElasticSearch caching capabilities and makemaximum use of it.After that we will focus on index control. We will learn the way to change howindex fields are written, by using different posting formats. We will discusssegments merging, why it is important, and how to adjust it when there is a need.We'll take a deeper look at shard allocation mechanism and routing, and finallywe'll learn what to do when data and query number grows.The book can't omit garbage collector description—how it works and where to startand when you need to tune its behavior. In addition to that, it covers functionalitiesthat allow us to troubleshoot ElasticSearch, such as describing how segments mergingworks, how to see what ElasticSearch does beneath its high-level interface, and how tolimit the I/O operations. But the book doesn't only pay attention to low-level aspectsof ElasticSearch; it includes user search experience improvements tips, such as dealingwith spelling mistakes, highly effective autocomplete feature, and a tutorial on howyou can deal with query related improvements.In addition to this, the book you are holding will guide you through ElasticSearch JavaAPI, showing how to use it, not only when it comes to CRUD operations but also whenit comes to cluster and indices maintenance and manipulation. Finally, we will takea deep look at ElasticSearch extensions by developing a custom river plugin for dataindexing and a custom analysis plugin for data analysis during query and index time.

PrefaceWhat this book coversChapter 1, Introduction to ElasticSearch, will guide you through how Apache Luceneworks and will reintroduce you to the world of ElasticSearch describing the basicconcepts and showing how ElasticSearch works internally.Chapter 2, Power User Query DSL, describes how Apache Lucene scoring works,why ElasticSearch rewrites queries, and how query rescore mechanism works.In addition to that, it explains the batch APIs available in ElasticSearch andshows how to use filters to optimize your queries.Chapter 3, Low-level Index Control, describes how to alter Apache Lucene scoring andhow to alter fields' structure by using different posting formats. It also covers NRTsearching and indexing, transaction log usage, allows you to understand segmentsmerging, and tune it for your use case.Chapter 4, Index Distribution Architecture, covers techniques for choosing the rightnumber of shards and replicas, how routing works, and describes deeply how shardallocation works and how to alter its behavior. In this chapter, we also discuss howto configure your ElasticSearch cluster in the beginning and what to do when thedata and query number increases.Chapter 5, ElasticSearch Administration, describes how to choose the right directoryimplementation for your use case, what are the Discovery, Gateway, and Recoverymodules, how to configure them, and why you should bother. We also describe howto look at the segments' information provided by ElasticSearch and how to tune anduse ElasticSearch caching mechanism.Chapter 6, Fighting with Fire, covers how JVM garbage collector works, why it is soimportant, and how to start tuning it. It also describes how to control the amountof I/O operations ElasticSearch is using, what warmers are and how to use them,and how to diagnose problems with ElasticSearch.Chapter 7, Improving the User Search Experience, introduces you to the world ofsuggesters, which allows us to correct user query spelling mistakes andbuild efficient autocomplete mechanisms. In addition to that you'll see,on real-life example, how to improve query relevance by using differentqueries and ElasticSearch functionalities.Chapter 8, ElasticSearch Java APIs, covers ElasticSearch Java API, from basics such asconnecting to ElasticSearch, through indexing documents both one by one and inbatches and retrieving them afterwards. It also describes different methods exposedby ElasticSearch Java API that allow us to control the cluster.[2]

PrefaceChapter 9, Developing ElasticSearch plugins, covers ElasticSearch plugins development byshowing and deeply describing how to write your own river and language plugin.What you need for this bookThis book was written using ElasticSearch server 0.90.x; all the examples andfunctions should work with it. In addition to that, you'll need a command thatallows sending HTTP requests, such as curl, which is available for most operatingsystems. Please note that all the examples in this book use the mentioned curltool. If you want to use another tool, please remember to format the request inan appropriate way that is understood by the tool of your choice.In addition to that, to run examples in Chapter 8, ElasticSearch Java APIs and Chapter 9,Developing ElasticSearch Plugins, you will need a JDK (Java Development Kit)installed and an editor that will allow you to develop your code (or Java IDE suchas Eclipse). In both the mentioned chapters we are also using Apache Maven tobuild the code.Who this book is forThis book was written for ElasticSearch users and enthusiasts who are alreadyfamiliar with the basics concepts of this great search server and want to extendtheir knowledge when it comes to ElasticSearch itself, but it also deals with topicssuch as how Apache Lucene or JVM garbage collector works. In addition tothat, readers who want to see how to improve their query relevancy, how to useElasticSearch Java API, and how to extend ElasticSearch with their own plugin,may find this book interesting and useful.If you are new to ElasticSearch and you are not familiar with basic concepts suchas querying and data indexing, you may find it hard to use this book as most of thechapters assume that you have this knowledge already. In such cases, we suggestlooking at our previous book about ElasticSearch—the ElasticSearch Server bookfrom Packt Publishing.[3]

PrefaceConventionsIn this book, you will find a number of styles of text that distinguish betweendifferent kinds of information. Here are some examples of these styles, and anexplanation of their meaning.Code words in text, database table names, folder names, filenames, file extensions,pathnames, dummy URLs, user input, and Twitter handles are shown as follows:"What we would like to do is, use the BM25 similarity model for the name field andthe contents field."A block of code is set as follows:{"mappings" : {"post" : {"properties" : {"id" : { "type" : "long", "store" : "yes","precision step" : "0" },"name" : { "type" : "string", "store" : "yes", "index" :"analyzed" },"contents" : { "type" : "string", "store" : "no", "index": "analyzed" }}}}}When we wish to draw your attention to a particular part of a code block,the relevant lines or items are set in bold:{"settings" : {"index" : {"similarity" : {"default" : {"type" : "default","discount overlaps" : false}}}},.}[4]

PrefaceAny command-line input or output is written as follows:curl -XPOST localhost:9200/test/test/1 -d '{ "title": "test" }'New terms and important words are shown in bold.Warnings or important notes appear in a box like this.Tips and tricks appear like this.Reader feedbackFeedback from our readers is always welcome. Let us know what you think aboutthis book—what you liked or may have disliked. Reader feedback is important forus to develop titles that you really get the most out of.To send us general feedback, simply send an e-mail to feedback@packtpub.com,and mention the book title via the subject of your message.If there is a topic that you have expertise in and you are interested in either writingor contributing to a book, see our author guide on www.packtpub.com/authors.Customer supportNow that you are the proud owner of a Packt book, we have a number of thingsto help