Lucid Imagination - Indexing Text And HTML Files With Solr

Transcription

Indexing Textand HTML Files withSolr,, the LuceneSearch ServerA Lucid ImaginationTechnical TutorialBy Avi RappoportSearch Tools Consulting

AbstractApache Solr is the popular, blazing fast open source enterprise search platform; it usesLucene as its core search engine. Solr’s major features include powerful full-text search, hithighlighting, faceted search, dynamic clustering, database integration, and complex queries.Solr is highly scalable, providing distributed search and index replication, and it powers thesearch and navigation features of many of the world's largest internet sites. LucidImagination’s LucidWorks Certified Distribution for Solr provides a fully open distributionof Apache Solr, with key complements including a full Reference Guide, an installer, andadditional functions and utilities. All the core code and many new features are available, forfree, at the Lucid Imagination web site (www.lucidimagination.com/downloads).In the past, examples available for learning Solr were for strictly-formatted XML anddatabase records. This new tutorial provides clear, step-by-step instructions for a morecommon use case: how to index local text files, local HTML files, and remote HTML files. Itis intended for those who have already worked through the Solr Tutorial or equivalent.Familiarity with HTML and a terminal command line are all that is required; no formalexperience with Java or other programming languages is needed. System Requirements forthis tutorial are those of the Startup Tutorial: UNIX, Cygwin (Unix on Windows), Mac OS X;Java 1.5, disk space, permission to run applications, access to content.Indexing Text and HTML Files with SolrA Lucid Imagination Tutorial February 2010Page ii

ContentsIntroduction. 1Part 1: Installing This Tutorial . 1Part 2: Solr Indexing with cURL . 3Using the cURL command to index Solr XML . 3Troubleshooting errors with cURL Solr updates. 4Viewing the first text file in Solr . 5Part 3: Using Solr to Index Plain Text Files . 6Invoking Solr Cell . 6Parameters for more fields . 7Part 4: Indexing All Text Files in a Directory. 9Shell script for indexing all text files . 9More robust methods of indexing files . 9Part 5: Indexing HTML Files.10cURLSimplest HTML indexing .10Storing more metadata from HTML .11Storing body text in a viewable field .12Part 6: Using Solr indexing for Remote HTML Files .12Using cURL to download and index remote files .12File streaming for indexing remote documents .13Spidering tools .13Conclusion and Additional Resources .14About Lucid Imagination .15About the Author .15Indexing Text and HTML Files with SolrA Lucid Imagination Tutorial February 2010Page iii

IntroductionApache Solr is the popular, blazing fast open source enterprise search platform; it usesLucene as its core search engine. Solr’s major features include powerful full-text search, hithighlighting, faceted search, dynamic clustering, database integration, and complex queries.Solr is highly scalable, providing distributed search and index replication, and it powers thesearch and navigation features of many of the world's largest internet sites1.Today, the newly released version of Solr 1.4 includes a new module called Solr Cell thatcan access many file formats including plain text, HTML, zip, OpenDocument, and MicrosoftOffice formats (both old and new). Solr Cell is invokes the Apache Tika extraction toolkit,another part of the Apache Lucene family, integrated in Solr). This tutorial provides asimple introduction to this powerful file access functionality.In this tutorial, we’ll walk you through the steps required for indexing readily accessiblesources with simple command-line tools for Solr, using content you are likely to haveaccess to: your own files, local discs, intranets, file servers, and web sites.Part 1: Installing This TutorialAs it turns out, the existing examples for in the default installation of the Solr Tutorial arefor indexing specific formats of XML and JDBC-interface databases. While those formats canbe easier for search engines to parse, many people learning Solr do not have access to suchcontent. This new tutorial provides clear, step-by-step instructions for a more common usecase: how to index local text files, local HTML files, and remote HTML files. It is intended forthose who have already worked through the Solr Tutorial or equivalent.This tutorial will add more example entries, using Abraham Lincoln's Gettysburg Addressand the United Nations’ Universal Declaration of Human Rights as text files, and as HTMLfiles, and walk you through getting these document types indexed and searchable.First, follow the instructions in the Solr Tutorial rks-for-Solr orhttp://lucene.apache.org/solr/tutorial.html) from installation to Querying Data (or1Lucene, is the Apache search library at the core of Solr, presents the interfaces through a collection of directlycallable Java libraries, offering fine-grained control of machine functions and independence from higher-levelprotocols, and requiring development of a full java application. Most users building Lucene-based searchapplications will find they can do so more quickly if they work with Solr, as it adds many of the capabilities neededto turn a core search function into a full-fledged search application.Indexing Text and HTML Files with SolrA Lucid Imagination Tutorial February 2010Page 1

beyond). When you are done, the Solr index file will have about 22 example entries, most ofthem about technology gadgets.Next, use a browser or ftp program to access the tutorial directory on the LucidImagination web site al/). Youshould find the following tmlpost-txt.shremoteschema.xmlthtutorial.zipFor your convenience, all of the files above are included in nload/thtutorial/thtutorial.zip). Move the zip fileto the Solr example directory, (which is probably in usr/apache-solr-1.4.0 or /LucidWorks),and unzip it: this will create an example-text-html directoryWorking Directory: example-text-htmlThis tutorial assumes that the working directory is[Solr home]/examples/examples-text-html: you can check your location by using theUnix command line utility pwd.Setting the schemaBefore starting, it's important to update the example schema file to work properly with textand HTML files. The schema needs one extra field defined, so all words in the plain textfiles, and HTML body words go into the default field for searching.Make a backup by renaming the conf directory file from schema.xml to schema-bak.xml% mv ././lucidworks/solr/conf/schema.xml ././lucidworks/solr/conf/schemabak.xmlThen either copy the text-html version of the schema or edit the version that's there toinclude the body text field. Either: copy the new one from the example-text-html directory into the confdirectory:% cp schema.xml ././lucidworks/solr/conf/schema.xmlor (for apache installs)% cp schema.xml ./solr/conf/schema.xml Or: edit the schema to add this field: Open the original schema.xml in your favorite text editorIndexing Text and HTML Files with SolrA Lucid Imagination Tutorial February 2010Page 2

Go to line 469 (LucidWorks) or 450 (apache). This should be the Solr Cell sectionwith other HTML tags. field name "last modified" type "date" indexed "true" stored "true"/ field name "links" type "string" indexed "true" stored "true"multiValued "true"/ and add the code to create the body field: field name "body" type "text" indexed "true" stored "true" multiValued "true"/ Go to line 558 (LucidWorks) or 540 (Apache), and look for the copyfield section. copyField source "includes" dest "text"/ copyField source "manu" dest "manu exact"/ Go to the end of the section, after field manu and add the line to copy the body fieldcontent into the text field (default search field). copyField source "body" dest "text"/ Save and close the schema.xml file.Restarting SolrSolr will not use the new schema until you restart the search engine. If you haven't donethis before, follow these steps: Switch to the terminal window in which the Solr engine has been started Press c (control-c) to end this session: it should show you that Shutdown hook isexecuting. (Apache) Type the command java -jar start.jar to start it again. This only worksfrom the example directory, not from the example-text-html directory. (LucidWorks) Start Solr by running the start script, or clicking on the system tray iconPart 2: Solr Indexing with cURLPlain text seems as though it should be the simplest, but there are a few steps to gothrough. This tutorial will walk through the steps, using the Unix shell cURL command.Using the cURL command to index Solr XMLThe first step is communicating with Solr. The Solr Startup Tutorial shows how to use theJava tool to index all the .xml files. This tutorial uses the cURL utility available in Unix,within the command-line (terminal) shell.Indexing Text and HTML Files with SolrA Lucid Imagination Tutorial February 2010Page 3

Telling Solr to index is like sending a POST request from an HTML form, withappropriate path name (by default /update) and parameters. cURL uses this process, onthe command-line. This example uses the test file lu-example-1.xml.To start, be in the solr/example/example-text-html directory.Then, instruct Solr to update (index) an XML file using cURL , and then finish the indexupdate with a commit commandcURL 'http://localhost:8983/solr/update/' -H 'Content-type:text/xml' --databinary "@lu-example-1.xml"cURL 'http://localhost:8983/solr/update/' -H "Content-Type: text/xml" --databinary ' commit/ 'Successful calls have a response status of 0. xml version "1.0" encoding "UTF-8"? response lst name "responseHeader" int name "status" 0 /int int name "QTime" 000 /int /lst /response Troubleshooting errors with cURL Solr updatesIf you have a cURL error, it's usually mis-matched double quotes or single quotes. If you seeone of the following, go back and try again.cURL: (26) failed creating formpost datacURL: (3) url malformedWarning: Illegally formatted input field!cURL: option -F: is badly used hereCommon errors numbers from the Solr server itself include 400 and 500. This means thatthe POST was properly formatted but included parameters that Solr could not identify.When that happens, go back to a previous command that did work, and start building thenew one up from there. These errors should not damage your Solr search engine or index. title Error 400 /title /head body h2 HTTP ERROR: 400 /h2 pre Unexpected character 's' (code 115) inprolog; expected '<'at [row,col {unknown-source}]: [1,1] /pre Indexing Text and HTML Files with SolrA Lucid Imagination Tutorial February 2010Page 4

or title Error 500 /title /head body h2 HTTP ERROR: 500 /h2 pre org.apache.lucene.store.NoSuchDirectoryException: ta/index' does not existIf you can't make this work, you may want to follow the instructions with the Solr StartupTutorial to create a new Solr directory and confirm using the Java indexing instructions forthe exampledocs XML files before continuing.Viewing the first text file in SolrOnce you have successfully sent the XML file to Solr's update processor, go to your browser,as in the Getting Started tutorial, and search your Solr index for "gettysburg"http://localhost:8983/solr/select?q gettysburg.The result should be an XML document, which will report one item matching the new testfile (rather than the earlier example electronic devices files). The number of matches is onabout the eighth line, and looks like this: result name "response" numFound "1" start "0" After that, the Solr raw interface will show the contents of the indexed file:Indexing Text and HTML Files with SolrA Lucid Imagination Tutorial February 2010Page 5

NotesYou must use a browser than can render XML, such as Firefox or Internet Exploreror Opera (but not Safari).The field label arr indicates a multiValued field.Part 3: Using Solr to Index Plain Text FilesIntegrated with Solr version 1.4, Solr Cell (also known as the ExtractingRequestHandler)provides access to a wide range of file formats using the integrated Apache Tika toolkit,including untagged plain text files. The test file for this tutorial is lu-example-2.txt. It hasno tags or metadata within it, just words and line breaks.NoteThe Apache Tika project reports that extracting the words from plain text files issurprisingly complex, because there is so little information on the language andalphabet used. The text could be in Roman (Western European), Indic, Chinese, orany other character set. Knowing this is important for indexing, in particular fordefining the rules of word breaks, which is Tokenization.Invoking Solr CellTo trigger the Solr Cell text file processing (as opposed to the Solr XML processing), addextract in the URL path in the POST command: /solr/update/extract.This example includes three new things: the extract path term, a document ID (becausethis file doesn't have an ID tag), and an inline commit parameter, to send the update to theindex.cURL .id exid2&commit true' F "myfile @lu-example-2.txt"The response status of 0 signals success. Your cURL command has added the contents oflu-example-2.txt to the index.When running the query http://localhost:8983/solr/select?q gettysburg in the index, bothdocuments are matched. result name "response" numFound "2" start "0" Unlike the indexed XML document, with this text document, there are only two fields(content-type and id) that are visible in the search result. The text content, even the word"Gettysburg," all seems to be missing.Indexing Text and HTML Files with SolrA Lucid Imagination Tutorial February 2010Page 6

How can Solr match words in a file using text that doesn’t seem to be there? It's becauseSolr’s default schema.xml is set to index for searching, but not store for viewing. In otherwords, Solr isn’t preset to store for your viewing the parts of the documents with no HTMLtags or other labels. For plain text files, that's everything, so the next section is aboutchanging that behavior.Parameters for more fieldsSolr Cell provides ways to control the indexing without having to change source code.Parameters in the POST message set the option to save information about the documents inappropriate fields, and then to grab the text itself and save it in a field. The metadata can beextracted without the file contents or with the contents.Solr Cell external metadataWhen Solr Cell reads a document for indexing, it has some information about the file, suchas the name and size. This is metadata (information about information), and can be veryvaluable for search and results pages. Although these fields are not in the schema.xml file,Solr is very flexible, and can put them in dynamic fields that can be searched and displayed.The operative parameter is uprefix attr ; when added to the POST command, it will savethe file name, file size (in bytes), content type (usually text/plain), and sometimes thecontent encoding and tract?literal.id exid2&uprefix attr &commit true' -F "myfile @lu-example-2.txt"Here is an example of the same file, indexed with the uprefix attr parameter:Indexing Text and HTML Files with SolrA Lucid Imagination Tutorial February 2010Page 7

Mapping document contentOnce the metadata is extracted, Solr Cell can be configured to grab the text at well. Thefmap.content body parameter stores the file content in the body field, where it can besearched and displayed.NoteUsing the fmap parameter without uprefix will not work. To see the body text, theschema.xml must have a body field, as described in the Install section above.Here's an example of an index command with both attribute and content ract?literal.id exid3&uprefix attr &fmap.content body&commit true' -F "txtfile @lu-example-3.txt"Searching the Solr index http://localhost:8983/solr/select?q gettysburg will nowdisplay the all three example files. For lu-example-3.txt, it shows the body text in thebody field and metadata in various fields.Indexing Text and HTML Files with SolrA Lucid Imagination Tutorial February 2010Page 8

Part 4: Indexing All Text Files in a DirectoryThe Solr Startup Tutorial exampledoc directory contains a post.sh file, which is a shell scriptthat uses cURL to send files to the default Solr installation for indexing. This version usesthe cURL commands above to send .txt (as opposed to .xml) files to Solr for indexing. Thefile post-text.sh should be in the ./example/example-text-html/ directory with thetest files.Shell script for indexing all text files Set the permissions: chmod x post-text.sh Invoke the script: ./post-text.shYou should see the response with status 0 and the other lines after each item: if you donot, check each line for exact punctuation and try again.When you go back to search on Solr, http://localhost:8983/solr/select?q gettysburg, youwill find five text documents and one XML document.Different doc IDs: adds aan additional documentNote that the results include two different copies of the first example, both containing “Fourscore and seven years ago”, because the script loop sent all text files with the generated exidnumber, while the XML example contains an id starting with exidx.Identical doc IDs - replaces a documentThe second example text file had some text that was indexed but not stored as a text blockwhen we first indexed it. Now it has content in the body field, because the script loop sent itwith the same ID (and the new parameters), so Solr updated the copy that was already inthe index, using the Doc ID as the definitive identifier.For more information on IDs, see the LucidWorks Certified Distribution Reference Guide onUnique Key.More robust methods of indexing filesSending indexing and other commands to Solr via cURL is an easy way to try new thingsand share ideas, but cURL is not built to be a production-grade facility. And because Solr'sHTTP API is so straightforward, there are many ways to call Solr programmatically. Thereare libraries for Solr in nearly every language, including Java, Ruby, PHP, JSON, C#, and Perl,among others. Many content management sytems (CMS), publishing and social mediasystems have Solr modules, such as Ruby on Rails, Django, Plone, TYPO3, and Drupal; it isalso used in cloud computing environments such as Amazon Web Services and GoogleIndexing Text and HTML Files with SolrA Lucid Imagination Tutorial February 2010Page 9

Code. For more information, check the Solr wiki and the LucidWorks Solr client API Lineupin the LucidWorks Certified Distribution Reference Guide.Part 5: Indexing HTML FilesThis tutorial uses the same cURL commands and shell scripts for HTML as for text. Solr Celland Tika already extract many HTML tags such as title and date modified.NoteAll the work described above on text files also applies to HTML files, so if you'veskipped to here, please go back and read the first sections.cURLSimplest HTML indexingThe first example will index an HTML file with a quote from the Universal Declaration ofHuman Rights:cURL .id exid6&commit true' -F"myfile @lu-example-6.html"Doing a query for "universal", http://localhost:8983/solr/select?q universal , shows usthat Solr Cell created the metadata fields title, and links, because they are standardHTML constructs.Again, by default, the body text is indexed but not stored; and, changing that is just as easyas changing it with the text files.Indexing Text and HTML Files with SolrA Lucid Imagination Tutorial February 2010Page 10

Storing more metadata from HTMLAs in the text file section of this tutorial, this example uses the uprefix parameter attrto mark those fields that Solr Cell automatically creates but which are not in theschema.xml. This is not a standard, but it's a convention that's widely t?literal.id exid7&uprefix attr &commit true' -F "myfile @lu-example-7.html"Searching for "universal" now finds both HTML documents. While exid6 has very littlestored data, exid7 has the internal metadata of the document, including the title, author,and comments.NoteApache Tika uses several methods to identify file formats. These includeextensions, like .txt or .html, MIME types such as text/plain or application/pdf,and known file format header patterns. It's always best to have your sourcefiles use these labels, rather than relying on Tika to guess.Indexing Text and HTML Files with SolrA Lucid Imagination Tutorial February 2010Page 11

Storing body text in a viewable fieldAs in the text file, indexing Example 8 uses the fmap parameter to set the text from withinthe body field of the HTML document to the body field which is in this example schema,so it will be both searchable and stored.cURL ix attr &fmap.content body&commit true&literal.id exid8' -F "myfile @lu-example-8.html"Part 6: Using Solr indexing for Remote HTML FilesUsing cURL to download and index remote filesThe cURL utility is a fine way to download a file served by a Web server, which in thistutorial we’ll call a remote file. With the -O flag (capital letter O, not the digit zero), cURLwill save a copy of the file with the same name into the current working directory. If there'sa file with that name already, it will be over-written, so be careful.cURL -O l/lu-example-9.htmlNoteIndexing Text and HTML Files with SolrA Lucid Imagination Tutorial February 2010Page 12

If web access is unavailable, there's a copy of thethe remote subdirectory in the zip file.lu-example-9.htmlfile inIf you view the files in the local examples-text-html directory, there will be alu-example-9.html file. The next step is to send it to Solr, which will use Solr Cell to ract?literal.id exid9&uprefix attr &fmap.content body&commit true" -F "exid9 @lu-example-9.html"This will index and store all the text in the file, including the body, comments, anddescription.File streaming for indexing remote documentsSolr also supports a file streaming protocol, sending the remote document URL to beindexed. For more information, see the ExtractingRequestHandler and ContentStreampages in the LucidWorks Certified Distribution Reference Guide for Solr, or the Solr wiki. Notethat enabling remote streaming may create an access control security issue: for moreinformation, see the Security page on the wiki.Spidering toolsThis tutorial doesn't cover the step of adding a spider (also known as a crawler or robot) tothe indexing process. Spiders are programs that open web pages and follow links on thepages, to index a web site or an intranet server. This is how horizontal consumer websearch providers such as Google, Ask, and Bing find so many pages.Indexing Text and HTML Files with SolrA Lucid Imagination Tutorial February 2010Page 13

Solr doesn't have an integrated spider, but works well with another Apache Lucene opensource project, the Nutch crawler. There's a very helpful post on Lucid Imagination's site,Using Nutch with Solr, which explains further how this works.Alternatives include Heritrix from the Internet Archive, JSpider, WebLech, Spider on Rails,and OpenWebSpider.Conclusion and Additional ResourcesNow that you’ve had the opportunity to try Solr on HTML content, the opportunities tobuild a search application with it are as diverse and broad as the content you need tosearch! Here are some resources you will find useful in building your own searchapplications.Configuring the ExtractingRequestHandler in Chapter 6 of the LucidWorks for SolrCertified Distribution Reference idWorks-for-Solr/Reference-GuideSolr Wiki: Extracting Request Handler (Solr HandlerTika http://lucene.apache.org/tika/Content Extraction with Tika, by Sami aOptimizing Findability in Lucene and Solr, by Grant lity-Lucene-and-SolrIndexing Text and HTML Files with SolrA Lucid Imagination Tutorial February 2010Page 14

About Lucid ImaginationMission critical enterprise search applications in e-commerce, government, research,media, telecommunications, Web 2.0, and many more use Apache Lucene/Solr to ensureend users can find valuable, accurate information quickly and efficiently across theenterprise. Lucid Imagination complements the strengths of this technology with afoundation of commercial-grade software and services with unmatched expertise. Oursoftware and services solutions help organizations optimize performance and achieve highquality search results with their Lucene/Solr applications. Lucid Imagination customersinclude AT&T, Nike, Sears, Ford, Verizon, The Guardian, Elsevier, The Motley Fool, Cisco,Macy's and Zappos.Lucid Imagination is here to help you meet the most demanding search applicationrequirements. Our free LucidWorks Certified Distributions are based on these mostpopular open source search products, including free documentation. And with ourindustry-leading services, you can get the support, training, value added software, andhigh-level consulting and search expertise you need to create your enterprise-class searchapplication with Lucene and Solr.For more information on how Lucid Imagination can help search application developers,employees, customers, and partners find the information they need, please visithttp://www.lucidimagination.com to access blog posts, articles, and reviews of dozens ofsuccessful implementations. Please e-mail specific questions to: Support and Service: support@lucidimagination.comSales and Commercial: sales@lucidimagination.comConsulting: consulting@lucidimagination.comOr call: 1.650.353.4057About the AuthorAvi Rappoport really likes good search, and Solr is really good. She is the founder of SearchTools Consulting, which has given her the opportunity to work with site, portal, intranetand Enterprise search engines since 1998. She also speaks at search conferences, writes onsearch-related topics for InfoToday and other publishers, co-manages the LinkedInEnterprise Search Engine Professionals group, and is the editor of

Indexing Text and HTML Files with Solr A Lucid Imagination Tutorial February 2010 Page 1 Introduction Apache Solr is the popular, blazing fast open source enterprise search platform; it uses Lucene as its core search engine. Solr's major features include powerful full-text search, hit