The Search Engine For PDF Files - Aquaforest

Transcription

Tabula DXThe Search Engine for PDF FilesReference Guide Version 1.12July 2008 Copyright 2008 Aquaforest Limitedhttp://www.aquaforest.com/

CONTENTS1INTRODUCTION . 22INSTALLATION AND INITIAL CONFIGURATION. 32.12.22.32.42.52.62.7SYSTEM REQUIREMENTS . 3INSTALLING TABULA DX. 3TESTING THE INSTALLATION. 3TRIAL LICENSE RESTRICTIONS . 4UNINSTALLING THE PRODUCT . 4THE SAMPLE / DEMO COLLECTIONS . 4USING TABULA DX WITH IIS . 62.7.12.7.22.7.33Setting Up Tabula DX with IIS 5 (Windows XP) . 6Setting Up Tabula DX with IIS 6 (Windows 2003) .10Using Tabula DX with IIS 7 (Windows Vista, Windows 2008). 13SEARCH QUERY EXPRESSIONS . 153.13.24SEARCH FIELDS . 15QUERY EXPRESSIONS . 15CONFIGURING SEARCH COLLECTIONS . 174.14.24.34.4SYSTEM-WIDE SETTINGS . 17COLLECTION SETTINGS . 18COLLECTION ATTRIBUTES. 18CONFIGURING COLLECTION INDEXING . 205COMMAND LINE INDEXING . 216CUSTOMIZATION AND INTEGRATION . 226.16.26.36.4THE SEARCH URL PARAMETERS . 22CUSTOMIZING THE SEARCH INTERFACE. 22XML OUTPUT . 22WEB.CONFIG PARAMETERS . 247TABULA DX DIRECTORIES. 258TABULA DX AND LUCENE . 269PRODUCT VERSION HISTORY. 279.19.210VERSION 1.12 . 27VERSION 1.01 . 27ACKNOWLEDGEMENTS . 271

1INTRODUCTIONDesigned specifically for search-enabling large collections of PDF files via a browser interface, TabulaDX offers the following benefits and features :Complete PDF Searchability - Search on PDF bookmarks, annotations, and metadata including XMPwith no limit on the number of PDF pages that will be indexed.Ease of Use - Present users with a familiar search interface and document thumbnails.Performance and Scalability - Tabula DX is based on the Lucene search API which has been provento robustly support collections of millions of documents.Customizable - Simple user interface customization via XSL.Integration Support - Search results can be returned as pure XML from any web-based method.Designed for IIS and ASP.Net - Tabula DX is built using C# and ASP.Net for simple integration intoa Microsoft-based environment.Simple License Model - Tabula DX is licensed per server and has no limits on the number ofdocuments that can be indexed.Lucene Compatible - The Tabula DX Lucene indexes are compatible with tools supporting Lucene1.4 or later.Simple Web-Based Administration - The administration module allows creation of PDF collections,settings and index scheduling.2

2INSTALLATION AND INITIAL CONFIGURATION2.12.2System Requirements Windows XP, Windows 2003, Vista, Windows 2008 Version 2.0 of the .NET Framework Note : To run batch indexing jobs via the web interface a suitable user id and password will berequired, This can be set using the Settings section of the Tabula DX web interface. Web Server : Tabula DX includes the lightweight UltiDev Cassini web server and the productis initially configured to use this. For production use IIS is recommended as a web server.See section x.x below for configuration details.Installing Tabula DXThe source media zip file contains a setup.exe which will install Tabula DX along with the Cassinilightweight web server. Please contact support@aquaforest.com should you require any assistance.2.3Testing The InstallationTo get started with Tabula DX, access the Tabula DX shortcut that is installed on the desktop andunder the Programs menu.The main Tabula DX administration page3

Installation Test Page2.4Trial License RestrictionsThe trial license does not expire but limits the size of document collections to 1,000 documents. Pleasecontact sales@aquaforest.com for additional assistance with trial licensing.2.5Uninstalling The ProductTabula DX can be uninstalled via Windows Add / Remove Programs. UltiDev Cassini Web ServerExplorer and UltiDev Cassini Web Server for ASP.Net 2.0 can be removed in the same way.2.6The Sample / Demo CollectionsThe product comes installed with a small demonstration collection of around 20 documents which canbe searched as soon as the product has been installed.4

Demo Collection – Sample Search ResultsThe PDF file can be opened in a new browser window by clicking either the document thumbnail or thetitle link. In addition clicking on “Open with Find” will open the PDF document with the search stringpassed to Adobe Reader. This will automatically open the find interface within Adobe Reader usingthe same query parameters. See below for an example.Opening a PDF Document “With Find”5

2.7Using Tabula DX with IISTabula DX is supported under IIS 5 (Windows XP), IIS 6 (Windows 2003) and IIS 7 (Windows Vista,Windows 2008).Please note the following important system requirements : Tabula DX needs to be running under ASP.Net 2.0 Tabula DX does require certain file system privileges which are unlikely to be satisfied byusing the default IUSR account. Therefore the Tabula DX web application should be runusing Integrated Authentication or with an anonymous user configured with sufficientprivilege. Running under IIS7 / Windows Vista will require use of the Classic ASP.Net application pool.2.7.1Setting Up Tabula DX with IIS 5 (Windows XP)Create a new virtual directory called tabuladx that points to the Tabula DX install location (by defaultC:\Program Files\Aquaforest\Tabula DX) with a default document of main.aspx and either integratedauthentication or anonymous authentication with a suitably privileged user. The screen shots belowillustrate the process.6

7

8

Tabula DX administration can then be accessed via http://server/tabuladx :9

2.7.2Setting Up Tabula DX with IIS 6 (Windows 2003)Create a new virtual directory called tabuladx that points to the Tabula DX install location (by defaultC:\Program Files\Aquaforest\Tabula DX) with a default document of main.aspx and either integratedauthentication or anonymous authentication with a suitably privileged user. The screen shots belowillustrate the process.10

11

Tabula DX administration can then be accessed via http://server/tabuladx :12

2.7.3Using Tabula DX with IIS 7 (Windows Vista, Windows 2008)Create a new web application directory called tabuladx that points to the Tabula DX install location (bydefault C:\Program Files\Aquaforest\Tabula DX) and uses the Classic ASP.Net AppPool The screenshots below illustrate the process.13

Set the default document to be main.aspx :Ensure that the application runs with a suitably privileged identity. Not necessarily as administratorbut with enough privilege to write to the Tabula DX collections folder.Tabula DX administration can then be accessed via http://server/tabuladx :14

3SEARCH QUERY EXPRESSIONS3.1Search FieldsWhen documents are indexed, a number of different search index fields are populated depending uponthe collection configuration.FieldContainsContentsThe text of the document. This is the default field.BookmarksAnnotationsPDF Doc Info Fields (or XMP ed3.2The value of these fields can be set via the AcrobatDocument Properties tab.The value of the XMP metatdata field configured withthe search name xmpfield see section X.X for details ofhow this is configured.The collection ID of the document.Time stamp in YYYYMMDDHHMMSS formatPDF file pathPath of the thumbnail image (or blank if thumbails arenot used in the collection)The number of pages in the PDF documentThe size in bytes of the fileTime stamp in YYYYMMDDHHMMSS formatTime stamp in YYYYMMDDHHMMSS formatQuery ExpressionsSearch QueryMatches documents PdfContains the word “pdf” in the contents of thedocument.Pdf searchPdf AND search pdf searchEach of these expressions will find documents withboth of the words “pdf” and “search” in the contents ofthe document.Pdf OR searchThis will find documents with either (or both) of thewords “pdf” and “search” in the contents of thedocument.search AND NOT pdfDocuments that contain the term search but do notcontain the term PDF.Title:china AND –title:indiaThe title field contains the word china but not india15

(pdf or wordperfect) AND searchDocuments contain the word “search” and either “pdf”or “wordperfect”.Title:”search engines”The title field contains the phrase search enginesSearch*Contains terms that begin with search, such assearchable, searching and search.Contains terms that are close to the word search such asStarchContains winmodified values in the range specifiedSearch winmodified:[20070701000000 TO20070731235959]16

44.1CONFIGURING SEARCH COLLECTIONSSystem-Wide SettingsThe “Settings” tab allows a number of system-wide settings to be maintained. These settings are :AttributeDescriptionLicense KeyOnce purchased, a permanent license key may be entered here.Administrator PasswordIf a password is entered, then access to the Autobahn DX admin pageswill require entry of the specified password.Index Username andPasswordTo run indexing jobs via the web interface a suitable user id andpassword will be required which is used to create the job usingWindows Scheduled Tasks.17

4.2Collection SettingsIndividual collections can be configured by clicking on the collection name link in the main collectionspage :4.3Collection AttributesAttributeDescriptionCollection NameThe descriptive name of the document collectionFolder PathsOne or more folders containing documents to be indexed.File patternA pattern to be used to match the files to be indexed. Default *.pdfIndex FolderThe folder to contain the index files.Index Batch SizeThe maximum number of documents that will be indexed in a singlerun of the indexer.18

Results per PageDefault number of results per search results page.XSL FileDefault XSL file.Always OptimizeIf checked, the indexes will be optimized at the end of each run of theindexer.Check for DeletedIf checked, the index process will check each file in the index. If therelated PDF file has been deleted, the index entry will be removed.ThumbnailsIf checked a thumbnail image of the first page of each PDF documentindexed will be produced.Thumbnail FolderThumbnail images will be stored in this folder. Subfolders will beautomatically created to mirror the structure of the source PDFfolders.Index LoggingIf checked, a log file will be created (or appended to) detailingindexing activity each time the indexer is run on this collection.Index Logging FileThe index log file name. The file will be placed inAUTOBAHN/collections/collectionid/indexlog. The string%TIMESTAMP% can be included in the file name to create a uniquefile for each indexer run; the timestamp is of the formatYYYYMMDDHHMMSS.Index Doc InfoIf checked, PDF Doc Info metadata (title, author etc) will be indexedand can be searched using query expressions such asauthor:shakespeareIndex BookmarksIf checked, the text of bookmarks will be indexed and can be searchedusing query expressions such as bookmarks:shakespeareIndex AnnotationsIf checked, PDF annotations will be indexed and can be searchedusing query expressions such as annotations:shakespeareIndex XMPIf checked, XMP metadata will be indexed in accordance with the“XMP Fields” instructions.XMP FieldsThis defines which XMP fields should be indexed, and each linedefines a property of the form :namespace:propertyname,indexnameFor example onformanceThe above instructs Tabula DX to index the property “conformance”in the PDF/A namespace (http://www.aiim.org/pdfa/ns/id/) and indexit under the name pdfaconformance. The field may be searched tofind pdf/a conformant files with a query such as pdfaconformance:BFor further information about XMP refer to the Adobe resources here: http://www.adobe.com/devnet/xmp/pdfs/xmp specification.pdf19

4.4Configuring Collection IndexingThe configuring collection indexing page allows a collection to be indexed according to a set scheduleor immediately. The indexing process analyzes the current collection index and the set of file systemfiles, determining which files need to be indexed or reindexed.20

5COMMAND LINE INDEXINGTabula DX collections can be indexed by using the command line interface. The tabuladx.exeexecutable can be found in the product bin folder.tabuladx.exe /id collectionid /op operation [/debug]ParameterDescriptionIdThe collection ID, eg 1002OpThe operation :index – Index the collection. The indexing process analyzes thecurrent collection index and the set of file system files, determiningwhich files need to be indexed or reindexed.clear – Clear the index collectionThe executable can is also used to set up the demo collection, thisshould be performed automatically by the setup process.Setupdemo – Sets up the demo collection from the template.Adjustdemo – Adjust the collection for the local locationDebugOptional. If specified verbose output is produced.21

6CUSTOMIZATION AND INTEGRATIONTabula DX is designed to be customized and can be easily integrated within larger solutions.6.1The Search URL ParametersTabula DX search may be accessed via the search.aspx ctionid 1001&query office&xslfile xmlParameterDescriptionCollectionidThe numeric collection id to be searched.Default Value ifUnspecifiedN/AQueryThe query stringN/AResultfieldsThe list of fields to be returnedResultstartThe result set document number of the firstdocument to be returned.The maximum number of results to be t to XML to have pure XML returned (see 5.3below) or alternatively specify an alternate XSLfile.True or false. If set to false thumbnails are not tobe displayed.Index files directoryFrom collectionconfiguration.From collectionconfiguration.From collectionconfiguration.From collectionconfiguration.Customizing the search interfaceWhen running a search, Tabula DX generates an XML file with details of the search results. This ispassed to the browser with default of the XSL file to be used to transform the output into the searchresults page. By default the style/results.xsl file is used. This can be customized to suit specific needs.An alternative replaced in the collection settings page6.3XML Output ?xml version "1.0" encoding "utf-8"? searchresults search collectionid 1001 /collectionid query document guidelines /query resultfields highlight,path,title,thumbnailpath,pages /resultfields resultstart 1 /resultstart resultend 9 /resultend resultsperpage 10 /resultsperpage sortorder / thumbnails true /thumbnails indexdirectory c:\dev\scrii\index\ /indexdirectory xslfile / /search hits 9 /hits searchstatus success /searchstatus 22

results result row 1 /row highlight B Guidelines /B for Creating Archival /highlight path c:\docs2\PDFGuideline.pdf /path title PDF Guideline for FDA /title thumbnailpath c:\thumbnails\PDFG.pdf.1.gif /thumbnailpath pages 6 /pages /result result row 2 /row . /result /results /searchresults AttributeDescription search This contains the search attributes collectionid query resultfields resultstart resultend resultsperpage thumbnails indexdirectory hits searchstatus The collection IDThe search queryThe fields that will be included in results The result number of the first document in results The result number of the last document in results The maximum number of resultsTrue or False (see thumbnailpath if true).The location of the collection index directoryThe total number of results from the searchCan be :successblank (if query is blank)nomoreresultserrorThe result document setAn individual resultThe sequence in the result setThe contents of field for the documentDocument fragment, highlighted with search termsThe path of the documentDocument title. If the PDF document does not have a title, the first 60characters of the document content is used.The path of the thumbnail imageThe number of pages in the PDF document. results result row field highlight path title thumbnailpath pages 23

6.4Web.config ParametersThe web.config file contains a number of appSettings that can be adjusted. These parameters aredefined below.SettingDescriptiongeneratedTitleLengthFor PDF documents with no title, Tabula DX generates a title forsearch results purposes using the first generatedTitleLengthcharacters in the file. Initial value : 60.highlighMaxDocBytesToAnaylzeWhen constructing the “highlight” text fragment for resultsdisplay, this parameter determines how many characters of textwill be examined. Initial value : 100000.highlightNumFragmentsDetermines how many text fragments will be used in the highlighttext. Initial value : 2.highlightDelimiterSpecifies a character string to be used to separate the highlight textfragments. Initial value : “ ”highlightLengthDefines the length of highlight text. Initial value 150.defaultOperatorDefines whether “AND” or “OR” is the default search operator.Initial value : “AND”.resizeThumbnailsBy default thumbnails are a width of 100 pixels. They can beresized on the fly if this parameters is set to true. Initial valuefalse.Defines the image width if resizeThumbnails is set to true.resizeThumbnailWidth24

7TABULA DX DIRECTORIESThe Tabula DX folder structure is explained below. The root folder is ionBinContains the DLLs and executables, including tabuladx.exeCollectionsThe root folder for the search collections.Collections/9999The root folder for the search collections 9999. Includes theconfig 9999.xml file which holds the collection configuration.Collections/9999/indexThe default location for the collection index files.Collections/9999/indexlogThe default location for the collection index log files.Collections/9999/tempThe location for the collection index log files.Collections/9999/thumbnailsThe default location for the collection thumbnail files.ConfigContains the Tabula DX config file and the template collectiontemplate file.DocsContains the reference guide and license file.ImgContains the images used in the web interface.SamplesContains the sample collection documents.StyleDefault location for XSL documents. Includes results.xsl.TemplateThe template for the sample collection.25

8TABULA DX AND LUCENEThe indexes created by Tabula DX are compatible with Lucene 1.4 or later. More informationregarding Lucene can be found here : http://lucene.apache.org/java/docs/Included with the product is Luke – the Lucene Index Toolbox. This can be a useful tool for analyzingthe contents of indexes and running queries. Luke can be launched by running lukeall-0.7.1.jar in theproduct bin folder. Luke is Licensed under the Apache License, Version 2.0 (the "License"); you mayYou may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.026

9PRODUCT VERSION HISTORY9.1Version 1.12ReferenceChange1.12-01Updated release includes lightweight UltiDev Cassini web server9.2Version 1.01ReferenceChange1.01-01Initial Limited Release10ACKNOWLEDGEMENTSThis product includes Luke - Lucene Index Toolbox (http://www.getopt.org/luke), Copyright 2008Andrzej Bialecki.27

Pdf AND search pdf search Each of these expressions will find documents with both of the words "pdf" and "search" in the contents of the document. Pdf OR search This will find documents with either (or both) of the words "pdf" and "search" in the contents of the document. search AND NOT pdf Documents that contain the term .