XML Protocol 
------------


Read This First
---------------

This section contains some important notes.

*CDATA Tags in XML Requests. These are optional where data doesn't contain 
any characters that need escaping.  If you are unsure if your content will 
contain these characters then it best to use CDATA tags.  It will not
affect the LuceneServer.
*CDATA in XML Responses. The LuceneServer _will_ use CDATA tags in Responses.
*Field Naming Conventions - Field name have some limitations.  They must be a
valid XML attribute name.  i.e. no <,>,&,",' characters and NO SPACES or 
underscores (_).  Hyphens are accepted.  The recommended naming convention is
as for Java classes. i.e. FirstName or Address.  These are case sensitive.
*If un-well-formed xml is sent to the LuceneServer then an error will be 
returned.  This will not contain the serial (if present) because it could not
be extracted from the XML.  There is a generic response LuceneErrorResponse 
for these situations.

Date Formats Supported
-----------------------
Date formats supported using the Date field type (see index explanation below).

yyyy/MM/dd HH:mm:ss
yyyy-MM-dd HH:mm:ss
yyyy/MM/dd
yyyy-MM-dd
unix timestamp

These dates will be converted into Lucene's internal format eg (mx341s12) 
which isn't human readable in the logs.  
An alternative is to use an ISO formatted date string 
e.g. 2003-12-31 23:59:59
This will be correctly ordered when doing range queries as it is correct for
alphabetic ordering as well.  Lucene does not have support for numeric ordering
if sorting/range querying on numberic field they will need to be left padded
with zeros. eg 001.


Query XML
=========

The LuceneQueryRequest XML is the easiest so we will start with that.  Below is a fairly comprehensive example of a request:
<?xml version="1.0" encoding="UTF-8" ?>
<LuceneQueryRequest>
	<Application>sos</Application>
	<Serial>123abc</Serial>
	<Query>Name:cat</Query>
	<Sort limit="100">Name:desc</Sort>
	<Offset>10</Offset>
   	<Limit>3</Limit>
	<Range field="DisplayStartDate">
		<From>2002/01/29</From>
		<To>2003/01/29</To>
	</Range>
</LuceneQueryRequest>

Mandatory Tags: Query
Optional Tags:  Offset, Limit, Application, Return, Sort, Serial, 
				Default-Field, Range, Domian

<Query>         The query string to be passed to the indexer. If the query
				contains special xml characters (&<>) then it will need a
				<!<CDATA[ xxx ]]> tag surrounding it.

                 See: jakarta.apache.org/lucene/docs/queryparsersyntax.html
                 for query syntax.

				default-field attribute. If set this is the field that 
				the query parser uses for any search term that does not 
				specify a field name.  It must be entered in either the 
				query tag as an attribute or the Application defaults.  
				If not specified, then unqualified search terms will be 
				associated with a non-existant field and so will cause 
				the query to find nothing.
				
				default-field example:
				<Query default-field="Name">cat</Query>
				This will get turned into Name:cat

<Return>        The field(s) to return.  If this tag is not specified
                then only the Id and counter are returned. Any 
                specified fields are in addition to these three. If an
                Application is specified then the Application defaults
                apply.
                These are returned in the <Results> tag of the 
                LuceneQueryResponse (see below).

<Offset>      	Specify that, after appropriate sorting, all items 
                before this one are discarded and not returned to the
                client.  This will allow the client to perform
                page-at-a-time queries. Similar to offset in PostgreSQL
                queries.

<Limit>         Specify that no more than this many items are
                returned.  If negative, an unlimited number of results
                are returned (overriding any defaults).
                If 0 (zero) then only a count of results is returned.
                This is useful for checking whether a query will return
                lots of hits or to check whether a sort should be performed.

<Sort>          Specify that the results are ordered based on the
                values of specified fields.  If the sort should be on
                multiple fields, then seperate the fields with a space.
                By default, the indexer orders results by rank.  
                The special field name 'RANK' may be used to
                include rank-order within a multi-field sort.  To
                sort fields in reverse order, append ':Desc' to the
                sort field name, e.g.
                  <Sort>Author Date:Desc</Sort> - this will sort by author 
                  		from a to z and then by date with newest firrst.
                  <Sort>RANK:Desc</Sort> - this is the default.
                  
                Sorting is very memory intensive as all the documents 
                must be loaded into memory and then sorted. USE WITH
                CAUTION when handling large result sets.  As a precaution
                there is an optional limit attribute for the sort tag. This 
                will prevent sorting if the number of results exceeds the
                limit and return an error.
                  <Sort limit="100">Date:Desc</Sort>
                  This will return an error if trying to sort more than 100
                  documents.
                Please note that there is also a System sort-limit which
                is specified in the Server.config file.  This will override
                any request sort limit.
                Important Note: IN ORDER TO SORT ON A FIELD IT MUST BE STORED!
                This is because sorting on arbitatary fields is not internally
                supported by lucene.  This is an important index time decision
                as if you have not stored the field data for all documents 
                then the sort will be inconsistant.

<Application>   Indicate that the default parameters for this query
                are to be taken from the specified application.  The
                application must have been set up previously.

<Range>         Specifies a range query.
				The field attribute defines the Field name for which a Range
                query clause is wanted.  This tag should contain either a
                From or To tag or both.  If only one of Range-From or 
                Range-To is specified, then the range is open-ended.
                Further Range tags can be specified for other fields.
                e.g. both specified:
                <Range field="DisplayStartDate">
    				<From>2000/01/29</From>
    				<To>2004/01/29</To>
    			</Range>
    			eg from now to forever
    			<Range field="DisplayStartDate">
    				<From>2000/01/29</From>
    			</Range>
                 
<Domain>		Optional field which overrides the application's domain.

<Serial>        The serial header causes the server to insert the same Serial header
				in the response.  This will allow clients who process requests and
				responses asynchronously to link a response with a particular request.
				Please note however that it is not always possible to return the 
				serial in the event of an error.  This is especially true with invalid 
				XML or more severe server side errors.

<Fields> 		This can be included if you wish to define on-the-fly fields, to
				overide the default behaviour of fields or to define fields when there
				are no fields in the application properties.  Please see the 
				LuceneIndexRequest explanation below for more details.
				The only reason this would be done is if specifying date fields on the 
				fly.  If undefined then the internal Lucene representation of the date
				will be returned.

Query-Response XML
------------------

This is returned after a LuceneQueryRequest is received.  

Here are three examples of LuceneQueryResponses they will be explained below:

Results Example (optional field Name included):
<?xml version="1.0" encoding="UTF-8"?>

<LuceneQueryResponse>
  <Serial>123abc</Serial>
  <Count>2</Count>
  <Results>
    <Result counter="1" rank="1.0">
      <Field name="Id"><![CDATA[zz13]]></Field>
      <Field name="Name"><![CDATA[cat]]></Field>
    </Result>
    <Result counter="2" rank="1.0">
      <Field name="Id"><![CDATA[zz12]]></Field>
      <Field name="Name"><![CDATA[cat]]></Field>
    </Result>
  </Results>
</LuceneQueryResponse>

No Results Example:
<?xml version="1.0" encoding="UTF-8" ?>
<LuceneQueryResponse>
	<Serial>123abc</Serial>
	<Count>0</Count>
</LuceneQueryResponse>

Error Example:
<?xml version="1.0" encoding="UTF-8"?>
<LuceneQueryResponse>
  <Serial>123abc</Serial>
  <Error><![CDATA[Error parsing query "*cat": Lexical error at line 1, column 1.  Encountered: "*" (42), after : ""]]></Error>
</LuceneQueryResponse>

The top example has a result set.  Its size (count) is 2.  This tag is always
present unless an error has occured.  The results element wraps the individual
result elements.  Results will only be present if count > 0.
There will always be the counter and rank attributes for a result element. The
number of Field elements will vary depending on the return field specified but
the Id will always be returned.  Field data is currently wrapped in CDATA.

Secondly if there are no results (or if the limit has been set to 0) then
only the count will be returned.  There will be no results element.

The third example shows an error that occured during querying.

--------------------------------------------------------------------------


INDEX XML
=========

The LuceneIndexRequest is used to add new documents to the store or to
replace documents.  Each document must include the mandatory fields
'Domain' and 'Id' which will serve to uniquely identify them.  The
'Domain' field may be defaulted from an Application default, but the
'Id' field must be individually specified on each document.

	<?xml version="1.0" encoding="UTF-8" ?>
   	<LuceneIndexRequest>
   	<Application>sos</Application>
   	<Serial>abc123</Serial>
   	<Id>xmlID</Id>
   	<Fields>
   		<Field name="Name" type="text" indexed="yes" stored="no">cat</Field>
   		<Field name="Details">This is the defails field.</Field>
   		<Field name="Teaser">This is the teaser</Field>
   		<Field name="Location">10</Field>
   		<Field name="Category">10</Field>
   		<Field name="DisplayStartDate">2002-12-10</Field>
   		<Field name="SaleStartDate">2002-12-10</Field>
   		<Field name="EndDate">2002-12-25</Field>
   		<Field name="Cancelled">N</Field>
   	</Fields>	
   	</LuceneIndexRequest>


Mandatory Headers: Application, Id

Optional Headers: Stop-List, Fields/Field, Serial, Domian
				  Field-Definitions/FieldDefinition

<Application>   See LuceneQueryRequest for explanation.

<Serial>		See LuceneQueryRequest for explanation.

<Domain>        Every document must include a Domain field which may
                be specified or taken from the Application defaults.  
                This will be used to isolate documents into application 
                areas so that one application does not reference another's 
                documents.
                It is often better however to specify a type field and
                manage domains from within the application.

<Id>            Every document must be uniquely identified (within its
                application domain).  When indexing a document any 
                existing document with the same Domain/Id pair is 
                deleted.  This field allows the client to uniquely
                address the actual document in an external database.

<Stop-List>     A space-separated list of words that should be ignored
                when indexing the document.  Generally, this would be
                specified in the Application defaults.

<Fields>		This element wraps the individual fields to be indexed
				for the document.
				
<Field>			This contains the data to be indexed. The element text
				should be in a CDATA tag if it will contain a lot of 
				special characters (&<>).  Alternatively the special 
				characters can be escaped i.e. & == &amp;
				The element text will be ignored when defining fields
				at query time.

@name			The name of the field. 

				If you wish to define on-the-fly fields, to overide the 
				default behaviour of fields or to define fields when there
				are no fields in the application properties then use the 
				following optional attributes:

@type      		Defines the type of field.  This affects tokenisation
				and date calculations.
                Options: 'Date' field, a 'Text' field or an
                'Id' field. A 'Text' field is a field which is split
                into words.  An Id field is not split into words: it's
                value is treated as a whole word (or "term" in
                Lucene's terminology).  A Date field is an Id field
                which is converted from an external Date
                representation into a form suitable for indexing and
                sorting.
                Default is text.

@indexed   		"true" or "false" indicating whether indexing is required
                for this field.  A field cannot be used in a search
                query unless it is indexed.
                Default is true.

@stored    		"true" or "false" whether the content of a field is
                stored.  For example, this would not be done for the
                body text of a document as it is assumed to be stored
                elsewhere, but it would be required for the ID field
                so that the document ID could be returned as part of
                the query result.
                IMPORTANT NOTE: In order to sort on a field it MUST
                be stored.
                Default is false

LuceneIndexResponse XML
-----------------------

The following describes the response received from an XML index request.

<?xml version="1.0" encoding="UTF-8"?>
<LuceneIndexResponse>
  <Serial><![CDATA[abc123]]></Serial>
  <Status><![CDATA[Document indexed successfully]]></Status>
</LuceneIndexResponse>

This response is very simple.  The prescene of a status element implies success.
Below is an errornous index response: 

<?xml version="1.0" encoding="UTF-8"?>
<LuceneIndexResponse>
  <Serial><![CDATA[abc123]]></Serial>
  <Error><![CDATA[Mandatory 'Id' header is missing]]></Error>
</LuceneIndexResponse>


LuceneUnIndexRequest XML
------------------------
This class implements a LuceneServer UnIndex request. This will delete an
indexed document from the index.
There is also a purge function that will remove all documents from the supplied
application domian.  If a domain is specified then it will remove all from that
domain.  The application element is compulsory.

To delete a specific document:
<?xml version="1.0" encoding="UTF-8" ?>
<LuceneUnIndexRequest>
	<Serial><![CDATA[12<"3?a&bc]]></Serial>
	<Application>sos</Application>
	<Id>zz13</Id>
</LuceneUnIndexRequest>

To Purge:
<?xml version="1.0" encoding="UTF-8" ?>
<LuceneUnIndexRequest>
	<Serial><![CDATA[12<"3?a&bc]]></Serial>
	<Application>sos</Application>
	<Domain>new_domain</Domain>
	<Purge/>
</LuceneUnIndexRequest>

See Query syntax for explanation of Application and Domain tags.
<Id>	This is the Id of the document to remove from the index.

<Purge> This is an empty element and if it exists then the index for the
		specified domain will be purged.
		

LuceneUnIndexResponse XML
-------------------------

<?xml version="1.0" encoding="UTF-8"?>
<LuceneUnIndexResponse>
  <Serial><![CDATA[123abc]]></Serial>
  <Status><![CDATA[No documents deleted.]]></Status>
</LuceneUnIndexResponse>

Other Status messages:
	Purge Successful.  Deleted 3 documents
	Successfully deleted document with id: xmlID

LuceneUtilityXML
----------------

This is an umbrella Request that will be expanded with [hopefully] useful
mini-applications and other utility type functions.

OPTIMIZE

Currently you can force an optimize on an Index.

<?xml version=\"1.0\" encoding=\"UTF-8\" ?>" +
<LuceneUtilityRequest>" +
	<Serial>123abc</Serial>
	<Utility>OPTIMIZE</Utility>
	<Application>sos</Application>
</LuceneUtilityRequest>

Optimizing an index will merge multiple segments of the index into one.  
This will lower the number of open files and sometimes increases performance.
Performance is usually not affected significantly after optimizing but during
the process there is a significant performance penalty as Lucene merges the 
segments.  This has taken upto 5 minutes for a million document index.

BACKUP

This will effectively copy the index to another directory and then perform
an optimize (see above) on it.  This is done using internal Lucene commands
rather than a file copy and is preferrable.

<?xml version=\"1.0\" encoding=\"UTF-8\" ?>" +
<LuceneUtilityRequest>" +
	<Serial>123abc</Serial>
	<Utility>BACKUP</Utility>
	<Application>sos</Application>
	<BackUpTarget>d:/index-backup</BackUpTarget>
</LuceneUtilityRequest>

There are 4 levels of fallback for a backup target.

1. Specified in the XML request as shown above. This overrides all others.
2. Specified as "Lucene-Backup-Directory=c://indexbkp" in the application's 
   properties file.
3. Specified in the LuceneServer's Server.config file.
4. Failing the first three options a "_backup" is added to the index path
   and this directory is created (if possible).
