Protocol for the Lucene Socket-Server
=====================================

General principle:

Similar to SMTP or HTTP with Command, Headers and optional body.  The
protocol is basically line-oriented with each line being terminated by
a <LF> character.
Currently there is an XML version in development.

Examples
========

We'll start off with some examples to get the ball rolling, and then
follow up with a more tightly-defined specification.

CONTROL Example
---------------

A CONTROL message will perform an uncommon task (the only common tasks
are INDEX (for inserting a document) and QUERY (for retrieving).

Currently, two CONTROL messages are officially supported: Application 
and Optimize.  These are defined in detail later.

Please note: It is intended to deprecate the Application sub command 
in favour of .properties file creation.

Initially, when an Application is defined, a number of defaults can be
set.  These can either be set in the config file, or via a CONTROL
dialogue.  For example, the client opens a socket and sends:

      CONTROL
      Sub-Command: APPLICATION
      Application: test
      Stop-List: and if but or
      Stop-List: is was then not
      Field-Name: Domain
      Field-Type: Text
      Field-Indexed: Yes
      Field-Stored: Yes
      Domain: test
      Field-Name: Id
      Field-Type: Id
      Field-Indexed: Yes
      Field-Stored: Yes
      Field-Name: Published
      Field-Type: Date
      Field-Indexed: No
      Field-Stored: Yes
      Field-Name: Text
      Field-Type: Text
      Field-Indexed: Yes
      Field-Stored: No
      Sort:
      Sort: Published
      Sort: RANK
      Query: Domain:test
      Default-Field: Text
      Return: Id
      Return: Published
      Limit: 100
      END

This defines (or updates) an application profile called 'test'.
Documents are indexed with a number fields which are defined here.
The value of the 'Domain' field will default to 'test'.  None of the
other fields will have default values and will need to be set on a
per-document basis (which makes sense).

The definitions for the 'Domain' and 'Id' fields are known by default
and should not be changed, otherwise the Indexer will fail to operate
correctly!

The blank 'Sort' header will discard any Sort header already defined
for the application and replace it with the following Sort fields.

Queries in this application will want the results sorted by rank
within 'Published' date and the 'Id' and 'Published' fields returned
to the client.  All Queries will include the 'Domain:test' predicate
in the search phrase.

The default field to be searched when none is specified will be the
Default-Field field (Text in the above example).  If a default-field
is not specified then a non-existant field is assigned.  A query term 
for any other field can be specified by preceding the query term with 
the filed name plus a colon, e.g. 'Author:Tolkein'

The Stop-List words are added to any stop-list that may have already
been defined for this application.

INDEX Example
-------------

To index a document into the above framework, the following might be
transmitted:

      INDEX
      Serial: 123456
      Application: test
      Id: 13791624
      Published: 2002-12-25
      Body: Text

      blah blah blah any binary data to be word-indexed
      <socket half-closed>

This will first check for and delete document 13791624 in the test
domain.  In order to delete this document, it must first be found.  In
order to uniquely find it, it must be searchable on its Id field.  The
side effect of this is that a general search may find it based on its
Id.

After deletion, the document is re-indexed and re-inserted into the index.

Note that Lucene converts Date fields into a base thirty-six
representation of a Java time-stamp.  This is padded with leading
zeroes to a fixed length.  The standard Lucene Date Handler can only
deal with dates after the 1970 epoch.  I have replaced this with a
modified Date handler which can handle dates from approximately year
zero.

The serial header causes the server to insert the same Serial header
in the response.  This will allow clients who process requests and
responses asynchronously to link a response with a particular request.

Once the INDEX operation is complete, the server will attempt to respond with:

      INDEX-RESPONSE
      Serial: 123456
      END

If the client has already fully-closed the socket after sending the
body, then the server will get a 'Connection Reset' TCP error.  This
will cause the server to drop the socket connection, but the indexing
operation will have completed successfully.

If the client only half-closed the outgoing stream, then it will
receive the response.  The socket cannot be used for further requests
in this state, so the server will close it after the response.

If the client had used a Content-Length header or not used a Body,
then the client could continue sending further transmissions to the
server.

QUERY example
-------------

Once all the documents are in, the client can execute a query:

      QUERY
      Application: test
      First: 101
      Limit: 2
      Query: sex and drugs and rock and roll
      END

The 'test' Application defines the order that the documents are
returned, the fields to be returned with each document, and a
narrowing query clause (Domain:test)

      QUERY-RESPONSE
      Count: 1234567
      I: 101
      RANK: 0.2178
      Domain: test
      Id: 00713-216-9904
      Author: Joe Bloggs
      Published: 1036177152
      I: 102
      RANK: 0.2003
      Domain: test
      Id: 01819-110-6101
      Author: Mary Potts
      Published: 971209480
      END

The total documents that match the query is always returned, followed
by a number of fields for each document.  Three fields are always
returned for each document in addition to the requested field list:
I: 'I' is the relative document index in the result list (after sorting).
    The 'I' value for the first document returned will always be the
    number specified in the query 'First' header or 1 by default. The 'I'
    field for successive documents will increment by one.  
RANK: The Rank field is the match ranking assigned by the indexer 
    to this result.
Id: The Id of the document

When iterating through a result list, the first item for each result
is always the "I: nnn" line.


Syntax Definition
=================

The communication between the client and the server is called a
"Session".  Each Session is made up of a "Dialogues".  Each Dialogue
consists of a request Transmission and a response Transmission.

Session Definition
==================

A session (which is all the activity over the life of a TCP connection) is
defined as follows:

      Session = [ Dialogue ... ]

      Dialogue = Request Response

      Request = Transmission  (on client's outgoing connection)

      Response = Transmission (on client's incoming connection)

Transmission Definition
=======================

In the following syntax definitions, all terms with the suffix "Line"
on their name are coded on a single <LF>-terminated line.


All request and response transmissions are constructed in the same way:

      Transmission = CommandLine Headers Trailer

      CommandLine = 'INDEX' | 'QUERY' | 'CONTROL' | 'ERROR'
                  | 'INDEX-RESPONSE' | 'QUERY-RESPONSE' | 'CONTROL-RESPONSE'

      Headers = [ HeaderLine ... ]

      HeaderLine = Keyword [ <sp> ... ] ':' [ <sp> ... ] [ Value ] [ <sp> ... ] 

      Keyword = Printable ascii string excluding space and colon:

      Value = Printable ascii string.  Control characters (and "%") must
              be URL-encoded.

      Trailer = EndTagLine
              | BlankLine Body

      EndTagLine = 'END'

      BlankLine = solitary <LF> terminator

      Body = Raw data up to specified length or end of input


There is no asynchronous activity over a single connection.  All
requests are processed in the order received and only when the
previous response has been sent.

Transmission Bodies
===================

A transmission must be terminated by an 'EndTagLine' (END) or by a body.  
A body is recognised as any data following a blank line.  A body is a
raw binary stream and is not interpreted or filtered in any way.

The end of the body is reached either:
  1) after the number of bytes specified in the 'Content-Length' header, or
  2) at end of input.

Obviously, if method 2) is used to terminate the body, then this
socket cannot be used for any further transmissions.  The server will
never send a body in a response transmission.  This is only useful for
indexing.

Headers
=======

Headers are used to assign values to fields, or to control the server
in particular ways.  In general, headers may be specified multiple
times.  In some cases, multiple instances of a header will be combined
in a particular way.  In other cases, the last instance will take
precedence over any earlier instances.

The order of headers in the transmission may be significant in that
headers may interact with each other. e.g. Range headers.

Headers may be specified without any values.  The effect of this is to
delete (for this transmission) any application default which may have
been set.  Further headers of the same type may then be specified to
assign a new value.  This is the only way that an application default
value for a concatenating header (like Sort or Return) may be
overridden.

Special Headers
===============

There are two special Headers which affect the interpretation of the
transmission: 'Body' and 'Content-Length'

The 'Content-Length' header indicates the number of bytes in the body.
It is not mandatory to have a body if a 'Content-Length' header is
provided.  As the server does not send bodies in response messages,
the server will also never send Content-Length headers.  The value of
a Content-Length header must be an ascii string made up entirely of
decimal digits representing a decimal number.  The number is the size
of the body in bytes.  After that number of bytes of body have been
read, the Transmission is complete.  End-of-Input will also terminate
the body.  The content length must be a number between 0 and 2**31 - 1
(that is, it must fit in a 32-bit signed integer).

When using a body, a 'Body' header should be used to name the field that
the body will be attached to.  If no 'Body' header is supplied, the
body is discarded.




Definition of Dialogue Types
============================

CONTROL Dialogue
================

The CONTROL dialogue is a catch-all for sub-commands that can be used
to configure the server in various ways.  Following are the
sub-commands that are defined:

APPLICATION --   This is used to set up default indexing and querying
                 parameters for a client application.  For example, a
                 client application may specify that when indexing, it
                 may provide a 'Published' field that is to be
                 represented as a date, but not word-indexed, an
                 'Abstract' field that is to be indexed and stored, an
                 'Id' field that is to be stored but not indexed and a
                 'Text' field that is to be indexed but not stored.

OPTIMIZE --      This will optimize the Index store so that queries
                 will run faster.  This should be run after a number
                 of INDEX operations to reorganise the store.


APPLICATION SUB-COMMAND
-----------------------

Sets or updates Application defaults to be stored on disk in an
Application config file.  The values of any headers specified will be
appended to values already stored, unless a blank header is specified,
in which case all prior values are blanked out.

Mandatory headers: Sub-Command, Application

Optional Headers: Lucene-Index-Directory, All headers used in any
                  other Dialogue.

Sub-Command --     Must be APPLICATION.  This identifies an APPLICATION
                   sub-command.

Application --     The name of the Application being modified.

Lucene-Index-Directory -- An application may specify an alternative,
                   completely separate, Lucene Index directory.  Only
                   one index may be searched in any given query.  For the
                   moment, the decision has been made that all the data
                   will reside in one index, so this should not be
                   specified.

Other headers --   Application default values can be specified and will
                   be stored in an application config file on disk.
                   This will allow application-wide defaults such as
                   the names and types of fields, the result sorting
                   order, and the result fields required.


OPTIMIZE SUB-COMMAND
--------------------

An Optimize command should be executed after some INDEX operations
have completed to rebuild the files so that Query performance is
maximised.  Possibly, the Server should perform optimize operations
itself when it detects some idle time.  It is possible that an
optimize of a very fragmented index will be quite slow: we'll have to
experiment.  A call to optimize after indexing 500,000 documents took
around 4 minutes!  The LuceneServer was pretty much unresponsive 
during that time.

Mandatory headers: Sub-Command

Optional Headers: Application

Sub-Command -- Must be OPTIMIZE

Application -- The Application may override the default index location.



CONTROL-RESPONSE Transmission
-----------------------------

After the completion of a Control function, the Control-Response
transmission will be sent back to the client.  If their was an error
in the operation, the response will include an "Error" header
containing a description of the error.  It may be that this part of
the system will need tightening up by the use defined error codes,
etc.





QUERY Dialogue
==============

A QUERY dialogue consists of a QUERY request transmission being sent by the
client followed by a QUERY-RESPONSE response transmission from the server.

Mandatory headers: Query

Optional Headers: First, Limit, Application, Return, Sort, Sort-Limit,
				  Default-Field, Range-Field, Range-From, Range-To

Description of headers

Query --        The query string to be passed to the indexer.  As with
                all Header values, this must be URL-encoded to remove
                control characters (unless it is sent as a body).

                To avoid the need to URL-encode the query, it can be
                specified as part of the body by doing something like:
                     Body: Query
                     Content-Length: 15

                     blah blah blah
                     
                 See: jakarta.apache.org/lucene/docs/queryparsersyntax.html
                 for query syntax.

Return --       The fields to return.  If this header is not specified
                then the Rank, Domian and Id fields are returned. Any 
                specified fields are in addition to these three. If an
                Application is specified then the Application defaults
                apply.

First --        Specify that, after appropriate sorting, all items up
                to this one are discarded and not returned to the
                client.  This will allow the client to perform
                page-at-a-time queries.

Limit --        Specify that no more than this many items are
                returned.  If negative, an unlimited number of results
                are returned.
                If 0 (zero) then only a count of results is returned.

Sort --         Specify that the results are ordered based on the
                values of specified fields.  If the sort should be on
                multiple fields, then multiple sort headers must be
                specified.  By default, the indexer orders results by
                rank.  The special field name 'RANK' may be used to
                include rank-order within a multi-field sort.  To
                sort fields in reverse order, append ':Desc' to the
                sort field name, e.g.
                  Sort Date:Desc RANK:Desc Author
                  Sort RANK:Desc
                Sorting is very memory intensive as all the documents 
                must be loaded into memory and then sorted. USE WITH
                CAUTION when handling large result sets.
                IN ORDER TO SORT ON A FIELD IT MUST BE STORED!

Sort-Limit --   This specifies a user limit on the number of results
                to be sorted.  This is to increase application performance
                as sorting 1000s of results takes considerable time.
                If this value is lower than the Limit value specified 
                above then it will be ignored.
                If the number of results is greater than the Sort-Limit
                then an error will be returned.
                Please note that there is also a System sort-limit which
                is specified in the Server.config file.

Application --  Indicate that the default parameters for this query
                are to be taken from the specified application.  The
                application must have been set up previously.

Default-Field - This is the field that the query parser uses for any
                search term that does not specify a field name.  It
                must be entered in either the Transmission or the
                Application defaults.  If not specified, then unqualified
                search terms will be associated with a non-existant field
                and so will cause the query to find nothing.

Range-Field --  Specifies the Field name for which a Range
                query clause is wanted.  This header should be followed
                by the Range-From and Range-To headers.  Further Range-Field
                headers may then be specified for other fields.
                One or both of Range-From or Range-To must be specified before
                another Range-Field, otherwise this Range-Field is ignored.
                If only one of Range-From or Range-To is specified, then
                the range is open-ended.

Range-From --   Specifies a search start point for the previously-specified
                range field.  If the range field is defined as a Date field
                then the Range-From must specify a date value (in the agreed
                format).  If this field is not specified, the range starts at
                the first possible value.

Range-To --     Specifies a search end point for the previously-specified
                range field.  If the range field is defined as a Date field
                then the Range-To must specify a date value (in the agreed
                format).  If this field is not specified, the range ends at
                the last possible value.

QUERY-RESPONSE Transmission
===========================

The response to a query will contain a count of the number of matching
documents followed by zero or more groups of the relative document
number in the result list and the requested fields for the document.

Mandatory Headers: Count

Optional Headers: I, Rank, Domain, Id and any requested fields

If the first header in the Query-Response is not 'Count', but 'Error',
then the Query failed.  A descriptive string will be attached to the
Error header.


INDEX Dialogue
==============

The INDEX dialogue is used to add new documents to the store or to
replace documents.  Each document must include the mandatory fields
'Domain' and 'Id' which will serve to uniquely identify them.  The
'Domain' field may be defaulted from an Application default, but the
'Id' field must be individually specified on each document.

Mandatory Headers: Id

Optional Headers: Application, Domain, Stop-List, Field-Name,
                  Field-Type, Field-Indexed, Field-Stored, and
                  document fields.

Application --  Indicate that the default parameters for this document
                are to be taken from the specified application.  The
                application must have been set up previously.  This
                can usefully be used to set up field definitions and a
                stop-list in advance.

Domain --       Every document must include a Domain field which may
                be specified in the INDEX header or taken from the
                Application defaults.  This will be used to isolate
                documents into application areas so that one
                application does not reference another's documents.

Id --           Every document must be uniquely identified (within its
                domain).  The first act when indexing a document is to
                delete any existing document with the same Domain/Id
                pair.  This field allows the client to uniquely
                address the actual document in an external database.

Stop-List --    A space-separated list of words that should be ignored
                when indexing the document.  Generally, this would be
                specified in the Application defaults.

Field-Name --   The name of the field whose attributes are defined in
                subsequent headers.

Field-Type --   Whether the field named in the previous Field-Name
                definition is a 'Date' field, a 'Text' field or an
                'Id' field. A 'Text' field is a field which is split
                into words.  An Id field is not split into words: it's
                value is treated as a whole word (or "term" in
                Lucene's terminology).  A Date field is an Id field
                which is converted from an external Date
                representation into a form suitable for indexing and
                sorting.

Field-Indexed - 'Yes' or 'No' indicating whether indexing is required
                for this field.  A field cannot be used in a search
                query unless it is indexed.

Field-Stored -- 'Yes' or 'No' whether the content of a field is
                stored.  For example, this would not be done for the
                body text of a document as it is assumed to be stored
                elsewhere, but it would be required for the ID field
                so that the document ID could be returned as part of
                the query result.

Doc Fields --   Any other headers are assumed to contain document
                text.  If they are fields that haven't been defined as
                above, then they are assumed to be Text fields that
                are indexed but not stored.  Any fields that have been
                defined in a 'Field-Name' header but not supplied in a
                Field header will not become part of the document.

Body --         It may be useful to define a Body field so that the
                document text need not be URL-encoded.

INDEX-RESPONSE Transmission
===========================

There is not much to be said about an Indexing operation.  I
anticipate that there will be no headers in this dialogue.  The
arrival of the response can be taken as positive confirmation that the
index operation has completed.

Optional Headers: Error

Error --        This indicates that an error has occurred and contains an
                error message.

UNINDEX Transmission
====================

This class implements a LuceneServer UnIndex request. This will delete an
indexed document from the index.

The format of an unindex request is as follows:
UNINDEX
Application: your_app_name_here
Id: id_of_document_to_delete
END

There is also a development only purge feature.  
DO NOT DEPEND ON THIS TO REMAIN IN THE LUCENESERVER.

It will remove all documents for a given application using the application's
domain.  There is no support for an alternative domain at this time.

To use it use the following syntax:
UNINDEX
Application: your_app_name_here
Purge: true
END

UNINDEX-RESPONSE Transmission
=============================

This returns the header only, always. i.e.
UNINDEX-RESPONSE
END

ERROR Transmission
==================

If the server receives unexpected input (excluding blank lines) it
will respond with an ERROR transmission.  This will contain a single
Error header describing the problem.  Once an ERROR transmission has
been sent, all erroneous input is ignored (and no further ERROR
transmissions sent) until a valid transmission is received.
