$Id: README,v 1.8 2010/10/25 17:16:00 cm-msk Exp $

INTRODUCTION

This is the development area for OpenDKIM's statistics collection system.

In earlier versions, the statistics system gathered numerous details about
messages and their DKIM signatures in order to produce a report about
the observed implementation of DKIM on the Internet.  The data recorded
included the number of signatures, percent of signature failures, first-party
vs. third-party signatures, ADSP statistics, etc.  These data were collected
and submitted to the Internet Engineering Task Force as part of a
requirement to advance DKIM to Draft Standard status, and this work was
very successful.

As of v2.5.0, the statistics system has been repurposed to support ongoing
research and development into open DKIM-based reputation services.  The data
collected has been drastically reduced to include only information about
message arrival times, From domains, signing domains, sending IP addresses,
and whether or not the message was reported as spam.

You are invited, but certainly not required, to provide this
information to The OpenDKIM Project.  The code that does the reporting is
open source and included in this package, so you can easily verify that
nothing other than the above is being revealed in the data thus submitted.
Participants that submit data will be invited to be part of an experimental
reputation system based on DKIM that should begin to appear in the near future.

This directory contains information and software for generating, receiving
and processing such reports, including commands for creating a MySQL database
to store the reports for later query, and a program that can receive reports
generated by opendkim-stats to put such data directly into that database.


INSTALLATION

This system can be made to work with any SQL-based system, but the provided
scripts and documentation presume MySQL.

1.	Compile OpenDKIM using the "--enable-stats" option:

	% ./configure --enable-stats
	% make

	If you wish to apply local extensions to the statistics reporting
	system, see the EXTENSIONS section below.

2.	Install MySQL.  Create credentials for an "opendkim" user, optionally
	with some password.  Create a database called "opendkim".  Note that
	this does NOT refer to a "real" UNIX user and password.  The user you
	create must be granted SELECT and INSERT access to that table.

3. 	From inside the MySQL client, source the "mkdb.mysql" script.
	This will create the tables required to store the statistics reports.

4.	Install the "opendkim-stats" program someplace.

5.	Configure OpenDKIM to have statistics enabled, and begin reporting
	them to a file someplace.  See the opendkim(8) and opendkim.conf(5)
	man pages for details.

6.	Restart opendkim.

7a.	To get a human-readable form of the recorded statistics, use:

		opendkim-stats /path/to/stats/file

	...using, of course, the path to the statistics file you configure
	in opendkim.conf.  You can reset the contents of that file at any
	time by simply removing it or copying /dev/null on top of it.  The
	filter will create/append to the file on the next received message.

7b.	To translate records in the recorded statistics file into SQL
	insert operations, use:

		opendkim-importstats /path/to/stats/file

	...again using the path you specified in opendkim.conf.  You will
	probably also need one or more of the following:

		-d dbname	database name (default "opendkim")
		-p dbpasswd	database user's password (no default)
		-s scheme	database scheme (default "mysql")
		-u dbuser	database user (default "opendkim")

	Append "-r" to this to remove the statistics file on completion.

	You can also use the provided opendkim-genstats script to generate
	useful reports from the accumulated data.  Contribution of other
	reports you find useful would be welcome.

7c.	To participate in The OpenDKIM Project's data collection work,
	ask for the submission address you should use and then create this
	script and add it to your crontab:

		mv /path/to/stats/file /path/to/stats/file.OLD
		sendmail submission-address < /path/to/stats/file.OLD
		rm /path/to/stats/file.OLD

	When your data are received, The OpenDKIM project will use the
	aforementioned "opendkim-importstats" tool to import your data into
	the accumulated database.  You can view regularly generated reports,
	including your data, at:

		http://www.opendkim.org/stats/report.html

FILE FORMAT

The format of the file written by the opendkim filter is described here.
If demand appears, a stable API for accessing it will be provided.  Until
then, application developers are advised not to rely on this information being
stable between versions.

A statistics file consists of lines of ASCII data which are delimited from
each other by a single LF (ASCII 10).

Empty lines or lines beginning with other than alphabetic characters are
ignored completely.

A line in the file that begins with a capital letter identifies the type
of record it represents.  The first such line also implicitly ends the global
values section.

There are currently these record types:

	M	identifies a message
	S	identifies a signature
	U	updates a message

A message record is a tab-separated, ordered sequence of fields, as follows:

	MTA-provided job/envelope ID (string)
	reporter (string; defaults to hostname)
	first domain found in the From: header field (string)
	SMTP client IP address (string)
	UNIX timestamp of message receive time
	message size, in bytes
	signature count
	ATPS status (-1 = not checked, 0 = no, 1 = yes)
	spam status (-1 = not checked, 0 = no, 1 = yes)

A signature record implicitly references the preceding message record.
There may be more than one signature record per message; there could also be
none.  As above, a signature record is a tab-separated, ordered sequence of
fields, as follows:

	domain of the signature
	pass (0 = no, 1 = yes)
	failed due to "bh" mismatch (0 = no, 1 = yes)
	"l=" tag value (-1 = not present)
	error code from signature
	DNSSEC value (see DKIM_DNSSEC_* constants from dkim.h)

Fields for which no value is known or appropriate should be represented as
"-" in the file.


UPDATES

As of v2.5.0, a "spam" column is present in the table tracking per-message
information.  When a batch of new message information is sent, this field
is typically populated with a spam value of 0 (not spam).  An update record
can be sent later that changes the value in this column.

The fields in this case are:

	MTA-provided job/envelope ID (string)
	reporter (string; defaults to hostname)
	UNIX timestamp of message receive time (0 if not known)
	spam indicator (-1 = not tested, 0 = not spam, 1 = spam)

The first three fields must match those given for the corresponding "M"
(new message) record at some time in the past.  Where the timestamp is 0,
the latest record matching the first two is updated.  The "spam" column for
that message will be replaced with the value found in the fourth field.


EXTENSIONS

For the purpose of allowing local experimentation, it is possible to
extend the statistics reporting to include data about a message not covered by
the basic schema distributed with OpenDKIM.

Extension data are collected and stored during execution of the Lua "final"
script and then written out to the stats file associated with that message.
These items are written on "X" lines in the form "name value" (i.e. name and
value separated by whitespace).  opendkim-importstats will insert these into
your database using "name" as the name of the column to be updated and "value"
as the value to be placed there.

Choosing the "--enable-statsext" flag at build configuration time adds support
for this.  The mechanism for recording these extra statistics is the
odkim.stats() function, detailed in the opendkim-lua(3) man page.

In this way, one can create any supplementary message-specific columns as
is desirable and populate them whatever way is appropriate for each.

Participants that find particular additional columns produce interesting
correlations are encouraged to share them with the OpenDKIM community.


CONVERTING TO THE IPADDRS SCHEMA

In version 2.3.0 of OpenDKIM, the schema changed to add a new "ipaddrs"
table, moving that data from each row of the "messages" table (causing
duplication).  The opendkim-importstats tool in 2.3.0 expects the new schema
and is not back-compatible with prior releases.

To convert your existing databases, use these steps:

1) Run the stats/mkdb.mysql script from inside your MySQL client to create
   the required new table.  Existing tables will not be altered.

2) Run the following additional MySQL commands (lines are wrapped here for
   readability but each represents a single command):

	a) LOCK TABLES messages WRITE, ipaddrs WRITE;
	b) ALTER TABLE messages ADD COLUMN ip INT UNSIGNED AFTER ipaddr;
	c) INSERT INTO ipaddrs (addr, firstseen) 
		SELECT DISTINCT ipaddr, MIN(msgtime) FROM messages
		GROUP BY ipaddr;
	d) UPDATE messages
		SET ip = (SELECT id FROM ipaddrs WHERE addr = messages.ipaddr);
	e) ALTER TABLE messages MODIFY COLUMN ip INT UNSIGNED NOT NULL,
		DROP COLUMN ipaddr,
		ADD CONSTRAINT FOREIGN KEY(ip) REFERENCES ipaddrs(id)
		ON DELETE CASCADE;
	f) UNLOCK TABLES;

   The various ALTER TABLE commands may require a rebuild of the "messages"
   table.  This can take a large amount of time if the table is large.
