Python3 bindings for Xapian

The Python3 bindings for Xapian are packaged in the xapian module, so to use them you need to add this to your code:

import xapian

Since Xapian 1.4.22 these bindings require Python >= 3.3. If you still need support for older Python versions, Xapian <= 1.4.21 supports Python 3.2. If you still need Python2 support, there are separate bindings for that.

The Python API largely follows the C++ API - the differences and additions are noted below.

Strings

The Xapian C++ API is largely agnostic about character encoding, and uses the std::string type as an opaque container for a sequence of bytes. In places where the bytes represent text (for example, in the Stem, QueryParser and TermGenerator classes), UTF-8 encoding is used. In order to wrap this for Python, std::string is mapped to/from the Python bytes type.

As a convenience, you can also pass Python str objects as parameters where this is appropriate, which will be converted to UTF-8 encoded text. Where std::string is returned, it’s always mapped to bytes in Python, which you can convert to a Python str by calling .decode(‘utf-8’) on it like so:

for i in doc.termlist():
  print(i.term.decode('utf-8'))

Therefore, in order to avoid issues with character encodings, you should always pass text data to Xapian as unicode strings, or UTF-8 encoded byte strings.

There is, however, no requirement for byte strings passed into Xapian to be valid UTF-8 encoded strings, unless they are being passed to a text processing routine (such as the query parser, or the stemming algorithms). For example, it is perfectly valid to pass arbitrary binary data to the xapian.Document.set_data() method.

Unicode

Unicode text is most often in NFC already, but if you need to normalise text before passing it to Xapian, the standard python module “unicodedata” provides support for normalising unicode: you probably want the “NFKC” normalisation scheme, so for example normalising a query string prior to parsing it would look something like this:

::
def parse_query(query_string):

query_string = unicodedata.normalize(‘NFKC’, query_string) qp = xapian.QueryParser() query_obj = qp.parse_query(query_string)

Exceptions

Xapian-specific exceptions are subclasses of the :xapian-class:`Error` class, so you can trap all Xapian-specific exceptions like so:

try:
    do_something_with_xapian()
except xapian.Error as e:
    print str(e)

xapian.Error is a subclass of the standard Python exceptions.Exception class so will also be caught by except Exception.

Iterators

The iterator classes in the Xapian C++ API are wrapped in a pythonic style. The following are supported (where marked as “default iterator”, it means __iter__() does the right thing so you can for instance use for term in document to iterate over terms in a Document object):

Class

Python Method

Equivalent C++ Method

Python iterator type

MSet

default iterator

begin()

MSetIter

ESet

default iterator

begin()

ESetIter

Enquire

matching_terms()

get_matching_terms_begin()

TermIter

Query

default iterator

get_terms_begin()

TermIter

Database

allterms() (also as default iterator)

allterms_begin()

TermIter

Database

postlist(term)

postlist_begin(term)

PostingIter

Database

termlist(docid)

termlist_begin(docid)

TermIter

Database

positionlist(docid, term)

positionlist_begin(docid, term)

PositionIter

Database

metadata_keys(prefix)

metadata_keys(prefix)

TermIter

Database

spellings()

spellings_begin(term)

TermIter

Database

synonyms(term)

synonyms_begin(term)

TermIter

Database

synonym_keys(prefix)

synonym_keys_begin(prefix)

TermIter

Document

values()

values_begin()

ValueIter

Document

termlist() (also as default iterator)

termlist_begin()

TermIter

QueryParser

stoplist()

stoplist_begin()

TermIter

QueryParser

unstemlist(term)

unstem_begin(term)

TermIter

ValueCountMatchSpy

values()

values_begin()

TermIter

ValueCountMatchSpy

top_values()

top_values_begin()

TermIter

The pythonic iterators generally return Python objects, with properties available as attribute values, with lazy evaluation where appropriate. An exception is PositionIter (as returned by Database.positionlist for example), which returns an integer.

The lazy evaluation is mainly transparent, but does become visible in one situation: if you keep an object returned by an iterator, without evaluating its properties to force the lazy evaluation to happen, and then move the iterator forward, the object may no longer be able to efficiently perform the lazy evaluation. In this situation, an exception will be raised indicating that the information requested wasn’t available. This will only happen for a few of the properties - most are either not evaluated lazily (because the underlying Xapian implementation doesn’t evaluate them lazily, so there’s no advantage in lazy evaluation), or can be accessed even after the iterator has moved. The simplest work around is to evaluate any properties you wish to use which are affected by this before moving the iterator. The complete set of iterator properties affected by this is:

  • Database.allterms (also accessible as Database.__iter__): termfreq

  • Database.termlist: termfreq and positer

  • Document.termlist (also accessible as Document.__iter__): termfreq and positer

  • Database.postlist: positer

MSet

MSet objects have some additional methods to simplify access (these work using the C++ array dereferencing):

Method name

Explanation

get_hit(i)

returns MSetItem at index i

get_document_percentage(i)

convert_to_percent(get_hit(i))

get_document(i)

get_hit(i).get_document()

get_docid(i)

get_hit(i).get_docid()

Two MSet objects are equal if they have the same number and maximum possible number of members, and if every document member of the first MSet exists at the same index in the second MSet, with the same weight.

Non-Class Functions

The C++ API contains a few non-class functions (the Database factory functions, and some functions reporting version information), which are wrapped like so for Python 3:

  • Xapian::version_string() is wrapped as xapian.version_string()

  • Xapian::major_version() is wrapped as xapian.major_version()

  • Xapian::minor_version() is wrapped as xapian.minor_version()

  • Xapian::revision() is wrapped as xapian.revision()

  • Xapian::Remote::open() is wrapped as xapian.remote_open() (both the TCP and “program” versions are wrapped - the SWIG wrapper checks the parameter list to decide which to call).

  • Xapian::Remote::open_writable() is wrapped as xapian.remote_open_writable() (both the TCP and “program” versions are wrapped - the SWIG wrapper checks the parameter list to decide which to call).

The following were deprecated in the C++ API before the Python 3 bindings saw a stable release, so are not wrapped for Python 3:

  • Xapian::Auto::open_stub()

  • Xapian::Chert::open()

  • Xapian::InMemory::open()

The version of the bindings in use is available as xapian.__version__ (as recommended by PEP 396). This may not be the same as xapian.version_string() as the latter is the version of xapian-core (the C++ library) in use.

Query

In C++ there’s a Xapian::Query constructor which takes a query operator and start/end iterators specifying a number of terms or queries, plus an optional parameter. In Python, this is wrapped to accept any Python sequence (for example a list or tuple) of terms or queries (or even a mixture of terms and queries). For example:

subq = xapian.Query(xapian.Query.OP_AND, "hello", "world")
q = xapian.Query(xapian.Query.OP_AND, [subq, "foo", xapian.Query("bar", 2)])

MatchAll and MatchNothing

These are wrapped as xapian.Query.MatchAll and xapian.Query.MatchNothing.

MatchDecider

Custom MatchDeciders can be created in Python - subclass xapian.MatchDecider, ensure you call the super-constructor from your constructor, and define a __call__ method that will do the work. The simplest example (which does nothing useful) would be as follows:

class mymatchdecider(xapian.MatchDecider):
  def __init__(self):
    xapian.MatchDecider.__init__(self)

  def __call__(self, doc):
    # Accept all documents.
    return True

ValueRangeProcessor

The ValueRangeProcessor class is deprecated and will be removed in Xapian 2.0.0. The replacement is RangeProcessor (added in Xapian 1.3.6). Use RangeProcessor instead in new code - it’s more flexible because it can return an arbitrary Query object. This section documenting ValueRangeProcessor is here to aid migrating existing uses.

The ValueRangeProcessor class (and its subclasses) provide an operator() method (which is exposed in python as a __call__() method, making the class instances into callables). This method checks whether the beginning and end of a range are in a format understood by the ValueRangeProcessor, and if so, converts the beginning and end into strings which sort appropriately. ValueRangeProcessors can be defined in python (and then passed to the QueryParser), or there are several default built-in ones which can be used.

In C++ the operator() method takes two std::string arguments by reference, which the subclassed method can modify, and returns a value slot number. In Python, we wrap this by passing two bytes objects to __call__ and having it return a tuple of (value_slot, modified_begin, modified_end). For example:

vrp = xapian.NumberValueRangeProcessor(0, '$', True)
a = '$10'
b = '20'
slot, a, b = vrp(a, b)

You can implement your own ValueRangeProcessor in Python. The Python implementation should override the __call__() method with its own implementation, which returns a tuple as above. For example:

class MyVRP(xapian.ValueRangeProcessor):
  def __init__(self):
    xapian.ValueRangeProcessor.__init__(self)
  def __call__(self, begin, end):
    return (7, "A"+begin, "B"+end)

The equivalent RangeProcessor subclass to MyVRP would look like this:

class MyRP(xapian.RangeProcessor):
    def __init__(self):
        xapian.RangeProcessor.__init__(self)
    def __call__(self, begin, end):
        return xapian.Query(xapian.Query.OP_VALUE_RANGE, "A"+begin, "B"+end)

Return xapian.Query(xapian.Query.OP_INVALID) to signal that you don’t want to handle an offered range.

Apache and mod_python/mod_wsgi

Prior to Xapian 1.3.0, you had to tell mod_python and mod_wsgi to run applications which use Xapian in the main interpreter. Xapian 1.3.0 no longer uses the simplified GIL state API, and so this restriction no longer applies.

Test Suite

The Python bindings come with a test suite, consisting of two test files: smoketest.py and pythontest.py. These are run by the make check command, or may be run manually. By default, they will display the names of any tests which failed, and then display a count of tests which run and which failed. The verbosity may be increased by setting the VERBOSE environment variable, for example:

make check VERBOSE=1

Setting VERBOSE to 1 will display detailed information about failures, and a value of 2 will display further information about the progress of tests.