Synonyms¶
Introduction¶
Xapian provides support for storing a synonym dictionary, or thesaurus. This
can be used by the xapian.QueryParser
class to expand terms in user query
strings, either automatically, or when requested by the user with an explicit
synonym operator (~
).
Note
Xapian doesn’t offer automated generation of the synonym dictionary.
Here is an example of search program with synonym functionality.
def search(dbpath, querystring, offset=0, pagesize=10):
# offset - defines starting point within result set
# pagesize - defines number of records to retrieve
# Open the database we're going to search.
db = xapian.WritableDatabase(dbpath)
# Start of adding synonyms
db.add_synonym("time", "calendar")
# End of adding synonyms
# Set up a QueryParser with a stemmer and suitable prefixes
queryparser = xapian.QueryParser()
queryparser.set_stemmer(xapian.Stem("en"))
queryparser.set_stemming_strategy(queryparser.STEM_SOME)
queryparser.add_prefix("title", "S")
queryparser.add_prefix("description", "XD")
# Start of set database
queryparser.set_database(db)
# End of set database
# And parse the query
query = queryparser.parse_query(querystring, queryparser.FLAG_SYNONYM)
# Use an Enquire object on the database to run the query
enquire = xapian.Enquire(db)
enquire.set_query(query)
# And print out something about each match
matches = []
for match in enquire.get_mset(offset, pagesize):
fields = json.loads(match.document.get_data().decode('utf8'))
print(u"%(rank)i: #%(docid)3.3i %(title)s" % {
'rank': match.rank + 1,
'docid': match.docid,
'title': fields.get('TITLE', u''),
})
matches.append(match.docid)
# Finally, make sure we log the query and displayed results
support.log_matches(querystring, offset, pagesize, matches)
You can see the search results without ~ operator.
$ python3 code/python3/search_synonyms.py db time
1: #065 Electric time piece with hands but without dial (no pendulum
2: #058 The "Empire" clock, to show the time at various longitudes,
3: #041 Frequency and time measuring instrument type TSA3436 by Venn
4: #056 Single sandglass in 4 pillared wood mount, running time 15 1
5: #043 Loughborough-Hayes automatic timing apparatus. Used by the R
6: #011 "Timetrunk" by Hines and Co., Glasgow (a sandglass for timin
7: #016 Copy of the gearing of the Byzantine sundial-calendar (1983-
8: #045 Master clock of the "Silent Electric" type made by the Magne
9: #018 Solar/Sidereal verge watch with epicyclic maintaining power
'time'[0:10] = 65 58 41 56 43 11 16 45 18
Notice the difference with the ~ operator with time where calendar is specified as its synonym.
$ python3 code/python3/search_synonyms.py db ~time
1: #016 Copy of the gearing of the Byzantine sundial-calendar (1983-
2: #072 German Perpetual Calendar in gilt metal
3: #065 Electric time piece with hands but without dial (no pendulum
4: #068 Ornate brass Perpetual Calendar
5: #058 The "Empire" clock, to show the time at various longitudes,
6: #041 Frequency and time measuring instrument type TSA3436 by Venn
7: #056 Single sandglass in 4 pillared wood mount, running time 15 1
8: #043 Loughborough-Hayes automatic timing apparatus. Used by the R
9: #026 Sundial and compass with perpetual calendar and lunar circles
10: #036 Universal 'Tri-Compax' chronographic wrist watch
'~time'[0:10] = 16 72 65 68 58 41 56 43 26 36
Model¶
The model for the synonym dictionary is that a term or group of consecutive terms can have one or more synonym terms. A group of consecutive terms is specified in the dictionary by simply joining them with a single space between each one.
If a term to be synonym expanded will be stemmed by the xapian.QueryParser
, then
synonyms will be checked for the unstemmed form first, and then for the stemmed
form, so you can provide different synonyms for particular unstemmed forms
if you want to.
Todo
Discuss interactions with stemming (ie, should the input and/or output values in the synonym table be stemmed).
Adding Synonyms¶
The synonyms can be added by the xapian.WritableDatabase.add_synonym()
. In the following
example calender
is specified as a synonym for time
. Users may similarly write a loop to load all
the synonyms from a dictionary file.
db.add_synonym("time", "calendar")
QueryParser Integration¶
In order for any of the synonym features of the QueryParser to work, you must
call xapian.QueryParser.set_database()
to specify the database to
use.
queryparser.set_database(db)
If FLAG_SYNONYM
is passed to xapian.QueryParser.parse_query()
then the xapian.QueryParser
will recognise ~
in front of a term as indicating a
request for synonym expansion.
If FLAG_LOVEHATE
is also specified, you can
use +
and -
before the ~
to indicate that you love or hate the
synonym expanded expression.
A synonym-expanded term becomes the term itself OP_SYNONYM-ed with any listed synonyms,
so ~truck
might expand to truck SYNONYM lorry SYNONYM van
. A group of terms is
handled in much the same way.
If FLAG_AUTO_SYNONYMS
is passed to
xapian.QueryParser.parse_query()
then the :xapian-class:` QueryParser` will
automatically expand any term which has synonyms, unless the term is in a phrase
or similar.
If FLAG_AUTO_MULTIWORD_SYNONYMS
is passed to
xapian.QueryParser.parse_query()
then the :xapian-class:` QueryParser` will look at
groups of terms separated only by whitespace and try to expand them as term
groups. This is done in a “greedy” fashion, so the first term which can start a
group is expanded first, and the longest group starting with that term is
expanded. After expansion, the :xapian-class:` QueryParser` will look for further possible
expansions starting with the term after the last term in the expanded group.
OP_SYNONYM¶
Todo
Query.OP_SYNONYM, and how that relates to synonym expansion.
Current Limitations¶
Explicit multi-word synonyms¶
There ought to be a way to explicitly request expansion of multi-term synonyms,
probably with the syntax ~"stock market"
. This hasn’t been implemented
yet though.
Backend Support¶
Currently synonyms are supported by the chert and glass databases. They work
with a single database or multiple databases (use
xapian.Database.add_database()
as usual). We’ve no plans to support
them for the InMemory backend, but we do intend to support them for the remote
backend in the future.