Zthes: a Z39.50 Profile for Thesaurus Navigation

Mike Taylor

26th July 1999

Version 0.3b
(SCCSID "@(#)/export/home/staff/mike/src/zthes/SCCS/s.zthes.html 1.11")

WARNING: as of 28th November 2000, version 0.3b of this profile has been superseded by version 0.4. You should probably be reading that document instead of this one.



0. Changes since version 0.2

1. Introduction

1.1. Overview

This document describes an abstract model for representing and searching thesauri - semantic hierarchies of terms as described in ISO 2788 [2] - and specifies how this model may be implemented using the Z39.50 [1] protocol. It also suggests how the model may be implemented using other protocols and formats.

1.2. Status

This document will be reviewed and perhaps altered before being re-released as version 1.0. All feedback is very welcome, and should be emailed to the author ([email protected]).

1.3. Scope

This profile is laid out in two main sections. The first is concerned solely with the abstract representation of thesaurus terms and how they may be searched; and the second with the implementation of these abstract concepts in Z39.50: how thesaurus terms are encoded in the GRS-1 record structure, how searches are encoded in the type-1 query, etc.

It is intended that the abstract model described here is sufficiently general that it can also be implemented by protocols and data formats other than Z39.50. As an example, an appendix defines an XML DTD for thesaurus terms based on the model, and includes an example XML document using that DTD.

This profile does not mandate any relationship between a thesaurus and a database. The model is that terms from any thesaurus database may be used to search any other database (called a target database).

1.4. Acknowledgements

This document represents the consensual outcome of extensive discussions between the members of the informally convened Zthes working group:

2. Abstract Model

2.1. Overview

This profile represents a thesaurus as a database of inter-linked terms. If multiple thesauri are to be supported by a single server, then they must be presented as separate databases.

Each individual term in a thesaurus is represented by a record in the database. In the interests of simplicity and orthogonality, even non-preferred terms must be represented by their own records.

Term records consist of an initial part describing the term itself (with information such as its unique identifier, scope note, etc.), together with sub-records briefly describing related terms. The primary means of navigation from one term to another is by searching for the unique identifiers of the terms related to the first one.

2.2. Schema

In the element tables in this profile, the occurrence columns describe whether the elements are mandatory and/or repeatable as follows:

Value Meaning
1 mandatory, not repeatable
1+ mandatory, repeatable
[0,1] optional, not repeatable
0+ optional, repeatable

The top level term record is composed of the following elements:

Name Occurrence Description
termId 1 an opaque string of characters which uniquely identifies the term within the thesaurus
termName 1 the name of the term in a form which may be displayed to a user or used as a search term in a target database
termQualifier [0,1] an additional string which, if supplied, qualifies the term name such that the combination of term and qualifier is unique within the thesaurus
termType [0,1] an indication of the type of the term, chosen from the controlled vocabulary described below
termLanguage [0,1] the language of the term
termNote [0,1] a scope note for the term: that is, arbitrary prose clarifying the meaning and scope of the term
termCreatedDate [0,1] the date on which the record defining the term was created
termCreatedBy [0,1] the name of the person who created the record defining the term
termModifiedDate [0,1] the date on which the record defining the term was last modified
termModifiedBy [0,1] the name of the person who last modified the record defining the term
postings 0+ a sub-record, in the format described below, indicating the frequency with which the term occurs in a target database
relation 0+ a sub-record, in the format described below, briefly describing a term related to this one

It is recognised that in many thesauri there is no explicit unique identifier field, and the term itself, perhaps in combination with the qualifier, uniquely identifies a record. Thesauri such as these must nevertheless provide a termID field, which may be automatically generated simply by combining the term and qualifier.

The termType element may take the following values:

``PT''
Preferred term (also known as a descriptor)
``ND''
Non-descriptor: that is, a non-preferred term.
``NL''
Node label: that is, a dummy term not assigned to documents when indexing, but inserted into the hierarchy to indicate the logical basis on which a category has been divided - for example, by function. Also known as a guide term or a facet indicator.

Servers may return other values of termType at their discretion. It is recommended that such extension values begin with the string ``X-''.

Each postings sub-record is composed of the following elements:

Name Occurrence Description
sourceDb 1 the host, port and name of a target database in which the term may be found
fieldName [0,1] if specified, the name of a field in the target database in which the term may be found; otherwise, the sub-record represents a postings count across the entire target database
hitCount 1 the number of occurrences of the term in the target database (in the nominated field only, if specified)

If a server wishes to communicate separate postings counts for a term in more than one field, then multiple postings sub-records with the same value of sourceDb should be used.

Each relation sub-record is composed of the following elements:

Name Occurrence Description
relationType 1 an indication of the type of the relation, chosen from the controlled vocabulary described below
sourceDb [0,1] if specified, the host, port and name of a different Zthes database in which the related term is found; otherwise, the related term is in the same database as the current one
termId 1 the unique identifier of the related term within its database
termName 1 the name of the related term
termQualifier [0,1] the qualifier of the related term
termType [0,1] the type of the related term
termLanguage [0,1] the language of the related term

The relationType element may take the following values:

``NT''
Narrower term: that is, the related term is more specific than the current one.
``BT''
Broader term: that is, the related term is more general than the current one.
``USE''
Use instead: that is, the related term should be used in preference to the current one.
``UF''
Use for: that is, the current term should be used in preference to the related one
``RT''
Related term.
``LE''
Linguistic equivalent: the current term and the related term are preferred terms representing the same concept - or ``sufficiently close'' concepts - in different languages.

Servers may return other values of relationType at their discretion. It is recommended that such extension values begin with the string ``X-''.

With a single exception, this profile deliberately restricts its set of supported relations to those discussed in ISO 2788 [2], in the belief that it is better for a small set of relations to be used interoperably than for a larger set to be specified, with different servers and clients in practice using different subsets.

That sole exception is the addition to the standard relation types of ``LE'', introduced to model the multilingual links described in ISO 5964 [7].

The ``NT'' and ``BT'' relationships are reciprocal; so are ``USE'' and ``UF''; and ``RT'' and ``LE'' are reflexive. That is, when any term T1 points to another T2 using the relation ``NT'', T2 should point back to T1 using ``BT'' and vice versa; when T1 points to T2 using the relation ``USE'', T2 should point back to T1 using ``UF'' and vice versa; and when T1 points to T2 using the relation ``RT'' or ``LE'', T2 should point back to T1 using the same relation.

The termType element in a relation sub-record may take the same values as in the top-level record.

2.3. Searching

The following searches must be supported:

Support for additional searches, including the following, may be useful.

3. Z39.50 Specification

3.1. Overview

This profile builds on the work of others by using the Z39.50 Attribute Architecture [3] together with the Utility Attribute Set [4] and the Cross-Domain Attribute Set [5] developed for it.

As such, it requires servers and clients to support version 3 of the Z39.50 protocol.

In designing the Zthes-1 attribute set of additional attributes required for thesaurus navigational searches, we have sought to comply with the guidelines expounded in the the Attribute Set Developers Guide[6].

Unusually for a Z39.50 profile, the intention of this profile is that it be used in conjunction with other profiles. It is envisaged that an application will use the Zthes profile to navigate a thesaurus and thereby obtain terms suitable for searching in a target database; and use a second, domain-specific profile such as GILS or CIMI to search in and retrieve from that database.

The Z39.50 objects defined by this profile have the following OIDs.

Object OID
tagSet-Zthes No OID yet assigned by the Z39.50 Maintenance Agency; the private OID 1.2.840.10003.14.1000.136.1 may be used if necessary.
The Zthes Schema 1.2.840.10003.13.8
The Zthes-1 Attribute Set 1.2.840.10003.3.13

3.2. tagSet-Zthes

This profile defines a tag set called tagSet-Zthes, which describes the additional tags needed in the schema beyond those found in the standard tagSet-M and tagSet-G. It contains the following elements, corresponding to the same-named elements in the abstract model schema described above:

Tag Name ASN.1 Datatype
1 termQualifier InternationalString
2 termType InternationalString
3 relationType InternationalString
4 postings structured
5 fieldName InternationalString
6 hitCount INTEGER

3.3. Schema

In the Zthes schema, tag types indicate elements from the following tag sets:

Type Meaning
1 tagSet-M, defined in appendix TAG.2.1 of the Z39.50 standard [1]
2 tagSet-G, defined in appendix TAG.2.2 of the Z39.50 standard [1]
3 application-defined string tags
4 tagSet-Zthes, defined above

The abstract schema described in section 2.2 is represented in Z39.50 by a GRS-1 record encoded with the tag-paths specified in the following table. Where possible, standard tags from tagSet-M and tagSet-G are re-used; in these cases, the generic names of the tags are listed in the right-hand column.

Tag Path Occurrence Element Generic Name
(1,14) 1 termId localControlNumber
(2,1) 1 termName title
(4,1) [0,1] termQualifier  
(4,2) [0,1] termType  
(2,17) [0,1] termNote description
(2,20) [0,1] termLanguage language
(1,15) [0,1] termCreatedDate creation date
(1,27) [0,1] termCreatedBy record created by
(1,16) [0,1] termModifiedDate dateOfLastModifification
(1,28) [0,1] termModifiedBy record modified by
(4,4) 0+ postings  
(4,4)(2,36) 1 sourceDb databaseName
(4,4)(4,5) [0,1] fieldName  
(4,4)(4,6) 1 hitCount  
(2,30) 0+ relation relation
(2,30)(4,3) 1 relationType  
(2,30)(2,36) [0,1] sourceDb databaseName
(2,30)(1,14) 1 termId localControlNumber
(2,30)(2,1) 1 termName title
(2,30)(4,1) [0,1] termQualifier  
(2,30)(4,2) [0,1] termType  
(2,30)(2,20) [0,1] termLanguage language

(The numeric value of the tagSet-G element databaseName is assumed since it is the first of two new tagSet-G elements up for consideration at the August 1999 ZIG, and the approved elements currently go up to 35.)

(The numeric values of the tagSet-M elements record created by and record modified by are assumed since they are the first and second new tagSet-M elements up for consideration at the August 1999 ZIG, and the approved elements currently go up to 26.)

The termLanguage element is expressed as one of the standard codes described in RFC 1766 [8] and ISO 639 [9] - for example, ``en'' for English, ``fr'' for French and ``de'' for German.

The administrative date fields should be returned in the ASN.1 GeneralizedTime format. (The working group considered the Z39.50 ASN.1 date/time definition [11], but reached the conclusion that the benefits would be outweighed by the barrier raised to implementation.

The person-name elements, termCreatedBy and termModifiedBy, may be returned in whatever format is convenient for the server: this profile does not attempt to address the interpretation of such administrative information across multiple databases.

The sourceDb element should be returned in the form of a z39.50s URL as described in RFC 2056 [10]. For example, if the related term is in the database called ``aat'' on the server running on port 3950 on the host foo.bar.org, then the sourceDb element should have the value z39.50s://foo.bar.org:3950/aat.

Servers may, at their discretion, include additional tagSet-M, tagSet-G and string-tagged (type 3) elements in the records they return; they may include such additional elements at the top level, within relation sub-records, or both. Clients may display any such additional elements as they see fit, or may ignore them.

3.4. Element Sets

3.4.1. Element Set ``f''

Use of the element set ``f'' requests a full record, and so servers should respond by returning a record containing as many as possible of the elements listed above in the table in section 3.3.

3.4.2. Element Set ``b''

Use of the element set ``b'' requests a brief record. Servers should respond by returning a record omitting the administrative fields (termCreatedDate, termCreatedBy, termModifiedDate and termModifiedBy), and all the relation sub-records.

This element set may be useful when constructing a summary of several records found by a search for initial entry points to a thesaurus; it unlikely to be useful when navigating from term to term.

3.5. The Zthes-1 Attribute Set

This profile defines an attribute set called Zthes-1, which describes the additional access points needed for searching beyond those found in the standard utility and cross-domain attribute sets. It contains the following attributes, all of type Access Point:

Type Value Name Description
1 1 termQualifier searches in the termQualifier element of the top-level term record
1 2 termType searches in the termType element of the top-level term record
1 3 thesAdmin used for a variety of searches related to administrative details of thesaurus structure - see below
1 4 relatedTermID used in conjunction with a semantic qualifier (attribute type 2) with value equal to one of the relationTypes described in section 2.2; searches for all records in the specified relation to the record whose termID is equal to the search term.

For example, a search for abc123 with access point relatedTermID and semantic qualifier ``NT'' finds all the narrower terms of the record whose termID is abc123.

The thesAdmin access point must be used with one of a small set of well-known strings as the search term. Servers may support the following values:
``start'' Searches for all records considered suitable as starting points for browsing.
``whole'' Searches for a special record describing the thesaurus as a whole, and containing material such as introductory text and revision history that might be front matter in a printed thesaurus. This record, when it exists at all, may not be found in any other search.

This profile does not currently specify the format of this special record: candidate respresentions would be a pre-formatted SUTRS record or a GRS-1 record using specialised schema.

The Zthes-1 attribute set conforms to attribute set class 1 as described in the Z39.50 Attribute Architecture. However, it prescribes no rules to resolve conflict between its own semantics and those of another attribute set in the case where attributes for both are used in a single search term of a type-1 query and the top-level attribute-set of that query is Zthes-1.

3.6. Searching

Servers must support type-1 queries which use the following access points in a manner conformant to the definitions of the attribute sets which define them. Where possible, standard attributes from utility and cross-domain sets are re-used; in these cases, the generic names of the attributes are listed in the right-hand column.

Attribute Set Type Value Search For Generic Name
utility 1 4 termID local control number
cross-domain 1 1 termName title
zthes-1 1 1 termQualifier  
utility 1 10 all elements all access points

(In this table and the next, the numeric values of the attributes taken from the utility set are assumed from the position of those attributes in the lists in draft 3 of the utility set document. The numbers may change after the August 1999 ZIG.)

For the purpose of searches on the local control number access point, values of the termID function as opaque ``magic cookies''. Therefore, such search terms should not include any contentAuthority attribute, even if it happens that for the specific thesaurus in question, the termID identifiers are taken from a well-known source.

The following additional access points may optionally be supported:

Attribute Set Type Value Search For Generic Name
zthes-1 1 3 thesAdmin  
zthes-1 1 4 relatedTermID  
(with a semantic qualifier from the relationType controlled vocabulary)
utility 1 3 termLanguage language
cross-domain 1 4 termNote description
zthes-1 1 2 termType  
utility 1 1 termCreatedDate record date
(with functional qualifier ``date/time created'')
utility 1 2 termCreatedBy record creator
(with functional qualifier ``creator'')
utility 1 1 termModifiedDate record date
(with functional qualifier ``date/time last modified'')
utility 1 2 termModifiedBy record creator
(with functional qualifier ``last modifier'')
utility 1 2 either termCreatedBy
or termModifiedBy
record creator
(with no functional qualifier)

(The semantic qualfiers to be used with the utility access point 2 (record creator) are not defined anywhere - the utility attribute set says

May be qualified in the same manner as the Access Point value 'Name' in the Cross Domain set.
but the cross-domain attribute set does not specify any semantic qualifiers, so we have made up some sensible values for the Zthes profile, in the hope that they will be adopted by the utility set.)

3.7. Explain

3.7.1. Overview

(The specification for the use of Explain with Zthes databases was contributed by Denis Lynch.)

A client can gain two kinds of information about Zthes databases from Explain: the fact that a particular database is a Zthes database, and the fact that a Zthes database is relevant to a particular TermList.

These two uses are specified in the next two sections.

3.7.2. Identifying a Zthes Database

Among the many features that a client could use to deduce that a particular database follows the Zthes profile, the profile distinguishes one required indicator. In the DatabaseInfo record for the database, the AccessInfo element must contain a schemas OID specifying the Zthes schema.

Once a client has observed the Zthes schema in the schemas for a database, it may presume that the server observes the behaviour described in this profile. (The client may still need other information from Explain, for example what record syntaxes are supported.)

3.7.3. Locating a Relevant Zthes Database

Any database may use a Zthes thesaurus or other type of authority file for the basis of the vocabulary used for an access point. This is described in Explain as follows: in the access point's TermListDetails record, the commonInfo element must contain an OtherInformation item encoded as an AuthorityFileInfo External. The AuthorityFileInfo External is defined as follows:
AuthorityFileInfo ::= SEQUENCE {
    name        [1] IMPLICIT HumanString,  -- for display
    database    [2] IMPLICIT InternationalString,
                     -- z39.50s URL to the authority database.
                     -- Simplifies to a database name if on the same server.
    exclusive   [3] IMPLICIT NULL OPTIONAL
                     -- If present, all terms in the term
                     -- list come from this authority file.
                     -- If absent, other terms may or may not
                     -- be present in the term list.
}

Note: it may be desirable to include an additional item to indicate the kind of authority file being referenced. If it is desirable, an appropriate identification scheme will be required.

4. Future Directions

This document has already discussed several possible directions for subsequent versions of this profile, or perhaps future companion profiles. Areas for consideration include, but may not be limited to, the following:

Appendix A. References

[1] National Information Standards Organization. ANSI/NISO Z39.50-1995. Information Retrieval (Z39.50): Application Service Definition and Protocol Specification. Bethesda, MD: NISO Press, 1995. Also available at http://www.loc.gov/z3950/agency/document.html
[2] International Organization for Standardization. ISO 2788: Guidelines for the establishment and development of monolingual thesauri, 2nd ed. Geneva: ISO, 1986. For some inexplicable and inexcusable reason, ISO standards are not generally available on-line.
[3] Z39.50 Maintenance Agency. Z39.50 Attribute Architecture, Draft of November 1998. Available at http://www.loc.gov/z3950/agency/attrarch/arch.html
[4] Z39.50 Maintenance Agency. Z39.50 Utility Attribute Set, Draft 3 of July 1999. Available at http://www.loc.gov/z3950/agency/attrarch/util.html
[5] Ralph LeVan. A Cross-Domain Attribute Set, version 1.2 of 1998/11/16. Available at http://www.oclc.org/~levan/docs/crossdomainattributeset.html
[6] George Percivall. Attribute Set Developers Guide, annotated outline of 18th September 1998. Available at http://harp.gsfc.nasa.gov/~eric/attr_set_developers_guide.html
[7] International Organization for Standardization. ISO 5964: Guidelines for the establishment and development of multilingual thesauri. Geneva: ISO, 1985.
[8] H. Alvestrand. RFC 1766: Tags for the Identification of Languages. March 1995. Available at ftp://ftp.uu.net/inet/rfc/rfc1766.Z
[9] International Organization for Standardization. Prepared by ISO/TC 37, Terminology (principles and coordination). ISO 639:1988 (E/F): Code for the representation of names of languages, 1st edition, 1988.
[10] R. Denenberg, J. Kunze, D. Lynch. RFC 2056: Uniform Resource Locators for Z39.50. November 1996. Available at ftp://ftp.uu.net/inet/rfc/rfc2056.Z
[11] Z39.50 Maintenance Agency. Z39.50 Date/Time Definition, April 6, 1998 (amended February 17, 1999.) Available at http://www.loc.gov/z3950/agency/defns/date.html

Appendix B. The Zthes Abstract Model in XML

Appendix B.1. The Zthes DTD for XML

This DTD was supplied by Thomas Place. It is put forward not as a ``good'' XML representation of thesaurus information (whatever that might be construed to mean) but as a pragmatically valuable alternative encoding of the Zthes abstract record. Real Zthes datasets have been exchanged in the form of XML documents conforming to this DTD.

<!-- Zthes DTD
     Based on Z39.50 Profile for Thesaurus Navigation, version 0.1
     (20 Feb 1999)
     Version of DTD: 25 Feb 1999 -->

<!-- #PCDATA: parseable character data = text

     occurence indicators (default: required, not repeatable):
     ?: zero or one occurrence (optional)
     *: zero or more occurrences (optional, repeatable)
     +: one or more occurrences (required, repeatable)

     |: choice, one or the other, but not both
 -->

<!ENTITY % term            "termId, termName, termQualifier?,
                            termType?, termLanguage?">

<!ENTITY % admin           "termCreatedDate?, termCreatedBy?,
                            termModifiedDate?, termModifiedBy?">

<!ELEMENT Zthes            (%term;, termNote?,
                            %admin;,
                            relation*)>

<!ELEMENT relation         (relationType, sourceDb?, %term;)>

<!ELEMENT termId           (#PCDATA)>
<!ELEMENT termName         (#PCDATA)>
<!ELEMENT termQualifier    (#PCDATA)>
<!ELEMENT termType         (#PCDATA)>
<!ELEMENT termLanguage     (#PCDATA)>
<!ELEMENT termNote         (#PCDATA)>
<!ELEMENT termCreatedDate  (#PCDATA)>
<!ELEMENT termCreatedBy    (#PCDATA)>
<!ELEMENT termModifiedDate (#PCDATA)>
<!ELEMENT termModifiedBy   (#PCDATA)>
<!ELEMENT relationType     (#PCDATA)>
<!ELEMENT sourceDb         (#PCDATA)>

(This appendix should include a crosswalk with any pre-existing thesaurus DTDs if appropriate.)

Appendix B.2. Sample Zthes-in-XML Document

This document was supplied by Thomas Place.

<?XML version="1.0" ?>
<!DOCTYPE Zthes SYSTEM "zthes.dtd">
<Zthes>
  <termId>102067</termId>
  <termName>video art</termName>
  <termType>PT</termType>
  <termNote>
    Use for works of art that employ video technology, especially 
videotapes. For the study and practice of the art of producing such 
works, use "video."
  </termNote>
  <relation>
    <relationType>UF</relationType>
    <termId>102067/001</termId>
    <termName>art, video</termName>
    <termType>ND</termType>
  </relation>
  <relation>
    <relationType>BT</relationType>
    <termId>185191</termId>
    <termName>[time-based works]</termName>
    <termType>NL</termType>
  </relation>
  <relation>
    <relationType>RT</relationType>
    <termId>54153</termId>
    <termName>video</termName>
    <termType>PT</termType>
  </relation>
  <relation>
    <relationType>RT</relationType>
    <termId>253827</termId>
    <termName>video artists</termName>
    <termType>PT</termType>
  </relation>
</Zthes>

Appendix C. Implementations

At the time of writing, only one complete impletation each of Zthes client and server are known to exist (but please inform the author of any others!): There is also an incomplete implementation of a a server built on System Simulation's Index+ database management system and the Yaz toolkit, capable of serving terms from the AAT, TGN and MeSH, subject to licensing conditions.

Several other people and organisations have also expressed an interest in implementing this profile.

Appendix D. Applications

At the time of writing, the Zthes profile is known to be in use in the following projects: (The ELVIL, Decomate II and Elise II projects are all funded by the European Union, perhaps reflecting a European interest in the multilingual applications of thesauri.)