Guidelines for SKOSification
By SKOSification, we mean the process of conversion or transformation of a terminology into SKOS. We list below some guidelines for proceeding to this conversion from a technical and organisation point of view. From the technical point of view, many of the guidelines provided here are inherent to the SKOS model but a special attention must be paid to these points in order to enable the general consistency within the ATHENA Thesaurus.
Evaluate the main features of the terminology to be migrated
Before starting any procedure for converting a terminology into SKOS, the institution must have defined the purpose of its terminology (e.g. indexing and retrieval, only indexing, or only retrieval). As a second step, and a consequence of the definition of the purpose, the institution must evaluate if SKOS is the appropriate format considering the content of its terminology. In the case of authority files for instance, SKOS may not be the most appropriate format. Here are some features that can help for this evaluation:
- Concepts: Is the terminology dealing with objects and abstract things that could be assimilated to concepts? Is the terminology dealing with persons?
=> if the terminology is dealing with persons and not objects or abstract things, a standard like FOAF (Friend Of A Friend)FOAF : http://www.foaf-project.org would be more apropriate
- Semantic relations: Are the descriptors (then concepts) of the terminology can be linked together via semantic relations. => if the terminology only contain independent descriptors without any semantic relations, a SKOS modelization is not absolutely necessary, an RDF representation may be more convenient.
- Interoperability: Can the terminology be linked to another resource dealing with the same subject/domain or scope? => if the terminology can be linked to other resources, all the potential links should be considered before the transformation process in order to implement these links in a most efficient way.
Identify your concepts
The W3C define two main steps to proceed to the identification of concepts: - Creating (or reusing) a Uniform Resource Identifier (URI) to uniquely identify the concept - Asserting in RDF using the rdf:type property that the resource identified by this URI is of type skos:Concept
Use of a Persistent Identifying System for the definition of the URIs
As we described them above, we recommend the use of standards for the identification of the concepts. Indeed, as the identification of concepts is achieved with the definition of HTTP URIs, these URI must be declared to persistent identification systems such as PURL which is normalised. This will also be of a great benefit since it is location-independent, e.g. if the terminology is moved from one location (housing server) to another, the URIs identifying the concepts of this terminology will not have to be modified.
Use of non-explicit URIs
It is highly recommended to use non-explicit URIs in order to avoid the reuse of a same URI for identifying two different concepts. Indeed as natural languages are by definition ambiguous and polysemous, it is possible that two different concepts might have two similar labels. The use of explicit URIs supposes that the choice of one specific natural language has been made during the definition or the migration of the terminology which cannot be convenient in a multilingual context.
Define with precision the labels expressing concepts
Preferred labels must be unique within a concept scheme
As it is required by the SKOS data model, no two concepts from a same concept scheme should have the same preferred label in a given language. However as natural languages are highly polysemous and full of homographs, the SKOS data model does not forbid that one concept can have two same preferred labels in two different languages.
Each concept must be expressed with one preferred label per language (mandatory)
As we saw above, the SKOS data model does not forbid the absence of preferred label, but labels are meant to help the understanding and refining the meaning of a concept. This is especially true in a multilingual context and it is helpful for purposes of administration and maintenance. Therefore we recommend using one preferred label per language. It is important to note that this also means that is not possible to have several preferred labels in the same language.
Avoid the concatenation of several words for a same label
In order to get the most accurate description, we recommend avoiding several values as a preferred term. For example, double concepts such as “dwelling/houses” must be considered as two different concepts that are linked by a semantic relation. The use of scope notes can help to reinforce the closeness of these two concepts. The link between the two terms must be defined in order to provide the best description. We can state that “dwelling” and “houses” are synonyms; then the double concepts can be modelled as follows: Dwelling: preferred label and houses: alternative label
Another possibility in the case of double concepts is to model the two concepts as related concepts.
Privilege the use of the lemma for the preferred label and possibly the other labels
The preferred label should consist in a single word term or a compound words term in natural language. This means that no artificial word or code must be used to label a concept. Such code must be defined using the skos:notation property. The lemma of a word represents its canonical form. We strongly recommend this form of terms to be used as preferred label. For instance, in English or in French, the usual form of a lemma in the case of nouns is the singular for the number and the masculine for the gender.
Privilege the typography in use by convention in the languages involved
The labels should respect the typographical rules that are usually in use in the languages of the labels. For instance, in English all the words referring to a language or nationality starts with an upper-case character whereas in French, these words will be in lower case characters. Thus we recommend respecting the conventions that are in use for each language involved. Any exception to this guideline must be documented via documentation properties of the model.
For verbal forms, infinitive forms will be privileged. Thus the forms of terms should be based on the conventions in the languages involved. If the concept is only expressed with labels in specific forms that do not correspond to the lemma, this must be documented via the documentation properties (skos:note, skos:changeNote, skos:editorialNote or skos:historyNote) In the case of compound terms, if possible, the addition of adjectives or verbs to a noun phrase should be limited. In the same spirit, the use of articles and prepositions should be avoided in order not to extend the length of the label. From the computing systems point of view, these guidelines can help the efficiency of a retrieval system.
Avoid the duplication of information
The SKOS data model consists of classes and properties as we saw above. Meanings are to be deduced by an efficient use of these properties. As some of the properties available in the SKOS model are proposed as pairs (inverse or symmetric), this supposes that the use of one property implies the opposite or the reverse. Therefore it is better to avoid duplication and not to repeat the same information in different ways. SKOS terminologies are processed by machines. So the less redundant information there is, the faster the results of a query can be retrieved. The main properties to pay attention to in order to avoid duplication of information are:
The use of the skos:broader or skos:narrower property implies the inverse meaning. Asserting that A has a broader concept B implies that B has a narrower concept A. This is true also for the skos:broaderTransitive and skos:narrowerTransitive property.
The skos:related property is symmetric then if an assertion that A is related to B is made, there is no need to make the following assertion, B is related to A.
Provide precision to the semantic relations of your concepts
Non-immediate hierarchical relations
In some cases, semantic relations between concepts have to be described with precision in order to avoid a loss of meaning or information and also avoid designing information which will not make any sense. For example the skos:broaderTransitive/skos:narrowerTransitive pair of properties allows to describe with precision relations between concepts when two levels of hierarchy are impacted. Then the use of these transitive properties is preferred in order to assert a non-immediate hierarchical relationship between two concepts. However there is a possibility to use an extension to the SKOS data model in order to remove the symmetry of a property if this creates confusion in the meaning of the concepts.
Consistency of the semantic relations
In order to ensure consistency, mixing hierarchical relationships with associative ones should be avoided. For example, a concept A cannot be related to another concept B if this concept A is the narrower concept of a concept C. Therefore a special attention must be paid when designing the semantic relations between concepts.
Enable the multilingualism
Provide for each concept an equivalent label in the languages involved in your terminology
Special attention must be paid to the multilingual labels expressing the concepts. These multilingual labels must be defined in the correct way in the different languages of the terminology so that the equivalencies can be computed from the SKOS representation of concepts.
Use the same system of language tags for defining the language of label
There are several systems which are normalized and equivalent: for example the three tags “en”, “en-GB” or “en-Latn” are different language tag systems referring to one language which is the English from Great Britain in Latin alphabet. In the case of terminology where different languages of different alphabet are involved, the tag system “language-alphabet” (for example “en-Latn”) may be useful for providing more precision. We recommend using the same system of tags for every language attribute of the terminology. In the case where a specific language tags system is not required, we recommend the use of the language systems defined in ISO 639-11where the language tags are coded on two letters in lower case.
Ensure the documentation of concepts and the terminology
Provide documentation for each change that may occur to a concept and its labels
The SKOS data model provides number of documentation properties in order to refine the meaning of a concept or keep track of the changes on the label(s) of a concept and/or its meaning. For the purposes of administration and maintenance of the terminology, each change must be reported in the SKOSified terminology using change notes (skos:changeNote) or editorial notes (skos:editorialNote).
Provide as much as possible documentation to concepts with scope notes
As mentioned above, documentation on concepts helps to refine the meaning of a concept. The use of scope notes (skos:scopeNote) can be very helpful in enabling a better understanding of the concepts with contextual information. Examples may also be provided via skos:example property. Documentation of concepts is especially needed in the case of homographs/homonyms in the same language or different languages for the labels expressing the concept. Then scope notes and examples can provide the user with a semantic disambiguation.
Guidelines for mapping
Mapping is an inherent part of the SKOSification of a terminology. The following guidelines emphasize some aspects of the mapping process that may be crucial for general consistency of the terminology and the meanings of concepts.
Pay attention to the identification of your concepts during the mapping process
Use only absolute URIs
This guideline follows on from the one referring to the identification of concepts in the SKOSifcation part above. The terminology is made available in a machine-readable format by the SKOSification process. In order to make easily computable the identification of concepts and linking between concepts, it is recommended to use absolute URIs rather than relative ones.
For example: <rdf:Description rdf:about="http://www.athenaeurope.org/athenawiki/AthenaThesaurus/RMCA _Keywords#architecture"> is an absolute HTTP URI <rdf:Description rdf:about="RMCA_Keywords#architecture"> is a relative HTTP URI.
Respect the URIs of the original sources
As URIs are defined in order to identify the concepts uniquely, during the mapping process from a concept scheme to another, the URI defined within each concept scheme must be respected in order to enable the interoperability between the different resources involved.
Avoid the duplication of information
We saw that the structural properties for defining the semantic relations between concepts are either inverse or symmetric. This is also true for the mapping properties.
The mapping properties skos:broadMatch and skos:narrowMatch are each other’s inverse therefore there is no need to repeat twice the same mapping link using both properties for the same subject and object.
The mapping property skos:exactMatch and skos:closeMatch are symmetric. So repeating the mapping link can be avoided. The property skos:exactMatch is also a transitive property then there is no need to repeat the mapping link on several levels.
For instance: A skos:exactMatch B B skos: exactMatch C
The assertion A skos:exactMatch C can be inferred from the preceding statement.
Provide precision to the semantic relations of your concepts
Use the appropriate properties to make links between concepts
The SKOS data model provides semantic relations and mapping properties, and does not restrict the use of these properties. However we strongly recommend to model in a homogenous way the relations between concepts in order to ensure the semantic consistency of the terminology. We recommend to: o Use mapping properties to make a link between concepts from different concept schemes o Use semantic relations properties to make a link between concepts within a same concept scheme
The SKOS data model does not forbid using semantic relations properties for make a link between concepts from different concept schemes but it is highly recommended to follow these guidelines.
Enable the multilingualism
Manage multilingualism of the terminology through mapping of concepts and terms
The mapping process can be useful in a monolingual context but is especially relevant in a multilingual context. Equivalences can be stated from the mapping links made between several terminologies in different languages. Equivalencies in a multilingual context can be of three kinds: semantic, cultural or structural. The semantic aspect refers to the meaning of the concept; the cultural aspect refers to the use of a term in a given language or culture; and the structural aspect refers to the semantic relations between concepts. This last aspect deals with the mapping and allows defining complete equivalence (synonymy) or partial equivalence (quasi synonymy) or nonequivalence. As it was the case for the first version of the ATHENA Thesaurus, equivalences between concepts in languages that were not initially involved in the source terminology can be deduced from correct mapping links without translating the concepts.
Ensure the documentation of concepts and the terminology
Make explicit with notes the purpose of a relation
For the purposes of maintenance and administration, it is important to explain the choices of modelling that have been made for making links between concepts. The use of scope notes can help making explicit these choices.
Documentation properties can also keep track of history of mapping links.
Validation is an important part of the SKOSification process and mapping also. Therefore a special attention must be paid to this final step of the SKOSification. From a technical point of view, in order to check the consistency of your converted
terminology to the SKOS model, we recommend using the online web service [Party]. Pool Party offers a free online tool for validating SKOS files that may be already online or stored on your local repositories.
This tool checks the consistency of the SKOSified terminology according to the following
points which refer to our guidelines:
- Valid URIs: the tool checks if there is not any unauthorised character in the URI.
Although if an URI is used twice for identifying two different concepts, there won’t be
any alert or warning.
- Missing language tags: the tool checks if all the labels and notes have a language tag
- Missing labels: the tool checks that each concept has at least one preferred label.
- Loose concepts: all the concepts that are isolated and not linked to other concepts are pointed out as loose concepts
- Disjoint OWL classes: the tool checks the eventual consistency with OWL elements that may be in the SKOSified terminology
- Consistent use of labels: the rules for the use of labels are checked by the tool in order to avoid the use of a same label as a preferred label and alternative or hidden label, and to avoid the use of two preferred labels in a same language, ...
- Consistent usage of mapping properties: the tool checks the consistency in the mapping relations.
- Consistent usage f semantic relations: the tool checks that there is no mix between hierarchical and associative semantic relationships.
From the content point of view, only the administrators and users of the terminology can validate the final migration of the terminology into SKOS format at least for an initial transformation process since they will be the one able to confirm or modify the general design of the terminology and its semantic relations according to the indexing and retrieval efficiency. For further modifications and updates, a set of rules and policies have to be defined in order to enable the collaborative moderation for managing the terminology. These rules and policies have to be agreed on by the community of users.