Controlled vocabulary

Controlled vocabularies provide a way to organize knowledge for subsequent retrieval. They are used in subject indexing schemes, subject headings , thesauri , [1] [2] taxonomies and other forms of knowledge organization systems . Controlled vocabulary schemes mandated by the use of predefined, which have been restricted by the designers of the schemes.

In library and information science

In library and information science controlled vocabulary is a Carefully selected list of words and sentences , qui are used to tag units of information (document or work) so That They May be more Easily retrieved by a search. [3] [4]Controlled vocabularies solve the problems of homographs , synonyms and polysemes by a bijection between concepts and authorized terms. In short, controlled vocabularies reduce ambiguity inherent in normal human languages ​​where the same concept can be given different names and ensure consistency.

For example, in the Library of Congress Subject Headings (a subject heading system), authorized terms-subject headings in the case of variant spellings of the same word (American versus British) , choice Among scientific and popular terms ( cockroach versus Periplaneta americana ), and choices entre synonyms ( car versus car ) Among other difficulties arising.

Choices of authorized terms are based on the principles of user warrant (what terms users are Likely to wear) warrant bedding (what terms are Generally used in the literature and documents), and structural warrant (terms Chosen by Considering the structure, scope of The controlled vocabulary).

Controlled vocabularies also typically handle the problem of homographs , with qualifiers. For example, the term ” pool” is used to refer to either the pool or the pool .

There are two main types of controlled vocabulary tools used in libraries: subject headings and thesauri. While the differences between the two are diminishing, there are still some minor differences.

Historically subjects were designed to describe books in library catalogs by catalogers while they were used to index documents to documents and articles. Subject headings to the broader scope of description. Also because of the card catalog system, the subject is always indirectly ordered (though with the rise of automated systems), while thesaurus terms are always in direct order. Subject headings also includes the pre-coordination of terms such as the designer of the controlled vocabulary. (Eg, children and terrorism) while thesauri tend to use singular direct terms.

For example, the Library of Congress Subject Heading Itself Did not-have much Syndetic structure up to 1943, and It Was not up to 1985 When It Began to adopt the thesauri type term ” Broader term ” and ” Narrow term .”

The terms are chosen and organized by trained professionals (including librarians and information scientists) who possess expertise in the subject area. Controlled vocabulary terms can accurately describe what a given document is actually about, even if the terms themselves do not exist within the document’s text. Well-known subject heading systems include the Library of Congress system , MeSH , and Sears . Well known thesauri include the Art and Architecture Thesaurus and the ERIC Thesaurus.

This article is about the use of the term “intercourse.” This article is a preview generated by the International Electrotechnical Commission (IDO) of the Government of Canada.

Controlled vocabulary elements (terms / phrases) employed as tags , to aid in the identification process or other information system entities (eg DBMS, Web Services) qualifies as metadata .

Indexing languages

There are three main types of indexing languages.

  • Controlled indexing language – only
  • Natural language indexing language – any term from the document in question
  • Free indexing language – any term (not only from the document) can be used to describe the document

When indexing a document, the indexer also has to choose the level of indexing exhaustivity, the level of detail in which the document is described. For example, using low indexing exhaustivity, minor aspects of the work will not be described with index terms. In general the higher the indexing exhaustivity, the more terms indexed for each document.

In recent years free text search as a means clustering of access to the documents HAS Become popular. This involves the use of natural language in the form of an index . Many studies have been done to compare the efficiency and effectiveness of the text.

Controlled vocabularies are often claimed to improve the accuracy of free text searching, such as to reduce irrelevant items in the retrieval list. These irrelevant items ( false positives ) are often caused by the inherent ambiguity of natural language . Take the English word football for example. Football is the name given to a number of different sports teams . Worldwide the most popular of these sports team is football , which also happens to be called soccer in several countries. The word soccer est applied to rugby soccer ( rugby union and rugby league ), American soccer , Australian rules soccer , Gaelic Football , and Canadian soccer . A search for football so will retrieve documents that are about several completely different sports. Controlled vocabulary solves this problem by tagging the documents in such a way that the ambiguities are eliminated. A search for football so will retrieve documents that are about several completely different sports. Controlled vocabulary solves this problem by tagging the documents in such a way that the ambiguities are eliminated. A search for football so will retrieve documents that are about several completely different sports. Controlled vocabulary solves this problem by tagging the documents in such a way that the ambiguities are eliminated.

Compared to free text searching, the use of a controlled vocabulary can Dramatically Increase the performance of an information retrieval system, if performance is Measured by precision (the percentage of records in the list retrieval That are Actually relating to the search topic).

In some cases controlled vocabulary can enhance recall as well, because unlike natural language schemes, once the correct authorized term is searched, you do not need to worry about searching for other terms that might be synonyms of that term.

HOWEVER, a controlled vocabulary search aussi May lead to unsatisfactory recall , en ce que it will fail to retrieve records Some That are Actually relating to the search question.

This is especially problematic when it comes to the subject matter of the subject. Essentially, this can be avoided only by an experienced user of controlled vocabulary, the understanding of the vocabulary coincides with the way it is used by the indexer.

Another possibility is that the article is not tagged by the indexer because it is indexing exhaustivity is low. For example, an article might mention football as a secondary focus, and the indexer might decide not to tag it with “football” because it is not important enough to the main focus. But it turns out that for the searcher that article is relevant and hence recall fails. A free text search would automatically pick up that article lookless.

On the other hand, free text searches have high exhaustivity (you search on every word) so it has potential for high recall (assuming you solve the problems of synonyms by entering every combination) but will have much lower precision.

Controlled vocabularies are also quickly out-dated and in fast developing fields of knowledge, they are not regularly updated. Even in the best case scenario, controlled language is often not as specific as the words of the text itself. Indexers trying to choose the appropriate index terms could misinterpret the author, while a free text search is in no danger of doing so, because it uses the author’s own words.

The use of controlled vocabularies can be expensive. Moreover, the user has to be familiar with the controlled vocabulary scheme to make best use of the system. But as already mentioned, the control of synonyms, homographs can help increase precision.

Numerous methodologies have been developed to assist in the creation of controlled vocabularies, including faceted classification , which allows a given data record or document to be described in multiple ways.

Applications

Controlled vocabularies, such as the Library of Congress Subject Headings , are an essential component of bibliography , the study and classification of books. They were initially developed in science and science . In the 1950s, government agencies began to develop written vocabularies for the burgeoning journal; An example is the Medical Subject Headings (MeSH) developed by the US National Library of Medicine . Subsequently, for-profit firms (called Abstracting and Indexing Services) emerged to index the fast-growing literature in every field of knowledge. In the 1960s, an online bibliographic database based on dialup X.25 networking. These services were seldom made available to the public because they were difficult to use; Specialist librarians called search intermediaries. In the 1980s, the first full text databases appeared; These databases contain the full text of the index articles as well as the bibliographic information. Online bibliographic databases have migrated to the Internet and are now publicly available; However, most are proprietary and can be expensive to use. Students enrolled in colleges and universities may be able to access some of these services without charge; Some of these services may be available without charge at a public library. In the 1980s, the first full text databases appeared; These databases contain the full text of the index articles as well as the bibliographic information. Online bibliographic databases have migrated to the Internet and are now publicly available; However, most are proprietary and can be expensive to use. Students enrolled in colleges and universities may be able to access some of these services without charge; Some of these services may be available without charge at a public library. In the 1980s, the first full text databases appeared; These databases contain the full text of the index articles as well as the bibliographic information. Online bibliographic databases have migrated to the Internet and are now publicly available; However, most are proprietary and can be expensive to use. Students enrolled in colleges and universities may be able to access some of these services without charge; Some of these services may be available without charge at a public library. Students enrolled in colleges and universities may be able to access some of these services without charge; Some of these services may be available without charge at a public library. Students enrolled in colleges and universities may be able to access some of these services without charge; Some of these services may be available without charge at a public library.

In large organizations, controlled vocabularies may be introduced to improve technical communication . The use of controlled vocabulary ensures that the same thing. This consistency of terms is one of the most important concepts in technical writing and knowledge management , where effort is expended to use the same word throughout a document or organization instead of slightly different ones to refer to the same thing.

Web pages could be dramatically improved by the development of a controlled vocabulary for web pages; The use of such a vocabulary could culminate in a semantic Web , in which the content of Web pages is described using a machine-readable metadata scheme. One of the first proposals for the Dublin Core Initiative. An example of a controlled vocabulary which is usable for indexing web pages is PSH .

It is unlikely that a single metadata scheme will ever succeed in describing the content of the entire web. [5] To create a Semantic Web, it may be necessary to draw from two or more metadata systems to describe a Web page’s contents. The eXchangeable Faceted Metadata Language (XFML) is designed to enable controlled vocabulary creators to publish and share metadata systems. XFML is designed on faceted classification principles. [6]

Controlled vocabularies of the Semantic Web define the concepts and relationships (terms) used to describe a field of interest or area of ​​concern. For instance, to declare a person in a machine-readable format, a vocabulary is required that has the formal definition of “Person”, such as the Friend of a Friend ( FOAF ) vocabulary, Affiliate, Affiliate, Email address, and homepage, or the person vocabulary of Schema.org . [7] Similarly, a book can be described using the Book vocabulary of Schema.org [8] and general publication terms from the Dublin Core vocabulary,

To use machine readable terms from Any controlled vocabulary, web designers can choose from a variety of annotation formats Including RDFa HTML5 Microdata , or JSON-LD in the markup, or RDF serializations (RDF / XML, Turtle, N3, Trig, TriX) in external files.

See also

  • Authority control
  • Controlled natural language
  • IMS Vocabulary Definition Exchange
  • Named-entity recognition
  • Nomenclature
  • Ontology (computer science)
  • Terminology
  • Thesaurus
  • Universal Data Element Framework
  • Vocabulary-based transformation

References

  1. Jump up^ Controlled VocabulariesLinks to examples of thesauri and classification schemes.
  2. Jump up^ Controlled VocabulariesLinks to examples of thesauri and classification schemes used in Agriculture, Fisheries, Forestry etc.
  3. Jump up^ Amy Warner,A taxonomy primer.
  4. Jump up^ Karl Fast, Fred Leise and Mike Steckel,[1]
  5. Jump up^ Cory Doctorow,Metacrap.
  6. Jump up^ Mark Pilgrim,eXchangeable Faceted Metadata Language.
  7. Jump up^ “The Person vocabulary of Schema.org” . Retrieved 13 March 2015 .
  8. Jump up^ “The Book vocabulary of Schema.org” . Retrieved 13 March 2015 .
  9. Jump up^ “Dublin Core Metadata Element Set, Version 1.1” . Retrieved 13 March 2015 .
  10. Jump up^ “The Event vocabulary of Schema.org” . Retrieved 13 March 2015 .

Leave a Comment

Your email address will not be published. Required fields are marked *