Handbook
Handbook
Handbook
Metadata for
Information
Management and
Retrieval
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page ii
Metadata for
Information
Management and
Retrieval
Understanding metadata and its use
Second edition
David Haynes
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page iv
Except as otherwise permitted under the Copyright, Designs and Patents Act
1988 this publication may only be reproduced, stored or transmitted in any form
or by any means, with the prior permission of the publisher, or, in the case of
reprographic reproduction, in accordance with the terms of a licence issued by
The Copyright Licensing Agency. Enquiries concerning reproduction outside
those terms should be sent to Facet Publishing, 7 Ridgmount Street, London
WC1E 7AE.
Every effort has been made to contact the holders of copyright material
reproduced in this text, and thanks are due to them for permission to reproduce
the material indicated. If there are any queries please contact the publisher.
A catalogue record for this book is available from the British Library.
Typeset from author’s files in 10/13 pt Palatino Lintoype and Open Sans by
Flagholme Publishing Services.
Printed and made in Great Britain by CPI Group (UK) Ltd, Croydon, CR0 4YY.
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page v
Contents
3 Data modelling 35
Overview 35
Metadata models 35
Unified Modelling Language (UML) 36
Resource Description Framework (RDF) 36
Dublin Core 39
The Library Reference Model (LRM) and the development of RDA 40
ABC ontology and the semantic web 42
Indecs – Modelling book trade data 44
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page vi
4 Metadata standards 49
Overview 49
The nature of metadata standards 49
About standards 51
Dublin Core – a general-purpose standard 51
Metadata standards in library and information work 54
Social media 62
Non-textual materials 64
Complex objects 70
Conclusion 74
CONTENTS VII
Provenance 134
Conclusion 137
References 239
Index 257
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page ix
Figures
1.1 Metadata from the Library of Congress home page 12
2.1 Example of marked-up text 20
2.2 Rendered text 21
2.3 Word document metadata 28
2.4 Westminster Libraries – catalogue search 30
2.5 Westminster Libraries catalogue record 30
2.6 WorldCat search 31
2.7 WorldCat detailed record 32
2.8 OpenDOAR search of repositories 32
2.9 Detailed OpenDOAR record 33
3.1 An RDF triple 37
3.2 More complex RDF triple 37
3.3 A triple expressed as linked data 38
3.4 DCMI resource model 39
3.5 Relationships between Work, Expression, Manifestation and Item 41
3.6 LRM agent relationships 42
3.7 Publication details using the ABC Ontology 44
3.8 Indecs model 45
3.9 OAIS simple model 46
3.10 OAIS Information Package 46
3.11 Relationship between Information Packages in OAIS 47
4.1 BIBFRAME 2.0 model 57
4.2 Overlap between image metadata formats 66
4.3 IIIF object 67
4.4 Relationships between IIIF objects 67
4.5 Metadata into an institutional repository 72
4.6 How OAI-PMH works 72
5.1 Example of relationship between ISTC and ISBN 85
5.2 Structure of an Archival Resource Key 85
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page x
Tables
1.1 Day’s model of metadata purposes 13
1.2 Different types of metadata and their functions 14
4.1 KBART fields 60
4.2 IIIF resource structure 68
11.1 Dublin Core to MODS Crosswalk 176
13.1 Comparison of metadata fields required for data sets in Project Open Data 209
13.2 Core metadata elements to be provided by content providers 213
14.1 Metadata standards development 231
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page xi
Preface
T
HIS IS NOT A ‘HOW TO DO IT’ BOOK. There are several excellent guides
about the practical steps for creating and managing metadata. This
book is intended as a tutorial on metadata and arose from my own
need to find out more about how metadata worked and its uses. The original
book came out at a time when there were very few guides of this type
available. Metadata Fundamentals for All Librarians provided a good starting
point which introduced the basic concepts and identified some of the main
standards that were then available (Caplan, 2003). It was an early publication
from a period of tremendous development and in an area that was changing
day to day. Introduction to Metadata, published by the Getty Institute,
represented another milestone and provided more comprehensive
background to metadata (Baca, 1998). It is now in its third edition (Baca, 2016).
In my work as an information management consultant many colleagues
and clients kept asking the questions: ‘What is metadata?’, ‘How does it
work?’, and ‘What’s it for?’. The last of these questions particularly resonated
with the analysis and review of information services. This led to the
development of a view of metadata defined by its purposes or uses. Since the
first edition of Metadata for Information Management and Retrieval there have
been many excellent additions to the literature, notably Zeng and Qin’s book,
simply entitled Metadata, which is now in its second edition (Zeng and Qin,
2008; 2015; Haynes, 2004). I also enjoyed Philip Hider’s book, Information
Resource Description, which is substantially about metadata from a subject
retrieval perspective (Hider, 2012). There are many other excellent tomes,
some of which are mentioned in the main body of this book. I hope that this
second edition adds a unique perspective to this burgeoning field.
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page xii
This book covers the basic concepts of metadata and some of the models
that are used for describing and handling it. The main purpose of this book
is to reveal how metadata operates, from the perspective of the user and the
manager. It is primarily concerned with data about document-based
information content – in the broadest sense. Many of the examples will be for
bibliographic materials such as books, e-journals and journal articles.
However, this book also covers metadata about the documentation associated
with museum objects (thus making them information objects), as well as
digital resources such as research data collections, web resources, digitised
images, digital photographs, electronic records, music, sound recordings and
moving images. It is not a book about databases or data modelling, which is
covered elsewhere (Hay, 2006).
Metadata for Information Management and Retrieval is international in
coverage and sets out to introduce the concepts behind metadata. It focuses
on the ways metadata is used to manage and retrieve information. It
discusses the role of metadata in information governance as well as exploring
its use in the context of social media, linked open data and big data. The book
is intended for museums, libraries, archives and records management
professionals, including academic libraries, publishers, and managers of
institutional repositories and research data sets. It will be directly relevant to
students in the iSchools as well as those who are preparing to work in the
library and information professions. It will be of particular interest to the
knowledge organisation and information architecture communities. Managers
of corporate information resources and informed users who need to know
about metadata will also find much that is relevant to them. Finally, this book
is for researchers who deal with large data sets, either as their creators or as
users who need to understand the ways in which that data is described, its
properties and ways of handling and interrogating that data.
Acknowledgements
P
REPARATION OF THIS BOOK would not have been possible without the
support and assistance of many individuals, too numerous to list. I
hope that they will recognise their contributions in this book and will
accept this acknowledgement as thanks. Any shortcomings are entirely my
own.
I would like to thank colleagues at City, University of London. David
Bawden and Lyn Robinson at the Centre for Information Science provided
guidance and encouragement throughout. Andy MacFarlane was an excellent
critic for the early drafts of the chapter on information retrieval. The library
service at City, University of London has been an invaluable resource which,
with the back-up of the British Library, has been essential for the identification
and procurement of relevant literature.
Neil Wilson, Rachael Kotarski, Bill Stockting and Paul Clements at the
British Library, Christopher Hilton at the Wellcome Library and Graham Bell
of EDItEUR all freely gave their time in interviews and follow-up questions.
I would like to acknowledge the contribution made by former colleagues
at CILIP, where I was working when I wrote the first edition. I am also
grateful for the feedback from reviewers, colleagues and students who have
used the book as a text. I am especially grateful for the moral support of the
University of Dundee, where I teach a module on ‘Metadata Standards and
Information Taxonomies’ on their postgraduate course in the Centre for
Archives and Information Studies (CAIS). Teaching that particular course has
helped to shape my thinking and has given me an incentive to read and think
more about metadata.
Many colleagues in the wider library and information profession helped to
clarify specific points about the use of metadata. I would especially like to
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page xiv
thank Gordon Dunsire for going through the manuscript and pointing out
significant issues that I hope have now been addressed.
Finally I would like to thank family, friends and colleagues who have
provided constant encouragement throughout this enterprise.
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 1
PART I
Metadata concepts
Part I introduces the concepts that underpin metadata, starting with an
historical perspective. Some examples of metadata that people come across
in their daily life are demonstrated in Chapter 1, along with some alternative
views of metadata and how it might be categorised. This chapter defines the
scope of this book as considering metadata in the context of document
description. Chapter 2 looks at mark-up languages and the development of
schemas as a way of representing metadata standards. It also highlights the
connection between metadata and cataloguing. Chapter 3 looks at different
ways of modelling data with specific reference to the Resource Description
Framework (RDF). It describes the Library Reference Model (LRM) and its
impact on current cataloguing systems. Chapter 4 discusses cataloguing and
metadata standards and ways of representing metadata. It introduces RDA,
MARC, BIBFRAME as well as standards used in records management, digital
repositories and non-textual materials such as images, video and sound.
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 2
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 3
CHAPTER 1
Introduction
Overview
This chapter sets out to introduce the concepts behind metadata and illustrate them with
historical examples of metadata use. Some of these uses predate the term ‘metadata’. The
development of metadata is placed in the context of the history of cataloguing, as well as
parallel developments in other disciplines. Indeed, one of the ideas behind this book is that
metadata and cataloguing are strongly related and that there is considerable overlap
between the two. Pomerantz (2015) and Gartner (2016) have made a similar connection,
although Zeng and Qin (2015) emphasise the distinction between cataloguing and
metadata. This leads to discussion of the definitions of ‘metadata’ and a suggested form
of words that is appropriate for this book. Examples of metadata use in e-publishing,
libraries, archives and research data collections are used to illustrate the concept. The
chapter then considers why metadata is important in the wider digital environment and
some of the political issues that arise. This approach provides a way of assessing the
models of metadata in terms of its use and its management. The chapter finally introduces
the idea that metadata can be viewed in terms of the purposes to which it is put.
Why metadata?
If anyone wondered about the importance of metadata, the Snowden
revelations about US government data-gathering activities should leave no
one in any doubt. Stuart Baker, the NSA (National Security Agency) General
Counsel, said ‘Metadata tells you everything about somebody’s life. If you
have enough metadata you don’t really need content’ (Schneier, 2015, 23). The
routine gathering of metadata about telephone calls originating outside the
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 4
USA or calls to foreign countries from the USA caused a great deal of concern,
not only among American citizens but also among the US’s strongest allies and
trading partners. The UK’s Investigatory Powers Act (UK Parliament, 2016)
requires communications providers to keep metadata records of commun-
ications via public networks (including the postal network) to facilitate security
surveillance and criminal investigations. As Jacob Appelbaum said when the
Wikileaks controversy first blew up, ‘Metadata in aggregate is content’
(Democracy Now, 2013). His point was that when metadata from different
sources is aggregated it can be used to reconstruct the information content of
communications that have taken place.
Although metadata has only recently become a topic for public discussion,
it pervades our lives in many ways. Anyone who uses a library catalogue is
dealing with metadata. Since the first edition of this book the idea of metadata
librarians or even metadata managers has gained traction. Job advertisements
often focus on making digital resources available to users. Roles that would
have previously been described in terms of cataloguing and indexing are
being expressed in the language of metadata. Re-use of data depends on
metadata standards that allow different data sources to be linked to provide
innovative new services. Many apps on mobile devices depend on combining
location with live data feeds for transportation, air quality or property prices,
for example. They depend on metadata.
INTRODUCTION 5
The term began to be widely used in the database research community by the
mid-1970s.
A parallel development occurred in the geographical information systems
(GIS) community and in particular the digital spatial information discipline.
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 7
INTRODUCTION 7
In the late 1980s and early 1990s there was considerable activity within the
GIS community to develop metadata standards to encourage interoperability
between systems. Because government (especially local government) activity
often requires data to describe location, there are significant benefits to be
gained from a standard to describe location or spatial position across
databases and agencies. The metadata associated with location data has
allowed organisations to maintain their often considerable internal
investments in geospatial data, while still co-operating with other
organisations and institutions. Metadata is a way of sharing details of their
data in catalogues of geographic information, clearing houses or via vendors
of information. Metadata also gives users the information they need to process
and interpret a particular set of geospatial data.
In the mid-1990s the idea of a core set of semantics for web-based resources
was put forward for categorising the web and to enhance retrieval. This
became known as the Dublin Core Metadata Initiative (DCMI), which has
established a standard for describing web content and which is not discipline-
or language-specific. The DCMI defines a set of data elements which can be
used as containers for metadata. The metadata is embedded in the resource,
or it may be stored separately from the resource. Although developed with
web resources in mind it is widely used for other types of document,
including non-digital resources such as books and pictures. DCMI is an
ongoing initiative which continues to develop tools for using Dublin Core.
This position was questioned by Gorman (2004), who suggested that
metadata schemes such as Dublin Core are merely subsets of much more
sophisticated frameworks such as MARC (Machine Readable Cataloguing).
He suggested that without authority control and use of controlled vocab-
ularies, Dublin Core and other metadata schemes cannot achieve their aim of
improving the precision and recall from a large database (such as web
resources on the internet). His solution is that existing metadata standards
should be enriched to bring them up to the standards of cataloguing.
However, his arguments depend on a distinction being drawn between ‘full
cataloguing’ and ‘metadata’. An alternative view (and one supported in this
book) is that cataloguing produces metadata. Gorman is certainly right in
suggesting that metadata will not be particularly useful unless it is created in
line with more rigorous cataloguing approaches.
All these metadata traditions have come together as the different
communities have become aware of the others’ activities and have started to
work together. The DCMI involved the database and the LIS communities
from the beginning with the first workshop in 1995 in Dublin, Ohio, and has
gradually drawn in other groups that manage and use metadata.
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 8
INTRODUCTION 9
What is metadata?
Although there is an attractive simplicity in the original definition, ‘Metadata
is data about data’, it does not adequately reflect current usage, nor does it
describe the complexity of the subject.
At this stage it is worth interrogating the idea of metadata more fully. The
concept of metadata has arisen from several different intellectual traditions.
The different usages of metadata reflect the priorities of the communities that
use metadata. One could speculate about whether there is a common
understanding of what metadata is, and whether there is a definition that is
generally applicable.
Metadata was originally referred to as ‘meta-data’, which emphasises the
two word fragments that make up the term. The word fragment ‘meta’, which
comes from the Greek ‘μετα’, translates into several distinct meanings in
English. In this context it can be taken to mean a higher or superior view of
the word it prefixes. In other words, metadata is data about data or data that
describes data (or information). In current usage the ‘data’ in ‘metadata’ is
widely interpreted as information, information resource or information-
containing entity. This allows inclusion of documentary materials in different
formats and on different media.
Although metadata is widely used in the database and programming
professions, the focus in this book is on information resources managed in
the museums, libraries and archives communities. Some in the library and
information community defined metadata in terms of function or purpose.
However, in this context metadata has more wide-ranging purposes,
including retrieval and management of information resources, as we see in
an early definition:
any data that aids in the identification, description and location of networked
electronic resources. . . . Another important function provided by metadata is
control of the electronic resource, whether through ownership and provenance
metadata for validating information and tracking use; rights and permissions
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 10
Pomerantz (2015, 21–2) talks about metadata often describing containers for
data, such as books. He also suggests that metadata records are themselves
containers for descriptions of data and its containers and arrives at the
following definition of metadata: ‘a potentially informative object that
describes another potentially informative object’ (Pomerantz, 2015, 26). Zeng
and Qin (2015, 11) talk about metadata in the following terms: ‘metadata
encapsulate the information that describes any information-bearing entity’,
before switching their attention to bibliographic metadata and components
of metadata as described in Dublin Core. Gilliland also talks in terms of
information objects:
Perhaps a more useful, ‘big picture’ way of thinking about metadata is as the
sum total of what one can say about any information object at any level of
aggregation. In this context, an information object is anything that can be
addressed and manipulated as a discrete entity by a human being or an
information system. (Gilliland, 2016)
INTRODUCTION 11
The field names are highlighted in bold – these are equivalent to the data
elements in a metadata record. The content of each field, the metadata content,
appears alongside the field name. This same cataloguing information can be
displayed in other formats such as MARC 21.
The second example is of metadata from the home page of the Library of
Congress website, Figure 1.1 on the next page. The form displays embedded
metadata using a variety of standards. The top part of the form consists of
metadata automatically extracted from the page coding. The lower part of the
form lists metadata that the page has been tagged with according to various
metadata standards. The ‘dc:’ label refers to Dublin Core. The ‘og:’ tag refers
to Open Graph metadata.
Purposes of metadata
Metadata is something which you collect for a particular purpose, rather than
being a bunch of data you collect just because it is there or because you have
some public duty to collect (Bell, 2016). One of the main drivers for the
evolution of metadata standards is the use to which the metadata is put, its
purpose. Even within the library and information profession, a wide range
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 12
of metadata purposes has been identified. Two of the most useful models
provide a basis for the purposes of metadata described in this book.
In the first model Day (2001) suggested that metadata has seven distinct
purposes. He starts with resource description – identifying and describing
the entity that the metadata is about. The second purpose is focused on
information retrieval – and in the context of web resources this is called
‘resource discovery’. This is one of the primary focuses of the Dublin Core
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 13
INTRODUCTION 13
1 Resource description
2 Resource discovery
3 Administration and management of resources
4 Record of intellectual property rights
5 Documenting software and hardware environments
6 Preservation management of digital resources
7 Providing information on context and authenticity
Gilliland (2016) takes a slightly different approach, although she also classifies
metadata according to purpose. The use of metadata is categorised into more
specific sub-categories. This means that a metadata scheme as well as
individual metadata elements could fall into several different categories
simultaneously. Gilliland provides some useful examples of the metadata that
falls under each type (Table 1.2). There is some common ground with Day, in
that they both identify: administration (equivalent to management and
administration); description (encompassing information retrieval or resource
discovery); and preservation as key purposes of metadata. The technical
metadata in Gilliland corresponds to ‘Documenting hardware and software
environments’ in Day. The ‘Use’ metadata could include transactional data
as would be seen in an e-commerce system or could provide an audit trail for
documents in a records management system.
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 14
Table 1.2 Different types of metadata and their functions, extracted from Gilliland (2016)
There is a lot of common ground between these two models and although
neither of them specifically mentions ‘interoperability’ as a purpose, it is
alluded to. For instance, Day’s purpose 5 – ‘documenting software and
hardware environments’, touches on one aspect of interoperability and the
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 15
INTRODUCTION 15
INTRODUCTION 17
Multiple purposes
Metadata can be used within one application for several different purposes.
The model developed here helps in the analysis of metadata applications and
the understanding of its characteristics in different situations.
purposes identified in the first edition and has been extended and modified
to reflect the full range of uses of metadata in the 14 years that have since
passed.
Part III (Chapters 11–14) is about the management of metadata and starts
with well established methods of managing standards, schemas and metadata
quality. It then considers recent developments in taxonomies, encoding
schemes and ontologies and the role that these play in structuring knowledge.
It moves on to big data and the challenges faced by those wishing to exploit
very large data collections. It then considers the starting point of this book,
politics. What are the implications for privacy and national security? The final
chapter also considers the future of metadata – from the empowerment of
users through to professional development – and considers who will be
responsible for managing metadata in the future.
Throughout this book ‘metadata’ is used as a singular collective noun. The
word ‘data’ is used as a mass noun and is treated as a collective singular noun
in accordance with most common current usage (Rosenberg, 2013, 18–19).
This ties in with the gradual disappearance of the word ‘datum’. Even Steven
Pinker, one of the foremost thinkers and writers about linguistics
acknowledges this, although he makes clear his own preferences:
I like to use data as a plural of datum, but I’m in a fussy minority even among
scientists. Data is rarely used as a plural today, just as candelabra and agenda
long ago ceased to be plurals. But I still like it. (Pinker, 2015, 271)
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 19
CHAPTER 2
Defining, describing and expressing metadata
Overview
This chapter describes some of the concepts associated with metadata. It considers ways
in which metadata can be expressed and focuses on document mark-up languages. It
then considers schemas as one method of defining metadata standards and data
elements. Databases of metadata are described as an alternative to embedded
metadata. The last section of the chapter shows some examples of how metadata is used
in different contexts such as document creation, records management, library catalogues,
digital repositories and image collections.
Defining metadata
Metadata is used in catalogues of digital and physical information resources.
The requirements for books in a library catalogue might be very different
from the metadata embedded in a web page, but the general concepts of
metadata apply to both. Its use for digital and printed resources provides
some helpful examples. Document mark-up languages such as SGML and
XML are widely used to express metadata standards.
Document mark-up
The development of mark-up languages is an excellent example of the way
in which metadata can be applied to and expressed in documents. Electronic
documents are one of the most common forms of digital object to which
metadata is applied, and range from web pages through to electronic records,
and may incorporate text, images and interactive material.
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 20
• HTML
• XML
• TEI
• LaTeX.
Figure 2.1 shows: raw text (the data), the text with formatting instructions
(the mark-up) and the text as it would appear to the reader (the rendition).
This is an
example of
marked-up text
that shows
large and small
This is an example text as well as
of marked-up text Stylesheet 1 bold and italics
that shows
<l>large</l> and
<s>small</s>text as
well as <b>bold</b> This is an example of
and <i>italics</i> marked-up text that
shows large and small
text as well as bold
and italics
Stylesheet 2
SGML was for a long time an international standard, ISO 8879 (ISO, 1986).
Although now withdrawn, it is still the basis for other mark-up languages.
Hypertext Markup Language (HTML) is an application of SGML. HTML is
used to encode the content of web pages and is widely used to describe web
pages, including the metadata embedded in them. HTML5 recognises
metadata content as a specific category of HTML content:
Metadata content is content that sets up the presentation or behavior of the rest of
the content, or that sets up the relationship of the document with other
documents, or that conveys other ‘out of band’ information.
(W3C, 2014b)
The head of the document holds the metadata, including the title and other
metadata content held in the data element ‘meta’.
TEI is a specialist mark-up language widely used in the digital humanities
(TEI Consortium, 2016). The TEI header is where metadata is normally
embedded. However, marking up documents in this way allows other
characteristics of a document to be identified and retrieved or processed. The
Title Statement includes title, author, and funder information. Other
bibliographic information includes edition and publication details. Although
TEI is not a cataloguing standard, its structure facilitates identification and
use of structured metadata.
LaTeX is another specialist mark-up language, developed for scientific and
mathematical publications. Different templates can be applied to a marked-
up document to format it to conform with a variety of academic publications.
The current version is LaTeX2e. LaTeX3 was still in development at the time
of writing (LaTeX3 Project Team, 2001).
content) and mark-up, which encodes the logical structure and other
attributes of the data. Documents are organised into elements which break
the document down into units of meaning, purpose or layout. The elements
correspond to fields in a database, as will be seen in later examples in this
chapter. XML documents can also use entities, which may refer to an external
document or a dynamic database record, or can be used to label a defined
piece of text for re-use within the document.
The DTD for a memo can be used to test the ‘validity’ of the document. In
other words does a document purporting to be a memo have the right
elements appearing in the right order? If it does, the DTD provides the means
for the memo to be expressed in a variety of formats determined by the
appropriate stylesheet. In this example, the ‘Memo’ DTD might have separate
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 24
The element may have attributes associated with it – in terms of the encoding
system used for instance, or the type of data that appears in that element. For
example the ‘To’ element could be defined by the following statement:
<!ELEMENT To (#PCDATA)>
This indicates that the ‘To’ data element consists of Parsed (parsable)
Character data (#PCDATA).
XML schemas
An alternative way of defining metadata is to use XML schemas. They offer
greater flexibility than DTDs and are widely used for expressing metadata
standards. Schemas are XML languages used for defining similar types of
document in terms of their structure, content and meaning. The W3C website
defines them in the following terms:
XML Schemas express shared vocabularies and allow machines to carry out rules
made by people. They provide a means for defining the structure, content and
semantics of XML documents. (Sperberg-McQueen and Thompson, 2014)
They are described using XSDL (XML Schema Definition Language). The
following extracts are from an example of an XML schema that defines simple
Dublin Core metadata elements (Cole et al., 2008).
The start of the schema contains declarations about the nature of the
schema, including two namespace references ‘xmlns’.
This is followed by annotations from the authors about the background to the
schema and then a namespace reference to the standard for XML.
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 25
<xs:annotation>
<xs:documentation xml:lang=‘en’>
DCMES 1.1 XML Schema
XML Schema for http://purl.org/dc/elements/1.1/namespace
Created 2008-02-11
Created by
Tim Cole ([email protected])
Tom Habing ([email protected])
Jane Hunter ([email protected])
Pete Johnston ([email protected]),
Carl Lagoze ([email protected])
This schema declares XML elements for the 15 DC elements
from the http://purl.org/dc/elements/1.1/namespace.
It defines a complexType SimpleLiteral which permits mixed
content
and makes the xml:lang attribute available. It disallows child
elements by use of minOcccurs/maxOccurs.
However, this complexType does permit the derivation of other
complexTypes which would permit child elements.
All elements are declared as substitutable for the abstract
element
any, which means that the default type for all elements is
dc:SimpleLiteral.
</xs:documentation>
</xs:annotation>
Schemas are commonly associated with databases, where each data element
corresponds to a field in a database. As with databases, the schema can be set
up to provide semantic and syntactic checks on data. In other words, checks
on the meaning and grammar of an expression can be made. Syntactic checks,
for example, can be applied to the data to ensure that it is of the appropriate
type and is expressed in a format that can be processed by the database
software. For example, dates can be defined using international standard ISO
8601:2004 to get over the problem of differing American and British date order
(ISO, 2004c). For instance, ‘10/12/17’ means ‘10th December 2017’ in Britain
and ‘October 12th 2017’ in the USA. Schemas can also apply semantic checks
to ensure that business rules are followed by requiring the value of an element
(the field content) to fall within a specified range. For instance, the value of
the month element in the data should be between 1 and 12. The
www.schema.org website offers a resource for sharing schemas of this type
and this is described in more detail in Chapter 12.
Namespace
Namespace is used to locate definitions for metadata schema from the Internet. This
ensures greater consistency of terminology used to define metadata elements and
provides a way of sharing elements. In the Dublin Core example the namespace that
provides the original reference to Dublin core elements is as follows:
xmlns=‘http://purl.org/dc/elements/1.1/’
Databases of metadata
The previous section about the mark-up of documents focused particularly
on embedded metadata. For example, a web resource may have metadata
tags and content embedded in the resource. Electronic documents and other
digital materials often have embedded metadata, allowing other applications
and systems to effectively process them. However, this is not the only way of
handling metadata. In many systems the metadata may be held separately in a
database.
Databases of metadata may be generated at the point of creation of
documents by Enterprise Content Management (ECM) systems, for instance.
ECM systems store the metadata about documents in a central database and
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 27
use this data to manage and handle the documents. This allows documents
to be brought forward for review, the workflow to be managed, and access
to be controlled. Institutional repositories and library management systems
operate in a similar fashion, working with central collections of metadata, the
library catalogue or repository database.
Word-processed documents
Applications that are used for preparing documents, such as word-processing
packages, automatically generate metadata when a document is saved for the
first time. In some cases systems can be configured to prompt the author for
metadata when a new document is saved. Metadata associated with the user
such as ‘Author’ and ‘Company’ may be automatically generated. This can be
edited and additional metadata can be added manually. A controlled
vocabulary can be a useful way of ensuring consistent retrieval of documents.
For instance, keywords selected from a thesaurus can be added as metadata
to enrich the subject description of the document.
The screenshot in Figure 2.3 on the next page shows a typical metadata
screen associated with a word-processed document. In Microsoft Word there
are additional tabs for: ‘Statistics’, and ‘Contents’, which display metadata
such as document size, time spent editing it, the session number, and the
number of words. The final tab ‘Custom’ allows for additional optional data
associated with document and records management.
Library catalogues
Metadata is particularly useful for large collections of documents or other
materials, where it can be used for managing the resource and for finding
specific items. A catalogue becomes essential for retrieval when there are
more than a few hundred items in a collection. The arrangement of books in
a subject classification is not always sufficient for good subject retrieval. If a
book is about more than one subject, it can only be held in one place
physically. It may also be out on loan, which means that shelf browsing would
not identify it. Early examples of library metadata were held on catalogue
cards. In the 1960s electronic catalogues began to appear and are routinely
used in most libraries today. Library users or patrons can find books by
searching the catalogue by a variety of criteria such as author name, words
in the title, classification code (which determines the arrangement on the
shelves) and keyword (subject).
Figure 2.4 on the following page shows the results of a search on the subject
‘Sherlock Holmes’ in the catalogue from the City of Westminster Libraries,
London. In this example the metadata is used to store comparable data about
individual items in the collection. This allows users to search consistently
across the whole collection. Other metadata associated with items, such as
location (library branch), author and format are all metadata elements that
can be used to refine the results of the search. Within each item, other
metadata such as title, publication date, abstract and availability are
displayed.
The detailed record shows additional metadata elements, including ISBN,
subject terms, physical description and genre. An even more detailed system
record such as that shown in Figure 2.5 will be available to library staff, which
will contain administrative data such as date of acquisition, accession number
and the status of the item in collection management processes such as
labelling, repair and withdrawal.
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 30
Selecting one item from the results list provides a detailed catalogue entry for
that item. Figure 2.7 on the next page is the WorldCat entry with
supplementary information about holdings in participating libraries.
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 32
A closer look at individual records (Figure 2.9) shows the metadata about the
repository (repositories, being information resources themselves, are
described by metadata).
Image repositories
Image repositories such as iStockPhoto, Getty Images and Flickr use specialist
metadata as well as keywords to help people to retrieve images (Getty Images,
2017; iStockphoto LP, 2017; Yahoo! Inc., 2017). So, for instance, it is possible
to search Getty Images by the following criteria: image type, orientation,
number of people, colour, image size, age (of people in the image), people
composition, image style, ethnicity, photographers, and royalty-free
collections.
Conclusion
These examples of metadata are based on the principle that metadata may be
embedded in a digital object or held separately from the resource that it
describes. Mark-up languages such as XML provide a way of handling and
exchanging metadata. They also provide a means of describing metadata
standards.
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 34
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 35
CHAPTER 3
Data modelling
Overview
Metadata models help to give an understanding of the development of metadata
standards. The chapter starts with an overview of data modelling and its relationship to
metadata. It defines some of the terminology used to describe modelling languages,
using the Unified Modelling Language (UML) as an example. Systems such as RDF and
the ABC Ontology are discussed before considering domain-specific modelling
frameworks such as the Library Reference Model (LRM), indecs for the book trade and
OAIS for online exchange of information. Van Hooland and Verborgh (2014) talk about
four types of modelling: tabular, relational, meta mark-up and RDF. This chapter focuses
on the last two types.
Metadata models
A metadata standard is a type of data model which provides a way of
conceptualising the characteristics of an information resource. A data model
may have its own syntax (grammar) and semantics (meaning) and may be
expressible in a mark-up language such as XML (W3C, 2016). One of the
interesting aspects of the development of metadata standards has been the
convergence of different communities of interest. People have recognised the
benefits of working within common frameworks. In order to do so they have
adopted common languages for describing the data that they handle.
Languages such as XML and RDFa have played an important role in equip-
ping these communities with a set of tools to describe data and relationships
between data elements (Herman et al., 2015).
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 36
DATA MODELLING 37
The book, The Hound of the Baskervilles (subject) has creator (predicate)
Arthur Conan Doyle (object).
Creator
The Hound of the Arthur Conan Doyle
Baskervilles
This structure is recursive, so that the subject of a triple can be another triple.
In other words there can be metadata about the metadata. It is also possible to
chain statements to produce more detailed metadata records, as illustrated in
Figure 3.2. The author name is represented by a Uniform Resource Identifier
(URI) or an Internationalized Resource Identifier (IRI) with properties of ‘Name’
Name Lifespan
and ‘Lifespan’ associated with it. In the ‘node and arc’ diagram below the object
of the statement is itself a statement. ‘The Hound of the Baskervilles has creator
. . .’ leads to the statement ‘URI http://viaf.org/viaf/65283845 has name Arthur
Conan Doyle and has lifespan 1859–1930’.
The mandated use of URIs in RDF makes it a powerful tool for creating
linked data. Examples of its use in open data initiatives can be found in
Chapter 13. The author statement in Figure 3.3 can be expressed in the
following terms:
http://purl.org/dc/
terms/creator
http://viaf.org/viaf/181625655 http://viaf.org/viaf/65283845
The opening RDF statement includes the RDF namespace declaration, which
refers to the specific URI. This allows multiple and consistent use of XML
resources, because different documents can refer to the same namespace. It
also ensures that an application can recognise and use the appropriate version
of RDF to interpret the statements that follow.
DATA MODELLING 39
<rdf:Description rdf:about=http://viaf.org/viaf/181625655
<dc:creator><rdf:Description
rdf:about=‘http://viaf.org/viaf/65283845’></rdf:Description></dc:creator>
</rdf:Description>
Dublin Core
The Dublin Core metadata standard, which is described in Chapter 4, is a
widely used metadata standard for describing online resources (DCMI, 2012).
It is underpinned by a data model which can be represented in UML (Powell
et al., 2007). Figure 3.4 shows the DCMI Resource Model, showing the
relationship between a resource description and establishing the fact that each
property-value pair contains one property and one value. For example, the
property ‘dc:creator’ might have the value ‘Arthur Conan Doyle’. A resource
resource
literal value
1 value
non-literal
property- value
described
resource value pair
1
Key
property
label relationship defined by label
‘is’ or ‘is a’
‘contains n’ or ‘has n’
Figure 3.4 DCMI resource model (DCMI, 2012, licensed under CC BY 4.0)
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 40
For example, the work Sapiens: a brief history of humankind by Yuval Noah
Harari, published in the UK in 2014, was originally published in Hebrew in
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 41
DATA MODELLING 41
WORK
is realized through
realizes
EXPRESSION
is embodied in
embodies
MANIFESTATION
is exemplified by
exemplifies
ITEM
Figure 3.5 Relationships between Work, Expression, Manifestation and Item (based on Riva,
Le Boeuf and Žumer, 2017)
Israel in 2011 and has been widely translated. Each translation could be seen
as a work or as an expression of the original Hebrew publication. If the
English version is seen as a work (with a relationship to the original Hebrew
work) an expression might be the edition published by Harvill Secker in
London in 2014. The hardback edition with ISBN 9781846558238 is a manifest-
ation of the English-language version. An example of that manifestation is
the item, which is the copy of the hardback English-language edition that is
on my bookshelf at home.
Figure 3.6 overleaf shows in general terms the relationship between a work,
expression, manifestation or item and the responsible agent (which could be
a person or a corporate body). The double-headed arrows indicate that there
may be multiple instances of a relationship between entities. For instance a
work is created by a person (or persons). The reverse relationship is that a
person creates a work (or works).
In the example title Sapiens, the work was created by Yuval Noah Harari.
The English edition in hardback was created, manufactured and distributed
by publisher Harvill Secker. The copy on my bookshelf is owned by me.
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 42
WORK
was created by
created
EXPRESSION
was created by
created
is distributed by AGENT
was created by
was manufactured by
manufactured
created
MANIFESTATION distributed
was modified by
is owned by
owns
modifies
ITEM
DATA MODELLING 43
and Hunter, 2002). The model is intended to provide a basis for analysing
existing metadata ontologies, to give communities the tools to develop their
own ontologies and to provide a mechanism for mapping between metadata
ontologies.The ABC Ontology was developed to facilitate interoperability
between metadata ontologies from different domains. Its target is to ‘model
physical, digital and analogue objects held in libraries, archives and museums
and on the Internet’ (Lagoze and Hunter, 2002). This includes books, museum
objects, digital images, sound recordings and multimedia resources. It can
model abstract concepts such as intellectual content and time-based events
such as a performance or a lifecycle event that happens to an object such as
publication of a book. The model is based on a primitive category ‘Entity’,
with three categories at the next level: Temporality, Actuality and Abstraction.
The data elements used fall into four main categories, as shown here:
• ENTITY
– Time
– Place
• TEMPORALITY
– Situation
– Event
– Action
• ACTUALITY
– Artifact
– Agent
• ABSTRACTION
– Work
Each category has subcategories that allow for more precise descriptions of
the models. These in turn can be broken down into subclasses specific to a
particular domain, such as libraries, museums or web resources. The ABC
Ontology allows for modelling of time-dependent relationships, which are
particularly important in museums and archives (where the provenance of
an item is key to its integrity), rights management (where it is important to
track who has used a work under what conditions and when) and for events
such as a musical performance. Figure 3.7 on page 44 is a simple
representation of a publication using the ABC Ontology. This is a simplified
representation of part of the publishing process. The work Omeros by the late
Nobel Laureate Derek Walcott is expressed as a book. The book is manifest
as the edition published by Faber & Faber Ltd (Walcott, 1990). A more
complete representation of this would indicate the place of publication and
co-publishing details.
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 44
AG0 AG1
participant
1990 participant
author publisher
type atTime type hasParticipant
type authoring 1990
atTime
creates
MN0
format
book
hasRealisation
title
WK0 “Omeros”
DATA MODELLING 45
The indecs framework has defined metadata elements, each of which has an
indecs identifier or iid. The indecs framework can be used to model the
relationships between entities. Indecs is based on the premises that: ‘People
make stuff’, ‘People use stuff’ and ‘People do deals about stuff’ (Figure 3.8).
make
people stuff
used by
do about
deals
OAIS
Producer Consumer
Archive
Management
Preservation
Content Description
Information Information
Package Information
DATA MODELLING 47
Producer
SIP
OAIS
Query
AIP
Results
set
Order
DIP
Consumer
Conclusion
This chapter described how standards for metadata developed along different
paths to fit in with the requirements of different communities. A number of
data modelling systems or frameworks have been developed for describing
metadata.
The ABC Ontology is a general framework for developing domain-specific
descriptions and provides a way of describing different ontologies using a
common language. The Resource Description Framework (RDF) is a way of
modelling and describing metadata and can be expressed in a number of
languages, including HTML, XML, and Turtle (Terse RDF Triple Language).
Its syntax is based on triples, subject – predicate – object, so that, for example,
the book The Hound of the Baskervilles (subject) has creator (predicate) ‘Arthur
Conan Doyle’ (object). A third model, the Library Reference Model (LRM), is
more specific, providing a framework for describing products of intellectual
and artistic effort, such as books and sound recordings. Indecs, the fourth
modelling system described, focuses on the entities and transactions that
occur in a commercial publishing environment. OAIS, an information model
for digital archives, is used for online exchange of data.
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 49
CHAPTER 4
Metadata standards
Overview
This chapter looks at the structure of metadata standards. Understanding the way in
which standards are created and how they are constructed gives us an insight into their
use and potential applications to different situations. Metadata standards have arisen in
different collections and user communities and some of the main metadata standards
are described. This is not intended to be a comprehensive survey but rather an overview
of the range of metadata standards on offer with pointers to further information about
specialist standards. It starts with a description of Dublin Core, which is probably the
most widely used standard, partly because of its simplicity and partly because it was
designed for web resources. It illustrates some of the main features of metadata
standards and is itself used as the basis for specialist standards and application profiles.
The chapter goes on to consider some of the standards that are used in the library and
information field for bibliographic materials and social media. It also describes metadata
standards used for non-textual materials.
METADATA STANDARDS 51
About standards
Metadata standards provide a framework for analysing information resources
so that they can be effectively managed and easily retrieved. The standards
identify the characteristics of a resource that are recorded and sometimes
specify the way in which the metadata content is created (encoding schemes
are covered in Chapter 12). In effect the metadata standard provides
containers for particular types of information. For instance, in Dublin Core
the dc:creator data element, which is defined in the standard for Dublin Core
Metadata Element Set (ISO, 2009b), provides a place for author name, or
organisation responsible for the creation of that resource. The resource could
be a web page or it could be a book catalogued in a library, for instance.
METADATA STANDARDS 53
the Google search page, or ‘Library of Congress Home’ for the Library of
Congress home page.
• Type – The dc:type data element describes the ‘nature or genre of the
resource’. This might be: ‘web page’, ‘text’ etc. The file format or physical
medium would be described in the dc:format data element.
Some of the above examples show how a data element can be made more
specific by adding a qualifier. This is known as a ‘refinement’. So for instance
the dc:coverage data element can have a temporal or a spatial refinement,
appearing thus: dc:coverage.temporal or dc:coverage.spatial.
Different communities have emerged in the Dublin Core domain and they
have developed further data elements that extend the Dublin Core Metadata
Element Set (DCMI, 2012). Dublin Core is adaptable and supports the
development of application profiles. A Dublin Core Application Profile
(DCAP) incorporates elements of DC and data elements from other metadata
standards and namespaces or using specialised vocabularies to create a
standard for specific applications or requirements (Coyle and Baker, 2009;
Malta and Baptista, 2014). A DCAP is defined by:
• functional requirements
• domain model
• description set profiles and usage guidelines
• syntax guidelines and data formats.
METADATA STANDARDS 55
• title
• statement of responsibility
• edition statement
• numbering of serials
• production statement
• publication statement
• distribution statement
• manufacture statement
• copyright date
• series statement
• identifier for the manifestation
• carrier type
• extent.
MARC 21
The MARC standard (MAchine-Readable Cataloguing) emerged in the 1960s,
when libraries needed an efficient method of generating multiple catalogue
cards for each item. The advent of computerised processing of data allowed
for single entry of cataloguing details for multiple outputs for the author-title
catalogue and for the classified and subject catalogues. Individual catalogue
records were marked up with field designators to indicate the content of each
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 56
• bibliographic data
• holdings data
• authority data
• classification data
• community information.
The 1XX, 4XX, 6XX, 7XX and 8XX tags in the bibliographic format can be
modified by the following digits to give them a more specific meaning:
METADATA STANDARDS 57
For instance, this means that the tag 100 would contain personal names – such
as an author name, whereas 110 would indicate a corporate author.
BIBFRAME
BIBFRAME, the Bibliographic Framework, is described in the following terms:
• general properties
• category properties
• title information
• work identification information
• work description information
• subject term and classification information
• instance description statements
• instance identification information
• instance description information
• carrier description information
• item information
• type information
• cataloguing resource relationships – general
• cataloguing resource relationships – specific
• cataloguing resource relationships – detailed
• agent information
• administration information.
• titleInfo
• name
• typeOfResource
• genre
• originInfo
• language
• physicalDescription
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 59
METADATA STANDARDS 59
• abstract
• targetAudience
• note
• subject
• classification
• relatedItem
• identifier
• location
• part
• extension
• recordInfo.
Some of these elements are container tags with no content, but serve to group
together sub-elements. For example <titleInfo> is a container tag with the
following sub-elements, which do contain data:
<title>
<subTitle>
<partNumber>
<partName>
<nonSort>
For instance, the following MODS metadata refers to the English translation
of a text that was originally in another language (Portuguese):
KBART
KBART (Knowledge Bases and Related Tools) is a link-resolving system that
enables libraries to link to appropriate copies of electronic publications such
as e-journal articles (NISO/UKSG KBART Working Group, 2010). The system
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 60
The elements of description (fields or data elements) that are applied to the
different levels of archives are grouped together as follows:
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 61
METADATA STANDARDS 61
Although there are rules for the creation of the content of these data elements,
some of them are more narrative in nature and do not use controlled
vocabularies or specific syntactic encoding schemes. Elements such as
Reference code are based on international country codes followed by a
national reference number and a local reference code and there are clear
guidelines for generating the content of other fields as well.
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 62
Social media
There are several standards widely used for online and social media. Viewing
page information by right-clicking on a browser reveals structured metadata
embedded in the page. Some of the more common ones are briefly described
here.
FOAF
The FOAF (friend of a friend) ontology is:
METADATA STANDARDS 63
<foaf:Person rdf:ID=‘me’>
<foaf:name>David Haynes</foaf:name>
<foaf:title>Dr</foaf:title>
<foaf:givenname>David</foaf:givenname>
<foaf:family_name>Haynes</foaf:family_name>
<foaf:schoolHomepage rdf:resource=‘www.city.ac.uk’/>
</foaf:Person>
og:title
og:type
og:image
og:url
og:audio
og:description
og:determiner
og:locale
og:site_name
og:video
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 64
Twitter hashtags
Hashtags are widely used on social networks and microblogging sites by
users to enrich the metadata associated with a posting on social media. For
example, Twitter users can include hashtags in tweets to signify their content
or context. These are indexed by Twitter and are picked up as trending topics.
This allows a conference organiser to publicise a hashtag that all attendees at
a conference can use when tweeting about the conference. For example, the
ISKO UK biennial conference in 2017 had the hashtag #ISKOUK2017.
Although there is no formal control of hashtags on most social media sites,
they serve a useful purpose and continue to be a very popular way of
publicising events and marking specific topics for wider attention.
Non-textual materials
Standards such as MARC and MODS, and indeed Dublin Core, can be
applied to non-textual materials (Weber and Austin, 2011). However, there
are standards that have been developed specifically for non-textual materials,
such as VRA Core, MIX, and IIIF, which are described here.
VRA Core
The VRA Core describes works of art, cultural objects and their images, and
is maintained by the Library of Congress (Visual Resources Association, 2015).
VRA Core metadata can be embedded in METS (Metadata Encoding and
Transmission Standards) documents. The VRA Core has the following fields:
METADATA STANDARDS 65
• technique
• textref
• title
• worktype
• intellectual content
• intellectual property
• extensions
• instantiation.
JPEG2000
JPEG2000 is a group of standards for image coding and compression. Part 2
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 66
of the standard deals with extensions to the coding data including metadata
associated with an image (ISO, 2004a). The metadata is divided into four
categories:
• image creation metadata – how the image was created, e.g. the camera and
lens settings
• content description metadata – the subject of the image, what it was about
• history metadata – what processing was done to the image to reach its
final form and or links to previous versions of the image
• intellectual property rights metadata – information about the rights owners,
etc.
EXIF
EXIF, developed by the Japan Electronic Industries Development Association,
overlaps in coverage with JPEG2000 but is independent of format (so that it
can also be used for TIFF image files, for instance, EXIF metadata can be
embedded in a JPEG or TIFF image. Figure 4.2 shows the general areas of
overlap between EXIF and two other commonly used metadata formats that
are used for images (Metadata Working Group, 2010, 21).
GPS
Orientation
EXIF
Copyright
Description
Creator
Date/Time
Rating
Keywords
Location
METADATA STANDARDS 67
Figure 4.3 IIIF object Figure 4.4 Relationships between IIIF objects (source: IIIF
(source: IIIF Consortium, Consortium, 2017)
2017
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 68
• descriptive properties
• rights and licensing properties
• technical properties
• linking properties
• paging properties.
Manifest The manifest resource represents a single object and any intellectual
work or works embodied within that object.
Sequence The sequence conveys the ordering of the views of the object.
Canvas The canvas represents an individual page or view and acts as a central
point for laying out the different content resources that make up the
display.
Image resources Association of images with their respective canvases is done via
annotations. Although normally annotations are used for associating
commentary with the thing the annotation’s text is about, the Open
Annotation model allows any resource to be associated with any other
resource, or parts thereof, and it is reused for both commentary and
painting resources on the canvas.
Annotation list For some objects, there may be more than just images available to
represent the page. Other resources could include the full text of the
object, musical notations, musical performances, diagram
transcriptions, commentary annotations, tags, video, data and more.
These additional resources are included in annotation lists, referenced
from the canvas they are associated with.
Range It may be important to describe additional structure within an object,
such as newspaper articles that span pages, the range of non-content-
bearing pages at the beginning of a work, or chapters within a book.
These are described using ranges in a similar manner to sequences.
Layer Layers represent groupings of annotation lists that should be collected
together, regardless of which canvas they target, such as all of the
annotations that make up a particular translation of the text of a book.
Collection Collections are used to list the manifests available for viewing, and to
describe the structures, hierarchies or curated collections that the
physical objects are part of. The collections may include both other
collections and manifests, in order to form a hierarchy of objects with
manifests at the leaf nodes of the tree.
Paging In some situations, annotation lists or the list of manifests in a
collection may be very long or expensive to create. The latter case is
especially likely to occur when responses are generated dynamically.
In these situations the server may break up the response using paging
properties.
METADATA STANDARDS 69
• segments
• embedded content
• choice of alternative resources
• non-rectangular segments
• style
• rotation
• comment annotations
• hotspot linking.
The standard is detailed with a core structure (data elements listed below)
and an extended structure with more detailed information on intellectual
property and some technical details.
City
Copyright notice
Country
Country code
Creator
Creator’s contact Info
• Address
• City
• Country
• E-mail address
• Phone number
• Postal codes
• State/Province
• Web URL
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 70
Complex objects
Other standards such as METS and OAI-PMH also allow for metadata
exchange between repositories and this is discussed in Chapter 6 on
information retrieval.
METADATA STANDARDS 71
OAI-PMH
The OAI-PMH (Open Archives Initiative – Protocol for Metadata Harvesting)
standard provides a framework for metadata discovery (Lagoze et al., 2002).
This enables service providers to ‘harvest’ metadata from other metadata
stores such as institutional repositories to create a searchable index and
repository. Some services also harvest content from other repositories to
facilitate faster retrieval. For instance, an institutional repository may collect
metadata in a variety of formats: authors inputting Dublin Core metadata,
MARC records from the library, LOM metadata from the institutional VLE
and DDIs from electronic publications. The institutional repository may also
make METS records available to external services Figure 4.5 on the next page
illustrates the process.
Dublin Core provides a ‘common currency’ for exchange of data about
internet resources. However, many consider it too crude for dealing with the
type of bibliographic material held in institutional repositories. MODS is
based on MARC and provides more detailed bibliographic data. It can be
generated from MARC records, such as those held on library management
systems, and so facilitate exchange of data between systems. Figure 4.6
illustrates the relationship between institutional repositories and resource
discovery systems. The service provider builds up a database of metadata
(and sometimes the resources themselves) harvested from institutional
repositories. It serves queries to the institutional repositories when an item is
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 72
Dublin Core
Institutional
Repository
Plus:
• Admin metadata
• Structural metadata
• Preservation metadata
METADATA STANDARDS 73
IEEE LOM and its derivatives (such as ANZ-LOM) define data elements to
describe a learning object (IEEE Computer Society, 2002). These elements are
arranged into nine categories, as follows:
• accessibility
• interoperability
• durability
• re-usability.
The IEEE Learning Object Metadata standard (IEEE Computer Society, 2002)
and similar systems were intended to allow interchangeability of course
material. It was based on an assumption that it should be possible to construct
courses from pre-existing units and course material rather that writing from
scratch. In practice that objective has eluded educators. This may be because
course material is much more dynamic than many people acknowledged.
Most academics update their material at least annually. They also work hard
to make each course a coherent whole rather that an accumulation of
disjointed elements. Where things have changed is the growth of online
learning environments – particularly the freely available courses such as
Coursera, EdX and the Khan Academy. These are examples of MOOCs –
Massive Open Online Courses. Universities and consortia of universities have
set up these online courses, covering a wide range of subjects. A variety of
metadata on the courses is available for searching or display. This means it is
possible to search or navigate by subject, level and institution. Examination
of the landing pages of courses from the three largest providers reveals
extensive use of social media metadata such as Open Graph, Facebook and
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 74
Twitter. However, Miranda and Ritrovato (2014) identified Dublin Core, IMS
and IEEE LOM as widely used metadata standards for course material.
Conclusion
This chapter has provided an overview of some of the most commonly used
metadata standards in different domains of activity. It shows the relationships
between metadata standards and the domains in which they are used. The
choice of standards was also dictated by the use of widely accepted standards
as the basis for derived standards or application profiles based on national
need. Dublin Core was introduced as a general-purpose standard, even
though it was designed primarily to describe online resources and web pages,
specifically. We then considered standards that were applicable to LIS work,
such as KBART, RDA, MARC 21 and MODS and archives – ISAD(G) and
EAD. Metadata standards such as FOAF, the Open Graph Protocol and
Twitter hashtags were discussed in the context of social media standards. An
overview of standards for non-textual resources include VRA Core, MIX,
PBCore, JPEG and EXIF. Finally, complex objects that might include materials
in a variety of digital formats were covered by METS, OAI-PMH (for
exchange of metadata) and learning object metadata such as IEEE LOM.
Although some standards have persisted for a long time, Dublin Core since
2006, IEEE LOM since 2002 with minor amendments, other significant
changes are afoot. At the time of writing RDA had been implemented in
several national libraries, including the British Library, and the Library of
Congress. Other national libraries and many academic libraries are in the
process of implementing RDA (RDA Steering Committee, 2017).
Long-established mark-up systems such as MARC were under scrutiny
with the proposed BIBFRAME replacement being developed at the Library
of Congress.
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 75
PART II
Purposes of metadata
One of the organising principles of the first edition of this book was that
metadata could be categorised by purpose. The original five purposes
reflected the preoccupations of information professionals in the early 2000s.
Many of these purposes have stood up to scrutiny and Part II builds on that
model, but with six purposes. This part of the book starts with resource
identification and description (Chapter 5) as before. Chapter 6 looks at
information retrieval and the impact that metadata has on it. It necessarily
discusses retrieval theory moving beyond the measures of precision and recall
that were discussed in the first edition. This part then moves on to ‘Managing
information resources’ (Chapter 7) and looks at the role of metadata in
managing the information lifecycle. Chapter 8 considers intellectual property
rights, including provenance. Previously there was a chapter on e-commerce
and this has been developed into a description of the role of metadata
supporting e-commerce and e-government (Chapter 9). It is illustrated with
examples from the book trade (ONIX), e-learning environments and research
data (including ‘big data’). The final chapter in Part II is about information
governance (Chapter 10), dealing with ethical and regulatory issues. Risk is
used as a lens through which to view regulation and governance.
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 76
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 77
CHAPTER 5
Resource identification and description
(Purpose 1)
Overview
The chapter begins with a discussion of resource identifiers. It then considers how
resource description is used to distinguish between different information resources.
Some widely used identifiers such as ISBNs, DOIs, ISSNs, ISTCs and ISANs are described.
‘Description’ underpins other purposes such as retrieval and rights management. The
chapter then looks at other metadata used for describing information resources by
considering in turn: title; creator; bibliographic citation; date; format; and description.
Identifiers
A fundamental requirement of any description system is to have a way of
uniquely identifying an item, so that it is clear what is being described. This
is a particular concern in online records management, where there are
different levels of aggregation. Is the resource being described as a single
document, a series of documents on a particular topic, or a collection of items?
For a small collection, an identifier could simply be the title of a book or piece
of music. However, with even quite modest collections ambiguity becomes a
significant issue, as when two different books share the same title, or where
the same work may have different versions of the title (as with translated
works). Identifiers such as ISBNs can be used to distinguish between them,
although because ISBNs are assigned to manifestations rather than works,
they are not a reliable way of disambiguating two titles. Additional metadata
such as Author would be needed to distinguish between two works. For
instance, a search for the book title The Outsider in a public library catalogue
might retrieve the following items:
An ISBN alone may not be sufficient to identify an item. The first item in the
above listing predates the ISBN system, for instance. The last item was published
before the 13-digit ISBNs came into effect. It may also be necessary to distinguish
between several copies of a title in a lending library or between individual items
of stock in a publisher’s warehouse. An identification system can be used for
this as well. In both instances the identifier should be unique at some level (title,
edition, or item for instance) and unambiguous. Some works may have several
identifiers such as different identifiers for hardback and paperback editions of
a book. Translations present a particular problem because a translation could
be regarded as a separate work as well as an expression of the work in the
original language. It is important to understand what is being identified. The
FRBR model for bibliographic items allows for different levels of granularity of
information resource based on a multi-layer model comprising:
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 79
In the above example we see identifiers applied at each level from the author–
title catalogue entry for the work, to the manifestations of that work
represented by ISBNs. A digital object identifier (DOI) could play the same
role as identifier for a digital item. At the item level a library accession number
would identify an individual copy.
URLs (universal resource locators) are commonly used to identify web
pages; they are used throughout this book, for instance, to provide a reference
trail for those seeking further information or background about specific
topics. However, URLs describe the location of an electronic resource on the
internet. In most cases this happens to coincide with the actual resource and
so is effectively used as though it were a resource identifier. However,
websites change and the content at a particular address may disappear or be
replaced. This is one reason for giving the date accessed when citing a URL.
In other words, URLs are not necessarily persistent. The concept of a Uniform
Resource Identifier incorporates Universal Resource Names and Universal
Resource Locations (Berners-Lee, Fielding and Masinter, 2005). A URI may
be a URN that identifies a specific resource, but not how to access it, and/or
a URL such as a web address which points to a specific location on the
internet. In other words an ISBN and a URL are examples of URIs.
Name authorities – for many years libraries have developed name
authorities following AACR2, and now RDA (Joint Steering Committee for
Development of RDA, 2014). Archivists have also developed a system for a
name authority, ISAAR (CPF) for Corporate Bodies, Persons and Families
(International Council on Archives, 2004), which is used in several countries.
Name authorities ensure the consistency of catalogues and help to eliminate
ambiguity, one of the reasons for identification systems.
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 80
a4739b4f-2077-4594-b8b3-2cfa70a41d5d
(Note: This code was automatically generated from the UUID Generator
website: www.uuidgenerator.net/. Each character (0-9 and a-f) represents
a hexadecimal (base-16) digit.)
EAN codes are used for physical objects and consist of a 13-digit code which
can be used to create a barcode (GS1, 2015). The EAN has retained its former
acronym, which stood for European Article Number, even though it is now
international in scope. These are the barcodes commonly seen on books and
magazines, and in the case of books correspond to the 13-digit ISBN, which
always starts with the 978 EAN code.
have its own ISBN and the e-book will have a distinct and separate ISBN. For
example, each hardback edition of Harry Potter and the Chamber of Secrets
would have its own ISBN as would each paperback edition. However, ISBNs
can be mis-assigned or can be inadvertently re-used, so they are not always
reliable identifiers. ISBNs originally consisted of a 10-digit code. Since 2007
ISBNs have consisted of a 13-digit code with the following elements
(International ISBN Agency, 2014):
ISBN
Product form
Title
Series
Contributor
Edition
Language(s) of text
Imprint
Publisher name and contact details
Country of publication
Publication date
ISBN of parent publication
All DOIs start with the prefix 10, from the Handle System, followed by an
alphanumeric (letters and digits) of any length to identify the registrant
organisation (Sun, Lannom and Boesch, 2003). A forward slash separates the
prefix and the suffix, which is assigned to the entity or digital object itself.
The suffix may incorporate existing identifiers such as ISBNs. Once assigned,
a DOI is persistent – in other words it does not change, even if the ownership
changes.
The DOIs are based on three components: resolution, metadata and policy.
A DOI can be resolved into associated values such as URLs, other DOIs and
other metadata. A digital object with a DOI may have an associated URL, an
internet location (which is not necessarily persistent). The entity associated
with the DOI can be moved to another internet location or URL without the
need to change the DOI. The DOI can be resolved into multiple values, as we
see in the following example of four sets of associated data:
DOI:10.1004/123456
URL: www.pub.com
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 83
URL: www.pub2.com
DLS: loc/repository
title
frequency
publisher’s name
medium, etc.
The ISAN consists of 16 hexadecimal digits, the first 12 of which are unique
to each audiovisual work and the remaining 4 being reserved for part
numbers. Machine-readable versions of the number have a check digit added.
The ISAN is applied to a work and all its manifestations in different media,
unlike ISSNs, which are unique to each form of a serial. A proposed
development is V-ISAN, which will incorporate information about the version
of the audiovisual work.
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 84
This conforms with the EAN system, which means that music publications
can have barcodes based on the ISMN.
• registration element
• year element
• textual work element
• check digit.
The ISTC provides a way of identifying text which may be incorporated into
a serial publication or which may be manifest as a book. For instance, a work
with its own ISTC may correspond to several publications (manifestations of
the work), each with its own ISBN (Figure 5.1 opposite).
Work by Author
with its own
ISTC #A
http://ark.cdlib.org/ark:/13030/ft4w10060w
Describing resources
The six-point model of the purposes of metadata introduced in Chapter 1
started with resource description, which is the most fundamental of all
metadata purposes. It has its origins in the emergence of library catalogues
and at its most basic is a way of identifying works. Adequate description is
an essential prerequisite for information retrieval and resource discovery
(Chapter 6). It also underpins the other applications of metadata. Without a
way of identifying and describing a resource, it is impossible to use the
associated metadata for other purposes.
For example, in a web search (known generically as resource discovery or
information retrieval), some kind of description is needed for retrieved items
to evaluate the search results and to have an idea of whether the required
item has been retrieved. Another example would be in a library. A search of
a library catalogue that only yielded accession numbers would not be useful
for most searchers. Descriptive data such as the title, author or the format of
the item would normally be needed in order to evaluate the items and to make
a decision about its relevance and therefore whether to order, borrow, reserve
or consult the item.
A single data element may not be sufficient to distinguish between items.
A search for the author ‘Maya Angelou’, for instance, would probably bring
up several works. In order to select the appropriate item, a wider description
than just the name of the author would be needed to assess its relevance. The
book title I Know Why the Caged Bird Sings may then provide the additional
descriptive information that helps a reader to evaluate the retrieved item for
relevance and enables that person to distinguish between it and other books
by the same author such as Gather Together in my Name, or The Heart of a
Woman.
It may not always be clear how complete a description is needed in a given
situation. One extreme would be to use the entire item as the description. So
for instance, the entire text of a book could be used to describe the contents
of the book. In effect this is what happens with web pages or repositories of
electronic journals or e-books. The entire text is available for searching.
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 87
However, even this may not be complete, because it will not include metadata
elements that describe its context or what has happened to it during its life.
It also may not include external, independent descriptions of the item, which
may themselves be useful sources of data about the book, such as a critical
review, or a third-party abstract in a bibliographic database. The biggest
drawback of the complete text is its length – often making it impractical as a
source of information for rapid evaluation. This is why description metadata
is used as a surrogate for the full item.
Attributes, as they are defined in the model, generally fall into two broad
categories. There are, on the one hand, attributes that are inherent in an entity,
and on the other, those that are externally imputed. The first category includes
not only physical characteristics (e.g. the physical medium and dimensions of an
object) but also features that might be characterized as labelling information (e.g.
statements appearing on the title page, cover, or container). The second category
includes assigned identifiers for an entity (e.g. a thematic catalogue number for a
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 88
Descriptive metadata
The following metadata elements (mostly derived from Dublin Core, with
the exception of Bibliographic Citation) are described in terms of their
relevance to describing resources. Dublin Core elements were chosen as the
basis for discussion in this chapter, because of its general nature, widespread
use and relative familiarity. It has been widely used as the basis for
application profiles relevant to specific communities of interest. The
descriptive metadata elements could include:
Identifier
Title
Creator
‘Bibliographic citation’
Date
Format
Description
Title
Although titles are extensively used to identify resources, they are not always
descriptive of the content. In web pages, apart from the URL the title is probably
the most widely used metadata element and in html it is delimited by the tags
<title> and </title>. This mark-up is frequently used by search engines and by
browsers to establish what is displayed at the top of the web page.
Book titles show considerable variation, depending on whether it is the full
title, a common representation of the title, or a particular translation of the
original title. This variation can cause confusion in the identification of an
information resource unless there is some way to distinguish between them.
Consistent cataloguing rules, for instance RDA and the ISBD, provide rules
on sources of information (Joint Steering Committee for Development of
RDA, 2014; IFLA, 2011). They can be used to establish which version of the
title takes priority, or even how to deal with different title origins. Cataloguers
have to deal with questions such as: ‘Does the title of a series of monographs
appear before or after the title of the individual monograph?’ They also have
to deal with subtitles and this may provide additional confusion in the
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 89
Creator
Creator covers a wide range of possible relationships and may imply
intellectual property rights such as copyright. For printed publications, the
author is usually the ‘creator’ entity. However, it also applies to editors of
series of compiled works as well as illustrators and translators.
For web pages the situation can become quite complex. For instance, some
organisations do not attribute the content of web pages to named individuals,
but to departments or the organisation itself, often for the following reasons:
• to provide a more reliable point of contact for those who wish to act on
the content of the web page – the individual authors may move on, in
which case the department may be the more helpful point of contact.
RDA provides a good guide for expressing author names in publications. The
citation rules for many refereed journals also have their own conventions for
author names. The rules are not so clearly defined for web pages, and the
permissive metadata standards used in this arena such as Dublin Core do not
specify how a creator’s name should be recorded. The question arises should
it be: surname, followed by the initials, or full name, or the title followed by
the first name and then the family name? Even in the relatively well defined
area of bibliographic records there are variances in author name which can
cause problems when it comes to reliable identification of a publication.
Different amounts of data may be available for different publications by the
same author. In the previous example about Origin of Species, the author could
be expressed as:
C. Darwin
Charles Darwin
C. R. Darwin
Charles R. Darwin
Charles Robert Darwin
Charles Darwin 1809–1882
And then there are the inversions of the surname and given names: Darwin,
C. etc. The last example in the list above introduces the dates of Charles
Darwin’s lifespan, affording another way of discriminating between this
particular individual and other authors who may share the same name.
Where transliteration from a different script is concerned, there is an added
level of variation. For instance, is it ‘Tchaikovsky’ or ‘Chaikowski’? Is it ‘Mao
Tse-Tung’ or ‘Mao Zedong’?
A name may not be sufficient to distinguish between different authors. For
example, the author ‘Steve Jones’ comes in at least three distinct varieties,
which becomes clear when the cataloguing data reveals their dates of birth.
There is Steve Jones (b. 1944), the biologist who wrote The Language of the
Genes. Biology, history and the evolutionary future; there is Steve Jones (b.1953),
the sports writer and author of Endless Winter: the inside story of the rugby
revolution; and there is Steve Jones (b. 1961), the music critic and author of
Rock Formation: music, technology and mass communication. Authority lists which
include additional data such as the date of birth provide an added level of
specificity and makes identification of items (such as books) more reliable.
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 91
Where the creator is an organisation the issue of name change may arise.
For instance my own institution, ‘City, University of London’ was previously
‘The City University, London’ and prior to becoming a university was the
‘Northampton College of Advanced Technology’. These kinds of changes can
lead to problems with identification and accurate description.
Cataloguers deal with items ‘in hand’, so that the publication details reflect
the situation at the point of publication. This can be copy cataloguing, where
records are obtained from an external source and adapted, or original
cataloguing, creating records from scratch in-house (Chan and Salaba, 2016,
69). To some extent authority lists such as those maintained by the Library of
Congress (2017b) can help with making connections between different
manifestations or expressions of the same work. This is one of the issues that
RDA cataloguing is intended to address.
Bibliographic citation
The bibliographic citation includes elements already discussed, such as title
and creator. However it also includes other distinguishing details such as
publisher, place of publication and date of publication. Different types of
bibliographic records such as journal articles will also include details of the
journal and the volume and issue numbers of the journal. Conventions for
citations such as RDA or the multiplicity of conventions used by refereed
journals provide rules for the order and format of the citation details. The
intention of the citation details is to uniquely identify and help in the location
of the resource being described. Again, there must be a consideration of
consistency in citation conventions. Some applications such as RefWorks,
Endnote bibliographic reference management tool or applications such as
Zotero or Mendeley work with generic bibliographic records that can be
output in a variety of reference styles such as Harvard, or Modern Languages
Association, according to the requirements of the publisher. Even then these
conventions may be limited to the order in which items are cited and the
punctuation that separates the different data elements.
Date
Date information occurs in a number of contexts. It may be an intrinsic
property of an information resource – for example, date of creation, date of
publication, date of revision. It may also be an externally imposed data
element that has more to do with the management of the resource, for
example web page revision or expiry dates, or review dates for electronic
records. Date information can also refer to when something was done to the
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 92
Format
Format information is particularly important for electronic information
resources and may provide the key to future access to the resource. It is
evident in digital images – many of which are created with a great deal of
proprietary metadata about: the format; the application and version used to
create or modify the image; the storage format; and the medium used to store
the data. This descriptive information becomes important when it comes to
reconstructing information by means of migration or by emulating the
original applications.
Format does not only apply to electronic resources. The format of a printed
work may also be relevant and may refer to whether a book is hardcover or
paperback, and the physical size of the document, and whether or not it
contains illustrations. This type of information is particularly helpful for
managing resources. Do the books fit on a standard shelf for instance, or do
they have to be kept with the outsize material?
Description
On the face of it the ‘Description’ data element (in Dublin Core) is most
directly relevant to the purpose of resource description discussed in this
chapter. However, descriptive information may not be an intrinsic property
of the information resource. An author’s abstract in a journal article or an
introduction from a monograph is intrinsic, but an externally produced
abstract or summary is not; it is applied to the resource. This becomes
particularly relevant in describing physical objects or images, as would be the
case in a museum.
There are different approaches to resource description. For example, an
external abstract may be enriched with controlled terms to enhance retrieval.
Alternatively, it may be purely free text – the most likely outcome of using
authors’ abstracts or publishers’ promotional material. The description will
depend on the purpose of the abstract and this will inform the approach that
should be adopted. Many secondary sources specialise in preparing abstracts
on indexed items. The same article may have quite different abstracts which
are geared to different audiences. The questions to ask are:
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 93
The description data element can therefore be applied to this purpose even
though it is not necessarily intrinsic to the resource itself.
Conclusion
Description is an application of metadata that underpins other purposes,
including authenticity, finding and retrieving information and describing
what has to be managed. The actual names of the relevant data elements will
vary according to the schema used. Those that relate to description of
information resources and information-containing artefacts (such as museum
objects, digital media and printed documents) fall under the following broad
headings:
The data elements used to illustrate the descriptive purpose of metadata area
also fulfil other functions such as information retrieval, interoperability and
rights management. The level of description required will depend on the
context. For instance, a title may be sufficient for a library user to distinguish
between different books by an author. A fuller description (in combination
with other data elements) may be needed if several titles are being evaluated
to inform a purchase decision.
Identifiers are a particularly complex area, there being a variety of different
identification systems that can be applied. For instance, an electronic resource
may have a DOI, a URL and an ISBN. Other descriptive elements such as title
may be applied at the RDA work level or manifestation level. Throughout the
discussion on descriptive metadata elements, the theme of consistency has
recurred. The adoption of consistent cataloguing rules is one way of uniquely
identifying items and forms the basis for the development of authority lists.
In the library and archive fields there has been considerable progress in the
development of name authority lists that can be used to distinguish between
similar-sounding items and to consolidate variations around a party name
(such as an author) or information resource (such as an archive or book) for
consistent retrieval.
A common theme running throughout the description purpose is the need
for consistent encoding (which is covered in Chapter 12), to ensure a degree
of interoperability between items and to help discriminate between items that
are relevant and those that are not.
Haynes 4th proof 13 December 2017 13/12/2017 15:37 Page 95
CHAPTER 6
Retrieving information (Purpose 2)
Overview
Metadata standards such as Dublin Core and MODS were designed to improve the
retrieval of web resources and discoverability of digital information resources. This
chapter considers the role of metadata in information retrieval. It begins with a review
of information retrieval concepts and measures of retrieval performance before
considering the impact of metadata on retrieval. Reference is made to models for
resource description and subject indexing. The final part of the chapter examines the
relationship between subject indexing and computational methods of retrieval.
Significant words
General search engines have moved away from the classic Boolean search
model based on set theory, where exact matches to queries are required in
order to retrieve items. Algorithms based on probabilistic models allow search
results to be ranked by relevance (or closeness to fit). Although search engines
rank results so that the most ‘meaningful’ items appear at the top of the list,
they do not solve the problem of differing weights of search terms. Formal
descriptions extracted from the document or applied by cataloguers or
indexers still play an important role. This is particularly the case for the
semantic web, where the context of a descriptor can have a profound effect
on retrieval.
Another aspect to consider is the level at which retrieval takes place. This
can be at collection level or at the level of individual works, or individual
manifestations of that work, or individual items. Collection-level retrieval
provides ‘a filtering system that helps reduce users’ data overload’ (Zavalina,
2011, 105). The following types of data element had the greatest effect on
collection-level retrieval (in order): Description, Subjects, Title.
Most retrieval is document-based, but Salton, Allan and Buckley (1993) have
developed a passage retrieval system based on retrieval of excerpts of
documents rather than the whole document. This is particularly relevant where
a search yields many long documents which users then have to navigate
through to find the relevant material. This research has been carried forward
by exploring different computational techniques to improve the performance
of passage retrieval systems (MacFarlane, Robertson and McCann, 2004).
Passage retrieval can also be used to exploit the structure of XML documents
to narrow down the search results in XML element retrieval (Winter, 2008).
Information Theory
Shannon’s Information Theory, based on his work at Bell Labs, underpins
digital communications systems today (Shannon, 1948). It looks at the
probability that a particular unit of communication (such as a word or phrase)
will occur. The average quantity of information conveyed by a unit (expressed
as entropy) reaches its maximum when the probabilities of word occurrence
are all equal to one another. Otherwise, there is quite a lot of redundancy built
into most text-based systems. The less frequently a unit of communication
occurs, the more information it conveys. This can be used to compute the
incremental value of a two-word term over its separate components. In other
words, compound terms (or co-location of relevant words) can improve the
ranking of a retrieved document. This approach leads to a mathematical
analysis that is independent of linguistic analysis. The entropy, H, is equal to
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 98
minus the constant, k, times the sum of the probability of occurrence of a term,
i, times the log of the probability of occurrence of the term, i., expressed as:
• Unstructured text
– Boolean (Fuzzy, Extended Boolean, Set-based)
– Vector (Generalised vector, Latent semantic indexing, Neural
networks)
– Probabilistic (BM25, Language models, Divergence from randomness,
Bayesian networks)
• Semi-structured text
– Proximal nodes
– XML-based
• Web-based
– Page-rank
– Hubs and authorities
• Multimedia
– Image retrieval
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 99
The primary concern here will be with retrieval of unstructured text, such as
that typically found in books and journal articles. The description
‘unstructured’ does not mean that there is no structure to the text, but rather
that it does not conform to a standard structure (as defined in a document
type description or XML scheme, for instance). The text in a book will usually
be organised into a title page, contents section, chapters and sections and a
bibliography and index at the end. However these may not be defined in a
way that can be interpreted easily by a machine.
Semi-structured text such as that found in web resources is also of interest
as a great deal of retrieval by online search engines is based on this type of
resource. They exploit both embedded metadata as well as the text on the
page. HTML markers may indicate headings, allowing for a degree of
automation in the weighting of terms for retrieval. Titles and section headings
would tend to have greater significance than body text on the page.
The retrieval approaches can be broadly categorised as follows:
Boolean logic
Set theory has developed considerably since George Boole, a 19th-century
mathematician, invented Boolean algebra using logical operators to combine
sets. These basic operators are available on many search interfaces and are a
fundamental part of searching the internet and metadata collections such as
library catalogues. The commonly used operators are:
In Figure 6.2 on the next page a library catalogue contains details on books
about pets. In the first example an enquirer wants books about both cats and
dogs. The area of overlap between the two circles represents the set of books
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 100
on
CAT AND DOG. Another reader might be less discriminating and may want
anything on either cats or dogs. This is represented by the total area of both
circles CAT OR DOG. In the third example, someone may be looking for books
that are exclusively about cats and which do not mention dogs at all: CAT NOT
DOG. This is represented by the left-hand circle, but excluding that part which
overlaps with the circle for ‘DOG’. Although this type of search facility is
available on many commonly used search engines, most users do not explicitly
use Boolean operators. They tend to be limited to advanced searchers. Google
and other search systems use the ‘AND’ operator implicitly to link two or more
search terms that are entered without operators between them. If it recognises
a phrase, it will be more specific than a single word. Other search engines,
particularly those found in intranets and on websites, use the OR operator by
default – expanding the search for each term that is entered into the query.
Fuzzy searching
Rather than a binary condition where a document is either a member of a set
or it is not, extended Boolean search models allow for weights to be attached
to terms and for a degree of membership of a set to be processed. Different
implementations of this approach have showed improved retrieval
performance over simple Boolean retrieval (Colvin and Kraft, 2016). So
document retrieval may be defined by the intersection and the union between
documents with respect to terms A and B:
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 101
Salton and Yang (1973) combine Term Frequency (TF) and IDF weights to
produce a vector value that is an indication of the usefulness of an indexing
term for retrieving and ranking search results.
But what most of all recommends the solution in this Essay is, that it is compleat
in those cases where information is most wanted, and where Mr. De Moivre’s
solution of the inverse problem can give little of no direction; I mean, in all cases
where either p or q are of no considerable magnitude. [. . .] And tho’ in such
cases the Data are not sufficient to discover the exact probability of an event, yet
it is very agreeable to be able to find the limits between which it is reasonable to
think it must lie, and also to be able to determine the precise degree of assent
which is due to any conclusions or assertions relating to them.
(Bayes and Price, 1763)
These two measures can be expressed in terms of the following contingency table
where A is the total number of relevant documents in a set, and B is the total number
of documents retrieved:
Relevant Non-relevant
Retrieved A ∩B A ∩B B
Not retrieved A ∩B A ∩B B
A A N
A ∩B
Precision = B
Recall = A ∩B
A
Precision and recall are widely used for evaluating the effectiveness of
retrieval systems. As the precision increases, the recall often decreases. The
reverse is often true as well. As the number of items retrieved increases, the
precision (the proportion of relevant items in the retrieved set) decreases. In
practice precision and recall are difficult to measure, especially in a dynamic
and diverse environment such as the internet, because it is necessary to know
the total population of relevant items on the system. It can also be difficult to
assess the relevance of a retrieved item, especially if only one item is actually
needed to address the information need. In a web environment that uses
vector analysis and probabilistic searches to produce ranked results, it is
impossible to review the set of all documents that match a particular enquiry
in the Boolean sense. However the precision measure can be modified so that
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 104
a fixed number of ranked results is evaluated. For instance, if the first k items
in a retrieved set is evaluated it is possible to produce a measure such as the
precision at k. The first page of a search result may have ten items, and that
typically is as far as most searchers look. So an evaluation of those first ten
items makes sense to most users. This would be expressed as ‘precision at k
where k=10’ or ‘the precision of the first ten results’.
Another aspect of retrieval performance is recall. It is not always possible
to predict the effectiveness of a particular search query for retrieving relevant
items.
If different terminology is used by the searcher and the creator of the text,
there is likely to be a mismatch. For instance, a news report using the word
‘migrants’ may be about asylum-seekers or refugees as well as so-called
economic migrants. This might mean that a search on ‘refugees’ would miss
this news report. Controlled vocabularies and browsing systems may go some
way to addressing this issue, by associating related terms, or by providing a
navigation route to the preferred term. The role of thesauri and other
controlled vocabulary systems is discussed in Chapter 12.
websites. If the word is in the title of a web page, search engines now tend to
attribute greater weight to it than if it only occurs in the main body of the
page. Matches to the title words push the resource up the ranking of hits.
Metadata can now play a role in putting a term into context. Unfortunately
this is a feature that was exploited unscrupulously by a minority of web authors
who embedded repetitions of keywords in the metadata. This manipulation
was carried to its logical conclusion by putting in the name of competitors in
the metadata fields of their home pages. This meant that searches for a
competitor’s name would retrieve the site indexed in this way - a good way of
alerting competitors’ customers to the existence of your products or services.
Because of the possibility of overt manipulation, most search engines reduced
the weight attached to metadata terms or ignored them altogether.
Search engines have continued to evolve and enhance the quality and utility
of search results by using semantic web features to make results more relevant.
Fact retrieval systems such as Google’s Knowledge Graph also depends on
semantic metadata. Ontologies, such as DBpedia, FOAF, schema.org and
Facebook Open Graph are all used to add meaning to search results. Metadata
is still important for retrieval, especially where users might want to restrict the
results by format, date or other criteria. Apart from social media sites, some
communities still add metadata to target pages to enhance retrieval. Domains
such as government or academic institutions may be more controlled in the
way in which they use subject terms to describe the content of their pages. For
instance, institutional repositories make subject (and other) metadata visible
by using Dublin core tags. This makes the resources discoverable by OAI-PMH
harvesting systems, which can then compile their own indexes of material (see
Chapter 4 for OAI-PMH). They regularly scan the target repositories for
updates which can then be incorporated into their own indexes.
internet. All too often intranets grow in an uncontrolled manner and do not
have a coherent structure. It is common for each department within an
organisation to have considerable autonomy about what goes on the intranet
and the result can be more like a scaled-down version of the internet than a
structured information resource. Content management systems can help
organisations to manage their intranets and websites more effectively.
However, even with use of the additional metadata elements to describe the
content of a particular page or site, there may be an issue of consistency.
Indexing resources is expensive in terms of human effort and lack of suitably
skilled staff can be a limiting factor. This absence can affect the quality and
consistency of indexing. Web managers need to be aware of these issues when
they are implementing a metadata strategy.
Image retrieval
Multimedia files present challenges for retrieval because their content is not
composed of text which can be indexed and retrieved. Ponceleón and Slaney
(2011, 589) talk about the ‘semantic gap’ which they define as ‘the gap
between contents of a multimedia signal and its meaning.’ Retrieval by the
characteristics of the images or sounds can be achieved by a variety of
processing techniques and advances in face recognition and speech
recognition allow for subject retrieval in some cases. Content-Based Image
Retrieval (CBIR) has been focused on colour, texture and salient points of
images or multimedia files. This approach represents an alternative to
metadata-based retrieval.
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 109
The most important factor affecting what can be done with multimedia assets
(apart from their editorial value) is their intrinsic quality (e.g. the definition of an
image or the encoding format of a video) and the quality of the metadata
associated with them.
Metadata are textual descriptions that accompany a content element; they can
range in quantity and quality, from no description (e.g. Web cam content) to
multilingual data (e.g. closed captions and production metadata of motion
pictures). Metadata can be found:
1. embedded within content (e.g. closed captions);
2. in surrounding Web pages or links (HTML content, link anchors, etc.);
3. in domain-specific databases (e.g. IMDb1 for feature films);
4. in ontologies (like those listed in the DAML Ontology Library2).
(Ceri et al., 2013, 209)
are widely available apps on smartphones and other devices that allow users
to associate a face with a name. The system then automatically labels similarly
appearing faces in other photos, effectively recognising individuals in photo
albums. At the time of writing there were some experimental auto-captioning
systems such as CaptionBot which processes uploaded images and creates a
caption using artificial intelligence techniques (Microsoft, 2017). This approach
could provide enhanced retrieval from a collection of images that have been
processed in this way. Further work on deep indexing of images provides
more complex descriptions (Karpathy and Fei-Fei, 2017).
Conclusion
Although free-text searching and Boolean logic are powerful tools for
retrieval, more sophisticated statistical methods are widely used for internet
searching to provide a way of ranking search results. Shannon’s Information
Theory and Bayesian Inference have both played an important role in the
development of a new generation of search engines designed for handling
large data sets. The effectiveness of a retrieval set can be measured in terms
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 112
CHAPTER 7
Managing information resources (Purpose 3)
Overview
Management of information was the third of the purposes of metadata identified in the
six-point model of metadata use. This chapter describes the information lifecycle and a
simplified model of this provides the framework for describing the management of
information resources and the role of metadata. The chapter considers the role of
metadata in each of the main stages of the information lifecycle. This is illustrated with
examples from libraries, archives, records management and research data repositories.
Information lifecycles
One of the purposes of metadata is to manage the capture, storage,
distribution and use of information resources. This can be done in a variety
of contexts: libraries, records collections, archives, research data repositories
and multimedia collections. As well as formal collections, metadata also plays
an important role in the organisation of personal collections, such as
bibliographic references, social media and personal files.
The concept of an information lifecycle is widely used in the management
of digital resources and particularly for preservation. There are identifiable
stages in the life of a digital resource and this provides a basis for managing
those resources. Detlor (2010) puts forward a process-based view of
information management where the number of steps in the information
lifecycle depends on the perspective taken (organisational, library, or
personal). This is an idea that is also summarised in Floridi’s (2010) overview
of Information. Although metadata does not refer exclusively to digital
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 114
information (it is also used for books and other physical manifestations of
information) the majority of examples discussed here are electronic. These
include websites, electronic document and records management systems,
data repositories and social media. The lifecycle concept is well developed in
records management and this is reflected in ISO15489-1:2016, the international
standard for records management, which emphasises events in the lifecycle
of records:
Six broad classes of metadata may be used in the management of records. They
may be applied to all entities (see above), or fewer, depending on the complexity
of the implementation. The six classes are the following:
• creating records
• capturing records
• records classification and indexing
• access control
• storing records
• use and re-use
• migrating and converting records
• disposition.
metadata is created for the record and would typically include date of
creation, the owner and business classification. Metadata can be applied to
paper files or electronic records. Once created, a record may be retrieved and
used but not changed; this is known as fixity. It is classified and indexed to
ensure that it can be retrieved and that it is handled as part of the appropriate
class of records. During its life, access controls may be applied to a record so
that only authorised people are able to retrieve and use it. Preservation and
storage are important considerations, especially in a changing digital
environment. Disposal according to a retention schedule may be triggered by
an event, such as date of creation. The record may also be converted and
migrated to a new environment and this would be recorded in the record’s
metadata. ISO 15489-1:2016 specifies that metadata for records should include
the following:
Metadata standards such as ISAD(G) and EAD describe some of the data
elements that are used to manage records (ICA 2000; Library of Congress,
2016a). ISO23081-2:2009 provides a detailed metadata standard developed
for records management (ISO, 2009d).
The lifecycle model can be developed for digital curation of research data
such as that generated by research groups and typically held by universities.
Description and storage of research data sets has become an important way
of making research data available for further analysis. They may also be
consolidated into larger data sets. A simplified version of the Digital Curation
Centre’s (2010) lifecycle model is illustrated in Figure 7.1 overleaf.
Some of the attributes of this model can be simplified further to give us the
following defined stages in the life of an information resource or digital object
in a managed environment (see Figure 7.2, page 116):
Concept-
ualise
Create or
receive Dispose
Appraise
Transform and select
Accesss Ingest
and use
Store Preserve
Create or
ingest
Preserve
Transform
and store
Dispose
Distribute
Review
and use
The remainder of this chapter will consider the role of metadata in each of
these stages in the lifecycle of an information resource, illustrated with
examples from librarianship, records management, digital curation and
content management.
Create or ingest
Different levels of aggregation will affect the ingestion of documents. For
instance, in records management systems, most of the metadata associated
with a record or file is generated at the point of creation or capture onto the
records management system. The metadata elements can be applied at
different levels of aggregation: an individual record (which may be an
electronic document or spreadsheet, for instance); at folder level
(corresponding to a paper file); or at class level (a file plan category). The
items at a lower level of aggregation inherit the attributes of their category,
so that for instance a document will inherit the metadata elements that apply
to the folder to which it belongs.
Institutional repositories often depend on data entry by the authors
themselves. This means that quality control issues may be a concern. Use of
drop-down lists can help to address some of these concerns, along with
intervention by cataloguers after the data has been entered.
Libraries have slightly different processes at the point of acquisition.
Notably, retrieval systems predominate and selection, ordering and purchase
of resources are significant factors. Co-operation between the book trade
(publishers, suppliers, retailers) and the library and information community
has opened up a number of possibilities. The acquisitions process can be
handled electronically – ranging from the small-scale ordering of individual
items via internet suppliers, or direct from publisher, through to purchase via
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 118
large-scale book suppliers, who may also select materials on behalf of the
library. Basic cataloguing data can be used to identify relevant items (e.g.
author, title information). If already known, identifiers such as ISBNs can also
be used for selecting titles. Publisher-supplied metadata can be made available
as part of the ONIX data, or MARC records can be located from central catalog-
uing agencies such as OCLC, the Library of Congress or the British Library.
The ONIX records allow for tracking of order information and verification of
delivery. ONIX is intended to support a fully-integrated e-commerce approach
to acquisition. This will include the delivery and payment details as well as
price and any discounts that may apply. Once the item has been acquired,
further cataloguing may be necessary to make it retrievable. Library
management systems usually import and export data in MARC 21 format,
although this may change with the introduction of BIBFRAME. Imported
records are enhanced with additional proprietary metadata from the library
management system and local data added by the cataloguers. The internal
metadata may include location information, loan records, details about the
management of binding of journals and covering of books and withdrawal and
disposal of items.
• reference
• provenance
• context
• fixity
• access rights.
1.1 objectIdentifier
1.2 objectCategory
1.3 preservationLevel
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 121
Rights Statement
Assertion of right or
permission
Object
Discrete unit of information Agent
subject to digital preservation. Environment Person,
Intellectual Entity, organisation or
Representation, File or software
Bitstream
Event
An action that
involves an Object or
an Agent known to
the system
Figure 7.3 PREMIS data model (based on PREMIS Editorial Committee, 2015)
1.4 significantProperties
1.5 objectCharacteristics
1.6 originalName
1.7 storage
1.8 signatureInformation
1.9 environmentFunction
1.10 enviornmentDesignation
1.11 environmentRegistry
1.12 environmentExtension
1.13 relationship
1.14 linkingEventIdentifier
1.15 linkingRightsStatementIdentifier
Figure 7.4 Loan record from Westminster Public Libraries © 2017, Westminster City Council
Libraries and Sirsi Corporation
associated with the file category applies to the individual record as a default.
Depending on the standard, the disposal data element can be divided into
sub-elements to allow for effective management of the process. This will
include ‘Disposal action’, ‘Disposal time period’, ‘Disposal due date’ and
‘Disposal authorised by’. In this way the metadata can effectively trigger a
cascade of events during the record lifecycle.
Disposal is a necessary and often controversial aspect of library
management. Most library collections are living collections that are managed
with limited space available. Library managers have to decide how to
maintain their collections in a way that reflects the current needs of its users.
This means weeding out-of-date and damaged materials as well as acquiring
new titles. Metadata can be used to implement a retention/disposal policy.
Borrowing and usage patterns captured by the library management system
can reveal which items are not being used and which therefore may be
considered for disposal. Metadata associated with disposal will be of two
types – intention and action. Intention applies to documents that have a
known life or those that have been selected for disposal. Library management
systems can be configured to generate disposal lists (in much the same way
we have seen for records management), which can be reviewed before a final
decision is made. Once an action has been taken and recorded, the metadata
can provide an audit trail.
Transform
The transform step completes the information lifecycle, by leading to the
creation of new information resources. In creating a new document a new set
of metadata is required for each unique document. This may be embedded
within the document initially. New metadata is created when the document
is captured to the appropriate repository, whether it is a library management
system using MARC (or BIBFRAME) records, a records management system
using ISAD(G), or an enterprise search system based on Dublin Core with
added metadata elements. Research data is designed to be made available for
research and re-use. New data generated by manipulating or combining the
research data with other data sources leads to the creation of new resources
with their own metadata. Exposing metadata of existing resources through
exchange formats such as OAI-PMH, or by making it available in RDF format,
allows for the discovery, distribution and re-use. When combined with other
data sources a data set is transformed into something new.
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 125
Conclusion
Metadata is a tool for management of information resources, whether they
are electronic and available on the internet, via a closed system, or physical
and accessible via a library catalogue. Metadata enables lifecycle management
where resources are created, modified, used and disposed of. The metadata
is utilised by software applications to handle transactions. They also
document processes that have taken place during the lifecycle of an
information resource.
For example, records management systems depend on metadata to trigger
events in the lifecycle of a record. The metadata can also be used to anticipate
the fate of an individual record as soon as it is created, rather than after many
years when it is due for review and when the originators may have moved
on or retired. Another example of the use of metadata can be seen in content
management systems, which have their own metadata for describing and
manipulating web or intranet content.
Preservation is a complex area with a range of issues to be addressed,
including digital degradation and technology obsolescence. The use of
metadata becomes particularly important for digital materials because it
provides an avenue for describing the format and technology of a resource,
aiding its management and recovery. Metadata standards such as OAIS and
PREMIS are designed to facilitate preservation management.
Library management systems use acquisitions data to manage the
workflow from ordering and payment for a publication through to
cataloguing and making it available to users. Metadata associated with loans
keep track of individual items and assist with stocktaking exercises. Research
data collections also benefit from controlled metadata use to describe and
make available data collections for further research work, or for combination
with other resources as linked data. These examples demonstrate the wide
use of metadata for managing information resources. This is often based on
the management of a resource’s lifecycle.
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 126
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 127
CHAPTER 8
Managing intellectual property rights (Purpose 4)
Overview
This chapter considers the ways in which metadata has an impact on intellectual property
rights and information access rights. It goes on to describe the issues arising from
authenticity, ownership, and rights management. A discussion of different models of
intellectual property (IP) rights considers the Open Digital Rights Language as an example
of an information modelling language that deals with intellectual property rights. The
indecs system and PREMIS are referred to, because they both deal with rights, although
they are discussed in more detail elsewhere. There is also a brief discussion of the way
rights are handled by Dublin Core, MPEG-21 and the METS Rights extension to METS. The
chapter goes on to consider provenance, starting with a general definition and then
describing the PROV metadata standard. It then considers provenance in the context of
records management, e-documents, books and printed materials.
Rights management
Protection of intellectual property rights has a major economic impact on
many industries. One of the drivers for the development of metadata
standards in the publishing and book industry has been the need to manage
copyright effectively. They form a key part of the framework for publishing,
while protecting the rights of those involved in creating, performing or
distributing a creative work. In most countries an author has moral rights to
be identified as the creator of a work and consequently to enjoy the benefits
that come with these rights. The World Intellectual Property Organization
(WIPO) in Geneva regulates international treaties to help facilitate
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 128
She goes on to tabulate some of the core elements for rights and this provides
a useful basis for identifying rights data in general catalogues or data
collections. Data elements dealing with rights are built into general metadata
standards such as Dublin Core or as extensions such as METSrights. For
instance, in Dublin Core the dc.rights data element provides a home for data
on copyright, licensing arrangements (such as those associated with Creative
Commons licences) and access rights (such as those invoked by freedom of
information legislation in different parts of the world). Dublin Core does not
specify how this data element should be expressed. Commonly the data
element is used to include a copyright statement to indicate ownership of
rights. Although the rights data element does not have any formal
refinements, individual authorities and organisations using Dublin Core have
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 129
METSRights
METSRights is an external schema which is endorsed for use with METS
(Library of Congress, 2016b). The schema deals with intellectual property
rights associated with digital objects. It is an extension to METS and has the
data elements RightsDeclaration, RightsHolder, and Context. This allows
encoding of data about the nature of the rights associated with a digital object
as well as who owns the rights or has access for use of the digital object. The
Context provides a container for data about the circumstances in which the
rights apply and any constraints on those rights.
standards. These include indecs and the Open Digital Rights Language
(ODRL) and have led to the development of industry-specific metadata
standards such as ONIX (publishing industry), OAI-rights activity
(government, museums and libraries), and MPEG-21 (audiovisual materials).
Modelling systems and languages such as ODRL can be used to mark up or
express rights metadata.
The Open Digital Rights Language (ODRL) is described as ‘a standard
language and vocabulary for the expression of terms and conditions over
assets’ (Iannella, 2002). Terms and conditions include permission, constraints,
requirements, conditions, offers, and agreement with rights holders. ODRL
covers both physical manifestations and digital materials. It is an international
initiative to develop an open standard for digital rights management and is
designed to be compatible with a number of other models and standards for
rights management metadata, including indecs, EBX, DOI, ONIX, MPEG,
PRISM and Dublin Core. It provides cross-sectoral interoperability and is
extensible. In the ODRL language there are three core entities:
ODRL is a model that describes agreements between parties for rights over
assets and their use. The language can be used to model different types of
relationship and to allow for a range of interactions. The ODRL Foundation
Model is illustrated in Figure 8.1 (Iannella, 2002).
Permissions cover four areas of activity: usage, re-usage, transfer and
asset management. Within each area a number of specific activities are
described:
Figure 8.1 ODRL Foundation Model © 2002 World Wide Web Consortium (MIT, ERCIM,
Keio, Beihang)
can be installed, backed up, moved, deleted and restored. These are non-
trading activities, but are necessary for effective maintenance of the
resource within a client organisation.
ONIX
The book trade provides a good example of the complexities that arise when
it comes to managing intellectual property rights. Rights-related metadata
includes information on: authorship, publishers and territorial rights. The
ONIX metadata framework was developed with this partly in mind
(EDItEUR, 2014). In order to develop ONIX a framework was needed to
analyse the different types of relationship that occur and are necessary for
commercial transactions to take place. The indecs model is just such a
framework, developed with support from the European Commission with a
focus specifically on rights management (Rust and Bide, 2000).
By establishing rights ONIX allows for automated rights management and
for the use of rights while protecting rights owners and allowing freedom of
legitimate, fair use. There are different views of metadata, including the
intellectual property view (Figure 8.2):
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 132
make
persons intellectual property
used by
own in
rights
MPEG-21
The MPEG-21 series of standards provides an interoperable framework for
multimedia. The aim is to work across a range of communities and to facilitate
integration of different models. MPEG-21 encompasses content creation,
production, delivery and consumption (ISO, 2004b). It defines a framework
for Intellectual Property Management and Protection (IPMP). The purpose is
to enable legitimate users to identify and interpret intellectual property rights,
whilst enabling rights holders to protect their rights. The Digital Items
Declaration Language (MPEG-21 DIDL) is an interoperable schema for
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 133
declaring digital items. The language can be used to represent the Digital Item
Declaration model and is one element of MPEG-21.
website can be embedded in the web resource or other digital material. The
title and source metadata are marked up as Dublin Core terms.
Provenance
Provenance – the place of origin or earliest known history of something.
(Pearsall, 1999)
Figure 8.4 PROV metadata model for provenance © 2013 World Wide Web Consortium,
(MIT, ERCIM, Keio, Beihang)
Digital objects
Legal admissibility of digital objects such as born-digital documents and
digitised images depends on the accompanying documentation and
certification attesting to its authenticity. Metadata provides a way of recording
details of the circumstances of creation of a document (date of creation, author,
editor, etc.) and actions that have taken place since – an audit trail of who has
accessed the document and any changes that have taken place – what the
amendments were and who made them, when. As discussed in Chapter 2,
many programs automatically attach their own metadata to electronic docu-
ments when they are created and this can provide an audit trail for the
document as it is drafted and altered.
There is no way to verify the authenticity of a document without
information about its history and what has happened to it since its creation.
For this, metadata is necessary. The authenticity of information can be
determined by means of physical certificates to indicate that the document
has been checked or that a specific procedure has been followed, or via the
metadata embedded in the resource or held separately in a database.
with individual paintings and objects in their collections. Good metadata used
in conjunction with other tools and scholarship to establish the age and origin
of an item help to build the case for the authenticity of an item.
Metadata
Provenance is a key aspect of the preservation data associated with an
information resource. Metadata and metadata schema can be treated as
information objects or digital objects for the purposes of describing
provenance and managing their preservation. This allows the application of
preservation models such as PROV and PREMIS to metadata (Li and
Sugimoto, 2014). This approach would help to record data such as who
created the metadata record, what content rules were used to create it, and
when was it created or amended.
Conclusion
In looking at rights management we see some similarities between different
models of intellectual property rights. It quickly becomes clear that there are
three main concepts that need to be represented in any model of rights
management:
The models are capable of a great deal of complexity as we have seen above,
but the use of comparable building blocks allows for a degree of inter-
operability between different schemas arising from the models.
Rights management metadata was developed in response to the need to
protect the intellectual property rights associated with digital resources and
a need to allow for the different types of transaction that take place in creating
and distributing electronic resources. In order to do this, models for
intellectual property rights (IPR) management such as ODRL and indecs were
developed. Specialist metadata schemas such as PREMIS, MPEG-21 and
METS Rights are also used for capturing and handling rights data for digital
objects. Another aspect of ownership is provenance, which can affect the
acceptance of the authenticity of an item and therefore its value. It is also
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 138
CHAPTER 9
Supporting e-commerce and e-government
(Purpose 5)
Overview
This chapter considers the ways in which metadata is used for e-commerce and e-
government. It describes use of metadata for marketing and online behavioural
advertising. E-commerce is illustrated with an example from the book trade, ONIX, and
with a description of music industry metadata and digital images. It finally looks at e-
government, focusing on the documentary aspects of transactions and the role that
metadata plays in facilitating these transactions.
Electronic transactions
E-commerce and e-government are two sides of the same coin. They are about
human interaction with organisations via the internet that result in
transactions of one kind or another. In the case of e-commerce that interaction
is with commercial organisations. In e-government the interaction is with
public bodies. E-government has a slightly wider definition, in that public
education and information can also be considered a part of e-government
even if it is not a part of a specific transaction.
The main difference between the two is that transactions with individuals
may involve metadata behind the scenes, but do not require overt handling
of metadata by the consumer. For instance if an individual is recording a life
event via the internet (such as registering a death), the ‘Tell Us Once’ service
in the UK effectively allows surviving relatives to complete one death
registration form and the data on that form is used to inform the local
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 140
E-commerce
Chaffey (2015, 13) defines e-commerce in the following terms: ‘e-commerce
should be considered as all electronically mediated transactions between an
organisation and a third party it deals with.’ Laudon and Traver (2014, 51)
talk about ‘the use of the Internet, the World Wide Web (Web) and mobile
apps to transact business’. E-commerce now plays a role in most businesses
in their transactions with customers and with other businesses. As well as
direct retail activities, businesses procure services and purchase products
from suppliers using e-commerce applications.
Metadata plays a key role in the revenue-generating activities of social
media giants such as Facebook, Google and Yahoo!. For instance, van Dijck
(2013, 63–4) says: ‘As Facebook owns an unprecedented reservoir of
customised (meta)data, advertising and public relations are becoming a
mixture of science and statistics, and therefore a lucrative business model.’
This points to the enormous potential being realised by control and
management of metadata associated with use of internet resources. Talking
about another major platform, he continues (2013, 93): ‘Needless to say, both
user-added tags and automatic tags added considerably to Flickr’s
commercial potential, especially in the area of app development and
recommendation systems.’
• title
• description
• keywords
• robot behaviour (index, noindex, follow, nofollow).
Image tags may also be used for retrieval, and certainly for display. An
‘alternative description’ is commonly used so that an image description is
given when a cursor is rolled over the image – a feature originally intended
to help users with visual impairments. However, this Alt text provides a
textual description of an image which assists retrieval as well. Metadata about
language or country may also be useful for global sites with different
interfaces for different groups. For instance, invoking www.google.com will
deliver the user to an appropriate country site, based on the IP address of the
user. This may be important where there are different products available,
different pricing, or different regulations that apply to transactions in each
country. So, for instance, following the European Court of Justice ruling on
Google (Spain), all the EU versions of Google provide a disclaimer at the
bottom of the results page if it detects that you are doing a search on an
individual, allowing it to comply with a European Court of Justice ruling in
2014: ‘Some results may have been removed under data protection law in
Europe.’
Some search engines, such as Google.com and Bing.com, are set up to look
for marked-up meta-tags in the items and to generate snippets for display in
response to searches. So-called ‘rich snippets’ may be used to improve search
results or enhance the display of search listings. These may include
breadcrumb trails, ratings reviews, pricing and meta-descriptions. Rich
snippets are a form of semantic mark-up metadata in web pages that allow
search engines to interpret their content more accurately and deliver relevant
results pages to users. Google uses Schema.org (described in Chapter 12) as
a standard for marking up semantic data.
The indecs framework forms the basis of the ONIX e-commerce metadata
standard for handling works such as books, sound recordings, graphic arts
and films.
The metadata may also include marketing material to help retailers sell the
book as well as making the book discoverable by individual purchasers.
Because it is expensive to create and maintain metadata, only data that is
strictly necessary to solve a specific business need is created. EDItEUR has
developed technical standards to allow publishers, distributors, wholesalers,
retailers, and the general public to get the metadata easily.
ONIX is based on the indecs model, which was the result of a three-year
project that culminated in 2000. This model has been used to develop a
number of commercial metadata frameworks, including ONIX (for the book
trade), ONIX-PC for periodicals, DDX for the recorded music business and
EIDR for the film and entertainment sector – film, movies, TV. Although on
the surface DDX, ONIX and EIDR appear very different, there are similarities
at a deeper level, because they are based on the same data model.
There is a family of ONIX metadata standards. The classic standard is ONIX
for Books. It is a trade metadata standard for communicating information
about books, e-books and other book-like objects. This might include digital
audio, such as a recording on a CD of someone reading a book.
ONIX-PC is the metadata standard for serials. This metadata is passed
between the serial publishers and aggregators who make bundles of e-
journals available for subscription by academic libraries, for instance.
E-government
The Organisation for Economic Co-operation and Development (OECD)
defines e-government as (Field, Muller and Lau, 2003): ‘The use of information
and communication technologies, and particularly the Internet, as a tool to
achieve better government.’ Bhatnagar (2009) suggests that e-government can
be seen as an extension of e-commerce: ‘For those who see it as some form of
extension of e-commerce to the domain of the government, it represents the
use of the Internet to deliver information and services by the government.’ It
encompasses a range of interactions between government and citizens and
between government and businesses. This includes delivery of information,
downloading forms and online form-filling. Examples of e-government in
practice include filing tax returns, registering businesses, applying for
passports and voting. The Federal Government of the USA has identified
metadata elements for description of government digital resources (Central
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 149
Information Office, 2016). This schema is primarily for listing data sets in the
Project Open Data directory.
In the UK the move towards e-government prompted the development of
eGMS (e-Government Unit, 2006) the e-government metadata standard, a
Dublin Core application profile. This standard, which is no longer
maintained, and is no longer mandated for use of government websites, was
primarily designed for retrieval of resources on official government websites.
In Australia the Australian Government Locator Service (National Archives
of Australia, 2010) was designed for a similar purpose and also focused on
discoverability of government online resources. Like eGMS it is also an
application profile of Dublin Core and has become a metadata standard for
use within national government. Because much of government information
has a geographic aspect there is some emphasis on spatial metadata in
government metadata standards. Standards such as ISO 19115 (ISO, 2014a)
are coming to the fore and focus primarily on the location of data collections
and services.
The European Commission (2011) has suggested ‘Metadata is an important
asset for eGovernment systems development and as such should be carefully
and professionally managed’. It has been responsible for a number of
initiatives to facilitate exchange of data between public sector organisations
within members states as well as between member states and interoperability
continues to be a priority (Bovalis et al., 2014).
A great deal of e-government’s focus is on Open Data and activity is
currently focused on linked data initiatives, described in Chapters 3 and 13.
This perhaps represents a general move towards greater private sector
participation in the delivery of public services.
Conclusion
E-commerce and e-government have much in common. They are both about
facilitating transactions between suppliers/government and consumers/
citizens. That said, there are some significant differences. E-commerce deals
with transactions between businesses, where there is a need to make the
supply chain work effectively, as exemplified by the book trade system,
ONIX. It also encompasses business–consumer interactions. E-government
tends to focus on electronic transactions between government and citizens or
individuals, although public sector procurement systems are business-to-
business.
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 150
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 151
CHAPTER 10
Information governance (Purpose 6)
Overview
This chapter considers the ways in which metadata has an impact on information
governance. The first part of the chapter considers the role of metadata in privacy,
freedom of information and legal admissibility of documents. It then goes on to explore
the use of metadata to facilitate regulatory compliance. This demonstrates approaches
to document and information management and to metadata policies which contribute
to the overall ability of organisations to comply with regulatory requirements.
regulations are applied across the board and that potential loopholes are
closed off.
Responses to these pressures by professional bodies such as computing,
information management, document and records management and library
and information services have resulted in new approaches and practices.
These professional bodies also play an important role in setting standards
and providing training for professionals responsible for compliance. This
chapter identifies these areas and looks at ways in which metadata has been
used as a part of information governance. It could be argued that metadata is
an information resource that itself is subject to governance and this is
discussed in Chapter 11.
Authentication
Records management and good governance depends on being able to
demonstrate the authenticity of a record and to provide documentation about
its history and the way it has been managed. This may include details of
transactions that have taken place: who viewed a particular document and
when, what changes were made to the document during its history and the
measures adopted to ensure that unauthorised changes have not taken place.
This provides the basis for legislation on the legal admissibility of electronic
documents and whether they can be used as evidence in legal proceedings.
Provenance and preservation metadata has an important role to play in
authenticating digital documents, as well as providing an audit trail of actions
that have been performed on documents (such as access, amendments, or
deletions). Duranti (1989) puts forward a new role for ‘diplomatics’ (the
authentication of documents – diplomas, certificates or diplomatic docu-
ments). Duranti and Rogers have gone on to develop the idea of diplomatics
adapted for authenticating records stored in the cloud and using metadata to
achieve this:
Information governance
Information governance is an important part of the corporate agenda. Public-
sector and third-sector organisations are also under increasing scrutiny to
demonstrate transparency and to counter perceptions of corruption.
Information governance is a wide term that is taken to mean governance of
information technology, data governance and governance of information held
in documents (whatever their form). Here we concentrate on the last of these
definitions. However, it is important to recognise the overlap between
definitions. Information governance is sometimes seen as part of IT
governance. Definitions are important because to some extent they determine
who is responsible for information governance: lawyers, records managers,
librarians or the IT department. The predominating professional culture will
determine the way in which this issue is handled. The corporate context is
also very important. For instance, regulated industries such as pharma-
ceuticals or financial services have specific reporting requirements that affect
the way in which they handle information.
Although the role of metadata in ensuring information governance is
recognised, there are few practical guides. Blackburn, Smallwood and Earley
(2014) consider some of the questions that arise in information governance
and suggest that metadata may be a way to address some of these questions.
Information governance may in some organisations be closely tied in with
records management and with information security. Both areas are subject to
compliance issues and meeting regulatory standards is one of the major
focuses for information governance activity. The role of metadata in records
management has been discussed in Chapter 7. It looked at the way in which
metadata is used to manage and track records throughout their lifecycle. This
is particularly the case of electronic records and by extension, digital assets.
Information governance may be driven by the management of information
risk, such as the risks associated with data breaches, data loss, disaster
recovery and non-compliance with regulations. In order to get a handle on
this we shall break down information governance into several distinct areas:
information compliance, e-discovery, information risk and sectoral
compliance.
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 154
This has had some effect on legislators and regulators and an impact on trade
relations between major trading blocks such as the USA and the European
Union. Notably, the EU-US Safe Harbour agreement which allowed US
companies to process personal data of EU citizens, was struck down in 2015,
in part because of the systematic and routine gathering of metadata about
phone calls by the US government.
The European Union is characterised by general privacy legislation that
applies across all sectors. The EU General Data Protection Regulation is
principles-based (European Parliament and European Council, 2016). The
European legislation is enforced by national data protection authorities in the
member states. In the USA privacy protection is industry-based, covering
consumers interacting with specific industries such as health (Health
Insurance Portability and Accessibility Act), credit agencies (Fair Credit
Reporting Act) and Federal agencies (Privacy Act of 1974).
Where privacy legislation applies, it is important for information managers
to demonstrate that personal data is handled appropriately. This may mean
codifying the personal data according to how sensitive it is and controlling
who has access to it when and how. For instance, Article 9 of the EU General
Data Protection Regulation prohibits the processing of sensitive personal data
unless strict conditions are met:
In the context of records management, file plans are a well established way
of determining how documents containing personal data should be handled.
The alternative of allocating security levels to individual documents or classes
of document and then restricting access to those who have the appropriate
level of security clearance can be complicated to administer. Individual
managers may need access to personal details of their staff, but would not
normally have access to personal data of staff in other departments.
Privacy also arises in the case of social media, where individuals create
personal profiles. These are supplemented with behavioural data which is
gathered by the provider and may be made available at varying levels to other
users, and third parties such as online advertisers. Privacy and surveillance
is discussed further in Chapter 14.
Data breaches
Data breaches present a number of challenges to the organisation, including
loss of reputation, loss of customers, regulatory penalties such as fines and
waste of time and effort dealing with the breach. The likelihood of a data
breach may be reduced by an active information security programme in
which metadata plays a part.
Information security
Information security will depend on a number of approaches, including
physical security, hardware, firewalls and communications measures, as well
as procedures and effective management of the data. Using metadata to log
the location of sensitive data, who has access to it and how it is used provides
a mechanism for control and for detecting data breaches. For instance
metadata can be used for forensic analysis of database attacks (Khanuja and
Suratkar, 2014). The previously mentioned use of metadata to establish the
authenticity and provenance of data is also a significant contribution to data
security (Jansen, 2014).
Description of metadata
In the early 2000s some researchers were already developing metadata
vocabularies with the explicit purpose of controlling quality and security of
data. The HIDDEL (Health Information Disclosure, Description and Evaluation
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 158
Language) project created a data model describing health websites and health
information providers (Eysenbach et al., 2001). This allows evaluation of
websites and providers but does not go into a great deal of detail about the
individual data elements. De Vries et al. (2014) propose a system of ethical
metadata which provides a context for medical and ethnic data gathered in
the course of malaria research in Africa. Although they do not develop a
specific standard for this type of metadata they describe its potential role and
the benefits:
By way of solution we would propose that at least some information about the
normative context of sample collection and data sharing – what we called ethical
metadata – needs to be taken into account when data sharing decisions are to be
made. This may particularly be the case where research is conducted on
identifiable population groups where stigma or discrimination are of concern.
(de Vries et al., 2014)
Skinner, Han and Chang (2005) describe a concept of Meta Privacy, which
not only encompasses secure metadata, but also uses metadata to manage the
security of a data collection:
Sectoral compliance
There is very little reported research on the role of metadata in compliance.
Kerrigan and Law (2005) report on the development of an engineering
application to extract compliance metadata about environmental regulations
which can then be applied to documents for automatic logic processing of
engineering documentation. Singh and Kumar (2014) consider ways of
complying with data regulations affecting cloud computing. They propose a
four-layer architecture: identification, classification, routing and storage. In
this proposed system data is routed to the appropriate data centres depending
on the type of regulation that applies to it. They talk about metadata
associated with Virtual Appliances (VAP) used to process data so that it ends
up in the appropriate category of data centre.
The REGNET research project in the USA developed a number of tools and
methods ‘to facilitate access, compliance and analysis of government
regulations’ (Law et al., 2014). Among the methods developed, the researchers
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 159
Conclusion
Metadata contributes to the authentication of documents and data and this is
probably the main way in which metadata is currently used for facilitating
information governance. Information retrieval or (e-discovery) is also
important for access to regulations and this is the other major role for
metadata. Projects such as REGNET have explored ways in which metadata
can play a role in interfaces between regulators and companies through expert
systems, for instance. Using metadata to describe access requirements for data
and to identify sensitive data elements has also been tried. Neither of these
approaches have been widely adopted at the time of writing but may develop
into more-widely available products and services.
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 160
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 161
PART III
Managing metadata
Part III looks at metadata as a resource to be managed, rather than as a tool
for management that we saw in Part II. Chapter 11 refers back to the metadata
concepts in Part I and identifies some of the issues that arise when developing
and implementing metadata standards, such as quality and security. One way
of addressing the quality issue is to have some control over the way in which
metadata content is created. Chapter 12 considers the ways in which
taxonomies and other controlled vocabularies can be used to improve
metadata quality. Cataloguing rules are also important in this context as are
authority files. Chapter 13 looks at very large collections of data, especially
research data and official data released by public authorities. These require
special consideration because of expansion of linked data and the emphasis
on re-usability of public data. This raises ethical and political issues about the
control and management of information as well as privacy and human rights,
the topic for Chapter 14. This last chapter also peers into the future and
speculates on which professional groups will be responsible for metadata
management and use.
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 162
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 163
CHAPTER 11
Managing metadata
Overview
This chapter considers the issues surrounding the management of metadata and
describes some of the techniques that are used for metadata management. The project
lifecycle concept is used as the framework for discussion of metadata management. The
management of metadata starts with analysing metadata requirements and moves on
to the development and selection of metadata schemas. There is then a discussion about
encoding metadata and the use of controlled vocabulary before turning to content rules.
Interoperability of metadata schemas focuses on crosswalks and metadata registries.
Quality management covers the use of administrative metadata and reviews issues such
as security of information. The final part of the chapter looks at user education and the
presentation and use of search aids to make metadata more accessible. The chapter
concludes with a view on convergence of management practice for metadata across the
domains.
MANAGING METADATA 165
This ties in with a wider concept of the information lifecycle which has been
expressed in a number of contexts, such as records management, digital
curation and electronic publications. To some extent digital lifecycles can be
applied to metadata, which is itself a type of digital resource. The Digital
Curation Centre’s (2010) digital lifecycle model discussed in Chapter 7 can
be applied to data collections, including metadata collections. The Create or
Receive, Ingest, Preserve, Store, Access and Re-use, and Dispose steps are
most directly applicable to metadata. The Transform step may apply to
conversion of metadata to other formats for export or exchange, or could refer
to the addition and modification of metadata records to reflect the life of a
document or digital resource.
Project approach
An alternative approach is a project lifecycle (Figure 11.1 over the page),
which can be adopted for the development and management of metadata as
described in the previous edition of this book (Haynes, 2004).
In this model the analysis of metadata requirements sets the criteria for
selecting an appropriate scheme or developing a schema or application
profile. The selection may be constrained by issues such as who else is using
this standard, and practical issues of cost of development of a purpose-made
metadata schema. The next stage is to define the vocabulary used in each of
the fields (in database terms, a data dictionary). The metadata is then applied
to items or it may be imported from a third party, which introduces further
issues of cataloguing standards. The quality management processes help to
ensure a consistently indexed resource that is suitable for searching and other
user interactions.
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 166
SELECT/DEVELOP ENCODING
AND MAINTAIN CONTROLLED
VOCABULARIES
IMPORTING APPLYING
METADATA METADATA
(INTEROPERABILITY) (CONTENT RULES)
QUALITY
ASSURANCE
SEARCH AIDS
AND USER
INTERFACE
MANAGING METADATA 167
others will need to be taken into account. For instance, metadata may be used
primarily for retrieval, but description and identification may be other
important requirements.
In practice most metadata schemas have resource description as one of their
purposes but they are usually used for a variety of purposes. For instance,
library standards such as MARC 21, while dealing with resource description,
are also used for information retrieval in library catalogues and resource
management in library management systems. Metadata associated with
records management and preservation is an example of management of
information, but there is often an information retrieval aspect as well.
Metadata associated with ONIX is used for rights management and e-
commerce.
The analysis of requirements has to take into account what is being
described, the systems it needs to interact with, the level of granularity that
has to be supported, the user community, existing standards and the format
of pre-existing metadata. It will also be necessary to take into account the
software environment under which the metadata will operate.
Who will be using the metadata, and how? The profile of users (including
managers and information staff, as well as the eventual audience) is arguably
the most important consideration. What is its overall purpose? This needs to
be taken into account in developing a management approach for a metadata
collection. A requirements checklist might include the following:
MANAGING METADATA 169
Importing metadata
Metadata has long been imported from other sources or repositories. Many
libraries import bibliographic records rather than cataloguing new
acquisitions. Cataloguing authorities such as the Library of Congress or the
British Library, or bibliographic service organisations such as OCLC, Neilson,
or Bibliographic Data Services Ltd, sell records to libraries. Importing
metadata requires good selection procedures, quality control and adherence
to a common data standard. Additional work may be needed to clean the data
and to reconfigure it to fit the destination system.
1 metatag extraction
2 content extraction
3 automatic indexing
4 text and data mining
5 extrinsic data auto-generation
6 social tagging.
They conclude that there is a lot of potential for this approach in light of the
large volume of material that needs to be processed and the cost of staff to do
so manually. They accept that some human intervention is necessary, hence
the designation ‘semi-automatic’. Many of the tools that they reviewed use a
combination of these methods. The authors conclude that while the 39 tools
that they reviewed offered many potential benefits, a major barrier to
implementation is the very specific nature of the tools, which were mostly
designed for a very specific domain or data set.
Tools for auto-generation of metadata provide only partial coverage of
metadata elements, which means that human intervention is inevitable.
Automated metadata generation will require considerable investment to
integrate the tools and to make the resulting product more generally
applicable and capable of dealing with a wider variety of sources, document
formats and metadata standards.
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 170
Application profiles
Many metadata schemas encourage users to adopt standard metadata
elements that are appropriate to their needs. However, additional data
elements can be created to fulfil specific requirements of the application. It is
also possible to adopt metadata elements from different schemas using a ‘mix
and match’ approach. Application profiles provide a way of re-using existing
metadata standards (or data elements from within those standards) and
facilitate interoperability by the common use of existing standards. Nilsson,
Baker and Johnston (2008) define application profiles in terms of a process to
build functional requirements.
For example, the Singapore Framework for Dublin Core Application
Profiles was developed within the Dublin Core community. It can be used
with other metadata standards, schemas and encoding schemes (Nilsson,
Baker and Johnston, 2008). It has the following components, which are
illustrated in Figure 11.2:
annotate
built on built on
uses uses
Community Metadata DCMI Abstract DCMI Syntax
built on
Domain Vocabularies Model Guidelines
Models
Domain standards
built on built on
Foundation standards
MANAGING METADATA 171
Interoperability of metadata
What is interoperability?
It is important to have a clear view of what is meant by interoperability, before
explaining the role of metadata in this context. Some definitions focus on the
storage of data in a standard format. A good example can be found in libraries,
where the MARC 21 format is used to exchange bibliographic records between
systems. This does not mean that the library management systems themselves
have to store data internally in MARC 21 format. Indeed, many of these systems
have additional proprietary metadata elements. The internal architecture of the
library management system may make a proprietary data structure more
appropriate. However, the ability to generate output in a standard format and
to import records in an agreed format allows the exchange of data between
systems. For instance, in a relational database system a bibliographic record is
created at the point of querying the system. The different fields comprising that
virtual record are stored in separate tables. Applying a bibliographic standard
that is based on discrete records in a flat file structure may not be easily
translated into a relational system. The indecs initiative defines interoperability
as (Rust and Bide, 2000): ‘enabling information that originates in one context
to be used in another in ways that are as highly automated as possible’. This
definition focuses on the information aspect and the requirement to use
information in different contexts from its origin. It also highlights the
automated nature of transactions.
The above definitions suggest that metadata may be used to facilitate the
exchange of information between systems. However, the data must be capable
of being used by other systems. The implication is that the data is used by
different systems to achieve a common end (such as the successful purchase
of a product). Nilsson, Baker and Johnston (2009) have developed an
interoperability model. It defines four levels of compatibility that can be used
to assess the interoperability of applications with Dublin Core:
There are two contexts for metadata and interoperability: metadata as a tool
to facilitate exchange of information between interoperating systems, and
interoperability of metadata schemas themselves. Weibel (1998) suggests that
there are three different types of interoperability: semantic, structural and
syntactic interoperability. They are defined as follows:
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 172
MANAGING METADATA 173
manage the resource and to process transactions connected with the data
entities being described. The ONIX standard and MARC 21 are good
examples of this more comprehensive approach to defining metadata
elements.
Management issues
Use of metadata to enable interoperability brings up a number of
management issues:
• content standards
• suppliers’ interests versus customers’ interests
• cost versus functionality.
Content standards
In order to exchange data there has to be a commonly recognised format for
describing that data, a metadata standard. These standards cover not only
what data is expressed but also how that data is expressed. For instance, an
information resource may have a date field associated with it. An example
would be the date that a recording was made. An agreement on how date is
expressed would be needed between two applications, even if their internal
date representations were different. Frameworks such as Dublin Core suggest
encoding schemes, but they are not mandatory, and it is important that this
is made explicit in the data itself. Using different date conventions is not of
itself a problem, so long as the convention is explicit and there is a way of
converting from one format into another. In a wider context, content
standards need to be agreed between metadata systems so that like is
compared with like and so that the content is interpreted in the appropriate
way.
Normalising data
With the proliferation of resource discovery services and collections of
metadata, consistency of metadata has become a major issue. One response
to this is to normalise metadata from different sources. This means that it will
be necessary to use the least specific data available. Although there is a loss
of precision, this is compensated for by the wider range of potential sources
that can be called upon.
A second approach is to require everyone to adhere to the same standard.
This makes sense in communities that have very specific requirements and
where there are benefits to be gained from the additional effort required.
However, this approach is not appropriate for a heterogeneous community
where requirements and purposes may differ quite radically. Importing
metadata from other repositories does raise a number of issues. The iLumina
project (McClelland et al., 2002) identified the following issues:
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 175
MANAGING METADATA 175
Some of these issues are addressed in well developed markets for exchange
of metadata and where there are widely accepted standards. For bibliographic
records there is a well established market and reputable suppliers that
provide good-quality data. Even so, there can be variations in the level of
cataloguing undertaken. In other fields it will be necessary to work with some
sample data to establish the feasibility of importing it and to assess its quality
and suitability before undertaking a full-scale import project.
Crosswalks
Reconciling metadata created in different environments is a major challenge
and some effort has been devoted to mapping equivalent metadata elements
between different metadata schemas. These mappings can be displayed as
tables and are known as crosswalks. They can be used within systems to effect
transformations between metadata objects. In the area of bibliographic
standards, BIBFRAME provides a model for bibliographic data that can help
with the creation of crosswalks between schemes. Crosswalks have been
published between Dublin Core and other major metadata schemas such as
MODS. Table 11.1 over the page shows an extract from a Dublin Core to
MODS crosswalk (Library of Congress, 2012).
More complex transformations can be achieved by use of a central
metadata schema for interchange between different schemas. This is similar
to the idea of a key language for translations between many languages. The
advantage of this approach is that there are fewer transformations necessary
to cover the whole range of possibilities.
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 176
Subject <subject><topic>
<classification>
Description <abstract>
<note>
<tableOfContents>
Publisher <originInfo><publisher>
Contributor <name><namePart>
Date <originInfo><dateIssued>
<originInfo><dateCreated>
<originInfo><dateCaptured>
<originInfo><dateOther>
Type <typeOfResource>
<genre>
Format <physicalDescription><internetMediaType>
<physicalDescription><extent>
<physicalDescription><form>
Identifier <identifier>
<location><url>
Source <relatedItem type=‘original’> + <titleInfo><title>
or <location><url>
Language <language><languageTerm type=‘text’>
<language><languageTerm type=‘code’>
Relation <relatedItem> + <titleInfo><title> or
location><url>
Coverage <subject><temporal>
<subject><geographic>
<subject><hierarchicalGeographic>
<subject><cartographics><coordinates>
Rights <accessCondition>
Figures 11.3 and 11.4 opposite illustrate the concept. This can be expressed
by the formula y = x(x-1), where y is the number of possible connections
(translations) and x is the number of different schemas in operation. This
number rapidly escalates as the number of schemas increases. In a star
configuration, on the other hand, there are x-1 transformations or crosswalks
(i.e. y = x-1). The disadvantage is that except for the key schema in the centre
any crosswalk between two schemas will require two steps rather than one.
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 177
MANAGING METADATA 177
It will be necessary to have a crosswalk to the key schema in the middle and
then another crosswalk from the key schema to the destination.
Figure 11.3 shows the possible crosswalks between four schemas using a key
schema compared with direct crosswalks between schemas. There is a total
of eight possible crosswalks between four schemas in the star configuration
(with a key schema in the centre) compared with 12 possible direct
unidirectional connections. As crosswalks are directional, each edge in these
diagrams represents two crosswalks (one in either direction). The star
configuration shows a slight advantage in the number of crosswalks, offset
by the fact that each schema conversion is a two-step process: a crosswalk to
the key schema and then a crosswalk to the destination schema. By contrast
Figure 11.4 demonstrates a significant advantage with 20 possible crosswalks
via a key schema compared with 180 possible direct crosswalks between 10
schemas. The J. Paul Getty Trust adopts a variation of the star configuration
using its own metadata standard, Categories for the Description of Works of
Art (CDWA), as the reference or key schema in the first column of the
crosswalk table. The crosswalk compares data elements from 12 other
Figure 11.5 Data Catalog Vocabulary Data Model © 2014 World Wide Web Consortium
(MIT, ERCIM, Keio, Beihang)
MANAGING METADATA 179
Metadata registries
Metadata registries provide a resource where metadata definitions and
specifications can be stored and maintained. Many of them conform to the
ISO/IEC 11179 model of metadata registries (ISO/IEC, 2015). They may be
domain-specific or may be maintained by a public authority. A good example
is the METeOR Metadata Online Registry (Australian Institute of Health and
Welfare, 2017). This contains data models used by national health, safety and
welfare agencies and authorities in Australia. Another example is the EPA’s
System of Registers (US Environmental Protection Agency, 2017), which lists
the data standards, data elements and vocabularies used by the EPA. The
Open Metadata Registry (formerly the NSDL Registry) supports metadata
interoperability by providing access to details of 420 vocabularies and 158
element sets (at the time of writing) that have been entered by members of
the registry (Metadata Management Associates, 2017). DataCite metadata
(described in Chapter 13) is accessed via individual research repositories.
Details of the repositories can be viewed via the re3data.org website – the
registry of research data repositories.
Quality considerations
Quality management
The quality management process ensures that the metadata is consistent,
accurate and complete. There are many measures of information quality that
can be applied to metadata. The concept of quality can be applied to the content
of metadata elements as well as to the administrative metadata. The emphasis
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 180
Administrative metadata
Administrative metadata shows when the metadata was created or updated
and its origin. Its purpose is to provide a means of managing metadata (as
opposed to the resources described by the metadata). In the early 2000s the
Dublin Core community recognised the need for metadata to describe
metadata, to facilitate interoperability and exchange of metadata. This
culminated in a specialised data element set (Hansen and Andresen, 2003). It
defines the data elements, which are grouped in the following categories:
describes describes
Resource Content A-Core
Metadata Metadata
MANAGING METADATA 181
Although this original Internet-Draft has not been updated, many of the
ideas in this paper have been carried forward in the Dublin Core DCMI
Administrative Metadata, described earlier. The A-Core Elements are divided
into four components:
Metadata security
Security is a major consideration in an interoperable environment. A useful
analysis by the New Zealand government suggests that security is a key issue
(New Zealand E-Government Unit, 2001):
At a basic level a security strategy for metadata will need to ensure that the
metadata maintains its physical integrity, by being stored securely on a
system with regular back-ups. The storage medium will be subject to all the
same considerations that would apply to any kind of electronic data:
robustness of the storage medium, corruption of data by decay of the
medium, storage conditions for the medium, durability of the medium,
technology used to read the medium. A strategy for back-up and migration
of the metadata will go some way towards addressing these concerns.
It will be necessary to restrict editorial access to metadata to authorised
personnel. The access is usually controlled by the operating system. At a crude
level it can be used to allow only certain people access to the metadata
management application. For example, different levels of access might
include:
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 182
• Read – The user can view metadata and print it off. In some cases this
will extend to the issue of whether or not the user even knows that the
record exists.
• Create – The user can create new metadata records.
• Edit – The user can amend or edit existing records – normally the date of
any changes and the name of the person making the change is recorded.
• Delete – The user can remove a record from the system – although an
audit trail should indicate that this has been done.
The levels of access can be fine-tuned so that individual records or even data
elements may have their own security levels. Users are then assigned a
security authorisation that allows them appropriate access to records or data
elements. These measures depend on the ability to identify individual users
and to control their access to the system. Most commonly a user’s identity
and password provides a basic level of security. More sophisticated systems
may require some kind of physical verification such as a key. This may be an
electronic key such as a swipe card, or could be based on a physical attribute
of the user such as a finger-print or iris image.
Another aspect of security is data privacy. If the metadata is being stored
as a back-up on a removable medium for instance, or being transmitted from
one location to another over the internet, it may be necessary to encrypt it.
There has been a lot of discussion about the balance between national security
and privacy in light of the extensive use of communications metadata by
bodies such as the US National Security Agency (Solove, 2011; Morrison, 2014;
Greenwald, 2013). Metadata itself needs to be kept securely as well as playing
a part in the security of the data that it describes. Skinner, Han and Chang
(2005) introduced the concept of ‘meta privacy’ to deal with the issue of
protecting metadata, as distinct from using metadata to protect data. They
talk about meta privacy in terms of benefits and risks associated with secure
metadata. They advocate the use of privacy tags that are attached to metadata
elements, which govern access to the contents of that data element.
Conclusion
Metadata is an information resource that needs to be managed. A lifecycle
approach can be adopted to handle metadata throughout its life. An alternative
approach is to view metadata creation as a project and to apply project
management principles to it. This means analysing metadata requirements,
selecting and developing a schema and then importing metadata. Application
profiles represent another option for development of suitable metadata
standards. The Singapore Framework for the development of application
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 183
MANAGING METADATA 183
CHAPTER 12
Taxonomies and encoding schemes
Overview
This chapter is all about the content of metadata elements. Permissive standards such
as Dublin Core describe what each field or data element is for, but do not specify how
the content of that data element is generated. For instance, the ‘dc:creator’ data element
might contain the name of an organisation as it is known to the website manager, it might
be taken from an authority file, or it may be created according to a set of cataloguing
rules such as AACR2. This chapter is about the techniques or mechanisms that are used
to manage and control the content of individual data elements. This is important for
consistency, quality of retrieval and efficiency of operation. Controlled vocabularies,
authorities and cataloguing rules all come under the heading of encoding schemes. A
more detailed treatment of cataloguing can be found in Welsh and Batley (2012). The
development and use of controlled vocabularies are covered in classic works such as
Aitchison, Gilchrist and Bawden (2000) and more recently in Broughton (2006). Lambe
(2007) and Broughton (2015) deal with aspects of classification and taxonomies – also
important sources of terms for metadata elements.
Sharepoint 2013. White (2016) makes a strong case for the use of taxonomies
and controlled vocabularies in an enterprise search environment. Bloggers
such as Earley (2017) and the Metadata Research Center (Drexel University,
2017) have also contributed to discussion about metadata and taxonomies.
In the case of subject retrieval an indexer may have to select terms from a
controlled vocabulary such as a thesaurus or from classes in a classification
scheme or taxonomy. This is especially important when dealing with a
structured collection of material where it is necessary to reliably and
consistently retrieve relevant material according to search criteria established
at the point of need. Using a controlled vocabulary ensures more consistent
retrieval. This limits the searcher to a preferred term choice rather than having
to think of what synonyms might describe the concept being searched for. In
records management systems a file plan provides a similar mechanism,
allowing users to select files according to a designated category which may
be subject-based or based on a functional analysis. The selection of terms or
categories can be presented as drop-down lists, as searchable databases, or
as navigable networks of terms. Many specialist organisations have
developed their own thesauri tailored to their needs. This approach has also
extended to EDRM (electric document and records management) systems,
where subject retrieval is a key consideration. A thesaurus allows a range of
relationships between terms to be included. A full treatment of thesaurus
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 187
There are many tools for developing and maintaining controlled vocabularies
such as thesauri and taxonomies. A good starting point is the Thesaurus
Software Directory (originally on the WillPower website and now maintained
by Taxobank) and the links to online resources mentioned in Heather
Hedden’s book The Accidental Taxonomist (Will and TaxoBank, 2013; Hedden,
2016). Other lists have been produced periodically but have not been kept up
to date. There are also professional groups and discussion lists that have an
interest in taxonomies or classification schemes, such as the American Society
for Information Science and Technology (ASIS&T, 2017), the Special Libraries
Association Taxonomy Division (SLA, 2017) and the International Society for
Knowledge Organization (ISKO, 2017).
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 188
Synonym rings
Many search engines support synonym rings. These can be created from a
simplified thesaurus, which associates terms that are synonyms or quasi-
synonyms (words which have the same meaning or similar or related
meanings). For instance, the following terms could be associated with one
another in a synonym ring:
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 189
Thesaurus relationships
Thesaurus relationships are defined in ISO 25964-1 (ISO, 2011) and an excellent and
detailed description of them is given in Aitchison, Gilchrist and Bawden (2000). A term
may be associated with other terms, defined by relationships. If we take ‘Bacteria’ as
our lead in term (Haynes, Huckle and Elliot, 2002) the following relationships can be
defined:
Bacteria
BT: Microorganisms
BT: Pathogens
NT: E coli
NT: Legionella
NT: Listeria
NT: Salmonella
BT – Broader Term. This is a more general term and is higher up the hierarchy. The
thesaurus may have several levels of hierarchy – which can provide a useful navigation
tool. In this example ‘Bacteria’ has two Broader Terms, ‘Microorganisms’ and
‘Pathogens’.
NT – Narrower Term. A more specific term, lower down the hierarchy. A term may have
more than one narrower term. The inverse relationship is a broader term. The
narrower terms of ‘Bacteria’ are: ‘E Coli’, ‘Legionella’, ‘Listeria’ and ‘Salmonella’.
The next relationship is illustrated by the example of ‘Addiction’ from the HSE
Thesaurus (Haynes, Huckle and Elliot, 2002).
Addiction
BT: Psychiatric disorders
RT: Alcohol abuse
RT: Drug abuse
RT: Smoking
RT: Substance abuse
RT – related term. This is for terms that are associated with the term in question. This
is a useful way of broadening the search or providing a route to alternative search
terms (or indexing terms). This feature can be particularly helpful for generating drop-
down lists of alternative search terms. In this example entering ‘Addiction’ would
produce a drop-down list of alternative search terms including ‘Alcohol abuse’, ‘Drug
abuse’, ‘Smoking’ and ‘Substance abuse’.
The final use and use for relationship is illustrated with the following example:
Personnel managers
UF: Human resources managers
UF: Industrial relations managers
UF: Training managers
BT: Functional managers
USE – preferred term. This points to the preferred term. A thesaurus represents a
‘controlled vocabulary’ to ensure consistency of indexing (and retrieval). The entry for
Training managers in the thesaurus would have the USE ‘Personnel managers’ as its
entry.
UF – Use For, i.e. non preferred term. This points to synonyms of a preferred term. In
this example ‘Personnel managers’ is the preferred term and the UF relationships point
to the non-preferred terms that would be synonyms.
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 190
USE
USE FOR
RT (related term)
NT (narrower term)
A search for any one of these terms would retrieve all the terms in the
synonym ring. This improves recall at the expense of precision. Precision can
be improved by being more selective in the relationships included in the
synonym ring – for example by limiting the synonym ring to true synonyms,
defined by the USE and USE FOR relationships. An alternative approach is
to be more inclusive (by using quasi-synonyms defined by RT and NT
relationships as well as USE and USE FOR), but to generate drop-down lists
in response to queries and allowing users to explicitly select related terms.
an index in a recipe book may be organised by dishes, e.g. Carrot cake, which
would be a pre-coordinated index term. In contrast, a searchable database of
recipes may have an index term ‘Carrots’ and a separate term for ‘Cakes’. This
is a post-coordinate system, because the terms are put together (or
coordinated) at the search stage rather than the indexing stage.
Figure 12.1 Extract from an authority file from the Library of Congress
>> known by
Bibliographic
Entities
>>
Names and/or >>
Identifiers
basis for
>>
Controlled
is based on Access Points
>>
is governed by
>
governed
>> Rules
are applied by
>>
applied Agency
is created / modified by
>>
creates / modifies
Figure 12.2 Conceptual model for authority data (based on IFLA, 2013)
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 193
Metadata
Keword: pinot gris
Keyword: sea bream
Ontologies
There is some discussion about the difference between ontologies and other
types of classification system. ‘An ontology is a set of precise descriptive
statements about some part of the world (usually referred to as the domain
of interest or the subject matter of the ontology)’ (W3C, 2012). Gruber talks
about ontologies in the following terms:
This means that they can be expressed as RDF triples. They are designed
for processing on computers and allow for the creation of new relationships
based on existing relationships and inferences that can be drawn from them.
Corcho, Poveda-Villalón and Gómez-Pérez (2015) talk about ‘lightweight’ and
‘heavyweight’ ontologies. Ontologies contain concepts and the relationships
between them. In these terms a thesaurus would count as a lightweight
ontology. However, a heavyweight ontology would contain more complex
relationships than those typically found in a thesaurus. They would also
contain axioms that enable applications to define new relationships. This has
proven useful in specific areas of research such as genetics.
It is now in its second release, OWL 2, and has the following features (W3C,
2012):
Schema.org
Schema.org is an ontology system based on semantic web technology with
controlled vocabularies for digital objects such as web pages, digital sound
recordings, images and electronic publications (Sponsors of Schema.org, 2017).
Schema.org metadata can be expressed in RDFa, Microdata and JSON-LD. It is
widely used by search engines and is the result of a collaborative effort by
Google, Microsoft, Yahoo! and Yandex. The vocabularies have been developed
by an open community process and are continually developing. Schema.org
can be extended either as a hosted extension (reviewed and managed by
schema.org), or as an external extension (managed by other groups). In effect
schema.org allows for tagging of content using a common vocabulary for
entities, relationships and actions. Schema.org contains 589 types, 860
properties and 114 enumeration values. Commonly used item types include:
The schema.org website uses the film Avatar to demonstrate how the mark-
up works (Sponsors of Schema.org, 2017). It declares Avatar to be item type
‘Movie’ (expressed using Microdata tags in HTML 5):
Schema.org can be used to specify the properties associated with the item:
Schema.org vocabularies are used by the main search engines to improve the
relevance of search result rankings. Tagged content using Schema.org
vocabularies and mark-up enables users to search in context and obtain more
precise search results. For instance, Google uses content that is tagged using
the Schema.org vocabularies (as well as other vocabularies) to populate its
Knowledge Graph database (Singhal, 2012). It provides factual answers to
queries without having to go to the external websites themselves.
For instance, a museum could tag its home page as a museum, which is
listed as Thing-Place in Schema.org. A search on google.co.uk yields at the
top of the search results the result in Figure 12.4.
Clicking on ‘The British Museum’ then shows structured data derived from
the schema.org metadata for that museum, without having to go to the
museum’s website (Figure 12.5).
Annotation Dominance (AD). This considers the number of times a particular tag is
applied to a resource relative to the total number of users who have used tags. This is
a measure of the level of consensus or agreement about the application of a particular
tag to a specific resource. For example, a photograph on a social media site might have
the tag ‘sunset’. The AD measure would be an indication of the proportion of taggers
that have used the tag ‘sunset’ for that particular picture. The value of AD will be
between 0 and 1.
(Count (TAi ,Rj )
AD =
Count (U,Rj )
Here, the numerator, Count(TAi, Rj), indicates the number of tag sets that contains a
tag Ai (TAi) assigned to a resource Rj, and the denominator, Count(U, Rj), is the
number of all users who bookmarked a resource Rj with a tag set. Therefore, AD is a
measure of how much a tag is agreed by users to represent a given resource.
(Syn and Spring, 2013, 969)
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 200
log
(Count (R, A ))
Count (R)
i
CRAD =
log(Count (R))
When the CRAD and AD values are multiplied together they provide a measure of the
degree to which a tag can act ‘as semantic and classificatory metadata terms’ (Syn and
Spring, 2013, 971).
• Weeding to remove bad tags such as spam, spelling errors and variations
on standard spelling (e.g. standardising on the US spelling ‘sulfur’ or the
British spelling ‘sulphur’ and removing errors such as ‘sulpur’ and
‘slufur’. Another example would be deciding whether the preferred term
is ‘epinephrine’ or ‘adrenaline’ and automatically re-indexing the non-
preferred terms.
• Seeding with ‘previously allocated tags’ to ensure that frequently used
tags are considered for each new document. For example, WordPress
lists previously used tags when a blog author is making a new entry. The
size of the tag text indicates which tags have been used most frequently.
• Vocabulary control to conflate synonyms and to disambiguate
homographs. For instance, on Mendeley, books about ‘information
retrieval’ may have the following tags: ‘IR’; ‘info retrieval’; and
‘information retrieval’. Standardising on one of these is desirable. An
image on iStock with the tag ‘plant’ could refer to an industrial facility or
a living thing and needs to be disambiguated to make it clear which
meaning is intended.
• Fertilizing by offering semantically related items to a user. A search on
‘plant’ also yields results for ‘tree’, ‘flower’ and ‘leaf’ on iStock.
• Harvesting to identify and use the most popular tags or ‘power tags’ to
index documents. For instance, catalogue entries for books on
LibraryThing display the tags that have been assigned by other users.
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 201
The size of the tag text indicates which ones are most popular and thus
direct other users to them.
Conclusion
In order to be useful for retrieval, management or interoperation, not only is
there a need for an agreed metadata standard, but also the content of the
metadata elements needs to be managed. Encoding schemes such as
controlled vocabularies, authority lists and cataloguing rules are all well
established methods of achieving this. However, in the era of linked data a
more sophisticated approach is required to allow for more complex
relationships between concepts and to facilitate the processing of data
elements to create new data. The development of languages such as OWL and
schema.org provide a mechanism for this and this development has resulted
in the creation of general-purpose ontologies such as FOAF, SKOS and
schema.org. A large number of specialist ontologies have also been developed
in specific areas such as genetics.
The establishment of the semantic web and the growth of social networks
that allow interaction between users and systems have led to the proliferation
of social tagging and the growth of folksonomies. There are emerging
approaches for harnessing global tagging to enhance the quality of online
data and to apply some level of control and consistency in their use.
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 202
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 203
CHAPTER 13
Very large data collections
Overview
This chapter concentrates on aspects of retrieval and management that are particular to
big data. This book originally set out to consider metadata about documents and
document collections, using a wide definition of documents to include images, sound,
museum objects, broadcast material, as well as text-based resources such as books,
journal articles and web pages. Social media activity has been included in this, because
it involves a permanent (usually text-based) record of social interactions or online
behaviour. The type of metadata associated with each of these types of big data will vary
considerably, as will the use to which it is put. Transactional data has largely been
excluded from this scope, unless those transactions relate to documents. This chapter
also describes linked data, an approach that expands the scope of data sets enormously,
because it provides a mechanism for combining data sets from different repositories or
collections – mediated by the internet.
Type
Definition: The type of data set or document being shared.
• Individual Participant Data Set
• Study Protocol
• Statistical Analysis Plan
• Informed Consent Form
• Clinical Study Report
• Analytic Code
• Other (specify)
URL
Definition: The Web address used to request or access the data set or document.
Identifier
Definition: The unique identifier used by a data repository for the data set or
document.
Comments
Definition: Additional information including the name of the data repository or
other location where the data set or document is available. Provide any
additional explanations about the data set or document and instructions for
obtaining access, particularly if a URL is not provided.
(US National Institutes of Health, 2017)
There has been a great deal of commentary about the growth of big data
(Mayer-Schönberger and Cukier, 2013; Davenport, 2014; Kitchin, 2014). It
encompasses many different areas and includes transactional data, research
data collections, unstructured data in an organisational context (mostly
documents), as well as large bibliographic collections. Each of these areas has
its own challenges of complexity, volume and quality. Metadata provides an
important means for accessing information in big data collections, and it
needs to be managed to do so effectively. The metadata status will depend to
a great extent on the nature of the ‘big data’ being interrogated. Metadata is
also needed to manage large data collections. Some aspects such as
preservation, rights management and retrieval are covered in earlier chapters.
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 205
(or documents), and to metadata itself. As more and more organisations are
migrating information repositories to the cloud there are opportunities to
break down silos and offer access to a broad range of data via a common
interface. Text and data mining techniques have been promoted as one of the
benefits of cloud services, alongside resilience, accessibility and cost savings.
However, they also introduce a level of complexity and the need for
description of resources. At the time of writing there do not appear to be any
established metadata standards for describing content held on cloud services,
although services such as Cloud Foundry do suggest metadata for service
development projects on its platform (Cloud Foundry Foundation, 2017).
Commercial services such as Amazon Web Services and Google Cloud
Platform handle metadata about applications and support imported metadata
associated with applications. Daconta (2013) suggests metadata attributes that
should be considered, in very general terms. He suggests that each resource
should have a unique identifier that allows a range of attributes to be
associated with that resource. The attributes (described using metadata) may
be constant (e.g. title, creator, date of creation) or dynamic (such as usage and
other transactional data). Taxonomies may be used to categorise the resources
and linked data provides connections to other resources. However, none of
this is specified in terms of actual metadata standards. Part of the problem is
that cloud services encompass a very diverse range of data collections and
resource types. It is unlikely that a single metadata standard would
adequately address them all. That said, identifiers such as DOI are suited to
wide-ranging resources and linked data has proven to be extremely flexible
and accommodating of different data types. Other researchers have also
recognised the need to manage metadata associated with ‘big data’ (Grunzke
et al., 2014; Sweet and Moulaison, 2013).
link to reviews of the book posted to a social media site, helping an individual
reader or a library to make a purchasing decision.
A key strand of open government initiatives is making data gathered by
public sector organisations freely available to the public. There are two
requirements – the first is to make the public aware of the data sources, by
means of resource discovery sources. The second requirement is to make the
data usable, which can be achieved by providing the data in RDF triples. The
open government initiative in the European Union, for instance, has resulted
in large data collections becoming freely available for use by commercial
organisations, academic researchers and even individuals. This has resulted
in a metadata strategy and recommendations to public authorities responsible
for publishing government data sets (European Commission, 2011). Services
that combine geographic data with public transport data, for instance, provide
live departure and arrival boards for commuters in large cities around the
world. Open government initiatives also improve accountability by making
the operation of government more transparent to their populations.
To make the data sets discoverable, some national governments have
created data portals. For example:
https://data.gov.uk UK
www.dati.gov.it Italy
www.data.gouv.fr France
www.data.gov USA
There are also international portals such as the European Union Open Data
Portal covering over 10,500 data sets, which are described using DCAT
metadata (Publications Office of the European Union, 2017; W3C, 2014a). The
data sets are grouped in the following broad categories:
The European Data Portal, funded by the EU, contains details of over 600,000
data sets from the public sector in Europe (European Commission, 2017a). It
harvests data from other catalogues of data sets, including national catalogues
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 208
Figure 13.1 Screenshot of search results from the European Data Portal © 1995–2016,
European Union
Another service, Data Portals, has details of 520 data portals worldwide and
uses the CKAN software for making data sets available (Open Knowledge
International, 2017). Many of these are local or national portals, some are
subject-specific, and some resources describe ontologies or vocabularies that
can be used for describing data sets. The CKAN system has the following
metadata fields built in:
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 209
• title
• unique identifier
• groups
• description
• data preview
• revision history
• licence
• tags – uncontrolled, although they can be organised into tag vocabularies
such as country, composer etc.
• format(s)
• API key
• extra fields.
This is one of a number of data standards used by data portals. Others include
Dublin Core, Project Open Data (POD), Data Catalog Vocabulary (DCAT) and
Schema.org. The US Government Project Open Data analysis of the different
metadata standards provides a useful table for comparison of metadata field
equivalents (Table 13.1).
Table 13.1 Comparison of metadata fields required for data sets in Project Open Data (source:
Federal CIO Council, 2017)
Label POD CKAN DCAT Schema.org
Title title title dct:title schema:name
Description description notes dct:description schema:description
Tags keyword tags dcat:keyword schema:keywords
Last modified n/a dct:modified schema:dateModified
Update
Publisher publisher organisation → title dct:publisher schema:publisher
Contact contactPoint maintainer dcat:contactPoint n/a
Name
Contact mbox maintainer_email foaf:mbox n/a
Email
Unique identifier id dct:identifier n/a
Identifier
Public accessLevel n/a n/a n/a
Access
Level
may be very good operational reasons for keeping separate data sets – for
instance, protecting sensitive personal data from unauthorised access. More
often the data sets arise from specialist software applications used to run
operations such as finance, customer relationships, marketing, logistics,
human resources, transactional processing – to name a few. Each of those
specialist systems will have its own data structures and will handle data in
ways that are specific to that application. Sometimes there will be common
data standards for interchange of data between different systems, but the
internal handling of the data may use proprietary standards. For instance, in
the library management field, MARC 21 is often used for exchange of
catalogue data between systems, but the internal data format may be more
complex to reflect internal management and auditing requirements.
The bigger challenge is unstructured content. The availability of powerful
software and services such as Google Enterprise Search, SharePoint and
Oracle allows organisations to search very large repositories of documents
that may have minimal structure. Although retrieval may be by means of
probabilistic searching and ranking rather than matching exactly to Boolean
operators, there are many challenges including duplication of content (all too
common and exacerbated by the lack of reliable retrieval), different
terminology or even language to describe the same concept or topic and
mixed media content. Individual documents may have some structure,
imposed by the software system or the organisational style manual, but there
may be little direct correlation between different applications. Even within
the same family such as Microsoft Office, each application has its own
document properties interface. The challenge is then to search across all these
different formats.
Several researchers have considered the use of metadata as a solution to
retrieval and management of unstructured document and content collections
within an organisation. Initiatives include the DMS Mark-up Language, a
metadata standard for multimedia content management systems (CMS), and
the Darwin Information Typing Architecture (Paganelli, Pettenati and Giuli,
2006; Anderson and Eberlein, 2015; Sheriff et al., 2011; Bailie and Urbina,
2012). The success of these approaches depends on the degree to which they
can be incorporated into document or enterprise information management
systems. Sheriff et al. (2011) identify the following sources for CMS metadata,
which could form the basis for a general approach to organisational content:
Ad
Exchange
Advertiser/
Agency Publisher
Ad
Network
Table 13.2 Core metadata elements to be provided by content providers (from Open
Discovery Initiative Working Group, 2014, 16)
* It is recognized that many content providers merge Content Type and Content Format
in their systems. Providing separate fields for this data is preferred, but the current
practice of a single field may continue if separating the data is too burdensome.
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 214
SUCCESSFUL
DATA
Reusable
C
Trusted
Reproducible
Reviewed
Comprehensible
Citable
Discoverable
Shared
Accessible
Preserved
Stored
Saved
Figure 13.3 A ‘pyramid’ of requirements for reusable data (de Waard, 2016)
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 215
DataCite
The DataCite metadata standard is for describing and disseminating research
data and has been developed with strong input from the research and
academic communities. Its goals are to facilitate access to research data via
the internet; make research citable in the scholarly content; and to support
data archiving (DataCite Metadata Working Group, 2016). The data sets
described by DataCite metadata include numerical and other types of
research data. DataCite is a member of the International DOI Foundation,
which means that its member institutions mint DOIs for their data clients.
There is a small mandatory set of metadata required to register research data:
• DOI
• Title
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 216
• Creator
• Publisher
• Publication Year
• ResourceType
Recommended properties:
• Subject
• Contributor
• Date
• RelatedIdentifier
• Description
• GeoLocation
Optional properties:
• Language
• AlternativeIdentifier
• Size
• Format
• Version
• Rights
• FundingReference
One of the challenges of maintaining dynamic data sets is the need to have a
stable referencing system, but to be able to incorporate additional data as it
is produced. Researchers can use the RelatedIdentifier and Version data
elements to specify updates to a data set.
Institutional repositories
Institutional repositories grew rapidly in the academic sector in the early
2000s. Academic institutions around the world realised that there were
benefits from putting their research outputs and some research data and
primary research resources on a system that would facilitate use and sharing
internally and by the global academic community. They have become an
important part of the preparation of bids for funding and for assessment of
research quality. Access to research repository data is available via directory
services such as OpenDOAR (University of Nottingham, 2014). This type of
service allows users to search across institutional repositories around the
globe and to filter results by criteria such as subject, institution, software used
and country. It is possible to search by institution, by collection or by
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 217
Source
Repository
Source
Repository
User
Source
Repository
Source
Repository
Source
Repository
Source
Repository
Federated
User Search
Service
Source
Repository
Source
Repository
Source
Repository
Source
Repository
User Metadata
Index harvesting
Source
Repository
Source
Repository
Conclusion
Recent growth in interest about ‘big data’, data mining and very large data
repositories has been paralleled by developments in metadata harvesting and
search architectures. Metadata standards such as D-CAT have been developed
to help structure information about large collections (Open Archives
Initiative, 2017). This enhances interoperability between systems and helps
to address some of the issues of data quality and consistency. Even in areas
where there is formal control of the metadata, such as bibliographic collec-
tions, the variety and variance in the use of cataloguing standards creates
challenges for services that span different collections or institutions. This
problem is magnified when searching across data repositories where
individual data collections may be highly structured but there is little or no
commonality between collections. Within organisations there is a separate
problem of bringing together data and information items from multiple silos
in different formats and often with little consistent internal structure. A good
example is the proliferation of word-processed documents that may be made
available via a document or records management system. These tend to rely
on categorisation by purpose or tagging by authors. Powerful text retrieval
techniques have improved access to documents, but do not fully address the
problems of consistency, precision or recall. Social media generates huge
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 220
volumes of transactional data and this is used to sell advertising to users. The
digital advertising industry depends on metadata associated with user online
transactions to build up these profiles as well as the personal data that is
gathered or inferred from users’ social media profiles. Research data
collections are a major enterprise, although this tends to be publicly funded.
Standards such as DataCite produce metadata that allows data sets in
repositories to be discoverable. Institutional repositories tend to focus more
on bibliographic data associated with research publications, although some
also contain primary data gathered in the course of research.
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 221
CHAPTER 14
Politics and ethics of metadata
Overview
This final chapter considers metadata in a political context. It considers three aspects of
its role in society and speculates on possible future developments. There is the ethical
strand, an increasingly important considerations for those involved anywhere in the
information communication chain (Robinson, 2009). It also considers where the power
lies in both professional and subject domain terms and which professional groups are
best equipped to develop and implement metadata standards. This section speculates
on the role of metadata in the creation of new knowledge – a holy grail that has so far
eluded the most advanced machine learning environments. Finally it considers the
practicalities of funding. There is a huge industry dependent on metadata about online
transactions, for instance. This forms the basis of digital marketing and the revenue
streams for some of the largest incorporated companies, such as Alphabet (Google) and
Facebook. This also raises the issue of who pays for the creation of new metadata
standards, and who funds the creation of metadata on a massive scale (all those
digitisation projects). Throughout, the chapter speculates on the future development and
role of metadata.
Ethics
An examination of the role of metadata raises many issues about privacy,
security, ownership and control. It also raises issues about the digital divide
and its possible role in making information accessible to wider audiences. It
has the potential to empower the marginalised, hold government to account
and improve individuals’ quality of life. Understanding metadata is
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 222
Security
Metadata has become a political issue. Anyone who had asked the question
‘What does metadata matter?’ prior to 2013 will have been startled by the
revelations about the US National Security Agency’s routine downloading of
metadata about telephone conversations that involve non-US citizens
(Greenwald, 2013). The Fourth Amendment to the US Constitution protects
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 223
‘The right of the people to be secure in their persons, houses, papers, and effects,
against unreasonable searches and seizures’ (United States, 1791). A lot hangs
on the interpretation of privacy, as Solove (2011) has so eloquently discussed
in his book Nothing to Hide. Individuals are monitored continually by CCTV
cameras, through their communications and particularly their online activity,
by security agencies, crime prevention and investigation agencies and by the
digital advertising industry. They all exploit metadata that reveals information
about us as individuals, whether it be online grocery shopping or hidden
activity on the ‘dark net’. The metadata about these activities and transactions
are the surrogate resource that is used to filter and aggregate data to find and
act on leads. Privacy International has identified the following types of
metadata that is gathered or could be gathered by security agencies:
• the location that it originated from, e.g. home address of the telephone,
subscription information, nearest cell tower
• the device that sent or made the communication, e.g. telephone identifier,
IMEI of the mobile phone, relatively unique data from the computer that sent
a message
• the times at which the message was made and sent
• the recipient of the communication and their location and device, and time
received
• information related to the sender and recipients of a communication, e.g.
email address, address book entry information, email providers, ISPs and IP
address, and
• the length of a continuous interaction or the size of a message, e.g. how long
was a phone call? how many bits in a message?
(Privacy International, 2017)
Improving discovery
Effective presentation of metadata enhances its usability. A key aspect of
presentation is the ease of navigation and searching. The navigation system
and search facilities have to accommodate the needs of different kinds of
users. Some people interact with systems when they create a new document
and they will need to create the metadata. Other users will be primarily
interested in using metadata to retrieve electronic resources as searchers.
When entering metadata an author may need access to a controlled
vocabulary, in order to select appropriate keywords. The terms can be
presented in a number of ways:
General users or searchers can interact with electronic systems using the first
three options to identify relevant search terms or selection criteria for the
metadata, and ultimately the resource that is described by the metadata.
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 225
User education
Although there is increasing awareness of the existence of metadata, many
users do not understand how metadata works. At the most basic level there
is a need to identify which metadata fields are available for searching. If the
content of the fields are controlled (either by cataloguing rules or by use of
controlled vocabulary), the user needs to know where they can browse the
available keywords or terms. More sophisticated searching will require an
understanding of ways in which search queries can be combined so that, for
instance, it is possible to search for the author and title of a book on Amazon,
or to search for the subject category and date of creation of a web page on a
government portal. Commentators such as Phil Bradley have written
extensively about this (Bradley, 2013).
The idea of the selective content has been around for some time, but was
brought to prominence in The Filter Bubble (Pariser, 2011). Unexpected
electoral results in 2016 such as the Brexit referendum in the UK and the
Presidential elections in the USA highlighted the problem of pollsters
themselves living in a filter bubble. This made it difficult for them to
understand or even acknowledge dissonant views from parts of the electorate
that they do not usually consider significant. The other phenomenon has been
‘fact-free news’ which has affected the political discourse in the USA, Russia
and the European Union in particular. In a former age this might have been
called propaganda. Material is easily generated in response to the economic
model that depends on driving traffic to websites and thereby generating
income from advertising revenue. The wilder or more outrageous the
assertion the higher the traffic. This turns out to be an effective political
campaigning strategy as well. The distinction between speculation and
established fact backed by evidence is still as stark as it ever was, but the usage
of these two types of approach is becoming blurred. Services such as
Facebook do not discriminate between evidenced material and
unsubstantiated speculations when they serve up the toxic mix of fact and
fictions to their subscribers.
Service providers tailor content to suit our opinions as revealed by previous
online activity. Ad blocking and cookie blocking limits the amount of data
gathered about your online activity. Blocking also stops some of the benefits
of cookies, such as continuity of sessions, tailoring of content and the ability
to make purchases online.
What role does metadata play in all of this? At least part of the basis for
selection of news stories to serve to users is based on matching a profile of
past interests and internet behaviour with new content. It seems that we are
more comfortable with news and commentary that reflects our own opinions
and preconceptions (Norris, 2001, 18–19). This is why so few liberals read
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 226
Fields were extracted from Twitter using NodeXL social analysis software
(Hansen et al., 2011).
Power
Who owns the metadata space?
There have been many successful collaborations, such as the one between
librarians and the IT community to create Dublin Core as a metadata standard
that could be applied to web resources. However, there are distinct
communities with their own perspectives on metadata, such as: librarians,
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 227
the adoption of RDA (Resource Description and Access). There have been
separate initiatives on identifiers – particularly those to do with people, and
separate development of metadata used in the social media space. Although
parts of the archives and records management communities have moved
closer together, they continue to remain independent of the LIS community
at large. It could be argued that the role of archives and records collections
are very different and that therefore the descriptive requirements and
management needs are different. Even within the bibliographic community,
the book trade has developed its own standards for metadata to facilitate e-
commerce. At this juncture it seems that this fragmentation of metadata will
continue and that a coherent metadata community is unlikely to emerge.
Looking to metadata initiatives in other areas, such as the geospatial
community, we see even further fragmentation and the possibility of a
common framework for metadata receding over the horizon as the flotilla of
initiatives disperses.
The more dynamic parts of the information disciplines are changing to
incorporate metadata as a part of their range of knowledge and skills.
Metadata is an established part of many i-school syllabuses and other LIS
academic courses. Job ads often feature the word ‘metadata’, as in ‘metadata
librarian’ where previously ‘cataloguer’ might have sufficed. This suggests
wider awareness of metadata and an appreciation of its role in LIS. The
content management community has started to consider a systematic
approach to documentation and use of metadata standards such as DITA
(Bailie and Urbina, 2012). Although it is beyond the scope of this book, it is
an interesting area to watch and it may have an impact on the adoption of
metadata standards generally.
Exploitation
The major search engines have been another focus of metadata activity as new
products based on the semantic web emerge. The Google Knowledge Base
and the development of Schema.org are examples. Open data initiatives
around the world have resulted in the development of new products and
services that combine data sets. Linked data technology (for example using
RDA triples to describe data elements) has facilitated the combination and
exchange of data. However, consistency of descriptions is a major problem.
If different encoding schemes are used for the content of data elements, there
may be a problem linking them or with information retrieval later on.
Nonetheless, there has been rapid growth of this sector and that seems set to
continue. The first linked open data set was published in 2007. By the end of
that year there were 28 data sets available which had grown to 203 data sets
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 229
by 2010 and 570 by 2014. In February 2017 there were 1139 linked open data
sets available (Abele et al., 2017). Knowledge creation has been an ambition
at least since the 1980s, when the Alvey Programme in the UK and Japan’s
Fifth Generation Computer Systems project were in full swing (Rutherford
Appleton Laboratory, 2015; ICOT, 1999). The development of the semantic
web in more recent years has provided an avenue for linking individual data
elements based on their meaning. The EU-funded LOD2 project has explored
advances in semantic web developments but has not addressed the creation
of new knowledge (Auer, Bryl and Tramp, 2014). It may be that knowledge
creation cannot be separated from consciousness. If that is the case, then it
would seem that the early dream of systems that could exploit existing data
sets to create new knowledge is still a long way off. Whatever the agent
(human or artificial intelligence), the idea that metadata could be used to
navigate existing knowledge as a first step to synthesising new knowledge
and insights remains an attractive one.
Money
Who pays?
One of the major challenges for any public initiative is finding the funding
and resources to make it happen. Many public bodies are good at creating
initiatives that address an issue of the moment, but then attention moves on
to other topics. The infosphere is littered with abandoned databases,
protocols, frameworks and information services. Creating metadata is
expensive and research continues into the automatic generation of metadata.
Approaches include automatic indexing, metadata extraction from data sets
and use of specific tools to generate metadata automatically (Golub et al.,
2016; Greenberg, 2009; Park and Brenza, 2015).
Volunteer cataloguing is an alternative approach and there have been
useful initiatives where a few dedicated individuals have done an enormous
amount of indexing and cataloguing. Some clever initiatives have also
allowed general users to contribute to indexing materials. This is particularly
interesting for images and to some kinds of manuscript which cannot easily
be converted into coded text with current technologies (Konkova et al., 2014).
For instance, the British Library Labs enables volunteers to transcribe images
of catalogue cards into machine-readable records by crowdsourcing tasks
(British Library Labs, 2017).
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 230
Standards
The standards development process is often prompted by the proliferation of
incompatible systems that arise in response to a practical problem. As
accepted good practices emerge, standards are often developed to codify that
practice. Standards then provide a common approach. This is fundamental
to the successful operation of metadata systems.
One of the most widely used metadata schemes, the Dublin Core Metadata
Element Set is now an international standard, ISO 15836 (ISO, 2009b). It
provides a starting point for many application profiles developed by specific
communities and individual organisations. It forms the basis of many other
standards, such as FOAF, AGLS, eGMS, Europeana Data Model and Pundit
(National Archives of Australia, 2008; e-Government Unit, 2006; Brickley and
Miller, 2014; Net7 srl, 2017; Europeana Labs, 2016).
Standards development is a helpful way of negotiating a system that all
parties can work with. This is evident in the publishing industry and book
trade, which is made up of conflicting and competing interests and yet has
co-operated to develop the ONIX metadata system, because of the need of
the different parties (book retailers and publishers) to exchange data
(EDItEUR, 2014). Another example is the evolution of the national MARC
standards into MARC 21, a unified standard adopted internationally (Library
of Congress, 2017c). Standards development is often seen as a common good.
Publicly funded organisations such as the Library of Congress in the USA or
Jisc in the UK often lead standards development and provide the support and
infrastructure for implementation. Commercial organisations such as data
service providers and systems providers get involved to influence eventual
standards that are used to define the market. A second source of standards
development comes from commercial organisations that want to develop
proprietary standards which tie customers into their products or services.
Some of these, such as Adobe’s PDF format, have reached a wider
community. One of the tasks of bodies that sponsor national and international
standards is to reconcile these different interests to produce a standard that
can be widely adopted. This is an expensive process, which may be funded
by private sector organisations with an interest in the standard. Professional
bodies, trade associations and public bodies (particularly regulators) will all
have a say in standards development and ratification. In this book we have
seen a wide range of organisations involved in standards development,
including those listed in Table 14.1 opposite.
closer look at the new, six-point model demonstrates why it works and
provides some indication of how it might develop in the future.
Conclusion
In the first edition of this book, I speculated on whether metadata as a concept
was here to stay. Metadata has been around for at least 2500 years in the form
of library catalogues. During that time it has been transformed into something
with a wide range of applications and operating in very complex
environments. Since 2004 it has become established as a label for job titles and
it has also emerged into the public awareness following high-profile news
stories. It is not a passing fad, but nor has it become a single discipline with
a coherent body of knowledge. The usage of the term ‘metadata’ is still very
varied and cuts across a number of distinct professional communities.
If there is a message from all of this, it is that the purposes of metadata
continue to be relevant and provide a useful insight into the way in which
metadata standards operate. Recent events have also demonstrated the wide
relevance of metadata to the everyday activity of all of us as users of the
internet or of telecommunications networks. The data and metadata
generated by our activity is a powerful marketing and monitoring tool that
can enhance and enrich our lives. However, we have to be aware of the harm
that can result from misuse and be vigilant about encroachments on privacy
and human rights.
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 238
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 239
References
Abele, A., McCrae, J. P., Buitelaar, P., Jentzsch, A., Cyganiak, R. (2017) The Linking
Open Data Cloud Diagram, http://lod-cloud.net [accessed 24 March 2017].
Aitchison, J., Gilchrist, A. and Bawden, D. (2000) Thesaurus Construction and Use: a
practical manual, 4th edn, London, Aslib/IMI.
Anderson, R. D. and Eberlein, K. J. (2015) Darwin Information Typing Architecture
(DITA) Version 1.3, Part 0: Overview, 10, http://docs.oasis-open.org/dita/dita/
v1.3/os/part0-overview/dita-v1.3-os-part0-overview.pdf [accessed 6 March 2017].
Arms, W. Y., Hillmann, D., Lagoze, C., Krafft, D., Marisa, R., Saylor, J., Terrizzi, C.
and Van de Sompel, H. (2002) A Spectrum of Interoperability: the site for science
prototype for the NSDL, D-Lib Magazine, 8 (1).
ASIS&T (2017) Classification Research SIG, www.asist.org/groups/classification-
research-cr [accessed 2 February 2017].
Auer, S., Bryl, V. and Tramp, S. (eds) (2014) Linked Open Data: creating knowledge out
of interlinked data. Results of the LOD2 project, Springer International Publishing.
Australian Institute of Health and Welfare (2017) METeOR,
http://meteor.aihw.gov.au/content/index.phtml/itemId/181162 [accessed 18
January 2017].
Baca, M. (ed.) (1998) Introduction to Metadata: pathways to digital information, 1st edn,
Los Angeles, CA, Getty Information Institute.
Baca, M. (ed.) (2016) Introduction to Metadata, 3rd edn,
www.getty.edu/publications/intrometadata [accessed 4 November 2016].
Baeza-Yates, R. and Ribeiro-Neto, B. (2011) Modern Information Retrieval: the concepts
and technology behind search, 2nd edn, Harlow, Pearson Education.
Bailie, R. A. and Urbina, N. (2012) Content Strategy: connecting the dots between
business, brand, and benefits, XML Press.
Baker, T., Coyle, K. and Petiya, S. (2014) Multi-entity Models of Resource
Description in the Semantic Web: a comparison of FRBR, RDA and BIBFRAME,
Library Hi Tech, 32 (4), 562–82.
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 240
REFERENCES 241
in the Era of Linked Data, Bulletin of the Association for Information Science and
Technology, 41 (4), 13–17.
Corporation for Public Broadcasting (2011) PBCore Schema,
http://pbcore.org/schema/ [accessed 27 July 2015].
Coyle, K. and Baker, T. (2009) Guidelines for Dublin Core Application Profiles,
http://dublincore.org/documents/2009/05/18/profile-guidelines [accessed 21 July
2015].
Creative Commons (2016) Creative Commons, http://creativecommons.org [accessed 5
February 2016].
Daconta, M. C. (2013) Big Metadata: 7 ways to leverage your data in the cloud, GCN,
https://gcn.com/blogs/reality-check/2013/12/metadata.aspx [accessed 3 March
2017].
DataCite Metadata Working Group (2016) DataCite Metadata Schema Documentation
for the Publication and Citation of Research Data, Version 4.0,
http://schema.datacite.org/meta/kernel-4.0/doc/DataCite-
MetadataKernel_v4.0.pdf [accessed 2 October 2017].
Davenport, T. H. (2014) Big Data at Work: dispelling the myths, uncovering the
opportunities, Boston, MA, Harvard Business Review Press.
Day, M. (1999) Metadata for Digital Preservation: an update, Ariadne, 22,
www.ariadne.ac.uk/issue22/metadata [accessed 4 October 2017].
Day, M. (2001) Metadata in a Nutshell, Information Europe, 6 (2), 11.
Day, M. (2004) Preservation Metadata. In Gorman, G. E. (ed.) Information Yearbook for
Library and Information Management 2003–2004: metadata applications and
management, London, Facet Publishing, 253–73.
DCMI (2012) Dublin Core Metadata Element Set v1.1,
http://dublincore.org/documents/dces [accessed 24 April 24 2017].
DCMI (2015) DCMI Mission and Principles, http://dublincore.org/about-us [accessed
21 July 2015].
DCMI Usage Board (2012) DCMI Metadata Terms,
http://dublincore.org/documents/2012/06/14/dcmi-terms [accessed 18 August
2015].
de Vries, J., Williams, T. N., Bojang, K., Kwiatkowski, D. P., Fitzpatrick, R. and
Parker, M. (2014) Knowing Who to Trust: exploring the role of ‘ethical metadata’
in mediating risk of harm in collaborative genomics research in Africa, BMC
Medical Ethics, 15 (62), http://www.biomedcentral.com/1472-6939/15/62 [accessed
2 October 2017]
de Waard, A. (2016) Research Data Management at Elsevier: supporting networks of
data and workflows, Information Services & Use, 36 (1–2), 49–55.
Democracy Now (2013) Court: Gov’t Can Secretly Obtain Email, Twitter Info from
Ex-WikiLeaks Volunteer Jacob Appelbaum,
www.democracynow.org/2013/2/5/court_govt_can_secretly_obtain_email
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 243
REFERENCES 243
REFERENCES 245
2017].
Hirata, K. and Kato, T. (1992) Query by Visual Example. In Pirotte, A., Delobel, C.
and Gottlob, G. (eds) Advances in Database Technology – EDBT ’92: 3rd International
Conference on Extending Database Technology Vienna, Austria, March 23–27, 1992
Proceedings, Berlin, Heidelberg, Springer, 56–71.
Hudgins, J., Agnew, G. and Brown, E. (1999) Getting Mileage out of Metadata:
applications for the library, LITA Guides No. 5, Chicago, IL, American Library
Association.
Iannella, R. (2002) Open Digital Rights Language (ODRL) Version 1.1, Cambridge MA,
W3C, https://www.w3.org/TR/2002/NOTE-odrl-20020919/ [accessed 2 October
2017].
Iannella, R. and Campbell, D. (1999) The A-Core: metadata about content metadata,
http://metadata.net/admin/draft-iannella-admin-01.txt [accessed 2 March 2004].
ICA (2000) ISAD(G): General International Standard Archival Description, 2nd edn,
www.icacds.org.uk/eng/ISAD(G).pdf, 91 [accessed 12 April 2016].
ICOT (1999) What is FGCS Technologies?,
https://web.archive.org/web/20090217105259/www.icot.or.jp/ARCHIVE/
Museum/ICOT/FGCS-E.html [accessed 24 March 2017].
IEEE Computer Society (2002) 1484.12.1 IEEE Standard for Learning Object Metadata,
New York, NY, IEEE.
IFFF Consortium (2017) IIIF Presentation API 2.1.1.,
http://iiif.io/api/presentation/2.1/#b-summary-of-metadata-requirements
[accessed 17 August 2017].
IFLA (1998) Functional Requirements for Bibliographic Records Final Report: IFLA Study
Group on the Functional Requirements for Bibliographic Records, Munich, K. G. Saur.
IFLA (2011) ISBD – International Standard Bibliographic Description Consolidated
Edition, Berlin, De Gruyter Saur.
IFLA (2013) Functional Requirements for Authority Data: a conceptual model, The Hague.
Information Commissioner’s Office (2012) Anonymisation: managing data protection
risk code of practice, Wilmslow.
International Council on Archives (2004) ISAAR (CPF) International Standard Archival
Authority Record for Corporate Bodies, Persons and Families, Paris.
International DOI Foundation (2012) DOI Handbook, www.doi.org/hb.html [accessed
31 March 2016].
International ISBN Agency (2012) ISBN Users’ Manual, 6th edn, London.
International ISBN Agency (2014) What is an ISBN?,
www.isbn-international.org/content/what-isbn [accessed 8 May 2017].
International ISTC Agency (2010) International Standard Text Code (ISTC) User
Manual, Version 1.2, London.
IPTC (2014) IPTC Photo Metadata Standard, London.
ISKO (2017) International Society for Knowledge Organization, www.isko.org [accessed
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 247
REFERENCES 247
2 February 2017].
ISO (1986) ISO 8879:1986 Information Processing – Text and Office Systems – Standard
Generalized Markup Language (SGML), Geneva.
ISO (2002) ISO 639-1:2002 Codes for the Representation of Names of Languages – Part 1:
Alpha-2 Code, Geneva.
ISO (2004a) ISO/IEC 15444-2:2004 Information Technology – JPEG 2000 Image Coding
System – Part 2: Extensions, Geneva.
ISO (2004b) ISO/IEC TR 21000-1:2004 Information T – Multimedia Framework (MPEG-
21) – Part 1: Vision, Technologies and Strategy, Geneva.
ISO (2004c) ISO 8601:2004 Data Elements and Interchange Formats – Information
Interchange – Representation of Dates and Times, Geneva.
ISO (2005) ISO 2108:2005 Information and Documentation – International Standard Book
Number (ISBN), Geneva.
ISO (2007) ISO 3297:2007 Information and Documentation - International Standard Serial
Number (ISSN), Geneva.
ISO (2008) ISO15706-1:2002+A1:2008 Information and Documentation – International
Standard Audiovisual Number (ISAN), Geneva.
ISO (2009a) ISO 10957:2009 Information and Documentation. International Standard
Music Number (ISMN), Geneva.
ISO (2009b) ISO 15836:2009 Information and Documentation – The Dublin Core Metadata
Element Set, Geneva.
ISO (2009c) ISO 21047:2009 Information and Documentation – International Standard
Text Code (ISTC), Geneva.
ISO (2009d) ISO 23081-2:2009 Information and Documentation – Records Management
Processes – Metadata for Records – Part 2: Conceptual and Implementation Issues,
Geneva.
ISO (2009e) ISO 31000:2009 Risk Management – Principles and Guidelines, Geneva.
ISO (2011) ISO 25964-1:2011 – Information and Documentation – Thesauri and
Interoperability with other Vocabularies. Part 1: Thesauri for Information Retrieval,
Geneva.
ISO (2012a) ISO/IEC 19505-1: 2012 Information technology – Object Management Group
Unified Modeling Language (OMG UML) – Part 1: Infrastructure, Geneva.
ISO (2012b) ISO 26324:2012 Information and Documentation – Digital Object Identifier
System, Geneva.
ISO (2013) ISO 3166-1:2013 Codes for the Representation of Names of Countries and Their
Subdivisions Part 1: Country Codes, Geneva.
ISO (2014a) EN ISO 19115-1 Geographic information – Metadata. Part 1: Fundamentals,
Geneva.
ISO (2014b) ISO 28560-2:2014 Information and Documentation – RFID in Libraries – Part
2: Encoding of RFID Data Elements Based on Rules from ISO/IEC 15962, Geneva.
ISO (2015) Standards, www.iso.org/iso/home/standards.htm [accessed 6 July 2015].
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 248
REFERENCES 249
Lagoze, C. and Hunter, J. (2002) The ABC Ontology and Model, Journal of Digital
Information, 2 (2), https://journals.tdl.org/jodi/index.php/jodi/article/view/44/47
[accessed 2 October. 2017].
Lambe, P. (2007) Organising Knowledge, Oxford, Chandos Publishing.
LaTeX3 Project Team (2001) LaTeX2e for Authors, https://www.latex-
project.org/help/documentation/usrguide.pdf [accessed 2 October 2017].
Latham, K. F. (2012) Museum Object as Document: using Buckland’s information
concepts to understand museum experiences, Journal of Documentation, 68 (1),
45–71.
Laudon, K. C. and Traver, C. G. (2014) E-commerce: business, technology, society,
Global edn, 10th edn, Harlow, Pearson Education.
Lavoie, B. F. (2004) The Open Archival Information System Reference Model: introductory
guide, Dublin, OH, OCLC.
Law, K. H., Lau, G., Kerrigan, S. and Ekstrom, J. A. (2014) REGNET: regulatory
information management, compliance and analysis, Government Information
Quarterly, 31, S37–S48.
Li, C. and Sugimoto, S. (2014) Provenance Description of Metadata using PROV
with PREMIS for Long-term Use of Metadata. In International Conference on
Dublin Core and Metadata Applications, 8–11 October 2014, Austin, TX, 147–56.
Library of Congress (2011) MIX NISO Metadata for Images in XML Schema,
www.loc.gov/standards/mix [accessed 27 July 2015].
Library of Congress (2012) Dublin Core Metadata Element Set Mapping to MODS
Version 3, www.loc.gov/standards/mods/dcsimple-mods.html [accessed 19
January 2017].
Library of Congress (2015a) Metadata Encoding and Transmission Standard (METS)
Official Web Site, www.loc.gov/standards/mets [accessed 29 July 2015].
Library of Congress (2015b) Metadata Object Description Schema (MODS),
www.loc.gov/standards/mods [accessed 23 July 2015].
Library of Congress (2016a) Encoded Archival Description (EAD),
www.loc.gov/ead [accessed 24 March 2016].
Library of Congress (2016b) External Schemas for Use with METS,
www.loc.gov/standards/mets/mets-extenders.html [accessed 26 May 2017].
Library of Congress (2017a) Bibliographic Framework Initiative (BIBFRAME),
www.loc.gov/bibframe [accessed 17 August 2017].
Library of Congress (2017b) Library of Congress Names,
http://id.loc.gov/authorities/names.html [accessed 8 May 2017].
Library of Congress (2017c) MARC Standards, www.loc.gov/marc [accessed 28
March 2017].
Library of Congress and Stock Artists Alliance (2009) PhotoMetadata Project,
www.photometadata.org [accessed 24 September 2015].
MacFarlane, A., Robertson, S. E. and McCann, J. A. (2004) Parallel Computing for
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 250
REFERENCES 251
Tools: a survey of the current state of the art, Information Technology & Libraries, 34
(3), 22–42.
Pearsall, J. (ed.) (1999) The New Oxford Dictionary of English, Oxford, Oxford
University Press.
Peters, I. and Weller, K. (2008) Tag Gardening for Folksonomy Enrichment and
Maintenance, Webology, 5 (3), http://www.webology.org/2008/v5n3/a58.html
[accessed 6 February 2017]
Pinker, S. (2015) The Sense of Style: the thinking person’s guide to writing in the 21st
century, Penguin Books.
Pomerantz, J. (2015) Metadata, Cambridge, MA, MIT Press.
Ponceleón, D. and Slaney, M. (2011) Multimedia Information Retrieval. In Baeza-
Yates, R. and Ribeiro-Neto, B. (eds) Modern Information Retrieval: the concepts and
technology behind search, Harlow, Pearson Education, 587–639.
Powell, A., Nilsson, M., Naeve, A., Johnston, P. and Baker, T. (2007) DCMI Abstract
Model, http://dublincore.org/documents/abstract-model [accessed 19 April 2016].
PREMIS Editorial Committee (2015) PREMIS Data Dictionary for Preservation
Metadata, Version 3.0, http://www.loc.gov/standards/premis/v3/
premis-3-0-final.pdf [accessed 25 May 2017].
Privacy International (2017) Privacy 101, Metadata,
www.privacyinternational.org/node/53 [accessed 23 March 2017].
Publications Office of the European Union (2017) EU Open Data Portal,
http://data.europa.eu/euodp/en/data [accessed 14 February 2017].
RDA Steering Committee (2015) Map from ISBD Properties to Unconstrained RDA
Properties, www.rdaregistry.info/Maps/mapISBD2RDAU.html [accessed 17 July
2017].
RDA Steering Committee (2017) RSC Website, www.rda-rsc.org [accessed 16 July
2017].
RDA Steering Committee, Metadata Management Associates & ALA Digital
Reference (2016) RDA Registry Examples, www.rdaregistry.info/Examples
[accessed 30 July 2017].
Riva, P., Le Boeuf, P. and Žumer, M. (2017) IFLA Library Reference Model, The Hague.
Robinson, L. (2009) Information Science: communication chain and domain analysis,
Journal of Documentation, 65 (4), 578–91.
Robinson, L. (2015) Multisensory, Pervasive, Immersive: towards a new generation
of documents, Journal of the Association for Information Science and Technology, 66
(8), 1734–7.
Rodriguez, M. A., Bollen, J. and Van de Sompel, H. (2009) Automatic Metadata
Generation Using Associative Networks, ACM Transactions on Information
Systems, 27 (2), 7:1–7:20.
Rosenberg, D. (2013) Data Before the Fact. In Gitelman, L. (ed.) ‘Raw Data’ is an
Oxymoron, Cambridge, MA, Massachusetts Institute of Technology, 15–40.
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 253
REFERENCES 253
Rousidis, D., Garoufallou, E., Balatsoukas, P. and Sicilia, M-A. (2014) Metadata for
Big Data: a preliminary investigation of metadata quality issues in research data
repositories, Information Services and Use, 34 (3–4), 279–86.
Rust, G. and Bide, M. (2000) The <indecs> Metadata Framework: principles, model and
data dictionary June 2000 WP1a-006-2.0,
http://www.doi.org/topics/indecs/indecs_framework_2000.pdf [accessed 2
October 2017]
Rutherford Appleton Laboratory (2015) The Alvey Programme,
www.chilton-computing.org.uk/inf/alvey/overview.htm [accessed 24 March 2017].
Salton, G., Allan, J. and Buckley, C. (1993) Approaches to Passage Retrieval in Full
Text Information Systems. In Proceedings of the 16th Annual International ACM
SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’93,
New York, NY, ACM, 49–58.
Salton, G. and Yang, C. S. (1973) On the Specification of Term Values in Automatic
Indexing, Journal of Documentation, 29 (4), 351–72.
Schneier, B. (2015) Data and Goliath: the hidden battles to collect your data and control
your world, New York, NY, W. W. Norton.
Schreiber, G. and Raimond, Y. (2014) RDF 1.1 Primer, W3C,
www.w3.org/TR/rdf11-primer [accessed 26 April 2017].
Shannon, C. E. (1948). A Mathematical Theory of Communication, Bell System
Technical Journal, 27 (3), 379–423.
Shannon, C. E. and Weaver, W. (1949) The Mathematical Theory of Communication,
Urbana, IL, University of Illinois Press.
Sheriff, A. M., Bouchlaghem, D., El-Hamalawi, A. and Yeomans, S. (2011)
Developing a Metadata Standard for Multimedia Content Management: a case
study, Architectural Engineering and Design Management, 7, 157–76.
Singh, J. and Kumar, V. (2014) Virtual Appliances-Based Framework for Regulatory
Compliances in Cloud Data Centers, IUP Journal of Information Technology, 10 (1),
30–47.
Singhal, A. (2012) Introducing the Knowledge Graph: things, not strings, Google Official
Blog, https://googleblog.blogspot.co.uk/2012/05/introducing-knowledge-graph-
things-not.html [accessed 4 August 2017].
Skinner, G., Han, S. and Chang, E. (2005) A New Conceptual Framework within
Information Privacy: meta privacy. In Hao, Y., Liu, J., Wang, Y., Cheung, Y.-M.,
Yin, H., Jiao, L., Ma, J. and Jiao, Y-C. (eds) Computational Intelligence and Security,
Pt 2, Lecture Notes in Artificial Intelligence, Springer, 55–61.
SLA (2017) SLA Taxonomy Division, http://taxonomy.sla.org/ [accessed 2 February
2017].
Smiraglia, R. (2005) Introducing Metadata. In Smiraglia, R. (ed.) Metadata: a
cataloger’s primer, Binghampton, NY, Haworth Information Press, 1–15.
Solove, D. J. (2011) Nothing to Hide: the false tradeoff between privacy and security, New
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 254
REFERENCES 255
Index
INDEX 259
data portals 106 (see also open data document mark-up 19–22 (see also
repositories and portals) mark-up languages)
data protection see privacy and data DOIs (Digital Object Identifiers) 44,
protection 79, 81–3, 206, 215
data retrieval 95 Domesday Project see BBC
Date data element 26, 52, 59, 91–2, 93 Domesday Book Project
DCAP (Dublin Core Application DTDs (Document Type Definitions)
Profile) 54 23–4, 62
DCAT (Data Catalog vocabulary) Dublin Core 7, 39–40, 65, 71–2
178, 207, 209 (see also application profiles 54, 149, 170
application profiles) crosswalks 175–6, 178
de facto standards 50, 148, 168, 233 data elements 24–5, 52–4
Description data element 92–3, 94 DCMI 51–4
descriptive metadata 14, 69–70, 71, rights management 128–9
86–7, 88–93, 94 Dublin Core Application Profile see
DIDL (Digital Items Declaration DCAP
Language) 132–3 Dublin Core Metadata Initiative see
Digital Curation Centre 167 Dublin Core – DCMI
lifecycle model 116, 118, 165
Digital Items Declaration Language EAD (Encoded Archival
see DIDL Description) 62, 74, 115
Digital Object Identifiers see DOIs EAN codes (European Article
digital objects 19, 67, 70-1, 82, 133, 218 Number) 80, 81, 84 (see also
METSRights 129 barcodes)
PREMIS 120, 121, 129 EBUCore 65
provenance 136, 137, 138 ECM (Enterprise Content
Schema.org 196 Management) systems 26–7
(see also electronic documents) e-commerce 16, 141, 235
digital resources 113, 128, 137, 194 electronic transactions 139–40,
preservation 118–19 211–12
DIP (Dissemination Information images 146–7
Package) 47 indecs 44–5, 143–4
Directory of Open Access music industry 147–8
Repositories see OpenDOAR ONIX 118, 143–4, 145–6
disposal of library materials 118, publishing and the book trade 48,
123–4 144–8
records 16, 28, 29, 115 e-discovery 156, 159
Dissemination Information Package EDRM (Electronic Document and
see DIP Records Management) 16, 27,
Document Type Definitions see 29, 186
DTDs creating metadata 193
Haynes 4th proof 13 December 2017 13/12/2017 15:38 Page 260
INDEX 261
INDEX 263
INDEX 265
INDEX 267