DCC Approach to
Digital Curation - under Development
The DCC Development team is working closely with the CASPAR project and further work on this document is being carried out in association with that project. In particular the CASPAR Conceptual Model follows on from this document and documents which describe its implementation may be obtained from http://www.casparpreserves.eu/publications/deliverables.
Changes
| Date |
By |
Change |
| 31 May 2004 |
DavidGiaretta |
Added comments on GGF Persistent Archives in introduction |
| 08 Jun 2004 |
DavidGiaretta |
Added a few more words on Data Grids |
| 15 Jun 2004 |
DavidGiaretta |
Added text to make the document easier to understand, emphasising a pragmatic rather than dogmatic approach |
| 23 Jun 2004 |
DavidGiaretta |
Add discussion of definition of Digital Curation |
| 15 Aug 2004 |
DavidGiaretta |
Response to Neil Beagrie's comments |
| 10 Oct 2004 |
DavidGiaretta |
Add suggestions for services |
| 11 Nov 2004 |
DavidGiaretta |
Add reference to CEDARS project |
| 04 Jan 2005 |
DavidGiaretta |
Add OAIS layered information model |
| 22 Jan 2005 |
DavidGiaretta |
Add info about RI Label Schema |
| 06 Feb 2005 |
DavidGiaretta |
Add focus on Preservation of Registry/Repository holdings i.e. it should be an OAIS |
| 16 Feb 2005 |
DavidGiaretta |
Add summary bullets to Introduction |
| 19 Apr 2005 |
DavidGiaretta |
Add clarification about scope of curation wrt preservation and "usability" in the context of OAIS |
| 28 May 2005 |
DavidGiaretta |
Change title of the document to make it clear that it is the view mainly of the DCC Development Team |
| 21 Sep 2005 |
DavidGiaretta |
Integrate with CASPAR bid |
| 11 Dec 2005 |
DavidGiaretta |
Tidy up comments |
| 31 Jul 2006 |
DavidGiaretta |
Ref. to CASPAR project collaboration |
##section#. Introduction and Motivation
The motivation for this document is a belief that an overall �Approach to Curation�, which includes a conceptual model and an outline architecture, must be agreed in order to allow the DCC to make progress. The approach to Curation presented here is a pragmatic one which aims at providing the basis for a project plan which allows us to show some real added-value services in addition to the advisory services; we expect that the approach will evolve along with our understanding of the issues.
In summary:
- Digital Curation covers many issues including financial, scientific, technical, legal and socialogical ones.
- Preservation, a subset of curation, is a particularly significant issue not least because preservation is a necessary activity, although of course it is not sufficient. Moreover because preservation of digital information is much more than bit preservation, if it is done properly it will contribute to the other technical aspects of curation.
- The ISO standard OAIS Reference Model, plus its follow-on standards work, provide the foundation - concepts and terminology - for information preservation.
- OAIS has several preservation strategies including migration and emulation. However ensuring long-term usability of the information focusses on the information content and this particular aspect of preservation links to contemporaneous usage.
- Representation Information is a key concept for information preservation and a (distributed) Registry/Repository of Representation Information is an essential service.
- The Registry/Repository must itself be an exemplar trustworthy OAIS repository, for long-term preservation of the Representation Information which it holds, and it will be OAIS certified in due course.
- A number of tools and services can be built on the Registry service itself and on the Repository's design and implementation.
- The tools and services will aim to promote interoperability and automated use as far as possible, and will support information preservation over the long term as well as current and future usability of information held in repositories of all kinds.
- The tools and services must be easily integrated into many of the other UK and international projects which are addressing the issue of digital curation.
- Migration and Emulation preservation strategies must also be supported.
- In addition to Representation Information, OAIS defines Preservation Description Information and Packaging Information. These types of information, and several other kinds, must be supported.
- Legal and socialogical issues are extremely important for curation in general, although one can imagine circumstances in which these are not significant. It is likely that we can provide support in terms of advice and checklist tools.
- Financial issues are always likely to be important and are likely to be supported by such things as advice on cost-effectiveness of hardware and software, and associated cost models for a variety of strategies.
- The ability to certify repositories according to an international standard will be an important service.
##section#.. Definition of Digital Curation
The Digital Curation Centre's current definition of the term is:
Digital curation, broadly interpreted, is about maintaining and adding value to, a trusted body of digital information for current and future use.
Some history of the term
The term Digital Curation is a rather recent invention.
The
Digital Data Curation Task Force - Report of the Task Force Strategy Discussion Day (2002) states
- "Tony Hey took up the term which had been used by Dr John Taylor, Director General of the Research Councils, to distinguish the actions involved in caring for digital data beyond its original use, from digital preservation. The concept�s reach extends beyond libraries."
The e-Science Curation Report (2003) by Lord and Macdonald proposed the following distinctions:
- Curation : The activity of, managing and promoting the use of data from its point of creation, to ensure it is fit for contemporary purpose, and available for discovery and re-use. For dynamic datasets this may mean continuous enrichment or updating to keep it fit for purpose. Higher levels of curation will also involve maintaining links with annotation and with other published materials.
- Archiving : A curation activity which ensures that data is properly selected, stored, can be accessed and that its logical and physical integrity is maintained over time, including security and authenticity.
- Preservation : An activity within archiving in which specific items of data are maintained over time so that they can still be accessed and understood through changes in technology.
Peter Burnhill has suggested:
- Digital Curation = Data Curation plus Digital Longevity
There has been a lack of clarity, and this is at least partly because there are a number of terms used singly or in combination which have more or less clear definition, but they are used without adequate care. The danger is that the reader (or listener) may believe that something extra is implied. Some terms that are worth distinguishing are:
- data preservation : a general term probably equivalent to digital preservation in this context
- digital preservation : although this is sometimes interpreted as simply ensuring the original bits and bytes are accessible.
- digital information preservation : this is what is referred to in the OAIS standard - what is important is not the original "bits and bytes" but the content. An OAIS ensures that the content is accessible, understandable and usable.
- curation : general term - taking care of things
- data curation : looking after and adding value to data
- digital curation : looking after and somehow "adding value" to digital data, ensuring its current and future usefulness. This probably implies creating some new data from the existing, in order to make the latter more useful and "fit for purpose".
- information curation : not seen in the wild
On the importance of this definition
Peter Buneman has referred to two historical roots for our project (and for curation in the world outside), as
preservers and
publishers. To caricature this view, the preservers (including archivists and librarians) will take action to ensure their information is usable into the far future, even if this means puting it into some (almost) dark repository, wrapped up with specialist metadata. Meanwhile the publishers are solely concerned with getting their information out there and used, value-added with annotations, tracking its provenance and lineage, and perhaps providing access to some past states for information objects that change rapidly (such as many databases).
He points out that the two groups might seem to have orthogonal concerns, but that there is or should be more commonality between them. The preservers must be concerned with access, use, changing objects, etc, while the publishers must be concerned with making sure their objects will remain accessible far into the future.
The problem is that each group sees pressing problems within its own frame of reference. While each may be intellectually interested in the other's point of view, neither group has accepted both views as internally essential.
Clearly these extremes are indeed caricatures and we have to ensure that proponents of different views make strong efforts to express themselves in ways that are relevant to the other's frame of reference.
In particular this document is a vehicle to bring together the ideas of "preservation", "publication" and "use", to identify key areas which will benefit multiple aspects curation simultaneously, recognising that resource is limited.
##section#. Foundations
The Development Team's approach is to begin using the OAIS Reference Model and its view of information preservation as the basis for this document, then add in ideas from many other sources in order to cover areas which are not within the remit of OAIS. This starting point is a pragmatic one: OAIS is recognised as a significant standard in this area; it is one with which there is considerable familiarity by several members of the consortium; this starting point does, as should become clear through this document, provide some of the foundations for the DCC. Nevertheless the OAIS Reference Model clearly does not provide all the answers and so additional ideas will need to be brought in.
In addition to information preservation, the term Curation has been used to indicate support for current research activities using that information (
�The term �digital curation� is increasingly being used for the actions needed to maintain and utilise digital data and research results over their entire life-cycle for current and future generations of users.� (see JISC circular 6/03 (Revised))), including using information in new ways and also publishing results based on the information. Much of the research being undertaken within the DCC by the Database Group falls into this category. It is perfectly possible, and there are many examples, for digital repositories supporting current research not to have any long term preservation aspirations. Similarly it is possible to have a focus on preservation without much regard for supporting current research. However neither of these would be the right course for the DCC.
There are very many existing projects to support current research activities including many e-Science projects and the JISC Information Environment Architecture - all of which have particular relevance to the DCC. It does not seem sensible to duplicate or re-invent these. However it does seem likely that an architecture which tackles preservation issues � ensuring that information is usable in the future where users are unfamiliar with the data � will also be significant in supporting current usage, especially where, again, users are unfamiliar with the data.
We therefore take the view that we should be guided by long-term management and long-term preservation aspects, and try to ensure that components of the preservation architecture can supplement other �current use� architectures. To promote this we emphasise �interoperability� and �automated use� as far as possible. Clearly we need to find a common group or a fusion of ideas between the two approaches.
By �interoperability� we mean here the ability of separate systems (possibly covering different disciplines and built using different architectures) to exchange and use each others� data. This is important because we cannot dictate information architectures for �current use� being developed now, much less those developed in the future. The term �automated use� denotes the desire for systems to be able to deal with information without the need for human intervention, and in particular without the need for a human to read and interpret documentation associated with the information � especially important as we deal with increasing amounts of information from more and more sources. This aspect is important to both publishers and preservers.
The Global Grid Forum (GGF) has a
Persistent Archives Working Group (PAWG)(
http://www.gridforum.org/6_DATA/persist.htm) chaired by Reagan Moore and Kerstin Kleese which is looking at looking at Persistent Archives. These are defined as providing the mechanisms needed to manage technology evolution while preserving records and their context, and basing these on virtual data grid technology, focussing on
"the management of the evolution of the software and hardware infrastructure over time". Persistent Archives may be viewed within the context of the OAIS Reference Model, covering bit preservation, migration, media renewal etc, but omitting the Designated Community and many aspects of Preservation Planning which are essential to ensuring that information is preserved in a usable way. Nevertheless it seems clear that the Data Grid technology would be a very valuable area of work which we should be able to use, and perhaps contribute to.
Other preservation architectures are being developed, notably the NDIIPP (
http://www.digitalpreservation.gov/index.php?nav=3&subnav=12) from the US Library of Congress. We believe that that architecture addresses current use and has little to do with preservation but hope that we can work with them to come to a common view.
##section#.. Project Implications
- DCC should attempt to understand the current and future use aspects of curation in more detail, including the impact of future time on "publishers" and of current access on "preservers"
- DCC should initially focus on Preservation, emphasising interoperability and automated use
- We should use the OAIS Reference Model as our initial guide
- outreach should be put in place to work with �current use� projects
- DCC should attempt to work with the NDIIPP and the GGF PAWG on architecture
- work actively with Reagan Moore on Data Grid and data virtualisation
- work in very close collaboration with the EU funded *CASPAR* project
##section#. OAIS Reference Model
We take as the basis of this architecture the OAIS Reference Model and its view of information preservation. The Reference model is accessible at
http://ssdoo.gsfc.nasa.gov/nost/wwwclassic/documents/pdf/CCSDS-650.0-B-1.pdf. Its genesis from the space science community, which may be significant in understanding it, is apparent from
http://ssdoo.gsfc.nasa.gov/nost/isoas/. A simple description is available at
http://www.rlg.org/en/pdfs/rlgnews/news56.pdf, which is one of the resources accessible from RLG's OAIS pages, anchored at
http://www.rlg.org/en/page.php?Page_ID=3201.
OAIS describes several preservation strategies including Migration and Emulation; these are discussed later. Now we focus on the aspects of the preservation of the usability of information which can assist in the other side of curation, namely the usability of digital information.
##section#.. Key Concepts from OAIS
The OAIS Reference Model has many aspects, including an information Model, a Functional Model, and an Information Flow model. All these play their part. However two key concepts drive this document:
- Representation Net
- Designated Community and its Knowledge Base
Designated Community is defined as an identified group of potential Consumers who should be able to understand a particular set of information. The Designated Community may be composed of multiple user communities.
Knowledge Base is defined as a set of information, incorporated by a person or system, that allows that person or system to understand received information.
These are discussed in detail next.
##section#. OAIS: Representation Net
A basic concept of the OAIS Reference Model (ISO 14721,
http://www.ccsds.org/documents/650x0b1.pdf) is the concept of information being a combination of data and Representation Information. The UML diagram in Figure 1 OAIS Information Object illustrates this concept. The Information Object is composed of a Data Object that is either physical or digital, and the Representation Information that allows for the full interpretation of the data into meaningful information. This model is valid for all the types of information in an OAIS.
Figure 1 OAIS Information Object
This UML diagram means that
- an Information Object is made up of a Data Object and Representation Information
- A Data Object can be either a Physical Object or a Digital Object . An example of the former is a piece of paper or a rock sample.
- A Digital Object is made up of one or more Bits .
- A Data Object is interpreted using Representation Information
- Representation Information is itself interpreted using further Representation Information
This figure shows that Representation Information may contain references to other Representation Information. When this is coupled with the fact that Representation Information is an Information Object that may have its own Digital Object and other Representation Information associated with understanding each Digital Object, as shown in a compact form by the .interpreted using. association, the resulting set of objects can be referred to as a Representation Network.
The Representation component Figure 2 Representation Information Object shows more details and in particular breaks out the semantic and structural information as well as recognising that there may be �Other� representation information such as software.
Figure 2 Representation Information Object
The recursion of the Representation Information will ultimately stop at a physical object such as a printed document (ISO standard, informal standard, notes, publications etc) but use of things like paper documentation would tend to prevent "automated use" and "interoperability", and also complete resolution of the complate Representation Net to this level would be an almost impossible task. Therefore we would prefer to stop earlier. In particular we can stop for a particular Designated Community when the Representation Information can be understood with that Designated Community�s Knowledge Base.
For example a science file in FITS format would be directly understood by someone who knew how to handle this format � someone whose Knowledge Base includes FITS � for example has some appropriate software. Someone whose Knowledge Base does not include FITS would need additional Representation Information � for example would have to be provided with some software or the written FITS standard.
The CEDARS project referred to
Godel ends (see
http://www.leeds.ac.uk/cedars/guideto/cdap/guidetocdap.pdf) to capture this concept.
A problem with Representation Information is that the amount needed for a particular object could be vast and impractical to do anything with in reality. It is for that reason that the concept of the Designated Community is so important. It allows us to limit the Representation Information required to be captured at any one time.
##section#.. Preservation Issues
Given a file or a stream of bits how does one know what Representation Information is needed (this question applies to Representation Information itself as well as to the digital objects we are primarily interested in preserving and using); how does one know, for example, if this thing is in FITS format?
- Someone may simply �know� what it is and how to deal with it i.e. the bits are within the Knowledge Base
- One may have an associated label which points to the appropriate Representation Information.
- One may be able to recognise the format by looking for various types of patterns.
- One may feed the bits into all available interpreters to see which accept the data as valid
- Other means�.
Of the above, if (1) does not apply then only (2) is reliable because the others rely on some form or other of pattern recognition and there is no guarantee that any pattern is unique. Even if the File Format is unique the possible associated semantics will almost certainly not be so.
However if no label is available then one of the other methods must be used, as would be the case for data rescue (in the sense of data inherited without adequate metadata, but not itself corrupt).
##section#... Representation Information vs Format
To simply give the format of a piece of digital information is inadequate to communicate information as a simple counter-example shows. Suppose that I give you a piece of digital data and tell you that it is MS Word version 6 format. This enables you to find the right software to display the contents. However when you do that you see the following text:
sfqsftfoubujpo jogpsnbujpo svmft
To understand what this means I must supply you with the additional information that I have used a simple alphabetic substition cipher (a->b, b->c etc) with spaces unchanged.
##section#.. Implications
- A label must be attached to each piece of digital object as a necessary (but not sufficient) condition for long-term preservation � note that this is some kind of logical attachment or packaging TBD by the DCC.
- The label should at least identify Representation Information. For long-term preservation this label must therefore be a DCC persistent identifier.
- In order to allow some normalisation e.g. in the case of a compressed tar file containing a FITS file we may wish to identify �compress�, �tar� and �FITS� separately. To allow for this the label may have some structure itself � in fact it may itself be a digital object��.however we would probably want to prevent too much diversity. On the other hand we would probably want to cope with a variety of labelling since we would need to support a variety of standards.
- In order for the Representation Information to be persistent then it should either be held with the data object itself or be part of a central repository � part of the DCC. Thus the DCC needs a DCC Representation Information Repository. Because the long-term curation of this Representation Information would have to be guaranteed, adequate succession planning would have to be put in place, for example with a body of guaranteed longevity such as the National Archives or British Library (or an international body). In fact we would hope that the Representation Information Repository would develop into a distributed, global, collaboration.
This (global, distributed, persistent) repository would include
-
- a Structure Repository (includes file format information)
- automated use would be supported by use of formal description languages such as EAST (ISO 15889, http://east.cnes.fr/ ) or DFDL (http://forge.gridforum.org/projects/dfdl-wg/)
- a backstop would be human readable documents such as the appropriate ISO, or other, standard underpinning the format - for example the FITS standard documents, the ISO 9660 standard etc
- a Semantic Repository with, for example, Data Dictionaries and Ontologies
- a backstop would be human readable documents such as code books, human language grammar books and dictionaries etc
- Repository for other representation information, including Software � with appropriate emulation capabilities
- Each piece of digital Representation Information is also a digital object � which is understood either by the users� Knowledge Base OR by further Representation Information. Therefore each piece of Representation Information also has a label pointing to further Representation Information.
At any particular time the Representation Net for a given digital object need not be complete � it can be terminated at a point determined by the Knowledge Base � but
which Knowledge Base? The answer to this comes in a subsequent section.
The following sections discuss
- aspects of a Representation Information Registry/Repository
- implications for the Designated Community
- types of Representation Information
- the long-term preservation of the Representation Information held,
##section#. Registry/Repository
##section#.. Registry/Repository Use Cases
See Appendix 1
##section#.. Registry/Repository Architecture
The underlying model is the following: a curator of digital information can provide the digital information as well as associated Representation Information. The latter can, in principal, be provided in any form. However consistency and ease of use would suggest that a common mechanism be used and the label described in this document is a candidate standard for such a mechanism. This label allows the categorisation of Representation Information and links to an external registry/repository of such information - for example the DCC Registry/Repository.
Thus the curator would have some kind of "label" which could be supplied with the requested digital data. This label would allow access to further Representation Information, as much as required to enable the requestor to use that digital object.
Figure 3 Representation Information (RI) Architecture
This figure shows the Representation Information being held in a Registry/Repository. This does not exclude the possibility that it is held with the original digital object - in that case the Archive and the Registry/Repository are one and the same. The RI may even be pacckaaged with the original digital object.
The Representation Information stack.
##section#.. Registry/Repository Design
See Appendix 1
##section#. OAIS: Designated Community
A Designated Community has, at any particular time, a particular
Knowledge Base. For a specific Designated Community this Knowledge Base will evolve with time; in addition the definition of the appropriate Designated Community for a dataset may be changed.
The importance of identifying the Designated Community for a data object or more likely a collection in an archive is that it allows the archive to limit the amount of Representation Information required for any particular digital object. Without this limitation the archive would, in principle, have the impossible task of collecting all possible Representation Information.
##section#.. Implications
Techniques must be created for
- defining a Knowledge Base
- linking a Knowledge Base to a Designated Community
- linking Representation Information to a Knowledge Base if possible
##section#. Types of Representation Information
The OAIS Reference Model standard has a a great deal to say about Information Modelling and a number of these ideas are used in this section. The OAIS layered information model gives a high level view.
This model is in an appendix of the OAIS Reference Model and as such is not part of that standard. However it contains a number of useful ideas, including:
- The Media Layer simply models the fact that the bit strings are stored on physical or communications media as magnetic domains or as voltages. The function of this layer is to convert that bit representation to the bit representation that can be used in higher level (i.e., 1 and 0). This layer has as single interface, which enable higher layers to specify the location and size of the bitstream of interest and receive the bits as a string of 1 and 0 bits. In modern computing systems device drivers and chips built into the physical storage interface provide much of this functionality.
- The Stream Layer hides the unique characteristics of the transport medium by stripping any artifacts of the storage or transmission process (such as packet formats, block sizes, inter-record gaps, and error-correction codes) and provides the higher levels with a consistent view of data that is independent of its medium. The interface between the Stream Layer and higher layers allows the higher layers to request Data Blocks by name and receive a bit/byte string representing those Data Blocks. The term name here means any unique key for locating the data stream of interest. Examples include path names for files or message identifiers for telecommunication messages. In modern computing systems, operating system file systems often provide this layer of functionality.
- The Structure Layer converts the bit/byte streams from the Stream Layer interface into addressable structures of primitive data types that can be recognized and operated by computer processors and operating systems. For any implementation, the structure layer defines the primitive data types and aggregations that are recognized. This usually means at least characters and integer and real numbers. The aggregation types typically supported, include a record (i.e., a structure that can hold more than one data type) and an array (where each element consists of the same data type). Issues relating to the representation of primitive data types are resolved in this layer. The interface from the Structure Layer to higher levels allows the higher levels to request labeled aggregations of primitive data types and receive them in a structured form that may be internally addressable. In modern computing systems programming language compilers and interpreters generally provides this layer of functionality.
- The Object Layer, which converts the labeled aggregates of primitive data types into information, represented as objects that are recognizable and meaningful in the application domain. In the scientific domain, this includes objects such as images, spectra, and histograms. The object layer adds semantic meaning to the data treated by the lower layers of the model. Some specific functions of this layer include the following:
- Defines data types based on information content rather than on the representation of those data at the structure layer. For example, many different kinds of objectsimages, maps, and tablescan be implemented at the structure level using arrays or records. Within the object layer, images, maps, and tables are recognized and treated as distinct types of information.
- Presents applications with a consistent interface to similar kinds of information objects, regardless of their underlying representations. The interface defines the operations that can be performed on the object, the inputs required for each operation and the output data types from each.
- Provides a mechanism to identify the characteristics of objects that are visible to users, operations that may be applied to an object, and the relationships between objects. The Interface between the Object Layer and the Application Layer allows the higher levels to specify the operation that is to be applied to an object, the parameters needed for that operation and the form in which results of the operations will be returned in. One special interface allows the user to discover the semantics of the objects such as operations available, and relationships to other objects. In modern computing systems subroutine libraries or object repositories and interfaces supply this functionality.
- The Application Layer contains customized programs to analyze the Data Objects and present the analysis or the data object in a form that a Data Consumer can understand. In modern computing systems application programs supply this functionality.
##section#.. Structure � including Formats
We believe that it is useful to distinguish between
- formats which are used mainly for rendering � to be used by a human being rather than some computer process, and
- formats used for automated processing
The former include many commercially based formats such as the succession of Microsoft Word formats. The details of current commercial formats are likely to be proprietary and difficult or impossible to obtain. On the other hand the format is more likely to be available when that format is no longer the �current� version � which is when we would actually need it.
The latter are more likely to be simpler, with Open Source access software and more easily independently describable.
It is proposed that we focus initially on this latter set of formats � although we should not neglect the former type of formats, at least at to the level of collecting information about them - at least enough to start the associated Representation Net for each format.
There are tools, noted above, which can describe digital information in a way suitable for automated processing. EAST and DFDL currently give access to individual components such as numbers or arrays of numbers within a particular format. It may be useful to define useful scientific objects in order to facilitate automated processing.
In a similar way it may be useful to define some �humanities objects� for the same reason. Some of these, for example images � simple or multispectral � might be the same as their scientific counterparts. Others might be completely new, for example virtual reality city scapes to capture archaeological surveys.
PRONOM (
http://www.records.pro.gov.uk/pronom/) and the Global Digital Format Registry (GDFR
http://hul.harvard.edu/gdfr/) both aim to provide information about file formats - although they both focus on document type formats such as Word. They each have their own data model and plans for service delivery. Neither seem to put their plans into the wider context of complete Representation Information. A trial GDFR implementation provides some transformation services based on the Typed Object Model (TOM
http://tom.library.upenn.edu/). Perhaps the most functional project is
Presidential Electronic Records Project Operational System (PERPOS); they have collected 250+ legacy formats recognised by magic numbers and file structure rather than file extensions and has associated viewers and aim to provide services for file format recognition and transformation.
Initial discussions with David Ryan from the National Archives (TNA) gives some hope that the PRONOM repository and what is described here will converge. As an initial step we expect to be asked to review the new PRONOM data model.
Discussions with colleagues (Bill Underwood) lead us to believe that can also expect cooperation from PERPOS
##section#... Implications
- DCC should focus initially on formats supporting "automated use" rather than on those used largely for rendering.
- DCC should work with other groups such as PRONOM, GFDF and PERPOS on file formats and other forms of Representation Information, with the aim of developing a global, distributed system
- Representation Information Repository should define selected file formats using EAST and DFDL
- The EAST and DFDL tools are themselves Representation Information which in due course will have to be fully defined � the closure of their Representation Nets will be the EAST standard and the DFDL documentation
- Definitions should include scientific objects and humanities objects
- Data Grid and data virtualisation techniques should be valuable here
##section#.. Semantics
In principle we could plunge into developing tools for Ontologies. This is an active area of research by many groups and we are not likely to make rapid progress, although a survey of the current work would be in order. A more tractable topic (important from a pragmatic point of view) to tackle would be the simpler one of Data Dictionaries. A number of ISO standards exist already (ISO 11179 and the set consisting of ISO 21961, 21962 & 22643 -
http://www.ccsds.org/documents/647x3b1.pdf).
##section#... Implications
- we should review and report on tools available for dealing with Ontologies
- the Representation Information Repository should include Data Dictionaries, followed by more general semantics
##section#.. Time Dependent Information
Many, some would say most, datasets change over time and the state at each particular moment in time may be important. This is an important area requiring further research, however from the point of view in this document it may be useful to break the issue into separate parts.
- at each moment in time we could, in principle, take a snapshot and store it. That snapshot has its associated Representation Net.
- efficient storage of a series of snapshots may lead one to store differences or include time tags in the data (see for example Peter Buneman, Sanjeev Khanna, and Wang-Chiew Tan. On the Propagation of Deletions and Annotations through Views. In Proceedings of 21st ACM Symposium on Principles of Database Systems.). Additional Representation Information would be needed which describes how to get to a particular time's snapshot from the efficiently encoded version.
##section#... Annotation and other time dependent metadata
Annotation was identified in our bid as an area in itself requiring significant research effort. Insofar as annotations and other metadata are themselves data, the considerations of time dependence noted above apply also. Within the OAIS Information Model, Annotation, Provenance etc are part of Preservation Description Information which is discussed in more detail below.
##section#... Implications
These are area of active research within the consortium and the DCC should be able to provide
- advice and well tested tools for certain forms of efficient encoding of time dependent information
- advice on annotation
- identifiers and Representation, perhaps in the form of software, for the associated encodings
##section#.. Actions and Processes
Some information has, as an integral part of its content, an implicit or explicit process associated with it � this could be argued to be a type of semantics, however it is probably sufficiently different to need special classification. Examples of this include databases or other time dependent or reactive systems such as Neural Nets.
The process may be implicitly encoded in the data, for example with the scheme for encoding time dependence in XML data as noted above. Alternatively the process may be held in the Representation Information � possibly as software.
Amongst many other possibilities under this topic, Software and Software Emulation are among the most interesting (
http://www.dlib.org/dlib/october00/granger/10granger.html).
It may be possible to develop a Universal Virtual Computer (UVC) as outlined by Lorie (
http://www.rlg.org/preserv/diginews/diginews5-3.html#feature2). However, recognising that one of the prime desirable features of a UVC is that it is well defined and can be implemented on numerous architectures, it may be possible to use something already in place, namely the JAVA Virtual Machine (JVM,
http://java.sun.com/docs/books/vmspec/).
##section#... Implications
- Support Software emulation via a UVC (possibly based on JVM)
- Support time dependent or reactive systems
The main problem with emulation is not writing a portable software emulator for a particular architecture, but it is preserving that software and the knowledge it contains afterward. So the problem of emulation reduces to a problem of software preservation, which applies to software written for a UVC, JVM, .NET and any other system. The only advantage for the UVC is if its architecture remains fixed for all time, then at least some base software libraries written for it would continue to run. But as soon software starts to require other software dependencies and specific versions, then specifying those dependencies becomes an equal problem for the UVC as it does for any other system. Software maintenance is also a problem, in the future you may need a lot of representation information to understand and use some software source code or a binary.
-- SteveRankin - 12 May 2005
##section#. Persistent IDs
Persistent Identifiers must be defined if we are to use a Registry/Repository. See Appendix 3
##section#. Long-term Preservation of Registry/Repository holdings
##section#.. Archival Information Package
An Archival Information Package (AIP) is defined to provide a concise way of referring to a set of information that has, in principle, all the qualities needed for permanent, or indefinite, Long Term Preservation of a designated Information Object. Since the AIP is ready for long-term preservation it can be used as the basis for defining several important structures.
The AIP is a logical definition. For practical use an encoding must be provided. METS packaging (
http://www.loc.gov/standards/mets/ ) is an XML Schema which is based on the AIP. However concern has been expressed about weakness in the way the Representation Information is dealt with by METS, and other schemes are being investigated (
http://www.ccsds.org )
Figure 4 Full Archival Information Package
The Package Description Information is as follows:
##section#.. Metadata for preservation
OAIS defines Preservation Description Information as:
while for collections the Archive Information Collection is defined:
Simply taking the AIP components OCLC has put together a definition of metadata for preservation (
http://www.oclc.org/research/projects/pmwg/pm_framework.pdf ). This work could form the starting point for a DCC definition of preservation metadata, taking into account the additional components in these diagrams. Of particular importance are the following.
##section#... Reference Information
Reference Information is
the information that identifies, and if necessary describes, one or more mechanisms used to provide assigned identifiers for the Content Information. It also provides identifiers that allow outside systems to refer, unambiguously, to a particular Content Information. An example of Reference Information is an ISBN.
The Persistent ID discussed above could fulfill this role.
##section#... Provenance Information
Provenance Information is
the information that documents the history of the Content Information. This information tells the origin or source of the Content Information, any changes that may have taken place since it was originated, and who has had custody of it since it was originated. Examples of Provenance Information are the principal investigator who recorded the data, and the information concerning its storage, handling, and migration.
This is an active area of research, as already identified in the DCC bid.
##section#... Context Information
Context Information is
the information that documents the relationships of the Content Information to its environment. This includes why the Content Information was created and how it relates to other Content Information objects.
##section#... Fixity Information
Fixity Information is defined as
the information which documents the authentication mechanisms and provides authentication keys to ensure that the Content Information object has not been altered in an undocumented manner. An example is a Cyclical Redundancy Check (CRC) code for a file.
Experience has shown that even with a single computer system, error-free transfers of data cannot be taken for granted. The Atlas Petabyte Store at CCLRC uses a sophisticated checksum regime in order to confirm the correct transcription of data. No doubt other data stores have similar systems in place. The DCC could compare the techniques used and recommend a standard checking regime.
It may be that additional checks may be required and a full authentication system is required. The DCC could, for example, recommend a public/private key system and keep the keys stored in perpetuity in order for authenticity to be confirmed. Over the long term increasingly sophisticated systems are likely to be required and an associated process defined for strengthening authentication methods.
##section#.. Implications
The DCC could:
- define standard Preservation Metadata � based initially on OCLC work
- define adequate Packaging technique � almost certainly XML based
- define recommended tools and procedures for creating Fixity Information such as checksums and digests, together with associated Representation Information
- Investigate authentication systems with a view to preparing recommendins for users and consider offering, for example, a (fee-based) key storage service.
##section#. Audit and Certification Framework
The DCC is supposed to develop other funding streams to grow its service, supplementing JISC funding, at the extreme, in the long term, being self-funding. One of the services it might offer is that of Audit and Certification of Digital Repositories.
In order to have a certification process there needs to be in place a Standard (preferably one or more ISO standards) against which to do the certification. The
RLG Digital Certification Task Force is working to produce a certification process. The documents produced will probably become ISO standards via CCSDS (the body which produced the OAIS Reference Model). Indeed this work can be seen as a part of a suite of OAIS-related standards.
Following the production of the standard, accreditation bodies and certification bodies would need to be set up. Contact would have to be made with an overall body such as the International Accreditation Forum (IAF
http://www.iaf.nu/) or International Laboratory Accreditation Cooperation (ILAC
http://www.ilac.org/) or a regional body such as the European cooperation for Accreditation (
http://www.european-accreditation.org/).
Accreditation could also be applicable to third party software and services which would have to be appropriately tested.
##section#.. Project Implications
- facilitate production of standard(s) on which a certification program can be based
- work to establish accreditation and certification body in preparation for offering audit and certification services
- audit, certification and accreditation are potential sources of long term funding for the DCC
- software certification will require testbeds and testing procedures.
- Hardware and software systems will need to be purchased, hired or borrowed. The CANDO associates may be useful sources.
- We might expect Commercial software to be offered to us by the manufacturer for testing
- Charging for testing non-commercial software would be, at least initially, covered by DCC resources.
##section#. Outreach and Services
The DCC will provide a number of services. These form the "sharp end" of the DCC and will be the main contact with the world.
##section#.. Services
The DCC can provide a number of direct services and can coordinate or provide tools to support a number more.
- Registry/Repository - as discussed in earlier sections. This could be a distributed system, but coordination would be required in various areas including
- a persistent ID and name resolution mechanism - as discussed above
- a preservation strategy for each Registry/Repository
- Notification services - based on Registry/Repository, assuming there is a notification service by which one could ask for:
- updates to RepInfo
- Preservation Planning reminders
- Tools
- Migration services e.g. file format conversion
- presumably built on top of Registry/Repository service
- could use "GRID" type services for daling with large amounts of files.
- Certification services - discussed above
- OAIS level (i.e. whole archive) certification
- Certification of swappable components of OAIS - if/when these are defined
- commercial and non-commercial tools
- hardware
##section#.. Advice
- Curation manual
- Training
- Evaluation reports on tools
##section#. Prioritisation
##section#.. Easy starts
Besides the activities identified in the rest of this document a number of other activities can and should be started. � TBD�
##section#... Initial lists
List of
- archive requirements as currently perceived
- tools used by archives - commercial and free
- "current use" projects
- collaborators/competitors
- services we can delegate to or replicate
- information that we can point to or replicate
- relevant standards existing or due for completion in the next 3 years
- scientific formats (assume we can get lots of rendered formats form several sources)
- list of curation requirements specified in grant conditions from UK research funders, analogous to ROMEO in the JISC FAIR programme (http://www.jisc.ac.uk/index.cfm?name=project_fair_romeo)
- ...
Lay plans for
- outreach/publicity
- Registry/Repository of Representation Information and other metadata
- install tools such as EAST, DFDL etc
- research needed to support services and development
- ...
##section#. Some additional topics
##section#.. Risk Management
- costs/benefits/risks/risk sensitivity
##section#.. Legal Aspects
Legal aspects could pose meny problems for curation. The DCC has good advice on UK legal aspects but international issues may be more difficult. We must be sure that the DCC has an appropriate international stance.
##section#.. Monitoring
##section#.. Maturity level / Improvements
##section#.. Capturing and Disseminating Best Practice
##section#.. Evaluation of the suitability of Data Formats for preservation and use
A number of groups are working on this, see for example
Appendix ##app#. Representation Information
Appendix ##app#.. RI Use Cases
Use Cases for Representation Information Registry/Repository
There will be some overlap with the Use Cases for software and for standards
See
http://www.agilemodeling.com/style/useCaseDiagram.htm,
http://www.andrew.cmu.edu/course/90-754/umlucdfaq.html,
GDFR Use Cases
Concatenated Use Cases may be seen
here
GetDistributedRepInfoUseCase?
GetPreviousVersionRepInfoUseCase?
AddAlternativeRepInfoUseCase?
NotifyObsoleteUseCase?
Notes:
- "user" includes human user or computer process
--
DavidGiaretta - 04 Jan 2005
Appendix ##app#.. Registry/Repository Design
High Level Design for Registry/Repository
We are planning to use freeEBXML as the Registry/Repository. This implements the ebXML standard. However we should try not to tie ourselves in to ebXML.
The ebXML set of standards includes a
Registry Information Model and a
Registry Service Specification amongst others.
- ebXML Registry Information Model:
- Inheritance in ebXML Info Model:
Extending ebXML for the DCC Registry
The natural way to extend the Information Model is by making each
RepInfo entry (with its associated
DCCInfoLabel) as an
ExtrinsicObject?. Although the Representation Information may be something like a PDF file for example, it may make sense to define a
DccRepInfoObject? as an extension of
ExtrinsicObject? where the additional methods are
- getDccInfoLabel - to retrieve the DCCInfoLabel associated with the Representation Information
- getRepInfo - get the actual Representation Information (or if that is physical object such as a piece of paper, then some information on how to get that physical object)
In addition to this we need a
ClassificationScheme? with
ClassificationNodes?. The natural
ClassificationScheme? for us would start with the OAIS Information Model, with the option of adding some extra models such as
PRONOM and the
CLRC Scientific Metadata Model and the
Nerc Data Grid data model.
Thus the
DCCRepInfoClassificationScheme? would start with
- OAIS Representation Information network (extended):
(see
RepNet-extended.ppt for PowerPoint file for this extended Representation Net)
- Structure Information includes such things as EAST descriptions.
- Semantic Information includes Data Dictionaries
- Other Representation Information includes
- Software including for example EAST software - and perhaps PRONOM
The software (binary) classification will ultimately need some form of generic namespace to resolve libraries and other software requirements, such as operating system, shells and environment setup etc. Having such a namespace defined in the registry will help make any software stored in the repository usable and useful to someone in the future. We need some form of software model that includes enough information so that the software becomes useful. Also, does software source code come under Software, or something else? Source code has a similar namespace problem as the binary, but also depends on compilers and system specific code. -- SteveRankin - 30 Sep 2004
There are two examples of existing models for resolving software dependencies, and preserving the details necessary for a piece of software to run, these are: 1. CORBA, http://www.omg.org/library/wpjava.html "Simply put, Java allows you to create portable objects and easily distribute them; CORBA allows you to connect them together and to integrate them with rest of you computing environment-databases, legacy systems, objects or applications written in other languages, what have you." 2. .NET Assemblies, see http://www.scit.wlv.ac.uk/~cm1918/DotNet/Assembly.html and http://msdn.microsoft.com/net/ecma/. Some investigation into these technologies is required so see how general they are, can they fully describe any software for any system such that all the requirements for that software to run are preserved? -- SteveRankin - 04 Oct 2004
The work of the ANSI Standards Registry Committee (see
http://www.ansi.org/internet_resources/standards_registry_committee/stdsreg.aspx?menuid=12) in defining a simple specification for metadata that describes both in-progress and completed technical work created by Standards Developing Organisations (see
http://public.ansi.org/ansionline/Documents/Other%20Services/Standards%20Registry%20Committee/Standards%20Reg%20Metadata%20Def%20v3.0.pdf)
will be useful in the
DCCRR Classification scheme. The specification aims to enable searching across registries hosted by various organizations.
Registry Clasification Nodes
The following uses a table to show a tree structure for the Classification Scheme - please edit as appropriate. The Classification Scheme is used to classify entries in the registry in order to assist searching. Note that there may be multiple Classification Schemes and any registry entry may have multiple classifications. An example of this is ebXML would be classifying a buysiness by geographical location and by type of service. It is not clear at the moment that multiple Classification Schemes are required for Representation Information, for now we will develop a single CLassification Scheme - but additional schemes should not be excluded in future.
| DCC |
| * |
Structure |
| * |
*---- |
File Formats |
| * |
*---- |
*---- |
Text ((Email, Articles and Papers, Theses, Reports, Letters, Journals) |
| * |
*---- |
*---- |
Image |
| * |
*---- |
*---- |
Sound |
| * |
*---- |
*---- |
Moving images |
| * |
*---- |
*---- |
Data-sets |
| * |
*---- |
*---- |
3D models |
| * |
*---- |
*---- |
Time-varying or Dynamic data |
| * |
Semantics |
|
| * |
*---- |
Data Dictionary |
| * |
*---- |
Knowledge Organisation Systems |
| * |
*---- |
*---- |
Schema |
| * |
*---- |
*---- |
Ontology |
| * |
*---- |
*---- |
Metadata Vocabulary |
| * |
*---- |
*---- |
Thesaurus |
| * |
Other |
| |
*---- |
Software |
| |
|
*----- |
Title |
| |
|
*----- |
Version |
| |
|
*----- |
Description |
| |
|
*----- |
Licence |
| |
|
*----- |
Operating System |
| |
|
*----- |
Hardware Environment |
| |
|
*----- |
Application Type |
Suggest this may take: Representation Rendering Software, Access Software (etc), but enhanced to further detail the functionality of the application. e.g. Converter, Renderer, Modifier, Packaging, Emulation, etc. |
| |
|
*----- |
Topic |
These could be selected and refined over time to enable browsing. e.g. Drill-down categories. Science -> Astronomy, Biology, etc. Databases -> APIs, Engines/Servers, etc. Software Development -> Libraries, etc |
| |
|
*----- |
Development Status |
| |
|
*----- |
Representation Rendering Software |
This is what is covered by the KB Preservation Planner I think. Includes what we have called a "viewer" in some of the Label docs |
| |
|
*----- |
Access Software |
| |
* |
Standards |
|
| |
|
*----- |
Organisation |
| |
|
*----- |
Version |
| |
|
*----- |
Type |
| |
|
|
*----- |
Standard |
Official standard or specification which has been through a standardisation process |
| |
|
|
*----- |
Emerging Standard |
Under going the standardisation process |
| |
|
|
*----- |
De facto standard |
Widely adopted and has community support, but has not been through a standardistation process |
| |
|
|
*----- |
Guidelines for best practice |
| |
*----- |
Algorithms |
PreservationDescriptionInformation contains a report and proposals on the specific attributes we should include.
Additional Models
OAIS Layered Information Model
This model is in an appendix of the OAIS Reference Model and as such is not part of that standard. However it contains a number of useful ideas, including:
- The Media Layer simply models the fact that the bit strings are stored on physical or communications media as magnetic domains or as voltages. The function of this layer is to convert that bit representation to the bit representation that can be used in higher level (i.e., 1 and 0). This layer has as single interface, which enable higher layers to specify the location and size of the bitstream of interest and receive the bits as a string of 1 and 0 bits. In modern computing systems device drivers and chips built into the physical storage interface provide much of this functionality.
- The Stream Layer hides the unique characteristics of the transport medium by stripping any artifacts of the storage or transmission process (such as packet formats, block sizes, inter-record gaps, and error-correction codes) and provides the higher levels with a consistent view of data that is independent of its medium. The interface between the Stream Layer and higher layers allows the higher layers to request Data Blocks by name and receive a bit/byte string representing those Data Blocks. The term name here means any unique key for locating the data stream of interest. Examples include path names for files or message identifiers for telecommunication messages. In modern computing systems, operating system file systems often provide this layer of functionality.
- The Structure Layer converts the bit/byte streams from the Stream Layer interface into addressable structures of primitive data types that can be recognized and operated by computer processors and operating systems. For any implementation, the structure layer defines the primitive data types and aggregations that are recognized. This usually means at least characters and integer and real numbers. The aggregation types typically supported, include a record (i.e., a structure that can hold more than one data type) and an array (where each element consists of the same data type). Issues relating to the representation of primitive data types are resolved in this layer. The interface from the Structure Layer to higher levels allows the higher levels to request labeled aggregations of primitive data types and receive them in a structured form that may be internally addressable. In modern computing systems programming language compilers and interpreters generally provides this layer of functionality.
- The Object Layer, which converts the labeled aggregates of primitive data types into information, represented as objects that are recognizable and meaningful in the application domain. In the scientific domain, this includes objects such as images, spectra, and histograms. The object layer adds semantic meaning to the data treated by the lower layers of the model. Some specific functions of this layer include the following:
- Defines data types based on information content rather than on the representation of those data at the structure layer. For example, many different kinds of objectsimages, maps, and tablescan be implemented at the structure level using arrays or records. Within the object layer, images, maps, and tables are recognized and treated as distinct types of information.
- Presents applications with a consistent interface to similar kinds of information objects, regardless of their underlying representations. The interface defines the operations that can be performed on the object, the inputs required for each operation and the output data types from each.
- Provides a mechanism to identify the characteristics of objects that are visible to users, operations that may be applied to an object, and the relationships between objects. The Interface between the Object Layer and the Application Layer allows the higher levels to specify the operation that is to be applied to an object, the parameters needed for that operation and the form in which results of the operations will be returned in. One special interface allows the user to discover the semantics of the objects such as operations available, and relationships to other objects. In modern computing systems subroutine libraries or object repositories and interfaces supply this functionality.
- The Application Layer contains customized programs to analyze the Data Objects and present the analysis or the data object in a form that a Data Consumer can understand. In modern computing systems application programs supply this functionality.
NERC Datagrid Data Model:
RepInfo Network - ending the recursion
Each piece of
RepInfo will have its own
DCCInfoLabel associated with it. The label will contain reference to other
RepInfo. This recursion ends for example when one gets to something marked as something like "ASSUMED TO BE IN KNOWLEDGE BASE xxxxx" - this would allow an explicit link to one of a set of assumed
Knowledge Bases.
--
DavidGiaretta - 22 Dec 2004
DDC Registry/Repository v0.3
DCCRegRepV04
--
SteveRankin - 18 Jul 2005
Appendix ##app#. RI Label
Appendix ##app#.. RI Label Use Cases
Use Cases for Labels for the DCC
- User receives label for some data (having requested a label upon submission of data)
- User verifies label is appropriate for digital object.
- User updates label to detect new RepInfo that has been released and added to the RR.
- User uses label to retrieve RI from the DCCRR.
- User attaches label to digital object
- User has digital object composed of multiple file types and wishes to identify contents with a label. What is the label's structure? What does the label identify?
- Data producer has a data/document product which packages together, compresses, encodes or otherwise changes the original bit streams. A label must be attached to allow these processes to be reversed so that the original bit streams can be obtained by a user.
--
AdamRusbridge - 23 Nov 2004
Appendix ##app#.. RI Label Considerations
Information Labels
Introduction
One fundamental idea for the DCC is that associated with any digital information which is to be preserved there is some kind of pointer to Representation Information (
RepInfo). This
RepInfo may, in at least some cases, be made up of a combination of components which may be combined in different ways in different instances. For example there may be a compressed tarred FITS file, a compressed FITS file, a tarred FITS file and a FITS file (all the same FITS file) - do we need separate
RepInfo for each of these? The alternative would be to have
RepInfo for each of "compress", "tar", "FITS" and use these in various combinations.
The DCC
RepInfo Label provides the mechanism for combining individual
RepInfo components.
RepInfo Label Requirements
- A label must be attached to each piece of digital object as a necessary (but not sufficient) condition for long-term preservation – note that this is some kind of logical attachment or packaging TBD by the DCC.
- The label should at least identify Representation Information. For long-term preservation this label must therefore be a DCC persistent identifier.
- In order to allow some normalisation e.g. in the case of a compressed tar file containing a FITS file we may wish to identify “compress”, “tar” and “FITS” separately, the label may have some structure itself – in fact it may itself be a digital object…….however we would probably want to prevent too much diversity. On the other hand we would probably want to cope with a variety of labelling since we would need to support a variety of standards.
- There are potentially a very great number of ways of combining these kinds of RepInfo. The simplest is sequential - e.g. compressed, tarred, FITS, but it could clearly become very complex.
Potential implementations
Assume the namespace "DCC" is defined, the label could be of the form:
A simple sequential list such as the following is probably too simple:
<DCC:infoLabel>
<DCC:RepInfoID>xxxx</DCC:RepInfoID>
<DCC:RepInfoID>yyyy</DCC:RepInfoID>
<DCC:RepInfoID>zzzz</DCC:RepInfoID>
</DCC:infoLabel>
or
<DCC:infoLabel>
<DCC:RepInfo type="Software">
<DCC:PID>xxxx</DCC:PID>
<DCC:RepInfoID type="Software">
<DCC:PID>yyyy</DCC:PID>
<DCC:RepInfoID>
<DCC:PID>zzzz</DCC:PID>
</DCC:RepInfoID>
</DCC:RepInfoID>
</DCC:RepInfoID>
</DCC:infoLabel>
However we should remember that if we do not
right now have something which supports automated processing then something like the following is good enough:
<DCC:infoLabel>
The data is in FITS format - PID=xxxx - which has been tarred - PID = yyyy - and the compressed - PID = zzzz.
</DCC:infoLabel>
Ideally some already standard process markup language can be used.
We also need a mechanism to combine
RepInfo e.g. given a bitstream which in fact is a FITS file we need to supply
- the Syntax for FITS
- an associated Dictionary
<DCC:infoLabel>
<DCC:RepInfo type="syntax">
<DCC:PID>zzzz</DCC:PID>
</DCC:RepInfo>
<DCC:RepInfo type="dictionary">
<DCC:PID>dddd</DCC:PID>
</DCC:RepInfo>
</DCC:infoLabel>
Further implementation considerations
Associating Labels to Data
A label is not explicitly tied to a specific data object. Rather, a data object is associated with one or more labels (for example, through the object packaging file, such as METS). To identify alternate labels, and as Representation Information may change over time, some form of versioning or timestamp is required.
A label contains no reference to a data object as there is not necessarily any guarantee that the object itself is persistent or stable.
If the data is modified, it will be necessary to validate the label to confirm it remains appropriate, and if not create a new label and update the association accordingly. The label is used to identify the persistent and stable contents of the registry. This allows the same label to be associated with more than one data object. There may also be a need to associate more than one label to a piece of data requiring multiple pieces of Representation Information.
Internal Usage
As detailed elsewhere, the Representation Net is composed of multiple layers of Representation Information. Associating a label with each Representation Information will help ensure we are able to understand the Representation Information appropriately, and will help our construction of the Representation Net.
External Usage
A repository implementation is able to implement a label according to their own guidelines, or adopt our own label format for convenience and consistency. Alternatively, they may want to use XML but create their own Schema, to include various administrative and preservation details; then they could use our elements from our
NameSpace?.
Multiple Objects
The problem still remains with the packaging of multiple objects. Scenarios causing problems:
- Single object.
- Simply create a label that can be applied to this object.
- Multiple packaging schemes (e.g. tarred and gzipped file)
- Either attach a single label for tar (which can perform both unpacking tasks)
- Or work from top down to unpack (e.g. Attach gzip label, which will result in a tar file).
- Object containing embedded objects. E.g. email containing .png and .fits
- As yet unsure. This needs to be addressed further.
- To quote David:
- A couple of possible answers:
- 1) there could be a packaging structure already e.g. METS or the CCSDS Packaging - each time a component is refereed to there is a label which applies to that component.
- 2) if there is no such "external" packaging then our label would have to have a first level RepInfo which describes the fact that there are both Word and Video - and this description could be just free text(although of course we would prefer not to leave it like this). This could indeed eventually be a "package" of labels.
Labelling institute and data specific information
We need to agree what information should be labeled and what should not. For example, some piece of tabular data may have variables for column headings. This variable may be institution specific, it may even be file specific. An annotation is required to explain the semantics of these variables. In this example, the annotation is a composite part of the data object, rather than being representation information that would be useful to many people. Here, this annotation should be retained with the object as it contains information content directly needed to understand one specific data object, rather than being a generic description applicable to many data objects. To summarise, the data object is comprised of both the tabular data and associated annotation – and both of these items should be stored and associated in the institutional data repository.
I think we have to do whatever is needed to supply Representation Information. However we can supply advice on best practice. In the example above one would need of course to provide additional
RepInfo to describe the associated annotation and the relationship between that and the tabular data.
The point here is to establish what is Representation Information (our stuff) and what is not. I suggest that the distinction might be that Representation Information provides connected and replaceable methods for understanding the content of an object, but does not actually contain
(object-specific) information itself. The reason for making this distinction is that it could be easy for the
DCCRR to become bloated with information that ought to be retained alongside institutional content. In this was, it should be clear that the
DCCRR is not an archive of primary research data.
Proposed Schema for RI label:
- Diagram of proposed dcc-rilabel.xsd Schema:
This Schema should support the use cases and the activities discussed above.
- ri-label - top level: The high level view of the label - it must have a timestamp but all else is optional. The elements structure, semantics, standard, algorithm etc all are of type curelType.
- rilabel - curation element type: "curel" is a contraction of "curation element" and consists of an identifier and a text description. The element is a unique persistent identifier.
- rilabel - description: text of anything else - can appear throughout the label to describe any element.
- rilabel - other rep. info.: can contain a mixture of different elements, some of which can be nested.
- rilabel - algorithm:
- rilabel - processing software: allows nesting of processing software, for example encoding then compressing a data file.
- rilabel - package type: allows nesting of packaging methods: e.g. consider using the tar application to create a tar file which is then further processed using the zip application.
HEALTH WARNING - work in progress - this ri-template template does not validate against the dcc-rilabel schema - issue with "wraps" and "package" recursion
Just a couple of points here, the naming of some of the tags in the label do not match the naming in the OAIS reference model standard:
Structure should be Structure Information.
Other should be Other Representation Information.
Semantics should be Semantic Information.
This may seem picky, and I am not usually so when it comes to the naming of things, but we should make it clear to other people that we are in fact using OAIS as a basis for our representation information classification and matching the names to what appears in the standard does make it clearer.
The use of acronyms in the label should be a definite no no. We are trying to preserve meaning, so rilabel and otherri would more than likely mean nothing to someone in the future, without digging further to see that ri means. There are other important reasons why the naming for the tags in the label should be meaningful; one is that it is good programming practise to use meaningful names without any acronyms or abbreviations.
To look consistent between the registry and the label, we need to decide on some type of naming convention for tags and classes in registry, I prefer things like "StructureInformation" as an example for a class/tag name, and "structureInformation" for a variable/attribute name. It is what I have used in the registry model. But I will change to using things like "Structure Information" as the OAIS standard uses spaces in its class names, and we need to be consistent with that.
--
SteveRankin - 09 Mar 2005
Appendix ##app#.. RI Label Schema
<?xml version="1.0" encoding="UTF-8"?>
<rilabel xmlns="http://dev.dcc.ac.uk/dcc-rilabel.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://dev.dcc.ac.uk/dcc-rilabel.xsd
dcc-rilabel.xsd">
<description>High level description of the label</description>
<timestamp>2005-01-01T09:00:00.000Z</timestamp>
<semantics>
<description>Semantic description(s) - can be multiple referenced</description>
<cpid>MISSING</cpid>
<cpid>MISSING</cpid>
</semantics>
<structure>
<description>Structural description(s)</description>
<cpid>MISSING</cpid>
<cpid>MISSING</cpid>
</structure>
<otherri>
<knowledgebase>
<description>Knowledge base(s) which is (are) assumed - stops the Representation Network recursion</description>
<cpid>MISSING</cpid>
<cpid>MISSING</cpid>
</knowledgebase>
<standard>
<description>Standard(s) used</description>
<cpid>MISSING</cpid>
<cpid>MISSING</cpid>
</standard>
<software>
<renderingsoftware>
<description>Rendering software - one or more references</description>
<cpid>MISSING</cpid>
<cpid>MISSING</cpid>
</renderingsoftware>
<accesssoftware>
<description>Access software reference(s)</description>
<cpid>MISSING</cpid>
<cpid>MISSING</cpid>
</accesssoftware>
<processingsoftware>
<description>first level of processing software</description>
<cpid>MISSING</cpid>
<wraps>
<description>next level down of processing software</description>
<cpid>MISSING</cpid>
<wraps>
<description>next level down of processing software</description>
<cpid>MISSING</cpid>
</wraps>
</wraps>
</processingsoftware>
</software>
<algorithm>
<description>Algorithm(s) used</description>
<cpid>MISSING</cpid>
<cpid>MISSING</cpid>
</algorithm>
<package>
<description>outermost packaging</description>
<cpid>MISSING</cpid>
<packs>
<description>next level down of packaging</description>
<cpid>MISSING</cpid>
<packs>
<description>next level down of packaging</description>
<cpid>MISSING</cpid>
</packs>
</packs>
</package>
</otherri>
</rilabel>
--
DavidGiaretta - 19 Jan 2005
Appendix ##app#. Persistent IDs
Appendix ##app#.. Persistent ID Use Cases
Use Cases for Persistent Identifiers for DCC
- Users (in a general sense) need some way to refer to Representation Info held by the DCC - some kind of Identifier - which have to be
- persistent i.e. can be relied upon to be useful in the future
- probably distributed in order to bring in GDFR and PRONOM
- User received digital information which has associated with it (in some way) a label. The label points (in some way see ...)to one or more pieces of Representation Information (RepInfo). The user somehow then retrieves the RepInfo
- A repository receives a request from a user for some digital information. The repository transmits the digital information to the user (as a DIP of some sort), including a label identifying appropriate RepInfo.
- Software producers of generic software wish to deal with arbitrary digital information - assuming it has an appropriate label. The software has to retrieve RepInfo - possibly recursing several times until it can know what needs to be done.
- A Registry/Repository of RepInfo, e.g. the DCC itself, receives some RepInfo. It needs to assign an Identifier to it and make that available.
- A Registry/Repository receives an ID in a request for some RepInfo. The Registry/Repository packages up the RepInfo and sends it to the requestor.
- ID could indicate relationship between RepInfo - but this could be a hostage to fortune. The inter-relations are indicated by associated labels
--
DavidGiaretta - 24 Sep 2004
Appendix ##app#.. Persistent ID Proposals
Persistent IDs for the DCC
Introduction
To assist people dealing with digital information, whether as a user or as a curator, the assumption here is that attached, logically or physically, to the digital data in which the information is encoded there is a label, discussed in detail elsewhere, which allows the user to gain access to appropriate Representation Information (
RepInfo). Since the same information may be encoded in many different ways, it seems reasonable to assume some kind of normalisation is possible, and the label may be assumed to hold, in some way TBD, one or more identifiers, for example one identifier "pointing to"
RepInfo about syntax and another "pointing to" semantics.
The focus of this report is on these identifiers to individual pieces of
RepInfo. The identifiers are assumed to be persistent because they support digital curation which implies use over the long term. If the identifiers are totally within the management domain, persistence can be assured by that management domain. If the identifiers refer (or are referred from?) outside the management domain, then some guarantee of persistence is needed. References [1], [2] have details of several identifier schemes.
It is relatively easy to generate a unique identifier by having a hierarchical namespace,
_x.y.z_
each segment or namespace (i.e. each of x, y, z) forms a hierarchy of naming authorities, and where necessary to generate unique strings some algorithm such as that used by the UUID (
http://www.dsps.net/uuid.html) is used. A UUID is a
Universal Unique IDentifier which is a 128 bit number which can be assigned to any object and which is guaranteed to be unique. The mechanism used to guarantee uniqueness is through combinations of hardware addresses, time stamps and random seeds.
The difficulty task is to make the link between the identifier (as a character string) to the object to which it points. In particular the bootstrap procedure must be in place, in other words given a string - how does one know what to do with it - where does one start.
The steps involved would be
- given "x.y.z" one somehow knows (i.e. the bootstrap step) that one uses some service "A" with which one can find out what "x" means i.e. tells one where to go to look up some service ("B") associated with "x". "A" will be refered to here as the bootstrap resolver service
- using service "B" we then find out something about "y" - in particular some service "C"
- using service "C" we then find out something about "z" - in particular some service "D" which will point, at last, to the object wanted. This will be referred to here as the terminal resolver service
We presumably can say something about the last service "D". On the other hand we may have no control over the others in the heirarchy.
Thus we have the issues of - in the specific case of Persistent Identifiers for Representation Information (
RepInfo)
- the bootstrap into the name resolution system
- the persistence of each of the name resolvers
- the DCC's own RepInfo registry/repository
Options
Fixed root: e.g. ISO-based
A number of persistent IDs are available including DoI (
http://www.doi.org/), CCSDS Unique ID (
http://www.ccsds.org/documents/A31x0y1.pdf) which uses ISO/IEC 6523-1:1998 Information technology.Structure for the identification of organizations and organization parts.Part 1: Identification of organization identification schemes, 1998.
and
ISO/IEC 8824-1:1998 Information technology Abstract Syntax Notation One (ASN.1): Specification of basic notation, 1998. This identifier will be unique among all �ISO and CCITT� identifiers.
Mutable root: e.g. ARK
The Archival Resource Key (ARK see
http://www.cdlib.org/inside/diglib/ark/) proposes an approach which tries the recognise transitory nature of the many of the things from institutions to protocols, that we rely on today, and separates the unique identifier from the name resolver.
The form of the ARK system identifier is
[http://NMAH/]ark:/NAAN/Name
Of particular interest is the distinction that is made between the (optional) Name Mapping Authority Hostport (NMAH) which is mutable and replaceable, and the Name Assigning Authority Number (NAAN) which actually assigns the "Name". The combination of "NAAN/Name" is a unique identifier. The NMAH provides the lookup capabilities which resolves the unique identifier. There is an associated bootstrap protocol which allows one to find an active NMAH given a unique identifier.
The guidance for the names includes:
- No ARK shall be re-assigned; that is, once an ARK-to-object binding has been published, that binding shall be considered unique into the indefinite future.
- to help them age and travel well, CDL-assigned ARKs shall contain no widely recognizable semantic information (to the extent possible).
- the ARK group provide a tool for generating opaque ids - see noid.pdf
- ARKs are generated with a terminal check character that guarantees them against single character errors and transposition errors.
Further interesting ideas include providing not just a pointer to the digital object; each ARK links end-users to three things:
- Digital object metadata
- Digital object content files
- A commitment statement made by the host concerning the digital object, e.g. availability, mutability, etc
Discussion
The DCC could sign up to something like the DoI or could set up an independent system.
The DCC can define its own part of a unique identifier in many different ways. The UUID is one way if one expresses the 128 bits as 32 Hex characters.
This can be pre-pended by some hierarchy of names for bootstrap and name resolution:
- in the ARK system we need to be assigned a NAAN and our Unique ID is the "Name".
- with an ISO 8824 system we pre-pend by an appropriate sequence and we need to register our Naming Authority
For any particular digital object both these could be simultaneously valid. As long as both persistence routes end at the DCC name resolver they should both end up pointing aqt the same object. In fact a multiplicity of idempotent persistent identifiers could be simultaneously active.
Use of existing, or creation of new, infrastructure (standards, protocols, servers etc) for persistent IDs with adequate flexibility and longevity as part of the succession planning, agreement would be needed with, for example the National Archives, to act as backup and inheritor of DCC data.
An alternative unique identifier for an object could be generated from the characteristics of the object in question rather than in some arbitrary way. One possibility (although not perfect, see
http://eprint.iacr.org/2004/199.pdf) is to calculate the md5sum of an object in the
DCCRR, this could be the calculating of the md5sum for a document in the repository and/or the md5sum of the concatenated string formed from the metadata (attributes) of the object instance. The
DCCRR itself could be identified via an md5sum generated from metadata that makes the RR unique, such as information about the DCC, address, names etc. The �address� of an object in the RR would then just be a string consisting of the RR identifier plus the object identifier.
I am wondering whether we need to consider other characteristics of identifiers besides persistence e.g.
- global vs local scope
- persistence vs transience
- uniqueness
- opacity vs intelligence
- actionability
Manjula -14/7/05
by "the identifier" I mean the "local" identifier - which is assumed here to be a UUID:
- global vs local does not come in to it
- persistence vs transience - not sure whet you have in mind
- uniqueness - UUIDs claim some level of uniqueness. I'm not sure that a UUID is actually Universally unique but it probably could be regarded as unique within a registry AND is probably OK if registries merge their contents i.e. if a registry is no longer funded it may hand over its holdings to another registry.
- opacity vs intelligence - op[acity seems easier and less "transient"
- actionability - that's what the thing pre-pended allows. Or do you mean something else?
...David
Implementation Issues
The list below contains a set of considerations for the implementation of a Persistent Identifier scheme.
1. Should we be concerned about the semantics behind the NAAN (and by extension, have a NAAN)? Does the NAAN imply custodianship of a particular object (custodianship which may change over time), or is this custodianship to be conveyed through the objects metadata? The latter option is far more logical, and is something we should specify clearly.
2. Are objects resolved by taking a route through the objects commitment statement? For example, dark archived material such as official MS Word specifications. Can we embed information within the commitment statement ensuring this will not be accessible until some specified
date? It may be necessary to separate access from identification. Should the identifier for a "dark" object be known? This suggests a need for additional management services.
3. Hierarchical Items: The ARK specification recommends using a '/' character to separate an object hierarchy, and the use of the '.' character enables the identification of object variants. We may be able to avoid this for most items of
RepInfo, as it is likely that each piece of
RepInfo will have its own internal identifier which can easily be made into an external identifier when necessary. The hierarchical difficulty may arise from annotations upon
RepInfo.
4. Time Dependency Issues: Perhaps the variant indicator ('.') can identify time issues, with the suffix equal to a UNIX timestamp. I guess the objects metadata/commitment statement would contain the set of variant times, and leaving the suffix off would resolve to the most recent version.
5. Equality of Objects:
Each of these refer to ark:/12345/273mn382. How do we ensure that each of these URLs resolves to the DCC terminal resolver service (thereby guaranteeing the contents of the object)?
6. An identifier is an association between a unique string and an object. There needs to be some level of commitment for associated services such as resolution. How will this be explicity detailed? --
AdamRusbridge - 17 Nov 2004
Proposal
- We define our own terminal resolver service.
- We register with multiple "name resolvers" including "http-URI", ARK, DOI etc
- We already have an "http-URI" - http://regrep.dcc.ac.uk probably
- DCC has been assigned 64269 as its NAAN
- We use an ARK identifier as our pilot identifier - but make it clear that this is not the only bootstrap resolver service.
- However we use UUIDs as the "Name" i.e. the last part of the Identifier
- We could simultaneously use a number of bootstrap resolver services, i.e. keep our own "Name" and terminal resolver service, but pre-pend the appropriate string for each of the bootstrap resolvers. This opens the possibility of allowing a number of routes through to the DCC terminal resolver service, thereby increasing the possibility of at least one route surviving over the long term.
<cpid>
<value>e1fe9271-cd48-4418-a63e-b112ebf792c7</value>
<resolver resolverType="URN">urn:uuid:</resolver>
<resolver resolverType="ark">http://foobar.zaf.org/ark:/64269/</resolver>
<resolver resolverType="doi">10.123456/</resolver>
<resolver resolverType="URL">http://regrep.dcc.ac.uk/</resolver>
<description>For example the ARK identifier is created by appending the string in "value" to that in the resolver of resolverType="ark". For completeness we should include references to (paper) documents describing these resolvers.</description>
</cpid>
As for the long-term repository to which the Persistent IDs point, that could initially be provided by the DCC itself, however succession planning is needed and agreement should be reached with an organisation of guaranteed long-term existence such as the National Archives to act as backup and inheritor of DCC data.
The IEEE LOM (see ref [1]) has a similar construct, with the "catalog" tag providing information about the naming scheme e.g.:
<identifier>
<catalog>URI</catalog>
<entry>http://www.ukoln.ac.uk/</entry>
</identifier>
References
- Guidelines for using resource identifiers in Dublin Core metadata and IEEE LOM
- Life Sciences Identifiers
- Persistent Identifier scheme for digital collections at the National Library of Australia
- PADI report on Persistent identifiers
- ERPANet meeting on Persistent Identifiers
- ISO 11179-5 Part 5: Naming and Identification Principles
- DCC Workshop on Persistent Identifiers
- OASIS Extensible Resource Identifier
--
DavidGiaretta - 22 Jan 2005
Appendix ##app#.. Implications
- Use of existing, or creation of new, infrastructure (standards, protocols, servers etc) for persistent IDs with adequate flexibility and longevity
- as part of the succession planning, agreement would be needed with, for example the National ARchives, to act as backup and inheritor of DCC data.
Note that many of the features here have strong resonance with the
CEDARS project with its view of Representation Information, with the Persistent Identifiers similar to the CEDARS Reference ID (CRID), as well as the
CCSDS ADID and Control Authority system