Электронные
библиотеки, электронные архивы, виртуальные сообщества, Интернет,
WWW, DIENST.
Digital Libraries, Digital Archives,
Virtual Communities, Internet, WWW, DIENST.
Електронні бібліотеки, електронні
архіви, віртуальні співтовариства, Інтернет, WWW, DIENST.
1 Introduction
A recent emerging issue has been
the redefinition of the role of libraries, raising the question
of what is a "digital library" in a global networked
world. This paper by discussing and defining the concept of networked
digital library, following with the description of a related demonstration
project: ARQUITEC. The purpose of ARQUITEC is to set up a prototype
of a networked digital library for the Portuguese research and
academic community, which will be used to test the concept and
the technology.
"The net isn't 30 million people,
it's tens of thousands of overlapping groups ranging from a few
people to perhaps a couple of hundred thousand at the largest"
(O'Rally, 1996).
It has been broadly pointed out that the information technology in general, and the Internet in particular, has been supporting the existence of virtual communities, defined as communities of individuals sharing common interests, but that are not geographically confined. Evident demonstrations of that reality are the existing thousands of News groups and electronic mail lists, dedicated to almost all the cultural, professional and political perspectives.
This is the perspective from which we will define the concept of networked digital library. Table 1 resumes our vision of this new paradigm.
In what we call the traditional library, the subject is the book. Its value is "sacred" (otherwise it wouldn't have been purchased) and it is stored "for ever".. In this scenario authors decide what to write and when to edit the books, while the librarian decides whether to buy it or not. Finally, the librarians expect the patrons to came to the library and request the book. It was more or less like that until the middle of this century, when the industrial development changed it.
The industrial development reduced
printing costs, illiteracy and the physical distances, while at
the same time it increased the amount of information produced.
It is not possible anymore for an individual to absorb all the
knowledge produced by mankind, so it is necessary to specialize.
The specialization brought thematic magazines, journals, reports
and conferences. A new subject emergent from this reality is the
"paper", representing a new kind of knowledge. It is
not "sacred" anymore, but it continues being formal,
validated by the credibility of an editor or a review committee.
This knowledge is not intended to be valid "for ever",
but to be discussed during a period of time, refined and, in the
end, what survives is then sanctified in books.
| Paradigms | Networked Digital Library | ||
| Specialized Library | |||
| Traditional Library | |||
| Subject | The Book | The Paper | The Idea |
| Knowledge | Sacred | Formal | Informal |
| Memory | Persistent | Semi-persistent | Volatile |
| Actors | Author, Librarian | Community, Editor | Community |
| Dissemination | Very Slow | Fast / Slow | Very Fast |
| Library role | Passive | Active | Interactive |
Table 1: The library paradigms.
It is difficult for the traditional library to follow the specialization; so the library itself becomes specialized, with the mission to serve specific communities. Those communities control now the library content in their own interest, in the sense of who decides which periodicals to subscribe or what to buy.
In this scenario the library is requested to perform now a more active role. Since the communities are well identified, it is now possible to anticipate their needs and to provide customized services, such as the readers' notification of new periodic issues, the advertisement of new relevant publications, etc.
But the scenario has been changing again, with the arriving of the computer. With computer networks, electronic mail and News groups, communities intensify their interactions, while with the desktop publishing tools and WWW everyone becomes a potential publisher. The process acquires speed, and the subject is the idea. To produce fast results, ideas are submitted in pre-prints or presented to discussion as position papers in informal workshops. Ideas that succeed in this process are then formally published in journals and presented in conferences. The question now is what will be the impact of this new reality in the library world.
With electronic mail and WWW, it is now easier for the library to identify and reach the communities, providing them new services. By the same reason, it is now easier for the users to interact with the library, not only to access OPAC services but, in an extreme scenario, to contribute also with new kinds of meta-knowledge that can enrich notably the library. Examples of such contributions can be the tuning and completing of thesaurus and catalogues (allowing dynamic and collaborative cataloguing), the attachment of annotations and comments to the stored documents (allowing collaborative refereeing, for example), the participation in discussions supported by thematic electronic mailing lists, etc.
After this discussion, we will finish with our vision and a definition for the concept of networked digital library:
A networked digital library is defined not only as an organized repository of data and information, with the traditional mission of preserve, organize and provide access to the associated knowledge, but as a system with also the mission to stimulate, support and register the process of creation of the knowledge.
It is now our mission to demonstrate
how to turn this vision in reality.
ARQUITEC will provide support for a three-step workflow in the production of information, comprising: informal documents, such as position papers, drafts and preprints, refereed documents, such as papers published in refereed journals and formal documents, such as theses, dissertations and reports.
New realities, such as the increasing scholarly and scientific activity, have been resulting in the growth of publications rich in new interdisciplinary perspectives. That kind of contents has been raising serious classification problems for traditional libraries, where collections have been classified with catalogues usually defined by static structures. In order to deal with this dynamic classification problem, ARQUITEC will provide users with an interactive catalog of the documents.
A document index, a multi-context and multi-lingual thesaurus and the result of the user interactions will support that catalog. The users will be able to contribute to the catalog directly, by suggesting new keywords for documents or questioning existing ones, or indirectly, by suggesting new relationships to the thesaurus or questioning the existent ones.
For the development and interaction between the catalog and the thesaurus, experimental work has been done with MCF (Gutha, 1996), a recent language for meta-content representation. For the thesaurus structure, the ISO-5964 standard was followed (ISO, 1995).
Users can access ARQUITEC in one of two modes: anonymous or identified.
An identified user has a profile, composed of explicitly provided data and by data implicitly extracted from the history of that user interactions with the system. For example, if a user retrieves a document related to a specific subject that is not in its explicit profile, that subject is implicitly added to that user's profile. In any moment each user can access its own profile and change this implicit data.
User profiles serve three main purposes: searching (to rank searching results), filtering (the profile is used to notify the user of new documents of potential interest) and collaboration (any identified user can contribute with annotations for the documents, as also with suggestions about their classification).
The registered users it will be managed by a distributed directory based in the X.500 model, with an LDAP interface (Yeong et. al, 1995).
3.3 The official archive
A central archive at the National Library will be maintained, with a copy of the formal documents (refereed documents can be also archived, after copyright has been secured from their producers). This archive will automatically harvest the new documents from the local servers, storing and cataloguing them in a central repository.
An important requirement for the documents is their name persistence. Depending on whether they are a serial publication or isolated books, printed documents are usually identified by ISSN or ISBN numbers, but for digital publications such mechanism doesn't exist yet. It is usual to register CD-ROM publications with ISSN or ISBN numbers, specially if they are related to printed publications (such as the CD-ROMs distributed with magazines), but for on-line publications this is not of great help. In fact the publication of an on-line document is an almost instantaneous process (it requires basically the time to store and index it in a FTP or HTTP server), and there is no expedite way to require an ISBN number for that document compatible with this workflow.
Another important problem raised by on-line publications is that its name, or reference, should not only be an unique reference to identify that object in a specific name space, but also provide a way to access it (that reference must "say" where the object is and how to get it).
The problem of naming objects in a digital library was reported in the "Kahn/Wilensky Report", from which emerged the concept of handle as an URN (Kahn & Wilensky, 1995). A first and simplified version of that concept was implemented by OCLC in the PURL - Persistent URL service.
In its structure, a PURL is a normal URL - Uniform Resource Locator, with a generic structure like:
http://DNS of the PURL server…/document name...
A PURL has a logical meaning that, when used, implies an access to the PURL server which acts as a proxy. It automatically translates the PURL to the real URL of the document, performing after that a simple HTTP redirect.
A PURL service, for all the persistent
documents with copies archived at the National Library, will be
provided in ARQUITEC.
NCSTRL is a network of servers providing three kinds of services: repository, indexing and user interface (Davis, 1995). Currently NCSTRL is a worldwide service, with repositories installed in over 60 universities and research centers across the world.
INESC has been experimenting with the NCSTRL technology since middle 1996. We were impressed by its capabilities as a potential framework for future work, especially its open architecture model and its ability to handle documents in several formats. Therefore we decided to use it as the core technology for ARQUITEC.
The NCSTRL architecture is based on a network of DIENST servers, each one managing a repository of documents, the respective index and a user interface (Davis & Lagoze, 1994).. The user interface is implemented in HTML, provided through an HTTP server (the DIENST server is written in the PERL language, and its interface to the HTTP server uses CGI). A user can access any server from any user interface, since user searches can be performed in all the indexes.
Optionally, the repositories can be accessed via lite servers. In this case each site only has to provide a metadata description file and have its documents accessible by FTP or HTTP. A remote lite server converts that metadata to the DIENST format, indexes it, and provides a normal DIENST interface for the users and for the other DIENST servers. In the specific case of the NCSTRL service, it has only one central server for all the registered lite repositories.
A backup server can maintain a copy of all the indexes, which is useful if one of the servers becomes inaccessible. In that case users will not be able to perform retrievals, but at least they will be able to search and find references to the desired documents.
The core of ARQUITEC is based on a modified and extended version of DIENST (version 4.0). The required modifications occurred mainly at three modules of NCSTRL, corresponding to three different tasks of ARQUITEC: replacement of the indexing and searching tool, modification of the repository management and modification of the interface.
The original DIENST indexing and search tool has been replaced by a more powerful catalog, as described. The new requirements implied modifications at the NCSTRL repository interface level, in order to perform full text indexing of as many documents formats as possible (such as ASCII, Postscript, MS-Word, etc.), as well as in different languages.
Concerning the management of information, the main generic problems were the procedures for submission of the documents, their classification and search, as well as the creation and management of the central archive. The core of the NCSTRL system was also modified in order to allow the automatic creation and management of the official central archive.
The NCSTRL user interface was modified
in order to support all the described requirements, new functions
and services. The modifications were done essentially in the submission
of documents (that can be done remotely), as also in the support
of the search task. All the interface components were redesigned
to support multi-lingual access (Portuguese and English in the
first release).
Medium term work will be concerned with the integration of other information spaces, accessible by new interfaces at lite DIENST servers. Examples will be interfaces for Z39.50 servers, useful for the integration of OPAC systems such as the catalogs of conventional libraries, and HARVEST brokers, useful for the support of informal publications and other similar material such as archives of mailing lists, public source code, etc.
Examples of other identified research issues requiring our attention in the medium/long term are:
Document structuring: research will be done on using SGML and other alternative solutions for structuring the information objects;
Natural language: trials will be done in the classification and search of documents with natural language techniques, with a special concern for the Portuguese language;
Authentication and certification authorities: the requirements for secure authentication and certification authorities, for both the documents and users, will be addressed in medium term;
Long term preservation:
how will the official repository survive to the evolution of the
hardware and software (such as the storage technology, document
formats, etc.)?
Davis, J. R. (1995). Creating a Networked Computer Science Technical Report Library. D-Lib Magazine, September 1995. Available at http://www.dlib.org/dlib/september95/09davis.html
Davis, J. R.; Lagoze, C. (1994). A protocol and server for a distributed digital technical report library. Technical Report TR94-1418, Computer Science Department, Cornell University, 1994.
Gutha, R. V. (1996). Meta-Content Format. Apple Computer. Available at http://mcf.research.apple.com/hs/mcf.html
ISO - International Organization for Standardization (1995). ISO-5964: Documentation Guidelines for the establishment and development of multilingual thesaurus. ISO, Geneva, 1985.
Kahn, R.; Wilensky, R. (1995). A Framework for a Distributed Digital Object Services. CNRI. Available at http://WWW.CNRI.Reston.VA.US/home/cstr/arch/k-w.html
O'Reilly, T. (1996). Publishing Models for Internet Commerce. Communications of the ACM, June 1996, Vol. 39, No 6, 79-86.
Yeong, W.; Howes, T.; Kille, S.
(1995). RFC 1777: Lightweight Directory Access Protocol.
IETF Network Working Group. Available at http://www.umich.edu/~rsug/ldap/doc/rfc/rfc1777.txt.