Version : 1
Date : September 4th, 1997
Author : SIEMENS
Confidentiality : Public
Status : Draft
Networking and Digitization
Archiving and Managing Digital Information
Archiving digitised multimedia information gives a lot of advantages: access via the network to any information in a very short time, no danger of loss or destroying the original material, after once digitised the material can be copied without any quality loss as well as automatic repair function using hierarchical file management systems.
But there are also some challenges: the experience about the life time of magnetic material is mostly theoretical, access time to the stored material is widely different and the limitation on the public network gives also limitation for access. However there are a lot of different configuration possibilities to design appropriate digital archive solution consisting of different storage hierarchies with network access.
The following paragraphs give an overview of the existing and planned digital archive technologies.
Online storage systems give immediate access to the data. This can be done locally as well remote via the network. Local hard disks provide immediate access to the stored multimedia material as audio, video or still images. Due to the limitation of local storage capacity the connection to an online archive is managed via a local or wide area network (LAN, WAN). Online storage systems use direct accessible hard disks in a wide range from some GB until 2 TB. Due the permanent reduction on prize and growing in storage size the configuration of big online systems gets more and more cost effective. RAID technology supports that level of archive hierarchy in a perfect way.
Applications with real time streams of large binary objects as audio or video placing heavier burdens on fixed disk drives. This creates a big need to bridge the gap between processor performance and input/output rates.
One solution that offers significant advantages is RAID (redundant array of inexpensive drives). RAID technology turns several inexpensive drives into one, big drive with multiple actuators. The controller can manipulate these actuators either in parallel to share the work on reads or writes of large files or independently to perform multiple, simultaneous reads or writes of small files.
Simply put, whenever the DODRaid (Data On Demand) RAID controller receives instructions
from the host to write a block of data to the array, it "stripes" or breaks that block of data into
smaller pieces and apportions the pieces among the drives in the array. This decreases the time
needed to complete the write operation. Similarly, when the host asks to read a block of data from the array, the DODRaid determines which drives are involved and simultaneously fetches the data from each drive, again making use of the multiple actuators at its disposal.
There are several different methods of RAID in existence. The main differences between methods are in the way data is striped and if and how parity is implemented. These different methods are called "RAID Levels."
RAID 0 Striped Disk Array without Fault Tolerance
implements a striped disk array, the data is broken down into blocks and each block is written to a separate disk drive. I/O performance is greatly improved by spreading the I/O load across many channels and drives
RAID 0 is not a "True" RAID because it is not fault-tolerant, the failure of just one drive will result in all data in an array being lost. Therefore RAID 0 is not suitable for archive solution.
RAID 1 Mirroring and Duplexing
supports twice the read transaction rate of single disks, same write transaction rate as single disks. 100% redundancy of data means no rebuild of data is necessary in case of disk failure, just a copy to the replacement disk. Highest ECC (Error Checking/Correction) but 100% disk overhead makes the solution expensive. May not support hot swap of failed disk as typically implemented.
RAID 2 Hamming Code ECC
Each bit of data word is written to a data disk drive. Each data word has its Hamming Code ECC word recorded on the ECC disks. On Read, the ECC code verifies correct data
or corrects single disk errors, this makes "On the fly" data error correction possible.
Entry level cost are very high, it requires very high transfer rate
RAID 3 Parallel transfer with parity
The data block is subdivided ("Striped") and written on the data disks. Stripe parity is generated on Writes, recorded on the parity disk and checked on Reads. It supports very high Read and Write data transfer rate. Disk failure has a low impact on throughput. Transaction rate is
equal to that of a single disk drive at best.
RAID 5 Independent Data disks with distributed parity blocks
Each entire data block is written on a data disk; parity for blocks in the same rank is generated on Writes, recorded in a distributed location and checked on Reads. Disk failure has a
medium impact on throughput.
RAID 6 Independent Data disks with two independent distributed parity schemes
RAID 6 is essentially an extension of RAID level 5 which allows for additional fault tolerance by using a second independent distributed parity scheme. Data is striped on a block level across a set of drives, just like in level 5, and a second set of parity is calculated and written across all the drives; RAID 6 provides for an extremely high data fault tolerance and can sustain multiple simultaneous drive failures. Therefore it is a perfect solution for mission critical applications but due to the very bad write performance it is not suitable for multimedia archive solutions.
RAID 10 Very High Reliability combined with High Performance
RAID 10 is implemented as a striped array whose segments are RAID 1 arrays, it has the same fault tolerance as RAID level 1 as well as the same overhead for fault-tolerance as mirroring alone. Excellent solution for sites who would have otherwise gone with RAID 1 but need some additional performance boost but very expensive.
RAID 53 High I/O Rates and Data Transfer Performance
RAID 53 Should really be called "RAID 03" because it's implemented as a striped (level 0) array whose segments are RAID 3 arrays. It has the same fault tolerance as RAID 3 as well as the same fault tolerance overhead.
Maybe a good solution for sites who would have otherwise gone with RAID 3 but need some additional performance boost but is very expensive to implement.
Multimedia information are described with so called metadata. In classical archive systems these data describe the multimedia information with keywords. The keywords are described according specific thesauri or wordlist. More advanced metadata consists also of content based information like keyimages or others. These metadata are stored in a database. This information is needed for reference to the media information. Therefore these database applications are crucial for digital libraries. Traditionally centralised to ensure performance, availability, and manageability, these applications often run on expensive, proprietary platforms. Database platforms must meet certain requirements:
Transaction performance scalability
Very high levels of data availability
Maintenance of data integrity despite
Easy-to-use tools that facilitate the management of high performance, high availability environments
Several companies provide special designed database servers optimised for database
environments. Server architecture for database application should support high performance for transactions.It must combine high scalability and high reliability, availability, and serviceability to deliver the highest performance for transaction and easy to administrate parallel database server solution.
To store the mediadata high I/O performance is the main feature. Server for storing media data should offer high performance on I/O channels and a big number of SCSI Channels. Media servers must move hundreds of millions of bits per second. Manipulating a big number of multimedia data streams requires huge I/O bandwidth. Some vendors support a very fast I/O throughput, i.e. Silicon graphics uses the POWER Channel(TM) -2, the industry's highest performing I/O bus at 320MB/second per controller, with a sustained maximum bandwidth of 1.2GB/second when configured with four controllers. They also provide a new standard for file systems XFS(TM). XFS is a guaranteed rate and deterministic 64-bit file system which allows you to guarantee a specific transfer rate and duration for video files.
For very large archive systems, online archives cannot maintain all the media material. Therefore data migration is used to store the media data in different hierarchy levels. These storage levels consists of different storage material which will be discussed in a later chapter. This chapter gives an overview about the existing hierarchical file management systems (HMS) usable for the critical application in multimedia archive solution.
The concept used to enlarge the primary storage RAM to secondary storage disk by integrating it into the memory system (disk caching) is well known and standard in all computer systems. The HSM extent this to a tertiary storage like tape or optical disks with a huge storage capacity. HSM are therefore important to handle large amount of data operating systems where a small amount of primary storage (system memory) is made to look larger by integrating secondary storage (disk) into the memory subsystem. With an HSM, this concept is applied to tertiary storage (tape/optical) to make secondary storage appear much larger. Just as the virtual memory system makes it convenient to write large applications, HSM makes it convenient to handle a large amount of data. The term `Virtual Disk' has often been used to summarise what an HSM means.
HSM is not a backup strategy, at least not in the traditional sense. An HSM does not take a snapshot-rollback approach to data protection. Rather, an HMS's continuous data protection, shows a cost effective alternative.
HSM is not an archive or a volume manager, nor similar to products that offer a user a way to keep track of ``what files are on which tapes.'' While these products can be said to extend the secondary storage capacity, there is a fundamental difference: transparency. With an archive solution, the user must consciously designate and move data which are deemed archive candidates. Once the data is archived and deleted from disk, any further access to that data is through a cataloguing tool, a departure from typical methods of data access. From the user perspective, there is a definite gulf between ``active'' and ``archive'' data. An HSM, however, achieves data movement between disk and tape automatically, while presenting to the users and their application software a seamless view of their data.
The key concept behind an HSM is a speed/space balancing act. The need for speed is met by keeping active data on fast storage media (disks), while the need for space is met by keeping inactive data on less expensive storage media (tapes). New data entering the system is kept on fast media for fast access. Over time, with less frequent access, the need for this data to remain on fast media diminishes, and the data is migrated down to media on lower levels of media hierarchy.
Today, there is a wide range of approaches to the HSM solution. From transparent data access
methods to robotics control and data format on tertiary storage media, the industry is far from an agreement on which is the best way. Some HMS's reside in the virtual file system (vfs) layer of the server operating system, and by means of kernel hooks and stub files, present a file-system to the clients. Others reside in the user application space, negotiating the same file requests without kernel hooks. Some favour the tar format when storing data to tapes while others, for performance or tape wear issues, take a different tact. However, all these differences aside, the solution, as it looks to the user, is an expanded storage capacity with no obvious file system distinctions between active and inactive data.
HSM with that features needed for high quality archive solutions are not available for NT Systems in the moment but some vendors are claiming that they will work on it.
UniTree is the open systems HSM engine from UniTree Software Inc. It is designed to support dynamic, rapidly-expanding systems with large data storage requirements, placing no logical limitation on file-size, number of files, or total amount of data managed. Based on the IEEE Mass Storage System Reference Model, UniTree is implemented as a set of co-operative message passing processes, using the client/server model. UniTree's distributed architecture allows each process to run on separate networked computers for maximum system flexibility. UniTree enables parallel migration and staging as well as multiple Storage Hierarchies. It improves the FTP and NFS performance. The migration strategy follows defined rules. UniTree also support disk partitions on Sun 2GB and the 64K family. The media and drives are configurable and is compatible with most backup software (Legato, CAM, SMArch)
HIARC HSM runs on Solaris 2.4 and above. It supports 4mm, 8mm, 3480, DLT, VHS, D-1 and D-2 tape drives, and appropriate robotics. The migration works in automatic operation with total granularity of control if desired, using four parameters for migration, each weighed and measured against four waterlevel marks.
AMASS is a high performance UNIX file system. Applications have transparent access to files stored in robotic systems. The user's application software need not to be modified. Block-based movement of data between the hard disk cache and tape or optical tertiary storage. The relevant difference to other HSM is the support of mixed media data archive, it supports a huge range of devices, autochangers, and operating systems. The system run from a few gigabytes up to hundreds of TB. Since it's block based, files can be larger than a piece of media. Separate product DataMgr will migrate files from client machines to the AMASS server automatically with periodical starts outsides of peak times . When a predefined threshold is reached the migration will start automatically. The migration control follows different strategies like age, size or location. It supports different numbers of copies for data security. The file content remains readable during reloading. Direct reloading from every hierarchy level is supported as well as distributed storage hierarchies. For multimedia information like audio and video with very large file the support of multi volume is essential.
The near-line storage system maintains the media information on storage devices which give a very good prize - storage capacity relation but need more time to access the data. Such storage media are tapes, MOD's, DVD, or CD-ROM. The best price - storage capacity certainly offer magnetic tapes, the access time depends from the transfer rate, the size of the storage media and the type.
We use "near-line" in comparison to off-line because the media information is still available without any manual interaction via the network. Only the access time is longer than from the on-line archive and can last several minutes. The off-line archive really means off-line, where the material is archived in a separate room, non automatic accessible and needs manual interaction to insert the material into the archive robotics system. The administration of such off-line archive is still maintained by the data base.
The transfer into and from the near-line archive is managed by the HSM. The near-line archive can consist of various levels represented by different storage material. More than one near-line archive level is necessary when different access time to different content is wished. This is necessary for example for browsing and high quality archive. Browsing archives held highly compressed information which give a good representation of the content but need only small bandwidth for transfer. With the browsing information the retrieval is very significant and enables a content based and significant recherche. The browsing archive need a lower amount of storage capacity. For such file the fast access is important. Therefore the browsing archive can be stored on fast accessible devices with limited storage capacity.
Nowadays we don't speak about "permanent storage material" but we speak about "Permanent data" ("for ever young"). This happens due the use of data migration systems which are able to check the digital stored quality, can correct them and can even make a new copy - that means a new original- when the life cycle of the storage materials come to the end.. Nevertheless the longer the material itself persists the cheaper the storage material is and the faster the access to the stored data is the better can the storage material be used. To decide which storage media has the most value for a specific solution all three factors have to be taken into account. In heavy-duty on-line applications, where data is typically written once and read many times, high performance and durability are critical. To meet these user needs, today's storage material must offer:
greater durability and data integrity in heavy-duty-cycle applications
flexibility and compatibility in complex computing environments
ability to handle leading-edge HSM and real-time data acquisition applications
lowest possible cost-per-GB stored
With more data on fewer cartridges, data can be accessed faster because of lower number of tape changes. On the other hand the more information is stored in on tape the longer last the positioning to the appropriate position on the tape. The cost of an entire tape storage operation can be lowered by reducing both the capital equipment investment and amount of operator intervention required. Higher cartridge capacity also allows more data to be online for a given number of drives, which is particularly valuable in applications such as multimedia, where more image files can be online simultaneously.
A very good overview about existing storage media can be found in the Internet (http://alumni.caltech.edu/). In this chapter we will discuss only that media that can be used for digital archive solution.
Tapes offer the best prize - storage capacity. Two different types can be distinguished - Cartridge and Cassette. The difference is in the physical construction. Cartridges like DLT have only one reel, Cassettes (i.e., 8mm,4mm, 19mm, VHS) have two reels and may not need to be rewound to dismount.
Another distinction is made by the technology to write on the tape. Longitudinal writing is working with heads that write bit streams that are parallel to the edge of the tape. Serpentine are longitudinal that write the full length of the tape, then turn around and write the length of the tape in the opposite direction with the heads in a slightly different position. This process may repeat many times. The third type to write is Helical Scan. It is like a VCR with a rotating head mounted at an angle writing "swipes" at an angle not parallel to the edge of the tape. The tape is moved only slightly between swipes. Two longitudinal tracks may also be used for fast positioning purposes.
Drives like 3480/3490/3490E/3590 come traditionally from the mainframe vendors -- IBM, Fujitsu, Storage Tek, etc. The data transfer rate is normally poor as well as the storage capacity. Another type of tape drive from IBM based on the Magstar technology, known as Coyote or 3570, it's 2.2 MB/sec., 5GB in a two-reel cassette. Unlike 3590, the tape never leaves the cassette.
DLT is a widely used tape drive for backup systems as well as for archive solution. Different types of the DLT family offer different possibilities. For archive solutions two Types are used DLT 4000 and 7000. The DLT is designed for heavy-duty-cycle tape storage environments.
DLT 7000 offers performance data transfer rate (5 MB/sec) as well as high capacity (35 GB) Multi-level error detection makes it very secure, tape lifetime rated at 1,000,000 passes, a head life of 30,000 hours, and a 200,000 hour MTB.
Another tape drive technologies comes originated from broadcast and/or data recorder applications, where the data/signal was analogue in nature. They have been modified for digital use, with error correcting capabilities added (D1 and D2, 19mm). They support a very high data transfer rate up to 45 MByte/sec range, with storage capacities in the 25-175 GB range in physically different size cartridges with different length tapes, but all fitting into the same tape drive unit.
Another very useful tape for digital archive solution is the DTF cartridge and recorder in from SONY. The medium was primarily directed to the professional digital video market. It is however also available for the computer data storage market. It is based on a 19mm-wide tape with helical scan recording and special horizontal tracks for control, annotations and other information. It is based on the ANSI ID-1 standard. tape system includes a directory table which is updated every time the tape is unloaded. The capacity is very high (12100GB/volume uncompressed) with a very high data transfer rate. (16MB/s.) and a reasonable search speed. The price range is higher than for DLT but this technology can be incorporated in huge tape libraries to accommodate between <B6.4TB and 2.6PB</B (where PB stands for PetaByte = 10^15 bytes).
Robotics systems are an important factor in a digital archive solution. You will find them in the Hierarchy of near-line archive. Robotics systems make accessible a huge amount of multimedia information in an automated way. The storage material is housed in a big cabinet and one or more robotics arms bring them into the recorder. The speed of such an automated access depends heavily from speed of the robotic, distance between storage place and number of recorders/players available in the robotics unit.
The usability of robotics in digital archive depends therefore of the configurablity of storage media, needed floor size per Terabyte and speed of robotics. As mechanical unit the maintenance is another important part. It is very important if the system need to be shut down during the maintenance period or if at least one part is still available.
Robotics systems are incorporated into the whole archive system with the HMS systems which not only handles the file migration strategy but also offers the drivers for robotics and player/recorders.
A lot of manufacturers of tapes like D-1, Betacam and broadcast autochangers also provide solution for storage use. Most of the manufacturers provide robotics which can handle only one type of storage material.
This overview only shows some for solutions which can be used in archive solution offering a reasonable amount of storage capacity.
A very good overview about the existing robotics can be found in the Internet (http://alumni.caltech.edu/).
Huge autochangers, referred to as silos, round and several (~5) meters in diameter (Storage Tek), using 3480-style tapes capacities going to 20 GB/cartridge build a 120 TB/silo.
For digital archives solution the mixed media robotics from GRAU (E-MASS) give a lot of advantages. These robotics are high-end solution with the possibility to build very large capacity mixed-media autochangers known as the ABBA series, working with a lot of different server. They support 3480, D-2, MO, VHS, DLT, 8mm all in one robot. This enables the archivist to configure the appropriate storage material for the right purpose. Different library sizes from small solution until very large systems can be built up.
For the data broadcast tapes as for example Ampex with the DD-2 tape drive exist also specific solution with a huge amount of storage capacity. Sony sells three autochangers for their ID-1 line of tape drives, based on their broadcast line of autochangers. These are known as the DMS Series, models 24, 300M, and 700M and offer storage capacity up to cassettes for 30 terabytes.
Nowadays Petabyte storage solution are already available. Odetics make an expandable library known as the DataLibrary, with a maximum capacity of ten petabytes (ten million gigabytes). A robot handler runs on a track down an aisle lined with cartridges, and tape drives at one endof the aisle. Sony has also PetaSite robotic solution with up to 3 petabytes supporting both ID-1 and DTF in a single system
Jukebox for optical disk is another possibility to provide smaller storage unit for decentralised archive. These jukeboxes are handling MOD and WORMS and are used mainly for browsing archives. Jukebox solution are not very fast on transfer rate but do not need to wind and rewind a tape. With the real direct access to the right position this problem does not really limits the usage for audio material.
CD-ROM Jukeboxes are manufactured by a big number of provider.
A very important part of a digital archive solution is the network itself. The archive has to maintain different user groups: internal and external user. For internal users often an Intranet solution with higher bandwidth (100Mbit switched Ethernet or higher) is available. For these networks the limitation is very small and multimedia information with good quality can be transferred.
A much higher challenge is the network access via WAN. Although the network providers claim that they have installed a lot of glas fiber cable with very high bandwidth two problems are still existing in Europe: the prizes for high quality network access (as well as for lower bandwidth) are too high for heavily use by private users and the last mile to the customer with high bandwidth is nearly totally missing. Only cable TV provider have started to renew their access lines to the customer with high bandwidth solution.
To overcame this situation some content providers (Reuters) start to build up their own "network" using transmission via digital satellite and transmission. This is working quite good for closed user groups (each user has to have their own satellite antenna) and naer on demand solution. For real interactive solution this concept is not very comfortable. It is mainly a one way connection, the way back from the user to the provider has to go via another way like telefon line.
For closed user groups it is also possible to arrange special agreements for the tariff policy with the network providers. The PTT's in some European countries for example hold an ATM network trial where high bandwidth access is available. The limitation with this network in the moment is that there is no possibility to have a real switched solution as usual with the normal telephone line. These lines have to be preconfigured for a specific time (and to be paid as well, even when not used). This gives also not really the chance to use it in an "On demand" interactive style.
For open user groups with unknown partners all these high bandwidth solution are not available. Internet via the normal telephone lines will remain as the most used network access; the usage in private household is still poor because of the high cost for even the normal telephone lines. But with the more and more upgrading usage in the business environment the usage in the household will grow up as well. However network provider have to think over their tariff policy to open the market for real tele access to digital libraries.
It is essential for the success of digital archives that not only the preserving aspects as discussed in the former chapters is possible but also the open usage. Technically spoken the network is not longer a bottle neck. The technique is available to provide fast access which is a necessary argument for the usage.
Storage solution where the media material itself cannot be accessed via the network we call Off-line storage systems such. However the metadata for that material will be held in the digital archive. It is absolutely necessary that the is able to meta database keep track of the off-line storage To integrate the off-line storage in a closer way with the digital the browsing information will be kept in the digital archive. The original material remains in the off-line storage on the existing material. This gives advantage for the retrieval but restricts the usage of the high quality material.
However for cost reasons such solutions are widely used. Another reason for an off-line storage system is to have a cheap back up system which stays in another location as catastrophe back up solution.
A big advantage for digital archive is that the material
can be error corrected with mechanism that are standard in the computer
technology. As stated before error correction features exist on different
level. Some tape drives support this already and at least an HSM system
can correct the data during periodical check readings. Some HSM offers
also the possibility to do the correction with waterlevel: when the error
correction has to be done to often the HSM itself start to copy the material
on a new tape. As we are on the digital level the new copy means to produce
a new original whiteout any loss of quality. There is no longer a generation
Parallel Database Applications
There are two primary horizontal-market applications that benefit from parallel database processing: on-line transaction processing (OLTP) and data warehouse/decision support systems (DW/DSS). The former is characterised by databases that are constantly being updated by multiple concurrently executing transactions; the latter by extremely large databases that are
extensively queried but seldom updated.
These applications are often at the core of enterprise-wide, mission critical services in industries as diverse as financial services, manufacturing, and telecommunications. Because of its ability to provide high performance and high availability, parallel database technology is a natural choice for such applications.
Shared Memory SMP
The Shared Memory Symmetric Multiprocessing (SMP) architecture consists of a single database image running on a server with one operating system and multiple processors. Performance can be increased by adding additional processors, memory, and disk storage. However, contention for shared resources (e.g., the system bus, disk controllers, and operating system data) means that there is a diminishing return on investment as components are added. Furthermore, a single system has potential points of failure that can bring down the database server.
The Shared Disk architecture runs a single database image on multiple operating systems. In such a system all nodes share the same disks. Performance can be increased through the addition of processors, memory, disk storage, or nodes (systems). The Shared Disk architecture does not suffer from the scalability limitations of the Shared Memory SMP architecture; furthermore, a Shared Disk system can be designed so that each system functions as a backup for the other in the event of a failure. Finally, because the entire database is accessible from a central location, administration is as easy as it is in the Shared Memory SMP architecture. Shared disk architectures are particularly well-suited for OLTP applications with large numbers of users that would overwhelm a single SMP system.
As in a Shared Disk system, the Shared Nothing architecture uses a single database image across multiple operating systems. In the Shared Nothing architecture, however, systems do not share disk storage. The database is partitioned, with each system ``owning'' a partition. All access to a given partition of the database is made through the system that owns it. For applications with well-partitioned workloads and access patterns, the Shared Nothing architecture offers excellent scalability. Thus, it is particularly well-suited for DW/DSS applications that involve large, mostly-static databases, responding to very complex queries that require a lot of processing power.
Parallel database environments are intended to provide two key benefits: high availability and high performance. In a database cluster, multiple systems, or nodes, run against a single image of the database. If one of those nodes should fail, the remaining system(s) can pick up the workload of the failed node, allowing processing to continue. Since an instance of the database is logically present in each node, the database can remain available as long as one node remains active. Performance is scaleable across nodes, offering a high-end growth path: Additional processors, memory, and storage can be added as they are required. And manageability is improved because the cluster can be managed as a single database.
High availability is often confused with fault tolerance. Fault tolerant computers are specially designed to provide uninterrupted service even after catastrophic system or environmental failures that would completely shut down other configurations. Applications requiring fault tolerance, such as telephone switching, traffic control, and life support systems, cannot sustain interruptions in service, even if recovery is fully automated and occurs within minutes as they do in high availability systems.
High availability, on the other hand, implies that the uninterruptability of fault tolerant systems is not needed, but rather, a much higher degree of service is required than is normally expected from a single system, or even systems with redundant data services. In high availability environments, full redundancy is provided, and recovery from a failure takes only seconds or minutes. For many users and applications, such capabilities are more than adequate, particularly if they are available at a much lower cost than fault tolerant systems.
While fault tolerance provides a very high level of availability, such systems are expensive and complex. Because of their complexity, fault tolerant systems also make trade-offs in other eras, such as system scalability and adherence to industry standards, making them unsuitable for use in general-purpose applications. High availability systems, with their fast recovery from failure, can provide nearly the same levels of total availability as fault tolerant systems, but at a much lower cost and without compromising other features important to mainstream computing environments.
Digital archives host the treasury of the content owner. Therefore in important point is to watch the intellectual property rights. The advantages of digitised information that copies are as good as the originals turn to disadvantage in point if copyrights. Once having access to digital data copying is quite easy as well as changes. A digital archive solution has to have different strategies to protect the content from unallowed copies as well as from unallowed changes in the content.
It is important to protect the multimedia information against unauthorised copies when setting the information on a public network. At least it needs a possibility to identify the media ownership. A very common way to protect for example digital audio material is to add a digital signal out of the listening range. But when found out the frequency range it is easy to edit the signal and delete the "watermark". More advanced technologies as well as legal regulation are needed. A lot of research work as well as standardisation discussion are undertaken now to find possibilities to protect the copyrights for the digital material. Some interesting investigation acording this feature can be found already in the literature.
Many approaches are available for protecting digital data like encryption, authentication and time stamping. Digital watermarks are invisible ( or not to listening) structures or noises laying over the multi media information. Prof. Delp for example is working on project with Constant Watermark two dimensional watermark (CW2D) and Variable Watermark Two dimensional Watermark (VW2D). In this project the robustness of the watermarks will be checked. Further development for Watermarking Video are intended. However, beside singular projects with proprietary watermarking, some EC project for standardisation are in progress. For digital libraries the use of standard procedures for watermarking is essential and as soon as the standards are defined they should be incorporated.
R. B. Wolfgang and E. J. Delp, "A Watermarking Technique for Digital Imagery: Further Studies," Proceedings of the International Conference on Imaging Science, Systems, and Technology June 30 - July 3, 1997, Las Vegas, pp. 287. The readme file , compressed postscript file, PDF file, and the ftp site.
R. B. Wolfgang and E. J. Delp, "A Watermark for Digital Images," Proceedings of the IEEE International Conference on Image Processing, September 16-19, 1996, Lausanne, Switzerland, pp. 219-222. The readme file , compressed postscript file, PDF file, and the ftp site
content protection by generation
To prevent unallowed changes of the content the digital archive solution has different user access rights. But even that feature is not enough to prevent unwished or unallowed changes. The digital archive needs a mechanism where the restored material gets automatically a new ID or generation number. This makes sure that no unknown changes are possible.
In the moment no real secure way to make the billing for access via the Internet is known but a lot of work is under progress. Microsoft itself is working on encryption procedures as well as smart card providers.
For Austria the universities have defined a common catalogue system with online access. More information about BIBOS can be found under
Z39.50 is an applications-layer protocol within the OSI reference model developed by the International Standards Organisation (ISO). Its purpose is to allow one computer operating in a client mode to perform information retrieval queries against another computer acting as an information server. More information about z3950 can be found under
Establishing a new multimedia archive the database of that that archive should support that standard as well as connection to other existing databases like BIBOS (Austrian university catalogue system)
For advanced multimedia systems not only a bibliographic description will be stored in the database but also content related information. These database are implemented on top of a standard DBMS but offer several features that will help content owners in achieving accurate and reliable description. The CARAT system by Siemens realised such multimedia database where the domain can be specified by the administrator defining a specific wordlist with predefined set of values. Thi allows the information owner to control the vocabulary being used for these descriptors eliminating problems related to misspelling of descriptors. The database represent domain specific semantics und utilises these to structure the logging information into units having specific semantic meaning. It represents the structure of audio and video objects into a hierarchical structure. It allows the domain manager to define the various levels in the structure and the semantics associated with the levels for this specific domain.
Beside the common way to define the query with keywords in a predefined user interface
more advanced methods like browsing or navigating
are available. The purpose of these access methods is to find in an intuitive
way to explore the stored material very fast and comprehensive.
Content based browsing is an important innovation using multimedia content. Since the meta database is able to store not only keywords to describe the material the content based browsing gives a much higher impact on finding relevant information. The browsing information can be presented in different ways: video content can be presented with key images that represent a specific scene. CARAT by Siemens uses a patented way to give a very fast overview about the whole film using cross-section representation. The film will be "cut" vertically and horizontally and represented a coloured band This technique give a good overview about scene changes fading, zooming and is quiet useful during the logging to build up the right structure of the video. Thumbnails are widely used represent images in an fast overview. If the content based meta database is connected with the media server than the high quality can be displayed immediately just by point and click. For audio and video it is also widely used to compress the data very strongly and to store it on a browsing server to have good support during the recherche. The methods to create quiet useful browsing information especially for high quality music are not agreed by all audio engineers. I some solution audio engineers want to store just first end last few seconds of the music piece instead of highly compressed data. To find an agreed solution for these specialist some more research work is needed.
Advanced retrieval systems are an interesting subject of research work as well as already innovative new products on top of normal query systems.
Different methods for the retrieval with relevance-feedback method can be found in a work by Alexander Kaiser
"Computer-unterstütztes Indexieren in Intelligenten Information Retrieval Systemen. Ein Relevanz-Feedback orientierter Ansatz zur Informationserschließung in unformatierten Datenbanken". Using a multilingual thesaurus with semantic tolerance like ConSearch by Readware is another interesting possibility.
All these retrieval tools support the user to define the query in a proper way. The representation of the query results is still a challenge. An interesting way to visualise a big number of results with relevant content related information is KOAN a context analysator by Siemens. The data, descriptors and relations are represented in a 3dimensional room. The context analysator uses the possibilities of the human eye to interpret near object in the room as one unit. Context related documents will therefore be interpreted as connected. Moving trough that 3 dimensional room with the mouse the navigation through the retrieval results can be managed The whole result entity can be experienced with that navigation tool in a very easy and meaningful way.
T. Fuehring, K, Jacoby, R. Michelis, J. Panyr:
Kontextgestaltgebung: Eine Metapher zur Visualisierung von und
Interaktion mit komplexen Wissensbestaenden
Nachtrag zu: W. Rauch, F. Strohmeier, H. Hiller, C. Schloegl (Hrsg.):
Mehrwert von Information - Professionalisierung der Informationsarbeit
Proceedings des 4. Internationalen Symposiums fuer Informationswissenschaft (ISI '94)
Universitaetsverlag Konstanz, Konstanz 1994
Applied Research Papers
Kontextuelle Visualisierung von Information
In: B. Markscheffel, H.-J. Manecke (Hrsg.):
Die Informationsvermittlungsstelle im Wandel
Proceedings des 19. Internatiolen Kolloquiums ueber Information und Dokumentation
Deutsche Gesellschaft fuer Dokumentation, Frankfurt/Main 1996
T. Fuehring, J. Panyr, U. Preiser:
3D-Visualisierung von Prozessinformation: Ein Ansatz zur Unterstuetzung der
Stoerungsaufklaerung in Kraftwerken
In: J. Krause, M. Herfurth, J. Marx (Hrsg.): Herausforderung an die Informationswirtschaft -
Informationsverdichtung, Informationsvisualisierung und Datenvisualisierung Proceedings des 5.
Internationalen Symposiums fuer Informationswissenschaft (ISI'96) Universitaetsverlag Konstanz,
T. Fuehring, T. Lauxmann:
3D-Visualisierung als Methode zur graphischen Analyse in der Szenariotechnik
In: H. Kremers, W. Pillmann (Hrsg.): Raum und Zeit in Umweltinformationssystemen
Proceedings des 9. Internationalen Symposiums Informatik fuer den Umweltschutz
Metropolisverlag, Marburg 1995
J. Panyr (1):
Automatische Klassifikation und Information Retrieval Systeme -
Niemeyer, Tuebingen 1986
Relevanzproblematik in Information-Retrieval-Systemen
In: Nachrichten fuer Dokumentation 37/1986, Nr.1
VCH Verlagsgesellschaft, Weinheim
Probabilistische Modelle in Information-Retrieval-Systemen
In: Nachrichten fuer Dokumentation 37/1986, Nr. 2
VCH Verlagsgesellschaft, Weinheim
Die Theorie der Fuzzy-Mengen und Information-Retrieval-Systeme
In: Nachrichten fuer Dokumentation 37/1086, Nr. 3
VCH Verlagsgesellschaft, Weinheim
Vektorraum-Modell und Clusteranalyse in Information-Retrieval-Systemen
In: Nachrichten fuer Dokumentation 38/1987, S. 13-20
VCH Verlagsgesellschaft, Weinheim
Interaktive Retrievalstrategien: Relevanzfeedback-Ansaetze
In: Nachrichten fuer Dokumentation 38/1987, S. 145-152
VCH Verlagsgesellschaft, Weinheim
A. Mueller, U. Thiel:
Query Expansion in an Abductive Information Retrieval System
Arbeitspapier der GMD IPSI, Darmstadt
M. Hemmje, C. Kunkel, A. Willet:
LyberWorld - A Visualization User Interface supporting Fulltext Retrieval
In: Proceedings der SIGIR 1994
J.P. Croft, W.B. Callahan, S.M. Harding
The INQUERY Retrieval System
In: Proc. 3rd Int. Conf on Database and Expert Systems Applications
Automatic Information Organization and Retrieval
McGraw Hill Book Company, New York 1968
N. J. Belkin:
Intelligent Information Retrieval: Whose Intelligence?
In: J. Krause, M. Herfurth, J. Marx (Hrsg.): Herausforderung an die Informationswirtschaft -
Informationsverdichtung, Informationsvisualisierung und Datenvisualisierung Proceedings des 5.
Internationalen Symposiums fuer Informationswissenschaft (ISI'96) Universitaetsverlag Konstanz,
Selection of Standards (Thesauri)
DIN 1463 (Teil 1),
DIN 31623 Teil 1
J. Viegener, A. Maurer:
Ein Ansatz zur Dynamisierung von Thesauri in Informationssystemen
In: Nachrichten fuer Dokumentation 44/1993, S. 285-292
Objektzentrierte Wissensrepraesentation und Information Retrieval Methoden
In: H.H. Zimmermann, H.-D. Luckhardt, A. Schulz (Hrsg.):
Mensch und Maschine - Infomationelle Schnittstellen der Kommunikation
Proceedings des 3. Internationalen Symposiums fuer Informationswissenschaft (ISI'92)
Universitaetsverlag Konstanz, Konstanz 1992
Das System der Branchen für den Alltagsgebrauch - Ein Thesaurus und sein Umfeld
(Vortragsgrundlage und Preprint zur ISKO-Konferenz 1993 in Weilburg)
M. Agosti, P.G.Marchetti:
User Navigation in the IRS Conceptual Structure trough a Semantic Association Function
In: The Computer Journal 35/1992, Nr. 3