Video and Audio: Organization and Retrieval in the WWW

Zhigang Chen
See-Mong Tan
Dong Xie
Aamod Sane
Yongcheng Li
Roy H. Campbell

Abstract:

The current World Wide Web does not adequately support the retrieval and exchange of continuous media, such as video and audio data. Inappropriate network transmission protocols, the lack of flexible access methods, as well as the lack of architectures that encourage the efficient reuse of continuous media, have been major factors preventing the widespread use of video and audio. This paper addresses this problem and proposes a continuous media model which integrates meta-information in continuous media in order to support the flexible and efficient reuse of video and audio data in the World Wide Web. Several classes of meta-information are identified to facilitate the hierarchical access, browsing, search, and dynamic composition of continuous media. Implementations of tools for the extraction and construction of continuous media meta-information are described.

Keywords:

Video, Audio, World Wide Web, On-demand retrieval

1 Introduction

World Wide Web development has traditionally focused on methods for the fast retrieval of documents consisting of static text and images. A wide spectrum of information retrieval methods are supported on the current World Wide Web. For example:

These efficient and flexible access methods, together with the easy-to-use, point-and-click user interfaces of WWW browsers, such as Mosaic and Netscape, have been some of the main factors contributing to the success of the Web.

Continuous media such as on-demand video and audio, play important roles in applications such as entertainment, distance learning and training, home shopping and on-line publishing. The integration of continuous media into the World Wide Web enhances the expressive power of the Web. However, there is limited support for continuous media in the today's World Wide Web. The wealth of well defined, flexible access methods for static document retrieval and navigation outlined above do not apply to continuous media. This is evidenced in:

Several commercial products, such as VDOlive[6] and Streamworks[7], allow users to retrieve and view in real time video and audio over the World Wide Web. However, these products suffer from the drawbacks outlined above. The products use either vanilla TCP or UDP for network transmission. Without resource reservation protocols in use within the Internet, TCP or UDP alone will not suffice for continuous media. Adaptable and media-specific protocols, such as the Video Datagram Protocol (VDP)[4] are required. Video and audio can also only be viewed in a primitive, linear, VCR-mode. The issues of content preparation and reuse are also not addressed.

The goal of Vosaic project[4] is to seamlessly integrate the organization, retrieval and navigation of continuous media into the World Wide Web. The issues of efficient network transmission, server management, and client side access to continuous media have been investigated. The network transmission protocols (VDP, for Video Datagram Protocol) and other aspects of the system were described in [4]. This paper focuses on the organization of a continuous media, including flexible access methods and the efficient reuse of video and audio data. In more detail, the objectives are:

This paper proposes a model for continuous media organization, storage and retrieval. The model treats continuous media as consisting of physical media and meta-information. Several classes of meta-information are identified in order to support flexible access and efficient reuse of continuous media. The meta-information encompasses the inherent properties of the media, hierarchical information, semantic description, as well as annotations that provide support for hierarchical access, browsing, searching, and dynamic composition of continuous media.

The rest of this paper is organized as follows: Section 2 describes the basic storage model for continuous media. Section 3 discusses how different access methods can be supported using meta-information in continuous media. Section 4 describes the our implementation of these ideas within the Vosaic project. Section 5 surveys related work. Section 6 concludes this paper and describes on-going and future work.

2 A Model for Continuous Media Organization


Figure 1: A model for continuous media

Our model of continuous media integrates video and audio documents with their meta-information. That is, the meta-information is stored together with the encoded video and audio. Several classes of meta-information are included in the model. These are:

The structure of this model is shown in Figure 1. The following subsections describe the meta-information and its use in detail.

2.1 Inherent Properties

Inherent properties include the specification of the encoding scheme, encoding parameters and frame access points. For example, for a video clip encoded in the MPEG format[8], the encoding scheme is MPEG, and the encoding parameters include the frame rate, bit rate, encoding pattern, and picture size. The access points are the file offsets of important frames.

Inherent properties assist in the network transmission of continuous media. It also provides random access points into the document. For example, [4] describes an adaptive scheme for transmitting video and audio over packet-switched networks with no quality of service guarantees. The scheme adapts to the network and processor load by adjusting the transmission rate. The scheme relies on the knowledge of the encoding parameters, such as the bit rate, frame rate and encoding pattern.

Information about frame access points enables frame-based addressing. Frame addressing allows accesses to video and audio by frame number. For example, a user can request a portion of a video document from frame number 1000 to frame number 2000. Frame addressing make frames the basic access unit. Higher level meta-information, such as structural information and semantic descriptions, can be built by associating a description with a range of frames.

The encoding within the media stream often includes several of the inherent property meta-information. These parameters are extracted and stored separately, as on-the-fly extraction is expensive. On-the-fly extraction unnecessarily burdens the server and limits the number of requests that the server can serve concurrently.

2.2 Hierarchical Structure


Figure 2: Hierarchical organization and indexing of a movie

A video or audio document often possesses a hierarchical structure. For example, a movie often consists of a sequence of clips. Each clip is made of a sequence of shots (scenes), while each shot includes a group of frames. An example of hierarchical information in a movie is shown in Figure 2. The example movie `` Engineering College and CS Department at UIUC'' consists of clips `` Engineering College Overview'' and `` CS Department Overview''. Each of these clips are composed of a sequence of shots, such as `` Engineering College Overview,'' which in turn consists of `` Campus Overview'', `` Message from Dean,'' and others.

The hierarchical structure describes the organizational structure of continuous media. It makes hierarchical access and non-linear views of continuous media possible.

2.3 Semantic Descriptions and Annotations


Figure 3: Keyword descriptions of a movie

Semantic Descriptions

Semantic descriptions describe part or the whole video/audio document. A range of frames can be associated with a description. As shown in Figure 3, the shots in the example movies are associated (indexed) with keywords. Semantic descriptions facilitate search. Searching through large video and audio clips is hard without semantic description support.

Annotations

Annotations describe how a certain object within a continuous media stream is related to some other object. Hyperlinks can be embedded to indicate this relationship. For example, a hyperlink can be made for a interesting object in a movie which leads to related information. Annotation information allows the browsing continuous media and can make video and audio integrated with static data types like text and images.

The model allows multiple annotations and semantic descriptions. Different users can describe and annotate in different ways. This is essential in supporting multiple views on the same physical media. For example, a user may describe the campus overview shot in the example movie as ``UIUC campus'', while another user may associate it with ``Georgian style architecture in the United States Midwest''. The first user may have a link from his presentation to introduce the UIUC campus, while the other user may use relative frames of the same video segment to describe Georgian-style architecture.

Supporting multiple views considerably simplifies content preparation. This is because only one copy of the physical media is needed. Users can use part or the whole copy for different purposes.

3 Flexible Access and Efficient Reuse

The meta-information is essential in supporting flexible access and efficient reuse. This section describes how flexible access and efficient reuse is achieved through the use of continuous media meta-information.

3.1 Hierarchical Access

The hierarchical information can be displayed along with the video to provide the user a view of the overall structure of the video. It allows the user to access to any desired clip, and any desired shot. Figure 4 shows what we envision as an implementation of the video player in Vosaic. The movie is shown along with its hierarchical structure. Each node is associated with a description. User can click on nodes of the structure and that portion of the movie will be shown in the movie window.


Figure 4: Display of a movie along with its hierarchical architecture.

Hierarchical access enables a non-linear view of video and audio, and facilitates greatly the browsing of video and audio materials. Video and audio documents have traditionally been organized linearly. Even though traditional access methods, such as the VCR type of operations, or the slide bar operation, allow arbitrary positioning inside video and audio streams, finding the interesting parts within a video presentation is difficult without strong contextual knowledge since video and audio express meanings through the temporal dimension. In other words, user cannot easily understand the meaning of one frame without seeing related frames and shots. Displaying hierarchical structure and descriptions give users a global picture of what the movie and each part is about.

3.2 Search


Figure 5: Key word search result

Search can be supported by searching through the semantic description. For example, the keyword descriptions in Figure 3 can be queried. The search of keyword tour will return all the tours in the movie, eg., One Lab Tour, DCL Tour, and Instructional Lab Tour. One implementation of search is shown in Figure 5.

3.3 Browsing


Figure 6: Hyperlinks embedded within video stream

Browsing is supported through hyperlinks embedded within video streams and through hierarchical access. Hyperlinks within video streams is an extension to the general hyperlink principle. It makes objects within video streams anchors for other documents. As shown in Figure 6, the rectangle outlining a black hole object indicates that it is a anchor, and upon clicking the outline, the document it is linked to is fetched and displayed (in this case a HTML document about black holes). Hyperlinks within video streams integrates and facilitates inter-operation between video streams and traditional static text and images.

3.4 Composition

The continuous media model also allows dynamic composition. A video presentation can use parts of existing movies as components. For example, a presentation of Urbana-Champaign can be a video composed of several segments from other movies. As shown in Figure 7, the the campus overview segment can be used in the composition. The specification of this composition is done through hyperlinks.


Figure 7: Dynamic composition of video streams

4 Implementations

Project Vosaic bases its architecture on the continuous media model outlined above. Meta-information is stored on the server side together with the media clips. Inherent properties are used by the server in order to adapt the network transmission of continuous media to network conditions and client processor load. Semantic description and annotations are used for searching video material and hyperlinking inside video streams. We designed and implemented tools for the extraction and construction of continuous media meta-information. A parser was developed to extract inherent properties from encoded mpeg video and audio streams. A link-editor was implemented for the specification of hyperlinks within video streams. We are also designing tools for video segmentation and semantic description editor. Implementation of the movie player in Figure 4 is also being carried out. This section describes some of these tools, and implementation of frame addressing and client side video hyperlinking.

4.1 Frame Addressing

Frame addressing uses the video frame and the audio sample as basic data access units to video and audio respectively. During the initial connection phase between vosaic server and client, the start and end frames for specific video and audio segments are specified. The default settings are the start and the end frame of the whole clip. The server only transmits the specified segment of video and audio to the client. For example, for a movie that is digitized as a whole and is stored on the server, the system allows a user to request frame number 2567 to frame number 4333. The server identifies and retrieves this segment, and transmits the appropriate frames to the client.

4.2 Parsing

We have developed a parser for extracting inherent properties from MPEG video and audio streams. The parsing is done off-line. The parse file contains:

in the clip file.

A example parse file is shown below:

#
#
# ------------------------------------------------------------------
# cs.mpg.par
#
# Parse file for MPEG stream file
# This file is generated by mparse, a parse tool for MPEG stream file.
# For more information, send mail to:
#
# zchen@cs.uiuc.edu
# Zhigang Chen, Department of Computer Science
# University of Illinois at Urbana-Champaign
#
# format:
# i1 h_size v_size frame rate bit rate frames total size
# i2 ave_size i_size p_size b_size ave_time i_time, p_time, b_time
# p1 pattern of first sequence
# p2 pattern of the rest of the  sequence
# hd header_start header_end
# frame_number frame_type start_offset frame_size frame_time
# ed end start
# ------------------------------------------------------------------
#
i1 160 112 15 262143 12216 8941060
i2 731 2152 510 76 12511 20911 10443 8826
p1 7 ipbbibb
p2 7 ipbbibb
hd 0 12
0 1 12 2234 20377
...

4.3 Hyperlinks and the Link Editor

A link editor was that enables the user to embed hyperlinks into video streams. The specification of a hyperlink for a object within video streams includes several parameters:

The positions of the object outline are interpolated for frames nestled in between the first and last frames specified. A simple scheme using linear interpolation which is shown in Figure 8. The position of the outline in the start frame(frame 1) and end frame(frame 100) are specified by the user, for frames in between, the position is interpolated as shown in the frame 50.


Figure 8: Interpolation of hyperlinks in the video streams

Linear interpolation is currently employed in our system. Our experience indicates that it works well for objects with linear movement. However, for better motion tracking, sophisticated interpolation methods, such as spline interpolation, may be desirable. This is currently being implemented.

4.4 Dynamic Composition

We are also experimenting with dynamic composition of video. For example, Figure 5 illustrates the result of a search on a video database. The search result is a server-generated dynamic composition of the matched clips. The resulting presentation is a movie made up of the video clips in the search result.

In general, users may use the dynamic composition facilities to create and author continuous media presentations by reusing video segments through this facility. The organization of video through dynamic composition reduces the need for the copying of large video and audio documents.

4.5 Video Segmentation and Semantic Description Editing

Video segmentation and semantic description editing is currently performed manually. Video frames are grouped and descriptions are associated with the groups. The descriptions are stored and used for search and hierarchical structure presentation. Figure 5 is a result of such experiment. We are studying the algorithms proposed in [18] and [19] for automatic video segmentation, and an authoring tool is being built to assist the user in generating and embedding descriptions.

5 Related Work

Meta-information and continuous media have been the subject of several studies. Video indexing and editing method have been studied in [12], [16], [11], [3] and [10]. The Informedia project at CMU proposes the use of automatic video segmentation and audio transcript generation for building large video libraries. [18] and [19] propose algorithms for video segmentation. Hyperlinks in video streams have been proposed and implemented in the Hyper-G distributed information system[2] [13], as well as in a World Wide Web context in Vosaic.

While previous work focus on a particular aspect of meta-information, for example, in terms of support for search only, or for hyperlinking only, our model categorizes and integrates continuous media meta-information in order to support continuous media network transmission, access methods, and authoring. The model can be generalized for static data. The generalized model encourages the integration of continuous media with static media, document retrieval with document authoring. Multiple views of the same physical media is possible in our system.

6 Conclusions

By integrating meta-information in the continuous media model, flexible access and efficient reuse of continuous media in the World Wide Web is achieved. Several classes of meta-information are included in our continuous media model. Inherent properties help network transmission of and provide random access to continuous media. Structural information provides hierarchical access and browsing. Semantic specifications allow search in continuous media. Annotations enable hyperlinks within video streams, and therefore facilitates the browsing and organization of irregular information in continuous media and static media through hyperlinks. The support of multiple semantic descriptions and annotations makes multiple views of the same material possible. Dynamic composition of video and audio is made possible by frame addressing and hyperlinks.

We are building tools for the extraction and construction of meta-information in continuous media documents. An authoring tool is being implemented to assist in creating compositions. Synchronization issues and server pre-fetch and caching are also under investigation as part of the larger Vosaic effort.

Acknowledgments

We would like to thank Joseph Hardin of NCSA for his advice and encouragement on this work. Jeff Terstriep and David Curtis from NCSA provided us with an abundance of hardware resources and video material. We also would like to thank several people we got to know on the Web for their insight and assistance, notably Amy Ayers, Hans Kugler, and Nick Bicanic.

Software

Vosaic software is available from http://choices.cs.uiuc.edu/research/Vosaic/vosaic2.html.

References

1
B. Alberti, F. Anklesaria, P. Linder, M. McCahill, and D. Torrey. Gopher+: Proposed enhancements to the internet gopher protocol.

gopher://boombox.micro.umn.edu:70/11/gopher/gopher_protocol/G opher%2b, 1992.

2
K. Andrews, F. Kappe, and H. Maurer. The Hyper-G Network Information System. J. UCS, 1(4), April 1995.

3
S. Loeb. Delivering Interactive Multimedia Documents over Networks. IEEE Communications Magazine, May, 1992.

4
Z. Chen, S. Tan, R. Campbell, and Y. Li. Real time video and audio in the World Wide Web. In Proc Fourth International World Wide Web Conference, 1995.

5
World Wide Web Consortium. Hypertext Markup Language. Available on the WWW via http://www.w3.org/hypertext/WWW/MarkUp.

6
VDOnet Corporation. VDOLive Internet Video Servers and Players. http://www.vdolive.com/, 1996.

7
Xing Technology Corporation. StreamWorks. http://www.xingtech.com/, 1996.

8
D. Le Gall. MPEG: A Video Compression Standard for Multimedia Applications. Communications of the ACM, 34(4):46--58, April 1991.

9
F. Davies, B. Kahle, H. Morris, J. Salem, T. Shen, R. Wang, J. Sui, and M. Grinbaum. WAIS Interface Protocol. ftp://quake.think.com/wais/doc/protspec.txt, April 1990.

10
C. Federighi, J. Boreczky, and L. Rowe. Indexes for User Access to Large Video Databases. In Proc. IS&T/SPIE Symp. on Elec. Imaging Sci. & Tech., San Jose, CA, February 1994.

11
C. Federighi and L. Rowe. A Distributed Hierarchical Storage Manager for a Video-on-Demand System. In Proc. IS&T/SPIE Symp. on Elec. Imaging Sci. & Tech., San Jose, CA, February 1994.

12
W. Mackay and G. Daveport. Virtual video editing in interactive multimedia applications. Communications of the ACM, pages 802--810, July 1989.

13
B. Marschall. Integration of digital video into distributed hypermedia systems. M.S. Thesis, Inistitute for Information Processing and Computer Supported New Media(IICM), Graz University of Technology, 1995.

14
Sun Microsystems. The Java Language Specification, 1995.

15
Sun Microsystems. The Java(tm) Language Environment: A White Paper, 1995.

16
R. Weiss, A. Duda, and D. Gifford. Compostion and search with a video algebra. IEEE Multimedia, 2(1), Spring, 1995.

17
World Wide Web Consortium. Common Gateway Interface. Available on the WWW via http://www.w3.org/hypertext/WWW/Overview.html.

18
B. Yeo and B. Liu. A Unified Approach to Temporal Segmentation of Motion JPEG and MPEG Compressed Video. In Proc. Intl. Conf. on Multimedia Computing and Systems, Washington, D.C., May 1995.

19
H. Zhang, Y. Gong, S. Smoliar, and S.Y. Tan. Automatic Parsing of News Video. In Proc. Intl. Conf. on Multimedia Computing and Systems, 1994.