-

Web Page Dynamics and Their Impact on Document Collections

-

,
, , ,

Wallace Koehler

School of Library and Information Studies, University of Oklahoma, Norman, Oklahoma, USA

, , , ,

, . , . , - .

The World Wide Web undergoes constant change in size and content from the individual Web page through the Web as a whole. These changes are in some ways qualitatively different and in some ways similar to the dynamics one encounters in the print world. This paper explores changes to Web pages and considers some of the implications of those changes for librarians and other information managers.

, . , . , - .

 

Introduction

Since its invention in 1991, the World Wide Web has grown and changed dramatically. There have been sometimes conflicting reports of the size the WWW measured either as sites or pages but all agree that it continues to increase at sometimes geometric rates. This size increase has significant consequences for those who seek to bring some kind of webographic control to this decade old information resource.

The growth in Web size is not the only challenge to webographers. Not only do new Web pages and Web sites come into existence, extant Web pages and Web sites often change, move, and go away. Its focus is:

1. How stable is the information content of the WWW? How often do Web pages? What changes?

2. How stable are Web pages? What is the death rate?

There are both practical and theoretical implications for these questions. Most designers of Web-based collections are concerned with questions of Web site or Web page migration or disappearance. These phenomena have been approached in at least four ways: ignore them, wait for user complaints then correct them, scan and remove broken links, and scan and recheck broken links periodically (Koehler 2000a).

Practitioners may also design for stability.

There have been few studies of Web site stability. Kitchens and Mosley (2000) question the utility of printed Internet guides since the Web references are far too ephemeral. Germain (2000) questions URLs as citations for scholarly literature for the same reason. McMillan (2000) argues that content analysis tools can be brought to the Web, but there are problems unique to the method because of the ephemeral nature of the target. OCLC's Web Characterization Project is an annual capture of publicly available Web sites to analyze trends in the size and content of the Web. (Office of Research, OCLC 2001).

The Problem

This paper revisits the many of questions posed in its 1999 precursor (Koehler 1999a). Web site behaviors are not reported here except to reemphasize their variable and ephemeral nature. The paper addresses the same set of URLs reported in the 1999 article. These were all on-line at the time of the original URL data harvest. No new URLs have been added since that date. These represent therefore a static and an aging collection of URLs. This is done to ascertain the behavior of such a collection over time.

The addition of newly collected URLs could bias the collection's behavior, particularly if an aging set of Web pages establishes an equilibrium or stability over time. Finally, since publication of the 1999 study, I have described a new behavior ZZZ the phantom Web page (Koehler 2000b). Phantoms are comatose Web pages that appear to be live to screening software. They result from the proliferation of non-standard or expanded error messages once rare but now commonly generated by server software.

Methodology

Data were collected between December 1996 and February 2001 to map Web page change over time. Data are collected weekly and include page size (in kilobytes) and link changes for Web pages. A number of attributes were examined to assess the growth, change, and death of those Web pages. FlashSite 1.01, a software product of Incontext was employed throughout the study for data capture.

Selection of URLs

A random selection of 360 URLs was made in the last two weeks of December 1996. Table 1 shows the distribution of URLs, by top-level domain, collected between December 10, 1996 and January 9, 1997. These URLs have since been monitored weekly through February 2001. For a variety of reasons, including a local power outage, a fire, and a server crash, we were unable to collect data in weeks 162, 165, 169, 178, 182, and 207. These are represented as gaps in the graphics.

Table 1. Sample Distribution

TLD Type

Total

Percent

gTLD

   

com

94

26.0%

edu

69

19.1%

gov

12

3.3%

mil

11

3.0%

net

32

8.9%

org

9

2.5%

     

IP Number

1

0.3%

     

ccTLD

   

Africa

1

0.3%

Asia

7

1.9%

Europe

90

24.9%

Middle East

1

0.3%

North America

18

5.0%

Pacific

11

3.0%

South America

5

1.4%

     

TOTAL

361

100.0%

Web page sample.

The selection process produced a set of 360 URLs. These URLs ranged from zero level server-level domain addresses to fourth level. That is, the URL returned by the random search engine process ranged from those with no directory structure (http://aaa.bbb.ccc) to those at the sub-subfile level (http://aaa.bbb.ccc/www/xxx/yyy/zzz.html). The Web page URLs were retained as returned to test the proposition that the further down the directory structure a URL lay, the less stable it was likely to be. That is, not only would it more likely disappear sooner, it would experience greater content and structural change. By retaining URLs at a variety of directory structure levels, it was possible to test the assumption.

Measures of Change

Once the URLs were selected, each URL was entered into FlashSite 1.01. The FlashSite report is in three parts: The first reports in kilobytes (kb) the size of the current document download. Second, it reports the number of new links from the target Web document. Third, it lists changed items linked from the target document. These three measures can be used to track Web document metamorphosis. The first measure (size in kb) captures changes in target document content, while the other two capture changes in the structure of that document.

FlashSite offers a non-exportable spreadsheet-like presentation of all URLs, including the status of the most recent download attempt. Those status messages include complete and network error. The network error message occurs whenever and for whatever reason FlashSite is unable to access, download, and assess content and structural changes. These reasons include slow response, no DNS entry (that is, the server is absent), file-not-found (the specific page is gone), and idiopathic causes. All network error messaged URLs were resubmitted twice more if necessary through FlashSite each week. Those URLs that did not download successfully were manually checked for status Comatose URLs are retained in the FlashSite file and rechecked weekly at the same time as were the others. This was done to determine the resurrection rate of the comatose sites. The term comatose is chosen rather than dead because there can be no absolute certainty that a URL will not at some time resurface.

Discussion

This paper presents an overview of the Web page change phenomenon. I argue that Web pages become more stable over time. Web page stability must however be considered in context and it is relative term. Web documents by their very nature can never approach the degree of permanence manifested by traditional media. By traditional media I mean both print (ink on paper and microfilm) and electronic or optical storage systems (CD-ROM for example).

Figure 1 is a plot of Web page demise, including phantom pages, from December 1996 through February 2001. It represents 213 weeks of data collection. Over that period, as Figure 1 suggests, Web page demise patterns have undergone a three and perhaps four phase change. The numbers 1 through 4 on Figure 1 indicates these. Phase one was the period of greatest loss. Well before the end of two years, more than half of the sample had disappeared. Phase two represents a correction then a slowing in the rate of demise. Phase three suggests a leveling and a period when there was no decline in the number of extant Web pages. Phase four, for which there is now only tentative data, may indicate a return to the Web page demise.

There may be no phase four, at least not yet, for the Week 213 data point has roughly the same value as the Week 165 point. It may be that the increase is but a correction for the short-term downturn and the level period will continue. Only further data collection and time will tell.

Figure 1

These data reinforce my earlier estimate (Koehler 1999a) for the half-life of Web pages of something less than two. Approximately 65% of the original Web page sample were still in existence at week 104. At week 213, about 35% of the sample were still responsive. This 35% of the original sample represents just over 50% of the sample remaining at week 104. The half-life of this segment between weeks 104 and 213 was also about two years. It is far too early to posit that there is a repeating two-year half-life cycle to Web page collections, but the idea is tantalizing.

Figures 2,3, and 4 represent changes other than demise for Web pages. Figure 2 measures the percent of Web pages that experience content change of Web pages still active at any given time. Content change is measured by a change in page size, measured in kilobytes. It captures any change, thus does not in any way imply the degree or magnitude of change. The curvilinear line is a trend line. The trend line suggests that the weekly degree of change was fairy steady, that for the first 150 weeks or so, something of on order of 20% of Web pages were changed for from week to the next. However, thereafter, the degree of change has since declined.

Figure 2

Figures 3 and 4 chart changes in the hypertext link structures from the Web page sample. Figure 3 graphs weekly changes in existing hypertext links from Web pages as a percent of the active sample. That is to say, it is concerned with either deleted links or changed links.

Figure 3

Figure 4 charts newly added links to the Web page sample, again as a percent of Web pages to the active sample. Over the first 150 weeks links structural changes occurred in about 15-20% of all Web pages on a weekly basis, and new links were added to between 5 and 10% of the sample over the same period. In both cases, changes increased thereafter.

Figure 4

An important pattern emerges after an examination of Figures 1 through 4. At the transition point between phases 2 and 3 for Web page demise, where the number of Web pages extant in the sample levels off, content change in those existing Web pages begins to decline. At the same time, both measures of structural change begin to increase in parallel with one another. This is, I believe, too systematic to be coincidental.

These data point to a slow stabilization in Web pages and in Web page content over time. At the same time, changes to the hypertext link structure of these same pages are increasing. Changes to the hypertext structures of Web pages modifies their meaning, but not to the same extent as page demise or to the content of the page. There are several possible explanations for these phenomena. Two seem plausible (1) Web authors become satisfied with page content over time and (2) Web authors ignore old content. The increase in link changes suggests reason one over reason two. Changing link structures implies that Web documents are screened for broken links and when found, edited.

Implications and Conclusions

There may be a decline in the Web page death or comatoseness rate over time such that as a Web page collection ages, it tends to become more stable. In addition, as the collection ages, the frequency and type of change to the page also changes toward stability. It may therefore be that collections of young Web pages are inherently unstable, but as the collection ages, it is less likely that the Web author will make modifications to the same degree as s/he once did. This may be the result of satisfaction with the product or in the end an increasing disinterest.

There are also important implications for the webographer. It has been demonstrated that the ephemeral nature of Web documents has a detrimental effect on catalogs and collections that point to them. The implications of these findings for library catalogs and other collections of Web documents are that it may be desirable to incorporate older rather than newer resources in collections that either cannot (print) nor are not monitored for busted links. It also suggests that collecting Web documents for libraries and other information resources may not be so futile as once it seemed.

References

  1. Germain, C.A. (2000). URLs: Uniform resource locators or unreliable reliable resource locators? College and Research Libraries 61 (4), 359-365.
  2. Kitchens, J.D. & Mosley, P.A. (2000). Error 404: Or, what is the shelf-life of printed Internet guides? Library Collections, Acquisitions & Technical Services 24 (4), 467-478.
  3. Koehler, W.C. (1999a). An Analysis of Web Page and Web Site Constancy and Permanence. Journal of the American Society for Information Science 50 (2), 162-180.
  4. Koehler, W.C. (1999b). Classifying Websites and Webpages: The Use of Metrics and URL Characteristics as Markers. Journal of Librarianship and Information Studies 31 (1), 21-31.
  5. Koehler, W.C. (2000a). Keeping the Web garden weeded: Managing the elusive URL. Searcher, 8(4), 43-45.
  6. Koehler, W.C. (2000b). The Management of Web Page Dynamics in Web Catalogs the Phantom Page Problem. Conference on Libraries and Associations in the Transient World Proceedings, vol 1, pp. 211-214. Sudak, Ukraine. Available: http://www.gpntb.ru/win/inter-events/crimea2000/doc/tom1/444/Doc3.HTML
  7. McDonnell, J., Koehler, W. & Carroll, B. (2000). Cataloging Challenges in an Area Studies Virtual Library Catalog (ASVLC): Results of a Case Study. Journal of Internet Cataloging 2 (2), 15-42.
  8. McMillan, S.J. (2000). The microscope and the moving target: The challenge of applying content analysis to the World Wide Web. Journalism and Mass Communication Quarterly 77 (1), 80-98.
  9. Office of Research, OCLC (2001) http://wcp.oclc.org/, http://wcp.oclc.org/pubs.htm
  10. OCLC CORC http://www.oclc.org/corc/
  11. O'Neill, E.T. & Lavoie, B. F. (2000). Bibliographic control for the Web. Serials Librarian 37 (3), 53-69.