The previous section provided a breakdown of the concepts involved in preservation via digital imaging. In this section, a methodology is detailed for choosing how to proceed in the conversion of a given collection for preservation.
The goal is to determine the characteristics or format of the digital images used for preservation. In the terminology of the previous section, this digital image format would be the aggregate of three characteristics: its electronic content(s) and the two aspects of its electronic form, namely the encoding/compression and the file format.
The optimal choice of a digital image format for an collection depends on several factors:
In outline form, here are some guidelines on how to proceed. Note that this decision tree cannot be tackled unidirectionally; constraints extend both from the top down and from the bottom up, limiting some choices made in the middle.
As an example, assume a collection of brittle books, which contain the following key classes of material content:
In our example, we might find that some items in the collection are:
These factors affect our conversion strategy and may raise the need for special scanners or processing techniques.
Here one must chose the electronic content necessary to preserve the features present in a region of material content.
A spatial resolution must be chosen which offers a sufficient number of samples across the smallest significant feature. One or two samples per feature is an acceptable minimum.
An excess of samples above the minimum required across the smallest feature is disproportionately costly. In other words, the benefit received is minimal and the additional cost is quite high. This is because increases in the total data size (cost) go as the square of the increases in spatial resolution. At a spatial resolution producing one or two samples per smallest feature, the feature will be preserved (high benefit). More samples across this feature will only fix its position more accurately -- perhaps imperceptibly so (minimal additional benefit).
In general, more samples are required across physical features having low contrast or significant tonal range. Conversely, an electronic representation having greater tonal range (more bits per pixel) can, in general, be a good match to the smallest physical features at a somewhat lower spatial resolution.
With our example collection, we might choose the following possible mapping of material content areas into electronic content types. Note that other choices are possible and that at this stage we are ignoring some practical issues such as whether commonly available encoding/compression types and file formats can convey such a wide variety of electronic content types.
| material content region
| corresponding electronic content
scan at 600 dpi binary or 300 dpi 8 bit grayscale
Although most often an access image would be derived from its corresponding preservation image, note that different electronic content types could potentially apply in the preservation image and the access image with dual processing pathways active during capture.
For the current example collection, the following mapping of each electronic content type into a compatible encoding/compression method could be an appropriate one:
| electronic content
600 dpi binary:
| encoding/compression method
CCITT Group 4 (T. 6) compressin a single strip of data
Note that this choice may be guided by constraints downstream in this decision tree. For example, the file format dictated by the intended use may not support more than one encoding or compression method.
Appropriate file formats are next chosen which both permit the inclusion of the required encoding/compression types and support all the anticipated variations of preservation and access usage.
In our example, support for both Group 4 compressed binary data, JPEG compressed grayscale data and JPEG compressed color data would be required of the preservation file format. The access file format would perhaps not need to support all these types.
The preservation or archival file could be destined for use outside the institution or could be intended for use only in-house. A proprietary or unique file format could be acceptable in the latter case (if required to support a highly innovative mix of encoding/compression types), but likely would not be acceptable in the former. Interchange requires standard file formats. If, however, only access files are to be interchanged, the data could be removed from the proprietary format, processed appropriately and formatted into a widely accepted format when the access files are created.
Access files clearly need to be in a standard format, but the choice of which format may be driven by the realities of the marketplace; are the intended users likely to have viewing and manipulation tools which support the chosen format? Extensible viewers which allow downloading of the viewing software along with the data will someday ease this contraint -- one could then use any innovative proprietary format and still enable people to view it. This system architecture hinges on the acceptance of the new proposed standard platform-independent interpreted software languages.
If the likely use is viewing or reference to a screen image, one class of file formats might best apply, whereas if printing is the more common need, another class might be more appropriate. Note that this consideration similarly trickles back up to step 3 and its choice of electronic content types as well; printing devices are primarily binary devices.
It may be that one's first pass through this decision tree produces a result which is overly aggressive. The resulting approach may produce a conversion project which is too labor intensive. It may require more flexibility of file formats than is currently available.
In these cases, a second pass through the guidelines may permit some compromises which increase the likelihood of creating a successful conversion project.
Among the elements of electronic content, scanning resolution -- both spatial (in dots per inch) and tonal (in bits per pixel) -- ranks as the most important choice to be made in preservation using digital imaging. The scanning decision tree found in Appendix B, discussed in this section, offers additional guidance on the treatment of this most important digital image characteristic.
In choosing the scanning resolution, it is best to attempt to respect the judgment of the original creator of the material, who made similar choices (consciously or not) in its creation.
If an author chose to highlight some of the words in a manuscript by typing them in a different color, we can assume that the intent was to better communicate some information to the reader through this use of color. Thus, for the preservation copy, this color information should be preserved. The decision to do a color scan in this case should not require any judgment call by the scanner operator, because the policy should be that whenever the author used multiple colors, a color scan should be performed, perhaps automatically.
While decisions regarding what constitutes "information" in the original document is to a large extent a matter of policy, those decisions have such a strong impact on the economic viability of a conversion project, that some pragmatic guidelines must and do emerge.
By assuming that the important information content to be preserved is the information that originated with the author, and that the author intended the recipient to be able to see that information using normal lighting with an unaided eye, we can set some reasonable limits for the scanning process.
For example, we would not resort to color scanning to pick up the brown color of a coffee stain, because the stain is not part of the information that the author intended to convey to the recipient. Nor would we, according to these guidelines, use color scanning to pick up the brown color of aged paper.
With today's technology, the additional cost of storing a color image instead of a monochrome image is low, making it easy to adopt the policy of scanning all images that have significant color content with a color scanner. Even though the uncompressed file size of a color image is three times larger than that of a monochrome image, the JPEG compression algorithm compresses the additional color information very efficiently, so the compressed file size is likely to be only 20 to 25% larger.
The additional labor cost of scanning the image in color can be more significant, because current high speed scanners (e.g. those with autofeeders) tend to be monochrome only, requiring the use of a manual feed scanner for these documents. A monochrome access copy of the image may be desirable for printing because today's laser printers are almost all monochrome only.
In the scanning decision tree (Appendix B), the first branch depends on the presence of significant color content in the source document. This should be easy to determine using the criteria presented above. Once color scanning is chosen, this obviates all further consideration of the tonal resolution requirements, thus simplifying the remaining decision processes.
Because the tonal resolution required for adequate color rendition (at least 16 bits/pixel) is greater than that required by any other type of source material, other content types that may be present within the page will not be harmed by color scanning. The spatial resolution required for the brightness component in color scanning is the same as the spatial resolution required in monochrome scanning. Assuming a properly designed scanner, a monochrome image can always be created from the color image, with quality as good as or better than that obtained from an original monochrome scan.
Monochrome sources that include continuous tone or halftone photographs also require significant tonal resolution. While halftone photographs could theoretically be scanned at a high enough spatial resolution to accurately preserve the sizes of the halftone dots, even a resolution of 1200 pixels/inch would not be sufficient to render individual dots within 5% of their original area. This approach would not work at all for continuous tone photographs, and requiring the scanner operator to discriminate between the two types of photographs during scanning would be error-prone.