Sustainability of Digital Formats: Planning for Library of Congress Collections |
|
![]() |
|
Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact |
Content Categories >> Still Image | Sound | Textual | Moving Image | Web Archive | Datasets | Email and PIM | Design and 3D | Geospatial | Aggregate | Generic
Geospatial Content >> Introduction to Geospatial Resources and FormatsTable of Contents
• Scope • Characteristics of Geospatial Formats • Georeferencing • GIS Metadata and Data Documentation • Content Standards • Quality Factors • Levels of Data Quality • Provenance and Lineage • GIS Functionality: Introduction • GIS Functionality: Types of Analysis and Uses for GIS Resources • Basic Spatial Analysis • Spatial Interpolation and Estimation • Grid-based Analysis • Geospatial Datasets • Functionality & Other Categories • Summary • References Scope The intended audience for this web site is the librarian, archivist, and/or data manager responsible for preserving digital resources. This essay is an introduction to geospatial formats and GIS functionality for the generalist or specialist geospatial data manager rather than for a GIS domain expert actively using the geospatial data. See also the summary overview Geospatial Content: Quality and Functionality Factors. The descriptions of geospatial formats on this web site are intended to support the preservation of the data and its documentation and metadata, as received by a digital archive. The preservation goal is to facilitate future viewing or rendering of the data, and, to the extent known and agreed upon by the geospatial community, enable the re-use, re-analysis, and/or re-compilation of the data in the future. Characteristics of Geospatial Formats Geospatial formats have been and are continuously being specified and adopted by governmental organizations, software vendors, and standards-making bodies. These formats are often based on specifications or standards that are more general, e.g., for still images (raster and vector) and for datasets. For information about such formats and associated factors for assessing quality and functionality, see Still Images: Quality and Functionality Factors and Datasets: Quality and Functionality Factors. Common to geospatial formats are the capabilities for accurate representation of the described resource’s location on the Earth using basic and inherent conceptual mechanisms such as georeferencing, scale, precision and accuracy. Georeferencing. Georeferencing has been defined as the establishment of a relationship between information (e.g., documents, datasets, maps, images, biographical information) and geographic locations through mechanisms such as the addition of place labels (e.g., place codes or toponyms) or the assignment of geographic coordinates.* (See the glossary in Linda Hill's Georeferencing: The Geographic Associations of Information [2], p. 228.) Georeferencing must be understood as a multi-part process involving the concepts of geographic coordinates and two or three dimensional map projections. All of the terms in the following list bear on georeferencing.
In analysing geospatial formats, it is important to understand how or whether a given format provides the means for documenting and calculating the accuracy and precision of measurement, description, and placement of the feature. Particularly when data are to be re-used, re-computed, or appended in time series or by similar or different instrumentation, it is of primary importance that such measures of the quality of the data be documented. Data quality is important to assess from the perspective of a preservation goal of being able to reproduce or replicate the data for purposes of re-use, and to further scientific experimentation. Contributing to an assessment of data quality are the concepts of provenance and lineage for data. The term provenance concerns the factual establishment of data authorship and is important in assessing the authenticity and accuracy of data. The term lineage goes beyond provenance to include discussion of the source, methods, and timing of the data. All three characteristics of geospatial resources--discussed in more detail below--should be recorded as metadata to enable future users to assess the fitness of a geospatial resource for a particular purpose. GIS Metadata and Data Documentation Content Standards. Community based content standards exist for geospatial data such as the U.S. Federal Geospatial Data Committee’s Content Standard for Digital Geospatial Metadata (FGDC) [7] and the broader ISO standard for geographic information, ISO 19115:2003 [8]. In the U.S., FGDC’s content standard elements have has been adopted more widely as a result of being incorporated into commonly used software products by such domain giants as ESRI and GeoMedia. (ESRI uses a "profile" of the FGDC standard rather than a native FGDC XML schema.) ISO 19115:2003 is slowly being adopted by more U.S. federal agencies and international agencies and is beginning to be incorporated into common software packages. Both the FGDC and the ISO:19115 content standards describe a number of characteristics of data including descriptive, technical, source, and preservation metadata. Descriptive metadata is necessary for identification, citation, and currency assessment of the data. Technical metadata describes the spatial references (projection, datum, and geographic coordinates). Preservation metadata includes the environment characteristics associated with the creation of the data, processing history, and provenance / lineage of the data sources and final output. Within the geospatial community, there is increased use of community derived ontologies for controlled values of semantic concepts such as provenance and data quality, instrument descriptions, algorithm expressions, and workflow processes. Content-specific metadata and data documentation can be expressed or noted within a given data format in terms of community based content standards (such as ISO 19115, FGDC, SensorML [9], and UncertML [10]) and community-built ontologies. Such information is useful not only for sharing and comparing by subsequent data users, but also for purposes of replicative compilation and/or computation to prove or extend scientific research, and for data extension in the case of open or serial or time series additions to a data set. In addition, it’s useful to know the form of expression for the metadata and documentation, i.e., whether it is expressed in a well-known XML schema or RDF ontology, as CSV spreadsheets, relational database tables and attributes and included with the data, or as links to external reports, ontologies or web-based services. Noting whether standards based metadata can be included or referred to within a format is useful for sharing and comparing between subsequent data users. To ascertain which software products are compliant with OGC standards see the OGC Product Registry [11]. Quality Factors. For some geospatial resources such as satellite data, it is critical to proper understanding and use of the data to include information about the quality of the data as well as its provenance and lineage. For example, for satellite images the percentage of cloud cover is a significant quality characteristic. Such information is critical not only to understanding what the data says, but also to understand appropriate and inappropriate uses of the data. Levels of Data Quality. There can be many levels at which data quality is and should be documented. For example at the product level, it is key to know how closely the data represents the actual geophysical state given the output from different instruments. Another quality level would be at the pixel level where the algorithms used to create the data points are noted as well as an assessment of the usability of those data points. At the granule-level, statistical roll-up of pixel-level data is compiled. This kind of computation could be important to validate the model used. For example, climate change data models can have grids of contiguous data tagged with uncertainty statistics for each grid cell, thus providing the means to assign quantitative risk factors or uncertainty levels to different mitigation scenarios. Examples of data quality reports for a data set can be found at NASA's NASA Surface Meteorology and Solar Energy: Accuracy [12]. The assessment of bias is a key data quality factor, i.e., bias that is generated from the instruments used (instrumental bias), or the type of sampling or observations made that provide the view of the data produced. In addition, an assessment of appropriate and/or inappropriate use is often considered to be an important data quality consideration. Provenance and Lineage. Documentation about the provenance of data in terms of factual establishment of its authorship is usually considered to be quite important in order to determine the authenticity of the data, and to some extent its accuracy. For example, knowing the name of the organization and/or person(s) responsible for the creation and/or collection of data may help ascertain whether or why certain features are or are not present, such as roads or buildings on a map of a city. A data consumer would have more confidence in the accuracy of such a map if it had been created by the city’s data center rather than by a student at a local college or university. Another example of describing the provenance of data is the tracking of what instrumentation was used to generate or record the data and the algorithms used to calculate the data output. The term data lineage is often considered to encompass provenance in the sense of authorship, but can include discussion of source, methods, and timing of the data as it has been created, derived and/or subset over time and into different products. For example, MODIS data is characterized as being generated at various levels, starting with Level 1 which is closest to the raw data output by satellites and other remote sensing instruments, and is rarely used by itself. Level 2 data is derived from Level 1, and may involve a subset of data from certain instruments, or for certain time periods or locations. The crunching or compilation of Level 2 data often results in more specific products that can be used by themselves for various purposes (educational, policymaking, etc.), and are considered Level 3. Ideally, a Level 3 product would include documentation about its lineage going back to the Level 1 data. GIS Functionality: Introduction One important feature of GIS systems is the display or printing of maps. Vector data such as geographic coordinates, points, lines, and polygons that describe areas are stored in mathematical form. This data can be used by GIS systems, or by vector graphics software, to print or to display on screen using scalable shapes, labels, and legends. When a user wishes to transform vector data into a raster format, the data structure of the original vector format should facilitate that basic need. GIS Functionality: Types of Analysis and Uses for GIS Resources Basic Spatial Analysis. A fundamental activity associated with normal functionality for a geospatial resource involves the reconciling of multiple data sources to the same or compatible geo-referenced locations as represented by the spatial data (coordinate information that describes the resource’s geography), and the attribute data (the non-spatial characteristics describing the resource), and documented by the GIS metadata. In determining the fitness of a format for basic spatial analysis, we must bear in mind that some of the basic analysis techniques described below are only appropriate for vector or attribute data. Typically, the reconciliation will take the form of performing operations on both the spatial data and the attribute data, if necessary, that allow the resource to become associated or converted to a (different) datum, map projection, and measurement units. Once any necessary reconciliation is done, the data are ready for further spatial analysis including sorting/selection, classification and other operations. Brief descriptions of some types of spatial analysis considered to be part of normal functionality follow.
Spatial Interpolation and Estimation. Some geospatial formats support more specialized spatial analyses that use statistical techniques to provide additional data points, especially when the complete extent of data points within it are unknown due to sparse, lost, or unobserved data points. These techniques are also used when changing the size of a grid, especially to a smaller cell size. The analysis can provide an estimation of a more full extent of data points within a given sample. Some formats support the generation and writing of derived data from these statistical techniques back into a resource, usually with a calculated error or accuracy rate included. Some of the statistical methods include the following:
Grid-based Analysis. Grid-based analysis is a GIS functionality that begins by identifying an area of interest and dividing it into rectangular cells based on geo-location (using a known datum and projection). The cells contain data values from a variety of sources and are stored in a format designed to hold gridded data. The values are then available for various forms of spatial and statistical analysis. Grids contain information that can range from geographic coordinates to reflectance values from solar radiation hitting surface features. Since grid capability enhances the utility of geospatial data, format descriptions document when and how a format supports the use of gridded data. Increasingly, geospatial data users transform their data to formats with more capability for grid-based analysis. For instance, a vector format containing point data may be transformed into a grid-capable format in order to perform area analysis over a broader geographic spectrum. To fully understand the implications of these kinds of data transformations, it is critical that the GIS metadata for the output of the grid-based analysis describe the factors that might impact the potential accuracy (error rate) of the data output such as the cell size and unit of measurement.
Geospatial Datasets. For formats that support geospatial datasets, the means used to establish and maintain the relationships among the constructs of a dataset are important. These relationships are useful to know for the full extent of the data within the dataset (as well as selected subsets of the data) such as the attribute tables and features they describe, and the location information that places them on the Earth. In addition, very complex geospatial datasets may be open, and thus a capability may be needed to append data on a rolling or ongoing basis, or to set up relationships among data series based on time or instrumentation sources. It is important to document the extent and mechanism a format uses to maintain relationships among the parts of and the output from a geospatial dataset. In addition, because the primary and secondary or related parts of a geospatial data resource may not necessarily be connected by a given format, our format description documents will note when a community-based aggregation format is warranted to keep the dataset together. Examples of community-based aggregation formats include SDTS (Spatial Data Transfer Standard [13]), SAFE (Standard Archive Format for Europe [14]), XFDU (XML Formatted Data Unit [15]), DDI (Data Documentation Initiative [16]), METS (Metadata Encoding and Transmission Standard [17]) or one of the various flavors of ZIP [18].
Functionality Shared with Other Categories. The functionality associated with different geospatial formats will vary depending upon whether the format is used for raster data, vector data, or attribute data. For example, using a raster still image, one could identify general locations within an image, but precise location is facilitated by using Boolean logic to query attribute data associated with the raster image. The following sections discuss some of the functionality shared with other formats:
Summary. 1. Geospatial Innovation Facility, University of California, Berkeley. GIS Data Types: Vector vs. Raster, accessed April 8, 2011. http://gif.berkeley.edu/documents/GIS_Data_Formats.pdf. 2. Hill, Linda L. Georeferencing: The Geographic Associations of Information. MIT Press: Cambridge, Massachusetts, 2006. 3. Geospatial Innovation Facility, University of California, Berkeley. Projections: What You Need to Know for GIS. Page 1 of document; accessed April 8, 2011. http://gif.berkeley.edu/documents/Projections_Datums.pdf. 4. Geospatial Innovation Facility, University of California, Berkeley. Datum: What You Need to Know for GIS. Page 2 of document; accessed April 8, 2011. http://gif.berkeley.edu/documents/Projections_Datums.pdf. 5. Geospatial Innovation Facility, University of California, Berkeley. Scale in GIS: What You Need to Know for GIS, accessed April 8, 2011. http://gif.berkeley.edu/documents/Scale_in_GIS.pdf. 6. Bolstad, Paul. GIS Fundamentals: a First Text on Geographic Information Systems. White Bear Lake, Minn: Eider Press, 2008. 7. Federal Geographic Data Committee. Content Standard for Digital Geospatial Metadata. FGDC-STD-001-1998, accessed April 8, 2011. http://www.fgdc.gov/standards/projects/FGDC-standards-projects/metadata/base-metadata/index_html. 8. ISO (International Organization for Standardization). Geographic information –- Metadata. ISO 19115:2003, accessed April 11, 2011. http://www.iso.org/iso/catalogue_detail.htm?csnumber=26020. 9. Open Geospatial Consortium Inc., Mike Botts, editor. OpenGIS Sensor Model Language (SensorML) Implementation Specification. OGC 07-000, July 17, 2007; accessed April 11, 2011. http://www.opengeospatial.org/standards/sensorml. 10. Williams, Matthew, Dan Cornford, Lucy Bastin, and Edzer Pebesma. Uncertainty Markup Language (UncertML): OpenGIS Discussion Paper. 08-122r2, accessed January 16, 2012. http://portal.opengeospatial.org/files/?artifact_id=33234. 11. Open Geospatial Consortium Inc. All Registered Products. OGC Product Registry, accessed April 8, 2011. http://www.opengeospatial.org/resource/products. 12. NASA Langley Research Center. NASA Surface Meteorology and Solar Energy: Accuracy. Page accessed April 8, 2011. http://power.larc.nasa.gov/cgi-bin/cgiwrap/solar/print.cgi?accuracy.txt. 13. American National Standards Institute (ANSI). Spatial Data Transfer Standard (SDTS). ANSI NCITS 320-1998, June 9, 1998, accessed April 11, 2011. http://mcmcweb.er.usgs.gov/sdts/standard.html. 14. European Space Agency. Standard Archive Format for Europe (SAFE). Page accessed April 8, 2011. http://earth.esa.int/SAFE/index.html. 15. ISO (International Organization for Standardization). Space Data And Information Transfer Systems -- XML Formatted Data Unit (XFDU) Structure And Construction Rules. ISO 13527:2010, 2010, accessed April 11, 2011. http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=53985. 16. DDI Alliance. Data Documentation Initiative (DDI) Technical Specification. Version 3.1, 2009, accessed April 8, 2011. http://www.ddialliance.org/Specification/. 17. Digital Library Federation. METS Metadata Encoding & Transmission Standard. Page accessed April 8, 2011. //www.loc.gov/standards/mets/. 18. Wikipedia. ZIP (file format), Accessed April 8, 2011. http://en.wikipedia.org/wiki/ZIP_%28file_format%29. 19. Open Geospatial Consortium, Inc. Welcome to the OGC Website. Page accessed April 8, 2011. http://www.opengeospatial.org/. * Georeferencing ought not be confused with georegistration and geocoding. Georegistration is the process of adjusting one drawing or image (the "target component") so that its features match the geographic locations of the same features on a "reference component," i.e., a drawing, image, surface, or map that is known to be correct. Geocoding is the process of determining geographic coordinates from other data, such as street addresses or place names. Back to top |
|