Top of page

Sitemaps

An XML file listing URLs and their associated metadata.

What are sitemaps?

A sitemap provides information on the relationships between the pages, videos, images and other resources on a website. They are primarily used to inform search engines about the pages that are available for crawling. It is expressed as an XML file listing URLs and their associated metadata. Additional information about the sitemap protocol is available at: https://www.sitemaps.org/protocol.html .

The LOC.gov sitemap structure

The top-level sitemap for loc.gov is located at: loc.gov/sitemap.xml and has the following structure:


<sitemapindex>
    <sitemap>
            <loc>http://www.loc.gov/exhibitions/sitemap.xml</loc>
    </sitemap>
    ...
    <sitemap>
            <loc>http://www.loc.gov/librarians/sitemap.xml</loc>
    </sitemap>
</sitemapindex>

At this level each sitemap tag describes a portal (page) within loc.gov (e.g. loc.gov/newspapers/, loc.gov/events/) and contains a link to that specific portals (pages) sitemap. In a similar fashion, the sitemap for a specific portal (page) will contain links to its related resources. For example the sitemap for loc.gov/exhibitions is:


<sitemapindex>
    <sitemap>
            <loc>http://www.loc.gov/exhibitions/rosa-parks-in-her-own-words/sitemap.xml</loc>
    </sitemap>
    ...
    <sitemap>
            <loc>http://www.loc.gov/exhibitions/drawing-justice-courtroom-illustrations/sitemap.xml</loc>
    </sitemap>
</sitemapindex>

Use case

By traversing the nodes of the top-level sitemap one can crawl the pages on the website and map the relationship between pages and other resources. Furthermore, when one encounters a terminal node (e.g. a page that does not link to any other resources) the sitemap reveals useful metadata. For example the sitemap for the page loc.gov/exhibitions/rosa-parks-in-her-own-words/,


<urlset>
    <url>
        <loc>
            http://www.loc.gov/exhibitions/rosa-parks-in-her-own-words/about-this-exhibition/
        </loc>
        <changefreq>weekly</changefreq>
        <priority>0.5</priority>
    </url>
</urlset>

contains the url, change frequency, and the priority (used by web crawlers and search engines to signal the importance of the page).