Week 6 Outline: Web Scraping & Parsing XML

Exercise: Download XML-formatted finding aids from the Library of Congress and extract metadata fields

Open Terminal in macOS and launch our Docker container:

docker rm -f pcda_ubuntu
docker pull pcda17/ubuntu-container
docker run --name pcda_ubuntu -ti -p 8889:8889 --volume ~/Desktop/sharedfolder/:/sharedfolder/ pcda17/ubuntu-container

In Windows 10, open PowerShell and enter the following to launch the Docker container:

docker rm -f pcda_ubuntu
docker pull pcda17/ubuntu-container
docker run --name pcda_ubuntu -ti -p 8889:8889 --volume C:\Users\***username_here***\Desktop\sharedfolder:/sharedfolder/ pcda17/ubuntu-container

Open a new browser window and navigate to the Library of Congress’s list of XML finding aids by collection: http://findingaids.loc.gov/source/main.

Choose a collection you would like to work with for today’s class. For instance, the “Recorded Sound” collection is located at http://findingaids.loc.gov/source/RS. Copy the URL of the page you’ve chosen.

Navigate to localhost:8889 in your browser to open Jupyter. In the New drop-down menu new the top right of the window, select Terminal to open a bash shell session in your browser.

Make a new directory in /sharedfolder/ using the mkdir command, then cd into the directory.

mkdir LOC_Recorded_Sound

cd LOC_Recorded_Sound/

Download the LOC page you’ve chosen using wget. The --adjust-extension option adds “.html” to the end of the filename.

wget http://findingaids.loc.gov/source/RS --adjust-extension

Two more ways to download the contents of a web page:

With the page open in your browser, go to File > Save Page As ... in the toolbar. In the window that pops up, select Webpage, HTML Only as your format, then save the file wherever you like.

Right click anywhere in the browser window and select View Page Source. A new browser tab will pop up to display the page’s HTML source. Copy and paste the HTML into an empty text file.

In the macOS Finder or Windows Explorer, navigate to the sharedfolder directory on your desktop. Open the HTML file you just downloaded in Atom, or a text editor of your choice.

Scroll through the file and locate the list of links to finding aids. Each XML finding aid URL looks something like this: http://hdl.loc.gov/loc.mbrsrs/eadmbrs.rs010001.2

Our goal is to get each finding aid URL onto a separate line, using the text editor’s “Find and Replace” feature.

Because the same series of characters appear before and after each URL — href=" before and " target= after — use “Replace All” to replace each of these sequences with a newline.

Now save the HTML file (which is no longer proper HTML) and return to the terminal session in your browser.

We will now use the grep tool to search through the HTML file and extract lines containing URLS. The following command will write all lines in RS.html that include “http” to a new file called url_list_1.txt.

grep "http://" RS.html > url_list_1.txt

Open url_list_1.txt in your text editor and take a look. Note that the file still contains lines we don’t need, including links to METS records, which end in .4. Since all finding aid URLs that we want end in .2, we can use grep again to extract just those URLs.

grep "\.2" url_list_1.txt > url_list_2.txt

Note that the . character in our grep search term needs to be escaped using a backslash.

Open url_list_2.txt in your text editor. If the file still contains any text other than the URLs we want, delete it by hand and save the file.

Now we’re ready to download our collection of XML finding aids with wget. The -i option specifies that we want to download the files at every URL in a text file, with one URL per line.

wget -i url_list_2.txt

Navigate Jupyter Home at http://localhost:8889 and create a new Python 3 notebook.

In the first cell of your notebook, the following commands will change your working directory to /sharedfolder/LOC_Recorded_Sound and display a list of filenames in the directory.

import os

os.chdir('/sharedfolder/LOC_Recorded_Sound')

os.listdir('./')

Next we will use the BeautifulSoup package to parse an XML file. Insert one of your XML filenames in the snippet below and run it.

from bs4 import BeautifulSoup

xml_filename = 'eadmbrs.rs009003.2'

xml_text = open(xml_filename).read()

soup = BeautifulSoup(xml_text, 'lxml')

Open the same file in your text editor. Notice the tree structure of the XML file, in which each level of the XML tree is indented further than the one above it.

In case you’re working with a XML or HTML file that isn’t so neatly organized, this snippet will display a prettified version of the file.

from pprint import pprint

pprint(soup.prettify())

The following will locate the author element in the XML tree and display its contents.

author = soup.ead.filedesc.titlestmt.author.get_text()

print(author)

Or, more succinctly:

author = soup.find('author').get_text()

print(author)

The following snippet will print the author field for each file in your collection of finding aids:

for filename in [item for item in os.listdir('./') if item[-2:]=='.2']:
    page = open(filename).read()
    soup = BeautifulSoup(page, 'lxml')
    title = soup.title.string
    author = soup.find('author').get_text()
    print(author)

If an XML element type appears multiple times in a file, use soup.findAll() to return them in a list:

titles = soup.findAll('title')

print(titles)

To extract text from each element, you can use a for loop or, as below, a list comprehension:

titles = [item.get_text() for item in soup.findAll('title')]

print(titles)

Crossref API

Crossref API format:

https://search.crossref.org/dois?q=10.5555%2F12345678

pcda17.github.io

Course materials for 'Critical Perspectives in Cultural Data Analysis' at UT Austin's iSchool