Protocol for Corpus Management and Metadata

All files have been made into a corpus on HathiTrust. The original link can be found here.
Each file in the corpus has been downloaded as both a .pdf file and .txt file.
There have been some decisions made ahead of time concerning the selection of texts (X.1.1)
Each file is entered into an interactive Airtable spreadsheet.
1. The documents are broken down further into categories: monographs (which includes pamphlets and other singular works); reports, society meetings, conferences, and other collected documents; and articles and chapters (which reference back to either the monographs or collected document sources).
2. The main important information for any of these sources includes: Author name, publisher information, date of release, any editors or translators, any institutional affiliation, as well as markers for different kinds of document (scientific, governmental, public consumption or otherwise).
3. Further documentation when available is made concerning the author’s institutional affiliation and where that individual lived.
4. All documents are tagged with information about whether they have illustrations or other graphic representation.
5. Entry into the spreadsheet also produces an identifier based on this information. A rough outline is as follows:
Each .pdf and .txt tile is labeled with the above identifier.
Once labeled, the .pdf is opened and scanned for images to extract.
1. Images are decided based on the kind of image. Photographs, illustrations, and graphic representations are chosen.
2. The types of more abstract illustrations have been ignored: specifically things like bar graphs, scatter plots, and line graphs. Some other representations, like maps, were chosen or ignored on a case by case basis.
3. Some texts have no images. If there are no images, make sure “0” is marked in the “Images” column of the respective spreadsheet. If there are images change the “0” to a “1”.
4. The purpose of this is to have a quick tally of texts with images in relation to the entire corpus.
5. Once the pages for each image are selected, those images are extracted using the Adobe Acrobat Extract function. Extract Pages > Check “Extract Pages as Separate Files” > Click “Ok” > Create a folder with the identifier in the “0TB_Primary Sources/5 Images” folder > Click “Choose”
6. This operation allows for each page with an image on it to be exported as a single file (for analysis later).
After the documents have been input into the spreadsheet and checked for images the .pdf file is moved into “0 TB_Primary Sources/3 PDF_Fullsource/_Recorded” folder and the .txt file is moved into the “0TB_Primary Sources/4 TXT_Fullsource” folder.

< Previous Section | Next Section >

X.1.2: Protocol for Corpus Management and Metadata

Specimen Studies

Methods

The Structure of this Dissertation

Visual Practices in Medical Culture

Seeing and Settling in the Sanatorium Movement

Teaching Public Health

Representing Doctors in Tuberculous Contexts

Seeing Disease in Methyl Violet

Case Histories

Visceral Processes

Relation

Introduction

Terminal Imaginaries & Tuberculous Imaginaries

Dermographic Opacities

Tactical Pretensions

A Shift towards the Anticolonial

Refusals and Opacities

Digital and Ethical Workflows

Conclusion

Prometheus Undone

The Tuberculosis Corpus

Web Design

Installation Materials