The Tuberculosis Specimen

X.1.2: Protocol for Corpus Management and Metadata

Introduction

Specimen Studies
0.1.1 | 0.1.2 | 0.1.3 | 0.1.4 | 0.1.5
Methods
0.2.1 | 0.2.2
The Structure of this Dissertation
0.3.1

Tuberculosis' Visual Culture

Visual Practices in Medical Culture
1.1.1 | 1.1.2 | 1.1.3
Seeing and Settling in the Sanatorium Movement
1.2.1 | 1.2.2 | 1.2.3 | 1.2.4 | 1.2.5
Teaching Public Health
1.3.1 | 1.3.2 | 1.3.3 | 1.3.4 | 1.3.5
Representing Doctors in Tuberculous Contexts
1.4.1 | 1.4.2

Using Human Specimens in the Study of Tuberculosis

Seeing Disease in Methyl Violet
2.1.1 | 2.1.2 | 2.1.3 | 2.1.4
Case Histories
2.2.1 | 2.2.2 | 2.2.3 | 2.2.4
Visceral Processes
2.3.1 | 2.3.2
Relation
2.4.1 | 2.4.2 | 2.4.3

Arts-Based Inquiry

Introduction
3.1.1 | 3.1.2 | 3.1.3 | 3.1.4
Terminal Imaginaries & Tuberculous Imaginaries
3.2.1 | 3.2.2 | 3.2.3 | 3.2.4 | 3.2.5 | 3.2.6
Dermographic Opacities
3.3.1 | 3.3.2 | 3.3.3 | 3.3.4
Tactical Pretensions
3.4.1 | 3.4.2 | 3.4.3

Designing Opacity

A Shift towards the Anticolonial
4.1.1 | 4.1.2 | 4.1.3 | 4.1.4
Refusals and Opacities
4.2.1 | 4.2.2 | 4.2.3 | 4.2.4
Digital and Ethical Workflows
4.3.1 | 4.3.2 | 4.3.3 | 4.3.4 | 4.3.5
Conclusion
4.4.1

Coda

Prometheus Undone
5.1.1 | 5.1.2 | 5.1.3 | 5.1.4

Appendix

The Tuberculosis Corpus
X.1.1 | X.1.2 | X.1.3
Web Design
X.2.1 | X.2.2 | X.2.3 | X.2.4
Installation Materials
X.3.1 | X.3.2 | X.3.3

Index

< Previous Section | Next Section >

The follow protocol was developed to create the corpus’ metadata, as well as file names.

  1. All files have been made into a corpus on HathiTrust. The original link can be found here.
  2. Each file in the corpus has been downloaded as both a .pdf file and .txt file.
  3. There have been some decisions made ahead of time concerning the selection of texts (X.1.1)
  4. Each file is entered into an interactive Airtable spreadsheet.
    1. The documents are broken down further into categories: monographs (which includes pamphlets and other singular works); reports, society meetings, conferences, and other collected documents; and articles and chapters (which reference back to either the monographs or collected document sources).
    2. The main important information for any of these sources includes: Author name, publisher information, date of release, any editors or translators, any institutional affiliation, as well as markers for different kinds of document (scientific, governmental, public consumption or otherwise).
    3. Further documentation when available is made concerning the author’s institutional affiliation and where that individual lived.
    4. All documents are tagged with information about whether they have illustrations or other graphic representation.
    5. Entry into the spreadsheet also produces an identifier based on this information. A rough outline is as follows:
  5. Each .pdf and .txt tile is labeled with the above identifier.
  6. Once labeled, the .pdf is opened and scanned for images to extract.
    1. Images are decided based on the kind of image. Photographs, illustrations, and graphic representations are chosen.
    2. The types of more abstract illustrations have been ignored: specifically things like bar graphs, scatter plots, and line graphs. Some other representations, like maps, were chosen or ignored on a case by case basis.
    3. Some texts have no images. If there are no images, make sure “0” is marked in the “Images” column of the respective spreadsheet. If there are images change the “0” to a “1”.
    4. The purpose of this is to have a quick tally of texts with images in relation to the entire corpus.
    5. Once the pages for each image are selected, those images are extracted using the Adobe Acrobat Extract function. Extract Pages > Check “Extract Pages as Separate Files” > Click “Ok” > Create a folder with the identifier in the “0TB_Primary Sources/5 Images” folder > Click “Choose”
    6. This operation allows for each page with an image on it to be exported as a single file (for analysis later).
  7. After the documents have been input into the spreadsheet and checked for images the .pdf file is moved into “0 TB_Primary Sources/3 PDF_Fullsource/_Recorded” folder and the .txt file is moved into the “0TB_Primary Sources/4 TXT_Fullsource” folder.

< Previous Section | Next Section >

Sean Purcell,2023 - 2025. Community-Archive Jekyll Theme by Kalani Craig is licensed under CC BY-NC-SA 4.0 Framework: Foundation 6.