Digital Methods: The Dataset and Corpus

Introduction

Specimen Studies

0.1.1 | 0.1.2 | 0.1.3 | 0.1.4 | 0.1.5

Methods

0.2.1 | 0.2.2

The Structure of this Dissertation

0.3.1

Tuberculosis' Visual Culture

Visual Practices in Medical Culture

1.1.1 | 1.1.2 | 1.1.3

Seeing and Settling in the Sanatorium Movement

1.2.1 | 1.2.2 | 1.2.3 | 1.2.4 | 1.2.5

Teaching Public Health

1.3.1 | 1.3.2 | 1.3.3 | 1.3.4 | 1.3.5

Representing Doctors in Tuberculous Contexts

1.4.1 | 1.4.2

Using Human Specimens in the Study of Tuberculosis

Seeing Disease in Methyl Violet

2.1.1 | 2.1.2 | 2.1.3 | 2.1.4

Case Histories

2.2.1 | 2.2.2 | 2.2.3 | 2.2.4

Visceral Processes

2.3.1 | 2.3.2

Relation

2.4.1 | 2.4.2 | 2.4.3

Arts-Based Inquiry

Introduction

3.1.1 | 3.1.2 | 3.1.3 | 3.1.4

Terminal Imaginaries & Tuberculous Imaginaries

3.2.1 | 3.2.2 | 3.2.3 | 3.2.4 | 3.2.5 | 3.2.6

Dermographic Opacities

3.3.1 | 3.3.2 | 3.3.3 | 3.3.4

Tactical Pretensions

3.4.1 | 3.4.2 | 3.4.3

Designing Opacity

A Shift towards the Anticolonial

4.1.1 | 4.1.2 | 4.1.3 | 4.1.4

Refusals and Opacities

4.2.1 | 4.2.2 | 4.2.3 | 4.2.4

Digital and Ethical Workflows

4.3.1 | 4.3.2 | 4.3.3 | 4.3.4 | 4.3.5

Conclusion

4.4.1

Appendix

The Tuberculosis Corpus

X.1.1 | X.1.2 | X.1.3

Web Design

X.2.1 | X.2.2 | X.2.3 | X.2.4

Installation Materials

X.3.1 | X.3.2 | X.3.3

The first two chapters of this dissertation offer arguments regarding medicine’s visual culture (1.1.2; 2.2.2). These chapters depend on a pair of collections developed during the process of research. Corpora are collections of written materials, and they are used regularly in text-based digital humanities research methods, ranging from textual analysis—which counts the instances of a word throughout the selected text or corpus¹—sentiment analysis—which looks for words which have certain positive or negative valences in the language—or topic modelling—which looks for words that group together as topics across the selected written material.² The tuberculosis corpus is a collection of over 700 books published between 1882 and 1926, which were available on HathiTrust (X.1.1). These materials speak to a breadth of research published in English about tuberculosis during this period. These are, for the most part, monographs, research journals, and pamphlets. I also included tuberculosis-specific publications like The American Review of Tuberculosis and public-facing journals like The Journal of Outdoor Life. I would estimate that they are around 65-75% of the total books on tuberculosis printed in English from the period.³ Other published research in more general journals like The Lancet or regional medical research journals were omitted. I chose to exclude these collections because they speak to a more general medical audience, and the labor involved in reviewing and extracting tuberculosis specific articles would have been too time consuming. More information about the method and description of the tuberculosis corpus can be found in the appendix (X.1.1; X.1.2).

From the tuberculosis corpus, I developed a hand coded image dataset, which describes each photograph and illustration in these publications. At present, I have coded around 40% of the total images extracted from the corpus, which is enough to speak to broad trends in the period of interest. The specific method and protocols to generate the image dataset have been collected in a pair of appendices (X.1.4; X.1.5).

When I write about the tuberculosis corpus, I am referring to the full collection of texts I developed for this research study (X.1.1; X.1.2), and when I describe the image dataset, I am referring to the images which were analyzed within the larger corpus (X.1.3; X.1.4). I often use these terms together, because they depend on the same selection of texts, but describe different aspects of them.

I developed these materials for a few methodological reasons: first, I wanted to have a collection of materials that would be open for textual analysis. While I did not rely on any big data or machine learning approaches for my conclusions, I did touch on these to double-check my assumptions. Early drafts of this dissertation also included a text encoding initiative (TEI) project, marking up copies of the Henry Phipps Institute’s annual reports (0.1.1; 2.3.1; X.3.2). TEI projects encode texts through the use of extensible markup language (.xml), which affords a range of computational approaches once the text has been encoded. TEI can be used to do extremely detailed analyses of research, like counting each verb and noun used in a text, but my hope was to look for instances of case studies and the description of autopsies and generate key words from those selections to search the larger corpus. This was shelved due to time limitations.

Second, I wanted to create a broad collection of materials for further research. Early in the project, I had hoped the image dataset might be useful for other historians of medicine; although, I am not sure if I should distribute the final image dataset, once it is completed. This is because I began working on this project before I started thinking about ethical approaches to research into the lives of unwilling subjects, and I have not yet been able to think through an opaque approach to historical datasets: that is an approach that tries to respect a subject’s rights to not be included in a research project (0.1.5; 0.2.2; 4.1.3).⁴

The development of the dataset has proven to be very helpful for the work of the first half of this dissertation (1.0.0; 2.0.0), but it also served as the basis of the broader critical problems which guided the second half’s commitment to ethics. This is because the data-driven approach I have taken can be seen as antithetical to the loose, subjective, and embedded assumptions which guide humanistic methods.⁵

A social scientist might see the image dataset as a poorly thought through, unreproducible bastardization of content analysis.⁶ They would not be wrong, at least in the most technical sense. Content analysis needs a team of coders who devise a set structure and approach to their materials, so as to make their observations extrapolate beyond their sample. They develop a rigid codebook, and work to reduce error and noise between team members. Much of this is guided by a principle of positivism and extrapolation: it assumes that the sample set can represent the printed, imaged, or otherwise distributed media beyond itself. A couple hundred news articles examined through content analysis can statistically point to trends in the hundreds of thousands of other news articles printed during that period on the same subject.⁷

I am less interested in this statistical certainty, partly because there is a rather small limit to tuberculous publications in English in the period of inquiry. More importantly, though, these approaches tend to rely on a hypothesis-observation model of experiment, which has some flexibility when generating a codebook, but loses that when the large-scale coding happens in earnest. When the codebook is being developed, the coding team can negotiate how they will address the muddier examples in their selection, but once the coding begins at a larger scale any additional changes are much more difficult to execute. This dissertation instead depends on a recursive relationship with primary and secondary materials. This term, ‘recursive’, is used a lot in mathematics and computer code, describing the repeated application of a method over time. I use it here to stress the transformation in a project over time, adopting methods at different times and for different reasons based on the outcomes of earlier explorations (3.1.3; 4.1.2).

For this dissertation I applied a mix of different methods: I built the corpus and image data set; I created arts-based installations (3.2.1; 3.2.4; 3.3.1); I guided the development of a digital dissertation platform (4.1.1). Each of these practices emerged out of differing personal and epistemic goals, and they changed as I worked on them. Each method afforded a different framework to think through my research questions, and each method provided a new set of observations, limitations, and problems. Each new method bent the dissertation in oblong ways, distorting the relationship between myself and my research objects (0.1.4).⁸ In making this distinction between content analysis and the methods employed in this dissertation, my intent is not to belittle the social scientific methods which are in some ways very similar to my own, so much as to recognize its relation to this project and demarcate this project’s methodological differences (0.1.5). The process of coding and generating the dataset was most simply a process of better understanding the materials with which I worked. While I did not read every one of the hundreds of books in the corpus, I did have to enter the metadata related to each book by hand into my organizational Airtable document; I also had to look at each image to describe it within the working schema I had adopted (X.1.4). The adoption of a digital method is often assumed to be something where the labor is displaced to the computational tool; however, often for those tools to be useful, a great deal of work needs to be done by hand to produce legible results. This hand coded approach brought with it a nuanced understanding of the broader discourses around medical science and the study of tuberculosis (4.3.4).

The most prominent scholar in text analysis has since been accused of sexual assault, so you will not find any thing about him in this dissertation.

Liu, Fangzhou, and Hannah Knowles. “Retired English Professor Accused of Sexual Assault by Former Graduate Student.” The Stanford Daily, November 9, 2017. https://stanforddaily.com/2017/11/09/english-professor-accused-of-sexual-assault-by-former-graduate-student/; Mangan, Katherine. “2 Women Say Standford Professors Raped Them Years Ago.” The Chronicle of Higher Education, November 11, 2017. https://www.chronicle.com/article/2-women-say-stanford-professors-raped-them-years-ago/?sra=true&cid=gen_sign_in. ↩
Guldi, Jo. “Critical Search: A Procedure for Guided Reading in Large-Scale Textual Corpora.” History Faculty Publications 12 (2018): 1–35. ↩
This is based on the collections of the New York Academy of Medicine’s Rare Books Library, which had a number of books that are not included in the main corpus. ↩
This is a big problem in history, as described by scholars of the transatlantic slave trade.

Hartman, Saidiya. “Venus in Two Acts.” Small Axe 12, no. 2 (2008): 1–14; Johnson, Jessica Marie. “Markup Bodies: Black [Life] Studies and Slavery [Death] Studies at the Digital Crossroads.” Social Text 36, no. 4 (2018): 57–79. ↩
Posner, Miriam. “The Radical Potential of the Digital Humanities: The Most Challenging Computing Problem Is the Interrogation of Power.” LSE Impact Blog (blog), July 2015. https://blogs.lse.ac.uk/impactofsocialsciences/2015/08/12/the-radical-unrealized-potential-of-digital-humanities/; Drucker, Johanna. “Humanities Approaches to Graphical Display.” Digital Humanities Quarterly 5, no. 1 (2011). ↩
Hoslti, Ole R. Content Analysis for the Social Sciences and Humanities. Reading: Addison Wesley, 1969; Krippendorff, Klaus. Content Analysis: An Introduction to Its Methodology. Newbury Park: Sage, 1980; Tracy, Sarah. J. Qualitative Research Methods: Collecting Evidence, Crafting Analysis, Communicating Impact. Malden: Wiley Blackwell, 2013. ↩
There are a number of articles that cover content analysis in The International Encyclopedia of Communication Research Methods.

Matthes, Jörg, Christine S. Davis, and Robert F. Potter, eds. The International Encyclopedia of Communication Research Methods. Hoboken: Wiley Blackwell, 2017. ↩
Add a note about grounded theory. ↩

< Previous Section | Next Section >

0.2.1: Digital Methods: The Dataset and Corpus

Introduction

Specimen Studies

Methods

The Structure of this Dissertation

Tuberculosis' Visual Culture

Visual Practices in Medical Culture

Seeing and Settling in the Sanatorium Movement

Teaching Public Health

Representing Doctors in Tuberculous Contexts

Using Human Specimens in the Study of Tuberculosis

Seeing Disease in Methyl Violet

Case Histories

Visceral Processes

Relation

Arts-Based Inquiry

Introduction

Terminal Imaginaries & Tuberculous Imaginaries

Dermographic Opacities

Tactical Pretensions

Designing Opacity

A Shift towards the Anticolonial

Refusals and Opacities

Digital and Ethical Workflows

Conclusion

Coda

Prometheus Undone

Appendix

The Tuberculosis Corpus

Web Design

Installation Materials

Index