Yan, H., Das, S., Lavoie, A., Li, S., & Sinclair, B. 2019. “The congressional classification challenge: Domain specificity and partisan intensity.” In Proceedings of the 2019 ACM Conference on Economics and Computation, Pp. 71-89.
In this paper, we study the effectiveness and generalizability of techniques for classifying partisanship and ideology from text in the context of US politics. In particular, we are interested in how well measures of partisanship transfer across domains as well as the potential to rely upon measures of partisan intensity as a proxy for political ideology. We construct novel datasets of English texts from (1) the Congressional Record, (2) prominent conservative and liberal media websites, and (3) conservative and liberal wikis, and apply text classification algorithms to evaluate domain specificity via a domain adaptation technique. Surprisingly, we find that the cross-domain learning performance, benchmarking the ability to generalize from one of these datasets to another, is in general poor, even though the algorithms perform very well in within-dataset cross-validation tests. While party affiliation of legislators is not predictable based on models learned from other sources, we do find some ability to predict the leanings of the media and crowdsourced websites based on models learned from the Congressional Record. This predictivity is different across topics, and itself a priori predictable based on within-topic cross-validation results. Temporally, phrases tend to move from politicians to the media, helping to explain this predictivity. Finally, when we compare legislators themselves across different media (the Congressional Record and press releases), we find that while party affiliation is highly predictable, within-party ideology is completely unpredictable. Legislators are communicating different messages through different channels while clearly signaling party identity systematically across all channels. Choice of language is a clearly strategic act, among both legislators and the media, and we must therefore proceed with extreme caution in extrapolating from language to partisanship or ideology across domains.