Speech and dialogue are the heart of politics: nearly every political institution in the world involves verbal communication. Yet vast literatures on political communication focus almost exclusively on what words were spoken, entirely ignoring how they were delivered—auditory cues that convey emotion, signal positions, and establish reputation. We develop a model that opens this information to principled statistical inquiry: the model of audio and speech structure (MASS). Our approach models political speech as a stochastic process shaped by fixed and time-varying covariates, including the history of the conversation itself. In an application to Supreme Court oral arguments, we demonstrate how vocal tone signals crucial information—skepticism of legal arguments—that is indecipherable to text models. Results show that justices do not use questioning to strategically manipulate their peers but rather engage sincerely with the presented arguments. Our easy-to-use R package, communication, implements the model and many more tools for audio analysis.