By: Patrick W. Zimmerman
The Donald is a different kind of politician. He is the Teflon Man, the politician who can admit to grabbing women by their privates, call his opponent a nasty woman, have a full-on bromance with the Most Interesting ex-KGB Agent in the World, and get in his morning tweets instead of attending intelligence briefings.
We started the Trump Watch specifically to monitor how often the Tweeter in Chief inched the world closer to nuclear catharsis by insulting a nuclear power. It’s terrifying, but made me wonder who else he was talking about. More importantly, how can we tease out subjects he discusses that are truly orange-flavored crazy from those that are simply the news of the day that everyone (or a garden-variety politician) is tweeting about?
Computational linguistics to the rescue!
The question
How unusual, compared to other politicians who tweet, is Trump’s Twitter activity? That is to say, what terms (n-grams, for the purposes of this study) are important to his tweets but not commonly used by other politicians?
Term significance map
For those of you hoping that much of the hype about our new President-Elect’s social media habits were simply a function of the outsized attention he received, you’re out of luck. Thanks for playing. The Donald is who we thought he was. Insults, an obsession with the mainstream news media and his political rivals, and his repeated bullying of big companies to pull out of Mexico all feature prominently.
His most important term (by some distance) is….wait for it…dishonest.
Yeah. That seems about right.
As a note, terms with a wildcard like hack* are stems; the asterisked n-gram also includes “hacks”, “hacking”, “hacked”, and so on.
Digging deeper, we see a number of terms that hit the magic upper-right quarter of the graph, being both very important within @realDonaldTrump’s corpus and relatively unique within the context of all congressional and senatorial Twitter accounts.
Few other politicians were as concerned with the DNC, as only 3 other accounts mentioned it at all (The Donald tweets about it 4 times). I expected the media to score relatively high, as there’s a prevailing perception that he’s a bit obsessed with it. And, yup, it’s the 2nd-most frequent term in his corpus (but mentioned by a relatively low 26 other accounts out of the 512 active on the CSPAN list). Rounding out the (completely unsurprising) list of terms unique and important to Trump, we see mexic*. While he argues about whether or not he can magically get them to pay for construction of a border wall, the rest of Washington seems less enthusiastic (only 20 accounts using the term).
plant makes a somewhat surprising appearance in the upper-right corner of the graph, a result of the recent targeting of automakers who are building, built, or considering building auto plants abroad (specifically, in Mexico). So if you file it under the category of “jobs and trade” it fits into expectations much better. Also, while it’s not that significant since he didn’t use it that often, it seems rather appropriate for his Twitter persona that literally no one else in the corpus used :/, the “meh” emoji.
His top 25 most significant terms
Here’s a single-axis breakdown by TF-IDF (the diagonal axis in the above graph.
Since I know you are wondering, Make America Great is #35. He only has used the term twice in the last couple of weeks, and there are 18 other accounts that also used it.
Some notable absences from the top of his list
- great. It is, indeed, one of his favorite adjectives, but it’s not remotely unique. 233 other accounts in the dataset mention it. It’s also possible that he uses it in speech more than in writing (or tweeting, which is almost the same thing).
- russia*. He mentions it a fair amount but so do 115 other accounts.
- obamacare. Same as above. He talks about it, but so do 155 others.
- obama. He surprisingly doesn’t mention the outgoing president nearly as much as he does the dems, the dnc, or hillary. As expected, 152 congressional accounts mention the President by name.
- ISIS, terror*, muslim, or islam*. In a major change from his pointedly anti-muslim and tough-on-terrorism talk on the campaign trail, he has mentioned these terms, combined, just once (ISIS). Maybe it’s part of not telling them his strategy before he bombs the shit out of them.
The methodology
I wanted to wade into the Trump Twitterverse looking for term importance rather than simply term volume, and also to compare @realdonaldtrump to other twitter accounts run by politicians. Or, more accurately, a combination of politicians and their staffers. Comparing him to Joe Schmoe (or even Joe the Plumber), since it’s expected that someone encountering the minutiae of government in their everyday life would tweet about those things at a higher rate than, say, their labradoodle (or maybe not. It’s Trump, after all).
Thus, I used a relatively common computational linguistics tool called Term Frequency – Inverse Document Frequency (TF-IDF) to weigh the most frequent n-grams in Trump’s Twitter account (Term Frequency) modified by how many of the congressional accounts also use the same term (Inverse Document Frequency).
For TF, I used maximum TF normalization to express each term’s frequency relative to the frequency of the most common term in the document (to minimize the effect that longer documents tend to repeat the same words). Thus, our TFnorm. = 0.4 + (1-0.4) x (TFraw / TFmax)
I used a standard IDF function, so IDF = Log10(# docs / # docs using term)
I’ve pruned the list of stopwords (English grammatical words like “and”, “but”, “a”, and “the”) as well as a few terms which aren’t terribly informative (“said”, “make”, “today”, etc), then took the top 20 most important terms (to spare you all the task of endless scrolling).
What’s next?
We keep watching. We’ve only been scraping Trump since just after election day, and the corpus of his peers for a little over 2 weeks. Inauguration day is 10 days away.
Note: Within 12 hours of this article’s publication, fakenews, witch hunt, and small business entered the dataset, kind of proving this point. One assumes the events that triggered those tweets are unrelated. Probably.
Buckle up. I predict Trump isn’t going to reign in the Great Twitter Fountain any time soon.
48 hours later, a term which didn’t appear in the corpus at all now has the 3rd-highest TF-IDF score. fake news is the big news of the last few days.
He’s mentioned it 7 times since Tuesday.
For the continually updated versions of term significance dashboard on this page, see the Trump Watch page.