New Dataset – @realDonaldTrump Tweet corpus

  1. Presenting the Trump Watch (tick…tick…tick..)
  2. New Dataset – @realDonaldTrump Tweet corpus
  3. Just how oddball is @realdonaldtrump compared to other politicians?
  4. The anti-Trump shield: A defensive charity finder to help you put your money where his mouth is
  5. Breaking down the Women’s March: Even small-town marchers out-numbered the inauguration
  6. Trump cabinet bingo card generator
  7. Trump Watch historical: How has @realDonaldTrump’s use of Twitter evolved since Election Day?
  8. The tweet rate monitor: Is there any rhythm to Trump’s twitter blasts?
  9. Mapping the pep rally president
  10. Let’s play Trump Twitter bingo!
  11. The anti-Trump charity guide: What has Trump been effective at attacking in his first 9 months?
  12. Trump Watch retrospective: What did we learn in 2017?
  13. Fear and loathing of a very stable genius: What causes Trump’s rage tweets?
  14. Methodology and madness: How Trump’s tweet subjects have evolved over time
  15. Trump’s Twitter account charts a history of his presidency
  16. Trump’s tweet frequency is not dependent on his work schedule
  17. Trump’s Twitter pulpit is slowly losing its effectiveness
  18. How did President’s Trump language influence the El Paso shooter’s manifesto?

By: Richard W. Sharp

Release the tweets! We’ve made the Trump Watch’s database of tweets available on our new downloads page. These include all tweets from @realDonaldTrump going back to November 10, 2016. They were collected with Twitter’s public API using the query from:realDonaldTrump. Both the raw tweets and labels we have added are available. Each of the raw tweets files has a name in the format tweets_by_realDonaldTrump_yyyymmdd.json. The date represents the date that the tweet was collected. The tweet itself contains information about when it was created in the “created_at” field. A complete description of the information contained in a tweet is maintained by Twitter.

Since a search for tweets with the public API returns results from roughly the past week, we end up collecting the same tweet for several consecutive days. Each of the raw files contains at most one copy of each tweet (if we collected tweets more than once in a day, it’s the most recent version), however, the same tweet will typically appear in several of the files. Why keep duplicates? Because they’re not duplicates. Some features of a tweet change over time, such as the retweet count, which can give us insight into some short-lived trends. Sadly, we did not capture the recent “unpresidented” tweet, because it appeared and was corrected (in 27 minutes) faster than our collection updates , but it provides a good example of why its useful to archive the statements of public figures.

For the Trump Watch, we classify each tweet for sentiment and whether or not it’s an insult. The file trump_dump.csv contains the unique id and text of each tweet, as well as the tags we use to for classification and any notes. Please note that this is a .csv file, but it uses the | character as an alternate delimiter between fields to simplify parsing since commas are so common in the tweet text field. 


Here is how we categorized tweetID 810121703288410112:

Tag Definition Example
State All references to a country or similar entity (e.g., the United Nations, ISIS), as represented by the official apparatus of government (e.g., until 20 Jan 2017, “USA” implies the Obama administration).
Uses ISO-standard 3-letter country codes
#StaCHN
State Sentiment The sentiment (in the eye of the tweeter) implied by each state reference. This can be positive, negative, or neutral. #SsnCHNNeg
State Insult Whether the reference to each state is an insult or a compliment (in the eye of the target state). #SinCHNIns


We will continue to regularly update and add to the collection. 

About The Author

Richard is a Seattle area data scientist who builds predictive models and the services that deliver them. He earned a PhD in Applied and Computational Math from Princeton University, and left academia for the dark side of science (industry) in 2010, following his wife to the land of flannel. Fan of coffee, beer, backpacking and puns. Enjoys a day on the lake fishing, and, better, cooking up the catch for a crowd.

No Comments on "New Dataset – @realDonaldTrump Tweet corpus"

Leave a Comment