I’ve spent the better part of the last few hours reinstalling my Ubuntu virtual machine from scratch after completely botching my previous install’s configuration. How? I was attempting to get Python 2.5 up and running with a few custom packages, and ended up accidentally removing everything Python-related, which included many rather important system packages. Synaptic then froze up, and most of the basic system operations stopped working.
Freaking awesome. Just how I wanted to spend my Saturday evening.
But now that I have it working again, I wanted to delve into my latest project: a semi-intelligent Twitterbot! There are three core components to this project:
- Read the public Twitter timeline to accumulate posts.
- Build a Markov Model out of all accumulated posts.
- Use a cronjob to modulate the frequencies of #1 and #2.
I’ve posted about Hidden Markov Models before, and this is an example of theory put into practice. Granted, the utility of this application is questionable, but if for no other reason, it sure is entertaining. In fact, since activating my Twitterbot a little over a week ago, it’s already garnered a decent response. Here are some of my favorites thus far:
It’s endlessly amusing to me that so many people seem to think my bot is actually a person. At least a few also seem to be amused by its antics. Still others respond as though nothing is amiss. It’s also managed to flag down multinational Twitter users. And it’s even attracted the attention of other bots!
How does it work, ya say? Welllllll…
The underlying assumption of HMMs is that there is a hidden state that influences whatever the output we actively observe is. Within this context, it means we’re assuming there’s an unobservable pattern to the sentences Twitter users post that results in the actual words we can see. Thus, if we observe enough of these posts, we should, in theory, be able to infer those hidden states.
Yeah yeah, that wasn’t very simply put. Nevertheless, let’s move on.
The assumption my bot makes is pretty straightforward: each word that is observed depends only on the word before it. Put another way, this means that, given a single word, there is only a certain number of words that can come after it. Of these finite number of words that can come after it, some are much more likely than others.
This makes intuitive sense. Take any one of the sentences in this post, for example: after you read one word, you’re already expecting a certain word or number of words that could follow it (it’s how we read, in fact; ever heard that humans only actively read about 70% of the words on a page? all the others are inferred by this same method). It’s basically a primitive form of contextual analysis.
From a technical standpoint, this dependence on only a single previous word is called a “first-order” Markov Model. HMMs can go as high as you’d like. There is another similar Twitterbot built by a friend of mine which uses a “second-order” Markov Model, in that each word depends on the two previous words, resulting in a sentence that probably makes more sense than mine will. But for those of you who ahead of me, this also means much more of the original posts used to build the HMM will show up in the generated posts.
And honestly, I wanted my bot’s posts to be as random as possible while still kind of making sense 😛 Hence I purposefully implemented a first-order model.
My bot accumulates 800 posts from the public feed over 20 minutes, then uses those posts to build a first-order HMM. From that model, it then constructs a post by sampling from the model, and posts it to the Twitter account.
If you’re interested in following the bot, you can find it here.
I’m in the process of refining the current model, perhaps a hybrid first-second order HMM. I may also include some topics that are weighted more heavily than others, so the generated posts more accurately reflect the trending topics. And of course, I’m open to suggestions!
Yes, this bot provides a wonderful source of amusement, particularly since I am way over my head in schoolwork. Applying for jobs, applying to PhD programs, conducting research with Dr Murphy, and actually keeping up with my coursework is all proving very difficult to juggle these last few weeks of the semester. So it’s nice to have a new joke to read every 20 minutes!
I also want to mention that, a few days ago, my total hits on this blog surpassed 20,000. Thank you again to all those who seem to find something on this blog interesting 🙂