Language Detection of Social Media Data

When you go to a website or open an email in another language, you'll often see a message like this:

Translate

In some cases you'll think, "Wow how did they know?!" And in others, it's more like, "Why does Google think my message is in Malaysian?"

Many tools have been developed to detect the language of a piece of text. However, these tend to fail on short texts as well as social media text. With the rise of platforms such as Twitter and Facebook, it's important to be able to accurately detect the language of generally unstructured and noisy data.

Screen_shot_2012-02-09_at_6

Since our beginning we used the Python guess_language package to filter out English posts, and it worked great on longer texts such as blog posts but tended to fail on short posts, especially Twitter posts. As a result, we decided to develop our own language detection package trained on social media data. 

Using the Amazon Mechanical Turk system we built, we gathered over 15,000 labeled posts. We also used the Twitter API to gather over a million tweets. We used a Naive Bayes model to build a classifier. Since we did not have enough labeled foreign data (especially before we built our Mechanical Turk system), we chose to use an Expectation-Maximization algorithm to cluster the data into language classes. We trained the classifier by creating a set of ngrams from each post. 

Algorithm Details

The Naive Bayes Model

Our language model is constructed as follows:

Eq1_real

That is, to determine the language of a post, we pick the language that maximizes the probability of being that language given the n-grams in the post. In this equation, z is a normalization constant equal to the probability of the evidence or in this case the ngrams.

We can break 1/zP(ngrams) up like this (Law of Total Probability):

Eq2_real

Due to independence assumptions of the Naive Bayes model, we can rewrite the original equation as follows: 

Eq1_rewrite

In the equation above, the unnormalized value gets really small, so to account for underflow issues, we perform our calculations as a sum of logs:

Eq3_small

Constructing the N-grams

In this case, by n-gram we mean an n-letter slice of a word. We use non-overlapping n-grams from each word (including an additional space the end of each word). 

For example, trigram construction for the string "hello she loves cats" becomes ["hel", "llo", "she", "lov", "es ", "cat"]. 

Expectation-Maximization

We use the following algorithm for calculating the final probabilities:

1) Assign random probabilities to each of the posts in the training data

2) E-step: using the inital probabilities from 1) , we recalculate the priorP(language) and likelihood P(post | language), applying the normalization.

3) M-step: using probabilities calculated in step 2), we recalcuate the posterior probailities, P(post | language).

4) Repeat E & M steps until convergence.

Results

We experimented with lots of things, such as the size of the n-grams, using a purely supervised method of training data (with the small amount of labeled data we had), and combining both approaches. In the end we tested our classifier against guess_language and Google's Chromium language detector.

In a set of 4000 tweets (half English, half non-English), the performance of the three methods were as follows: 

Table3

Demo

We made a demo of the classification of Tweets, comparing the three methods. The tweets are retrieved from the Buzzient Post API. Check it out here

Social CRM Engagement API Now Available

Celebrate-hippies

We are happy to announce the release of our API and a Developers Site to go with it!

Our Engagement API allows you to integrate social data into your applications, allowing you to listen to and engage with your customers on social media from within your CRM, Contact Center, ERP or other software application. As our CEO Tim Jones has said, "Social CRM belongs to the customer, not the application vendor" -- our aim is to democratize Social CRM.

We already have a few large application vendors working on building social channels into their workflows and are excited at the prospect of more.

Check out our Developers Site to get your hands on a key today.

 

Using Django outside of Django

So you have this great new Django app you wrote, and it is super shiny and new. However, you've decided that you want to interact with your Django models from outside of Django. It may appear to be a daunting task at first but once you're set up, it can be just as easy as using your models from a view within your app.

 

Todo

For this post, I'm marrying a Supybot plugin to a simple Django application that manages a simple TODO list. Supybot is an IRC bot written in python. The end goal here is the ability to post our TODOs to the application from the bot and retrieve them as well.

The Setup.

We need to make our app comfortable with Django's paths. Django tends to be super finicky about paths especially here. So in order to appease Django, we need to setup its environment in the way similar to that standard applications do.

Let's go step by step here. We're going to toy with the path so that the Django application is happy. The application lives at '/home/user/laterapp'.

 
if "/home/user/laterapp/" in sys.path:   
    pass 
else:    
     sys.path.insert(0,'/home/user/laterapp/')  

We add the path to the Django application if it does not exist. Initially I did not have the if statement and had the problem that each time the plugin was reloaded, the laterapp path would be added to supybot's path each time. The if statement solved that issue.

 
from django.core.management import setup_environ 
import settings 
setup_environ(settings) 
 

Next, we import the setup_environ method from Django which sets up the Django application environment. import settings brings in the settings file from our application. Finally, setup_environ(settings) sets up our application environment.

 
os.environ['PYTHONPATH'] = ' /home/user/laterapp/' 
os.environ['DJANGO_SETTINGS_MODULE'] = 'settings' 
 

Next, we add the application path to the python path environment variable, and the settings file to the Django settings module environment variable.

Note: The recommended way is to do this outside of the application, but for the purpose of this demonstration, it was done inside of the application

 
from web.models import LaterObject
 
   

At this point we are able to import a model from our application successfully

 
#django related imports 
import sys,os 
if "/home/user/laterapp/" in sys.path:
    pass 
else:    
sys.path.insert(0,'/home/user/laterapp/') 
from django.core.management import setup_environ 
import settings 
setup_environ(settings) 
os.environ['PYTHONPATH'] = ' /home/user/laterapp/' 
os.environ['DJANGO_SETTINGS_MODULE'] = 'settings' 

from web.models import LaterObject,IRCUser 
#
 

Boom. Right there, that's the entirety of the code related to preparing our environment.

The Posting

So we have our environment setup and ready to get our data. Let's have a look at our models.

 
class IRCUser(models.Model):     
    ircuser_botuser = models.CharField(max_length=32)     
    ircuser_user = models.ForeignKey(User)     
    def __unicode__(self):         
        return self.ircuser_botuser 
class LaterObject(models.Model):
     later_user = models.ForeignKey(User)
     later_fullstring = models.TextField()
     later_postdate = models.DateTimeField()
     later_startdate = models.DateTimeField(blank=True)     
    def __unicode__(self):
         return self.later_fullstring[0:32] 
 
These are our simplified django models for an IRCUser and LaterObject. The IRCUser model ties a registered bot user to django's user model. The LaterObject is a todo post, with a user, textstring, postdate, and startdate (which tells when to actually show the post)

 
def later(self,irc,msg,args,text):
     #check if the user is registered with the bot
     try:
         caller= ircdb.users.getUser(msg.prefix)
     except KeyError:
         caller = None
     if not caller:
         irc.reply("You should be registered with the bot to use this feature")
     else:
         try:
             djuser = IRCUser.objects.filter(ircuser_botuser=caller.name).get()
         except:
             djuser = registerNewUserBasedOn(caller.name)
          datereveal = datetime.datetime.now()
          newlater = LaterObject(later_user=djuser.ircuser_user,later_fullstring=msg,later_postdate=datetime.datetime.now(),later_special_id=1,later_startdate=datereveal)  
        newlater.save()
         irc.reply("Saved new todo item")
 later = wrap(later, ['text']) 

The code above is the supybot plugin function attached to the later command. I'm not going to dive into writing supybot plugins, and I'm just going to say that its just slightly more involved than this is. All of the irc related functions are supybots and help with various things, such as the user who called the command and all the arguments that were passed along.

The first few lines in the code are about making sure the user is registered with the bot before writing things to the database. The djuser try: except pair looks up an IRCUser object from the database and if it does not exist, creates a new one and sends that user their site password.

For the purpose of this example, the datereveal is being set to immediate, such that the post is revealed and can be looked up immediately in the next function.

A new LaterObject is created, populated, and saved. That completes the posting function for the plugin.

 

and the Retrieval

What good is being able to post without actually being able to get things to do? Not very.

 
def getlater(self,irc,msg,args,text):
     #check if the user is registered with the bot
     try:
         caller= ircdb.users.getUser(msg.prefix)
     except KeyError:
         caller = None
     if not caller:
         irc.reply("You should be registered with the bot to use this feature")
     else:
         try:
             djuser = IRCUser.objects.filter(ircuser_botuser=caller.name).get()
         except:
             djuser = registerNewUserBasedOn(caller.name)
          post = LaterObject.objects.filter(later_user=djuser.ircuser_user).filter(later_startdate__gte=datetime.now()).order_by('?')[0]
         irc.reply(caller.name + ': '+ post.later_fullstring) 
 getlater = wrap(getlater, ['text']) 
 

Shazam!, we grab all entries from a user, filter them by their showdates, and grab a random one. We then pass it back to irc via irc.reply()

In action

 
< gmorell> !later do something 
<@Flexo> Saved new todo item 
< gmorell> !later do something else 
<@Flexo> Saved new todo item  
< gmorell> !getlater 
<@Flexo> gmorell: do something else 
 

 

 

Facebook News Feed Word Cloud

We do a lot of wicked enterprise social media analytics and integrations at buzzient, and it's given me a chance to explore the many intricacies of the facebook graph API.

This past week, for fun, I wanted to analyze some of the my own Facebook feed. So, I built a word cloud generator for your Facebook news feed. the output is an HTML page that looks like this:

Let's walk through how to actually build this. For simple applications like this, the Facebook graph API is incredibly easy to use.

First, we'll need some way to load a JSON resource from a URL. Everything the graph API returns is JSON.

def get_json(url):
    """ Return the parsed JSON from a url. """
    # print "GET %s" % url
    f = urllib2.urlopen(url)
    return json.loads(f.read())

 

Next, we can build a function that talks to the graph API and downloads the message content from your Facebook news feed. It takes an access_token, which you can get from the Graph API explorer, and the number of pages deep to pull messages from. The higher /pages/ is, the more content will be pulled from your feed and the more interesting your tag cloud will be. By default, a request returns 25 news feed entries, which may or may not contain a 'message' attribute. Photos, links and application updates often lack the message attribute.

def fb_feed_url(access_token):
    """ The graph api url for your facebook feed. access_token should be 
    a valid facebook access token as a string.
    """
    return "https://graph.facebook.com/me/home?access_token=%s" % access_token


def fb_news_feed(access_token, pages):
    """ Returns a set number of pages from a facebook news feed as a 
    list of parsed json. 
    """
    feed_url = fb_feed_url(access_token)
    page_json = []
    
    for x in range(0, pages):
        json = get_json(feed_url)
        page_json.append(json)
        feed_url = json['paging']['next']

    return page_json

 

This method returns an array of "pages" - the parsed JSON output directly from Facebook. But we're really only interested in the message content of the posts on each of those pages. We can pull those out with a list comprehension.

pages = fb_news_feed(access_token, 6)
    messages = [post['message'] for page in pages 
                                    for post in page['data']
                                        if 'message' in post]

 

This code is also a great chance to point out how awesome list comprehensions in python are. I must confess, before working at buzzient, I was pretty strongly in the Ruby camp. Programming in python took a while for me to get used to, especially significant white space. However, features like list comprehensions are starting to win me over. If you haven't given Python a try, use it for your next fun project. You just might like it.

 

Now that we've got a list of messages from the news feed, we can transform those into a word cloud. First,we'll turn the messages into one long array of words.

tokens = [token for msg in messages for token in word_tokenize(msg)]

 

This snippet uses the word_tokenize function from NLTK. If you're interested in text comprehension or natural language processsing, you need to check out this project. There's a free book about NLTK at http://www.nltk.org/book. We use the toolkit internally at buzzient as part of our sentiment analysis system.

 

Now that we've got a list of tokens, a word count will transform this into a dict where the key is a word, and the value is its count.

def word_count(words):
    """ Count the occurrences of each word. Words should be a list of strings
    that you might get from str.split() or nltk.word_tokenize().
    """
    seen = defaultdict(lambda: 0)
    
    for w in words:
        if w.lower() not in IGNORED_WORDS:
            seen[w] += 1
    
    return seen

 

We normalize the word count by dividing the count of each word by the count of the word that appears the most. This allows us to give each word an importance between 1 and 5, which will nicely map into some CSS classes to make the more important words stand out. I put in a extra check here to ignore words that only appear once. The word cloud should only be composed of interesting words, which tend to appear multiple times. This makes the word cloud contain fewer words, but the words that remain are much more interesting.

def word_cloud_sizes(counts):
    max_count = max(counts.values())
    min_size, max_size = (1, 5)
    word_sizes = defaultdict(lambda : 1)
    
    for word in counts:
        if counts[word] > 1:
            size = int(float(counts[word]) * (max_size - 1) / max_count) + 1
            word_sizes[word] = size
    
    return word_sizes

 

Finally, we can wrap this result to HTML and print it to stdout. When you run the program with your own access key, redirect this to a file, view it in a browser, and you should see your own Facebook news feed tag cloud!

Take a look at the full source for this at gist.github.com/1796048. You will need NLTK installed to run it.

 

Happy hacking!

An Intro to boto's MTurk interface

We recently started using Amazon Mechanical Turk for crowdsourcing. Through the boto API, we built a tool to help upload, process, and delete HITs in Python.  

Mechanical-turk

There are lots of docs online that are useful to read if you aren't familiar with the AMT data structures, etc. Here are some of the parts of the boto API we found useful for automating HIT uploads to mturk, allowing us to tag thousands of pieces of data in a matter of hours.

Creating a Connection

To open an mturk connection, have your Amazon access ID and secret access key ready. Before putting HITs into production, it's good to try things out on the sandbox environment. To do this, specify the host to be 'mechanicalturk.sandbox.amazonaws.com.'

>>> from boto.mturk.connection import MTurkConnection
>>> conn = MTurkConnection(aws_access_key_id=ACCESS_ID,
...   aws_secret_access_key=SECRET_KEY, host=HOST)

To test that your connection is working properly:

>>> conn.get_account_balance()
10000.00

If you're working in the sandbox environment, this should return $10,000. Otherwise, it'll give the amount in your account.

Creating HITs

To create a HIT, we first take a look at the create_hit() function. You need a couple things for a HIT. First, a QuestionForm object. Currently boto only supports selection answers and free text answers.

Here's an example.

>>> from boto.mturk.question import Question, QuestionForm,
...   Overview, AnswerSpecification, SelectionAnswers,

>>> overview = Overview()
>>> overview.append_field('Title', 'Your Pets')

>>> # the question form is what you pass to the create_hit() function
>>> qf = QuestionForm()
>>> qf.append(overview)

>>> # these are the answer choices
>>> choices = [('Cat', '0'), ('Dog', '1'), ('Fish', '2'),
...   ('Hamster', '3')]

>>> # this is the question that workers are providing an answer for
>>> qcontent = QuestionContent().
>>> append_field('Title', 'What's your favorite pet?')

>>> # selection answer form where the choices are radio buttons
>>> form = SelectionAnswer(min=1, max=1, style='radiobutton',
...   selections=choices, type='text', other=False)

>>> # here we require an answer from the worker
>>> q1 = Question(identifier='petquestion1', content=qcontent,
...   answer_spec=AnswerSpecification(form), is_required=True)
>>> qf.append(q1)

To create a free text answer:

>>> # create a free text answer
>>> qcontent = QuestionContent().
>>> append_field('Title', 'What's your favorite pet's name?')
>>> form = FreeTextAnswer()

>>> # here we don't require an answer
>>> q2 = Question(identifier='petquestion2', content=qcontent,
...   answer_spec=AnswerSpecification(form), is_required=False)
>>> qf.append(q2)

The question identifier is something you'll want to put a unique id for because it helps you identify what question the worker responded to when you fetch the results.

You'll also need to title the HIT and provide a description and keywords that match the HIT. Then you need to decide how to price the HIT.

To post HITs to Mturk, use the create_hit() command:

>>> create_hit(question=qf,
...   max_assignments=num_workers,
...   title='Pet Survey',
...   description='Answer questions about your pets',
...   keywords='pets, cats, dogs, hamsters, survey',
...   reward=0.05)

Other parameters you can change include the lifetime of the HIT, the duration or amount of time a worker can spend on the HIT, a qualification form with a series of questions workers have to get a certain score before they can accept the HIT.

The create_hit() function returns a ResultSet object, from which you can extract information about the HIT.

Getting HIT Data

To the get the answers from workers, you'll want to look at the function get_reviewable_hits(). This will return all the hits that have the 'Reviewable' status, meaning that all assignments for the HIT have been completed or the HIT has expired. You'll need to specify a page size and if necessary, continue to increment the page number.

>>> page_size = 10
>>> page_num = 1
>>> hits = conn.get_reviewable_hits(page_size=page_size,
...   page_number=page_num)
>>> total = ceil(float(hits.TotalNumResults) / page_size)
>>> while page_num < total:
...   page_num += 1
...   temp = conn.get_reviewable_hits(page_size=page_size, page_number=page_num)
...   hits.extend(temp)

To extract the results from the HITs returned, you can get all assignments from a HIT with get_assignments(hitid) and grab the worker's answers from each assignment:

>>> for hit in hits:
...   assts = conn.get_assignments(hit.HITId, page_size=10)
...   for asst in assts:
...     answers = []
...     for answer in worker.answers[0]:
...       for key, val in answer.fields:
...         contentid = int(key)
...         answerval = int(val)

Here the key is the question identifier you specified when creating the question form, and the value is the answer given by the worker.

Paying Workers

To pay workers, you want to accept the assignment:

>>> conn.approve_assignment(assignment.AssignmentId)

Otherwise, you can reject the assignment:

>>> conn.reject_assignment(assignment.AssignmentId)

Expiring and Deleting HITs

To delete HITs, there are two relevant functions to look at.

The dispose_hit() function deletes HITs that are in the reviewable state. If you want to delete the HIT but it is not yet reviewable, you need to expire the hit first with expire_hit().

>>> conn.expire_hit(hit.HITId)
>>> conn.dispose_hit(hit.HITId)

The disable_hit() function approves all submitted assignments that haven't been approved or rejected and then disposes of the HIT, so you'll get charged for all the assignments submitted. 

>>> conn.disable_hit(hit.HITId)

Those are the basics!

We are using this tagged data to improve our Language Detection ability. Look for a follow up post soon.

Friends Don't Let Friends Not Use Github

I spent 6 months coming up with reasons not to switch from SVN to Git. Two of our developers kept pushing and pushing. Then we did it and I can't imagine life without it. Yet at Buzzient, the number one reason hasn't even been Git -- but a GitHub feature: its Code Review capabilities

Code_review_github

When someone wants to merge their code, they make a Pull Request to the main branch (or development branch, or experimental branch, or even just a random branch.) Instead of receiving a nasty SVN UPDATE merge clusterflack you get this beautifully wrapped code package complete with a description of the changes, a summary of the code/file update and a visually stimulating diff. On this diff you can leave comments line by line, collaborate with other members of the team and even link to other commits.

Code_review_github1

To be honest, the experience of a GitHub Pull Request vs. an SVN UPDATE, is the difference between getting overnight mail delivery from FedEx vs. a flaming brown bag full of dog sh*t on your doorstep. The code review history is saved, the code is improved, and your sneakers are clean. Everyone wins. I spend more time every day wondering how pre-GitHub code written by employees/interns got into the codebase unchecked, than I did probably writing this post.

A STRONG mea culpa for not having a solid reviewing tool in place before that and I am sure there are similar tools for SVN that I just never explored. But WOW! Thanks GitHub. 2012 looks that much better with you as a tool in our arsenal.

If you or someone you know is on the verge of a Git/GitHub transition: force them, fight them. I feel like weeks of my life have been lost with large SVN merges conflicts and crazy .bashrc aliases to simplify things. I might go as far as to say in bad english: Software dev manager friends don't let software dev manager friends not use GitHub.

If you are looking for reasons on Git vs SVN, try searching the 1,890,000 posts about why its better (and the thousands of others why its not.) 

 

How not Supporting IE7, is actually Supporting IE7

This week, Facebook announced that their new Timeline view would not be supported for users of IE7. This announcement comes more than a year after they decided to no longer support Facebook Chat for IE6, and a few months after Microsoft itself launched their campaign to kill the use of IE6.

Unfortunately this is having a negative effect on the phasing out of these browsers. Many people are resistant to the new Facebook layout changes and are opting to use old outdated browsers to maintain the more familiar Facebook page layout. In addition, companies frustrated with their employees' wasting precious work hours on social media sites are using these outdated browsers to enforce employee compliance.

Like all companies that live in the enterprise software space, we at Buzzient are forced to develop our apps for the user base that lives under the iron fist of the CIO, who has never been fired for not upgrading the entire company to a more modern piece of software. Yet, in 2011, we made the painstaking decision to no longer support IE6 and our lives and development speed have improved dramatically because of it. Still, many of the applications we integrate with only support ancient browsers. Oracle's Siebel software, for example, only works in IE6, IE7 or IE8. Oracle has had little incentive to change this, and the CIOs (who buy thousands of licenses of this stuff) don't either.

These software behemoths have always been slow to upgrade -- for security, brand loyalty, logistical reasons, or what have you -- but it's unfortunate that the consumer is now switching to old browsers as well.

Simpsons-angry-mob

I, for one, am not a huge fan of new Timeline view, but I hate doubling down on dev time to support Internet Explorer even more. Hopefully Facebook doesn't decide to cave to an angry mob of old-page-layout-loyalists. Now that Facebook, Twitter, et al. control a large percentage of the world's pageviews, they have the ability to make some decisions that shape the browser habits of the world. I hope these emerging software giants will continue to make the right decisions.

Frictionless Deployment with Fabric and Git - How Do You Deploy?

Continuous deployment brings many benefits to the overall stability of your code base. Inspired by the Velocity Culture slide-deck from Amazon, we have been striving to bring our code base and deployment infrastructure closer to this reality. 

Presentation5

This is exactly how we feel about the new code we write and we want to be able to push it out there as quickly and efficiently as possible.

A recent HN post on deploying early and often caused us to step back and examine our own deployment process. We are running at least 15 EC2 servers at all times and want to make sure that we have a sustainable way of updating them consistently either all at once, or at least few at time.

We use Git for version control so updating to the latest code is beautifully simple


$ git pull 

But there are a few things we need to do at each deployment as well, including updating settings files upon code pull, logging deployments, restarting processes and running some post deployment tests.

We can wrap all that in a Fabric function and add just about any sort of build or deploy functionality we want into it.


 22  def deploy_code()
 23 
 24      run('git stash save "pullsave `date`"', pty=True)
 25      run('git pull', pty=True)
 26 
 27      host = env.host_string.split('@',1)[1]
 28      server = lookupInstance(dns=host)
 29 
 30      with settings(warn_only=True):
 31          sudo('find . -name \'*.pyc\' | xargs rm')
 32 
 33      if server.server_type == WEB:
 34          gunicorn_restart()
 35 
 36      run_deployment_checks(server_type)

Now we want to be able to run this on multiple servers so we create a Fabric function that appends all the wanted hosts ahead of time. The getServerDict() function returns a dict of user@X.1.X.1 IP addresses indexed by the server name.

 36 def remote(server='serv1',servers=''):
 37 
 38         servlist = getServerDict()
 39 
 40         env.hosts = []
 41 
 42         if servers:
 43            servers = servers.split(';')
 44            for serv in servers:
 45                    if serv in servlist:
 46                       env.hosts.append(servlist[serv])
 47         else:
 48                 env.hosts.append(servlist[server])

This then allows us to run on whatever servers we want. The below command will update one of our processing servers and one of our web servers.

$ fab remote:servers='proc27;web12' deploy_code

On top of this we have created Fabric functions to roll the latest commits back in case the deploy didn't go as smoothly as we had hoped. 

One thing we would like to improve is how we handle our Gunicorn restart. If someone is mid-page load and we restart Gunicorn with Supervisor, the worker is killed and the request is lost. The Gunicorn worker's lifespan is fixed so one way would be to somehow use up the remainder of their requests as quickly as possible in order to force their demise and start up new workers, but we haven't  yet come up with a solid way of doing this. You can guess their is a GitHub issue for it though. We will post once we figure it out.

We welcome any ideas or suggestions from your build processes in the comments below.

 UPDATE: Another Solid Presentation on Continuous Deployment by @bdurrett of IMVU

 

 

 

Lessons in Alchemy

A Bottom Up Introduction to SQLAlchemy – Lesson 1

Buzzient has been moving towards using the SQLAlchemy library for handling our database interactions. During the conversion, we have run into some areas we feel the documentation’s format doesn’t clearly address. Since we love SQLAlchemy we wanted to help smooth out the learning curve for others! This post is part of a three part series of posts about SQLAlchemy.

What is SQLAlchemy?

SQLAlchemy is a Python database library. It would be inaccurate to describe it any other way. SQLAlchemy supports ORM, but is not only an ORM library. SQLAlchemy can be a database abstraction layer, but it is not only, nor is it necessarily a database abstraction layer. SQLAlchemy contains a stack of technologies which allow you to do everything from make direct SQL calls to your database to dealing in objects and inheritence using its ORM layer.

Where Do We Start?

When I started learning about SQLAlchemy I took a top-down approach, meaning I started at the top of the SQLAlchemy stack and eventually worked my way down into the bottom layers as I needed to use more features. Learning SQLAlchemy top-down will result in you doing things without quite understanding why you are doing them, and as a result having problems you don’t understand. To start with SQLAlchemy, it is best to start at the bottom and work your way up. This means starting with the description of the database.

How to describe a Database

Describing a database schema in SQLAlchemy is fairly straight-forward. Essentially, you define Columns which you add to Tables which you add to a MetaData variable. A Column object only requires the column name and its type. The type can either be a SQLAlchemy generic type e.g., String or the type from the underlying database connector e.g., VARCHAR from MySQLDb. The Table initializer takes a table name, metadata variable, and argument list of Columns as its definition. The metadata variable is built by specifiying it as the second argument for tables. How about an example!

Example schema

>>> # Import the necessary items
>>> from sqlalchemy import MetaData
>>> from sqlalchemy import Table
>>> from sqlalchemy import Column
>>>
>>> # SQLAlchemy Generic Types
>>> from sqlalchemy import Integer
>>> from sqlalchemy import String
>>>
>>> metadata = MetaData()
>>>
>>> client('client', metadata,
>>>     Column('client_id', Integer, primary_key=True),
>>>     Column('name', String(255)),
>>>     Column('start_date', DateTime))

That’s it! That sets you up to get started using SQLAlchemy. Before you start querying though, it’s important to create an engine and actually create the schema in a database. For now, don’t worry about what the engine is, that will all be explained in depth in the next post!

Create an engine and create the database schema

>>> from sqlalchemy import create_engine
>>>
>>> # We'll create a SQLite engine
>>> engine = create_engine("sqlite:///demo.sqlite")
>>>
>>> # Then we'll populate the database with the schema
>>> metadata.create_all(engine)

Making Queries

The Table objects provide an entry point for making queries. Table objects has methods for doing things equivalent to all the basic SQL functions, such as select, insert, delete, join, and update. So let’s just do a basic example where we insert an entry into the client table defined above and retrieve it.

Insert and Select

>>> import datetime
>>>
>>> # Create the insert statement
>>> insert_statement = \
...     client.insert(
...     {
...         'name': 'Zosimos', 
...         'start_date': datetime.datetime.utcnow()
...     })
>>>
>>> # Get a connection to the engine (more on this later!  I promise!)
>>> conn = engine.connect()
>>>
>>> # Execute the insert statement
>>> conn.execute(insert_statement)
>>>
>>> # Now let's get that value!
>>> # This gets a ResultProxy from the database
>>> # ResultProxy's basically just give us 
>>> # different ways to access the result
>>> result = conn.execute(client.select())
>>>
>>> # Let's get the first value
>>> first_val = result.first()
>>>
>>> # Check it out!  Data!
>>> first_val
(1, 'Zosimos', datetime.datetime(2011, 10, 20, 22, 59, 35, 556273))
>>>

This concludes the first lesson on using the fantastic SQLAlchemy library. In the next lesson, I’ll write about connection pools, dialects, and the engine.

Using Fabric without stdin

At Buzzient we love Fabric. Fabric is an automation tool for Python, providing a layer of abstraction above the Paramiko library.  Fabric provides basic functionality for (among other things) running commands on a remote server, uploading, and downloading files.  We use Fabric everyday for things as simple as SSHing to our servers and as complex as deploying new servers.

We recently ran into a limitation in Fabric: Fabric will fail when you try to run a Fabric task without a stdin handler.  This limitation came up when we re-wrote our database backup jobs using some of our existing Fabric functionality.  Our backup scripts are fired off by a job dispatcher as their own processes.  Those processes would fail deep in Fabric's IO code. 

After researching and discussing the issue with some helpful people in the #fabric IRC channel, I was able to create a patch that supports this use case.  To take advantage of the new functionality, the env.input_enabled variable must be set to False.  This can be done programmatically, or through the --disable-input option at the command line.

Working with the Fabric code was straightforward.  The developers have done a great job of making the code accessible for someone who has never looked at it before.

I forked Fabric on github and pushed the patch.