Thursday, October 11, 2012

Building a Search Application (well, kind of...)

    
    When I finished CS 101 on Udacity, I didn't realize the potential of the things I learned. I was happy that I got to learn some Python constructs and got to know how search engines work. It's only when the results of Udacity's contest were announced, I realized the awesome things I could do with what I learned in CS 101. Udacians had developed some great apps. Then in June I took CS 253 and learned to use Google App Engine (GAE). It was a great experience. Though Steve Huffman isn't a professional teacher, he did a great job (better than most of my college professors). Later in September, I decided to build something to reinforce my knowledge of Python and GAE. That's how Find Dialogue happened.

    Find Dialogue is an app that lets you search the transcripts of Big Bang Theory (season 1-5). It is built in Python using webapp2 framework and Jinja2 template engine. Most of the styling is done using Twitter Bootstrap. I have also used css examples from various other sources where bootstrap didn't give the desired results. Initially I was thinking about adding transcripts of multiple sitcoms but after some googling and reading I came to the conclusion that it would be better if I concentrated on a single sitcom. So I chose Big Bang Theory.  

    I started looking for transcripts and I found this blog, perfect for my purpose. I used BeautifulSoup to get the transcripts and Python to parse the transcripts and to build the search index. Parsing involved adding line numbers to the transcripts so that it would be possible to search and extract the exact line where certain words occur. Search index basically maps words to their occurrences in the transcripts enabling faster lookups. Again, Python was used to convert the index and transcripts into csv format required by GAE bulk loader. Figuring out how the bulk loader works took a while but it was worth it. 

    Once I had transcripts and index in my local datastore, I wrote the search logic. The process of obtaining results after user submits a search query is as follows:
Upon receiving search query, the program filters it to remove some elements and divides it into words. If after filtering there are more than 10 words, the remaining words are discarded. List of occurrences of each word is obtained from the index. These lists are combined and processed to obtain list of results sorted according to the relevance. Only 10 most relevant results  are considered for further processing. Using these results, snippets of conversation are obtained from datastore (or memcahce) and are passed to the template engine. The template engine replaces \n with <br/>, inserts <mark> tags (with a little help from python) and generates HTML. Additionally, each result can be clicked to view the entire transcript in a nice format. 

    Building Find Dialogue was a great learning experience. Dealing with non-ascii characters and GAE bulk loader was difficult but I learned things that will help me in future projects. I was greatly inspired by Connor Mendenhall's DaveDaveFind. Big thanks to Ash for the transcripts. If you want to build something like this, I would suggest taking CS 101 and CS 253. Source code of my application, along with the code I used for pre-processing is available on github. Feel free to edit/improve/use it for whatever you want. :)