Did you know that Twitter, one of the world’s largest micro-blogging site, has now switched on to a new and more scalable search architecture powered by Lucene? Twitter’s original real time search was based on the Summize search system that it acquired in 2008. However, with the sites estimated 1,000 TPS (Tweets/sec) and 12,000 QPS (queries/sec), there was a need for a system that could keep up with these increasing demands and even allow for further modification. Thus Twitter decided to go for Lucene.
So what exactly is Lucene and what makes it so efficient?
Lucene is a Java-based, open source search engine library. What makes it efficient is the fact that it’s based on a modern search architecture that involves highly efficient inverted index scheme instead of a relational database. At the core of Lucene's logical architecture is the idea of a document containing fields of text. This flexibility allows Lucene's API to be independent of the file format. Text from PDFs, HTML, Microsoft Word, as well as many others can all be indexed as long as their textual information can be extracted.
Twitter needed a system that would provide it with extremely low indexing latencies (the time it takes between when a Tweet is tweeted and when it becomes searchable) and for that it wanted that the indexer itself should have a sub-second latency. While Lucene served this purpose, Twitter had to do some tweaking to incorporate it into its system. Some of the changes that twitter made to Lucene resulted in
- efficient early query termination
- lock-free data structures and algorithms
- posting lists, that are traversable in reverse order
- significantly improved garbage collection performance
Twitter has also promised to contribute these additions to the Open Source Lucene project.
Integration of Lucene in Twitter search has made it more scalable, enables more tweets per second and uses less of Twitter’s system resources. Also supporting an index that is twice as large as compared to its previous versions gives user’s the advantage of searching tweets further back in time.
With all these improvements its being speculated that Twitter’s developers will introduce some cool new search features in the near future. As Twitter is the most preferred platform among user’s for broadcasting short status updates, and for following news and latest trends its search engine becomes a key component for microblogging. Having Lucene at its core Twitter aims to make advancement in the field of search in near future. So for now it’s the user’s turn to use and give their take on Twitter search engine. Feel free!!!