Building a full-text search engine in “ONE” day

Building a full-text search engine in “ONE” day

stanaka who is a developer in Hatena Inc. wrote a full-text search engine in a day. He wrote the code as an outcome of reading introduction to Information Retrieval.

It tokenizes strings with bi-gram for multi-bytes characters and with delimiter for single-byte characters.

The indexer that is the first example code puts tokens to hash. The searcher  searches each token recursively if its keyword is multi-bytes and then merge the results. If its keyword is single-byte, it simply gets the results in the hash.

The sample data contains number of bookmarks (bcount), and number of stars (scount). Number of bookmarks comes from Hatena Bookmark which is the most popular social bookmarks service in Japan. Number of starts come from Hatena Star which is a kind of rating service. It’s a tool to express your appreciation.

Basically the scoring of this search engine is based on the “bcount”, “scount” and “date.” You can change its weight.

The source code is published at http://github.com/stanaka/one-day-fulltext-search/

Advertisements
Building a full-text search engine in “ONE” day

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s