It tokenizes strings with bi-gram for multi-bytes characters and with delimiter for single-byte characters.
The indexer that is the first example code puts tokens to hash. The searcher searches each token recursively if its keyword is multi-bytes and then merge the results. If its keyword is single-byte, it simply gets the results in the hash.
The sample data contains number of bookmarks (bcount), and number of stars (scount). Number of bookmarks comes from Hatena Bookmark which is the most popular social bookmarks service in Japan. Number of starts come from Hatena Star which is a kind of rating service. It’s a tool to express your appreciation.
Basically the scoring of this search engine is based on the “bcount”, “scount” and “date.” You can change its weight.
The source code is published at http://github.com/stanaka/one-day-fulltext-search/