Search Engine to search Indian Language Documents

Introduction
Problem Definition
Issues
Our Solution

Introduction

This project aims at giving a search engine to search Indian language documents. Mostly search engines available search across English documents. Very few search engines provide search across a few Indian language documents. Our aim is to provide search across Indian language documents.

Problem Definition

A multifont search engine searches documents that are available in different fonts.

Issues

    The regular search engines take words that are 7 bits (ASCII). But to search for a Tamil word the search engine should support multi-byte charset. Search on the web is  complicated by the fact that each Tamil language website uses its own font based encoding scheme.   There are lots of Tamil web-sites on net that are written using different fonts and charsets. Facilities to provide search in vernacular requires the user to type using the phonetickeyboard/inscript keyboard.  For a lay user, typing in the vernacular is a laborious task as each character may require more than one keystroke. To address these issues, a freely available search engine code that supports multi-byte charset (MnoGoSearch) is modified and is enabled with HWR.

Our Solution

    Today there are large number of Indian language websites,  namely www.vikatan.com,  www.kumudam.com,  www.eenadu.net,  www.tarahaat.com,  etc. Each website designs its webpages using its own font.  Not only does each website cater to a particular language, but it also caters to a specific font.
To address these issues, a search engine that supports search in multiple languages and multiple fonts is a necessity.  For ease of use of the search engine, a converter must be integrated. The integration of a HWR interface to address the issue of typing in a Indian language using the QWERTY keyboard.

SearchEngine

A search engine enables a user to search for document on a specific topic using keywords.  The Mnogosearch engine has two parts:
On the server side a filter and a converter are integrated .  The filter enables indexing of Tamil documents, while the converter converts the font  based document to an UNICODE based document.

On the Client side, a usual search page is provided so that user can write or type the query and submit it. Converter is enabled when a user clicks on the searched results.

1. Server side:

A tool called indexer resides in the server. This indexer collects words from the web pages and store them in the database. The functions of indexer are the following:
1.1 Downloading:
The indexer first downloads a page into a temporary buffer then sends it to the parser.
1.2 Parsing:
Parser can parse through each of the HTML tags. Filter and Converter are integrated  at this stage.  Filter code checks if the downloaded document is Tamil or not by checking the FONT tag of the page.  If the page is a non-Tamil page it skips the page  without  further processing and continues dowloading
the next page.  If the page is a Tamil document the indexer checks for the font used in the page and converts the page into UNICODE with the help of the
converter.  After conversion words are collected by the indexer and stored in the database.
1.3 Storing:
Words and information is stored in a MYSQL database. Words are stored in a table  and their corresponding information  is stored in a another table.
Words are stored in UNICODE. 

1.2.1 Converter

The purpose of the converter is to convert the text or HTML page from one font/encoding to another font/encoding. The converter converts all coding schemes to UNICODE.  The UNICODE  indices are stored in the database.
A map file is essentially a configuration file that maps font indices of a source font/encoding into that of a target font/encoding.  The map file is a editable
text file.  Each entry in the map file is defined as follows:

x:y

Where x corresponds to the source index/indices and y corresponds to that of the target index/indices.  If there is more than one index,  the indices are separated by a space.

x1 x2 x3:y1
x1 x2:y1 y2

New map files may be added to the converter to enable other fonts and encoding schemes.

2. Client Side:

A Graphical User Interface (GUI) is provided where a user can submit a query. The searched results that are sent by server(where search engine resides) are displayed.  When a user clicks on the displayed results the page is fetched from its corrosponding server and is displayed.  The page will be displayed correctly if the user has the font in his system or else it will be displayed as junk characters. In order to overcome this limitation the browser is enabled with converter. So now, when a user clicks on the displayed results the  page is fetched and is stored in the server where search engine resides. The page is converted to Unicode and is sent to the client.



Screenshots

display results



Top



Enhancements

HWR with SearchEngine

Future Perspective

In the current search engine,  a word that needs to be searched must be typed completely (in that it must include the inflective form desired).  Each inflective form requires a different query.  We are currently in the process of including a morph analyser into the search engine,  so that the indices correspond only to root words. The user has to rewrite the query if it is not recognised correctly which might be an inconvenience. To address this,  the HWR must be augmented with a spell check.

Links & References

1. http://www.mnogo.ru
2. http://ns.annauniv.edu/rctamil/html/Search.htm

 Contacts

indlinux.lantana.tenet.res.in