Engine to search Indian Language Documents
This project aims at giving a search engine to search Indian language
documents. Mostly search engines available search across English
documents. Very few search engines provide search across a few Indian
language documents. Our aim is to provide search across Indian language
A multifont search engine searches documents that are available in
The regular search engines take words that are 7
bits (ASCII). But to search for a Tamil word the search engine should
support multi-byte charset. Search on the web is complicated by
the fact that each Tamil language website uses its own font based
encoding scheme. There are lots of Tamil web-sites on net that
are written using different fonts and charsets. Facilities to provide
search in vernacular requires the user to type using the
phonetickeyboard/inscript keyboard. For a lay user, typing in the
vernacular is a laborious task as each character may require more than
one keystroke. To address these issues, a freely available search
code that supports multi-byte charset (MnoGoSearch) is modified and is
enabled with HWR.
Today there are large number of Indian language
websites, namely www.vikatan.com, www.kumudam.com,
www.eenadu.net, www.tarahaat.com, etc. Each website designs
its webpages using its own font. Not only does each website cater
to a particular language, but it also caters to a specific font.
To address these issues, a search engine that supports search in
multiple languages and multiple fonts is a necessity. For ease of
use of the search engine, a converter must be integrated. The
integration of a HWR interface to address the issue of typing in a
Indian language using the QWERTY keyboard.
A search engine enables a user to search for document on a specific
topic using keywords. The Mnogosearch engine has two parts:
- A Client side (GUI) for typing word to be queried in Tamil
On the server side a filter and a converter are integrated . The
filter enables indexing of Tamil documents, while the converter
the font based document to an UNICODE based document.
- A Server side (Database)with a tool called indexer that
crawls over HTML pages through hyperlinks and stores all words and
corresponding information from the documents into the database.
On the Client side, a usual search page is provided so that user can
write or type the query and submit it. Converter is enabled when a user
clicks on the searched results.
1. Server side:
A tool called indexer resides in the server. This indexer collects
words from the web pages and store them in the database. The functions
of indexer are the following:
The indexer first downloads a page into a temporary buffer then sends
it to the parser.
Parser can parse through each of the HTML tags. Filter and Converter
are integrated at this stage. Filter code checks if the
downloaded document is Tamil or not by checking the FONT tag of the
page. If the page is a non-Tamil page it skips the page
without further processing and continues dowloading
the next page. If the page is a Tamil document the indexer checks
for the font used in the page and converts the page into UNICODE with
the help of the
converter. After conversion words are collected by the indexer
and stored in the database.
Words and information is stored in a MYSQL database. Words are stored
in a table and their corresponding information is stored in
a another table.
Words are stored in UNICODE.
The purpose of the converter is to convert the text or HTML page from
one font/encoding to another font/encoding. The converter converts all
coding schemes to UNICODE. The UNICODE indices are stored
A map file is essentially a configuration file that maps font indices
of a source font/encoding into that of a target font/encoding.
map file is a editable
text file. Each entry in the map file is defined as follows:
Where x corresponds to the source index/indices and y corresponds to
that of the target index/indices. If there is more than one
index, the indices are separated by a space.
x1 x2 x3:y1
x1 x2:y1 y2
New map files may be added to the converter to enable other fonts and
2. Client Side:
A Graphical User Interface (GUI) is provided where a user can submit a
query. The searched results that are sent by server(where search engine
resides) are displayed. When a user clicks on the displayed
results the page is fetched from its corrosponding server and is
displayed. The page will be displayed correctly if the user has
the font in his system or else it will be displayed as junk characters.
In order to overcome this limitation the browser is enabled with
converter. So now, when a user clicks on the displayed results
page is fetched and is stored in the server where search engine
resides. The page is converted to Unicode and is sent to the client.
HWR with SearchEngine
In the current search engine, a word that needs to be searched
must be typed completely (in that it must include the inflective form
desired). Each inflective form requires a different query.
We are currently in the process of including a morph analyser into the
search engine, so that the indices correspond only to root words.
The user has to rewrite the query if it is not recognised correctly
which might be an inconvenience. To address this, the HWR must be
augmented with a spell check.
Links & References