Dimensionality Reduction and Feature Selection Methods for Script Identification on Document Images

Main Article Content

Bruce Poon, Rahman Saami, M. Ashraful Amin, Hong Yan


The goal of this research is to explore effects of dimensionality reduction and feature selection on the problem of script identification from images of printed documents. The k-adjacent segment is ideal for this use due to its ability to capture visual patterns. We have used principle component analysis to reduce the size of our feature matrix to a handier size that can be trained easily, and experimented by including varying combinations of dimensions of the super feature set. A modular approach in neural network was used to classify 7 languages - Arabic, Chinese, English, Japanese, Tamil, Thai and Korean.

Article Details