A weighted TF-IDF Uni-Gram Model for Automated Feature Extraction in Multi – Dimensional and Unstructured Big Data

Authors

  • A. Jebamalai Robinson
  • Dr. V Saravanan

Abstract

In the recent trends of the Big-data, A large volume of data mostly unstructured are produced in a variety of forms such as Text, Images, Audio and video. Making use of these data in an effective manner is a tedious and challenging task. Feature Extraction methods are used widely for extracting meaningful information from these large data sources. Data dimensionality is a critical issue when Data mining algorithms are applied to these large data. Although many researches have been conducted in the Feature Engineering of unstructured data to address the dimensional complexity, most of the methods suffer from pitfall in one or the other metrics. This paper proposes an automated Feature Extraction method based onweighted -TF-IDF model using uni-gram vector space methodfor a large text corpus.The document similarity features are calculated using the cosine similarity and Document Clustering is done for grouping the similar features from the corpus together.Experimental analysis prove that the proposed methodology outperforms the other state-of-the-art embedded methods for text feature extraction.

Keywords: Big data, Feature Extraction, Dimensionality, TF-IDF model, Topic modelling and LDA

Downloads

Published

2020-01-02

Issue

Section

Articles