Web Page Genre Identification and Categorization using Single-Label and Multi-Label Corpuses in English and Telugu Languages

  • K. Pranitha Kumari
  • K. Srinivasa Reddy

Abstract

Abstract: As web is fluid, new web page genres emerge, and these genres are known as emerging genres. Genre based searches can yield better search results than topic based searches for the user. In this paper, Refined Adjustable Centroid Classification (RACC) algorithm is proposed to classify web page genres including emerging genres in multiple languages. Seven Telugu web page genres TELPoetry, TELEntertainment, TELFAQ, TELChildren, TELSocial, TELetiquette and TELE-genre are identified using the method of annotation by objective sources. Telugu web page genre corpus(10-genre) is developed which contains newly identified seven Telugu web page genres and three existing Telugu web page genres. The 7-genre, 10-genre, 20-genre, 23-genre and newly formed Telugu 10-genre corpuses are classified using RACC algorithm. The classification results obtained show that RACC algorithm gave better results when compared with existing classification techniques on all five corpuses. The experimental results obtained are statistically significant (p<0.05).

 

Index Terms: Telugu web page genres, Emerging genres, syllable extraction, genre threshold, web page genre classification, genre corpus, identification of genres

Published
2022-01-01