ML-rank: A Language Independent Method For Keyword Extraction from Webpages

Authors

Himat SHAH, Radu MARIESCU- ISTODOR, Pasi FRANTI
School of Computing, University of Eastern Finland Joensuu 80110, Finland.

Abstract

We present a supervised method for keyword extraction from webpages. The method divides the HTML page into meaningful segments using document object model (DO on these, we generate a classification model that gives a likelihood for a word to be a keyword. The most likely words are th We analyze the usefulness of the features on different datasets (news articles and service web pages) and compare different classification methods for the task. Results show that random forest performs best and provides up to 27.8 % compared to the best existing method.