Himat SHAH, Radu MARIESCU- ISTODOR, Pasi FRANTI
School of Computing, University of Eastern Finland Joensuu 80110, Finland.
ML-rank: A Language Independent Method For Keyword Extraction from Webpages
Authors
Abstract
We present a supervised method for keyword extraction from webpages. The method divides the HTML page into
meaningful segments using document object model (DO
on these, we generate a classification model that gives a likelihood for a word to be a keyword. The most likely words are th
We analyze the usefulness of the features on different datasets (news articles and service web pages) and compare different
classification methods for the task. Results show that random forest performs best and provides up to 27.8 %
compared to the best existing method.