Scientific social web mining


Scientific social network analysis seeks to discover global patterns in the network of researchers working in a particular field. Common approaches use bibliographic/scholarly data as the basis for this analysis. In the Textrend project, we look for the potential of exploiting other resources as an information source, such as the homepages of researchers. The homepage of a researcher contains several useful pieces of scientific social information like the name of their supervisor, affiliations, academic ranking and so on.

The information on homepages may be present in a structured or natural text form. We focus on the detection and analysis of full text regions of the homepages as they may contain a huge amount of information while it requires more sophisticated analysis than that for structured ones.

We manually annotated a corpus of homepages of researchers for scientific social information. It is extensively annotated, has a hierarchical label structure and is freely available (along with the HTML annotation tool) for research purposes.
Later, as a case study, we chose one particular scientific social information type and sought to extract information tuples concerning the previous and current affiliations of the researcher in question.

The annotation tool

We developed a WYSIWYG HTML annotation tool which fulfils all of the following criteria:

  • The annotators should work on the pages in their original appearance, hence they should not work on source HTMLs and we should not use labelling which would modify the appearance of a page. Moreover, as the corpus contains downloadable subsites, the tool has to be compatible with the hyperlinks.
  • The labelled parts of the document should not match the DOM tree of the page.
  • The tool has to support and automatically verify the consistency of the annotation hierarchy.

The tool places a special HTML comment tag at the beginning and end of the selection. The use of comment tags rather than some other kind of HTML tag helps preserve the page's appearance and provides out-of DOM labelling. The tool is a Firefox extension; hence its installation and usage are both very simple.


  • István Nagy, Richárd Farkas, Márk Jelasity: Researcher affiliation extraction from homepages, in Proceedings of the NLPIR4DL Workshop at ACL 2009.
  • Richárd Farkas, Róbert Ormándi, Márk Jelasity, and János Csirik: A Manually Annotated HTML Corpus for a Novel Scientific Trend Analysis, The Eighth IAPR Workshop on Document Analysis Systems, Nara, 2008.

For further information please contact István Nagy (nistvan AT