User-oriented Semantic Indexer is the name of an efficient algorithm for annotating documents of any type.
The main motivation behind this work is to provide a kNN-based approach for annotating entities, be it textual documents, songs or movies. While other methods often combine machine learning and feature analysis of a given document (e.g. textual features), USI’s approach is completely independent of the document content. The only requirement in order to guarantee an accurate annotation is to provide an accurate already annotated neighborhood. The search of a good neighborhood is an independent task, related to information retrieval, for which an extensive list of tools already exist.
Thanks to the rise of thesauri, ontologies and knowledge representations in general, there are more and more data that can be annotated by concepts. The semantic indexing process has been initiated in the biomedicine field but much more content can now benefit from conceptual indexing thanks to DBPedia or Freebase. USI aims to do so, whatever is the content, whatever is the thesaurus.
USI is presented as a heuristic algorithm optimizing an objective function. We propose an algorithmic optimization of this heuristic to make it fast enough, implemented in the USI java library. This library is also hosted on GitHub and it can be freely downloaded to be implemented in your project.
- Dedicated website
- Source code on GitHub
- Demo on biomedical papers: try out USI by annotating biomedical papers (mostly about tumors)
- Demo on movies: try out USI by annotating movies thanks to Freebase annotations
- Evaluation datasets: contains the L1000 dataset (and 2 others) to evaluate your results – and those of USI
- Results: the results of USI for the L1000 dataset
- USI-MeSH: use this build to reproduce our results
- MeSH-validation: use this build to validate MeSH-based results – such as the ones provided above
- JAR build: include USI in your Java project as a library