The limitations of keyword-based search are well known in the information retrieval field. These are more evident in life sciences where most of the reliable scientific information is spread across biomedical literature in the form textual journal articles. Unlike the Web, these journal articles are devoid of hyperlinks and multiple keyword-based searches need to be performed while aggregating and organizing search results that the user finds interesting. This makes literature search a tedious task in life sciences.
Knowledge-based search systems are proposed as an improvement over conventional search and have gained popularity especially given the availability of many expert-curated vocabularies and taxonomies in the biomedical domains. The different classes in a given taxonomy are used to provide faceted search over articles that contain the instances of these classes. These taxonomies and other forms of ontologies are mostly static blocks of well-accepted consensual knowledge. Also, most of these standard ontologies have a limited number of predicates (or relationship types) such as "part of" and "is a". We believe the search process can benefit from recently published results that are not well known in the research community and also by relationship types that go beyond the taxonomic ones. Scooner is a knowledge-based literature search and exploration system that is built upon this intuition. We are working on providing more powerful knowledge-based search where recently published results are computationally extracted and used a background KB to guide the search process. The key here is that the knowledge-base that guides search is extracted from the same universe of literature that is being explored.
Search Process: In Scooner, search is modeled as an interactive process where, besides a search box for key word input, the points of interaction are based on domain specific assertions (or triples) of the form: subject -> predicate -> object (ex: muscarinic activation -> facilitates -> long-term potentiation). Raw text results are input to a spotter module that annotates them with entities found in the triples used as background knowledge. Clicking on an annotated entity displays all triples where it participates as a subject or object. Clicking on the corresponding object/subject would then bring up articles that potentially contain that triple; in most cases the original abstract from which the triple was extracted is listed in the top 2 or 3 articles. This way the triples can be browsed in the context of the abstracts in which they were found. New implicit knowledge can also be discovered by building trails from individual triples. Furthermore, users can filter search results or triples based on their association with MeSH terms assigned to the abstracts by the NLM.
Collaborative Extensions: Scooner combines these ideas of triple-based search and exploration with persistent search sessions. Users can create search projects and store their search history including the abstracts they felt important, triples they found useful, and also collaborate with colleagues. The workbench in Scooner facilitates a central aggregation of important abstracts imported for further review. The work bench can be filtered to only show only those abstracts that pertain to a selected set of triples or trails. Additionally, collaborative features were incorporated using which users can create persistent search projects, write comments on abstracts they find relevant, and share the (sub) projects with other users on a public dashboard.
Currently Scooner's KB comes from the human performance and cognition ontology project and the literature explored is the set of all abstracts available via PubMed as of June 2011. The knowledge-base is created for the domain of human performance and cognition and is extracted from articles on PubMed published by Aug 2008. Initial evaluations of Scooner by researchers at the AFRL indicate that Scooner does better than NLM's PubMed search tool.
Undergraduate Students: Alan Smith, Paul Fultz III
Graduate Students: Delroy Cameron, Christopher Thomas, Wenbo Wang
Postdocs: Ramakanth Kavuluru
Faculty: Amit Sheth
Former students who contributed to previous incarnations of Scooner: Pablo Mendes, Cartic Ramakrishnan
Architecture and Components
The following picture shows various components of Scooner
0. Knowledge base: We assume the presence of a knowledge base in the form of triples preferably in RDF. Having a good schema for the data really helps with Scooner's functionality. Other formats can also be converted to RDF with some pre-processing. Currently we are hosting two data sets both related to the domain of human performance and cognition. One is built at the Knoesis center using NLP and pattern based extraction techniques and is serialized as a Lucene index. The second data set comes from shallow parsing and rule based extraction techniques developed at the NLM and is stored in a Virtuoso triple store.
1. PubMed full text index: PubMed is NLM's service that facilitates access to article citation information from a number of biomedical journals including those from the MEDLINE database. Every year around December, PubMed releases a consolidated baseline (taking into account deletions and revisions) list of abstracts whose status has been thoroughly verified by officials at the NLM. After that, an update is released every few days to account for new articles that are added as the years progresses. Our full text index consists of abstracts released through PubMed until June 2011. We are working on automatically indexing new updates. The IndexWrapper service supports two types of queries. The first query is a traditional query with fields including title, abstract, author, year, and pmid. The second query is a phrasal range query which looks for a co-located presence of a subject, predicate, and object labels in the abstract field to retrieve abstracts that contain (information relevant to) the triple. Indexing is done using Lucene API where the SnowballAnalyzer is used for the abstract and the title fields to incorporate stemming, while KeywordAnalyzer is used for the remaining fields for exact matching. Also, boosting on the year of publication is performed at query time so that relevant articles that are more recent are ranked higher.
2. Triples Model Interface: The triples in the original HPC knowledge base extracted at the Knoesis center are serialized using Lucene for efficient programmatic access through Java. The other set of triples obtained from NLM's BKR data set are hosted on a Virtuoso RDF triple store and are accessed programmatically using a Java interface to the SPARQL end point that provides triple access. Both the SPARQL end point and Lucene index access are abstracted out using the same interface that takes a set of entity IDs and returns the associated (inverse) triples.
3. Spotter Service: Named entities in either of the data sets are spotted using an in-memory prefix tree data structure populated with the entity labels when the server is set up to host Scooner. The prefix tree algorithm spots the longest available label in the set of the entities of the knowledge base. For example, when it encounters the phrase "long term memory formation", and if both "memory formation" and "long term memory formation" are valid entities in the data set, it spots longest label that is, "long term memory formation". But the next time in the abstract, it sees the same phrase, "memory formation" is spotted instead because an entity is spotted only once in each abstract.
- R. Kavuluru, C. Thomas, A. Sheth, V. Chan, W. Wang, A. Smith, A. Sato and A. Walters. An Up-to-date Knowledge-Based Literature Search and Exploration Framework for Focused Bioscience Domains. IHI 2012 - 2nd ACM SIGHIT International Health Informatics Symposium, January 28-30, 2012.
- Christopher J Thomas, Amit Sheth, Web Wisdom: An Essay on How Web 2.0 and Semantic Web can foster a Global Knowledge Society , In Book: Computers in Human Behavior, Volume:27 Issue:4, Robert Tennyson (Ed.) Elsevier Ltd. 2011, pp. 1285-1293, doi: 10.1016/j.chb.2010.07.023.
- D. Cameron, R. Kavuluru, O. Bodenreider, P. N. Mendes, A. P. Sheth, K. Thirunarayan, Semantic Predications for Complex Information Needs in Biomedical Literature, 5th International Conference on Bioinformatics and Biomedicine BIBM11, Atlanta GA, November 12-15, 2011
- D. Cameron, P. N. Mendes, A. P. Sheth, V. Chan, Semantics-Empowered Text Exploration for Knowledge Discovery, 48th ACM Southeast Conference, ACMSE2010, Oxford Mississippi, April 15-17, 2010.
- Wenbo Wang, Christopher Thomas, Amit Sheth, Victor Chan, Pattern-Based Synonym and Antonym Extraction, 48th ACM Southeast Conference, ACMSE2010, Oxford Mississippi, April 15-17, 2010.
- Christopher J. Thomas, Wenbo Wang, Pankaj Mehra, Delroy Cameron, Pablo N. Mendes, and Amit P. Sheth. What Goes Around Comes Around - Improving Linked Opend Data through On-Demand Model Creation. In Web Science, 2010
- Amit Sheth, Christopher Thomas, Pankaj Mehra, Continuous Semantics to Analyze Real-Time Data, IEEE Internet Computing, vol. 14, no. 6, pp. 84-89, Nov./Dec. 2010, doi:10.1109/MIC.2010.137
- Christopher J. Thomas, Pankaj Mehra, Wenbo Wang, Amit P. Sheth, Gerhard Weikum, and Victor Chan. Automatic Domain Model Creation Using Pattern-Based Fact Extraction, Kno.e.sis Center Technical Report 2010