CALIFORNIA INSTITUTE OF TECHNOLOGY
This supplement is to accelerate Aim 1C in the original application (1C. We will efficiently curate information derived from the literature and user submission) thereby helping achieve Aim 1 (Increase Database Content), and in a new Aim (Aim 3; described in this document), to pilot extension of the methods and tools to curation at other databases. Reasonably comprehensive, well-structured databases are critical to modern biomedical research but only a few groups have such resources. While man
y model organism researchers have access to such databases, these projects struggle to keep up with the expanding literature. As part of the parent grant, WormBase has begun to use automation and semi-automation in our curation pipeline. This automation will be accelerated within WormBase, and extended to other model organism databases (MODs). As part of this project, we will compare data models and curation strategies at each of eleven database projects (including Mouse Genome Database, FlyBase
WormBase, Saccharomyces Genome Database, ZebraFish Information Network, and the UniProtKB), prioritize the development of tools according to joint needs and opportunities, and implement automated curation pipelines at a few sites. A generic curation workflow includes paper identification (triage), first pass curation (indexing data types), and retrieval or extraction of facts related to specific data types; this workflow is well suited for automation. Various statistical NLP methods to classify
and index papers will be investigated as training sets are developed; a Support Vector Machine (SVM) approach is promising but Hidden Markov Models and Conditional Random Fields will be also evaluated. The Textpresso Search Engine has been adapted by many MODs at least in pilot form and will be used in some curation tasks for identifying individual sentences with relevant facts. Curators will evaluate the outcome of the search results by analyzing true and false positives and negatives using st
andard metrics of recall and precision. Their evaluations will serve as the basis for improved recall and precision. All data extracted from papers will be available in freely accessible component databases; annotated training sets will be freely available; all software will be open source and freely available for anonymous download.