April 29th, 2022
Candidate: Juan Trelles Trabucco
Date: Friday, April 29, 2022
Time: 11 am CT
Committee Members: Dr. G. Elisabeta Marai, Dr. Andy Johnson, Dr. Wei Tang, Dr. Steven Drucker, Dr. Cecilia Arighi
Biomedical researchers need to search over an increasing number of publications to find knowledge. Among these researchers, biocurators specialize in identifying relevant publications and extracting information to populate structured databases. Efforts from biocuration groups benefit a much larger community who access this curated information daily. To effectively deal with the extensive collection of documents, the biocuration workflow uses text-mining approaches to automate tasks, such as predicting the relevance of a publication. While exploiting patterns in text data has yielded significant breakthroughs for biocuration, images in publications contain important cues to determine the document's relevance. Still, integrating these image content with textual features is underexplored due to difficulties collecting images, defining classification schemes, and having access to domain experts for data labeling.
Rooted in a years-long collaboration with text-mining researchers and biocurators, I propose strategies for harvesting labels from biomedical publications and modeling image classifiers to support biocuration. First, I present two approaches for training image classifiers with deep learning for small and unbalanced datasets. Then, I characterize the labeling task in biocuration and introduce a labeling system tailored for domain experts. Next, I build on lessons from this labeling system to develop a labeling approach for model builders by combining visual analytics and machine learning techniques. These works provide the building blocks for the future design of search interfaces that leverage text- and image-based data for biomedical experts. In addition, it supports the development of novel multi-modal approaches for detecting publication's relevance and interpreting these predictions.