August 24th, 2023
Candidate: Juan Trelles Trabucco
- Dr. Liz Marai
- Dr. Andy Johnson
- Dr. Wei Tang
- Dr. Cecilia Arighi (University of Delaware)
- Dr. Steven Drucker (Microsoft Research)
Date/time: Aug 24th, 11 am CT
Location: ERF 3036 and Zoom
https://uic.zoom.us/j/84392472211?pwd=RTVwQjZaSlVBcERHcUloc1N6THQ0QT09 Meeting ID: 843 9247 2211 Passcode: UA6mfGMR
Images in scientific publications communicate essential information, such as experiments performed and results obtained, that researchers can use as a proxy for the publication'’s relevance to their research interests. Domains like biocuration, where biomedical experts analyze the literature to extract and organize information for future retrieval and analysis, can benefit from computationally leveraging these images in their workflows. Prior work in biomedical document classification for biocuration has shown that combining features from images (e.g., acquisition modality) with text features can improve precision and recall. However, the considerations and benefits obtained from computationally integrating such image-based features in biocuration workflows are less clear.
Based on a multi-year collaboration with text-mining researchers and biocurators, this dissertation identifies and tackles several challenges to computationally using image-data in the biocuration process. These challenges include the scarcity of labeled datasets for training image classifiers, the lack of characterization for the needed modality classes in biocuration, the lack of available labeling systems to label extensive collections of images, and the lack of support for image-based features in academic search engines. This dissertation addresses these issues by proposing two taxonomies for image modalities found in biomedical publications; introducing training strategies using shallower neural networks and hierarchical organization of classifiers; describing two labeling systems, one for domain experts and one for model builders who require support for data understanding; and concludes by merging several building blocks into document search systems that successfully leverage image-based data.