FYI!
-------- Forwarded Message --------
Subject: [Bburg-fac] Invitation to final defense of Ph.D. candidate
Yufeng Ma, soon to go to Yahoo!
Date: Fri, 15 Feb 2019 10:41:14 -0500
From: EdFox via Bburg-fac <bburg-fac(a)cs.vt.edu>
Reply-To: EdFox <fox(a)vt.edu>
To: CS Faculty <faculty(a)cs.vt.edu>
All are invited to the final defense of Ph.D. candidate Yufeng Ma, soon
to go to Yahoo!
*Time*: 2:30-4:30 PM EST, 02/20 (Wednesday)
*Location*: 2030E Torgersen Hall (DLRL)
*Title:* Going Deeper with Images and Natural Language
*Abstract:* One aim in the area of artificial intelligence (AI) is to
develop a smart agent with high intelligence that is able to perceive
and understand the complex visual environment around us. More
ambitiously, it should be able to interact with us about its
surroundings in natural languages. Thanks to the progress made in deep
learning, we've seen huge breakthroughs towards this goal. The
developments have been extremely rapid in visual recognition, in which
machines now can categorize images into multiple classes, and detect
various objects within an image, with the ability that is competitive
with or even surpasses that of humans. Meanwhile, we also have witnessed
similar strides in natural language processing (NLP). It is quite often
for us to see that now computers are able to almost perfectly do text
classification, machine translation, etc. However, despite much
inspiring progress, most of the achievements made are still within one
domain, not handling inter-domain situations. The interaction between
the visual and textual areas is still quite limited, although there has
been progress in image captioning, visual question answering, etc.
In this dissertation, we design models and algorithms that enable us to
build in-depth connections between images and natural language, which
help us to better understand their inner structures. In particular,
first we augment the conventional image captioning model using
adversarial loss, to boost its performance on various metrics (e.g.,
BLEU, CIDEr). Generative adversarial networks (GANs) are connected with
discrete caption data through the Gumbel-Softmax trick. The results of
our experiments with the MSCOCO caption dataset show that our model is
capable of generating captions with improved quality and higher
readability. Second, we develop a model (Quasi-Supervised Learning
network) for measuring review congruence, which takes an image and
the review text as input and quantifies the relevance of each sentence
to the image. The whole model is trained in a purely unsupervised way,
much like GANs. Although there are no manual tags involved in training
the model, we build pseudo labels with the help of attention mechanisms.
The model is able to accurately estimate the relevance of sentences to
an image based on our experimental results on the Yelp Restaurant
dataset. Lastly, we go deeper with fine-grained image understanding by
identifying aspect level ratings and by detecting corresponding aspects
from the image within the review context, which is considered a novel AI
task. This would help merchants provide better service and readers make
choices more easily. Adaptive models are proposed to automatically
adjust weights placed on images and texts for overall rating prediction.
It shows significant improvement as compared to our baseline model based
on the Mean Squared Error (MSE) metric in terms of both overall and
aspect level rating.
On the theoretical side, this research contributes to multiple research
areas in Computer Vision (CV), Natural Language Processing (NLP),
interactions between CV&NLP, and Deep Learning. Regarding impact, these
techniques will benefit related