Leveraging the Latent Space for Model Understanding and Optimization
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
In recent years, the field of machine learning has seen massive growth in both the size and quality of models and performance on tasks such as classification or image generation. However, these models are typically limited by two key factors. First, models such as those used in tasks of text-to-image generation lack interpretation. Second, models that leverage the latent space to represent data struggle to capture high-level details. This often results in reconstructions which do not accurately represent the original data. First, to address the issue of interpretability in text-to-image models, we introduce WINOVIS, a novel dataset designed to probe models in their ability to interpret textual prompts. This approach reframes the task of pronoun disambiguation from a single mode of natural language to a multi-model problem involving both visual and textual understanding. Second, we turn our focus to models in image generation such as the VQ-VAE which often struggle to reconstruct images capturing the finer details of the original input image. By introducing lightweight and straightforward modifications to the VQ-VAE’s loss function and dictionary selection process, we enable the reconstruction of images that retain high-level details often absent from the reconstructions produced by the traditional VQ-VAE.