This repo provides implementations of Slot Attention, Vector Quantization, Visual GPT, and Image PCFG.
This is a straightforward extension of PCFGs of languages to 2-dimensional images. Here are algorithms and implementations. Motivations and technical details are summarized in my thesis (p18-21). The key idea here is to use a pre-trained vector quantization model to tokenize images into
Check out run_ipcfg.sh
and run_ipcfg_eval.sh
for training and evaluation.
Inspired by Image GPT and DALL·E, I combined Vector Quantization and GPT to solve the abstract visual reasoning task. Below is an example of the task: what is the most likely image that follows the given sequence of images (have a guess :))? What I did include (1) using a pre-trained vector quantization model to tokenize the prefix images, (2) formulating the task as causal language modeling, and (3) generating the most likely image using GPT.
Check out run_raven_solver.sh
and run_raven_eval.sh
for training and evaluation.
See this paper for technical details. I trained and evaluated models on AbstractScences and CLEVR. Below are some illustrations:
Check out run_slot_abscene.sh
and run_slot_clevr.sh
for training and evaluation.
MIT