Visually Grounded Reasoning
What would you do if you were tired and saw that there happened to be a chair nearby? Sit on that chair, right? According to an analysis by Harvard researchers, the proportion of the human brain receiving external information through vision accounts for 83% of all senses. That being said, human beings can quickly obtain almost all the necessary environmental elements from visual inputs. No matter what type the chair in front of us is (e.g., a bench or a cradle), we can quickly judge where to sit and where to place our hands. Contrary to current popular belief, artificial intelligence, with images being the only form of information, struggles to observe the position of each part of the chair through the arrangement of RGB pixels, and it is even more difficult to infer the functions undertaken by each part. For an agent to have the ability to observe and perceive the world like us, its underlying algorithm should not only be more sensitive but also more thoughtful. Sensitive means to memorize the features of the target objects and perceive these features from visual information, whereas thoughtful stands for a motivation to think beyond the pixels and reasoning about the concerned properties of the objects (e.g., can I sit there?). Extending the chair setting to broader indoor and outdoor environments, we hope to build AI for visual task guidance in the wild and with expert knowledge for challenging scenarios.