It appears that Google might not be the best source of data for training a machine vision system (a baby) in the early stages. I will use the word baby to loosely refer to the current state of the art in machine vision. Before a teenage vision algorithm can learn to navigate the nasty world on its own, it must learn the basics of vision and object classification in elementary school.
What I'm trying to say is that there is a time for everything; there is a time for unsupervised learning. Try using Google images to search for images of 'shoe' and you will find the third image a 10 foot high-heeled shoe. A human will understand that this is still a shoe even though its scale is out of whack. When trying to teach a baby what the word 'shoe' means, it is a bad idea to show it 10 fooot high statues of shoes.
By the way, Google images returns too many synthetic and manually edited images of objects. These unrealistic scenarios are not good for training babies.