Tombone's Computer Vision Blog: alyosha efros

Showing posts with label alyosha efros. Show all posts

Wednesday, May 16, 2018

DeepFakes: AI-powered deception machines

Driven by computer vision and deep learning techniques, a new wave of imaging attacks has recently emerged which allows anyone to easily create highly realistic "fake" videos. These false videos are known as Deep Fakes. While highly entertaining at times, DeepFakes can be used to perturb society and some would argue that the pre-shock has already begun. A rogue DeepFake which goes viral can spread misinformation across the internet like wildfire.

"The ability to effortlessly create visually plausible editing of faces in videos has the potential to severely undermine trust in any form of digital communication. "

--Rössler et al. FaceForensics [3]

Because DeepFakes contain a unique combination of realism and novelty, they are more difficult to detect on social networks as compared to traditional "bad" content like pornography and copyrighted movies. Video hashing might work for finding duplicates or copyright-infringing content, but not good enough for DeepFakes. To fight face-manipulating DeepFake AI, one needs an even stronger AI.

As today's DeepFakes are based on Deep Learning, and Deep Learning tools like TensorFlow and PyTorch are accessible to anybody with a modern GPU, such face manipulation tools are particularly disruptive. The democratization of Artificial Intelligence has brought us near infinite use-cases. From the DeepDream phenomenon of 2015 to the Deep Style Transfer Art apps of 2016, 2018 is the year of the DeepFake. Today's computer vision technology allows a hobbyist to create a Deep Fake video of just about any person they want performing any action they want, in a matter of hours, using commodity computer hardware.

Fig 1. DeepFakes generate "false impressions" which are attacks on the human mind.

What is a Deep Fake?
A deep fake is a video generated from a modern computer vision puppeteering face-swap algorithm which can be used to generate a video of target person X performing target action A, usually given a video of another person Y performing action A. The underlying system learns two face models, one of target person X, and of for person Y, the person in the original video. It then learns a mapping between the two faces, which can be used to create the resulting "fake" video. Techniques for facial reenactment have been pioneered by movie studios for driving character animations from real actors' faces, but these techniques are now emerging as deep learning-based software packages, letting the deep convolutional neural networks do most of the work during model training.

Consider the following collage of faces. Can you guess which ones are real and which ones are DeepFakes?

Fig 2. Can you tell which faces are real and which ones are fake?

Figure from Face Forensics[3]

It is not so easy to tell which image is modified and which one is unadulterated. And if you do a little bit of searching for DeepFakes (warning, unless you are careful, you will encounter lots of pornographic content) you notice that the faces in those videos look very realistic.

How are Deep Fakes made?
While there are conceptually many different ways to make Deep Fakes, today we'll focus on two key underlying techniques: face detection from videos, and deep learning for creating frame alignments between source face X and target face Y.

A lot of this research started with the Face2face work [1] presented at CVPR 2016. This paper was a modernization of the group's earlier SIGGRAPH paper and focused a lot more on the computer vision details. At this time the tools were good enough to create SIGGRAPH-quality videos, but it took a lot of work to put together a facial reenactment rig. In addition, the underlying algorithms did not use any deep learning, so a lot of domain-knowledge (i.e., face modeling expertise) went into making these algorithms work robustly. The TUM/Stanford guys filed their Real-time facial reenactment patent in 2016 [4], and have more recently worked on FaceForensics[3] to detect such manipulated imagery.

Fig3. Face2Face technique from 2016. It is 2018 now, so just imagine how much better this works now!

In addition to the Face2face guys (who have now a handful of similarly themed papers), it is interesting to note that a lot of key early ideas in face puppeteering were pioneered by Ira Kemelmacher-Shlizerman who is now a computer vision and graphics assistant professor at University of Washington. She worked on early face puppeteering technology for the 2010 paper Being John Malkovich, continued with the Photobios work, and later founded Dreambit (based on a SIGGRAPH 2016 paper), which was acquired by Facebook. :-)

Fig 4. Ira's early work on face swapping in 2010. See the Being John Malkovich paper[2].

Take a look at Ira's Dreambit video, which shows some high-quality "entertainment" value out of rapidly produced non-malicious DeepFakes!

Fig 5. Ira's Dreambit system. Lets her imagine herself in different eras, with different hairstyles, etc.

The origin of Ira's Dreambit system is the Transfiguring Portraits SIGGRAPH 2016 paper[6]. What's important to note is that this is 2016 and we're starting to see some use of Deep Learning. The transfiguring portraits work used a big mix of features, using some CNN features computed from early Caffe networks. It is not an entirely easy-to-use system at this point, but good enough to make SIGGRAPH videos, take a one minute to generate other cool outputs, and definitely cool enough for Facebook to acquire.

Fig 6. Transfiguring Portraits. The system used lots of features, but Deep Learning-based CNN features are starting to show up.

Fighting against DeepFakes
There are now published algorithms which try to battle DeepFakes by determining if faces/videos are fake or not. FaceForensics[3] introduces a large DeepFake dataset based on their earlier Face2face work. This dataset contains both real and "fake" Face2face output videos. More importantly, the new dataset is big enough to train a deep learning system to determine if an image is counterfeit. In addition, they are able to both 1.) determine which pixels have likely been manipulated, and 2.) perform a deep cleanup stage to make even better DeepFakes.

Fig 7. The "fakeness" masks in FaceForensics[3] are based on XceptionNet

Another fake detection approach, this time from a Berkeley AI Research group called Image Splice Detection, focuses on detecting where an image was spliced to create a fake composite image. This allows them to determine which part of the image was likely "photoshopped" and the technique is not specific to faces. And because this is a 2018 paper, it should not be a surprise that this kind of work is all based on deep learning techniques.

Fig 8. Fighting Fake News: Image Splice Detection[5]. Response maps are aggregated to determine the combined probability mask.[5]

From the Fighting Fake News paper,

"As new advances in computer vision and image-editing emerge, there is an increasingly urgent need for effective visual forensics methods. We see our approach, which successfully detects manipulations without seeing examples of manipulated images, as being an initial step toward building general-purpose forensics tools."

Concluding Remarks
The early DeepFake tools were pioneered in the early 2010s and were producing SIGGRAPH-quality results by 2015. It was only a matter of years until DeepFake generators became publicly available. 2018's DeepFake generators, being written on top of open-source Deep Learning libraries, are much easier to use than the researchy systems from only a few years back. Today, just about any hobbyist with minimal computer programming knowledge and a GPU can build their own DeepFakes.

Just as Deep Fakes are getting better, Generative Adversarial Networks are showing more promise for photorealistic image generation. It is likely that we will soon see lots of exciting new work on both the generative side (deep fake generation) and the discriminative side (deep fake detection and image forensics) which incorporate more and more ideas from the machine learning community.

References

[1] Justus Thies, Michael Zollhöfer, Marc Stamminger, Christian Theobalt, and Matthias Nießner. "Face2face: Real-time face capture and reenactment of rgb videos." In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pp. 2387-2395. IEEE, 2016.

[2] Ira Kemelmacher-Shlizerman, Aditya Sankar, Eli Shechtman, and Steven M. Seitz. "Being john malkovich." In European Conference on Computer Vision, pp. 341-353. Springer, 2010.

[3] Andreas Rössler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. "FaceForensics: A Large-scale Video Dataset for Forgery Detection in Human Faces." arXiv preprint arXiv:1803.09179, 2018.

[4] Christian Theobalt, Michael Zollhöfer, Marc Stamminger, Justus Thies, Matthias Nießner. Real-time Expression Transfer for Facial Reenactment Invention. 2018/3/8. Application Number 15256710

[5] Minyoung Huh, Andrew Liu, Andrew Owens, Alexei A. Efros, "Fighting Fake News: Image Splice Detection via Learned Self-Consistency." arXiv preprint arXiv:1805.04096, 2018

[6] Ira Kemelmacher-Shlizerman, "Transfiguring portraits." ACM Transactions on Graphics (TOG), 35(4), p.94. 2016

Friday, April 24, 2015

Making Visual Data a First-Class Citizen

“Above all, don't lie to yourself. The man who lies to himself and listens to his own lie comes to a point that he cannot distinguish the truth within him, or around him, and so loses all respect for himself and for others. And having no respect he ceases to love.” ― Fyodor Dostoyevsky, The Brothers Karamazov

City Forensics: Using Visual Elements to Predict Non-Visual City Attributes

To respect the power and beauty of machine learning algorithms, especially when they are applied to the visual world, let's take a look at three recent applications of learning-based "computer vision" to computer graphics. Researchers in computer graphics are known for producing truly captivating illustrations of their results, so this post is going to be very visual. Now is your chance to sit back and let the pictures do the talking.

Can you predict things simply by looking at street-view images?

Let's say you're going to visit an old-friend in a foreign country for the first time. You've never visited this country before and have no idea what kind of city/neighborhood your friend lives in. So you decide to get a sneak peak -- you enter your friend's address into Google Street View.

Most people can look at Google Street View images in a given location and estimate attributes such as "sketchy," "rural," "slum-like," "noisy" for the given neighborhood. TLDR; A person is a pretty good visual recommendation engine.

Can you predict if this looks like a safe location?

(Screenshot of Street view for Manizales, Colombia on Google Earth)

Can a computer program predict things by looking at images? If so, then these kinds of computer programs could be used to automatically generate semantic map layovers (see the crime prediction overlay from the first figure), help organize fast-growing cities (computer vision meets urban planning?), and ultimately bring about a new generation of match-making "visual recommendation engines" (a whole suite of new startups).

Before I discuss the research paper behind this idea, here are two cool things you could do (in theory) with a non-visual data prediction algorithm. There are plenty of great product ideas in this space -- just be creative.

Startup Idea #1: Avoiding sketchy areas when traveling abroad
A Personalized location recommendation engine could be used to find locations in a city that I might find interesting (techie coffee shop for entrepreneurs, a park good for frisbee) subject to my constraints (near my current location, in a low-danger area, low traffic). Below is the kind of place you want to avoid if you're looking for a coffee and a place to open up your laptop to do some work.

Google Street Maps, Morumbi São Paulo: slum housing (image from geographyfieldwork.com)

Startup Idea #2: Apartment Pricing and Marketing from Images
Visual recommendation engines could be used to predict the best images to represent an apartment for an Airbnb listing. It would be great if Airbnb had a filter that would let you upload videos of your apartment, and it would predict that set of static images that best depict your apartment to maximize earning potential. I'm sure that Airbnb users would pay extra for this feature if it was available for a small extra charge. The same computer vision prediction idea can be applied to home pricing on Zillow, Craigslist, and anywhere else that pictures of for-sale items are shared.

Google image search result for "Good looking apartment". Can computer vision be used to automatically select pictures that will make your apartment listing successful on Airbnb?

Part I. City Forensics: Using Visual Elements to Predict Non-Visual City Attributes

The Berkeley Computer Graphics Group has been working on predicting non-visual attributes from images, so before I describe their approach, let me discuss how Berkeley's Visual Elements relate to Deep Learning.

Predicting Chicago Thefts from San Francisco data. Predicting Philadelphia Housing Prices from Boston data. From City Forensics paper.

Deep Learning vs Mid-level Patch Discovery (Technical Discussion)
You might think that non-visual data prediction from images (if even possible) will require a deep understanding of the image and thus these approaches must be based on a recent ConvNet deep learning method. Obviously, knowing the locations and categories associated with each object in a scene could benefit any computer vision algorithm. The problem is that such general purpose CNN recognition systems aren't powerful enough to parse Google Street View images, at least not yet.

Another extreme is to train classifiers on entire images. This was initially done when researchers were using GIST, but there are just too many nuisance pixels inside a typical image, so it is better to focus your machine learning a subset of the image. But how do you choose the subset of the image to focus on?

There exist computer vision algorithms that can mine a large dataset of images and automatically extract meaningful, repeatable, and detectable mid-level visual patterns. These methods are not label-based and work really well when there is an underlying theme tying together a collection of images. The set of all Google Street View Images from Paris satisfies this criterion. Large collections of random images from the internet must be labeled before they can be used to produce the kind of stellar results we all expect out of deep learning. The Berkeley Group uses visual elements automatically mined from images as the core representation. Mid-level visual patterns are simply chunks of the image which correspond to repeatable configurations -- they sometimes contain entire objects, parts of objects, and popular multiple object configurations. (See Figure below) The mid-level visual patterns form a visual dictionary which can be used to represent the set of images. Different sets of images (e.g., images from two different US cities) will have different mid-level dictionaries. These dictionaries are similar to "Visual Words" but their creation uses more SVM-like machinery.

The patch mining algorithm is known as mid-level patch discovery. You can think of mid-level patch discovery as a visually intelligent K-means clustering algorithm, but for really really large datasets. Here's a figure from the ECCV 2012 paper which introduced mid-level discriminative patches.

Unsupervised Discovery of Mid-Level Discriminative Patches

Unsupervised Discovery of Mid-Level Discriminative Patches. Saurabh Singh, Abhinav Gupta and Alexei A. Efros. In European Conference on Computer Vision (2012).

I should also point out that non-final layers in a pre-trained CNN could also be used for representing images, without the need to use a descriptor such as HOG. I would expect the performance to improve, so the questions is perhaps: How long until somebody publishes an awesome unsupervised CNN-based patch discovery algorithm? I'm a handful of researchers are already working on it. :-)

Related Blog Post: From feature descriptors to deep learning: 20 years of computer vision

The City Forensics paper from Berkeley tries to map the visual appearance of cities (as obtained from Google Street View Images) to non-visual data like crime statistics, housing prices and population density. The basic idea is to 1.) mine discriminative patches from images and 2.) train a predictor which can map these visual primitives to non-visual data. While the underlying technique is that of mid-level patch discovery combined with Support Vector Regression (SVR), the result is an attribute-specific distribution over GPS coordinates. Such a distribution should be appreciated for its own aesthetic value. I personally love custom data overlays.

City Forensics: Using Visual Elements to Predict Non-Visual City Attributes. Sean Arietta, Alexei A. Efros, Ravi Ramamoorthi, Maneesh Agrawala. In IEEE Transactions on Visualization and Computer Graphics (TVCG), 2014.

Part II. The Selfie 2.0: Computer Vision as a Sidekick

Sometimes you just want the algorithm to be your sidekick. Let's talk about a new and improved method for using vision algorithms and the wisdom of the crowds to select better pictures of your face. While you might think of an improved selfie as a silly application, you do want to look "professional" in your professional photos, sexy in your "selfies" and "friendly" in your family pictures. An algorithm that helps you get the desired picture is an algorithm the whole world can get behind.

Attractiveness versus Time. From MirrorMirror Paper.

The basic idea is to collect a large video of a single person which spans different emotions, times of day, different days, or whatever condition you would like to vary. Given this video, you can use crowdsourcing to label frames based on a property like attractiveness or seriousness. Given these labeled frames, you can then train a standard HOG detector and predict one of these attributes on new data. Below if a figure which shows the 10 best shots of the child (lots of smiling and eye contact) and the worst 10 shots (bad lighting, blur, red-eye, no eye contact).

10 good shots, 10 worst shots. From MirrorMirror Paper.

You can also collect a video of yourself as you go through a sequence of different emotions, get people to label frames, and build a system which can predict an attribute such as "seriousness".

Faces ranked from Most serious to least serious. From MirrorMirror Paper.

In this work, labeling was necessary for taking better selfies. But if half of the world is taking pictures, while the other half is voting pictures up and down (or Tinder-style swiping left and right), then I think the data collection and data labeling effort won't be a big issue in years to come. Nevertheless, this is a cool way of scoring your photos. Regarding consumer applications, this is something that Google, Snapchat, and Facebook will probably integrate into their products very soon.

Mirror Mirror: Crowdsourcing Better Portraits. Jun-Yan Zhu, Aseem Agarwala, Alexei A. Efros, Eli Shechtman and Jue Wang. In ACM Transactions on Graphics (SIGGRAPH Asia), 2014.

Part III. What does it all mean? I'm ready for the cat pictures.

This final section revisits an old, simple, and powerful trick in computer vision and graphics. If you know how to compute the average of a sequence of numbers, then you'll have no problem understanding what an average image (or "mean image") is all about. And if you're read this far, don't worry, the cat picture is coming soon.

Computing average images (or "mean" images) is one of those tricks that I was introduced to very soon after I started working at CMU. Antonio Torralba, who has always had "a few more visualization tricks" up his sleeve, started computing average images (in the early 2000s) to analyze scenes as well as datasets collected as part of the LabelMe project at MIT. There's really nothing more to the basic idea beyond simply averaging a bunch of pictures.

Teaser Image from AverageExplorer paper.

Usually this kind of averaging is done informally in research, to make some throwaway graphic, or make cool web-ready renderings. It's great seeing an entire paper dedicated to a system which explores the concept of averaging even further. It took about 15 years of use until somebody was bold enough to write a paper about it. When you perform a little bit of alignment, the mean pictures look really awesome. Check out these cats!

Aligned cat images from the AverageExplorer paper.

I want one! (Both the algorithm and a Platonic cat)

The AverageExplorer paper extends simple image average with some new tricks which make the operations much more effective. I won't say much about the paper (the link is below), just take at a peek at some of the coolest mean cats I've ever seen (visualized above) or a jaw-dropping way to look at community collected landmark photos (Oxford bridge mean image visualized below).

Aligned bridges from AverageExplorer paper.

I wish Google would make all of Street View look like this.

AverageExplorer: Interactive Exploration and Alignment of Visual Data Collections. Jun-Yan Zhu, Yong Jae Lee, and Alexei A. Efros. In SIGGRAPH 2014.

Averaging images is a really powerful idea. Want to know what your magical classifier is tuned to detect? Compute the top detections and average them. Soon enough you'll have a good idea of what's going on behind the scenes.

Conclusion

Allow me to mention the mastermind that helped bring most of these vision+graphics+learning applications to life. There's an inimitable charm present in all of the works of Prof. Alyosha Efros -- a certain aesthetic that is missing from 2015's overly empirical zeitgeist. He used to be at CMU, but recently moved back to Berkeley.

Being able to summarize several of years worth of research into a single computer generated graphic can go a long way to making your work memorable and inspirational. And maybe our lives don't need that much automation. Maybe general purpose object recognition is too much? Maybe all we need is a little art? I want to leave you with a YouTube video from a recent 2015 lecture by Professor A.A. Efros titled "Making Visual Data a First-Class Citizen." If you want to hear the story in the master's own words, grab a drink and enjoy the lecture.

"Visual data is the biggest Big Data there is (Cisco projects that it will soon account for over 90% of internet traffic), but currently, the main way we can access it is via associated keywords. I will talk about some efforts towards indexing, retrieving, and mining visual data directly, without the use of keywords." ― A.A. Efros, Making Visual Data a First-Class Citizen

Tuesday, December 06, 2011

Graphics meets Big Data meets Machine Learning

We've all played Where's Waldo as children, and at least for me it was quite a fun game. So today let's play an image-based Big Data version of Where's Waldo. I will give you a picture, and you have to find it in a large collection of images! This is a form of image retrieval, and this particular formulation is also commonly called "image matching."

The only catch is that you are only given one picture, and I am free to replace the picture with a painting or a sketch. Any two-dimensional pattern is a valid query image, but the key thing to note is that there is only a single input image. Life would be awesome if Google's Picasa had this feature built in!

The classical way of solving this problem is via a brute-force nearest neighbor algorithm, an algorithm which won't match pixel pattern directly, but an algorithm which will also use a state-of-the-art image descriptor such as GIST for comparison. Back in 2007, at SIGGRAPH, James Hays and Alexei Efros have shown this to work quite well once you have a very large database of images! But the reason why the database had to be so large is because a naive Nearest Neighbor algorithm is actually quite dumb. The descriptor might be cleverer than matching raw pixel intensities, but for a machine, an image is nothing but a matrix of numbers, and nobody told the machine which patterns in the matrix are meaningful and which ones aren't. In short, the brute-force algorithm works if there are similar enough images such that all parts of the input image will match a retrieved image. But ideally we would like the algorithm to get better matches by automatically figuring out which parts of the query image are meaningful (e.g., the fountain in the painting) and which parts aren't (e.g., the reflections in the water).

A modern approach to solve this issue is to collect a large set of related "positive images" and a large set of un-related "negative images" and then train a powerful classifier which can hopefully figure out the meaningful bits of the image. But in this approach the problem is twofold. First, working with a single input image it is not clear whether standard machine learning tools will have a chance of learning anything meaningful. The second issue, a significantly worse problem, is that without a category label or tag, how are we supposed to create a negative set?!? Exemplar-SVMs to the rescue! We can use a large collection of images from the target domain (the domain we want to find matches from) as the negative set -- as long as the "negative set" contains only a small fraction of potentially related images, learning a linear SVM with a single positive still works.

Here is an excerpt from a Techcrunch article which summarizes the project concisely:

"Instead of comparing a given image head to head with other images and trying to determine a degree of similarity, they turned the problem around. They compared the target image with a great number of random images and recorded the ways in which it differed the most from them. If another image differs in similar ways, chances are it’s similar to the first image. " -- Techcrunch

Abhinav Shrivastava, Tomasz Malisiewicz, Abhinav Gupta, Alexei A. Efros. Data-driven Visual Similarity for Cross-domain Image Matching. In SIGGRAPH ASIA, December 2011. Project Page

Here is a short listing of some articles which mention our research (thank Abhinav!).

http://techcrunch.com/2011/12/06/cmu-researchers-one-up-google-image-search-and-photosynth-with-visual-similarity-engine/

http://news.cs.cmu.edu/article.php?a=2858

http://futureoftech.msnbc.msn.com/_news/2011/12/06/9252228-computer-mimics-human-ability-to-match-images

http://www.physorg.com/news/2011-12-team-computerized-method-images-photos.html

http://nanopatentsandinnovations.blogspot.com/2011/12/carnegie-mellon-creates-computerized.html

http://www.sciencecodex.com/read/carnegie_mellon_creates_computerized_method_for_matching_images_in_photos_paintings_sketches-82595

http://www.cra.org/ccc/rh-imatch.php

Monday, December 05, 2011

An accidental face detector

Disclaimer #1: I don't specialize in faces. When it comes to learning, I like my objectives to be convex. When it comes to hacking on vision systems, I like to tackle entry-level object categories.

Fun fact #1: Faces are probably the easiest objects in the world for a machine to localize/detect/recognize.

Note #1: I supplied the images, my algorithm supplied the red boxes.

Note #2: Sorry to all my friends who failed to get detected by my accidental face detector! (see below)

So I was hackplaying with some of my PhD thesis code over Thanksgiving, and I accidentally made a face detector. oops! I immediately ran to my screenshot capture tool and ran my code on my Mac desktop while browsing Google Images and Facebook. It seems to work pretty well on real faces as well as sketches/paintings of faces (see below)! I even caught two Berkeleyites (an Alyosha and a Jianbo), but you gotta find them for yourself. The detector is definitely tuned to frontal faces, but runs pretty fast and produces few false positives. Not too shabby for some midnight hackerdom.

Yes, I'm doing dense multiscale sliding windows here. Yes, I'm HoGGing the hell outta these images. Yes, I'm using a single frontal-face tuned template. And yes, I only used faces of myself to train this accidental face detector.

Note: If I've used one of your pictures without permission, and you would like a link back to your home on the interwebs, please leave a comment indicating the image and link to original.