Tuesday, August 16, 2011

Question: What makes an object recognition system great?

Today, instead of discussing my own perspectives on object recognition or sharing some useful links, I would like to ask a general question geared towards anybody working in the field of computer vision:

What makes an object recognition system great?

In particular, I would like to hear a broad range of perspectives regarding what is necessary to provide an impact-creating open-source object recognition system for the research community to use.  As a graduate student you might be interested in building your own recognition system, as a researcher you might be interested in extending or comparing against a current system, and as an educator you might want to to direct your students to a fully-functional object recognition system which could be used to bootstrap their research.



To start the discussion I would like to first enumerate a few elements which I find important in making an object recognition system great.

Open Source
In order for object recognition to progress, I think releasing binary executables is simply not enough.  Allowing others to see your source code means that you gain more scientific credibility and you let others extend your system -- this means letting others both train and test variants of your system. More people using an object recognition system also translates to a high citation count, which is favorable for researchers seeking career advancement.  Felzenszwalb et al. have released multiple open-source version of their Discriminatively Trained Deformable Part Model -- each time we see a new release it gets better!  Such continual development means that we know the authors really care about this problem.  I feel Github, with its distributed version control and social-coding features, is a powerful took the community should adopt, something which I believe is very much needed to take the community's ideas to the next level.  In my own research (e.g., the Ensemble of Exemplar-SVMs approach), I have started using Github (for both private and public development) and I love it. Linux might have been started by a single individual, but it took a community to make it great.  Just look at where Linux is now.

Ease of use
For ease of use, it is important that the system is implemented in a popular language which is known by a large fraction of the vision community.  Matlab, Python, C++, and Java are such popular language and many good implementations are a combination of Matlab with some highly-optimized routines in C++.  Good documentation is also important since one cannot expect only experts to be using such a system.

Strong research results
The YaRS approach, which is the "yet-another-recognition-system" approach, doesn't translate to high usage unless the system actually performs well on a well-accepted object recognition task.  Every year at vision conferences, many new recognition frameworks are introduced, but really only a few of them ever pass the test of time.  Usually an ideas withstands time because it is a conceptual contribution to science, but systems such as the HOG-based pedestrian detector of Dalal-Triggs and the Latent Deformable Part Model of Felzenszwalb et al. are actually being used by many other researchers.  The ideas in these works are not only good, but the recognition systems are great.

Question:
So what would you like to see in the next generation of object recognition systems?  I will try my best to reply to any comments posted below.  Any really great comment might even trigger a significant discussion; enough to warrant its own blog post.  Anybody is welcome to comment/argue/speculate below, either using their real name or anonymously.



9 comments:

  1. Petter S4:47 PM

    Agreed. Github has great potential as a tool for sharing research source code.

    ReplyDelete
  2. I think you hit the nail on the head when you talk about the fact that Source Code should come with any good work. If we all just propose half-baked ideas and wave our hands, we won't get there. Many great ideas are out there, now we have to stop re-inveting the wheel all the time and get hacking, because much of the work will come down to engineering, scaling up, getting faster, etc.

    I think what I would also like is some type of recognition system to which we can always add. Lets say we agree that HOG is the right way to go for these systems and that it works well. Okay, now how can we get millions of training data? How can a random person just snap a picture of an object, trace its boundary, and simply click a button to make your classifier better? In the background, some kind of features should be extracted, sent to a master server, and be added to a huge database. In other words, we need to build a large system for object recognition, and think more about scaling way up. i.e. As a catch phrase, I'd label this as crowd-sourced detector framework, or something. It should only get smarter over time, and it should learn online.

    As my last point, I would add that you should be posting these also to your Google+ account, I would love to reshare there, and I'm sure you'd get much more interaction from others too. It's more accessible.

    Cheers!

    ReplyDelete
  3. Hi Tomasz, my name is Babak Rasolzadeh and I'm the CTO of a computer vision startup in Sweden. I won't mention the company name here as I don't want to advertise. What I want to add is that from the industrial point-of-view a good object recognition system needs to incorporate at least two other important aspects:
    1. Scalability: meaning large scale learning in the sense of what Fei-Fei and Lazebnik and Grauman talked about at the latest CVPR.
    2. good API/SDK: this is perhaps even more important if your system is going to get traction and adaption from the rest of the IT-community. Computer Vision will be an ubiquitous and integrated part of Web 3.0 and the the interface to these systems will be THE defining factor whether or not a particular system will become successful or not. I have plenty more to say about this but will spare you :-)

    ReplyDelete
  4. Hey Andrej and Babak,

    Thanks for your comments. It seems that everybody I talked to (EXCEPT PROFESSORS) agrees that the way people interact with the code (the API) is as important as what the code actually does.

    Unfortunately, the standards in academia are often reversed, where most academics I met only care about making their CVs longer. This mean writing lots of throw-away-code.

    Andrej:
    #1: regarding posting to Google+, I am a bit of a Google+ newbie, so I haven't yet integrated it well into my blogging/posting/sharing lifestyle. I'll do my best to make sure my circles get to hear what I have to say.

    #2: regarding your crowd-sourced detector framework, I will probably group together my ideas regarding this, and make a separate post about this. While doing something like this is quite noble, it is not easy, and definitely beyond the scope of a single individual's effort.

    Is there any place where you've talked about something similar (so I can link to your ideas)? If I don't hear from you or there really isn't anywhere else you talked about this, then I'll just link to your Google public profile.

    ReplyDelete
  5. I think regarding your point about tendency to just write some throw-away-code and move on, things are changing a bit. I see several high-profile researchers now encouraging their students to put up their code, and even clean it up.

    For example, one of the reasons, I would argue, that the Felzenswalb detector is so popular is that it is just so easy to use in your project. It was all available, documented, etc. Surely, this increases number of citations and fame, and surely this can lead to better CV. Another piece of nice recent work along these lines was the Predator framework by Zdenek. I'm sure there are more, and I'm expecting even more, as people are slowly figuring this out.

    I didn't discuss my crowd-sourced detector framework ideas anywhere yet, I just randomly mentioned it here. I agree that it's a project beyond an individual. I also haven't fully ironed out the details of it in my head. One of the biggest problems I still struggle with conceptually is that the idea of an 'object' in a discriminative sense is flawed. Every classifier we build is in some sense a slice through a hierarchy of parts and attributes that make up an object. A car is a car, but it's also a specific brand, a specific type, a specific make, a specific color, it's made up of other objects like doors, wheel, windows etc., and it's also a vehicle. How are we ever going to reconcile all this? I don't know how to resolve these issues, but I have a feeling that they will become more troublesome when we start to collect larger and larger datasets. I should note that I don't like putting objects in hierarchies, however, as IMAGENET has done (thought it's a good attempt #1). Instead of forcing a hierarchy all the time, I think it's more appropriate to sometime assign a tag or an attribute instead-- like a dog can be male/female, or brown/black -- these shouldn't be different branches in tree.

    Anyway I'm rambling-- I'm looking forward to hearing your thoughts on this! Cheers
    Andrej

    ReplyDelete
  6. I don't like throw-away-code either. I encourage everybody around me to become a badass software engineer (e.g., intern at Google, learn git, use github, master several languages, and keep learning), but I'm not convinced that academia as a whole supports this attitude. Of course things are changing, but it is the PhD students that need to step up and learn when to say "no" and when to say "yes" when it comes to good software practices.

    By the way, the crowd-sourced detection idea is similar to Vijayanarasimhan's project at UT. He is now at Google (no surprise).

    Large-Scale Live Active Learning: Training Object Detectors with Crawled Data and Crowds by Sudheendra Vijayanarasimhan and Kristen Grauman

    Regarding your doubts regarding the "one-hierarchy-to-rule-them-all" approach, I cannot agree more. Something gets lost when we place an object into an ontology. Unfortunately, these problems regarding categories/ontologies/instances/attributes/etc have been adressed more than 50 years ago in Philosophy and Psychology -- computer science is a bit naive when it comes to this. I will come back to this point when I summarize the contributions of my PhD dissertation (a post coming in the next few weeks).

    ReplyDelete
  7. Anonymous11:16 PM

    IMO good system should be fusion (or at least verified by) of several different methods

    ReplyDelete
  8. Tomasz, I think we should come up with the precise definition of the problem first. Even PASCAL VOC has different categories. If we pick one of them as a standard, we are at risk of meta-overfitting, i.e. creating the system that works only for those data. There are only 20 classes, so the system might be able to detect chairs and motorbikes, but useless for detecting other important stuff. That's why I think the thing community needs is a framework where the descriptors, and learning algorithms, and other stuff are implemented, so it is easy to combine them to object recognition system one needs. To this extent, OpenCV is close to this general-purpose framework. It is open-source, free (BSD license), and the API is fine. New algorithms are added when they prove to work, which usually means few years after the original paper. This is probably not acceptable, so we may need the OpenCV-cutting-edge module, which would not be typically used by engineers, but researchers would use it as a sandbox. Ideally, the implementations will move to the library core.

    BTW, there is EMMCVPR paper that predicts ImageNet categories by means of structured output learning.

    ReplyDelete
  9. Hi all,
    I couldn't agree more... too much "research" effort is wasted trying to reproduce the results of others, just in order to improve on them. I too think that we need something reusable and generic like OpenCV, but less low-level.

    The greatness of OpenCV is that it gives a common platform for many algorithms to cooperate; this seems hard to achieve with object recognition, as we don't seem to know what exactly the inputs and outputs ought to be (still images? videos? regions? categories? masks? probability fields? Answering these is a significant part of the problem itself...)

    An example to follow might be ROS, of robotics. As far as I know, it's a distributed system with components written in arbitary languages, capable of communicating thorough messages (including video feeds, sensor data and sinks for driving actuators). That way fusion of multiple systems would be much easier to implement.

    ReplyDelete