Corinna Cortes est chercheuse en informatique et dirige Google Research, après avoir travaillé pendant plus de 10 ans dans les laboratoires AT&T Labs – Research (anciennement AT&T Bell Labs). Ses recherches portent sur les algorithmes d’apprentissage automatique, un domaine qui trouve des applications dans des outils tels que les moteurs de recherche, les systèmes de reconnaissance automatique de l’écriture ou de traitement de la parole. Elle a en particulier apporté d’importantes contributions aux bases théoriques des machines à support vectoriel ainsi qu’au data-mining (c’est à dire l’extraction de connaissances à partir de données) pour les très grandes données. Elle a notamment reçu pour ses travaux la Médaille AT&T Science and Technology en 2000 et, conjointement à Vladimir Vapnik, le Prix Paris Kanellakis Theory and Practice en 2008, deux prix importants en informatique. Propos recueillis à l'occasion de Maths A Venir 2009.
Could you describe shortly your college path and career from your first position as a researcher to present?
My path as a student was a long and windy road. Just like many other girls in high school, I took math and physics at the highest level, and really liked math. But, due to some peer pressure and other areas of interest of mine, I did a detour through theology and later Danish literature! During a break, I felt a need to identify with the working class (this was the very early eighties) and I also took a job at a factory where I met a number of engineering students. That was a great experience. I eventually changed to math and physics, and completed a Master’s degree in Denmark.
On my way to a Ph.D. program in computer science in the US, I got a summer internship at AT&T Bell Laboratories, back then one of the most prestigious research labs in the world. I didn’t really leave it the next 14 years. I was lucky enough to spend most of my student years there, only occasionally visiting my university, and after my Ph.D. degree, I stayed on as a full-time researcher.
With the rise of the Internet, new possibilities opened up, and, 6 years ago, I moved to Google Research, one of the most fascinating and energizing work places in the US.
What was your very first subject of research?
My first years at Bell Labs were spent in a machine learning group, which shaped my later carrier. A year after I got there, Vladimir Vapnik joined the group, and together we developed Support Vector Machines. That became a part of my thesis and has remained one of my main interests since then. Support Vector Machines inspired work in many other machine learning algorithms, and today I would rather say I work on “Kernel Methods”, but my work is still very related to my original thesis work.
Which problems are you currently working on?
The problem settings I am working on have changed a bit over the years.
Learning theory and algorithms were originally designed for a flawless world, somewhat perturbed by random noise. But the real world we deal with at Google is often far from that perfect ideal. In practice, data is poorly labeled or not labeled at all, data can be biased or drawn from slightly different problems than what one is trying to learn, and the distributions may drift with time. Machine learning at Google must address all these issues to be effective and obtain high-quality solutions. Disregarding these questions can lead to dramatically poor performance or lead to unpredictable results or behavior, thereby harming quality.
Another challenge facing learning applications at Google is the staggering size of the datasets, which exceeds several million and can reach billions of high-dimensional data points. Scaling exiting learning algorithms to deal with such magnitudes using massively distributed systems, or devising novel techniques taking advantage of these systems while not sacrificing learning guarantees, is an important question in many of our applications.
Could you describe to us in simple words a typical problem raised in your field of research?
Sure, let us take image search as an example. Search for images using all the major search engines is still primarily text-based. We input text, and the images returned are primarily based on the text surrounding the images. We clearly have to move towards better image understanding, building learning algorithms that can reliably recognize objects in the images. But just think of the scale of the internet, how many images are out there and how many object recognizer are needed. Think of the quality of the images. The object recognizers will have to be very robust against noise and occlusion, be able to work on low as well as high resolution images, color and black and white, etc. These are hard problems for machine learning.
Are the problems you deal with raised by practical needs or is it pure theory?
The fun thing about being at Google is that we have access to real problems and lots of data. Hence, naturally, the problems we are working on are inspired by practical needs. Otherwise, we could just as well be in academia. But we are striving to find general solutions that can work across a number of applications and produce publishable results. A good project has two major outlets: a publication and a launch. Both exciting.
Could you describe a typical application of your current work?
To solve problems in machine learning, one typically has to build a classifier from labelled data. Support Vector Machines is still one of the most popular techniques. SVMs are widely available in a number of software packages and used by most machine learning practitioners.
Do you consider yourself as a mathematician or as a computer scientist?
I am trained as a mathematician, a physicist and a computer scientist, and I think of myself as all of the above. It is really important to have good math skills to be able to prove properties about an algorithm [will it converge? Is there a unique solution? What guarantees does it give for performance on a new test point?]. It requires a lot of knowledge of optimization to effectively deal with the scale of our applications, it requires good intuition about data to know what to pursue, and it requires good computer science skills to independently develop a solution. In my group, we have primarily computer scientists, but we also have optimization people and statisticians.
Which mathematics are “hidden” (for the average user) in tools you work on?
The large-scale problems in machine learning have inspired a lot of mathematicians to work on approximation algorithms and optimization techniques. Good knowledge of algorithms is another hidden field.
What kind of unsolved problem do you find in your field of research?
How do you effectively and relatively accurately invert a million-dimensional dense matrix in seconds on an ordinary desktop computer.
Do you think that the average person needs to know or should know a few of the science behind the computing tools daily used by them?
Maybe not the average person, but I could wish the young people better realized the opportunities that lie ahead of them if they master computers and can enhance the flow of information.
Just think of a visit to the hospital, how much of our treatment is tied to recent developments in computer science. The image processing behind the MRI or x-ray we get to locate the problem, the monitoring and alerting of our vital signs, the development in computational biology and the drug industry where one can simulate and experiment with new medications using computers, the electronic records that makes our chart available to all functions of the hospital, the statistics that is aggregated to better understand the outcome of the treatment. It is sad that especially girls are not more attracted to the area.