One thing I noticed talking to CS graduates from Harvard this year is -- entirely anecdotally -- more seem to be going to small start-ups. This has been an increasing trend the last few years, and strikes me as a big change. When I was a student, most CS grads went one of the safe and successful routes of going to Microsoft, Wall Street, or one of the name brand consulting firms. (Some losers went to grad school.) When I started as faculty, this basic framework still seemed pretty much in place -- the most notable change was that over the years Google took up a good chunk of Microsoft's previous mind-space.
Assuming it's true (and I think it is), there are many things that could be bringing about the change. Harvard in many ways is trying to promote it -- in CS our classes are much more "project-based", there's a student Hack Harvard group arranging sponsored Hack Nights, and there are institutional developments like the Harvard Innovation Lab to help promote an innovation culture. Part of it has to be attributed to the "Facebook Effect" -- we're bringing in more CS-interested people as students, and more of them are interested in startups, probably because some want to change the world, and because "a million dollars isn't cool."
One other thing that I think might be helping? Harvard's financial aid policies, which changed significantly around 2006, making Harvard much more affordable for lower and middle class attendees. I imagine it's a lot easier to take a risk right out of school when you're not deep in a debt hole. At the very least, it's easier to justify to your parents why you're not (immediately) taking that high-paying job after they've paid for four years of private college.
Overall I think this is all to the good. While startups aren't for everyone, I think Harvard grads as a group have historically been too "conservative" in choosing career paths; it's exciting to see what may be a change in worldview. A downside? We've been sending noticeably fewer of our best and brightest to grad school, as they're seeking other opportunities.
I'd be eager to hear from recent (and not-so-recent) grads their thoughts on the current startup culture...
Tuesday, May 29, 2012
Saturday, May 26, 2012
CS 124 -- Some Student Comments
Overall, I feel I've had a very successful year, but I do have to admit: my undergraduate class, CS 124, could have been better this spring. I wasn't prepared for the class doubling in size from the previous year, and I had too many other things going on to make modifications on the fly. The most common complaint was that the turnaround time to get back assignments was too long, and I agree. I hope to find time to make some changes before next year rolls around that will help with that. (Lots of perennial complaints about timing of things like the midterm, too -- see this previous post if you care.)
Some interesting takeaways from the comments, which, as always, have high variance. Answers below come from the question: "What would you like to tell future students about this class?" My favorite so far:
"CS124 isn't too bad for an introduction to some basic algorithms, but it could be a lot more rigorous. Fairly light workload."
What's funny if I can't tell if the student is being serious or sarcastic, the comment is such an outlier. It's either from someone who really did well in the class and felt insufficiently challenged (there's usually a few), or someone who is making some kind of joke. Other comments are more realistic, with different levels of "appreciation" for the difficulty:
"It is hard."
"A lot of hard work but it's worth the effort."
"The hardest class taken this far in college; painful (to be fair though I did lean a lot but not sure how efficient the process is...)"
"Often pretty difficult, but totally worth it."
"It's incredibly difficult (on a scale from 1 to impossible, it's harder than impossible)..."
"Start the psets early, and focus on them, because they are immensely rewarding. However, they are extremely brutal."
Brutal, nice adjective there.
Some people feel algorithms is part of your health foot diet:
"This course is computer science vegetables."
"It's broccoli; you just gotta take it."
Hmm. I do make my kids eat their vegetables, including broccoli. Still, perhaps not the most appetizing comparison.
Several people mention that the class is good preparation for job interviews.
"The material from this course is so valuable to have for tech recruitment season."
"124 is no doubt a tough class but it is also super useful. In fact, if you want a job in CS you MUST take this course. Multiple problem set questions appeared on my interviews and the algorithmic thinking helped me solve any other problem they asked. "
"Programming interview questions focus on algorithms. This class has been so helpful in getting me a job."
"Very difficult, but very rewarding. Teaches you how to do algorithms and helps with job interviews."
I am absolutely unembarrassed about CS 124 being (in part) vocational training. I didn't design the class to prep students for job interviews, but I'm glad to hear that it does, and very happy it plays that role. Maybe we need to somehow move the class to the fall instead of the spring, so students can get the full class in before interview season.
Finally, of course, some of the highest variance comes in students' feelings toward the actual teacher:
"Mitzenmacher is actually a really great lecturer and motivates everything that we learn in class."
"Professor Mitzenmacher is an excellent lecturer and he really does seem to care about his students."
"Mitzenmacher.... definitely generates enthusiasm for the material, and lectures were very engaging and were well-motivated by relevant real world examples."
All of you, come find me after class so I can give you extra credit. On the other hand, there are also plenty of comments like....
"Mitzenmacher's approach to teaching seems to be "present the material and not care about students' struggles.".... The material itself is quite interesting, but just be prepared for a professor like Mitzenmacher."
Ouch.
"I am baffled as to why Prof. Mitzenmacher is still allowed to teach this class. His lazy attitude toward teaching was frustrating and not conducive to learning."
Double-ouch.
(I have all sorts of funny, sarcastic things to say here about not being "allowed to teach this class", but I'm sure they'd just get me in trouble later, so other faculty can mentally fill in their own joke here.)
To close it out, these seem like the best things to tell students who are thinking of taking my course:
"Good luck and enjoy the ride! Even though it almost killed me, I'm genuinely sad it's over."
"Brace yourself."
Indeed.
Some interesting takeaways from the comments, which, as always, have high variance. Answers below come from the question: "What would you like to tell future students about this class?" My favorite so far:
"CS124 isn't too bad for an introduction to some basic algorithms, but it could be a lot more rigorous. Fairly light workload."
What's funny if I can't tell if the student is being serious or sarcastic, the comment is such an outlier. It's either from someone who really did well in the class and felt insufficiently challenged (there's usually a few), or someone who is making some kind of joke. Other comments are more realistic, with different levels of "appreciation" for the difficulty:
"It is hard."
"A lot of hard work but it's worth the effort."
"The hardest class taken this far in college; painful (to be fair though I did lean a lot but not sure how efficient the process is...)"
"Often pretty difficult, but totally worth it."
"It's incredibly difficult (on a scale from 1 to impossible, it's harder than impossible)..."
"Start the psets early, and focus on them, because they are immensely rewarding. However, they are extremely brutal."
Brutal, nice adjective there.
Some people feel algorithms is part of your health foot diet:
"This course is computer science vegetables."
"It's broccoli; you just gotta take it."
Hmm. I do make my kids eat their vegetables, including broccoli. Still, perhaps not the most appetizing comparison.
Several people mention that the class is good preparation for job interviews.
"The material from this course is so valuable to have for tech recruitment season."
"124 is no doubt a tough class but it is also super useful. In fact, if you want a job in CS you MUST take this course. Multiple problem set questions appeared on my interviews and the algorithmic thinking helped me solve any other problem they asked. "
"Programming interview questions focus on algorithms. This class has been so helpful in getting me a job."
"Very difficult, but very rewarding. Teaches you how to do algorithms and helps with job interviews."
I am absolutely unembarrassed about CS 124 being (in part) vocational training. I didn't design the class to prep students for job interviews, but I'm glad to hear that it does, and very happy it plays that role. Maybe we need to somehow move the class to the fall instead of the spring, so students can get the full class in before interview season.
Finally, of course, some of the highest variance comes in students' feelings toward the actual teacher:
"Mitzenmacher is actually a really great lecturer and motivates everything that we learn in class."
"Professor Mitzenmacher is an excellent lecturer and he really does seem to care about his students."
"Mitzenmacher.... definitely generates enthusiasm for the material, and lectures were very engaging and were well-motivated by relevant real world examples."
All of you, come find me after class so I can give you extra credit. On the other hand, there are also plenty of comments like....
"Mitzenmacher's approach to teaching seems to be "present the material and not care about students' struggles.".... The material itself is quite interesting, but just be prepared for a professor like Mitzenmacher."
Ouch.
"I am baffled as to why Prof. Mitzenmacher is still allowed to teach this class. His lazy attitude toward teaching was frustrating and not conducive to learning."
Double-ouch.
(I have all sorts of funny, sarcastic things to say here about not being "allowed to teach this class", but I'm sure they'd just get me in trouble later, so other faculty can mentally fill in their own joke here.)
To close it out, these seem like the best things to tell students who are thinking of taking my course:
"Good luck and enjoy the ride! Even though it almost killed me, I'm genuinely sad it's over."
"Brace yourself."
Indeed.
Tuesday, May 22, 2012
Non-travel
For the past few years, I've been the chair for one of the international review panels for Country X's research funding bodies. (It's probably all a matter of public record, but no need to name Country X for this.) I agreed to serve (and chair) the committee with one restriction -- I didn't want to fly out for an all day panel, which is apparently their standard approach. Not that Country X isn't a wonderful destination -- were I single and/or without 3 children, I might well enjoy the chance to fly thousands of miles to Country X on their dime and spend a day or two hanging out and seeing the sights with nothing more stressful on the agenda than going through reviews for some research proposals. But that's not the case.
Year 1 I think they were a little concerned, but I think now they're happy I'm willing to do it and are OK with the process. Recently we had this year's meeting, over several hours on Skype, and it all worked just fine. The size is about that of a small-ish NSF panel -- about a half dozen panelists, usually about ten or so proposals to review. That's a very nice size for an electronic meeting -- I recognize it can get harder for multiple reasons as things get bigger. I don't believe the lack of face-to-face presence changes the outcomes significantly -- whatever the sources of noise are, I don't think the noise is necessarily bigger with this approach.
On the plus side, we must have saved the funding agency on the order of $10,000 or more. Air fare, hotel, etc. isn't cheap for an "international panel". I'm sure they can find better uses for the money -- like supporting research!
I keep hoping to hear that at some point the NSF will experiment with an electronic as opposed to face-to-face panel, if only to see how it goes. From Boston I can deal with taking an early flight to DC for a 1-day panel, but I really try to avoid 2-day panels now, and I'm aware what a drag the trip can be for west coast colleagues. I think the NSF could corral more reviewers (like me) if serving on a panel was easier and less time-consuming, by which I really mean doesn't require a flight.
Year 1 I think they were a little concerned, but I think now they're happy I'm willing to do it and are OK with the process. Recently we had this year's meeting, over several hours on Skype, and it all worked just fine. The size is about that of a small-ish NSF panel -- about a half dozen panelists, usually about ten or so proposals to review. That's a very nice size for an electronic meeting -- I recognize it can get harder for multiple reasons as things get bigger. I don't believe the lack of face-to-face presence changes the outcomes significantly -- whatever the sources of noise are, I don't think the noise is necessarily bigger with this approach.
On the plus side, we must have saved the funding agency on the order of $10,000 or more. Air fare, hotel, etc. isn't cheap for an "international panel". I'm sure they can find better uses for the money -- like supporting research!
I keep hoping to hear that at some point the NSF will experiment with an electronic as opposed to face-to-face panel, if only to see how it goes. From Boston I can deal with taking an early flight to DC for a 1-day panel, but I really try to avoid 2-day panels now, and I'm aware what a drag the trip can be for west coast colleagues. I think the NSF could corral more reviewers (like me) if serving on a panel was easier and less time-consuming, by which I really mean doesn't require a flight.
Monday, May 21, 2012
Summer School on Massive Data Mining in Copenhagen
Rasmus Pagh asked me to advertise the following summer school he is organizing. Looks like a good deal to me:
SUMMER SCHOOL ON MASSIVE DATA MINING
August 8-10, 2012, IT University of Copenhagen, Denmark
The summer school is aimed at PhD students and young researchers both from the algorithms community and the data mining community. A typical participant will be working in a group that aims at publishing in algorithms conferences such as ESA and SODA, and/or in data mining conferences such as ICDM and KDD. Speakers:
Michael Mahoney, Stanford University
Toon Calders, Eindhoven University of Technology
Suresh Venkatasubramanian, University of Utah
Aris Gionis, Yahoo! Research
Early registration fee (before June 15) is €90.
Organizing chair: Rasmus Pagh
Website: www.itu.dk/people/pagh/mdm12/
SUMMER SCHOOL ON MASSIVE DATA MINING
August 8-10, 2012, IT University of Copenhagen, Denmark
The summer school is aimed at PhD students and young researchers both from the algorithms community and the data mining community. A typical participant will be working in a group that aims at publishing in algorithms conferences such as ESA and SODA, and/or in data mining conferences such as ICDM and KDD. Speakers:
Michael Mahoney, Stanford University
Toon Calders, Eindhoven University of Technology
Suresh Venkatasubramanian, University of Utah
Aris Gionis, Yahoo! Research
Early registration fee (before June 15) is €90.
Organizing chair: Rasmus Pagh
Website: www.itu.dk/people/pagh/mdm12/
Friday, May 18, 2012
That Time of Year : Annual NSF Reports
I received my notice a few weeks back that it's time to fill out my annual grant reports for NSF.
I'd like to complain about it; it's a pretty boring process, and their interface isn't that great.
But, really, I can't complain. The NSF provides me (and my students) with money to do our work. In return, they ask for what seems like very reasonable reporting of how that money is being used. What have I been working on? Who I have been working with? Where has the work been published? What other outcomes of the research -- in industry or education -- has there been? These seems like very worthwhile questions for them to be asking.
One thing I don't really know is what is done with these reports after I enter them. I hope they're useful. I have had an NSF program manager discuss my reports with me, so I do have evidence they're being read. (I must admit, after a program manager let me know they actually read these reports some years ago, I started taking them more seriously!)
So, strangely, this year, as I'm going through the mundane task of filling out my annual reports, I actually find that it's leaving me with a feeling a gratitude. Thanks, NSF, for letting me and my students do all this really fun stuff! I know I complain about the NSF when I feel there's something worth complaining about, so it only seems fair to express the gratitude when I'm feeling thankful for all they do.
I'd like to complain about it; it's a pretty boring process, and their interface isn't that great.
But, really, I can't complain. The NSF provides me (and my students) with money to do our work. In return, they ask for what seems like very reasonable reporting of how that money is being used. What have I been working on? Who I have been working with? Where has the work been published? What other outcomes of the research -- in industry or education -- has there been? These seems like very worthwhile questions for them to be asking.
One thing I don't really know is what is done with these reports after I enter them. I hope they're useful. I have had an NSF program manager discuss my reports with me, so I do have evidence they're being read. (I must admit, after a program manager let me know they actually read these reports some years ago, I started taking them more seriously!)
So, strangely, this year, as I'm going through the mundane task of filling out my annual reports, I actually find that it's leaving me with a feeling a gratitude. Thanks, NSF, for letting me and my students do all this really fun stuff! I know I complain about the NSF when I feel there's something worth complaining about, so it only seems fair to express the gratitude when I'm feeling thankful for all they do.
Thursday, May 17, 2012
A Little Truly Offensive Bragging (On Behalf of Harvard CS)
So which US school was the only one in the top ten for the ACM International Collegiate Programming Contest this year?
No, not MIT, Princeton, or Carnegie Mellon (tied for 18th), nor Stanford (good effort!, tied for 13th) - but Harvard, at number 7. (Results here.)
Congratulation to our team Spencer Liang, Alex Zhai, and Neal Wu. (Neal has also been busy as a TF for CS 124 this year...)
Of course, also congratulations to the many other teams, in particular the top two (who both solved 9 of the problems; the winners won on time):
St. Petersburg State University of IT, Mechanics and Optics
University of Warsaw
No, not MIT, Princeton, or Carnegie Mellon (tied for 18th), nor Stanford (good effort!, tied for 13th) - but Harvard, at number 7. (Results here.)
Congratulation to our team Spencer Liang, Alex Zhai, and Neal Wu. (Neal has also been busy as a TF for CS 124 this year...)
Of course, also congratulations to the many other teams, in particular the top two (who both solved 9 of the problems; the winners won on time):
St. Petersburg State University of IT, Mechanics and Optics
University of Warsaw
Tuesday, May 15, 2012
Grading, Correlation vs. Causation
Final grades are finally in, thank goodnees. With 100+ students, even entering the grades is non-trivial.
Once again, freshman in my undergraduate Algorithms and Data Structures class receive A's at a substantially higher rate than other students. (Way, way higher.) My conclusion: it's best for your grade if you take my class as a freshman.
Once again, freshman in my undergraduate Algorithms and Data Structures class receive A's at a substantially higher rate than other students. (Way, way higher.) My conclusion: it's best for your grade if you take my class as a freshman.
Wednesday, May 9, 2012
Grading is No Fun
I'm told my exams are not nearly as much fun as I like to think they are. (It's a very rare occurrence when someone turns in their exam early at my 3-hour final.) I can say that this year they weren't much fun to grade.
The issue was the large class size -- the final number was 110 (about twice the size of last year). You'd think with more TAs it might not take too much longer to grade... but apparently there are non-linearities in there somewhere. Anyhow, grading took about 6 hours today. For any students who happen to be reading, I figure they'd appreciate knowing that, at least temporally speaking, I had to suffer about twice as long as they did.
For faculty types out there, I'm curious... how much grading do you do? For my undergrad class, I help grade the midterm and final, and take on grading the first programming assignment myself (1 of 8 class assignments). Is this a lot, a little, more or less normal? Comments welcome.
The issue was the large class size -- the final number was 110 (about twice the size of last year). You'd think with more TAs it might not take too much longer to grade... but apparently there are non-linearities in there somewhere. Anyhow, grading took about 6 hours today. For any students who happen to be reading, I figure they'd appreciate knowing that, at least temporally speaking, I had to suffer about twice as long as they did.
For faculty types out there, I'm curious... how much grading do you do? For my undergrad class, I help grade the midterm and final, and take on grading the first programming assignment myself (1 of 8 class assignments). Is this a lot, a little, more or less normal? Comments welcome.
Monday, May 7, 2012
Attribute-Efficient Learning (Guest Post by Justin Thaler)
[Editor's comment: Justin has been having quite a year; after a paper in ITCS, he's recently had papers accepted to ISIT, ICALP, HotCloud, and COLT. In this guest post, he tells us about the COLT result.]
Rocco Servedio, Li-Yang Tan, and I recently had a paper accepted to COLT on attribute-efficient learning, which is a clean framework introduced in 1990 by Avrim Blum that captures the challenging and important problem of learning in the presence of irrelevant information. Think of a infant trying to learn whether or not an object is safe to touch. The infant's brain may be inundated with sensory information, but there may only be a few characteristics of the object that matter (is it sharp? Is it sitting on a heat source?). Or think of a scientist trying to identify the genetic causes of a certain disease that depends on the (possibly complicated) interactions of a small number of genes: the scientist may collect a massive amount of genetic data from study participants, yet only a small amount of this information is actually relevant to the function being learned (i.e., the mapping of genes to a subject's phenotype).
Thus, the goal of an algorithm for attribute-efficient learning is to run in time polynomial in the total number of attributes, n, using a number of examples which is polynomial in the description length of the function f to be learned. The latter can be substantially smaller than n if most of the attributes are irrelevant.
In its most general form, the problem of learning in the presence of irrelevant information is captured by the "junta problem". In this setting, you make no assumption other than that the function being learned depends on only k << n of its n attributes, and your goal is to learn the function in time substantially smaller than the trivial O(n^k) obtained by exhaustively checking all k-element subsets of the attributes. The uniform-distribution variant of this problem has been called the "most important open question in uniform distribution learning" in the groundbreaking paper of Mossel, O'Donnell, and Servedio (see also Lipton's blog post on the subject here). Greg Valiant has recently made progress on this problem here.
Our goal is simultaneously more and less ambitious than MOS. On the one hand, we want to attribute-efficiently learn under arbitrary distributions, not just the uniform distribution. On the other hand, we are willing to assume not just that our functions depend on few attributes, but that the relevant attributes interact in structured ways. Specifically, we focus on attribute-efficient learning of decision lists, a problem first posed by Blum in his 1990 paper. This natural function class is (one of the few) known to be efficiently PAC-learnable, but it seems to lie on the boundary of tractability in the more challenging attribute-efficient setting. The problem has already been studied by a number of authors, see e.g. here, here, here, and here.
What did we manage to show? We give both positive and negative results. On the positive side, we build on a construction of Klivans and Servedio to give an algorithm achieving new tradeoffs between the running time and sample complexity for learning length-k decision lists over n Boolean attributes. Like most known positive results on PAC-learning under arbitrary distributions, we achieve this by constructing "uncomplicated" polynomial threshold functions (PTFs) computing any decision list. (For the uninitiated, a polynomial threshold function is exactly what it sounds like: it is a function of the form sign(p(x)), where p is an n-variate polynomial, and the complicatedness of a PTF refers to the degree of p, and the size of its coefficients). For technical reasons we actually construct something slightly more expressive than a PTF, which we call a generalized PTF, and which is just as useful for learning purposes.
We have two results on the negative side. First, we give new lower bounds on the complexity of PTFs that compute a specific and important decision list called ODD-MAX-BIT, building upon an elegant argument due to Beigel. Our lower bound strongly suggests that brand new techniques will be needed to improve on our positive result. (Parenthetically, the ODD-MAX-BIT function has played important roles in complexity theory in the past, from oracle separations to circuit lower bounds -- you can learn about many of these prior results just by typing ODD-MAX-BIT into Google). In brief, Beigel's lower bound is tight for relatively low-degree PTFs computing ODD-MAX-BIT, but is loose or vacuous for higher-degree PTFs. Our new lower bound significantly improves upon Beigel's at higher degrees and comes close to matching our upper bound (the main gap is that our upper bound uses generalized PTFs, while our lower bound applies only to honest-to-goodness PTFs).
Our final result is to extend Beigel's lower bound, which applied only to thresholds-of-monotone-conjunctions, to thresholds of significantly more general classes of functions.
In closing, I would like to say a little about about the ideas underlying our constructions, as there is a common intuition underlying both our positive and negative results. The positive result works by breaking the decision list into "chunks", closely approximating each chunk (in the $L_{\infty}$ norm) with a low-degree polynomial, and then putting the approximations together to get a PTF for the entire decision list. The negative result shows that this approach is intrinsic: it also breaks the decision list into chunks, and shows that you can take any PTF p for ODD-MAX-BIT and turn it into a low-degree polynomial q closely approximating each chunk. We then use a new Markov-type inequality to conclude that q has to be "complicated", and hence p has to be complicated as well.
Our Markov-type inequality seems quite natural, and similar statements have apparently been known to approximation theorists for some time now. I wanted to state the result here in the hopes that it will find some other applications in TCS.
Markov's classical inequality is a statement about the derivative of a univariate polynomial in terms of its degree. In contrast, we give a bound on the derivative of a univariate polynomial in terms of both its degree and the *size of its coefficients*. For comparison:
Markov's Theorem: Let p : [-1,1] --> [-1,1] be a real polynomial with deg(p) <= d. Then max |p'(x)| <= d^2, where the maximum is taken over all x in the real interval [-1, 1].
Our Markov-type Inequality (what I state here is a bit of a simplification of what we actually show): Let p : [-1,1] -> [-1,1] be a real polynomial with deg(p) <= d and coefficients of absolute value at most W. If max |p(x)| >= 1/2, then max |p'(x)| = O(d*log(W)), where again the maximums are taken over all x in [-1, 1].
Notice our bound is tighter than Markov's when W is smaller than 2^d.
The way I like to think about our Markov-type inequality is as follows. Since the Chebyshev polynomials are a tight example for Markov's inequality, one can interpret Markov as saying that, in order to maximize the derivative of a polynomial P of degree d which stays bounded on the interval [-1, 1], the best approach is to "spend all the allowed degree on a Chebyshev polynomial." There is also a simple example showing that our new inequality is asymptotically tight for a wide range of W's -- the tight example involves composing a Chebyshev polynomial with a monomial. So our new Markov-type inequality is saying that in order to maximize the derivative of a polynomial of degree d and weight W which stays bounded on the interval [-1,1], the best approach is to "spend all of the allowed weight on a Chebyshev polynomial T, and then spend any remaining allowed degree by composing T with a low-weight, high-degree polynomial (namely, a monomial)." It actually turns out that this same intuition underlies our positive result, which allowed us to improve over the Klivans-Servedio construction when the degree of the (generalized) PTF is relatively high.
Rocco Servedio, Li-Yang Tan, and I recently had a paper accepted to COLT on attribute-efficient learning, which is a clean framework introduced in 1990 by Avrim Blum that captures the challenging and important problem of learning in the presence of irrelevant information. Think of a infant trying to learn whether or not an object is safe to touch. The infant's brain may be inundated with sensory information, but there may only be a few characteristics of the object that matter (is it sharp? Is it sitting on a heat source?). Or think of a scientist trying to identify the genetic causes of a certain disease that depends on the (possibly complicated) interactions of a small number of genes: the scientist may collect a massive amount of genetic data from study participants, yet only a small amount of this information is actually relevant to the function being learned (i.e., the mapping of genes to a subject's phenotype).
Thus, the goal of an algorithm for attribute-efficient learning is to run in time polynomial in the total number of attributes, n, using a number of examples which is polynomial in the description length of the function f to be learned. The latter can be substantially smaller than n if most of the attributes are irrelevant.
In its most general form, the problem of learning in the presence of irrelevant information is captured by the "junta problem". In this setting, you make no assumption other than that the function being learned depends on only k << n of its n attributes, and your goal is to learn the function in time substantially smaller than the trivial O(n^k) obtained by exhaustively checking all k-element subsets of the attributes. The uniform-distribution variant of this problem has been called the "most important open question in uniform distribution learning" in the groundbreaking paper of Mossel, O'Donnell, and Servedio (see also Lipton's blog post on the subject here). Greg Valiant has recently made progress on this problem here.
Our goal is simultaneously more and less ambitious than MOS. On the one hand, we want to attribute-efficiently learn under arbitrary distributions, not just the uniform distribution. On the other hand, we are willing to assume not just that our functions depend on few attributes, but that the relevant attributes interact in structured ways. Specifically, we focus on attribute-efficient learning of decision lists, a problem first posed by Blum in his 1990 paper. This natural function class is (one of the few) known to be efficiently PAC-learnable, but it seems to lie on the boundary of tractability in the more challenging attribute-efficient setting. The problem has already been studied by a number of authors, see e.g. here, here, here, and here.
What did we manage to show? We give both positive and negative results. On the positive side, we build on a construction of Klivans and Servedio to give an algorithm achieving new tradeoffs between the running time and sample complexity for learning length-k decision lists over n Boolean attributes. Like most known positive results on PAC-learning under arbitrary distributions, we achieve this by constructing "uncomplicated" polynomial threshold functions (PTFs) computing any decision list. (For the uninitiated, a polynomial threshold function is exactly what it sounds like: it is a function of the form sign(p(x)), where p is an n-variate polynomial, and the complicatedness of a PTF refers to the degree of p, and the size of its coefficients). For technical reasons we actually construct something slightly more expressive than a PTF, which we call a generalized PTF, and which is just as useful for learning purposes.
We have two results on the negative side. First, we give new lower bounds on the complexity of PTFs that compute a specific and important decision list called ODD-MAX-BIT, building upon an elegant argument due to Beigel. Our lower bound strongly suggests that brand new techniques will be needed to improve on our positive result. (Parenthetically, the ODD-MAX-BIT function has played important roles in complexity theory in the past, from oracle separations to circuit lower bounds -- you can learn about many of these prior results just by typing ODD-MAX-BIT into Google). In brief, Beigel's lower bound is tight for relatively low-degree PTFs computing ODD-MAX-BIT, but is loose or vacuous for higher-degree PTFs. Our new lower bound significantly improves upon Beigel's at higher degrees and comes close to matching our upper bound (the main gap is that our upper bound uses generalized PTFs, while our lower bound applies only to honest-to-goodness PTFs).
Our final result is to extend Beigel's lower bound, which applied only to thresholds-of-monotone-conjunctions, to thresholds of significantly more general classes of functions.
In closing, I would like to say a little about about the ideas underlying our constructions, as there is a common intuition underlying both our positive and negative results. The positive result works by breaking the decision list into "chunks", closely approximating each chunk (in the $L_{\infty}$ norm) with a low-degree polynomial, and then putting the approximations together to get a PTF for the entire decision list. The negative result shows that this approach is intrinsic: it also breaks the decision list into chunks, and shows that you can take any PTF p for ODD-MAX-BIT and turn it into a low-degree polynomial q closely approximating each chunk. We then use a new Markov-type inequality to conclude that q has to be "complicated", and hence p has to be complicated as well.
Our Markov-type inequality seems quite natural, and similar statements have apparently been known to approximation theorists for some time now. I wanted to state the result here in the hopes that it will find some other applications in TCS.
Markov's classical inequality is a statement about the derivative of a univariate polynomial in terms of its degree. In contrast, we give a bound on the derivative of a univariate polynomial in terms of both its degree and the *size of its coefficients*. For comparison:
Markov's Theorem: Let p : [-1,1] --> [-1,1] be a real polynomial with deg(p) <= d. Then max |p'(x)| <= d^2, where the maximum is taken over all x in the real interval [-1, 1].
Our Markov-type Inequality (what I state here is a bit of a simplification of what we actually show): Let p : [-1,1] -> [-1,1] be a real polynomial with deg(p) <= d and coefficients of absolute value at most W. If max |p(x)| >= 1/2, then max |p'(x)| = O(d*log(W)), where again the maximums are taken over all x in [-1, 1].
Notice our bound is tighter than Markov's when W is smaller than 2^d.
The way I like to think about our Markov-type inequality is as follows. Since the Chebyshev polynomials are a tight example for Markov's inequality, one can interpret Markov as saying that, in order to maximize the derivative of a polynomial P of degree d which stays bounded on the interval [-1, 1], the best approach is to "spend all the allowed degree on a Chebyshev polynomial." There is also a simple example showing that our new inequality is asymptotically tight for a wide range of W's -- the tight example involves composing a Chebyshev polynomial with a monomial. So our new Markov-type inequality is saying that in order to maximize the derivative of a polynomial of degree d and weight W which stays bounded on the interval [-1,1], the best approach is to "spend all of the allowed weight on a Chebyshev polynomial T, and then spend any remaining allowed degree by composing T with a low-weight, high-degree polynomial (namely, a monomial)." It actually turns out that this same intuition underlies our positive result, which allowed us to improve over the Klivans-Servedio construction when the degree of the (generalized) PTF is relatively high.
Subscribe to:
Posts (Atom)