All Categories

Review of my 2017 Forecasts

1/4/2018

TL;DR: I got a lot of stuff right and a lot of stuff wrong. My Brier score (a measure of forecasting skill) was .22, which is slightly better than chance but not great. My forecasts were most accurate in areas that I’m familiar with, and areas where quantitative extrapolation was possible; they were less accurate in cases where I relied on discussions with a small number of bullish experts or on a vague sense of the “hotness” of certain sub-fields of AI.

Introduction

I’ve written a fair amount about the issue of progress in AI and forecasts thereof, including this paper and this blog post last year. This topic is intrinsically interesting to me, and seems neglected relative to its interestingness--most AI-related papers are about how to make capabilities a bit better rather than trying to understand the relative contributions of hardware, algorithmic improvements, data, fine-tuning, etc. to past performance increases, or projecting these into the future.

The topic also seems important, because lots of people are rightly concerned about the societal implications of AI, which in turn hinges partially on how quickly certain developments will arise. And in light of substantial expert disagreement on the future of AI (including both the timing of developments and its social implications), it seems like it should be a topic of a lot of scholarly interest. There has been some movement towards taking it more seriously in the last year or so, including the EFF AI Progress Measurement Project and the AI Index, but as researchers like me are wont to say, more research is still needed.

Besides my prior paper on this topic and some ongoing research, I’ve tried to contribute to this discussion via blog posts and Twitter, by putting specific falsifiable AI forecasts on record, so that they can be evaluated in the future by myself and others. I also encourage others to do the same. I focus on short-term forecasts so that there is a feedback loop in a reasonable period of time regarding the reliability and calibration of one’s forecasts, and because, a priori, short-term forecasts seem more likely to be reliable than long-term ones. In this blog post, I’ll try to resolve all of my forecasts made in early 2017 and discuss some takeaways I’ve arrived at from analyzing them.

As alluded to in the TL;DR above (no need to read it if you haven’t already), they’re a mixed bag. Overall, I’ve updated a bit in the direction of not trusting my own short-term forecasts, or those of others about whom I have even less information regarding their extent of overconfidence (a pervasive problem in forecasting and human judgment more generally, as documented in other domains).

Methods

Resolving my 2017 forecasts was a lot easier than resolving my 2016 forecasts, because I put most of the 2017 ones in a single blog post and they were mostly pretty objective. But a lot of ambiguities arose, some expected (e.g. a forecast about “impressive” progress) and some unexpected (e.g. I forgot to specify some aspects of how quantitative Atari forecasts should be evaluated). To help with my evaluation process, I solicited help from relevant experts, especially Sebastian Ruder on the topic of transfer learning and Marc Bellemare on the topic of Atari AI. Thank you, Seb and Marc, for your help!

I will try to be explicit below about my decision process for arriving at True/False designations for specific forecasts, and my overall Brier score (a measurement of forecasting skill), but I don’t expect to totally eliminate ambiguity. I also don’t want this blog post to be super boring, though, so I’ll inevitably leave some things out, and feel free to get in touch with me to discuss more. One can pretty easily plug in different values for my forecasts and arrive at different conclusions for how well I did in 2017.

Overview of Results

Without further ado, here’s a rapid-fire summary of how the forecasts fared. In the next section, I’ll elaborate on each forecast, what actually happened in 2017 in the related technical area, and how a true/false determination was made.

Best median Atari score between 300 and 500% at end of 2017 (confidence: 80%)
TRUE
Comment: See caveats regarding compute requirements.

Best median Atari score between 400 and 500% at end of 2017 (confidence: 60%)
TRUE
Comment: See caveats regarding compute requirements.

Best mean Atari score between 1500% and 4000% at end of 2017 (confidence: 60%)
TRUE
Comment: See caveat regarding compute requirements below.

No human-level or superintelligent AI (confidence: 95%)
TRUE
Comment: Freebie.

Superhuman Montezuma’s Revenge AI (confidence: 70%)
FALSE
Comment: Resolved as false despite mild ambiguity.

Superhuman Labyrinth performance (confidence: 60%)
TRUE
Comment: Could be broken into two separate predictions for mean and median but I opted not to do that to avoid favoring self.

Impressive transfer learning (confidence: 70%)
FALSE
Comment: Under-specified; resolved as false based on expert feedback.

(sub-prediction of the above) Impressive transfer learning aided by EWC or Progressive Nets (confidence: 60%)
N/A
Comment: Conditional on the above being false, the probability of this one being true is zero, so it is not being counted towards the Brier score calculation.

Speech recognition essentially solved (confidence: 60%)
FALSE
Comment: Did not check with experts but seems unambiguously false for a broad sense of “human-level” speech recognition (especially noisy, accented, and/or multispeaker environments).

No defeat of AlphaGo by human (confidence: 90%)
TRUE
Comment: Seems like a freebie in retrospect given AlphaGo Zero and AlphaZero but perhaps wasn’t obvious at the time in light of Lee Sedol’s game 4 performance.

StarCraft progress via deep learning (confidence: 60%)
FALSE
Comment: Under a somewhat plausible interpretation of the original forecast, this could have been resolved as true, since by some metrics it occurred (good performance attained on StarCraft sub-tasks with deep learning). Using the specific measure of progress I mentioned (performance in the annual StarCraft AI tournament), the forecast was barely false (4th place bot in the annual competition used deep learning, vs. 3rd or above as I said).

Professional StarCraft player beaten by AI system by end of 2018 (confidence: 50%)
TOO SOON TO RESOLVE

More efficient Atari learning (confidence: 70%)
TRUE
Comment: Some ambiguity based on the available data on frames used, performance at different stages of training, etc. but leaning towards true. See comments below.

Discussion of Individual Forecasts

As the above rapid-fire summary suggests, there were several trues and several falses, but that doesn’t give a complete picture of forecasting ability, because one also has to take the confidence level for each forecast into account. I’ll return to that when I calculate my Brier score below. But first, I’ll go through each forecast one by one and comment briefly on what the forecast meant, what actually happened, and how I resolved the forecast. If you’re not interested in a specific forecast, feel free to skip each subsection below or the whole section, as I’ll return to general takeaways later.

Best median Atari score between 300 and 500% at end of 2017 (confidence: 80%)

What I meant: I was referring here (as is pretty clear from the original blog post and the relevant literature) to the best median score, normalized to a human scale (where 0 is random play and 100 is a professional game tester working for DeepMind), attained by a single machine learning system playing ~49-57 Atari games with a single set of hyperparameters across games. This does not mean a single trained system that can play a bunch of games, but a single untrained system which is copied and then individually trained on a bunch of games in parallel. The median performance is arguably more meaningful as a measure than the mean, due to the possibility of crazily high scores on individual games. But mean scores will be covered below, too.

Why did I say 300-500%? Simple: I just extrapolated the mean and median trends into the future on a graph and eyeballed it. 300-500% seemed reasonably conservative (despite representing a significant improvement over the previous state of the art in 2016 of ~150%, or ~250% with per-game hyperparameter tuning). 400-500% was a more specific forecast, so I gave that range less confidence than 300-500% (see next forecast below).

(median project figure taken from my blog post last year)

What happened: There was a lot of progress in Atari AI this year. For a more comprehensive summary of algorithmic improvements bearing on Atari progress in the past few years, as well as remaining limitations, see e.g. this, this, and this. I’ll mention two developments of particular relevance here.

First, the Rainbow system published in October showed the power of combining many recent developments in deep reinforcement learning, including some developments first published in 2016 and some from 2017. This paper had one of the prettiest figures of the year, in my view, which I’m reproducing below, but it fell short of 300% (though I think it’s pretty reasonable to guess that by running it longer, or adding a few more bells and whistles like those discussed in the paper, that might have happened). While not settling this forecast, Rainbow will play a role in resolving another forecast later in this post.

Second, the recently announced Ape-X system blew away previous results by scaling up earlier agent elements, but with a different approach for decentralization (decentralized actors send useful memories, rather than gradients, to a central learner). I’ll return to whether this should count below, because it uses a lot of compute, but taking it at face value, it clearly validates the first and second forecasts by attaining a 434% median score. See below:

How the forecast was resolved: My own interpretation of the papers plus the independent conclusion of Marc Bellemare (with a big caveat to be discussed further below in the Sidenote on Compute). This resolution method applies to all of the Atari forecasts discussed below.

Score: True

Best median Atari score between 400 and 500% at end of 2017 (confidence: 60%)

See above.

Score: True

Best mean Atari score between 1500% and 4000% at end of 2017 (confidence: 60%)

See above for general discussion, and here for an example of some >1500% mean scores reported in 2017. I believe there might be others, and for several recent papers, the mean was not reported but it would likely be very high given the median score).

Score: True

Sidenote on Compute

For all of the above 3 forecasts, the forecast was resolved as true, but could have been resolved as false if one only considered results that used the same amount of compute as earlier papers. Marc Bellemare writes (personal communication):

“Ape-X results are the best available, as far as I know. This said, [asynchronous] methods use significantly more computational power, so comparing to previous approaches to determine AI technology progress is probably not completely informative -- it's more measuring the availability of resources to AI researchers. On the other hand, there is a case to be made for making these methods work on large distributed systems. …
If comparing compute-for-compute, I believe the answer to be no to all of these, although we have seen significantly progress in all of these metrics (except maybe 3 [Montezuma’s Revenge*]).”
*discussed below

Elaborating on Marc’s point, Ape-X, which plausibly resolved the first two forecasts positively, used a large amount of computing power and a large amount of frames (experiences) per game. The closest median score to 300% that I’m aware of using comparable resources to 2016 papers is Rainbow, at 223% (I exclude UNREAL due to per-game hyperparameter tuning). Likewise, the high mean scores mentioned above use a bit more data than is typical (300 million game frames vs. e.g. 100 or 200 million).

So, as originally stated (without a compute caveat), the forecasts above were all true, but without this caveat, one might be mislead into thinking more progress occurred than there actually was. Thanks to Marc for raising this issue.

We will return to the efficiency issue below in the context of another forecast, and note that I will not always be so generous in interpreting my forecasts charitably (see e.g. Montezuma’s Revenge and StarCraft below); I hope I’m being reasonably objective on average.

No human-level or superintelligent AI (confidence: 95%)

What I meant: Specifically, I said:

“By the end of 2017, there will still be no broadly human-level AI. No leader of a major AI lab will claim to have developed such a thing, there will be recognized deficiencies in common sense reasoning (among other things) in existing AI systems, fluent all-purpose natural language will still not have been achieved, etc.”

This was a bit of a freebie, unless you’re an extreme outlier in views about the future of AI. But I think it’s good to have a range of forecasts in the mix, including confident ones.

What happened: Nothing that would call this forecast’s truth (over the 2017 timeframe) into question, though AI progress occurred, including many efforts aimed at making more general-purpose AI systems.

How it was resolved: Paying attention to the field.

Score: True

Superhuman Montezuma’s Revenge AI (confidence: 70%)

What I meant: I said: “[B]y the end of the year, there will be algorithms that achieve significantly greater than DeepMind’s “human-level” threshold for performance on Montezuma’s Revenge (75% of a professional game tester’s score). Already, there are scores in that ballpark. By superhuman, let’s say that the score will be over 120%.”

Notably, I did not say anything about average vs. best case performance, which bears on the resolution question below.

What happened: A bit of progress, but less than I expected. I noticed after making this forecast that even in 2016, there was a paper (by Marc, actually) in which a single seed of a system achieved well over 120%, but not in a way that was reproducible across different random seeds and hyperparameters - the hard exploration problem here remains unsolved, for now. Typically people report some sort of average of runs, so I won’t count that, and nothing else reliably surpassed 120% Montezuma’s Revenge this year to my knowledge.

How it was resolved: See Atari resolution methods above (my and Marc’s judgment).

Score: False

Superhuman Labyrinth performance (confidence: 60%)

What I meant: “Labyrinth is another environment that DeepMind uses for AI evaluation, and which affords human-normalized performance evaluation. Already, the UNREAL agent tests at 92% median and 87% mean. So I’ll use the same metric as above for Montezuma’s Revenge (superhuman=120%) and say that both mean and median will be superhuman for the tasks DeepMind has historically used.”

What happened: A few papers reported improved Labyrinth (aka DeepMind Lab) results. I believe the highest results were attained with the Population-Based Training (PBT) method layered on top of the UNREAL agent. My understanding from the relevant papers is that the PBT paper uses 2 Labyrinth tasks that the UNREAL paper did not. After subtracting those 2 tasks, so that we’re comparing the same set of 13 tasks, the mean and median (my calculations) are 165 and 133.5%, respectively. Note that this differs from the “average” (presumably median) figure used in the PBT paper, which includes 2 tasks I excluded which were not widely used when I made my forecast.

How it was resolved: I analyzed the relevant papers as described above. Note that I could have counted this as two separate forecasts (mean and median), but didn’t as it was originally listed as one.

Score: True

Impressive transfer learning (confidence: 70%)

What I meant: “Something really impressive in transfer learning will be achieved in 2017, possibly involving some of the domains above, possibly involving Universe. Sufficient measures of “really impressive” include Science or Nature papers, keynote talks at ICML or ICLR on the achievement, widespread tech media coverage, or 7 out of 10 experts (chosen by someone other than me) agreeing that it’s really impressive.”

What happened: A lot of cool results were published in the transfer learning area in 2017, but no one development in particular rose to the level of “really impressive” to a large swathe of the relevant community. This is my understanding based on talking to some folks and especially the comments of Sebastian Ruder and his estimate of what other experts would say. In retrospect, as discussed further below, I think I based this forecast too much on transfer learning being much-hyped as an area for immediate work in 2016--it’s a harder problem than I realized.

How it was resolved: I asked Sebastian to evaluate this, and he said many thoughtful things--I recommend you read the thread in full: https://twitter.com/seb_ruder/status/948403531765067776 In further discussion, Sebastian wrote, “I think that most would agree that there was significant progress, but as that's true for most areas in DL and as I think it'd be harder for them to agree on a particular achievement IMO, I'd also suggest failure [of the forecast].”

Score: False

(sub-prediction of the above) Impressive transfer learning aided by EWC or Progressive Nets (confidence: 60%)

This was a minor sub-prediction of the above - that the source of transfer learning progress would be improvements building on elastic weight consolidation or progressive neural networks, two 2016 developments. While both papers were cited in subsequent developments, nothing earth shattering happened. I don’t feel that my Brier score should be further penalized by this failure, though, since it was implicitly conditional on impressive transfer learning happening at all.

Score: N/A

Speech recognition essentially solved (confidence: 60%)

What I meant: “I think progress in speech recognition is very fast, and think that by the end of 2017, that for most recognized benchmarks (say, 7 out of 10 of those suggested by asking relevant experts), greater than human results will have been achieved. This doesn’t imply perfect speech recognition, but better than the average human, and competitive with teams of humans.”

What happened: There has been further progress in speech recognition, but it remains unsolved in its full generality, as argued recently--noisy environments, accents, multiple speakers talking simultaneously, and other factors continue to make this difficult to solve satisfactorily, although there are some additional benchmarks along which superhuman performance has been claimed. Overall, I think this may have been harder than I thought, and that I updated too much on a relevant expert’s bullishness on progress about a year and a half ago. However, it also seems somewhat plausible to me that this was achievable this year with a bigger push on the data front, and I haven’t seen much progress beyond 2016 era size datasets (~20,000 hours or so of transcribed speech). As documented by researchers at Baidu recently in some detail, many deep learning problems seem to exhibit predictable performance improvement as a function of data, and speech might be such a problem, but we won’t know until we have (perhaps) a few hundred thousand hours of data or more. In any case, as was the case last year, I don’t know much about speech recognition, and am not very confident about any of this.

How it was resolved: Absence of any claims to this effect by top labs; the blog post linked to above.

Score: False

No defeat of AlphaGo by human (confidence: 90%)

What I meant: “It has been announced that there will be something new happening related to AlphaGo in the future, and I’m not sure what that looks like. But I’d be surprised if anything very similar to the Seoul version of AlphaGo (that is, one trained with expert data and then self-play—as opposed to one that only uses self-play which may be harder), using similar amounts of hardware, is ever defeated in a 5 game match by a human.”

What happened: The “something new” I mentioned ended up being a match with Ke Jie, among other events that DeepMind put on in China. Ke Jie lost 3-0 to a version pretty similar to the Seoul version of AlphaGo, although at the time AlphaGo Zero also existed, and was trained only from self-play. There is now talk of some sort of Tencent-hosted human versus machine Go match in 2018, but this seems more about publicity than any real prospect that humans will retain supremacy again. If a Tencent-designed AI loses to a human, I am pretty sure it will be the humans behind the machine who are at fault, as strongly superhuman Go performance has been definitively shown to be possible post-AlphaGo Fan, AlphaGo Lee, AlphaGo Master, AlphaGo Zero, and AlphaZero.

How it was resolved: Nothing fancy was needed.

Score: True

StarCraft progress via deep learning (confidence: 60%)

What I meant: “Early results from researchers at Alberta suggest that deep learning can help with StarCraft, though historically this hasn’t played much of if any role in StarCraft competitions. I expect this will change: in the annual StarCraft competition, I expect one of the 3 top performing bots to use deep learning in some way.”

What happened: I’m disappointed in the way that I phrased this forecast, because it’s ambiguous as to whether the StarCraft competition thing is a necessary condition, a sufficient condition, or neither with respect to my overall “StarCraft progress via deep learning” forecast. But for the purpose of being unbiased, I will assume that I meant it as a necessary condition - that is, since the top 3 thing didn’t happened, I was wrong. Note, however, that there was progress on sub-tasks of StarCraft in 2017 with deep learning, and the 4th place contestant in the tournament did use deep learning.

How it was resolved: Looking at the slides describing the tournament results, and not seeing anything about deep learning for the top 3 bots, but seeing something about deep learning for the 4th place contestant.

Score: False

Professional StarCraft player beaten by AI system by end of 2018 (confidence: 50%)

This forecast refers to the end of 2018, and has since been converted into a bet. It hasn’t happened yet, at least based on public information, which is unsurprising (as I obviously would have given less than 50% chance to it happening earlier than the 50% forecast). So this doesn’t affect my Brier score.

More efficient Atari learning (confidence: 70%)

What I meant: “an [AI] agent will be able to get A3C’s [A3C is an algorithm introduced in 2016] score using 5% as much data as A3C”. This would be a 2x improvement over UNREAL, which in turn was a big improvement over previous approaches in terms of the metric in question (median points attain per unit of training frames).

What happened: There was a fair amount of progress in 2017 in the area of efficient deep reinforcement learning, including some work specifically tested in Atari in particular. For example, the Neural Episodic Control algorithm was specifically motivated by a desire to extract maximum value out of early experiences.

How it was resolved: Resolving this was hard based on the data available (see Marc’s comments below), and some eyeballing of graphs and cross-referencing of multiple papers was required. This is partly because papers rarely report exactly what the performance of an algorithm was at many points in the training process, and sometimes other details are left out.

But ultimately, Marc and I both concluded that this threshold was reached with Rainbow. Interestingly, unlike Neural Episodic Control, Rainbow was not primarily designed to have fast learning--it was intended to be a well-rounded system with both rapid learning and good final performance, and to determine how complementary several different methods were (it turns out they’re pretty complimentary). Here’s the same figure from above, comparing Rainbow to A3C among other algorithms:

Marc and I ended up interpreting my forecast and resolving it in different ways, though with the same ultimate conclusion. I looked at the above figure and tried to eyeball a point on the Rainbow curve that was about 5% of the way through the 200 million frames - that is, at 10 million frames, a little bit after the dotted vertical at 7%. Based on this eyeballing the Rainbow figure, I concluded that the rainbow-colored line was above the horizontal line (roughly corresponding to A3C’s final performance) by the time 10 million frames is reached - somewhere around the time that the rainbow-colored line is yellow.

Marc, on the other hand, started from A3C’s score (reported elsewhere) and tried to figure out how many frames it used, which was surprisingly difficult (sidenote: Marc is a coauthor on a paper which touches on related evaluation issues). After some back and forth, he wrote:

“I was going by the assumption that A3C consumed about 320M frames of data, which from squinting at the Rainbow graph would put us just short of the 5% mark (but around 7-8%). However, I can't find that # frames figure in the paper. The UNREAL paper has # frames for A3C, but reports mean scores, not median scores. However, if we cross reference the UNREAL learning curve with the A3C mean score reported in Vlad Mnih's 2016 paper, we get a result that suggests about 1 billion frames for A3C, or 180 frames per second -- which sounds about right for the 4 days of training reported in Mnih et al. So you might be right, & I'll adjust my statement to...yes [for this forecast].”

Score: True
_____

Brier Score Calculation

To introduce Brier scores, the means by which I grade my forecasts, I’ll start by excerpting Wikipedia at length:

“The Brier score is a proper score function that measures the accuracy of probabilistic predictions. It is applicable to tasks in which predictions must assign probabilities to a set of mutually exclusive discrete outcomes. The set of possible outcomes can be either binary or categorical in nature, and the probabilities assigned to this set of outcomes must sum to one (where each individual probability is in the range of 0 to 1). It was proposed by Glenn W. Brier in 1950.[1]
...the Brier score measures the mean squared difference between:

The predicted probability assigned to the possible outcomes ...
The actual outcome ...

Therefore, the lower the Brier score is for a set of predictions, the better the predictions are calibrated. Note that the Brier score, in its most common formulation, takes on a value between zero and one, since this is the largest possible difference between a predicted probability (which must be between zero and one) and the actual outcome (which can take on values of only 0 or 1).”

Brier scores are calculated in a few different ways, but here I will use the common formulation mentioned above, in which 0 is the best, 1 is the worst, and .25 means you’re essentially forecasting randomly.

Using the common formulation of the Brier score and the scores above, we get:

*drum-roll*

.22

The calculation I plugged into Wolfram Alpha was, if you want to try variations on it:

((.8-1)^2+(.6-1)^2+(.6-1)^2+(.95-1)^2+(.7-0)^2+(.6-1)^2+(.7-0)^2+(.6-0)^2+(.6-0)^2+(.9-1)^2+(.6-0)^2+(.7-1)^2) /12

So, I did better than chance, but barely. Sad! If you changed a few forecasts, it would push it over the line in the other direction (towards being systematically wrong), and it would take several forecasts being flipped in my direction for me to be very (meta-)confident about the future of AI.

Note that Brier scores give an incomplete picture of how good one is at forecasting, however, so being close to .25 doesn't mean I did a horrible job, though I might have. Two other factors are key.

First, some domains are harder to forecast than others - without more analysis of past data and past forecasts, we don’t know how random or predictable AI progress is. I'd be curious to see people do this with datasets like the AI Index and the EFF AI Progress Measurement Project, and I'd like to see more people make and evaluate more forecasts.

Second, Brier scores are about the forecasts you actually made, not other possible forecasts. I could have made many easy forecasts to boost my score. Arguably, the only one was the "no human-level or superintelligent AI" one. With many virtually certain forecasts that turned out correctly, my score would have been closer to 0.

In forecasting tournaments, it's typical to have a range of difficulty levels for questions, and to have many people answer the same questions. That way, you can know whose forecasts are more or less impressive. It also helps to have some base rates of predictability for the domain in question, which, again, we (/I) don't have yet.

Perhaps more importantly than the overall Brier score and the considerations above, though, is what can be learned from the errors.

Patterns

Over the course of the year, as I started to see how forecasts would resolve, I reflected a bit on the sources of errors, and I’ve noticed a few patterns.

First, all of my errors were in the direction of overestimating short-term AI progress. This isn’t super (meta-)surprising to me, since I consider myself pretty bullish about AI, and would be surprised to find myself in the opposite camp. And it is also at least potentially consistent with (Roy) Amara’s Law: “We tend to overestimate the effect of a technology in the short run and underestimate the effect in the long run.”

I am not sure I should radically update in response to this, though, because first, there is an asymmetry in some of my forecasts such that it would have been impossible to end up underestimating progress (because I didn’t give an upper bound, e.g. with the “superhuman” forecasts for Labyrinth and Montezuma’s Revenge); and second, I’m generally of the belief that there are bigger risks from underestimating technical progress than overestimating it, so if there is an inevitable false positive/false negative tradeoff, I'd like to be slightly on the overestimating side of things. Most of the things we should do in response to expectations of rapid future progress are pretty robustly beneficial (e.g. more safety, ethics, and policy research, or improving social safety nets). But there are cases where that isn’t true, and I made some actual errors that aren’t explainable by the asymmetry point, so I will try to update at least somewhat in response to this pattern and move towards a more calibrated estimate of progress.

Second, in the case of speech recognition and Montezuma’s Revenge, I deferred too much to a small number of relevant experts who were (and are) bullish about AI progress. I suspect part of the reason this happened is that I spend a lot of time talking to people who are concerned about the societal implications of AI, and some such people arrive at their concern based on views about how quickly the technology is progressing relative to society’s speed of preparation (though not all people concerned about societal implications can be described that way or are bullish). I’ll try to take this into account in the future, and more deliberately seek out and associate with skeptics about progress in particular technical areas and with regard to AI in general.

Third, I generally did better in areas that I am familiar with relative to areas I am unfamiliar with (e.g. Atari=familiar, speech recognition=unfamiliar). This isn’t surprising, but worth reflecting more on. Sometimes these factors overlapped (e.g. speech recognition is also an area I’m not that familiar with, and I deferred to someone on it). So I’ll try to limit my forecasts in the future to areas I’ve spent a non-trivial amount of time understanding, like deep reinforcement learning applied to control problems like Atari. I was thinking about making some forecasts for deep reinforcement learning for continuous control problems (like robotics) this year, and now am not so sure about that, because I’ve read such papers less carefully and extensively than Atari-related discrete control papers.

Fourth, the forecasts I made based on quantitative trends did well relative to my forecasts based on other factors like (small n) expert judgment, my gut feelings, and my perception of the hotness of a given research area. There is some evidence from other forecasting domains that quantitative forecasts often fare better than expert judgment. And it’s (in retrospect, at least) unsurprising that hotness isn’t a great signal, since often people work on problems that aren’t just important and timely but also difficult and intellectually interesting, like transfer learning.

Sixth, the problems with evaluating AI progress, which is notoriously difficult, are also applicable to forecasting AI progress. Issues with underreporting of evaluation conditions, for example, made it tricky to determine whether some of the forecasts were right. Some of the ambiguities in my forecast resolutions could have been avoided had I been more specific and spent more time laying out the details, but that would have been more likely to happen in the first place if widely agreed upon evaluation standards were used in AI, so I can’t take all the blame for that.

Conclusion

Overall, I found this retrospective analysis to be humbling, but also helpful. I encourage others to make falsifiable predictions and then look back on them and see what their error sources are.

I’ve definitely grown less confident as a result of doing this Brier score calculation, though I also have become generally more skeptical of other people’s forecasts, since, first, most people don’t make many falsifiable public predictions based on which they can be held accountable and improve their calibration over time, and second, several of my failures were based in part on trusting other people’s judgments. As noted above in the discussion of Brier scores, it's not clear how good or bad my score was, because we (or at least I) don't know what the underlying randomness is in AI progress, and I didn't give myself many freebies.

However, I continue to be impressed, as I was last year, with the power of simple extrapolations, and may make more of these in the future, perhaps using methods derived from the Baidu paper mentioned above, or along the lines I discussed but didn’t follow up on in last year’s post, or using other approaches.

I’m not sure yet whether or when I will write a similar blog post of forecasts for 2018. If I do, I’ll try harder to be more specific about them, since, despite trying hard to make my forecasts falsifiable in 2017, there was still a fair amount of unanticipated wiggle room.

To be continued*!

*confidence level: 70%

____

Recommended Reading

In addition to the various links above, I’m including pretty much the same reading list that I gave last year, for those who want to know more about issues of technological forecasting in general, AI progress measurement in particular, and related issues. I’m adding an additional reference by Morgan on the promise and peril of using expert judgments for policy purposes, and the aforementioned expert survey by some of my colleagues on the longer-term future of AI.
____

Anthony Aguirre et al., Metaculus, website for aggregating forecasts, with a growing number of AI events to be forecasted: http://www.metaculus.com/questions/#?show-welcome=true

Stuart Armstrong et al., “The errors, insights and lessons of famous AI predictions – and what they mean for the future”: www.fhi.ox.ac.uk/wp-content/uploads/FAIC.pdf

Miles Brundage, “Modeling Progress in AI”: https://arxiv.org/abs/1512.05849

Jose Hernandez-Orallo, The Measure of All Minds: Evaluating Natural and Artificial Intelligence: https://www.amazon.com/Measure-All-Minds-Evaluating-Intelligence/dp/1107153018

Doyne Farmer and Francois Lafond, “How predictable is technological progress?”: http://www.sciencedirect.com/science/article/pii/S0048733315001699

Katja Grace et al., “When Will AI Exceed Human Performance? Evidence from AI Experts,” https://arxiv.org/abs/1705.08807

Katja Grace and Paul Christiano et al., AI Impacts (blog on various topics related to the future of AI): http://aiimpacts.org/

M. Granger Morgan, “Use (and abuse) of expert elicitation in support of decision making for public policy,” PNAS, http://www.pnas.org/content/111/20/7176

Luke Muehlhauser, “What should we learn from past AI forecasts?”: http://www.openphilanthropy.org/focus/global-catastrophic-risks/potential-risks-advanced-artificial-intelligence/what-should-we-learn-past-ai-forecasts

Alan Porter et al., Forecasting and Management of Technology (second edition): https://www.amazon.com/Forecasting-Management-Technology-Alan-Porter/dp/0470440902

Tom Schaul et al., “Measuring Intelligence through Games”: https://arxiv.org/abs/1109.1314

8 Comments

My AI Forecasts--Past, Present, and Future (Main Post)

1/4/2017

11 Comments

I have a long-standing interest in understanding how predictable AI progress is, and occasionally make my own forecasts. In this post, I’ll review some of my previous forecasts, make new ones for 2017, and suggest ways that I and others could make better predictions in the future. The purpose of this post is to gather all of the forecasts in one place, keep myself honest/accountable regarding AI forecasting (which was part of the point of making the forecasts in the first place), see what if anything can be learned so far, and encourage others to do more of the above.

For those most interested in the stuff on Atari (my more quantitative forecasts) and my new predictions, and less interested in how my other miscellaneous forecasts fared, just read this blog post. If you want to know more about the process I went through to review all of my other forecasts, and to see how I did on predicting non-Atari things, see this supplemental page.

Atari Forecasts

I’m focusing on these because they’re forecasts about which I’ve thought a lot more than the ones linked to above, and about which I've made more specific data-based forecasts. I also have a lot of data on Atari performance, which will be made public soon.

In early 2016, I made a simple extrapolation of trends in mean and median Atari performance (that is, the best single algorithm’s mean and median score across several dozens games). For both mean and median performance, I made a linear extrapolation and an exponential extrapolation.

Extrapolation of mean/median Atari trends (https://t.co/UYzh8bleNF) - to do seriously, would include error bars... pic.twitter.com/LIZR53MMDI
— Miles Brundage (@Miles_Brundage) April 23, 2016

I also said:

Sidenote: if Atari AI prog. is fairly predictable, median score in the no-op condition will be in the vicinity of 190-250% by end of year.
— Miles Brundage (@Miles_Brundage) May 19, 2016

As the use of the word “sidenote” suggests, this wasn’t that rigorous of a forecast. I just took the available data and assumed the trend was either linear or exponential, and that the future data would be between those two lines. I’ll mention some ways I could have done this better later in this post. But it turned out to be fairly accurate, which I find interesting because it’s often claimed that progress in general AI is nonexistent or impossible to predict. In contrast, I think that general game playing is one of the better (though still partial) measures of cross-domain learning and control we have, and it’s fairly steady over time.

Here is the same plot, with two recent scores added, along with the range I forecasted. The light blue ovals are the data I had as of April, the lines are the same as those I plotted in April, the dark blue stars are very recent scores, and the red oval is the range I forecasted before. Two of DeepMind’s ICLR submissions are roughly at the bottom and top of the range I expected based on eyeballing the graph in April. Obviously, this could have been done more rigorously, but it seems to have been about right.

Note that there is eventually some upper bound due to the way evaluation is done (with a fixed amount of time per game), but it may not be reached for a while. And other metrics can be developed (e.g. learning speed, perhaps explored in a future post) which allow for other measures of progress to be projected, even if final scores max out, so I don't see any reason why we couldn't keep making short-term forecasts of benchmarks like this.

Based on the recent data, I think that that we might be seeing an exponential improvement in median scores. The range I gave before was agnostic regarding linear vs. exponential, and recent data points were in the ballpark of both of those lines, but only the higher one really counts since we’re interested in the highest reported score. Using the same sort of simple extrapolation I used before, I pretty strongly (80% confidence) expect that median scores will be between 300 and 500% at the end of 2017 (a range that covers linear progress and a certain speed of exponential progress), and not quite as strongly (60%) expect it to be at the higher end of this, that is, 400-500%, reflecting exponential progress before some eventual asymptote well above human performance.

For mean scores (which I didn’t make a prediction for before), here is what the most recent data (the PGQ paper submitted for ICLR) looks like when added to the graph from April.

It turns out that mean scores could have been more accurately predicted by an exponential curve—and more specifically, a faster exponential curve than I had come up with. It makes sense that mean scores would grow faster than median scores, but I’m somewhat surprised by how fast mean progress has been. I didn’t make a forecast in April, though, so I’ll rectify that now: by the end of the year, I weakly (60% confidence) expect mean scores to be between 1500% and 4000%. Obviously, that’s a pretty wide range, reflecting uncertainty about the exponent, but even at the low end, it’d be a lot higher than where we are today (877.23%).

Finally, note that these are pretty simple extrapolations and it might turn out that scores asymptote at some level before the end of the year. It seems plausible to me that you could figure out a rough upper bound based on detailed knowledge of the 57 games in question, but I haven't done this.

Conclusion re: Past Forecasts

Overall, I think my forecasts for Atari and the other domains covered (see the supplemental post for more examples of forecasts) were decent and reasonably well-calibrated, but I’m perhaps biased in my interpretation. I haven’t calculated a Brier score for my previous forecasts, but this would be an interesting exercise. Among other things, to do this, I’d have to quantify my implicit levels of confidence in earlier predictions. Perhaps I could have others assign these numbers in order to reduce bias. Since I’m giving confidence levels for my forecasts below, it will be easier to calculate the Brier score for my 2017 predictions.

Also, I think that the success of the median Atari forecast, and the plausibility that the mean forecast could have been better via e.g. error bars, suggests that there may be high marginal returns on efforts to quantify and extrapolate AI progress over the short-term.

Finally, it was a pain to find all of my old forecasts, so in the future, I’ll be putting them in blog posts or using a specific hashtag to make them more easily discoverable in the future.

Present Forecasts

Below are forecasts that I’ve either thought about a lot already, or just came up with in a few minutes for the purpose of this post. These are labeled “present” forecasts because while they’re about the future, they’re relatively weak and shoddy compared to what I or others might do in the future, e.g. with theoretically well-motivated error bars, more rigorous data collection, a wider range of tasks/domains covered, etc. I’ll say a bit about such future forecasts later, but for now I’ll just list a bunch of present forecasts.

First, I’ll just repeat what I said above about Atari.

Best median Atari score between 300 and 500% at end of 2017

Confidence level: 80%

Best mean Atari score between 1500% and 4000% at end of 2017

Confidence level: 60%

No human-level or superintelligent AI

By the end of 2017, there will still be no broadly human-level AI. No leader of a major AI lab will claim to have developed such a thing, there will be recognized deficiencies in common sense reasoning (among other things) in existing AI systems, fluent all-purpose natural language will still not have been achieved, etc.

Confidence level: 95%

Superhuman Montezuma’s Revenge AI

I don’t think this is that provocative to those who follow Atari AI super closely, versus how it may seem to those who are casual observers and have heard that Montezuma’s Revenge is hard for AIs, but I think by the end of the year, there will be algorithms that achieve significantly greater than DeepMind’s “human-level” threshold for performance on Montezuma’s Revenge (75% of a professional game tester’s score). Already, there are scores in that ballpark. By superhuman, let’s say that the score will be over 120%.

Confidence level: 70%

Superhuman Labyrinth performance

Labyrinth is another environment that DeepMind uses for AI evaluation, and which affords human-normalized performance evaluation. Already, the UNREAL agent tests at 92% median and 87% mean. So I’ll use the same metric as above for Montezuma’s Revenge (superhuman=120%) and say that both mean and median will be superhuman for the tasks DeepMind has historically used. I’m not as familiar with Labyrinth as Atari, so am not as confident in this.

Confidence level: 60%

Impressive transfer learning

Something really impressive in transfer learning will be achieved in 2017, possibly involving some of the domains above, possibly involving Universe. Sufficient measures of “really impressive” include Science or Nature papers, keynote talks at ICML or ICLR on the achievement, widespread tech media coverage, or 7 out of 10 experts (chosen by someone other than me) agreeing that it’s really impressive.

Confidence level: 70%

I also weakly predict that progressive neural networks and/or elastic weight consolidation (thanks to Jonathan Yan for suggesting the latter to me) will help with this (60%).

Speech recognition essentially solved

I think progress in speech recognition is very fast, and think that by the end of 2017, that for most recognized benchmarks (say, 7 out of 10 of those suggested by asking relevant experts), greater than human results will have been achieved. This doesn’t imply perfect speech recognition, but better than the average human, and competitive with teams of humans.

Confidence level: 60%

No defeat of AlphaGo by human

It has been announced that there will be something new happening related to AlphaGo in the future, and I’m not sure what that looks like. But I’d be surprised if anything very similar to the Seoul version of AlphaGo (that is, one trained with expert data and then self-play—as opposed to one that only uses self-play which may be harder), using similar amounts of hardware, is ever defeated in a 5 game match by a human.

Confidence level: 90%

StarCraft progress via deep learning

Early results from researchers at Alberta suggest that deep learning can help with StarCraft, though historically this hasn’t played much of if any role in StarCraft competitions. I expect this will change: in the annual StarCraft competition, I expect one of the 3 top performing bots to use deep learning in some way.

Confidence level: 60%

Professional StarCraft player beaten by AI system

I don’t know what the best metric for this is, as there are many ways such a match could occur. I’m also not that confident it will happen next year, but I think I’d be less surprised by it than some people. So partly because I think it’s plausible, and partly because it’s a more interesting prediction than some of the others here, I’ll say that it’ll happen by the end of 2018. I think it is plausible that such an achievement could happen through a combination of deep RL, recent advances in hierarchical learning, scaling up of hardware and researcher effort, and other factors soon-ish, but it's also plausible that other big, longer-term breakthroughs are needed.

Confidence level: 50%

More efficient Atari learning

I haven’t looked super closely at the data on this, but I think there’s pretty fast progress in Atari learning happening with less computational resources. See e.g. this graph of several papers hardware type-adjusted score efficiency (how many points produced per day of CPU, with GPUs counting as 5 units of CPU).

The big jump is from A3C, which learned relatively quickly using CPUs, vs. days of GPUs on earlier systems. Moreover, the UNREAL agent learns approximately 10x faster than A3C. So by the end of 2017, I’ll say that learning efficiency will be twice as good as that: an agent will be able to get A3C’s score using 5% as much data as A3C. Considering how big a jump happened with just one paper (UNREAL), this seems conservative, but as with the mean score forecast above, it’s still a big jump over what exists today so is arguably a non-trivial prediction.

Confidence level: 70%

Future Forecasts

There is a lot of room for improvement in the methodology and scale of AI forecasting.

One can use error bars based on the variance of technological progress rates and the number of data points available, as suggested by Farmer and Lafond (that paper is included in the list of resources below).

There are also many more tasks for which one could gather data and make forecasts. For example, one area that I think is worth looking at is progress in continuous control. It’s an area of real-world importance (specifically, robotics for manufacturing and service applications), and there’s a lot of data available for tasks in MuJoCo in terms of scores, data efficiency, etc. That’s a case where further research and forecasting/subsequent evaluation of forecasts could be valuable not only for improving our knowledge of AI’s predictability, but also our early warning system for economic impacts of AI. Likewise for some NLP tasks, possibly, but I’m less familiar with the nature of those tasks.

A lot of my forecasts are about technical things rather than the social impact of AI, and the latter is also ripe for well-grounded forecasting. Right now, the people making forecasts of AI adoption are people like Forrester Research, who sell $500 reports, and aren’t transparent about methods (or at least, I don’t know how transparent they are since I can’t afford their reports). It might be useful to have better vetted, and/or crowdsourced, free alternatives to such analyses. Topics on which one could make forecasts include AI adoption, publication rates, relative performance of labs/companies/countries, dataset sizes, job displacement scales/types, etc.

The literature on AI forecasting is pretty sparse at the moment, though there are many resources to draw on (listed below). A lot of things can be improved. But in the future, besides growing this literature on its own terms, I think it’d be good for there to be stronger connections between AI forecasts and the literature on technological change in general. For example, Bloom et al. had a very interesting paper recently called “Are Ideas Getting Harder to Find?” which suggested that fast technological improvement has occurred alongside fast growth in the inputs to that improvement (researcher hours). One could ask the question of how much AI progress we’re getting for a given amount of input, how much those inputs (researchers, data, hardware, etc.) are growing, and why/under what conditions AI progress is predictable at all.

Recommended Reading:

Stuart Armstrong et al., “The errors, insights and lessons of famous AI predictions – and what they mean for the future”: www.fhi.ox.ac.uk/wp-content/uploads/FAIC.pdf

Miles Brundage, “Modeling Progress in AI”: https://arxiv.org/abs/1512.05849

Jose Hernandez-Orallo, The Measure of All Minds: Evaluating Natural and Artificial Intelligence: https://www.amazon.com/Measure-All-Minds-Evaluating-Intelligence/dp/1107153018

Doyne Farmer and Francois Lafond, “How predictable is technological progress?”: http://www.sciencedirect.com/science/article/pii/S0048733315001699

Katja Grace and Paul Christiano et al., AI Impacts (blog on various topics related to the future of AI): http://aiimpacts.org/

Anthony Aguirre et al., Metaculus, website for aggregating forecasts, with a growing number of AI events to be forecasted: http://www.metaculus.com/questions/#?show-welcome=true

Luke Muehlhauser, “What should we learn from past AI forecasts?”: http://www.openphilanthropy.org/focus/global-catastrophic-risks/potential-risks-advanced-artificial-intelligence/what-should-we-learn-past-ai-forecasts

Alan Porter et al., Forecasting and Management of Technology (second edition): https://www.amazon.com/Forecasting-Management-Technology-Alan-Porter/dp/0470440902

Tom Schaul et al., “Measuring Intelligence through Games”: https://arxiv.org/abs/1109.1314

Acknowledgments: Thanks to various commenters on Twitter for suggesting different considerations for new forecasts, various people for encouraging me to keep doing AI forecasting and writing it up (sorry it took so long to make this post!), and Allan Dafoe for comments on an earlier version of this post.

11 Comments

My AI Forecasts--Past, Present, and Future (Supplement)

1/3/2017

1 Comment

Warning: less well-written than main post

Methodology for Past Forecast Review

I downloaded a CSV file with all of my tweets and searched for all tweets with the strings forecast*, predict*, extrapolat*, state of the art*, SOTA*, and expect*. This may have missed a few predictions, and there are some forecasts that I’ve made in places other than Twitter, but this method has probably covered the vast majority of predictions, as I’m pretty tweet-prone.

It turns out that there were a lot more than I thought (I forgot about a lot of the less rigorous ones), and the forecasts have different implicit (and sometimes explicit) confidence levels and focuses (e.g. quantifiable technical achievements vs. social adoption of/responses to AI).

For each of the forecasts below, which are arranged in chronological order, I’ll reproduce the text of the tweet, and then say something about how it fared. I didn’t reproduce every single forecast-y tweet here because some are extremely vague or otherwise uninteresting, but here is a link to the spreadsheet on which this blog post was based if you’re interested/want to check (and my entire tweet history if you’re super skeptical about data missing from that curated spreadsheet).

Annotated List of Forecasts

I expected that CMU, a NASA-related team, and the Institute for Human and Machine Cognition (IHMC) would do well in the first (virtual) round of the DARPA Robotics Challenge:

Looking forward to the first DARPA Robotics Challenge results on Thursday. My bet is CMU, one of the NASA-related teams, and IHMC do well.
— Miles Brundage (@Miles_Brundage) June 26, 2013

This was a decent forecast (much better than chance under some interpretations of what I meant, though I was pretty vague), as IHMC got first place out of 28 teams and a JPL-related team got fifth place out of 28. This is definitely better than a random prediction. The DARPA Robotics Challenge website is no longer live, so I am having trouble verifying how CMU did in this round. I assume they weren’t in the top six based on what I later tweeted:

I was two out of three with my DARPA Robotics Challenge predictions...IHMC did the best - not surprised.
— Miles Brundage (@Miles_Brundage) June 27, 2013

I had previously done an internship at IHMC and had personally seen that they were putting a lot of effort into the DRC, so I probably don’t deserve much credit for this forecast. I also didn’t put much work into making it.

Later, I doubled down on this IHMC-boosterism:

If you're in Florida, consider checking out the DARPA Robotics Challenge live on Dec. 20-21 http://t.co/L41D4Wi5A4 My money is on IHMC!
— Miles Brundage (@Miles_Brundage) December 9, 2013

DARPA Robotics Challenge is Friday and Saturday! My bet is still on IHMC. Anyone else have a favorite?
— Miles Brundage (@Miles_Brundage) December 19, 2013

They got second place, but due to events I did not predict (SCHAFT, the winner, being bought by Google and dropping out), this retroactively improved:

So now that Google-SCHAFT is out of the DRC, my prediction of IHMC doing well has retroactively improved, they won rounds 1 and 2! ;-)
— Miles Brundage (@Miles_Brundage) June 26, 2014

Again, I don’t think I get much credit for this.

In early 2015, I said some things about DeepMind's likely work in 2015:

In 2015 I think DeepMind will prob demo some sort of mind blowing learning thing in a 3D world or at least much-richer-than-Atari 2D world.
— Miles Brundage (@Miles_Brundage) January 1, 2015

I don’t know what evidence I based this on, if any, or what counts as a “mind blowing learning thing in a 3D world,” but I think this basically happened a bit later than I expected: the A3C paper showing early impressive results in Labyrinth came out in early 2016. Fortunately, this was within my vague confidence interval:

On error bars: wouldn't be stunned if what I said re: DeepMind demo happened in 2016 not 2015, but if not in 2016 then my model is v. wrong.
— Miles Brundage (@Miles_Brundage) January 1, 2015

In early 2015, I had a vague and pretty incorrect model of what DeepMind and others were trying to do with games – roughly, move forward through time/game complexity space (with newer games generally being harder for AI) and show impressive learning across a wide variety of games for that point in time/complexity space. Based on this, I said:

Mode prediction for where in videogame chronology/complexity space DeepMind will have impressively dominated many hard games in 2016 is 2000
— Miles Brundage (@Miles_Brundage) January 1, 2015

This model of what DeepMind and others are up to turned out to be a bit misguided, since they’re still publishing a lot of results with (old) Atari games, and making brand new environments that don’t easily map onto the metric above (since the environments have highly variable difficulty/complexity). I was wrong for thinking that they’d try to move on before having more definitively solved Atari and games that are not well captured in that metric (e.g. Go – thousands of years old, but still pretty hard). Nevertheless, if you wanted to be generous and take “DeepMind” to refer to the broader AI community, you could say that OpenAI’s Universe covers a lot of Flash games from the early 2000s, some of which deep reinforcement learning (RL) works pretty well on. But overall, I’d say this was a misguided and vague forecast. I did caveat it a bit:

DeepMind *could* focus on playing higher fraction of old games w/o input, but they're also simultaneously moving forward in time game-wise.
— Miles Brundage (@Miles_Brundage) January 6, 2015

Regarding non-game stuff, I said:

DeepMind will prob someday (if they haven't already) do non-game stuff, but for now that's their metric, with some reason - it's very hard!
— Miles Brundage (@Miles_Brundage) January 1, 2015

DeepMind has since applied deep RL to data center energy management and deep learning to healthcare. They have also used non-game domains for benchmarks in research (e.g. MuJoCo). But this was a pretty uninteresting/banal prediction (it’s pretty obvious they would have done something non-game-related eventually).

Anti-prediction for DeepMind 2015-2016: them playing Destiny or other current video game. Way too hard/not worth their time except for fun.
— Miles Brundage (@Miles_Brundage) January 6, 2015

As far as I know, this was correct, unless you count StarCraft 2 as a “current” video game.

Another key pt on DeepMind's near-term game stuff: suspect some of the impressive results they show will *not* be fully autonomous learners.
— Miles Brundage (@Miles_Brundage) January 6, 2015

Arguably, this was ultimately true of AlphaGo – its learning was kickstarted with a dataset of human play, though they have said they’ll explore learning from scratch in the future.

Elaboration on previous 2015-2016 DeepMind predictions: simultaneous to video game stuff, they will prob make some big progress on Go. (1/2)
— Miles Brundage (@Miles_Brundage) January 6, 2015

This was based on the early results from Maddison et al. (including some DeepMind authors) in late 2014 that seemed to suggest to me that they might work more on it in the future and that deep learning could help a lot.

My money is on IHMC doing well in, if not winning, DARPA Robotics Challenge finals. Will be v. interesting to see how the Chinese team does.
— Miles Brundage (@Miles_Brundage) March 21, 2015

IHMC got second (same number of points as the winner, KAIST, but with a slower time) and the Chinese team did poorly.

Regarding speech recognition, in late 2015, i said:

2. Think 2016 will be year in which it's pretty clear that speech recognition is now of broad utility. Also, note role hardware played in...
— Miles Brundage (@Miles_Brundage) December 17, 2015

There wasn’t a clear metric for this. There was a lot of coverage of speech recognition in the tech press, and some impressive (nearly) human-level results, but I’m not sure whether 2016 represented any sort of shift in terms of wide adoption. Anecdotally, it seems more widely used in Beijing than in Western countries, but I don’t know for sure.

As part of a longer rant in 2016, I said:

6. And I no longer think massive progress in AI in, say, 10 years is implausible - now seems plausible enough to plan for possibility of it.
— Miles Brundage (@Miles_Brundage) January 8, 2016

And:

8. I expect enough prog that "human-level AI" will be more clearly revealed as a problematic threshold, and in many domains, long surpassed.
— Miles Brundage (@Miles_Brundage) January 8, 2016

10. access to the Internet is allowed, a la https://t.co/Vs9KX98v3p
— Miles Brundage (@Miles_Brundage) January 8, 2016

This was pretty vague, and the timeline in question is still ongoing, so I can’t evaluate it yet.

Regarding hardware and neural network training speeds, I said:

2. This would affect, as prior hardware improvements have affected, three things: attainable performance, speed thereof, and iteration pace.
— Miles Brundage (@Miles_Brundage) January 15, 2016

3. And that's all just from hardware - algorithmic advances have also been rapid in recent years, though I haven't yet quantified that rate.
— Miles Brundage (@Miles_Brundage) January 15, 2016

4. Seems like a not too crazy projection is that in, say, 3 years, neural nets will be 100x faster to train, w/ big impacts on applications.
— Miles Brundage (@Miles_Brundage) January 15, 2016

(sorry for the bad formatting here)

6. These are just rough ideas currently - may do more rigorous calculation with error bars at some point. Point is, expect much NN progress.
— Miles Brundage (@Miles_Brundage) January 15, 2016

I’m still pretty confident that hardware is speeding up and will speed up neural net training a lot, but we’ll have to wait until early 2019 to evaluate the 100x thing. I’ll try to specify it a bit better now: for a multiple expert-suggested set of 10 benchmarks in image recognition and NLP, you will be able to achieve the same the same performance, using new hardware and algorithms, in 100x less training time (wall time) vs. results reported in early 2016 on at least 8 of those benchmarks. This is a rough, intuitive guess, so I have less confidence in it than some of my more quantitative extrapolations of Atari results discussed below.

Regarding AlphaGo’s success against Lee Sedol, I said in the middle of the match:

Predicted AlphaGo victory w/ 65% confidence and 4-1/5-0 for whichever victor w/ 90% confidence, so not too late for me to be very wrong.. :)
— Miles Brundage (@Miles_Brundage) March 12, 2016

This reference to a prior prediction was based on a Facebook comment I made before the match began, which in turn elaborated on views expressed in a blog post I wrote. The comment (not publicly linkable, unfortunately), on March 1, said:

I think my reasoning at the time was essentially correct, deep RL is in fact very effective and scalable for well-defined zero-sum games, and that my conclusions in the aforementioned blog post (about the importance of hardware in this case and humans occupying a small band in the space of possible intelligence levels) are still correct. But lots of people thought AlphaGo would win, and I wasn’t extremely confident, so don’t get much credit for this.

Regarding dialogue systems:

In few years, I expect impressive (by today's standards, though maybe not future revised ones) limited dialogue AIs from Goog, IBM, FB, etc.
— Miles Brundage (@Miles_Brundage) March 12, 2016

I still believe this, but “a few years” haven’t yet passed so don’t have much more to say about this right now, other than that it is probably too vague.

Regarding Google’s business model for AI:

@samim my expectation is that they will gradually, over next 10 years, introduce more, better, and more integrated cognition-as-a-service.
— Miles Brundage (@Miles_Brundage) March 24, 2016

Again, it’s early for this, but this seems pretty plausible to me.

Atari Forecasts

See main blog post.

1 Comment

The White House AI workshops and public engagement in science and technology

5/4/2016

1 Comment

Yesterday the White House Office of Science and Technology Policy (OSTP) announced a series of workshops on AI. This is great news, and I’m excited to attend one or two of these workshops as I do related research, and it’s a good sign that OSTP is taking the societal issues of AI this seriously. It’s also encouraging that they have a good lineup of speakers and organizers from academia and industry so far.

Still, I think it is important for those interested in these issues, and people in general, to not take the White House and powerful corporations at face value, since we all have a stake in AI policy being done well. So I made what I thought were fairly supportive but mildly critical comments about this last night on Twitter. I’m copying them below for reference, and then I’ll say a bit more about some of the themes in them and my response to a reply from Ryan Calo, who is involved in organizing two of these events (Ryan's scholarly and advocacy work is great, by the way, and I applaud him for taking this on!). My tweets said:

"1. Some thoughts on the White House AI initiative: first, this is very good news overall IMO. Important issue getting high level attention.

2. Also, the thematic grouping of the workshops makes sense as does livestreaming them, and they seem to have good speakers lined up so far.

3. Last point of praise: the task force on AI for government is much needed. U.S. Digital Service, etc. are good but AI seems underutilized.

4. Now on to some mild critiques/caveats - it would be a mistake to accept the framing of this as a real public engagement effort. It's not.

5. It will have lots of value, IMO, but to actually engage the public on AI, one would need to look at and learn from prior similar efforts.

6. Among many other reasons why this isn't on the cutting edge of such things, self-selection will play a huge role in participation.

7. And even if all were interested in it, not all could go. This is why robust such efforts do things like compensate people for their time.

8. For examples of serious efforts at engagement, see e.g. https://www.nasa.gov/sites/default/files/atoms/files/ecast-informing-nasa-asteroid-initiative_tagged.pdf … (asteroids, in which I participated as a facilitator), ..

9. or the National Citizens' Technology Forum: http://pus.sagepub.com/content/23/1/53.abstract … That sort of thing may follow the workshops but they're not the same.

10. One final area of caveating/mild critique regards the report to be produced after the workshops. This could be a great report, but...

11. it is also clear that as in similar cases (see, e.g. Krimsky's Genetic Alchemy on Asilomar) this will hardly be an apolitical document.

12. It will reflect not just what happens at the workshops but various political constraints/biases of those involved at and around OSTP.

13. This is not intrinsically bad - some of those biases may be good, and science being political isn't unheard of - but it should be known.

14. Among other potential biases here are those for low/no regulation, and with many corporations involved here, advisory capture is a risk.

15. So, as in a lot of things sci./tech policy-related, processes are key - and beyond the livestreaming, there is ~ no accountability here.

16. What is *done* with this report after it's produced will also be critical, and again, this is not immune from biases/constraints/etc.

17. So, in short, good turn of events but I want to emphasize that this is not a panacea from either an AI policy or democratic perspective."

In reply to this and a subsequent tweet that CC’d him and the apparent lead on this at OSTP (Ed Felten, whose work I’m less familiar with but also seems smart and well-intentioned here), Ryan wrote:

"Thanks, Miles. Seems a little odd/hard to prejudge here but we'll watch for these dangers."

This caused me to reflect a bit more on the topic and while I responded briefly to the above right after, I wanted to write up some of my thoughts at more length, and say in particular why I don’t think it’s necessarily odd/hard to prejudge the limitations of the proposed workshops in at least a few ways. Indeed, people who study public engagement in science and technology (and there are many such people – I just got home from a conference of a hundred or so such people which involved only a small fraction of them) already know a lot about when and why public engagement works and from the publicly available information on these workshops, they do not seem to obviously circumvent known issues. Which would be fine if this were billed merely as a process for gathering input from experts, but in Felten’s post on the topic (which I recommend reading here), he indicated some fairly ambitious aims for the process, such as “engaging with the public” and “spur[ring] public dialogue.” These are good goals, but in practice they’re non-trivial to do well.

There’s a large body of practice and scholarship on what the intrinsic limitations of science and technology public engagement are, and there is plenty already known about this topic that informs my (again, mildly) skeptical reaction to these workshops and the associated report that will stem from them. Below I’ll summarize a small amount of what’s known on this, which I think is sufficient to raise some questions and suggest possible solutions to the issues these workshops may run into. At the end I’ll give some specific concrete suggestions for consideration.

Why public engagement with science and technology is hard

There’s a long history of theory and practice on public engagement with science and technology, and making various cases for why it might or might not matter, may or may not be feasible to do in a meaningful way, etc. See, for example, this paper by Jack Stilgoe on some of the key issues explored in that literature. I can’t do justice to the breadth and depth of that literature here but I’ll briefly share a very selective/biased perspective on it from my point of view.

There are at least two key issues at play when one wants to engage the public on science and technology. One is representation, and the other is expertise. With regard to representation, efforts at public engagement (even ones much more involved than the apparent White House approach) have frequently been criticized as limited in their effective inclusion of diverse perspectives. This includes not just demographic diversity but also institutional and ideological diversity (e.g. overly corporate participation with minimal non-profit, union, or other perspectives). This happens not out of malice (usually, at least) but because it’s difficult. People are busy, distracted, committed, etc. and taking time out of their week to engage with an unfamiliar and complex issue like an emerging technology isn’t always high on people’s list of priorities. Hence, self-selection and exclusion of perspectives is a common critique of various public engagement approaches. I’ll give examples of efforts to overcome this in the next section.

The second issue, expertise, has also been a matter of much discussion. On the one hand, the very people whose perspectives aren’t yet reflected in science and technology policy discussions rarely have expertise in the topics in question. This means that their inputs at such events may be poorly informed (or at least perceived as such) and immediately disregarded by expert organizers. Again, ways to address this exist to some extent. But there is also a relationship here between expertise and power. In ways both direct/explicit (e.g. experts lecturing to non-experts) and indirect/implicit (e.g. reluctance of non-expert participants to voice dissenting views in the face of apparent expertise), there can be an imbalance of power between groups in the room. This raises the importance of process design and learning from prior similar experiences in order to facilitate constructive, informed dialogue. Lastly, expertise in the engagement context has theoretical issues—who is really an expert in the ethics of AI, or the future of work? In both these areas there is a lot of disagreement among “experts,” and while certainly some are more familiar with the contours of these debates than others, there is a lot of fundamental uncertainty and value disagreement involved, and efforts to convey the state of the art in such areas often results in overly narrow framing of problems and the exclusion of important uncertainties from the discussion. Efforts to inform the public also often go awry when they are, as is common, misinformed by a long-debunked “deficit model” of science communication--that by informing people of the “truth” about science and technology, they will be more supportive of the science/technology. In fact, the opposite sometimes happens, and in some areas like nanotechnology, scientists have been shown to be more concerned about some issues like health impacts than non-experts are. Science dialogue and policy are complex and tricky to do well.

I’ve raised a few thorny issues and trade-offs, but there are ways to address these to some extent. I’ll turn to examples of such efforts now.

What people have tried before

There have been many, many public engagement efforts in the U.S. and around the world in the past several decades. Europe (and a few countries there in particular) has been a leader in this area in sheer number, but the U.S. also has been innovative in this regard, especially in the last decade. Here I’ll give two examples of such efforts, one of which I participated in and another that I didn’t, and I’ll say a few words about what was interesting about these approaches.

One notable effort (which I did not participate in) is the National Citizens’ Technology Forum (NCTF), a series of events held in 2008. Organized by the Center for Nanotechnology in Society at ASU, the events represented a substantial investment of time, effort, and thought in how to engage a broad swathe of the public on the subject of nanotechnology-enabled human enhancement. This involved, among other things, a substantial effort in producing background materials to inform lay participants beforehand (a 61 page background report). It engaged members of the public who exhibited significant geographic diversity (the events was held at six sites across the U.S.), as well as socio-economic and ethnic diversity. This effort explicitly addressed some of the above concerns regarding self-selection and representation by compensating participants for their time with $500 at the end. And lastly, it was not a one-shot event – it spanned a month of virtual and face-to-face interactions. There is literature one can read about this (e.g. this and this) to learn more about it as well as subsequent events with even larger scope such as World Wide Views, but broadly speaking, a key takeaway from the experience is summarized by Dave Guston (full disclosure: my advisor) in the former paper:

"The general portrait of deliberation that emerged from the NCTF strongly supports the contention that lay citizens can deliberate in a thoughtful way across a continent about emerging technologies (Philbrick and Barandiaran, 2009) – with a few caveats [relating to the virtual component and the significance of the financial incentive]."

Guston also notes that “The participants mastered technical aspects of nanotechnology presented to them, and they engaged content experts in active, informed, and critical questioning.” I’m not referring to this because it’s a perfect model, and as noted above, it has been improved upon subsequently. The literature on this speaks to various “wins” as well as new considerations. But it’s still a stark contrast to the depth of participation that seems likely at the AI workshops, which is why I bring it up – to emphasize the vast range of possible public engagement models, and the value of learning from history in this area.

The second case I’ll discuss is one that I participated in directly, specifically as a discussion facilitator. This event was known as the NASA Asteroid Initiative Citizens Forum, and you can read all about it in this report. In contrast to the NCTF, the asteroid forum was limited to one day each in two physical locations (Phoenix and Boston) and an online component, with different participants in each city drawn from the local community. The event was sponsored by NASA and represented a novel collaboration between NASA (which provided the funding), the ECAST Network centered at ASU (ECAST stands for “Expert & Citizen Assessment of Science & Technology), the Boston Science Museum, and others. It also represented a very significant investment of effort, including an extremely informative and well put together planetarium show introducing participants to the nature of asteroids, their risks and opportunities in relation to Earth impacts, science, and space exploration, and a number of possible next steps for NASA’s asteroid initiatives. NASA funded this event because it wanted to get feedback about a specific set of options it was choosing between, and from what I understand, it was very pleased with the outcome and announced its decision a few months later.

Also, the participants were very diverse. At the table I was in charge of facilitating discussion at (essentially, just encouraging people to talk when they were quiet but looked like they had something to say, and taking notes), there were people from a variety of ethnicities, occupations, incomes, and educational backgrounds, and the event was generally considered successful, entertaining, and useful. Of particular note for the issues above is the way that expertise was managed. The planetarium show and subsequent materials were the result of negotiation between the various parties involved, and they were well caveated regarding uncertainties and values issues at stake. The activities of the day were also both entertaining (e.g. scenario planning involving cards dealt that had to do with different asteroid sizes, time until impact, and expected impact locations) and relevant (we discussed the impacts of different technologies and asteroid governance regimes).
Again, my point is not to suggest that any of the prior approaches is entirely unproblematic. Public engagement is difficult and multi-faceted. But I hope these are useful reference cases for considering the White House AI workshops.

Interlude regarding the final AI report and bias

I’ll briefly discuss one issue I mentioned in my original tweets, which is the political complexity of report production in this sort of context. As is well established in the scholarly literature on this topic, expert committees providing authoritative reports are not immune from bias and politics. And this isn’t necessarily a bad thing – some of those biases may be good, and politics is not inherently bad. But it’s worth noting here one specific aspect of the AI report which will be produced later this year – namely, it will not obviously bear a specific connection to each and every topic of discussion at all four workshops. There will be a need to exclude some comments from the report, or even entire topics entirely, given the diversity of the likely discussions. This is inevitable but could be handled more or less transparently, so I’ll suggest some possible mechanisms for encouraging transparency and fair characterization of uncertainty and disagreement at the end of this post. But for now, I’ll just emphasize that this is an area in which, for better or worse (probably both), the public doesn’t have a direct say in the final policy process, and there is a very significant responsibility on those involved to solicit diverse opinions before and during the process of writing this report.

Now, back to the public engagement issues and whether it’s early to say that the White House approach is limited…

Why it looks like the White House approach may be limited in important respects

I don’t have any insider knowledge of the planning of these events, and there is certainly a possibility that many of the above issues will be well handled. But based on the publicly available information, it seems that the events will be limited in two ways that bear on the issues of representation and expertise discussed earlier.

First, regarding diversity: there are many types of diversity, and the events discussed here seem to be doing well in certain respects, e.g. representation of women and people of color among speakers. This is great and to be applauded. Still, as discussed earlier, the value of the feedback the White House gets at these events (and the extent to which they can credibly claim to have “heard” from the public on this issue) may be impaired by a few factors. There is the lack (at least as far as I can tell from the websites) of any compensation for participation, which may limit economically diverse representation. There is an apparent heavy emphasis on corporate and academic participants, with little in the way of, e.g. unions or non-profits in general, raising the issue of ideological and institutional diversity. Lastly, there is also limited geographic diversity: all four events will take place in major metropolitan areas, none in rural locations and all on or near the coasts. I understand there are limitations of time and resources for these events (which seem perhaps a bit rushed, with one event taking place in less than a month, presumably to be completed before the end of the administration). Still, these are factors to consider in future follow-on events, if there are to be any, and some of the above considerations are potentially actionable for these events if organizers act soon.

Second, timing and format: Two issues related to the timing of the events are noteworthy. First, they are all one day long (with the exception of at least one technical workshop taking place the day before), which offers only limited opportunities for substantial learning and input. There is also a timing dimension to consider within the day of each workshop: namely, the distribution of different activities. With the number of listed speakers, it seems like a significant amount of time will be allocated to lectures, and with finite time, that leaves less opportunity for rich engagement with the participants (many of whom will be experts). In short, I fear much of the events will consist of participants listening to lectures, learning a thing or two, and standing in line to ask questions/make comments at the end. This has value, for sure, but it may fall short of the potential for rich public engagement on AI which would be possible in longer or more creatively structured events.

Again, I’ll emphasize that I have limited knowledge here so these are meant as provocations, not final conclusions. I don’t know what the precise plans are for each of the events, but given the history of events like these in a wide variety of scientific and technological domains, it does not seem unreasonable to me to be concerned about the events’ diversity, richness of dialogue, and productivity.

Concrete suggestions

Some of these may be already underway or have been considered and discarded, but nonetheless I’d like to offer several suggestions for the organizers to consider. On any of these points, I’d be happy to elaborate, discuss further, or facilitate connections to people who are more knowledgeable about these topics than me.

1. Reach out to groups with different opinions on priorities and assumptions

There is a lot of diversity of opinion among AI experts and AI/ethics/society experts and diversity of interests among those who could be affected by AI. While it is not possible to represent all possible views in a finite amount of time, or to report all such disagreements in a report, diversity might be fostered by reaching out to invite participation from organizations such as unions (regarding job impacts), consumer advocacy organizations, privacy advocacy organizations, and organizations concerned specifically with long-term AI safety (e.g. the Machine Intelligence Research Institute). On the latter, I have already heard some concern voiced from people in that world who think that the safety and control workshop will be positioned to exclude such concerns, which, contra a somewhat popular belief, are shared by at least some AI experts (e.g. Stuart Russell would be a good participant, if he’s not already involved, and has a very different perspective on these issues than another currently confirmed participant, Oren Etzioni).

2. Incentivize non-expert participation

I raised this issue above already – namely, the exclusionary effect of uncompensated, time-consuming events. There are complexities to this, and I’d be happy to refer the organizers to people who have gone through such a process before. Which leads me to another point…

3. Reach out to the experts on non-experts (I realize the irony!)

There are many experts on the issue of lay participation in science and technology discourse. If they haven’t already, the organizers of these workshops could get more detailed feedback and suggestions from organizations like the ECAST Network. This is a difficult thing to get right and it’d be a shame if the relevant lessons learned from similar events were not brought to bear on making these AI events as successful as possible.

4. Precommit to conveying disagreements

This relates to the issue of the final report above. While the events will be livestreamed, not everyone will take the time to watch four days of video. So this makes it incumbent upon the organizers of this process to fairly represent the breadth of discussion at these events. One way to ensure this would be to commit beforehand to doing so in some fashion (e.g. through inclusion of dissenting perspectives to the report’s conclusion in appendices, or in a supplementary website), thus bringing attentive portions of the public’s scrutiny to bear on whether these commitments were followed up on.

5. Gather and precommit to reporting data on inclusion

Another thing that the organizers could precommit to is gathering and reporting data on the diversity and representativeness of workshop participants. This would create some accountability for fairly representing the opinions raised in the workshop as being as (un)representative as they actually are, and it would provide an incentive for organizers to expedite efforts to ensure such diversity if they know that this information will subsequently be reported.

6. Consider follow-on events with other institutional partners

Finally, given the intrinsic limitations of any single event or series of events in informing, eliciting, and representing public opinion, these workshops should be seen and characterized by the White House as the beginning of a dialogue, not its apotheosis. Felten has done this well already in his opening blog post on the topic, saying that he seeks to spark dialogue. One way to ensure that dialogue continues would to begin working with a group like ECAST to envision subsequent events beyond this summer which could involve richer, lengthier forms of participation by a wider group of people. I don’t know the constraints Felten et al. are under and this may not be possible at this time, but it seems worth at least considering. Perhaps this could be an initiative to be taken up by another institution besides OSTP, such as NSF, or the Domestic Policy Council.

I hope this context is helpful for some people and provided at least a few provocations! I look forward to anyone’s thoughts on these matters.

1 Comment

How Far to AI-topia?

4/10/2016

7 Comments

The title of this post is deliberately provocative. My goal here is to give some rigor and structure to discussions of the utopian opportunities afforded by AI, and to stimulate discussion on whether and how to realize those opportunities. It’s not my view that AI is sufficient to realize utopia, though it may be necessary.

Also, I deliberately used the phrase “far to” instead of “long until” in the title of the post. How soon, if ever, certain things discussed here happen depends not just on clock time but also the amount of effort applied to achieving them. In addition to the temporal distance to these scenarios, we should also think about AI opportunities in terms of technical distance, political distance, etc. Many factors are involved in bringing good things about, as well as analyzing the likelihood of them coming about.

One of the reasons I work on AI-related issues is that I think it has enormous positive potential, much of which hasn’t yet been realized (even with the current state of the technology, let alone how it may be in the future). At the same time, I think a lot of discussions regarding that potential are simplistic, vague, or otherwise problematic. So I’ll try to do a better and more critical job of analyzing the positive potential of AI here in order to encourage more of the same, but this post certainly shouldn’t be interpreted as a prediction that all these good things will necessarily happen. If such a thing were possible, it’ll happen because people work actively to bring it about, not because it’s inevitable. And I’m also not downplaying the negative potentials of AI, which are also very important to think rigorously about, but which I’m not focusing on in this blog post.

Without further ado, let’s get into what “utopia” and “utopianism” mean, and then explore several different connections between AI and utopia(nism).

Defining utopia and utopianism

There’s a lot of scholarly work on utopia and utopianism – indeed, there’s even a Journal of Utopian Studies. It’d be impossible to do a comprehensive survey of utopian thought and its various critiques in this blog post (though you can find pointers to some of the key texts in a paper I wrote for a class on utopianism). Here I’ll just give a very brief summary of what some people who study this sort of thing say about it.

Defining utopia and utopianism is tricky for some obvious as well as non-obvious reasons. While noting the many debates on this, I’ll simply define utopia as a space (past, present, or future) in which justice is realized, and utopianism as a system of belief that considers the pursuit of utopia as a valuable political project. Note that there is no presumption here that the (lack of) realization of justice is binary – one could imagine calling a certain state of affairs a utopia, while still finding room for improvement, though this doesn’t necessarily fit how the term is commonly used in public discourse (where it is often used in a derogatory fashion, for reasons we may better understand shortly). Indeed, the question of perfection and perfectibility looms large in debates on the meaning of utopia, but in the context of AI, what I’ll suggest below is simply that AI may at least help us attain a world that is much more just in many respects.

So, why are utopia and utopianism important and controversial? Besides the perfection stuff mentioned above, and the common historical association of the term with particular (failed) attempts at utopias, what some scholars of utopianism say is that utopianism is necessarily controversial, in some sense, at least when it’s discussed or pursued in unjust societies. Why? Because it serves implicitly or explicitly as a critique of the status quo. One of the foremost theorists in the history of utopian thought is Karl Mannheim, who theorized “utopia” as having a critical relation to the prevailing “ideology” of the times. Clashes between utopia and ideology are clashes about the nature of the world and what’s possible in it. He wrote:

“What in a given case appears as utopian, and what as ideological, is dependent, essentially, on the stage and degree of reality to which one applies this standard. It is clear that those social strata which represent the prevailing social and intellectual order will experience as reality that structure of relationships of which they are the bearers, while the groups driven into opposition to the present order will be oriented towards the first stirrings of the social order for which they are striving and which is being realized through them. The representatives of a given order will label as utopian all conceptions of existence which from their point of view can never be realized.” (Mannheim, 1932, p. 164, original emphasis).

Mannheim’s seminal early work is hardly the last word on utopianism, or ideology for that matter. But his idea of utopia as having a critical orientation towards the status quo is quite relevant for our discussion here, since some utopian visions for AI call into question many aspects of our society that are often taken for granted—to give just one example, the necessity of paid labor, which I’ll have a lot more to say about below.

AI and utopia

So, now that we have at least a vague sense of what utopian thought is and why it’s important, let’s get into some of the specific ways in which AI relates to utopia. I’ll discuss three specific relationships in this post, but this is not necessarily exhaustive, and some of these could be broken down into further sub-categories. The three I’ll explore are: AI as problem solver, AI as work improver, AI as work remover, and AI as equalizer. In most of these cases, I’ll discuss the various forms of (technical/social/policy/etc.) distances involved before they can be fully realized, examples of how they might work, issues in their implementation or ultimate desirability, and whether they actually count as being properly utopian. At the end of the post, I’ll try to tie some of these threads together and discuss the political function of AI-topia and the immediate challenges we face in achieving these goals, if we want to do so.

AI as problem solver

This one is very common. Over 8,600 AI researchers and other interested parties (myself included) have signed an open letter which, among other things, contains this brief summary of the case for AI as problem solver:

“The potential benefits are huge, since everything that civilization has to offer is a product of human intelligence; we cannot predict what we might achieve when this intelligence is magnified by the tools AI may provide, but the eradication of disease and poverty are not unfathomable.” (FLI open letter, 2015)

Eradication of disease and poverty would be very big deals, and arguably could serve as building blocks of or stepping stones toward utopia under some definitions. Addressing climate change would also be a big deal, as would addressing all sorts of other problems in society that have been claimed by various people to be (at least partially) solvable with AI. But how plausible is this? And can we parse out the “at least partially” part a bit more? I’ll try to do that a little bit in this section, while noting that it’s a very big question in need of more than a blog post to answer.

First, let’s note that this is a distinct sort of claimed benefit from AI from the others I’ll discuss below (e.g. AI as work remover). Eliminating the need for paid work in order to attain a decent (or high) standard of living would both require and enable a very sweeping change to society, with myriad consequences. In contrast, or so the AI as problem solver story goes, using AI to cure a disease is a more straightforward technical fix—we identify some sort of problem like a disease, apply AI to it, and, if it is a problem amenable to the application of intelligence, presto, we have a solution (say, a cure for a disease). The range of issues claimed to be solvable with AI is very broad, though—they are not all narrow problems amenable to technical fixes. Borrowing from Sarewitz and Nelson’s 2008 Nature article, “Three rules for technical fixes,” we might apply three criteria to various problems in order to assess the plausibility of AI as a solution to them:

“I. The technology must largely embody the cause–effect relationship connecting problem to solution. …
II. The effects of the technological fix must be assessable using relatively unambiguous or uncontroversial criteria. …
III. Research and development is most likely to contribute decisively to solving a social problem when it focuses on improving a standardized technical core that already exists.” (Sarewitz and Nelson, 2008)

This is just one set of criteria, and you can look at the original article to assess whether their reasoning makes sense in the context of AI, but it is at the very least a useful reminder that not all problems are equally solvable through technical means, and that there might be some structure to that relative solvability. For example, one might argue that developing a cure for a disease would be easier for AI to accomplish than ensuring its effective distribution. Or, one might argue that “providing everyone with access to X standard of living” (whatever level that may be) is easier to accomplish than a comprehensive solution to the complex, multi-faceted, multi-level phenomena of racial, economic, gender, and other forms of inequality, in which poverty is intricately intertwined. The latter (at least absent a superintelligent AI imposing fundamental changes in society against many of its members’ wills) seems less amenable to an AI-based “solution,” though it’s plausible that AI could help on at least some of these fronts in at least some ways.

We should also appreciate here that different levels of technical capability in AI may enable different extents or likelihoods of “solution” to different problems. With a sufficiently advanced AI, it may be straightforward to tell it to find a cure for a certain disease. Today, however, AI is not able to easily solve any such problems without a significant amount of human input and fine-tuning. A comprehensive analysis of different possible states of the art in AI and how they might enable solutions to different sorts of problems is beyond the scope of this paper, but it seems like it could be an interesting exercise and relevant to cultivating more nuanced-while-still-ambitious utopian visions for AI and society.

Finally, when looking at AI through a utopian lens, we should consider the political and policy dimensions involved in determining what counts as a problem in need of an AI-based solution. There is a legitimate concern here about excessive technocratic management of society which needs to be taken into account, as well as the more general question of whether there is currently a good mapping between the application of AI technologies and the most important problems. While various organizations (claim to) work on applying AI to important problems, it’s far from obvious, to me at least, that, for example, the current distribution of U.S. federal funding for AI and robotics is being well-targeted to addressing societal problems as opposed to merely military problems and a handful of miscellaneous (mostly health-related) other ones. So, it may be that not only is AI as problem solver in need of more critical technical/social assessment, but its actual realization may require more political and policy changes than have heretofore been discussed in the literature. For example, what would it look like if governments took a comprehensive approach to funding AI research applied to grand challenges, or implemented an AI equivalent of agricultural extension services (provision of technical expertise to farmers), actively seeking out NGOs, local governments, and others who might benefit from AI but need technical experts to help? What problems are not currently likely to be solved by profit-motivated companies, which employ an increasing portion of global AI talent? These are just some of the questions a robust AI-topian project would need to answer.

AI as work improver

Another common claim made in support of (possibly) utopian visions for AI is that AI and robotics will enable the elimination of “dull, dirty, and dangerous” aspects of jobs, allowing work to be more stimulating, enjoyable, fulfilling, etc. This is potentially plausible, but not at all guaranteed, and probably the least interesting of the four AI/utopia connections I’ll discuss in this post. I discuss this claim in more detail in the context of Brynjolfsson and McAfee’s book The Second Machine Age, here, but in this section I’ll briefly recapitulate the key points of that article and explain why I find this to be not that interesting or utopian.

There’s definitely something to the idea that work could be better for a lot of people—indeed, it’s currently better than it used to be for a lot of people (see, e.g. the chapter on work conditions in Gordon’s book The Rise and Fall of American Growth). But AI does not by any means ensure this outcome. The reasons I give for this in the article mentioned above are:

“First, the tasks that are easy to automate aren't necessarily the boring and repetitive ones, and the tasks that are hard to automate aren't necessarily the fun and interesting ones. …
Second, even when enjoyable and harmonious jobs are technologically and socially feasible, companies may not face strong incentives to design them like that. …
Third, one person's meaningless, repetitive labor is another person's satisfying, hard day's work and a big part of her identity.” (Brundage, 2014).

These considerations suggest that increasing the quality of work is probably not exclusively, or even primarily, a matter of technology, as social science research has already shown for a long time. Political and policy measures are probably more important there, some of which I mention in that article – for example, I wrote: “Nobel Prize–winning economist Edmund Phelps has proposed subsidizing companies to hire low-wage workers. When businesses have to compete against one another to attract employees, instead of the other way around, they may be more inclined to foster satisfying, and not merely productive, work environments.”

Lastly, this doesn’t strike me as particularly utopian, or at least sufficiently utopian for my tastes, though it may be a very important impact of AI and one that should be actively pursued. Why? Because it leaves essentially unchallenged the presumption that people should have to spend a huge fraction of their lives working for others at jobs they may not want to do in order to have a decent standard of living, to the exclusion of other activities that they may find more fulfilling. Under some conceptions of justice, that requirement is itself unjust. Better work may be important, but less work may be a more utopian ambition, so we turn to that next.

AI as work remover

For a very long time, people have imagined an end of work achieved through political or technical means, or some combination of the two. And indeed, this ambition has been realized in many ways, and in many places. Child labor laws have led to big improvements in child welfare and educational advancement, and in many European countries there exists a much heavier emphasize on work/leisure balance, enforced by law. Historian Benjamin Hunnicutt goes so far as to say that free time is the forgotten American Dream, one that was actively sought for generations but was steadily eroded by consumerism (fueling higher material demands and thus higher desire to work) and a societal endorsement of the work ethic.

Moreover, an end of work has figured prominently in many specific conceptions of utopia, both in philosophical/political treatises and in (science) fiction. And yet, the end of work also features prominently in many modern fears associated with AI – namely, socially destructive technological unemployment. So, where does this rich history leave us regarding our discussion of AI and utopianism, and are we actually close to the dream (or nightmare) of an end of work? And “close” in what sense?

The history of debates on technological unemployment should, first of all, be a reason for skepticism about current claims about present or future technological unemployment. Not only is it the case that people in the past worried about permanent broad-based technological unemployment, and were ultimately proven wrong, but moreover, they even made many of the same arguments that are being made today. In the 30’s, people talked about the replacement of phone switchboard operators with automated switchboards in pretty much the same way people today talk about the possibility of AI substituting for white collar jobs—namely, that back in the day, machines just substituted for muscles, but now they’re substituting for brains. So, we should look carefully at the specific technical developments in question and their relationship to human cognitive abilities, and where AI could (or could not) plausibly soon substitute for humans. We should also think about whether and in what sense modern AI is different from earlier waves of automation –is machine learning the critical distinction, or the range of cognitive and manual tasks that are now being automated? I think a compelling case can be made that AI is likely to be able to automate a wide range of tasks, but my point here is simply that we need some historical perspective here, and to remember that many new jobs will be created, too.

It’s also important to note that, on the other hand, earlier prophets of technological unemployment were right in a sense—many people were, in fact, fired after being replaced by machines, entire careers disappeared, and substantial human suffering resulted. Overall, economies continued to grow and many (though not all) people eventually got new jobs, but it was not usually a smooth transition.

Another common talking point made in the 30’s, as well as the 60’s, regarding technological unemployment (or lack thereof) is also made today—that machines will augment, rather than substitute for, human labor. I’ve thought about this distinction a lot and am still not convinced it’s always meaningful or easy to discern in the case of AI—augmentation of person A may result in the substitution of person B. But it points to an important issue, namely that in many cases AI is not able to replace the full range of tasks performed by people, so visions of the future of work should take this into account. Indeed, a recent article by researchers at McKinsey argued that AI is more likely to substitute for individual tasks than entire occupations, and will change the nature of many jobs more than eliminating them. But even if this is true, is this a good thing? Do we want to ensure that a large fraction of the population is employed/employable in the future?

I’m not sure, especially since there is a lot of evidence that a lot of people get various benefits (besides just money) from paid work. But my current thinking is that a future with less need for work is a critical component of realizing justice. Here I’ve been influenced a lot by thinkers such as Phillipe van Parijs, who argue persuasively, in my view, that a robust conception of justice requires real freedom for all, that is, not merely a lack of external forces preventing one from achieving one’s goals, but the material and social capabilities to actually achieve them. This, in turn, requires the ability to choose one’s distribution of work and leisure, and to be able to say no to jobs that are demeaning, or insufficiently compensated, or otherwise undesirable. Achieving this, van Parijs argues, requires a maximum sustainable basic income, and indeed he thinks such a basic income is the way that capitalism as an economic system can be morally redeemed.

These are big issues, and I don’t expect to resolve them here, but I’ll briefly say a few words about the relationship between basic income, freedom from (required) work, and AI progress. Today, it is possible for people on average to work at least somewhat less than they do. We can see this from the variation cross countries in hours worked, with e.g. citizens of many European countries working much fewer hours than Americans. This, in turn, hinges on policy. But it’s also in part a technical question of what level of living standards society can afford to provide to all people. In the U.S., for example, the levels of basic income that have been analyzed are on the order of a few thousand to ten thousand dollars. Giving much more than that would require some combination of massively more taxes, massive changes to current welfare state policies, or a massive increase in economic productivity.

So, could we increase economic productivity a lot with AI in order to support a steadily rising basic income over time, perhaps even eventually being so productive as to support a global basic income? I think so, but how exactly to do this is still an unresolved question being explored by entrepreneurs and large tech companies right now. In the limit, as Robin Hanson has argued, AI could support extremely massive, nearly unimaginable increases in the rate of economic growth. But in the near future, the key questions that arise are: what range of goods and services can be automated sufficiently effectively to allow them to be distributed affordably today, and in the foreseeable future? What level of broadly-supplied living standards, if any, would count as achieving utopia? This obviously raises a host of technical, political, economic, ethical, and other questions. In addition, these issues intersect with other policy questions like the minimum wage. While small increases in the minimum wage have been shown to not have a huge effect on employment, very large ones plausibly could increase the rate of automation. Whether this is a good or bad thing depends, again, on your values, and arguments have been by some, e.g. Srnicek and Williams in their book Inventing the Future that this is desirable, and that a combination of basic income, increased automation, increased minimum wage, and diminishment of societal valorization of work should be pursued in tandem.

One final point worth considering when evaluating AI as work reducer through a utopian lens is that it’s very important to be clear about how we define “work.” The way it’s commonly used in discussions of technological unemployment, work just means paid labor. But if we consider feminist arguments to the effect that care work (e.g. raising children, caring for elderly people, etc.) is often either unjustly unpaid or insufficiently paid, yet equally vital to the maintenance of our society as other forms of work, then we might arrive at different perspectives. So we should be wary of a rush to automate all “(paid) work” in the pursuit of utopia while ignoring the systematic inequalities that may remain in place, often along race and gender lines, in these other domains. And we should think about what it would mean, and if it’d be valuable, to automate many aspects of those forms of work.

AI as equalizer

The final utopian vision for AI that I want to discuss here is probably the least discussed among the ones considered here, but it may also be extremely important. By “AI as equalizer,” I mean the idea that AI may help enable a greater level of equality in society above and beyond its impact on paid work, by providing people with access to either cheap or free cognition that levels the playing field between people in some way. To some extent, this has already occurred. A large fraction of the world has access to search engines, which in turn give them access to a large amount of information that is equally available to others, and which would have been prohibitively expensive, difficult, or impossible to acquire before. But I’m not just talking about search engines—I’m thinking about wide-ranging capabilities of personal assistant AIs, which we may not see in full form for some time but will happen progressively in the coming years and decades.

Who cares about personal assistant AIs? Perhaps we should care more about this than we currently do. In the world today, the ability to have a generally intelligent personal assistant is associated with privilege—it either requires a lot of money or a high-status job, and, of course, a human to do the work. But if personal assistant AIs were to develop to a sufficient extent that they could provide a very large fraction, all, or even more of the capabilities that a human assistant could provide, it could have a great leveling effect on society. Consider one dimension of that impact—the impact that assistant AIs could have in mitigating the impacts of differences in human intelligence levels. People vary naturally (and unnaturally, due to environmental factors) in their levels of intelligence, and there is abundant evidence that these differences are strongly linked to large differences in various life outcomes, ranging from health to job outcomes, and more generally, the ability to cope with the (rising) complexity of everyday life. If utopia is, at least in part, about enabling people to develop and pursue their own goals, then enabling them to effectively do that through access to high-quality cognition on the cheap (or for free) could be critically important.

We can flesh out this argument a bit using the terminology introduced by Sendhil Mullainathan and Eldar Shafir in their book Scarcity. In the book, they present various forms of evidence to show how scarcity broadly construed (scarcity of time, of money, of bandwidth, etc.) has a common set of effects on people, and functions, among other things, as a tax on the poor. When people experience scarcity, they are more likely to make bad decisions because their cognitive bandwidth is tied up in dealing with immediate crises, and we only have so much bandwidth. The opposite of scarcity is slack. If AI could democratize cognitive slack, i.e. enabling people to focus their mental resources on what they want to focus them on, while reliably outsourcing tasks, for free or very cheaply, to an AI, we may see more equal outcomes in various aspects of life. This argument, like many of the ones above, is necessarily speculative at this point, but it seems to me to deserve somewhat more consideration in the context of the utopian potentials of AI.

Conclusion: how close are we to AI-topia?

As I mentioned at the beginning of this post, there are multiple forms of distance between where we are today and the full realization of the positive potentials of AI. Some of these are technical, others are political, and others are laden with all sorts of uncertainties right now and we really don’t know how to think about them yet. But if I had to suggest some takeaway messages from this blog post, they are: first, that we should be more deliberate in how we use today’s AI to address major societal problems, and that it’s not obvious we’re doing this optimally right now. Second, there seems to be at least some reason to think that AI could be a critical component of a transition to a (much more) just world, and as such it probably actually deserves all the attention it’s currently getting, and then some. Third, we should be conscious of the role of utopian visions in our thinking about the future of AI. In the case of work, for example, talking about a post-work future can, in part, function as a critique of the political as well as technical conditions that make work necessary today. In the case of AI as a problem solver, highlighting the potential of AI to help address our grand challenges not only is an important endeavor in its own right, but it can also serve as a reminder that technology does not progress and apply itself inevitably and independent of human input—rather, we should deliberately seek to use this powerful tool at our disposal to make our world a better place.

Thanks for reading and I look forward to your thoughts!

7 Comments

AlphaGo and AI Progress

2/27/2016

9 Comments

Introduction

AlphaGo’s victory over Fan Hui has gotten a lot of press attention, and relevant experts in AI and Go have generally agreed that it is a significant milestone. For example, Jon Diamond, President of the British Go Association, called the victory a “large, sudden jump in strength,” and AI researchers Francesca Rossi, Stuart Russell, and Bart Selman called it “important,” “impressive,” and “significant,” respectively.

How large/sudden and important/impressive/significant was AlphaGo’s victory? Here, I’ll try to at least partially answer this by putting it in a larger context of recent computer Go history, AI progress in general, and technological forecasting. In short, it’s an impressive achievement, but considering it in this larger context should cause us to at least slightly decrease our assessment of its size/suddenness/significance in isolation. Still, it is an enlightening episode in AI history in other ways, and merits some additional commentary/analysis beyond the brief snippets of praise in the news so far. So in addition to comparing the reality to the hype, I’ll try to distill some general lessons from AlphaGo’s first victory about the pace/nature of AI progress and how we should think about its upcoming match against Lee Sedol.

What happened

AlphaGo, a system designed by a team of 15-20 people[1] at Google DeepMind, beat Fan Hui, three-time European Go champion, in 5 out of 5 formal games of Go. Hui also won 2 out of 5 informal games with less time per move (for more interesting details often unreported in press accounts, see also the relevant Nature paper). The program is stronger at Go than all previous Go engines (more on the question of how much stronger below).

How it was done

AlphaGo was developed by a relatively large team (compared to those associated with other computer Go programs), using significant computing resources (more on this below). The program combines neural networks and Monte Carlo tree search (MCTS) in a novel way, and was trained in multiple phases involving both supervised learning and self-play. Notably from the perspective of evaluating its relation to AI progress, it was not trained end-to-end (though according to Demis Hassabis at AAAI 2016, they may try to do this in the future). It also used some hand-crafted features for the MCTS component (another point often missed by observers). The claimed contributions of the relevant paper are the ideas of value and policy networks, and the way they are integrated with MCTS. Data in the paper indicate that the system was stronger with these elements than without them.

Overall AI performance vs. algorithm-specific progress

Among other insights that can be gleaned from a careful study of the AlphaGo Nature paper, one is particularly relevant for assessing the broader significance of this result: the critical role that hardware played in improving AlphaGo’s performance. Consider the figures below, which I’ll try to contextualize.

This figure shows the estimated Elo rating and rank of a few different computer Go programs and Fan Hui. Elo ratings indicate the expected probability of defeating higher/lower ranking opponents – so, e.g. a player with 200 points more than her opponent is expected to win about three quarters of the time. Already, we can note some interesting things. Ignoring the pink bars (which indicate performance with the advantage of extra stones), we can see that AlphaGo, distributed or otherwise, is significantly stronger than Crazy Stone and Zen, previously among the best Go programs. AlphaGo is in the low professional range (“p” on the right hand side) and the others are in the high amateur range (“d” for “dan” on the right hand side). Also, we can see that while distributed AlphaGo is just barely above the range of estimated skill levels for Fan Hui, non-distributed AlphaGo is not (distributed AlphaGo is the one that actually played against Fan Hui). It looks like Fan Hui may have won at least some, if not all, games against non-distributed AlphaGo.

I’ll say more about the differences between these two, and other AlphaGo variants, below, but for now, note one thing that’s missing from this figure: very recent Go programs. In the weeks and months leading up to AlphaGo’s victory, there was significant activity and enthusiasm (though by much smaller terms, e.g. 1-2 at Facebook) in the Go community about two Go engines – darkforest (and its variants, with the best being darkfmcts3) made by researchers at Facebook, and Zen19X, a new and experimental version of the highly ranked Zen program. Note that in January of this year, Zen19X was briefly ranked in the 7d range on the KGS Server (used for human and computer Go), reportedly due to the incorporation of neural networks. Darkfmcts3 achieved a solid 5d ranking, a 2-3 dan improvement over where it was just a few months earlier, and the researchers behind it indicated in papers that there were various readily available ways to improve it. Indeed, in the most recent KGS Computer Go tournament, according to the most recent version of their paper on these programs, Tian and Zhu said that they would have won against a Zen variant if not for a glitch (contra Hassabis who said darkfmcts3 lost to Zen - he may not have read the relevant footnote!). Computer Go, to summarize, was already seeing a lot of progress via the incorporation of deep learning prior to AlphaGo, and this would slightly reduce the delta in the figure above (which was probably produced a few months ago), but not eliminate it entirely.

So, back to the hardware issue. Silver and Huang et al. at DeepMind evaluated many variants of AlphaGo, summarized as AlphaGo and AlphaGo Distributed in the figure above. But this does not give a complete picture of the variation driven by hardware differences, which the next figure (also from the paper) sheds light on.

This figure shows the estimated Elo rating of several variants of AlphaGo. The 11 light blue bars are from “single machine” variants, and the dark blue ones involve distributing AlphaGo across multiple machines. But what is this machine exactly? The “threads” indicated here are search threads, and by looking in a later figure in the paper, we can find that the least computationally intensive AlphaGo version (the shortest bar shown here) used 48 CPUs and 1 GPU. For reference, Crazy Stone does not use any GPUs, and uses slightly fewer CPUs. After a brief search into the clusters currently used for different Go programs, I was unable to find any using more than 36 or so CPUs. Facebook’s darkfmcts3 is the only version I know of that definitely uses GPUs, and it uses 64 GPUs in the biggest version and 8 CPUs (so, more GPUs than single machine AlphaGo, but fewer CPUs). The single machine AlphaGo bar used in the previous figure, which indicated a large delta over prior programs, was based on the 40 search thread/48 CPU/8 GPU variant. If it were to show the 48 CPU/1 GPU version, it would be only slightly higher than Crazy Stone and Zen - and possibly not even higher than the very latest Zen19X version, which may have improved since January.

Perhaps the best comparison to evaluate AlphaGo against would be darkfmcts3 on equivalent hardware, but they use different configurations of CPUs/GPUs and darkfmcts3 is currently offline following AlphaGo’s victory. It would also be interesting to try scaling up Crazy Stone or Zen19X to a cluster comparable to AlphaGo Distributed, to further parse the relative gains in hardware-adjusted performance discussed earlier. In short, it’s not clear how much of a gain in performance there was over earlier Go programs for equivalent hardware – probably some, but certainly not as great as between earlier Go programs on small clusters and AlphaGo on the massive cluster ultimately used, which we turn to next.

AlphaGo Distributed, in its largest variant, used 280 GPUs and 1920 CPUs. This is significantly more computational power than any prior reported Go program used, and a lot of hardware in absolute terms. The size of this cluster is noteworthy for two reasons. First, it calls into question the extent of the hardware-adjusted algorithmic progress that AlphaGo represents, and relatedly, the importance of the value and policy networks. If, as I’ve suggested in a recent AAAI workshop paper, “Modeling Progress in AI,” we should keep track of multiple states of the art in AI as opposed to a singular state of the art, then comparing AlphaGo Distributed to, e.g. CrazyStone, is to compare two distinct states of the art – performance given small computational power (and a small team, for that matter) and performance given massive computational power and the efforts of over a dozen of the best AI researchers in the world.

Second, it is notable that hardware alone enabled AlphaGo to span a very large range of skill levels (in human terms) – at the lowest reported level, around an Elo score of 2200, up to well over 3000, which is the difference between amateur and pro level skills. This may suggest (an issue I’ll return to again below) that in the space of possible skill levels, humans occupy a fairly small band. It seems possible that if this project had been carried out, say, 10 or 20 years from now, the skill level gap traversed thanks to hardware could have been from amateur to superhuman (beyond pro level) in one leap, with the same algorithmic foundation. Moreover, 10 or 20 years ago, even with the same algorithms, it would likely not have been possible to develop a superhuman Go agent using this set of algorithms. Perhaps it was only around now that the AlphaGo project made sense to undertake, given progress in hardware (though other developments in recent years also made a difference, like neural network improvements and MCTS).

Additionally, as also discussed briefly in “Modeling Progress in AI,” we should take into account the relationship between AI performance and the data used for training when assessing the rate of progress. AlphaGo used a large game dataset from the KGS servers – I have not yet looked carefully at what data other comparable AIs have used to train on in the past, but it seems possible that this dataset, too, helped enable AlphaGo’s performance. Hassabis at AAAI indicated DeepMind’s intent to try to train AlphaGo entirely with self-play. This would be more impressive, but until that happens, we may not know how much of AlphaGo’s performance depended on the availability of this dataset, which DeepMind gathered on its own from the KGS servers.

Finally, in addition to adjusting for hardware and data, we should also adjust for effort in assessing how significant an AI milestone is. With Deep Blue, for example, significant domain expertise was used to develop the AI that beat Gary Kasparov, rather than a system learning from scratch and thus demonstrating domain-general intelligence. Hassabis at AAAI and elsewhere has argued that AlphaGo represents more general progress in AI than did Deep Blue, and that the techniques used were general purpose. However, the very development of the policy and value network ideas for this project, as well as the specific training regimen used (a sequence of supervised learning and self-play, rather than end-to-end learning), was itself informed by the domain-specific expertise of researchers like David Silver and Aja Huang, who have substantial computer Go and Go expertise. While AlphaGo ultimately exceeded their skill levels, the search for algorithms in this case was informed by this specific domain (and, as mentioned earlier, part of the algorithm encoded domain-specific knowledge – namely, the MCTS component). Also, the team was large –15-20 people, significantly more than prior Go engines that I’m aware of, and more comparable to large projects like Deep Blue or Watson in terms of effort than anything else in computer Go history. So, if we should reasonably expect a large team of some of the smartest, most expert people in a given area working on a problem to yield progress on that problem, then the scale of this effort suggests we should slightly update downwards our impression of the significance of the AlphaGo milestone. This is in contrast to what we should have thought if, e.g. DeepMind had simply taken their existing DQN algorithm, applied it to Go, and achieved the same result. At the same time, innovations inspired by a specific domain may have broad relevance, and value/policy networks may be a case of this. It's still a bit early to say.

In conclusion, while it may turn out that value and policy networks represent significant progress towards more general and powerful AI systems, we cannot necessarily infer that just from AlphaGo having performed well, without first adjusting for hardware, data, and effort. Also, regardless of whether we see the algorithmic innovations as particularly significant, we should still interpret these results as signs of the scalability of deep reinforcement learning to larger hardware and more data, as well as the tractability of previously-seen-as-difficult problems in the face of substantial AI expert effort, which themselves are important facts about the world to be aware of.

Expert judgment and forecasting in AI and Go

In the wake of AlphaGo’s victory against Fan Hui, much was made of the purported suddenness of this victory relative to expected computer Go progress. In particular, people at DeepMind and elsewhere have made comments to the effect that experts didn’t think this would happen for another decade or more. One person who said such a thing is Remi Coulom, designer of CrazyStone, in a piece in Wired magazine. However, I’m aware of no rigorous effort to elicit expert opinion on the future of computer Go, and it was hardly unanimous that this milestone was that long off. I and others, well before AlphaGo’s victory was announced, said on Twitter and elsewhere that Coulom’s pessimism wasn’t justified. Alex Champandard noted that at a gathering of game AI experts a year or so ago, it was generally agreed that Go AI progress could be accelerated by a concerted effort by Google or others. At AAAI last year, I also asked Michael Bowling, who knows a thing or two about game AI milestones (having developed the AI that essentially solved limit heads-up Texas Hold Em), how long it would take before superhuman Go AI existed, and he gave it a maximum of five years. So, again, this victory being sudden was not unanimously agreed upon, and claims that it was long off are arguably based on cherry-picked and unscientific expert polls.

Still, it did in fact surprise some people, including AI experts, and people like Remi Coulom are hardly ignorant of Go AI. So, if this was a surprise to experts, should that itself be surprising? No. Expert opinion on the future of AI has long been known to be unreliable. I survey some relevant literatures on this issue in “Modeling Progress in AI,” but briefly, we already knew that model-based forecasts beat intuitive judgments, that quantitative technology forecasts generally beat qualitative ones, and various other things that should have led us to not take specific gut feelings (as opposed to formal models/extrapolations thereof) about the future of Go AI that seriously. And among the few actual empirical extrapolations that were made of this, they weren’t that far off.

Hiroshi Yamashita extrapolated the trend of computer Go progress as of 2011 into the future and predicted a crossover point to superhuman Go in 4 years, which was one year off. In recent years, there was a slowdown in the trend (based on highest KGS rank achieved) that probably would have lead Yamashita or others to adjust their calculations if they had redone them, say, a year ago, but in the weeks leading up to AlphaGo’s victory, again, there was another burst of rapid computer Go progress. I haven’t done a close look at what such forecasts would have looked like at various points in time, but I doubt they would have suggested 10 years or more to a crossover point, especially taking into account developments in the last year. Perhaps AlphaGo’s victory was a few years ahead of schedule based on reported performance, but it should always have been possible to anticipate some improvement beyond the (small team/data/hardware-based) trend based on significant new effort, data, and hardware being thrown at the problem. Whether AlphaGo deviated from the appropriately-adjusted trend isn’t obvious, especially since there isn’t really much effort going into rigorously modeling such trends today. Until that changes and there are regular forecasts made of possible ranges of future progress in different domains given different effort/data/hardware levels, “breakthroughs” may seem more surprising than they really should be.

Lessons re: the nature/pace of AI progress in general

The above suggested that we should at least slightly downgrade our extent of surprise/impressedness regarding the AlphaGo victory. However, I still think it is an impressive achievement, even if wasn’t sudden or shocking. Rather, it is yet another sign of all that has already been achieved in AI, and the power of various methods that are being used.

Neural networks play a key role in AlphaGo. That they are applicable to Go isn’t all that surprising, since they’re broadly applicable – a neural network can in principle represent any computable function. But AlphaGo is another sign that they can not only in principle learn to do a wide range of things, but can do so relatively efficiently, i.e. in a human-relevant amount of time, with the hardware that currently exists, on tasks that are often considered to require significant human intelligence. Moreover, they are able to not just do things commonly (and sometimes dismissively) referred to as “pattern recognition” but also represent high level strategies, like those required to excel at Go. This scalability of neural networks (not just to larger data/computational power but to different domains of cognition) is indicated by not just AlphaGo but various other recent AI results. Indeed, even without MCTS, AlphaGo outperformed all existing systems with MCTS, one of the most interesting findings here and one that has been omitted in some analyses of AlphaGo's victory. AlphaGo is not alone in showing the potential of neural networks to do things generally agreed upon as being "cognitive" - another very recent paper showed neural networks being applied to other planning tasks.

It’s too soon to say whether AlphaGo can be trained just with self-play, or how much of its performance can be traced to the specific training regimen used. But the hardware scaling studies shown in the paper give us additional reason to think that AI can, with sufficient hardware and data, extend significantly beyond human performance. We already knew this from recent ImageNet computer vision results, where human level performance in some benchmarks has been exceeded, along with some measures of speech recognition and many other results. But AlphaGo is an important reminder that “human-level” is not a magical stopping point for intelligence, and that many existing AI techniques are highly scalable, perhaps especially the growing range of techniques researchers at DeepMind and elsewhere have branded as “deep reinforcement learning.”

I’ve also looked in some detail at progress in Atari AI (perhaps a topic for a future blog post), which has led me to similar conclusion: there was only a very short period in time when Atari AI was roughly in the ballpark of human performance, namely around 2014/2015. Now, median human-scaled performance across games is well above 100%, and the mean is much higher – around 600%. There is only a small number of games in which human-level performance has not yet been shown, and in those where it has, super-human performance has usually followed soon after.

In addition to lessons we may draw from AlphaGo's victory, there are also some questions raised: e.g. what areas of cognition are not amenable to substantial gains in performance achieved through huge computational resources, data, and expert effort? Theories of what's easy/hard to automate in the economy abound, but rarely do such theories look beyond the superficial question of where AI progress has already been, to the harder question of what we can say in a principled way about easy/hard cognitive problems in general. In addition, there's the empirical question of which domains there exist sufficient data/computational resources for (super)human level performance in already, or where there soon will be. For example, should we be surprised if Google soon announced that they have a highly linguistically competent personal assistant, trained in part from their massive datasets and with the latest deep (reinforcement) learning techniques? That's difficult to answer. These and other questions, including long-term AI safety, in my view, call for more rigorous modeling of AI progress across cognitive/economically-relevant domains.

The Lee Sedol match and other future updates

In the spirit of model-based extrapolation versus intuitive judgments, I made the above figure using the apparent relationship between GPUs and Elo scores in DeepMind’s scaling study (the graph for CPUs looks similar). I extended the trend out to the rough equivalent of 5 minutes of calculation per move, closer to what will be the case in the Lee Sedol match, as opposed to 2 seconds per move as used in the scaling study. This assumes returns to hardware remain about the same at higher levels of skill (which may not be the case, but as indicated in the technology forecasting literature, naive models often beat no models!). This projection indicates that just scaling up hardware/giving AlphaGo more time to think may be sufficient to reach Lee Sedol-like performance (in the upper right, around 3500). However, this is hardly the approach DeepMind is banking on – in addition to more time for AlphaGo to compute the best move than in their scaling study, there will also be significant algorithmic improvements. Hassabis said at AAAI that they are working on improving AlphaGo in every way. Indeed, they’ve hired Fan Hui to help them. These and other considerations such as Hassabis’s apparent confidence (and he has access to relevant data, like current-AlphaGo’s performance against October-AlphaGo) suggest AlphaGo has a very good chance of beating Lee Sedol. If this happens, we should further update our confidence regarding the scalability of deep reinforcement learning, and perhaps of value/policy networks. If not, it may suggest some aspects of cognition are less amenable to deep reinforcement learning and hardware scaling than we thought. Likewise if self-play is ever shown to be sufficient to enable comparable performance, and/or if value/policy networks enable superhuman performance in other games, we should similarly increase our assessment of the scalability and generality of modern AI techniques.

One final note on the question of "general AI." As noted earlier, Hassabis emphasized the purported generality of value/policy networks over the purported narrowness of Deep Blue's design. While the truth is more complex than this dichotomy (remember, AlphaGo used some hand-crafted features for MCTS), there is still the point above about the generality of deep reinforcement learning. Since DeepMind's seminal 2013 paper on Atari, deep reinforcement learning has been applied to a wide range of tasks in real-world robotics as well as dialogue. There is reason to think that these methods are fairly general purpose, given the range of domains to which they have been successfully applied with minimal or no hand-tuning of the algorithms. However, in all the cases discussed here, progress so far has largely been toward demonstrating general approaches for building narrow systems rather than general approaches for building general systems. Progress toward the former does not entail substantial progress toward the latter. The latter, which requires transfer learning among other elements, has yet to have its Atari/AlphaGo moment, but is an important area to keep an eye on going forward, and may be especially relevant for economic/safety purposes. This suggests that an important element of rigorously modeling AI progress may be formalizing the idea of different levels of generality of operating AI systems (as opposed to the generality of the methods that produce them, though that is also important). This is something I'm interested in possibly investigating more in the future and I'd be curious to hear people's thoughts on it and the other issues raised above.

[1] The 15 number comes from a remark by David Silver in one of the videos on DeepMind’s website. The 20 number comes from the number of authors on the relvant Nature paper.

9 Comments

Why OpenAI Matters

12/12/2015

4 Comments

[Note: these comments assume some familiarity with OpenAI and the motivations behind it – see e.g. their website, this article, and this detailed interview].

The announcement of OpenAI has justifiably gotten a lot of attention, both in the media and in the relevant expert community. I am currently flying home from NIPS 2015, a major machine learning conference, and right after OpenAI was announced, I noticed several people around me checking out the website on their phones and computers to learn more, and I expect that OpenAI’s information/recruiting session at NIPS later today will be extremely popular. Here, I will summarize some preliminary, sleep-deprived thoughts on why this is getting so much attention and why that is appropriate.

The AI Arms Race

Over the past few years, many billions of dollars have been poured into AI and robotics research and development. OpenAI is in fact only the second of two billion dollar research centers announced very recently (the other being Toyota’s). Stuart Russell, lead author of the main textbook in the field, has claimed that more money has been invested by corporations in AI in the past few years than was invested by governments in the field’s entire prior history. This is largely a result, as pointed out in the open letter on AI signed by thousands of researchers and other interested parties, of AI methods (especially but not limited to deep learning) having just recently reached levels of performance that make them commercially lucrative, in turn leading to more funding to reach higher levels of performance, and so forth.

This has led to an extremely competitive race between technology companies to snatch up talent in especially “hot” sub-fields of AI. Walking around at NIPS, one would see badges listing the names of not just stereotypical “tech” companies like Google and Facebook but also miscellaneous hedge funds and other large companies seeking to cash in on the AI bonanza. Tellingly, NIPS, I saw a flyer for an academic job that started with “If you’re still interested in academic jobs…” Leaving aside the question of whether all this investment constitutes a bubble (I don’t think so), superficially, at least, things are extremely “hot” in areas like deep learning – for example, at NIPS, the workshop on deep reinforcement learning (a combination of deep learning and reinforcement learning that has led to impressive results in various domains recently) was so popular that people were sitting on the floor, standing along the walls, and getting shut out entirely for fire hazard concerns (including Rich Sutton, author of “the book” on reinforcement learning. I gather he eventually got in, though).

This arms race is not just a matter of accelerated deployment of AI technologies, or massive spending by tech giants, though that is happening—it’s also a battle for the hearts and minds of AI researchers, the most accomplished of which are being sought after like star athletes. Already, several years ago, AI researchers at the likes of Google were making plenty of money for a reasonable person’s purposes. Now, with talk of six or more figure salaries being offered to researchers, we may be at a point of diminishing returns for fat paychecks—many researchers also would ideally like the ability to publish (and be able to talk openly about) all their research and to work towards a mission (such as OpenAI’s, to benefit humanity through AI) that resonates with them. In this competitive context, OpenAI is stepping in with a third alternative beyond partially-secretive industry and often bureaucratic and under-resourced academia. It also comes at a critical time in discussions of the future of AI and the prospects for benefiting all of humanity through it. Speaking of arms races…

AI Safety and Ethics

In recent years, concerns about the short- and long-term implications of AI, and safety concerns in particular, have gone from being a fringe topic mostly researched outside the AI community to a mainstream topic of discussion. On Thursday at NIPS, there was a major symposium on the social impacts of machine learning attended by hundreds of researchers. Speakers such as Nick Bostrom (whose book Superintelligence helped catalyze the particular form of current concerns around the future of AI) and Erik Brynjolfsson (whose co-authored book The Second Machine Age helped spark discussions on the impact of AI on the economy) and panelists from industry and academia discussed short and long term issues with AI and its social implications, as well as what the AI community can do about it.

While a range of opinions was expressed on the urgency of different issues, there was no dispute in the symposium about the enormous consequences that AI is likely to have. Even Andrew Ng, a top deep learning researcher who leads AI research for Chinese search engine Baidu and who has compared concerns about existential risk from AI (of the sort Bostrom writes about, and Musk has echoed) to concerns about overpopulation on Mars, backpedaled his comments to some extent when he said that he thinks it makes sense for at least some people to be working on understanding potential long-term AI safety risks. The last panel of the symposium went into some detail about research priorities in that area, an area that Musk has previously invested $10 million in through the Future of Life Institute. There is certainly still much disagreement on the topic of long-term AI risks, as evinced by the diverse audience reactions at the symposium, some clapping at Ng downplaying long-term risks, and others responding favorably to DeepMind’s Shane Legg comparing AI safety research to engineers on the Apollo Project investigating ways that the mission could go wrong as a matter of common sense and simple responsibility. However, there has been a noticeable change in the tenor and number of people involved in these discussions in recent years.

In this context, OpenAI represents not only a big investment in AI safety (since this is among the areas Musk et al. have indicated the institution will study). It is also a sign of the mainstreaming of the issue, given the talent and prestige of those affiliated with the new organization, including Ilya Sutskever of Google deep learning fame. Along with safety research in particular, the funders and chairs of OpenAI have indicated that the organization will be concerned with shorter term issues such as the economic implications of AI, the need for AI to be designed in a way that is usable and complementary to human skills, and, of course, as the name implies, openness. This latter factor is particularly notable, not just for recruiting purposes as discussed above, but also because it represents a significant decision on the part of Musk and the funders. As Musk’s recommended book Superintelligence attests, diffusing AI technologies as widely as possible is not straightforwardly preferable over scenarios in which, say, it is more concentrated in the hands of governments or corporations with particular sets of safety constraints. Indeed, Musk’s latest comments (for example, in the Medium interview linked to above) suggest a shift in his thinking from when he compared AI to nuclear weapons and implied favoring a less open approach to AI safety. It will be interesting to see how the growing AI safety research community reacts to this move and to the specific arguments put forth by Musk et al., and to see how OpenAI will interpret the noteworthy last word in the statement of their mission on their website of ensuring that AI technology is “as broadly and evenly distributed as possible safely” (emphasis added). Thus, OpenAI has a lot of symbolic significance, but how will its mission be realized, if it will be?

Institutional Innovation

It’s not yet clear how exactly OpenAI will operate, but its status as a non-profit is of particular significance. This means that there will be no pesky shareholders questioning the return on investment of their research, or the particular balance struck between emphasis on AI performance progress and safety (a delicate issue again dealt with at some length in Superintelligence), in the event that OpenAI begins to focus more on the latter. Additionally, standing outside of traditional corporations will give OpenAI the ability to think more outside the box about potential applications of its AI approaches, instead of working around existing value streams as, e.g. Google and Facebook do to an extent. This, among other factors, will likely lead to a big influx of talented researchers interested in applying their talent to grand challenges, potentially leading OpenAI to grow to hundreds of people (or, someday, even more). It will also be interesting to see how OpenAI differentiates itself from a comparable (at least in stated ambition if not, yet, committed funds) organization, the Allen Institute for Artificial Intelligence, a non-profit funded by Paul Allen, which seems to have focused so far on applying AI to improving scientific productivity and on particular methods like natural language processing. I hope that OpenAI will differentiate itself by tackling truly difficult challenges like applying AI to, e.g. addressing poverty, mental health, or sustainability, or if they focus more on research, that they give very deep thought to what human-complementary AI means and what sorts of research won’t otherwise be funded by industry.

Scale is one of the ways that OpenAI may be distinct from academia, though it is not obvious all of the ways in which it intended to be different. Mark Riedl noted on Twitter that academia has many of the characteristics OpenAI is purported to have—though, I would add, academic research too often fails to have maximum positive impact and is too often bogged down in grant applications and other paperwork. Relative to industry, I’d expect OpenAI salaries to be on the order of 50-90% or more lower than what some of the best researchers could make, so the organization will need to stay true to its stated principles if it will continue recruiting successfully over the long term. To continue to recruit the best talent, I suspect, as was the case with another instance of institutional innovation I’m familiar with (ARPA-E), that the top leadership of OpenAI will want to not only pore through the hundreds or thousands of applications they will soon receive, but also aggressively recruit top talent in the field and bring in the likes of Musk to help with such recruitment.

The role of data at OpenAI also raises interesting questions from an institutional perspective, as Beau Cronin and others have noted on Twitter. With Amazon’s investment and Tesla’s indirect involvement through Musk, it would seem that OpenAI will potentially have access to a lot of the data needed to make deep learning and other AI approaches work well. But these are different sorts of datasets than Google and Facebook have, and may lend themselves to different technical approaches. They also raise the question of proprietary data – how will OpenAI balance its push for openness with this proprietary access? Will it release code but not data? How will its experiments be replicable if they are tailored to particular data streams? How will privacy be addressed? Some of these questions are unique to OpenAI and others aren’t. However, they’re the sorts of questions OpenAI will need to answer.

Conclusion/Open Questions

As I and others have written elsewhere, there are many open questions on the future of AI in general and how to innovate responsibly in this area, AI and the future of work, how AI progress may proceed, etc. and these precede OpenAI. The launch of OpenAI, however, makes them particularly concrete. Here I suggested some preliminary formulations of these questions in the context of a particular organization that ostensibly will address many of them. I look forward to thoughts people have on whether I formulated these issues correctly, how OpenAI could realize its goals, and what ripple effects its launch could have on the broader AI arms race/safety/ethics conversation.

4 Comments

Review of my 2017 Forecasts

My AI Forecasts--Past, Present, and Future (Main Post)

My AI Forecasts--Past, Present, and Future (Supplement)

The White House AI workshops and public engagement in science and technology

How Far to AI-topia?

AlphaGo and AI Progress

Why OpenAI Matters

Author

Archives

Categories