BI 123 Irina Rish: Continual Learning

December 26, 2021 01:18:59
BI 123 Irina Rish: Continual Learning
Brain Inspired
BI 123 Irina Rish: Continual Learning

Dec 26 2021 | 01:18:59

/

Show Notes

Support the show to get full episodes and join the Discord community.

Irina is a faculty member at MILA-Quebec AI Institute and a professor at Université de Montréal. She has worked from both ends of the neuroscience/AI interface, using AI for neuroscience applications, and using neural principles to help improve AI. We discuss her work on biologically-plausible alternatives to back-propagation, using “auxiliary variables” in addition to the normal connection weight updates. We also discuss the world of lifelong learning, which seeks to train networks in an online manner to improve on any tasks as they are introduced. Catastrophic forgetting is an obstacle in modern deep learning, where a network forgets old tasks when it is trained on new tasks. Lifelong learning strategies, like continual learning, transfer learning, and meta-learning seek to overcome catastrophic forgetting, and we talk about some of the inspirations from neuroscience being used to help lifelong learning in networks.

Transcript

Irina    00:00:03    We are not the first one asking the question about what intelligence is. People think there’s a first one to ask the question or to build something for them, something, and it’s not the first time to put it mildly. It’s always trained on capacity versus complexity, and you want to find the minimum cost and minimum capacity agent that’s capable to Concord the complexity or whatever future tasks that agent will be exposed to. But if the agent feeds the wall, the agent will have to have the ability to expand itself and continue learning what they’ve learned from two years, trying to do the new AI project. The idea that it was first of all, much less well-defined then AI for new era here, you like search for a black cat in the black room, and you’re not sure if the cat is there. That’s  

Speaker 2    00:01:11    This is brain inspired.  

Paul    00:01:25    Hey everyone, it’s Paul happy holidays. I hope you’re. Well today I speak with Irina Reesh, who is currently at the university of Montreal and also a faculty member at Mila Quebec AI Institute. And I wanted to have Irina on for multiple reasons. One of which is her interesting history, uh, having been kind of on both sides of the AI and neuroscience coin. So she’s also worked, uh, at IBM as you’ll hear working on healthcare and also in neuroscience inspired AI. And we have a pretty wide ranging conversation about much of her, uh, previous and current work. So we talk about, uh, her work on alternatives to the backpropagation algorithm. And we talked about her ongoing work on continual learning, which is kind of a big topic in AI these days. So as you probably know, uh, deep learning models suffer from what’s called catastrophic forgetting where when you train the model to do one thing really well, uh, and then you train it to do another thing.  

Paul    00:02:25    It forgets the first thing and humans don’t suffer from this problem. And it’s an important problem to tackle moving forward, uh, in deep learning. And we discussed many of the methods being used to try to, uh, solve continual learning and some of the inspirations from neuroscience along those lines. We also talk a little bit about scaling laws, which is roughly the relationship between how big and complex a model is and how well it performs over a range of tasks. We also talk about definitions and Irina’s definition of artificial general intelligence and how she views the relationship between AGI and continual learning. And we talk about a lot more, so you can learn more about Irina in the show notes at brain inspired.co/podcast/ 1 23 on the website. You can also learn about how to support the show on Patrion and possibly join the discord community of Patrion supporters who are awesome.  

Paul    00:03:21    Thank you guys. And thank you for listening. Enjoy, you’re actually kind of a perfect fit for this podcast, uh, because on the one hand you have a background in a lot of, uh, computer science and I guess your early work was in applied mathematics. So you kind of come from that side, but I know that you’re interested in using, among other things among the many things that you’re interested in in using some principles and ideas in neuroscience to help, uh, build better AI. So could you just, um, talk a little bit about your background and how you came to be interested in, uh, being inspired by, by brain facts and et cetera?  

Irina    00:04:00    That’s a very interesting question. Indeed. Sometimes I ask myself and I tried to dig into the past. The question is called for in the past. We want to go, uh, indeed, um, a couple of years ago I joined Mila and university of Montreal. But before that I was at DBM research and I was there for quite a long time initially in their department of computational biology where I did indeed focus on neuroscience, neuroimaging and applying statistical methods, machine learning, AI to analysis of brain data. So that’s mainly where I kind of really got, I guess, deeper into neuroscience, psychology psychiatry type of topics. And I’m still collaborating with a group that’s my kind of long-term collaborators in computational psychiatry and neuro imaging, uh  and his, um, uh, his friends. So that was really exciting. And that’s, I guess, where I actively was, uh, pursuing this idea of the intersection between AI and neuro. But I think the interest in that intersection started long time before I even joined IBM. And I usually I realized that I could track it back to my, uh, I think, uh, elementary or even middle school years. I used to go to muscle Olympics in Russia. I don’t know if it’s interesting it’s at this too long of an answer.  

Paul    00:05:30    Well, let me ask you this. So, so when I was in college, there wasn’t even a neuroscience program, um, uh, degree, uh, and I don’t know that I would have been about, you know, I started in aerospace engineering and then moved on to molecular biology. And I don’t know if neuroscience was available as a program, whether I would have actually chosen it. But, uh, so I was going to ask if that’s what was limiting, if you had that kind of kernel of interest, why then go into applied mathematics? Uh, that was your first degree, right?  

Irina    00:05:58    Right. Yeah. I mean, I probably should explain and did, uh, what I wanted to mention, uh, the reason was from practical perspective, I was going to mass Olympics and I was quickly realizing that, um, you don’t have that much control over your brain and like, you want to solve a problem. And then you kind of hitting the wall. It’s like, what’s going on there. And sometimes it works. Sometimes it doesn’t, you want to understand like why and how to make it work better. And then you see people around you, some of them struggling unable to solve anything. Some of them are able to solve like more than you do. And again, you wonder what the difference, what does it take? Then you start reading books, like, uh, polio, how to solve it, how to learn, how to solve problems and this and that. But it really came from very practical goal.  

Irina    00:06:49    Like I need to figure out how to solve all those problems quickly because I want to be in Delhi. So what do I do? My brain doesn’t seem to be cooperating. So what should I do? Like how do I make it work? So you start digging into how to make brain work. And then you run into books accidentally, which say, well, whether machines can sing, it was Russian translation. I guess, of the famous student’s work. That gets me into thinking about AI when I’m like 14 or something. That’s why I go to computer science. Uh, the closest to computer science at that point, uh, kind of in Russia was like applied math, essentially applied math slash computer science, but formally applied maths. Um, and that’s kind of, that’s how it goes from there. So like my focus on computer science and AI actually came from them really very practical goal. Like, I need to understand how this brain works, so I make it work better. That’s pretty much it. Then you realize that it’s like biology, psychiatry, neuroscience, and many other things that study brain, like whatever goes.  

Paul    00:07:57    Okay. So, so at IBM then you were part of a team, I think you said you were in computational biology division and say you were a part of a team that was sympathetic to, uh, using principles from biology to help make machines better. Right.  

Irina    00:08:13    Okay. The focus of the department was not on machines. The focus was on health care. So the focus was on how to make, uh, humans cross here. Uh, that was conditional biology in your science and neuro imaging, kind of, um, um, groups focus. So it’s not the focus of that group was not really on AI. And I was kind of back and forth between focus on AI and computer science and machine learning to biology. And back originally I started in machine learning group, uh, for distributed systems, uh, changed names a few times. And I moved to this competitional biology center. And then in the past few years before moving to Mila, I moved again from computational biology department to AI department of IBM. So, as I said, I was kind of iterating between the two multiple times. The focus in those past couple of years before moving to Mila was indeed to bring, uh, kind of neuroscience, inspirations and ideas to improve AI. So that was my, my latest focus on IBM was indeed on what you’re asking about. And that was new AI, uh, kind of, uh, project between IBM and MIT. Uh, that was going on. That kind of remained part of my focus when I joined Mila and was part of their direction for that, um, seven year program, uh, that I’m leading the Canada, excellence is a chair. Then your AI component is one of the kind of, uh, excess along, which things are supposed to be developing.  

Paul    00:09:56    We’ll see. Okay. So this is what I wanted to get to because, um, I, I’m curious about your experience about your colleagues and their sort of opinions about using, uh, neuroscience or neuro inspired, um, tools, uh, to build better AI because that’s very much the industry side of it. And you have a lot of like passionate opinions about whether we should be looking to the brain to build better AI or whether we should just continue to scale with what we have. Right. All right.  

Irina    00:10:24    I know both sides really well. We just ran our scaling laws workshop last weekend from blond. Yes.  

Paul    00:10:31    Okay. So, so what is your IBM colleagues, if you feel free enough to talk about it, what was kind of like their reception of your focus on using neuroscience?  

Irina    00:10:40    Uh, again, the colleagues in AI department, her colleagues in the bio department? Well, I think actually this whole new era AI new era for AI and AI for are all these ideas actually glue out my multiple interactions and discussions with my friends at a competitional psychiatry and neuroscience department. And primarily you share machete. And I think what really also helped to shape my views was the introductions, which enrolled, not just discussing, uh, technical aspects of AI or technical aspects of brain imaging, or even your science, but there was most of philosophy in those discussions because luckily Jeremiah had first degree in philosophy and the second in physics and then went to neuroscience. And I think that really made him like stay stand kind of apart and had of many colleagues. So I’m really, really, really grateful for those discussions because they helped me to shape my views as well.  

Irina    00:11:44    So what I’m trying to say that while the healthcare department, the combined department was focused on healthcare applications, the idea of using neuroscience and biology inspirations for improving AI, um, was very exciting for at least several people there. And it’s still exciting. And we would like to kind of do more along those lines. When I moved to AI department at IBM, uh, again, it was kind of a mix of opinions because just like in the field in general. Um, and I agree with that view, it may not necessarily be the case that the only path to intelligent, uh, artificial intelligence systems is, um, mimicking the brain more over like even my colleagues, neuroscientists never said that we have to mimic the brain. Their freaking question is like, what is the minimum part of those inspirations that might be necessary to transfer? And that’s just like,  gives us example about airplanes, not flooding their wings, right?  

Irina    00:12:54    So you don’t have to copy the biology. Exactly. And yet you want to come up with some common laws that govern the flight, right? Um, some aerodynamics of intelligence in our case. And that’s the tricky part. And that’s, I think where everyone agrees. So nature found the solution. Uh, there are some properties of that solution that might be specific to evolution and nature, and perhaps we can obstruct from them, but there are some parts. The question is which ones that are probably in variant or necessary for any intelligence, finding those invariant properties is a good open question. And I think that’s subconsciously everybody doing AI is trying to do that, but I definitely, I’m not in the camp of first. You need to completely understand how brain works and only then you can create artificial intelligence. I don’t think so. Just like with airplanes and engineers there, they don’t have to be like a biologists understanding birds perfectly. Right, right. You need to understand enough, but the good question is what is enough  

Paul    00:14:10    Interesting because we’re going to talk about a few of the, um, neural inspirations that you have focused on. And in some sense, um, I don’t mean this is not an insult at all, but it’s almost like, uh, we are sometimes cherry picking and just trying kind of one thing at a time. We think this might be important thing that might be important. Um, when what we, what we could do, which you say is not the right path. And I agree as we could, uh, instead of cherry picking, you know, these, these facts that we’re going to discuss in a little bit, um, you really could go more all in and try to, you know, there are people trying to quote unquote, build a brain, right. And they’re still having, you know, to abstract out a lot of things. But, um, but that push would be to build in more that we know about the brain that rather than less, it seems so. Um, but I want to ask you, uh, before we move on about philosophy, I happened to see a panel that you were on. I don’t remember the source. It may have been a,  

Irina    00:15:11    The main conference. We had an interesting discussion about philosophy therapy.  

Paul    00:15:17    It didn’t go that that far, but you, you, you got into a back and forth a little bit with Syria Ganguli and, uh, who finds philosophy useless, and you made the push that, uh, in fact it is useful. So I just wanted to,  

Irina    00:15:31    It’s useless. You can learn from anything, but let’s not go there.  

Paul    00:15:38    Okay. So, so you don’t want to, um, make a, uh, a case for philosophy.  

Irina    00:15:44    I can make useful philosophy. I, um, I think what happened there, maybe it was as usual, by the way, it’s not specific to the panel. Uh, people mainly disagree because of differences in their kind of definitions or interpretations of terminology. And unfortunately that’s a universal problem. And the field, like many concepts are not well-defined. And in general, I mean, that’s the main reason people argue because when people actually nail down details of what they are being for, or against surprisingly many cases, they all agree. So I think the, the problem was what people, uh, understood as philosophy when I say the word philosophy and it means something to me, it probably meant different things to different people. So they were really arguing, not with my point, but they were arguing with their own understanding of the world. Yeah. That’s why, because I don’t think Syria or anyone else will disagree in general that if you have different disciplines, whether it’s philosophy or neuroscience or psychology or psychiatry or any, any, any type of discipline that studied mind and its function, how it works and what are the mechanisms at different levels in different ways and philosophy is one of them.  

Irina    00:17:13    And even more, it’s not just the whole Sophie. I mean, you can think about Buddhism. And I brought this example to me, it’s empirical science of how mind works, which has several thousand years of knowledge accumulated in very different terminology and so on. Uh, but that’s, that’s a data that’s that’s knowledge accumulated by people from observations, right? So there is some truth to it. And the question is like how to deep that through out, since we coming from different fields, use different terminology again, how do you translate philosophy and Buddhism to machine learning slang and sense? So people will understand it, not everything there might be relevant, but we are not the first one asking the question about what intelligence is, what usually amazes me. And again, I don’t mean that doesn’t solve, but, and plus it’s very natural. It’s always happens, but people think there’s a first one to ask the question or to build something for something. And it’s not the first time to put it mildly. There are many bright minds that for many years, we’re facing similar type of questions just in different circumstance. So I think it might be useful to learn more about what they found.  

Paul    00:18:42    It does seem to be a recurring theme that, um, there’ll be a hot new trend. And then it turns out a hundred years ago, someone already had written basically the answer, you know, and laid out the groundwork, the groundwork for it that, uh, then, then we go back and, uh, something that we resolved, uh, had already essentially been solved.  

Irina    00:19:03    It is. I mean, it’s not specific to our field or our time. Right. It’s always been like, that’s probably always going to be like that. Uh, but, uh, that’s just why I mentioned philosophy. And, uh, also, I mean, I know, I know I essentially meant the same thing that Supriya was saying himself, that we are trying to, um, kind of discover the Kuvan laws behind intelligence, uh, whether biological or artificial and kind of pushing it forward common laws behind how mind works or how it could work and, um, how you can kind of affect it in different ways. So it works differently. And I think 80, any source of knowledge about like people asking similar type of questions and finding whatever answers, any information like that you can learn from all these data. All I actually was suggesting that, yeah, let’s try to learn from all data input data being different disciplines,  

Paul    00:20:13    But okay. So there, there’s a problem here, right? Where, um, throughout the years, all the different dif disciplines have continued to progress and it is essentially impossible to be an expert in all disciplines. So how, you know, what’s the right,  

Irina    00:20:27    That’s why we need AGI and he’ll be an expert in all of that.  

Paul    00:20:32    And they can tell us which disciplines we need to learn, but we won’t be  

Irina    00:20:35    The knowledge for us and conveyed to us in understandable manner. I’m just quoting that young, short scifi story from nature, but it’s only half joking.  

Paul    00:20:49    Yeah. Yeah. Well, so, so you are interested in, um, that that’s a goal to build AGI and we’re going to talk, uh, about lifelong learning and a little bit, I want to ask you about backpropagation first, but would you say that’s one of your, uh,  

Irina    00:21:03    Uh, yes and no. AGI is not the final goal in itself. It’s an instrumental goal. The final goal, as I was always putting AI as augmented, rather than artificial intelligence, to me, just the goal of building AGI never felt truly motivating. Like why do I care about machines?  

Paul    00:21:32    Well, do you know what AGI even is? I don’t really know what AGI is because that’s another thing where people have different definitions.  

Irina    00:21:39    Yes. It’s one of those terms and machine learning, which is not well-defined. And I know that’s a that’s creates lots of confusion and there is had two debates in the Mila on the topic of AGI. There are different definitions and different people again, mean different things. One practical definition could be just stick to the words, let’s say it’s artificial general intelligence. So junior role means capable of solving multiple really multiple problems. To me that means general broad, versatile, which relates to continue learning or the learning or transfer learning, but kind of push to extremes. So like truly versatile AI that can do while, at least pretty much anything we can do, not a narrow opposite of narrow, broad general. So that can be just a relatively clear definition, at least to me of what AGI would stand for. There are many other definitions and we probably could write like a list of different ones, but I think, yeah, you’re absolutely right. It’s not the term. It’s not the mathematical term.  

Paul    00:22:59    Do definitions matter  

Irina    00:23:01    Definitions matter. I mean, yes and no again, so you can have different definitions. What matters is for people before they start kind of working together on something or discussing something to agree on definitions. Because again, the main reason for debates, sometimes unending debates is at the core that people did not agree on definitions. And what comes to my mind whenever I listened to machine learning, people debating something or pretty much anyone debating anything, the picture of the elephant and seven blind men touching different parts of the elephant and saying, no elephant disease, no you’re wrong. And funders that no, it’s, you’re wrong. And they’re all right, and nobody’s wrong, but they didn’t agree on definitions and they don’t see the full, full picture.  

Paul    00:23:51    That’s all I’ve come to think that the purpose of debates is to talk past one another and not progress at all.  

Irina    00:23:58    That’s not the purpose to me. It’s a sad reality. Yeah. You can do that. You will probably have some fun, um, maybe for some limited amount of time and then pretty much you just wasted the time and everybody moved on. So what was the point? I don’t know. I mean, Yes. If you try to learn something or converge to something or make some progress, then probably not.  

Paul    00:24:28    Okay. So you and I agree. That’s good. We don’t need to debate that issue then. Okay. So, um, you’ve done work. One of the ways that you have fought to use neuroscience is, uh, on the question of backpropagation and, um, maybe before, because you had, you’ve done work on what’s called auxiliary, uh, variables like a backpropagation alternative. Um, so I’d like you to describe that and you know, where that came from, but before doing that, um, could I, cause we’ve talked about backpropagation multiple times now on the podcast, I had Blake Richards on way back when, um, you know, um, uses the morphology of neurons and the, uh, electrically decoupled, uncoupled, um, apical, dendrites, and blah, blah, blah, burst, firing, et cetera, as an alternative. And now, you know, there’s this huge family now of alternatives to backpropagation. Um, I’m curious about your overall view on, uh, that progress, that literature.  

Irina    00:25:29    Yes. Uh, yeah, that’s very good question. And actually, in fact, we are working right now with a group of people at Mila and outside of Mila, um, on, uh, extending different target propagation. So basically that line of work is still going on, although it was a bit in the back burner for awhile, and there are as usual, at least two motivations here, whether you come from neuroscience and you try to come up with a good model of how, uh, essentially learning happens at the brain, um, basically how they are created assignment for mistakes happens in the brain. And whether by propagation is a good model for that, or you can come up with a better model. So this is one motivation and many people who kind of are less concerned with, uh, competitive performance of alternatives to backpropagation and more concerned with really understanding how it works in the brain.  

Irina    00:26:26    They focus on that. And I also totally agree with that view, as I said, I mean, there is no contradiction once you clearly state what your objective is, you cannot say that they are wrong or you are right, the vice versa because they just optimizing different function. They want to answer the question, how we best model what happens in the brain. Their objective is not to beat you on amnesty, insofar as long as we all agree on what objective is, it’s not wrong. It’s interesting line of research and that’s kind of initially what motivated, uh, also work on, um, beyond back probe kind of just trying to understand things better. And Blake is definitely, uh, doing lots of things in this direction and other people, but on the other hand, there is, uh, another objective. Like if you come from the point of view of AI person who says, okay, I understand I want my analogies with brain, if, and only if it helps me build more effective, more efficient algorithm.  

Irina    00:27:31    So when you come from that objective, you can start wondering purely computationally, what are the limitations of backpropagation and, um, what could you do differently or better and how to solve those kinds of shortcomings? And usually people were always claiming that, yeah, there is problem of vanishing gradients or exploding gradients. Yeah. There is a problem that, uh, basically backpropagation is inherently sequential because you have to compute this chain of gradients and you have to do it sequentially. But again, one hand in brain processing is purely parallel second and computers. If we were able to do it in parallel, it probably would have been more efficient and better would scale better as well. So you want this parallelism, you want to avoid possible gradient, uh, issues. Uh, so what do you do? And that’s where many optimization techniques came starting from this. Um, essentially Yummly con’s own seizes mentioned alternative tobacco propagation that later was called target propagation.  

Irina    00:28:38    And all it meant is another view of the problem. So basically instead of just optimizing the objective of the neural net with respect to the weights of the neural net being unknown variables, you’re trying to find you introduce an auxiliary variables or different names. Uh, it all comes from the just three wheels kind of, uh, equality constraint there that those activations. So he then units, uh, they are equal to what do that, uh, linear function of previous led layer transformed by some non-linearity and you can play with it. You can introduce extra auxiliary variables, just the linear ones and another one, uh, another set, nonlinear transformation of those songs. Of course you can write it purely as a constraint optimization problem. You can modify constraint optimization and to just like this like ranch and whatever. So yeah, I mentioned that in the thesis and kind of was looking into that later.  

Irina    00:29:39    So it’s not something that was not kind of considered before. Just people didn’t really try to push, uh, directly optimization algorithms that would, um, take into account those exhilarated variables explicitly. And to me, the work from 2014, uh, was ass paper for getting the name right now. I mean, that basically motivated us to start looking into auxiliary variable approaches. And then there was a whole wave of this optimization approaches anyway. So they all try to do the same thing. They try to introduce activations or linear pre activations as explicit variables to optimize for that would reformulate your objective function for neural net learning in terms of two sets of variables, one being your usual Bates and the second being those activations, and that had some pluses and minuses as everything. The pluses would be that once you fix activations aid completely, decouples their problem into local layer wise sub problems.  

Irina    00:30:52    So you can solve for those weights completely in parallel. Um, basically the chain of gradients is broken and it’s good. So you don’t have by definition, any vanishing gradients or exploding gradients, because there is no chain. And second thing, you can do things in parallel. So those two things are good. There is also some similarity and more kind of biological plausibility because you take into Cod now activations and this neural net explicitly as variables and essentially interpretation of that is also, you view them as a noisy activations, which they are unlike classical neural nets, where they always deterministic variables, uh, deterministic conditions of the real neurons. Their real neurons are not fully deterministic functions, right?  

Irina    00:31:48    So the nonlinearity is a separate thing, but even just the fact that in artificial neural net, they’re totally deterministic. That’s also quite an kind of simplification. So, uh, there are other kind of, um, I mean, there are other flavors of this auxiliary variables and kind of target replication methods in our, uh, kind of approach, which is essentially in line of the, um, uh, the subsidiary variable optimizations, where you can write the joint objective in terms of activations and weights. Uh, there think here is we still use the same baits for kind of, for work, um, propagation or basically computing output given input, as well as for the optimization or in a sense like backward pass. There are other flavors like, uh, target propagation by Jaan electron and a different Stargate propagation by your show and his students and all flavors of methods. On top of that, they use two sets of eight, the forward maids and backhand weights, which may be even more biological plausible than those exhilarated methods that I mentioned.  

Irina    00:32:57    And then there is lots of kind of flavors and variations on that. And actually, it’s nice to see this subfield expanding recently, and there were some papers that new leaps last year and so on and support. So it’s all interesting. It has its pluses and those things such as personalization and, uh, by definition, lack of vanishing, gradients and exporting gradients, no matter how deep the network is or how long is a sequence in a recurrent  or something, those are good, but you move into different optimization space and empirically, whenever you try this methods on standard benchmarks and standard architectures, they are not all this performing as well as a classical backpropagation. And that was one of the issues with the whole field of alternatives to tobacco problem, how to make them competitive. There are multiple successes, but they’re are not like completely kind of putting back prop out of the picture, plus we didn’t aim to do so.  

Irina    00:34:11    We had some successes on fully connected architectures. We’ve had successes on Sofar and MDs. We’ve had some successes even on our Nance and even on simple cognitive, but what we learned, I mean, that was good because it was the first time when you actually do see those improvements, um, in, um, uh, in the paper at new rapes, uh, by, um, Sergei. But, um, yeah, and, um, uh, Jeff Hinton and others, uh, they also kind of were trying different alternatives, like target and various sorts of that. And unfortunately they couldn’t show that it would be, or be compatible with a standard back prop on Ponce image net. So there was this kind of mini, uh, unsuccessful attempts. There were some limited successes, and the question is still open, whether such alternatives can become true state of art. And the hunch is, I think you shouldn’t beat your head against the wall, trying to use alternative optimizations like that on architectures like convenance and so on, which were so well optimized to work with standard pro you need different architectures.  

Irina    00:35:25    And maybe the fact that we really tried hard to beat backdrop on classical resonates, and it didn’t really work. Maybe that’s a problem, but has not go well with, but something else will go well with exhilarating variables and target probe. It’s a hypothesis, but I think it’s kind of something to try. Anyway. I think, I think if you make those methods work, you will get benefits of much better personalization and scalability. You will not have this nagging issues of potentially exploding or vanish ingredients, but you will have other problems. You will possibly have convergence problems and automating minimization type of, uh, algorithms and so on and so forth. I mean, uh, there is no free lunch.  

Paul    00:36:19    So since you mentioned architectures, I want to pause and ask you, um, if you think looking to the brain for architectural inspiration makes any sense at all, or, you know, because it’s a whole system with lots of different architectures interacting. And if you’re thinking like different optimization methods might work better or worse with different architectures, if, if that is, uh, another avenue where it, where we should look,  

Irina    00:36:47    I think it would be useful to explore it again. Uh, there is this, uh, cheated debate within the field inductive biases versus scaling the most generic architectures say transformers, or even scaling multilayer. Perceptron ultimately, it’s a universal function. Approximator if you scale it enough, it probably will do anything, right. It’s probably just not going to scale very efficiently to put it mildly. So yeah, that’s why maybe transformers are better, but the question, okay, here is the sync inductive biases or priors versus scaling very generic type of networks. Uh, I might be wrong, my personal opinion, but I think just like historically, whenever you have not enough data. So in brain imaging, sometimes it’s small data sets or in medical applications. So on so forth when they don’t have enough data, then using priors or inductive biases from the domain is extremely helpful. They take role of regularization, uh, constraints.  

Irina    00:37:59    And if those civilization constraints are prior, so inductive biases, right, they help them in the sleep and you can perform really well despite having small amounts of data. And that’s where you could kind of use specific architectures and so on. And by the way, that’s why say convolutional networks were so successful for such a long time, right? But that you start scaling, right. And the amounts of data, if you have those amounts of data, they go way beyond what you had before. Your model size goes way beyond what you had before you scale the number of parameters while maintaining like some kind of structure of the network, like to scale with some depth on something, there are many kind of important caveats here about how to do it. So, so scaling model will actually capture the amount of information while you scale data. So, I mean, there are smarter ways to do that and less smart ways, but say you do it right now.  

Irina    00:39:08    Uh, it looks like those priors inductive biases become less and less important. And we do have empirical evidence say visual transformers at scale in terms of data start outperforming convenance. And by the way, that’s why I think looking at scaling laws is so important. You have two competing architectures, you see how the scale and you see that in lower data regime, covenants are so much better in higher data regime, it’s vice versa. And that approach that kind of empirical evaluation of different methods architectures or whatever you compare, uh, by looking at the whole curve rather than point that one architecture, another architecture, one data set, and other dataset. It’s not that informative, all those kinds of tables there, plots, scaling those flows, uh, giving you much fuller picture and give you better ideas. See if you can scale what type of methods you should invest into.  

Irina    00:40:14    And apparently if you can scale visual transformants would do better than that. But still the question remains. What if there are inductive biases such as maybe those brain architectures and so on that can improve your scaling exponent, essentially what it means when we talk about scaling exponent is that, um, empirically it’s fun that, uh, the performance of models expressed as, um, uh, basically the call center pillows on the test data and, or a loss or classification accuracy on downstream tasks. Uh, they usually seem to improve according to power law. You’ve seen probably those papers by Jared Kaplan and his colleagues from OpenAI. And now, I guess in Tropic and so on and so forth. And, uh, all those papers show you power laws, which are straight lines and logo blood, and the exponent of power law responds to the slope of that line and Lola plot and the whole billion dollar question in the scaling laws field is what kind of things improve that slope.  

Irina    00:41:22    So you get better improvement in performance for smaller amount of scaling, therefore cheap and scaling, by the way, involves here, not just scaling the data and scaly bottle size, but also scaling amount of compute because you may just even keep the model fixed and data fixed, but let your algorithm compute more. And sometimes you see very interesting behavior like grokking paper from workshop a day clear last spring, where they just ran their method for long time, just good to kind of kill it. And then there was certain phase transition from almost zero to almost one accuracy. They just managed to find some point in the search space. Yeah, but it was not intentional apparently, but it happened to be that for that type of benchmark and, uh, architecture, they used, it was a case that somewhere in the search space, there was a place, it was extremely good solution surrounded by not so good solutions.  

Irina    00:42:28    And if you find that place, you can jump there. And that’s your face transition from zero to one. Anyway, we kind of trying to explore those face transitions recently, uh, with my colleagues at meal as well. So back to your question, the question inductive biases versus scaling as usual, um, I would say maybe inductive biases plus scaling because certain inductive biases maybe can improve the exponent and at least you want to explore that they might be useful. I wouldn’t kind of throw the water. I wouldn’t throw the baby together with the water, as I say the saying, ah, okay, let’s just scale multi-layer perceptron yes, of course you can do that. It will be just extremely inefficient. Don’t you want more advantages scaling laws and maybe inductive biases can help you reset it. Doesn’t have to begin the two camps fighting each other. Although I understand it’s more fun to fight than collaborate.  

Paul    00:43:36    Sure. It’s more fun to do both. Right. Just like a scaling in inductive biases. The answer is always both.  

Irina    00:43:42    Yeah, it doesn’t have to be either, or it could be both. The question is what inductive biases help to improve scaling. And that’s a good question. It might be, it depends like, um, Jared Kaplan was presenting at our workshop a couple of times we had two workshops so far one in October, one just now last week. And again, he mentioned that in his experience, uh, again for that setting, that problem for GPT three, um, improvements due to architecture did not seem to be as significant as just kind of scaling and that’s totally fine. It doesn’t completely kind of excludes the station when inductive bias has maybe some, for some other problems would be much more important.  

Paul    00:44:32    How do you explore the full space of architecture as though you have some, some limited amount of exploration, right,  

Irina    00:44:39    Right. Um, that’s a good question. I mean, it just like miss everything, neuroscience inspired, it’s such a huge space on its own. And to be completely honest, you cannot just go and ask the scientists. What, in your opinion is the most important inactive bias that AI people should use? It doesn’t work this way. At least not in my experience because they say, we don’t know, like you tell us what you need and then maybe we can think and suggest what kind of inductive bias can best help you with what you need. So what I’ve learned from two years trying to do the new AI project, the idea that it was first of all, much less well-defined than AI for new era, where you take methods to analyze the data. That’s we know how to do here. You like search for a black cat in the black room, and you’re not sure if the characters there that you would have for AI, but I think what has interaction with those neuroscientists?  

Irina    00:45:49    Let’s say, look, I need, okay, you’re asking what I need for a I’d like my system to be able to continue to learn. I mean, well, it has to be learning how to do new tasks and work on new datasets and it should keep doing this because I might have my robot walking in the wild and it has to adapt, or I might have my personal assistant chatbot and it has to adapt to my changing mental states or it has to adapt to other people. So I don’t have to keep looking into different data environments tasks and doing them well at the same time. I don’t want it to completely forget what it learned before, because it may have to return to it and may not have to remember. Um, I don’t really push for absolute lack of support. Catastrophic forgetting fast remembering is fine.  

Irina    00:46:45    So just like basically a few short learning of new things and few short remembering of all things would be just fine after all. I don’t remember myself, the courses I took in the years of my undergrad, but I could remember them hopefully. So you want that? How does brain do it? And that may be more specific questions you say, okay, how does continue learning happen? What are the tricks? And then people in AI actually using those inspirations. I mean, this whole, a bunch of continual learning methods were inspired by some kind of some phenomena and neuroscience, for example, um, kind of freezing or slowing down the change of certain weights and the network. If those weights are particularly important for some previous tasks in, uh, it was kind of more formalized in word like EWC, um, consolidation or synaptic intelligence. Also, uh, there are many of those regularization based approaches, uh, essentially in continual learning.  

Irina    00:47:53    And it’s kind of one flavor of, uh, well, very obstructed inspirations, I would say, but from this phenomenon that does happen in the brain or replay methods again. So this classical example of, um, having hippocampus that essentially forms very quickly, um, kind of memories, but then they’ve consolidated say during sleep and you kind of have this longer term knowledge and prefrontal cortex. So having this learning systems approach. Yeah. Yeah. So that is another example. Again, it’s very much simplified and obstructed. It has its roots in neuroscience, but then it kind of gives rise to machine learning algorithms like, uh, rehearsals to the recorder. So they play junior different plays so on and so forth, but they deal. Yeah. Um, there is also the third kind of direction that, uh, people take usually in continual learning. Um, more like architectural based approaches when you essentially expand your network model, expand your architecture depending on, uh, the needs of your learner.  

Irina    00:49:10    And that also has its roots and you can connect it to things like adult neurogenesis that even adult brains apparently do grow new neurons when they do so in specific places like hippocampus, the dangers of hippocampus, it’s still happening there. It doesn’t stop, um, at the beginning of adulthood and yes, there was dogma. So two years ago or more that adult brains do not grow new neurons. Well, apparently they do in mammals, in rats, they do it in the hippocampus and olfactory bulb, as you can imagine, or factory bulb is very important for rats and other memos because they do use smell quite a bit. And humans, apparently factory wealth doesn’t matter as much anymore, but we still have some neurogenesis happening in hippocampus. And what was interesting and the kind of did some work on top of that at the paper about neurogenetic kind of model or the very simple version of that.  

Irina    00:50:17    But the idea was all the empirical evidence about neurogenesis in the literature suggests that there is more neurogenesis happening when the animal or a human is exposed to radically changing environment like different tasks or different environments in continual learning. So then it is associated with more junior neurogenesis. If the environment is not very complex and it’s not changing, you kind of, don’t really seem to need to expand capacity of your model. Like you have some new neurons being born, but they die apparently pirates. So it’s like use it or lose it. If you don’t need extra capacity, you won’t have extra capacity. If you keep challenging yourself and you keep kind of pushing yourself to extreme, so totally new situations, the new neurons will be incorporated and your hippocampus will expand somewhat. I mean, to some degree, of course, as everything. So it’s, it’s an interesting observation and it associated with possible ideas of expanding architectures in continual learning to accommodate for new information that cannot be possibly represented well using existing model. So yes.  

Paul    00:51:37    So let me just recap what you’ve said there, because you, you covered a lot of ground. So, um, you kind of just transitioned. I was going to ask you about continual lifelong learning and you just transitioned into it. So, and you talked about the three, um, neuroscience principles that, uh, have been, uh, implemented and the whole point of a lifelong learning that there’s this huge problem in deep learning called catastrophic forgetting where once the network is trained on one task, if you train it to learn a new task, it forget completely forgets the old task. Right. And so there’s been this explosion in, uh, lifelong learning methods. I, one of which is continual learning is that under the umbrella of lifelong learning, because there’s transfer learning, meta learning, continual learning, and now there’s medic continual continual meta and on and on. Right. Okay.  

Irina    00:52:28    Yes. Okay. So it’s again, a question of terminology, like kind of stepping on the same rake of machine learning terminology. Again, uh, I gave it to duty that they CML last summer, uh, one way actually to me, and I guess to many people, lifelong learning and continue learning as synonyms. And they all just mean that you would like your model to, uh, learn in online scenario where online means you get your samples as a stream instead of, um, I mean, you can get them as sequence of datasets or a sequence of badges or a sequence of just samples. But the point is you have that sequence of, uh, data you keep learning and you do not have an option of keeping the whole history of datasets, uh, or you might even have the option, but you might not want to constantly retrain because it’s not so efficient.  

Irina    00:53:31    So continue on lifelong learning. And this approach are synonyms metal, continual continual methods, and so on. So forth are still visiting the umbrella of continual learning, but kind of different formulations of how you might go about training your system to do that. Continual learning again, as I said, uh, I’m sure people have different definitions. So in my mind, it’s, uh, a particular definition of continual learning, which just means online. Non-stationary learning by non-stationary. I mean, any change to their environment or input data residents, a change of data distribution, it’s a change of task or both. So, um, as to transfer. Okay. So transfer learning again. I have the slides in the tutorial, as well as in the class, uh, slides for the continual learning class. They all online on my web page. Uh, we, I gave it for two years in a row. It was a winter semester class, uh, 2020 and 2021, but I’m not giving it this year because I will be teaching neuro scaling loss.  

Paul    00:54:50    That’s your new thing. Okay.  

Irina    00:54:51    Well, continue learning is related to that. It’s not like, uh, we completely jumped to something unrelated. It is related, uh, but with more focus on them, uh, scaling models, data and compute, and continuous learning, being a problem that you’re also trying to solve. That’s what, but back to your question about transfer, um, Matta and the Meda patients. So on first of all, I did my slides second, uh, while the slides are based on their very, very nice tutorial, uh, by, um, I think the launch from 2000, uh, 20 anyways. So the picture defines each of those, uh, problems and shows how they are different and how they similar a transfer usually assumes that you have two problems. And by learning on one, you trying to kind of, uh, trying to be able to use that knowledge, to do better on the second problem as it is not necessarily any notion of remembering or being equally interested in doing both problems.  

Irina    00:56:00    It’s like more unidirectional question of terminology to meet transfer is a property of your model and algorithm. And continual learning is a setting in which you would like transfer to happen. Uh, which means while learning, I always would like to improve, or at least not make it worse. My performance on the past, which means backward positive transfer, or it is backward non negative transfer at the same time. I’d like to hopefully learn better and faster in the future because I own the learned so much. So ideally I would like to have some positive transfer to future. And that view of, uh, not equating continual learning with catastrophic forgetting issue, but rather more general view of continual learning as a problem of maximizing transfer in both past and the future that kind of also came out of our joint paper, um, med experience replay from 2019 with Matt dreamer.  

Irina    00:57:10    And I very much kind of support that view more general view of continual learning. And especially when it comes in the context of not just supervised or unsupervised, but continuous reinforcement learning as the ultimate continual learning setting. Yeah. So that’s kind of where these different aspects may align and come together, uh, to build the kind of the same, same, I dunno, one big picture of what we’re trying to achieve. Agent that can adapt and still be able to return back to Lauren, maybe right. Retraining slightly. There’s a few shorts like the flexible avoiding or forgetting and a adaptation that is fast. And they aspects such as transfer playing the key role. The meta-learning is essentially very similar to continual learning, but it does not assume that you are different data sets came in a sequence, they available to you at the same time. And you try to learn from them common model that will be very easily adaptable to any new environment or data set.  

Irina    00:58:27    Um, presumably from the same distribution of datasets. So that’s essentially meta-learning and by the way out of distribution, generalization is another related that’s appealed, which is to me, essentially zero short meta-learning because out of the distribution setting is give me a multiple data sets or environments. And I will try to learn a model that, um, basically distills some cumin in variant robust features or in general Coleman in variant robust predictor. So that next time you give me data set for testing that is different from my drink, it’s out of distribution. And he had shares that invariant relationships, which are essentially closely related to the causal relationships. If you give me that I will do well on that. So it’s extreme case of meta-learning cause metal learning will tell you, I want to do well on that out of distribution dataset, just give me a few samples. So they all, all this terminology in a sense comes together and has many shared, uh, aspects. It’s just, uh, as I said, unfortunate Babylon tower situation and machine learning that makes it difficult. The set of ideas is much less dimensional than the ambient dimension in terms of our terminology. So many of the machine learning compression it’s long overdue.  

Paul    00:59:54    Well, it’s also, yeah, it’s also the variety of problem statements, but I want to ask you about that in a second, because so, so just backtracking once more. So you talked about three inspirations from neuroscience to help with continual learning. One was, uh, the variability in plasticity say you, uh, you want to remember a certain task, uh, there’s evidence that those, um, those synopsis form stronger connections and are less likely to, uh, change, right? And so you lower the plasticity moving forward so that you can, uh, maintain good skills on a certain task. Right? You also talked about the complimentary learning systems approach, uh, using the, you know, these two different learning and memory mechanisms. One is fast and, uh, uh, very specific and then that’s associated with the hippocampus and then there’s this, uh, general slow, um, and generalizable learning happening. And that’s associated with the, the neocortex. And of course, uh, you use replay, which has been used. And, you know, I think originally with DQN dissolve the Atari games, um, and  

Irina    01:00:55    So much much before  

Paul    01:00:59    That’s. Okay. No, thank you for correcting me. I knew I was right when I said originally, I was like, Ooh, that’s not  

Irina    01:01:07    Good,  

Paul    01:01:09    But, uh, okay. And then the third was, um, essentially inspired from neurogenesis, which is the idea that, uh, especially in the hippocampus of an adult, you can continue to form new neurons and that this might help us informing new memories. So what I wanted to ask you about, um, is the issue of the facts in neuroscience, continuing to change because it’s an empirical science, right? So I thought that recently there was backlash against this idea of neurogenesis and adults. And, you know, it’s like one year milk is good for you the next year. It’s really terrible for you, uh, based on new evidence. So there are these, I didn’t know that the evidence was still pointing toward neurogenesis as a concrete fact, right. It’s because we’re always finding out new things. So the story is always being revised in neuroscience. And, um, I’m wondering if that affects, uh, your own approach, because if we hang our hat on neurogenesis and it turns out we actually don’t create new neurons or something, um, you know, then how do we move forward from there? Well,  

Irina    01:02:08    You think about computational, uh, I don’t know, machine learning kind of side of that. Even if it happens in a nature, there was no neuro neurogenesis to start with. We don’t care. We have algorithm that works better.  

Paul    01:02:24    Yeah. You can always go back to that now you’ve now you have your AI hat back on. So  

Irina    01:02:28    Exactly. That’s why I never have only one hat I have multiple, and I actually don’t even think it’s effective to have single hat ever. Although I understand it’s kind of, people tend to have one hat because we are, so I don’t know, we’re so kind of tied to the notion of our identities, including the identity in terms of scientific views. So Coleman it’s so dear to our heart, but in reality, the notion of identity in general is much more vague than we might say, but I don’t want to go into philosophy here because it takes a long time and we don’t have the time it’s a separate conversation anyway.  

Paul    01:03:21    Okay. So, um, so then coming back to all the li all of these different problem statements and how you were saying, you know, are different definitions, but really they’re all kind of converging, um, onto like a, uh, a lower dimensional problem space, given the various approaches to training and testing and metal learning, you know, all the things that you just covered. Do we actually understand how humans learn well enough, the developmental literature, like psychology, like, uh, do humans do all of these different problem statements or would it, is there? So I know that there’s a little bit of work, at least on, you know, the humans actually showing a little bit of Castro, catastrophic forgetting in certain circumstances, but do we actually understand human behavior enough and how humans learn not at the neural level, but at the behavioral level, um, to map that onto these different continual learning metal learning, transfer learning, do humans do all of that to humans? Do some of it, you know, does that make sense?  

Irina    01:04:21    It definitely makes sense. Well, first of all, I mean, there are many people who kind of focusing on this type of studies and development though, uh, neuroscience. Um, so I really would love to read more about that as you just mentioned earlier. Uh, no human being, unfortunately can be on top of all the literature and all the fields related to mine. So there is lots of interesting recent work as well. And I have some colleagues at UDM working on that. Uh, for example,  in psychiatry department then, uh, Elaine Miller in neuroscience department. Uh, there is a lot there about in terms of, okay, not looking at Euro, just looking at the behavior and asking the question, whether human do transfer learning, or continue learning particular settings, I think, yes, because in a sense, the notions of particular settings in continual learning, they came out of, uh, researchers thinking about like, what do we do in real life in different situations?  

Irina    01:05:24    Say we have robotics, uh, what are kind of scenarios there? The robot moves from one room to another room, environment changes all the settings and specifically of them, oh, now it’s not just room didn’t change, but I gave it over different tasks. They all came out of our anecdotal observations of our own behavior of like, just common sense that he’s an egg in our heads about what people do. So in a sense, yeah. I mean, whatever you have right now in continual learning fields, for example, or transfer, it came out of our knowledge about human behavior, because in a sense where else would it come from?  

Paul    01:06:06    Well,  

Irina    01:06:08    Yeah. So in this, in this sense, yeah. But, uh, more on that, like more specifically, like study is about how it’s being done and like what affects that, what kind of makes it better or worse for that we would need to talk to our colleagues in psychology and your science and read more about that. And I’m pretty sure, as I said, I mean, I’m not claiming I’m completely on top of any literature  

Paul    01:06:34    That infinite hats. Well,  

Irina    01:06:37    That would be, that would be the ultimate goal. But for that first I need to create AGI. Then you connect with CGI and then you have your argumented brain, which can finally stay on top of all that literature. That’s my true motivation.  

Paul    01:06:55    All right. All right. Okay. So I have one more question about, um, about continual learning and then we’ll begin to wrap up here. Someone in the brain inspired discord community, uh, asked whether the learning trajectories of artificial networks impact, um, have an impact in continual learning, right? What, how much it retains and how much it forgets over successive tasks and training. Have you, do you, have you studied the learning trajectories at all? Because that’s something that’s being looked into in deep learning theory these days, uh, as, uh, something that matters.  

Irina    01:07:30    Yeah. The learning trajectory. Uh, okay. The question is also, it depends on several things. It depends on say the sequence of tasks like curriculum learning, which can matter a lot. Indeed. Um, actually it matters not just for continual learning. It matters for say adversarial robustness, even it matters for various aspects of the and model, like how what’s it consists of data was even to it and how they arrived there. But of course it also depends on say particular optimization algorithm that basically different trajectory leads you to different state in the weight space. It leads you to different model or think about the different, yeah. Different brain artificial brain. Uh, and uh, of course they will different trumps of their properties in terms of forgetting. And so-and-so, I guess when we were looking into, again, I’m thinking about this matter paper Ms. Mud and others, that obviously obviously the trajectory matters.  

Irina    01:08:34    The question is how do you, how do you know what the local kind of constraints or local kind of ambitions you should use, uh, to push trajectory into the desired, uh, kind of direction? Because like, all we can do is just to use kind of, uh, some local information just like, as gradient is right. And I guess things that change trajectories. I mean, as I said, data can change trajectory, all kinds of realizations can change trajectory, basically regularization. So precisely things among other things that change trajectory is a lot, you have objective standard objective of your senior on that, and you’re trying to optimize for that. And then you add things, one example, say you say, I really would like to make sure that I have that positive transfer or at least not negative. Let me add as a constraint, the product of gradients or new and old samples.  

Irina    01:09:36    And I want the things to be aligned. I don’t know. I do not want my new gradient to point in the direction opposite of the gradient on previous samples, because that would mean I will be decreasing performance on the past task. I will be forgetting. So I’ll try to add at least locally categorizer again, like a matter of paper with smart, for example, just one example that will push my weight kind of trajectory in that direction and so on and so forth. So basically any regular riser you put there by some desirable features, but without, of course any grantee, because it’s all local, it will change the trajectory. So in a sense, the whole field of continual learning, playing with at least the regularization based field, it’s all about changing trajectories and they are changing the final solution. So it  

Paul    01:10:33    Wasn’t good theory about how to do that. Really  

Irina    01:10:37    Again, I don’t want to claim, I mean, there are, there is various work, there was a paper on them, like try to theoretical understand the continual learning algorithms, but for specific types of them, there’s this, um, what was the name of  gradient descent and, um, okay. There is various work, but uh, to me, continual learning is still the field that is lagging behind quite a lot in terms of theoretical understanding,  

Paul    01:11:06    I was going to ask what your outlook is for continual learning. If we saw solve continual slash lifelong learning is the same as solving AGI and, you know, are you optimistic about, you know, what’s what, what is the normal number that people say 20 years? And that’s when we’ll have solved everything it’s always 20 years or so. Right. Oh,  

Irina    01:11:26    Some say less than that.  

Paul    01:11:29    What’s your number,  

Irina    01:11:30    Pam.  

Paul    01:11:32    10 years for lifelong learning. Yeah.  

Irina    01:11:35    Okay. And AGI  

Paul    01:11:37    That equates to AGI or when we solve lifelong, is that solving AGI? Okay.  

Irina    01:11:43    Again, difference in terminology. I apologize to people who might like really strongly disagree with me. And I know some people who will all going to say,  

Paul    01:11:56    Let’s drop some names.  

Irina    01:11:57    I know, no, no, no, no, we’re not going to do that. But those people know, well, there is also this kind of view, uh, which is still an open question, the pure about, so what does it take to solve continue learning, whether it equates to AGI and so forth. So if we assume that AGI for all practical purposes is general artificial intelligence by junior road, it’s versatile broad. It can kind of learn to solve pretty much any task that is, as people often put it economically available. So say AGI is a kind of a model that can solve all economically available tasks, say as good as, or better than human. Uh, something like that. Uh, the question is if you kind of put that agent in the wild, it will have to do continual learning, right? So it needs to be kind of sold. The question is whether you approach solving it by trying to train that agent in continuous learning manner or as scaling, um, cloud will tell us, or at least part of the scaling crowd, not overgeneralizing that maybe it’s enough just to really portray in a humongous well foundation models on multimodal data, not just language, not just video, uh, not just the images like media or perhaps even all kinds of time series data, but once you pre-train, it, it essentially solved continual learning.  

Irina    01:13:31    I had this question during the workshop discussed and its ongoing debate. What I would say is that for any fixed set of possible tasks that they will give continual learner, like for example, recent submission to a clear, uh, on, um, scaling and continual learning for a fixed set of tasks. Yeah, sure. Scaling model scaling amount and uh, diversity and complexity or information content of pre-training data will at some point where the complexity of the fixed set of tasks and yes, you will solve catastrophic. You can capture information of all the things that you’ve been do all of them well, but if the stream of tasks and continual learning continues growing right, infinitely will your pre-train model hit the wall at some point or not. And that’s a good question. And I think it’s interplay between their model capacity. I always keep saying that also in my tutorial on continued learning capacity of the pre-trained model that you learned and the complexity of unseen part of the universe that your agent will have to adopt.  

Irina    01:14:50    And I would say that what you really need to look into would be relative scaling of how your model capacity, that depends on size architecture and information content of the data you’re trained on, which depends on the amount or number of samples, but other things too, how that capacity scales visit respect to complexity of downstream tasks. So to me, relative scaling laws would be the most interesting scene to dive into. And I think it makes sense. It’s always trained of capacity versus complexity, just like great distortion theory and information theory and so on. And you want to find the minimum cost and minimum capacity agent that’s capable to kind of work well, conquered the complexity of whatever future tasks that agent will be exposed to. But if the agent hits the wall, the agent will have to have the ability to expand itself and continue learning. So to me, continuous learning is the ultimate test for anything that is called AGI  

Paul    01:16:03    Say that that sounds like incorporating principles of evolution into a book.  

Irina    01:16:09    Okay. So any pre-trained model may hit the wall and I believe it will have to keep evolving and if it won’t be able to evolve itself the bad,  

Paul    01:16:22    Okay. Irina, this has been fun. I have one more question for you and that is, uh, considering your, your own trajectory. If you could go back and start over again, would you change anything? Would you, um, change the order in which you learned things or order of your jobs, how would you start again?  

Irina    01:16:42    Ah, that’s a very interesting question. I’m not sure I have a immediate answer to that, but in one of those realities, I might’ve been taken totally different trajectory from the one I’m on right now. I probably would have been skiing somewhere in Colorado working as a ski instructor.  

Paul    01:17:10    Oh, I do all of that. Except work as a ski instructor. You should try what I did. Okay. I just, uh, just went uh, two days ago. Yeah.  

Irina    01:17:20    Look, Tim blond is pretty good too, so it doesn’t matter some good mountain and just,  

Paul    01:17:27    I know why don’t you come visit me and we’ll go ski and we’ll see if we can change your trajectory  

Irina    01:17:32    And you’re welcome to visit   

Paul    01:17:36    All right. Very good. Well, I really appreciate the time. Uh, so thanks for talking with me. Thank  

Irina    01:17:41    You so much for inviting me. It was fun.  

Paul    01:17:49   

0:00 – Intro
3:26 – AI for Neuro, Neuro for AI
14:59 – Utility of philosophy
20:51 – Artificial general intelligence
24:34 – Back-propagation alternatives
35:10 – Inductive bias vs. scaling generic architectures
45:51 – Continual learning
59:54 – Neuro-inspired continual learning
1:06:57 – Learning trajectories

View Full Transcript

Episode Transcript

Speaker 1 00:00:03 We are not the first one asking the question about what intelligence is. People think there's a first one to ask the question or to build something for them, something, and it's not the first time to put it mildly. It's always trained on capacity versus complexity, and you want to find the minimum cost and minimum capacity agent that's capable to Concord the complexity or whatever future tasks that agent will be exposed to. But if the agent feeds the wall, the agent will have to have the ability to expand itself and continue learning what they've learned from two years, trying to do the new AI project. The idea that it was first of all, much less well-defined then AI for new era here, you like search for a black cat in the black room, and you're not sure if the cat is there. That's Speaker 2 00:01:11 This is brain inspired. Speaker 3 00:01:25 Hey everyone, it's Paul happy holidays. I hope you're. Well today I speak with Irina Reesh, who is currently at the university of Montreal and also a faculty member at Mila Quebec AI Institute. And I wanted to have Irina on for multiple reasons. One of which is her interesting history, uh, having been kind of on both sides of the AI and neuroscience coin. So she's also worked, uh, at IBM as you'll hear working on healthcare and also in neuroscience inspired AI. And we have a pretty wide ranging conversation about much of her, uh, previous and current work. So we talk about, uh, her work on alternatives to the backpropagation algorithm. And we talked about her ongoing work on continual learning, which is kind of a big topic in AI these days. So as you probably know, uh, deep learning models suffer from what's called catastrophic forgetting where when you train the model to do one thing really well, uh, and then you train it to do another thing. Speaker 3 00:02:25 It forgets the first thing and humans don't suffer from this problem. And it's an important problem to tackle moving forward, uh, in deep learning. And we discussed many of the methods being used to try to, uh, solve continual learning and some of the inspirations from neuroscience along those lines. We also talk a little bit about scaling laws, which is roughly the relationship between how big and complex a model is and how well it performs over a range of tasks. We also talk about definitions and Irina's definition of artificial general intelligence and how she views the relationship between AGI and continual learning. And we talk about a lot more, so you can learn more about Irina in the show notes at brain inspired.co/podcast/ 1 23 on the website. You can also learn about how to support the show on Patrion and possibly join the discord community of Patrion supporters who are awesome. Speaker 3 00:03:21 Thank you guys. And thank you for listening. Enjoy, you're actually kind of a perfect fit for this podcast, uh, because on the one hand you have a background in a lot of, uh, computer science and I guess your early work was in applied mathematics. So you kind of come from that side, but I know that you're interested in using, among other things among the many things that you're interested in in using some principles and ideas in neuroscience to help, uh, build better AI. So could you just, um, talk a little bit about your background and how you came to be interested in, uh, being inspired by, by brain facts and et cetera? Speaker 1 00:04:00 That's a very interesting question. Indeed. Sometimes I ask myself and I tried to dig into the past. The question is called for in the past. We want to go, uh, indeed, um, a couple of years ago I joined Mila and university of Montreal. But before that I was at DBM research and I was there for quite a long time initially in their department of computational biology where I did indeed focus on neuroscience, neuroimaging and applying statistical methods, machine learning, AI to analysis of brain data. So that's mainly where I kind of really got, I guess, deeper into neuroscience, psychology psychiatry type of topics. And I'm still collaborating with a group that's my kind of long-term collaborators in computational psychiatry and neuro imaging, uh and his, um, uh, his friends. So that was really exciting. And that's, I guess, where I actively was, uh, pursuing this idea of the intersection between AI and neuro. But I think the interest in that intersection started long time before I even joined IBM. And I usually I realized that I could track it back to my, uh, I think, uh, elementary or even middle school years. I used to go to muscle Olympics in Russia. I don't know if it's interesting it's at this too long of an answer. Speaker 3 00:05:30 Well, let me ask you this. So, so when I was in college, there wasn't even a neuroscience program, um, uh, degree, uh, and I don't know that I would have been about, you know, I started in aerospace engineering and then moved on to molecular biology. And I don't know if neuroscience was available as a program, whether I would have actually chosen it. But, uh, so I was going to ask if that's what was limiting, if you had that kind of kernel of interest, why then go into applied mathematics? Uh, that was your first degree, right? Speaker 1 00:05:58 Right. Yeah. I mean, I probably should explain and did, uh, what I wanted to mention, uh, the reason was from practical perspective, I was going to mass Olympics and I was quickly realizing that, um, you don't have that much control over your brain and like, you want to solve a problem. And then you kind of hitting the wall. It's like, what's going on there. And sometimes it works. Sometimes it doesn't, you want to understand like why and how to make it work better. And then you see people around you, some of them struggling unable to solve anything. Some of them are able to solve like more than you do. And again, you wonder what the difference, what does it take? Then you start reading books, like, uh, polio, how to solve it, how to learn, how to solve problems and this and that. But it really came from very practical goal. Speaker 1 00:06:49 Like I need to figure out how to solve all those problems quickly because I want to be in Delhi. So what do I do? My brain doesn't seem to be cooperating. So what should I do? Like how do I make it work? So you start digging into how to make brain work. And then you run into books accidentally, which say, well, whether machines can sing, it was Russian translation. I guess, of the famous student's work. That gets me into thinking about AI when I'm like 14 or something. That's why I go to computer science. Uh, the closest to computer science at that point, uh, kind of in Russia was like applied math, essentially applied math slash computer science, but formally applied maths. Um, and that's kind of, that's how it goes from there. So like my focus on computer science and AI actually came from them really very practical goal. Like, I need to understand how this brain works, so I make it work better. That's pretty much it. Then you realize that it's like biology, psychiatry, neuroscience, and many other things that study brain, like whatever goes. Speaker 3 00:07:57 Okay. So, so at IBM then you were part of a team, I think you said you were in computational biology division and say you were a part of a team that was sympathetic to, uh, using principles from biology to help make machines better. Right. Speaker 1 00:08:13 Okay. The focus of the department was not on machines. The focus was on health care. So the focus was on how to make, uh, humans cross here. Uh, that was conditional biology in your science and neuro imaging, kind of, um, um, groups focus. So it's not the focus of that group was not really on AI. And I was kind of back and forth between focus on AI and computer science and machine learning to biology. And back originally I started in machine learning group, uh, for distributed systems, uh, changed names a few times. And I moved to this competitional biology center. And then in the past few years before moving to Mila, I moved again from computational biology department to AI department of IBM. So, as I said, I was kind of iterating between the two multiple times. The focus in those past couple of years before moving to Mila was indeed to bring, uh, kind of neuroscience, inspirations and ideas to improve AI. So that was my, my latest focus on IBM was indeed on what you're asking about. And that was new AI, uh, kind of, uh, project between IBM and MIT. Uh, that was going on. That kind of remained part of my focus when I joined Mila and was part of their direction for that, um, seven year program, uh, that I'm leading the Canada, excellence is a chair. Then your AI component is one of the kind of, uh, excess along, which things are supposed to be developing. Speaker 3 00:09:56 We'll see. Okay. So this is what I wanted to get to because, um, I, I'm curious about your experience about your colleagues and their sort of opinions about using, uh, neuroscience or neuro inspired, um, tools, uh, to build better AI because that's very much the industry side of it. And you have a lot of like passionate opinions about whether we should be looking to the brain to build better AI or whether we should just continue to scale with what we have. Right. All right. Speaker 1 00:10:24 I know both sides really well. We just ran our scaling laws workshop last weekend from blond. Yes. Speaker 3 00:10:31 Okay. So, so what is your IBM colleagues, if you feel free enough to talk about it, what was kind of like their reception of your focus on using neuroscience? Speaker 1 00:10:40 Uh, again, the colleagues in AI department, her colleagues in the bio department? Well, I think actually this whole new era AI new era for AI and AI for are all these ideas actually glue out my multiple interactions and discussions with my friends at a competitional psychiatry and neuroscience department. And primarily you share machete. And I think what really also helped to shape my views was the introductions, which enrolled, not just discussing, uh, technical aspects of AI or technical aspects of brain imaging, or even your science, but there was most of philosophy in those discussions because luckily Jeremiah had first degree in philosophy and the second in physics and then went to neuroscience. And I think that really made him like stay stand kind of apart and had of many colleagues. So I'm really, really, really grateful for those discussions because they helped me to shape my views as well. Speaker 1 00:11:44 So what I'm trying to say that while the healthcare department, the combined department was focused on healthcare applications, the idea of using neuroscience and biology inspirations for improving AI, um, was very exciting for at least several people there. And it's still exciting. And we would like to kind of do more along those lines. When I moved to AI department at IBM, uh, again, it was kind of a mix of opinions because just like in the field in general. Um, and I agree with that view, it may not necessarily be the case that the only path to intelligent, uh, artificial intelligence systems is, um, mimicking the brain more over like even my colleagues, neuroscientists never said that we have to mimic the brain. Their freaking question is like, what is the minimum part of those inspirations that might be necessary to transfer? And that's just like, gives us example about airplanes, not flooding their wings, right? Speaker 1 00:12:54 So you don't have to copy the biology. Exactly. And yet you want to come up with some common laws that govern the flight, right? Um, some aerodynamics of intelligence in our case. And that's the tricky part. And that's, I think where everyone agrees. So nature found the solution. Uh, there are some properties of that solution that might be specific to evolution and nature, and perhaps we can obstruct from them, but there are some parts. The question is which ones that are probably in variant or necessary for any intelligence, finding those invariant properties is a good open question. And I think that's subconsciously everybody doing AI is trying to do that, but I definitely, I'm not in the camp of first. You need to completely understand how brain works and only then you can create artificial intelligence. I don't think so. Just like with airplanes and engineers there, they don't have to be like a biologists understanding birds perfectly. Right, right. You need to understand enough, but the good question is what is enough Speaker 3 00:14:10 Interesting because we're going to talk about a few of the, um, neural inspirations that you have focused on. And in some sense, um, I don't mean this is not an insult at all, but it's almost like, uh, we are sometimes cherry picking and just trying kind of one thing at a time. We think this might be important thing that might be important. Um, when what we, what we could do, which you say is not the right path. And I agree as we could, uh, instead of cherry picking, you know, these, these facts that we're going to discuss in a little bit, um, you really could go more all in and try to, you know, there are people trying to quote unquote, build a brain, right. And they're still having, you know, to abstract out a lot of things. But, um, but that push would be to build in more that we know about the brain that rather than less, it seems so. Um, but I want to ask you, uh, before we move on about philosophy, I happened to see a panel that you were on. I don't remember the source. It may have been a, Speaker 1 00:15:11 The main conference. We had an interesting discussion about philosophy therapy. Speaker 3 00:15:17 It didn't go that that far, but you, you, you got into a back and forth a little bit with Syria Ganguli and, uh, who finds philosophy useless, and you made the push that, uh, in fact it is useful. So I just wanted to, Speaker 1 00:15:31 It's useless. You can learn from anything, but let's not go there. Speaker 3 00:15:38 Okay. So, so you don't want to, um, make a, uh, a case for philosophy. Speaker 1 00:15:44 I can make useful philosophy. I, um, I think what happened there, maybe it was as usual, by the way, it's not specific to the panel. Uh, people mainly disagree because of differences in their kind of definitions or interpretations of terminology. And unfortunately that's a universal problem. And the field, like many concepts are not well-defined. And in general, I mean, that's the main reason people argue because when people actually nail down details of what they are being for, or against surprisingly many cases, they all agree. So I think the, the problem was what people, uh, understood as philosophy when I say the word philosophy and it means something to me, it probably meant different things to different people. So they were really arguing, not with my point, but they were arguing with their own understanding of the world. Yeah. That's why, because I don't think Syria or anyone else will disagree in general that if you have different disciplines, whether it's philosophy or neuroscience or psychology or psychiatry or any, any, any type of discipline that studied mind and its function, how it works and what are the mechanisms at different levels in different ways and philosophy is one of them. Speaker 1 00:17:13 And even more, it's not just the whole Sophie. I mean, you can think about Buddhism. And I brought this example to me, it's empirical science of how mind works, which has several thousand years of knowledge accumulated in very different terminology and so on. Uh, but that's, that's a data that's that's knowledge accumulated by people from observations, right? So there is some truth to it. And the question is like how to deep that through out, since we coming from different fields, use different terminology again, how do you translate philosophy and Buddhism to machine learning slang and sense? So people will understand it, not everything there might be relevant, but we are not the first one asking the question about what intelligence is, what usually amazes me. And again, I don't mean that doesn't solve, but, and plus it's very natural. It's always happens, but people think there's a first one to ask the question or to build something for something. And it's not the first time to put it mildly. There are many bright minds that for many years, we're facing similar type of questions just in different circumstance. So I think it might be useful to learn more about what they found. Speaker 3 00:18:42 It does seem to be a recurring theme that, um, there'll be a hot new trend. And then it turns out a hundred years ago, someone already had written basically the answer, you know, and laid out the groundwork, the groundwork for it that, uh, then, then we go back and, uh, something that we resolved, uh, had already essentially been solved. Speaker 1 00:19:03 It is. I mean, it's not specific to our field or our time. Right. It's always been like, that's probably always going to be like that. Uh, but, uh, that's just why I mentioned philosophy. And, uh, also, I mean, I know, I know I essentially meant the same thing that Supriya was saying himself, that we are trying to, um, kind of discover the Kuvan laws behind intelligence, uh, whether biological or artificial and kind of pushing it forward common laws behind how mind works or how it could work and, um, how you can kind of affect it in different ways. So it works differently. And I think 80, any source of knowledge about like people asking similar type of questions and finding whatever answers, any information like that you can learn from all these data. All I actually was suggesting that, yeah, let's try to learn from all data input data being different disciplines, Speaker 3 00:20:13 But okay. So there, there's a problem here, right? Where, um, throughout the years, all the different dif disciplines have continued to progress and it is essentially impossible to be an expert in all disciplines. So how, you know, what's the right, Speaker 1 00:20:27 That's why we need AGI and he'll be an expert in all of that. Speaker 3 00:20:32 And they can tell us which disciplines we need to learn, but we won't be Speaker 1 00:20:35 The knowledge for us and conveyed to us in understandable manner. I'm just quoting that young, short scifi story from nature, but it's only half joking. Speaker 3 00:20:49 Yeah. Yeah. Well, so, so you are interested in, um, that that's a goal to build AGI and we're going to talk, uh, about lifelong learning and a little bit, I want to ask you about backpropagation first, but would you say that's one of your, uh, Speaker 1 00:21:03 Uh, yes and no. AGI is not the final goal in itself. It's an instrumental goal. The final goal, as I was always putting AI as augmented, rather than artificial intelligence, to me, just the goal of building AGI never felt truly motivating. Like why do I care about machines? Speaker 3 00:21:32 Well, do you know what AGI even is? I don't really know what AGI is because that's another thing where people have different definitions. Speaker 1 00:21:39 Yes. It's one of those terms and machine learning, which is not well-defined. And I know that's a that's creates lots of confusion and there is had two debates in the Mila on the topic of AGI. There are different definitions and different people again, mean different things. One practical definition could be just stick to the words, let's say it's artificial general intelligence. So junior role means capable of solving multiple really multiple problems. To me that means general broad, versatile, which relates to continue learning or the learning or transfer learning, but kind of push to extremes. So like truly versatile AI that can do while, at least pretty much anything we can do, not a narrow opposite of narrow, broad general. So that can be just a relatively clear definition, at least to me of what AGI would stand for. There are many other definitions and we probably could write like a list of different ones, but I think, yeah, you're absolutely right. It's not the term. It's not the mathematical term. Speaker 3 00:22:59 Do definitions matter Speaker 1 00:23:01 Definitions matter. I mean, yes and no again, so you can have different definitions. What matters is for people before they start kind of working together on something or discussing something to agree on definitions. Because again, the main reason for debates, sometimes unending debates is at the core that people did not agree on definitions. And what comes to my mind whenever I listened to machine learning, people debating something or pretty much anyone debating anything, the picture of the elephant and seven blind men touching different parts of the elephant and saying, no elephant disease, no you're wrong. And funders that no, it's, you're wrong. And they're all right, and nobody's wrong, but they didn't agree on definitions and they don't see the full, full picture. Speaker 3 00:23:51 That's all I've come to think that the purpose of debates is to talk past one another and not progress at all. Speaker 1 00:23:58 That's not the purpose to me. It's a sad reality. Yeah. You can do that. You will probably have some fun, um, maybe for some limited amount of time and then pretty much you just wasted the time and everybody moved on. So what was the point? I don't know. I mean, Yes. If you try to learn something or converge to something or make some progress, then probably not. Speaker 3 00:24:28 Okay. So you and I agree. That's good. We don't need to debate that issue then. Okay. So, um, you've done work. One of the ways that you have fought to use neuroscience is, uh, on the question of backpropagation and, um, maybe before, because you had, you've done work on what's called auxiliary, uh, variables like a backpropagation alternative. Um, so I'd like you to describe that and you know, where that came from, but before doing that, um, could I, cause we've talked about backpropagation multiple times now on the podcast, I had Blake Richards on way back when, um, you know, um, uses the morphology of neurons and the, uh, electrically decoupled, uncoupled, um, apical, dendrites, and blah, blah, blah, burst, firing, et cetera, as an alternative. And now, you know, there's this huge family now of alternatives to backpropagation. Um, I'm curious about your overall view on, uh, that progress, that literature. Speaker 1 00:25:29 Yes. Uh, yeah, that's very good question. And actually, in fact, we are working right now with a group of people at Mila and outside of Mila, um, on, uh, extending different target propagation. So basically that line of work is still going on, although it was a bit in the back burner for awhile, and there are as usual, at least two motivations here, whether you come from neuroscience and you try to come up with a good model of how, uh, essentially learning happens at the brain, um, basically how they are created assignment for mistakes happens in the brain. And whether by propagation is a good model for that, or you can come up with a better model. So this is one motivation and many people who kind of are less concerned with, uh, competitive performance of alternatives to backpropagation and more concerned with really understanding how it works in the brain. Speaker 1 00:26:26 They focus on that. And I also totally agree with that view, as I said, I mean, there is no contradiction once you clearly state what your objective is, you cannot say that they are wrong or you are right, the vice versa because they just optimizing different function. They want to answer the question, how we best model what happens in the brain. Their objective is not to beat you on amnesty, insofar as long as we all agree on what objective is, it's not wrong. It's interesting line of research and that's kind of initially what motivated, uh, also work on, um, beyond back probe kind of just trying to understand things better. And Blake is definitely, uh, doing lots of things in this direction and other people, but on the other hand, there is, uh, another objective. Like if you come from the point of view of AI person who says, okay, I understand I want my analogies with brain, if, and only if it helps me build more effective, more efficient algorithm. Speaker 1 00:27:31 So when you come from that objective, you can start wondering purely computationally, what are the limitations of backpropagation and, um, what could you do differently or better and how to solve those kinds of shortcomings? And usually people were always claiming that, yeah, there is problem of vanishing gradients or exploding gradients. Yeah. There is a problem that, uh, basically backpropagation is inherently sequential because you have to compute this chain of gradients and you have to do it sequentially. But again, one hand in brain processing is purely parallel second and computers. If we were able to do it in parallel, it probably would have been more efficient and better would scale better as well. So you want this parallelism, you want to avoid possible gradient, uh, issues. Uh, so what do you do? And that's where many optimization techniques came starting from this. Um, essentially Yummly con's own seizes mentioned alternative tobacco propagation that later was called target propagation. Speaker 1 00:28:38 And all it meant is another view of the problem. So basically instead of just optimizing the objective of the neural net with respect to the weights of the neural net being unknown variables, you're trying to find you introduce an auxiliary variables or different names. Uh, it all comes from the just three wheels kind of, uh, equality constraint there that those activations. So he then units, uh, they are equal to what do that, uh, linear function of previous led layer transformed by some non-linearity and you can play with it. You can introduce extra auxiliary variables, just the linear ones and another one, uh, another set, nonlinear transformation of those songs. Of course you can write it purely as a constraint optimization problem. You can modify constraint optimization and to just like this like ranch and whatever. So yeah, I mentioned that in the thesis and kind of was looking into that later. Speaker 1 00:29:39 So it's not something that was not kind of considered before. Just people didn't really try to push, uh, directly optimization algorithms that would, um, take into account those exhilarated variables explicitly. And to me, the work from 2014, uh, was ass paper for getting the name right now. I mean, that basically motivated us to start looking into auxiliary variable approaches. And then there was a whole wave of this optimization approaches anyway. So they all try to do the same thing. They try to introduce activations or linear pre activations as explicit variables to optimize for that would reformulate your objective function for neural net learning in terms of two sets of variables, one being your usual Bates and the second being those activations, and that had some pluses and minuses as everything. The pluses would be that once you fix activations aid completely, decouples their problem into local layer wise sub problems. Speaker 1 00:30:52 So you can solve for those weights completely in parallel. Um, basically the chain of gradients is broken and it's good. So you don't have by definition, any vanishing gradients or exploding gradients, because there is no chain. And second thing, you can do things in parallel. So those two things are good. There is also some similarity and more kind of biological plausibility because you take into Cod now activations and this neural net explicitly as variables and essentially interpretation of that is also, you view them as a noisy activations, which they are unlike classical neural nets, where they always deterministic variables, uh, deterministic conditions of the real neurons. Their real neurons are not fully deterministic functions, right? Speaker 1 00:31:48 So the nonlinearity is a separate thing, but even just the fact that in artificial neural net, they're totally deterministic. That's also quite an kind of simplification. So, uh, there are other kind of, um, I mean, there are other flavors of this auxiliary variables and kind of target replication methods in our, uh, kind of approach, which is essentially in line of the, um, uh, the subsidiary variable optimizations, where you can write the joint objective in terms of activations and weights. Uh, there think here is we still use the same baits for kind of, for work, um, propagation or basically computing output given input, as well as for the optimization or in a sense like backward pass. There are other flavors like, uh, target propagation by Jaan electron and a different Stargate propagation by your show and his students and all flavors of methods. On top of that, they use two sets of eight, the forward maids and backhand weights, which may be even more biological plausible than those exhilarated methods that I mentioned. Speaker 1 00:32:57 And then there is lots of kind of flavors and variations on that. And actually, it's nice to see this subfield expanding recently, and there were some papers that new leaps last year and so on and support. So it's all interesting. It has its pluses and those things such as personalization and, uh, by definition, lack of vanishing, gradients and exporting gradients, no matter how deep the network is or how long is a sequence in a recurrent or something, those are good, but you move into different optimization space and empirically, whenever you try this methods on standard benchmarks and standard architectures, they are not all this performing as well as a classical backpropagation. And that was one of the issues with the whole field of alternatives to tobacco problem, how to make them competitive. There are multiple successes, but they're are not like completely kind of putting back prop out of the picture, plus we didn't aim to do so. Speaker 1 00:34:11 We had some successes on fully connected architectures. We've had successes on Sofar and MDs. We've had some successes even on our Nance and even on simple cognitive, but what we learned, I mean, that was good because it was the first time when you actually do see those improvements, um, in, um, uh, in the paper at new rapes, uh, by, um, Sergei. But, um, yeah, and, um, uh, Jeff Hinton and others, uh, they also kind of were trying different alternatives, like target and various sorts of that. And unfortunately they couldn't show that it would be, or be compatible with a standard back prop on Ponce image net. So there was this kind of mini, uh, unsuccessful attempts. There were some limited successes, and the question is still open, whether such alternatives can become true state of art. And the hunch is, I think you shouldn't beat your head against the wall, trying to use alternative optimizations like that on architectures like convenance and so on, which were so well optimized to work with standard pro you need different architectures. Speaker 1 00:35:25 And maybe the fact that we really tried hard to beat backdrop on classical resonates, and it didn't really work. Maybe that's a problem, but has not go well with, but something else will go well with exhilarating variables and target probe. It's a hypothesis, but I think it's kind of something to try. Anyway. I think, I think if you make those methods work, you will get benefits of much better personalization and scalability. You will not have this nagging issues of potentially exploding or vanish ingredients, but you will have other problems. You will possibly have convergence problems and automating minimization type of, uh, algorithms and so on and so forth. I mean, uh, there is no free lunch. Speaker 3 00:36:19 So since you mentioned architectures, I want to pause and ask you, um, if you think looking to the brain for architectural inspiration makes any sense at all, or, you know, because it's a whole system with lots of different architectures interacting. And if you're thinking like different optimization methods might work better or worse with different architectures, if, if that is, uh, another avenue where it, where we should look, Speaker 1 00:36:47 I think it would be useful to explore it again. Uh, there is this, uh, cheated debate within the field inductive biases versus scaling the most generic architectures say transformers, or even scaling multilayer. Perceptron ultimately, it's a universal function. Approximator if you scale it enough, it probably will do anything, right. It's probably just not going to scale very efficiently to put it mildly. So yeah, that's why maybe transformers are better, but the question, okay, here is the sync inductive biases or priors versus scaling very generic type of networks. Uh, I might be wrong, my personal opinion, but I think just like historically, whenever you have not enough data. So in brain imaging, sometimes it's small data sets or in medical applications. So on so forth when they don't have enough data, then using priors or inductive biases from the domain is extremely helpful. They take role of regularization, uh, constraints. Speaker 1 00:37:59 And if those civilization constraints are prior, so inductive biases, right, they help them in the sleep and you can perform really well despite having small amounts of data. And that's where you could kind of use specific architectures and so on. And by the way, that's why say convolutional networks were so successful for such a long time, right? But that you start scaling, right. And the amounts of data, if you have those amounts of data, they go way beyond what you had before. Your model size goes way beyond what you had before you scale the number of parameters while maintaining like some kind of structure of the network, like to scale with some depth on something, there are many kind of important caveats here about how to do it. So, so scaling model will actually capture the amount of information while you scale data. So, I mean, there are smarter ways to do that and less smart ways, but say you do it right now. Speaker 1 00:39:08 Uh, it looks like those priors inductive biases become less and less important. And we do have empirical evidence say visual transformers at scale in terms of data start outperforming convenance. And by the way, that's why I think looking at scaling laws is so important. You have two competing architectures, you see how the scale and you see that in lower data regime, covenants are so much better in higher data regime, it's vice versa. And that approach that kind of empirical evaluation of different methods architectures or whatever you compare, uh, by looking at the whole curve rather than point that one architecture, another architecture, one data set, and other dataset. It's not that informative, all those kinds of tables there, plots, scaling those flows, uh, giving you much fuller picture and give you better ideas. See if you can scale what type of methods you should invest into. Speaker 1 00:40:14 And apparently if you can scale visual transformants would do better than that. But still the question remains. What if there are inductive biases such as maybe those brain architectures and so on that can improve your scaling exponent, essentially what it means when we talk about scaling exponent is that, um, empirically it's fun that, uh, the performance of models expressed as, um, uh, basically the call center pillows on the test data and, or a loss or classification accuracy on downstream tasks. Uh, they usually seem to improve according to power law. You've seen probably those papers by Jared Kaplan and his colleagues from OpenAI. And now, I guess in Tropic and so on and so forth. And, uh, all those papers show you power laws, which are straight lines and logo blood, and the exponent of power law responds to the slope of that line and Lola plot and the whole billion dollar question in the scaling laws field is what kind of things improve that slope. Speaker 1 00:41:22 So you get better improvement in performance for smaller amount of scaling, therefore cheap and scaling, by the way, involves here, not just scaling the data and scaly bottle size, but also scaling amount of compute because you may just even keep the model fixed and data fixed, but let your algorithm compute more. And sometimes you see very interesting behavior like grokking paper from workshop a day clear last spring, where they just ran their method for long time, just good to kind of kill it. And then there was certain phase transition from almost zero to almost one accuracy. They just managed to find some point in the search space. Yeah, but it was not intentional apparently, but it happened to be that for that type of benchmark and, uh, architecture, they used, it was a case that somewhere in the search space, there was a place, it was extremely good solution surrounded by not so good solutions. Speaker 1 00:42:28 And if you find that place, you can jump there. And that's your face transition from zero to one. Anyway, we kind of trying to explore those face transitions recently, uh, with my colleagues at meal as well. So back to your question, the question inductive biases versus scaling as usual, um, I would say maybe inductive biases plus scaling because certain inductive biases maybe can improve the exponent and at least you want to explore that they might be useful. I wouldn't kind of throw the water. I wouldn't throw the baby together with the water, as I say the saying, ah, okay, let's just scale multi-layer perceptron yes, of course you can do that. It will be just extremely inefficient. Don't you want more advantages scaling laws and maybe inductive biases can help you reset it. Doesn't have to begin the two camps fighting each other. Although I understand it's more fun to fight than collaborate. Speaker 3 00:43:36 Sure. It's more fun to do both. Right. Just like a scaling in inductive biases. The answer is always both. Speaker 1 00:43:42 Yeah, it doesn't have to be either, or it could be both. The question is what inductive biases help to improve scaling. And that's a good question. It might be, it depends like, um, Jared Kaplan was presenting at our workshop a couple of times we had two workshops so far one in October, one just now last week. And again, he mentioned that in his experience, uh, again for that setting, that problem for GPT three, um, improvements due to architecture did not seem to be as significant as just kind of scaling and that's totally fine. It doesn't completely kind of excludes the station when inductive bias has maybe some, for some other problems would be much more important. Speaker 3 00:44:32 How do you explore the full space of architecture as though you have some, some limited amount of exploration, right, Speaker 1 00:44:39 Right. Um, that's a good question. I mean, it just like miss everything, neuroscience inspired, it's such a huge space on its own. And to be completely honest, you cannot just go and ask the scientists. What, in your opinion is the most important inactive bias that AI people should use? It doesn't work this way. At least not in my experience because they say, we don't know, like you tell us what you need and then maybe we can think and suggest what kind of inductive bias can best help you with what you need. So what I've learned from two years trying to do the new AI project, the idea that it was first of all, much less well-defined than AI for new era, where you take methods to analyze the data. That's we know how to do here. You like search for a black cat in the black room, and you're not sure if the characters there that you would have for AI, but I think what has interaction with those neuroscientists? Speaker 1 00:45:49 Let's say, look, I need, okay, you're asking what I need for a I'd like my system to be able to continue to learn. I mean, well, it has to be learning how to do new tasks and work on new datasets and it should keep doing this because I might have my robot walking in the wild and it has to adapt, or I might have my personal assistant chatbot and it has to adapt to my changing mental states or it has to adapt to other people. So I don't have to keep looking into different data environments tasks and doing them well at the same time. I don't want it to completely forget what it learned before, because it may have to return to it and may not have to remember. Um, I don't really push for absolute lack of support. Catastrophic forgetting fast remembering is fine. Speaker 1 00:46:45 So just like basically a few short learning of new things and few short remembering of all things would be just fine after all. I don't remember myself, the courses I took in the years of my undergrad, but I could remember them hopefully. So you want that? How does brain do it? And that may be more specific questions you say, okay, how does continue learning happen? What are the tricks? And then people in AI actually using those inspirations. I mean, this whole, a bunch of continual learning methods were inspired by some kind of some phenomena and neuroscience, for example, um, kind of freezing or slowing down the change of certain weights and the network. If those weights are particularly important for some previous tasks in, uh, it was kind of more formalized in word like EWC, um, consolidation or synaptic intelligence. Also, uh, there are many of those regularization based approaches, uh, essentially in continual learning. Speaker 1 00:47:53 And it's kind of one flavor of, uh, well, very obstructed inspirations, I would say, but from this phenomenon that does happen in the brain or replay methods again. So this classical example of, um, having hippocampus that essentially forms very quickly, um, kind of memories, but then they've consolidated say during sleep and you kind of have this longer term knowledge and prefrontal cortex. So having this learning systems approach. Yeah. Yeah. So that is another example. Again, it's very much simplified and obstructed. It has its roots in neuroscience, but then it kind of gives rise to machine learning algorithms like, uh, rehearsals to the recorder. So they play junior different plays so on and so forth, but they deal. Yeah. Um, there is also the third kind of direction that, uh, people take usually in continual learning. Um, more like architectural based approaches when you essentially expand your network model, expand your architecture depending on, uh, the needs of your learner. Speaker 1 00:49:10 And that also has its roots and you can connect it to things like adult neurogenesis that even adult brains apparently do grow new neurons when they do so in specific places like hippocampus, the dangers of hippocampus, it's still happening there. It doesn't stop, um, at the beginning of adulthood and yes, there was dogma. So two years ago or more that adult brains do not grow new neurons. Well, apparently they do in mammals, in rats, they do it in the hippocampus and olfactory bulb, as you can imagine, or factory bulb is very important for rats and other memos because they do use smell quite a bit. And humans, apparently factory wealth doesn't matter as much anymore, but we still have some neurogenesis happening in hippocampus. And what was interesting and the kind of did some work on top of that at the paper about neurogenetic kind of model or the very simple version of that. Speaker 1 00:50:17 But the idea was all the empirical evidence about neurogenesis in the literature suggests that there is more neurogenesis happening when the animal or a human is exposed to radically changing environment like different tasks or different environments in continual learning. So then it is associated with more junior neurogenesis. If the environment is not very complex and it's not changing, you kind of, don't really seem to need to expand capacity of your model. Like you have some new neurons being born, but they die apparently pirates. So it's like use it or lose it. If you don't need extra capacity, you won't have extra capacity. If you keep challenging yourself and you keep kind of pushing yourself to extreme, so totally new situations, the new neurons will be incorporated and your hippocampus will expand somewhat. I mean, to some degree, of course, as everything. So it's, it's an interesting observation and it associated with possible ideas of expanding architectures in continual learning to accommodate for new information that cannot be possibly represented well using existing model. So yes. Speaker 3 00:51:37 So let me just recap what you've said there, because you, you covered a lot of ground. So, um, you kind of just transitioned. I was going to ask you about continual lifelong learning and you just transitioned into it. So, and you talked about the three, um, neuroscience principles that, uh, have been, uh, implemented and the whole point of a lifelong learning that there's this huge problem in deep learning called catastrophic forgetting where once the network is trained on one task, if you train it to learn a new task, it forget completely forgets the old task. Right. And so there's been this explosion in, uh, lifelong learning methods. I, one of which is continual learning is that under the umbrella of lifelong learning, because there's transfer learning, meta learning, continual learning, and now there's medic continual continual meta and on and on. Right. Okay. Speaker 1 00:52:28 Yes. Okay. So it's again, a question of terminology, like kind of stepping on the same rake of machine learning terminology. Again, uh, I gave it to duty that they CML last summer, uh, one way actually to me, and I guess to many people, lifelong learning and continue learning as synonyms. And they all just mean that you would like your model to, uh, learn in online scenario where online means you get your samples as a stream instead of, um, I mean, you can get them as sequence of datasets or a sequence of badges or a sequence of just samples. But the point is you have that sequence of, uh, data you keep learning and you do not have an option of keeping the whole history of datasets, uh, or you might even have the option, but you might not want to constantly retrain because it's not so efficient. Speaker 1 00:53:31 So continue on lifelong learning. And this approach are synonyms metal, continual continual methods, and so on. So forth are still visiting the umbrella of continual learning, but kind of different formulations of how you might go about training your system to do that. Continual learning again, as I said, uh, I'm sure people have different definitions. So in my mind, it's, uh, a particular definition of continual learning, which just means online. Non-stationary learning by non-stationary. I mean, any change to their environment or input data residents, a change of data distribution, it's a change of task or both. So, um, as to transfer. Okay. So transfer learning again. I have the slides in the tutorial, as well as in the class, uh, slides for the continual learning class. They all online on my web page. Uh, we, I gave it for two years in a row. It was a winter semester class, uh, 2020 and 2021, but I'm not giving it this year because I will be teaching neuro scaling loss. Speaker 3 00:54:50 That's your new thing. Okay. Speaker 1 00:54:51 Well, continue learning is related to that. It's not like, uh, we completely jumped to something unrelated. It is related, uh, but with more focus on them, uh, scaling models, data and compute, and continuous learning, being a problem that you're also trying to solve. That's what, but back to your question about transfer, um, Matta and the Meda patients. So on first of all, I did my slides second, uh, while the slides are based on their very, very nice tutorial, uh, by, um, I think the launch from 2000, uh, 20 anyways. So the picture defines each of those, uh, problems and shows how they are different and how they similar a transfer usually assumes that you have two problems. And by learning on one, you trying to kind of, uh, trying to be able to use that knowledge, to do better on the second problem as it is not necessarily any notion of remembering or being equally interested in doing both problems. Speaker 1 00:56:00 It's like more unidirectional question of terminology to meet transfer is a property of your model and algorithm. And continual learning is a setting in which you would like transfer to happen. Uh, which means while learning, I always would like to improve, or at least not make it worse. My performance on the past, which means backward positive transfer, or it is backward non negative transfer at the same time. I'd like to hopefully learn better and faster in the future because I own the learned so much. So ideally I would like to have some positive transfer to future. And that view of, uh, not equating continual learning with catastrophic forgetting issue, but rather more general view of continual learning as a problem of maximizing transfer in both past and the future that kind of also came out of our joint paper, um, med experience replay from 2019 with Matt dreamer. Speaker 1 00:57:10 And I very much kind of support that view more general view of continual learning. And especially when it comes in the context of not just supervised or unsupervised, but continuous reinforcement learning as the ultimate continual learning setting. Yeah. So that's kind of where these different aspects may align and come together, uh, to build the kind of the same, same, I dunno, one big picture of what we're trying to achieve. Agent that can adapt and still be able to return back to Lauren, maybe right. Retraining slightly. There's a few shorts like the flexible avoiding or forgetting and a adaptation that is fast. And they aspects such as transfer playing the key role. The meta-learning is essentially very similar to continual learning, but it does not assume that you are different data sets came in a sequence, they available to you at the same time. And you try to learn from them common model that will be very easily adaptable to any new environment or data set. Speaker 1 00:58:27 Um, presumably from the same distribution of datasets. So that's essentially meta-learning and by the way out of distribution, generalization is another related that's appealed, which is to me, essentially zero short meta-learning because out of the distribution setting is give me a multiple data sets or environments. And I will try to learn a model that, um, basically distills some cumin in variant robust features or in general Coleman in variant robust predictor. So that next time you give me data set for testing that is different from my drink, it's out of distribution. And he had shares that invariant relationships, which are essentially closely related to the causal relationships. If you give me that I will do well on that. So it's extreme case of meta-learning cause metal learning will tell you, I want to do well on that out of distribution dataset, just give me a few samples. So they all, all this terminology in a sense comes together and has many shared, uh, aspects. It's just, uh, as I said, unfortunate Babylon tower situation and machine learning that makes it difficult. The set of ideas is much less dimensional than the ambient dimension in terms of our terminology. So many of the machine learning compression it's long overdue. Speaker 3 00:59:54 Well, it's also, yeah, it's also the variety of problem statements, but I want to ask you about that in a second, because so, so just backtracking once more. So you talked about three inspirations from neuroscience to help with continual learning. One was, uh, the variability in plasticity say you, uh, you want to remember a certain task, uh, there's evidence that those, um, those synopsis form stronger connections and are less likely to, uh, change, right? And so you lower the plasticity moving forward so that you can, uh, maintain good skills on a certain task. Right? You also talked about the complimentary learning systems approach, uh, using the, you know, these two different learning and memory mechanisms. One is fast and, uh, uh, very specific and then that's associated with the hippocampus and then there's this, uh, general slow, um, and generalizable learning happening. And that's associated with the, the neocortex. And of course, uh, you use replay, which has been used. And, you know, I think originally with DQN dissolve the Atari games, um, and Speaker 1 01:00:55 So much much before Speaker 3 01:00:59 That's. Okay. No, thank you for correcting me. I knew I was right when I said originally, I was like, Ooh, that's not Speaker 1 01:01:07 Good, Speaker 3 01:01:09 But, uh, okay. And then the third was, um, essentially inspired from neurogenesis, which is the idea that, uh, especially in the hippocampus of an adult, you can continue to form new neurons and that this might help us informing new memories. So what I wanted to ask you about, um, is the issue of the facts in neuroscience, continuing to change because it's an empirical science, right? So I thought that recently there was backlash against this idea of neurogenesis and adults. And, you know, it's like one year milk is good for you the next year. It's really terrible for you, uh, based on new evidence. So there are these, I didn't know that the evidence was still pointing toward neurogenesis as a concrete fact, right. It's because we're always finding out new things. So the story is always being revised in neuroscience. And, um, I'm wondering if that affects, uh, your own approach, because if we hang our hat on neurogenesis and it turns out we actually don't create new neurons or something, um, you know, then how do we move forward from there? Well, Speaker 1 01:02:08 You think about computational, uh, I don't know, machine learning kind of side of that. Even if it happens in a nature, there was no neuro neurogenesis to start with. We don't care. We have algorithm that works better. Speaker 3 01:02:24 Yeah. You can always go back to that now you've now you have your AI hat back on. So Speaker 1 01:02:28 Exactly. That's why I never have only one hat I have multiple, and I actually don't even think it's effective to have single hat ever. Although I understand it's kind of, people tend to have one hat because we are, so I don't know, we're so kind of tied to the notion of our identities, including the identity in terms of scientific views. So Coleman it's so dear to our heart, but in reality, the notion of identity in general is much more vague than we might say, but I don't want to go into philosophy here because it takes a long time and we don't have the time it's a separate conversation anyway. Speaker 3 01:03:21 Okay. So, um, so then coming back to all the li all of these different problem statements and how you were saying, you know, are different definitions, but really they're all kind of converging, um, onto like a, uh, a lower dimensional problem space, given the various approaches to training and testing and metal learning, you know, all the things that you just covered. Do we actually understand how humans learn well enough, the developmental literature, like psychology, like, uh, do humans do all of these different problem statements or would it, is there? So I know that there's a little bit of work, at least on, you know, the humans actually showing a little bit of Castro, catastrophic forgetting in certain circumstances, but do we actually understand human behavior enough and how humans learn not at the neural level, but at the behavioral level, um, to map that onto these different continual learning metal learning, transfer learning, do humans do all of that to humans? Do some of it, you know, does that make sense? Speaker 1 01:04:21 It definitely makes sense. Well, first of all, I mean, there are many people who kind of focusing on this type of studies and development though, uh, neuroscience. Um, so I really would love to read more about that as you just mentioned earlier. Uh, no human being, unfortunately can be on top of all the literature and all the fields related to mine. So there is lots of interesting recent work as well. And I have some colleagues at UDM working on that. Uh, for example, in psychiatry department then, uh, Elaine Miller in neuroscience department. Uh, there is a lot there about in terms of, okay, not looking at Euro, just looking at the behavior and asking the question, whether human do transfer learning, or continue learning particular settings, I think, yes, because in a sense, the notions of particular settings in continual learning, they came out of, uh, researchers thinking about like, what do we do in real life in different situations? Speaker 1 01:05:24 Say we have robotics, uh, what are kind of scenarios there? The robot moves from one room to another room, environment changes all the settings and specifically of them, oh, now it's not just room didn't change, but I gave it over different tasks. They all came out of our anecdotal observations of our own behavior of like, just common sense that he's an egg in our heads about what people do. So in a sense, yeah. I mean, whatever you have right now in continual learning fields, for example, or transfer, it came out of our knowledge about human behavior, because in a sense where else would it come from? Speaker 3 01:06:06 Well, Speaker 1 01:06:08 Yeah. So in this, in this sense, yeah. But, uh, more on that, like more specifically, like study is about how it's being done and like what affects that, what kind of makes it better or worse for that we would need to talk to our colleagues in psychology and your science and read more about that. And I'm pretty sure, as I said, I mean, I'm not claiming I'm completely on top of any literature Speaker 3 01:06:34 That infinite hats. Well, Speaker 1 01:06:37 That would be, that would be the ultimate goal. But for that first I need to create AGI. Then you connect with CGI and then you have your argumented brain, which can finally stay on top of all that literature. That's my true motivation. Speaker 3 01:06:55 All right. All right. Okay. So I have one more question about, um, about continual learning and then we'll begin to wrap up here. Someone in the brain inspired discord community, uh, asked whether the learning trajectories of artificial networks impact, um, have an impact in continual learning, right? What, how much it retains and how much it forgets over successive tasks and training. Have you, do you, have you studied the learning trajectories at all? Because that's something that's being looked into in deep learning theory these days, uh, as, uh, something that matters. Speaker 1 01:07:30 Yeah. The learning trajectory. Uh, okay. The question is also, it depends on several things. It depends on say the sequence of tasks like curriculum learning, which can matter a lot. Indeed. Um, actually it matters not just for continual learning. It matters for say adversarial robustness, even it matters for various aspects of the and model, like how what's it consists of data was even to it and how they arrived there. But of course it also depends on say particular optimization algorithm that basically different trajectory leads you to different state in the weight space. It leads you to different model or think about the different, yeah. Different brain artificial brain. Uh, and uh, of course they will different trumps of their properties in terms of forgetting. And so-and-so, I guess when we were looking into, again, I'm thinking about this matter paper Ms. Mud and others, that obviously obviously the trajectory matters. Speaker 1 01:08:34 The question is how do you, how do you know what the local kind of constraints or local kind of ambitions you should use, uh, to push trajectory into the desired, uh, kind of direction? Because like, all we can do is just to use kind of, uh, some local information just like, as gradient is right. And I guess things that change trajectories. I mean, as I said, data can change trajectory, all kinds of realizations can change trajectory, basically regularization. So precisely things among other things that change trajectory is a lot, you have objective standard objective of your senior on that, and you're trying to optimize for that. And then you add things, one example, say you say, I really would like to make sure that I have that positive transfer or at least not negative. Let me add as a constraint, the product of gradients or new and old samples. Speaker 1 01:09:36 And I want the things to be aligned. I don't know. I do not want my new gradient to point in the direction opposite of the gradient on previous samples, because that would mean I will be decreasing performance on the past task. I will be forgetting. So I'll try to add at least locally categorizer again, like a matter of paper with smart, for example, just one example that will push my weight kind of trajectory in that direction and so on and so forth. So basically any regular riser you put there by some desirable features, but without, of course any grantee, because it's all local, it will change the trajectory. So in a sense, the whole field of continual learning, playing with at least the regularization based field, it's all about changing trajectories and they are changing the final solution. So it Speaker 3 01:10:33 Wasn't good theory about how to do that. Really Speaker 1 01:10:37 Again, I don't want to claim, I mean, there are, there is various work, there was a paper on them, like try to theoretical understand the continual learning algorithms, but for specific types of them, there's this, um, what was the name of gradient descent and, um, okay. There is various work, but uh, to me, continual learning is still the field that is lagging behind quite a lot in terms of theoretical understanding, Speaker 3 01:11:06 I was going to ask what your outlook is for continual learning. If we saw solve continual slash lifelong learning is the same as solving AGI and, you know, are you optimistic about, you know, what's what, what is the normal number that people say 20 years? And that's when we'll have solved everything it's always 20 years or so. Right. Oh, Speaker 1 01:11:26 Some say less than that. Speaker 3 01:11:29 What's your number, Speaker 1 01:11:30 Pam. Speaker 3 01:11:32 10 years for lifelong learning. Yeah. Speaker 1 01:11:35 Okay. And AGI Speaker 3 01:11:37 That equates to AGI or when we solve lifelong, is that solving AGI? Okay. Speaker 1 01:11:43 Again, difference in terminology. I apologize to people who might like really strongly disagree with me. And I know some people who will all going to say, Speaker 3 01:11:56 Let's drop some names. Speaker 1 01:11:57 I know, no, no, no, no, we're not going to do that. But those people know, well, there is also this kind of view, uh, which is still an open question, the pure about, so what does it take to solve continue learning, whether it equates to AGI and so forth. So if we assume that AGI for all practical purposes is general artificial intelligence by junior road, it's versatile broad. It can kind of learn to solve pretty much any task that is, as people often put it economically available. So say AGI is a kind of a model that can solve all economically available tasks, say as good as, or better than human. Uh, something like that. Uh, the question is if you kind of put that agent in the wild, it will have to do continual learning, right? So it needs to be kind of sold. The question is whether you approach solving it by trying to train that agent in continuous learning manner or as scaling, um, cloud will tell us, or at least part of the scaling crowd, not overgeneralizing that maybe it's enough just to really portray in a humongous well foundation models on multimodal data, not just language, not just video, uh, not just the images like media or perhaps even all kinds of time series data, but once you pre-train, it, it essentially solved continual learning. Speaker 1 01:13:31 I had this question during the workshop discussed and its ongoing debate. What I would say is that for any fixed set of possible tasks that they will give continual learner, like for example, recent submission to a clear, uh, on, um, scaling and continual learning for a fixed set of tasks. Yeah, sure. Scaling model scaling amount and uh, diversity and complexity or information content of pre-training data will at some point where the complexity of the fixed set of tasks and yes, you will solve catastrophic. You can capture information of all the things that you've been do all of them well, but if the stream of tasks and continual learning continues growing right, infinitely will your pre-train model hit the wall at some point or not. And that's a good question. And I think it's interplay between their model capacity. I always keep saying that also in my tutorial on continued learning capacity of the pre-trained model that you learned and the complexity of unseen part of the universe that your agent will have to adopt. Speaker 1 01:14:50 And I would say that what you really need to look into would be relative scaling of how your model capacity, that depends on size architecture and information content of the data you're trained on, which depends on the amount or number of samples, but other things too, how that capacity scales visit respect to complexity of downstream tasks. So to me, relative scaling laws would be the most interesting scene to dive into. And I think it makes sense. It's always trained of capacity versus complexity, just like great distortion theory and information theory and so on. And you want to find the minimum cost and minimum capacity agent that's capable to kind of work well, conquered the complexity of whatever future tasks that agent will be exposed to. But if the agent hits the wall, the agent will have to have the ability to expand itself and continue learning. So to me, continuous learning is the ultimate test for anything that is called AGI Speaker 3 01:16:03 Say that that sounds like incorporating principles of evolution into a book. Speaker 1 01:16:09 Okay. So any pre-trained model may hit the wall and I believe it will have to keep evolving and if it won't be able to evolve itself the bad, Speaker 3 01:16:22 Okay. Irina, this has been fun. I have one more question for you and that is, uh, considering your, your own trajectory. If you could go back and start over again, would you change anything? Would you, um, change the order in which you learned things or order of your jobs, how would you start again? Speaker 1 01:16:42 Ah, that's a very interesting question. I'm not sure I have a immediate answer to that, but in one of those realities, I might've been taken totally different trajectory from the one I'm on right now. I probably would have been skiing somewhere in Colorado working as a ski instructor. Speaker 3 01:17:10 Oh, I do all of that. Except work as a ski instructor. You should try what I did. Okay. I just, uh, just went uh, two days ago. Yeah. Speaker 1 01:17:20 Look, Tim blond is pretty good too, so it doesn't matter some good mountain and just, Speaker 3 01:17:27 I know why don't you come visit me and we'll go ski and we'll see if we can change your trajectory Speaker 1 01:17:32 And you're welcome to visit Speaker 3 01:17:36 All right. Very good. Well, I really appreciate the time. Uh, so thanks for talking with me. Thank Speaker 1 01:17:41 You so much for inviting me. It was fun. Speaker 3 01:17:49 Brain inspired is a production of me and you had don't do advertisements. You can support the show through Patrion for a trifling amount and get access to the full versions of all the episodes. Plus bonus episodes that focus more on the cultural side, but still have science go to brain inspired.co and find the red Patrion button there to get in touch with me, [email protected]. The music you hear is by the new year. Find [email protected]. Thank you for your support. See you next time.

Other Episodes

Episode

March 27, 2019 01:04:40
Episode Cover

BI 030 Jay McClelland: Mathematical Reasoning and PDP

Jay's homepage at Stanford.Implementing mathematical reasoning in machines:The video lecture.The paper.Parallel Distributed Processing by Rumelhart and McClelland.Complimentary Learning Systems Theory and Its Recent Update.Episode...

Listen

Episode 0

September 13, 2018 01:11:00
Episode Cover

BI 009 Blake Richards: Deep Learning in the Brain

Mentioned in the show Follow Blake on twitter: @tyrell_turing Blake’s Learning in Neural Circuits (LiNC) Laboratory. He’s a Fellow with the Learning in Machines...

Listen

Episode 0

October 12, 2021 01:31:20
Episode Cover

BI 116 Michael W. Cole: Empirical Neural Networks

Support the show to get full episodes and join the Discord community. Mike and I discuss his modeling approach to study cognition. Many people...

Listen