AI in Big Pharma - Director of AI and Data Science, Bristol Myers Squibb

‍

Christian Merril is a data scientist and AI engineer with over 10 years of professional experience. He is currently an Associate Director of AI and Data Science at Bristol-Myers Squibb, where he leads a team of data engineers and data scientists in developing and deploying AI solutions to help improve the drug development process.

Christian has a strong background in machine learning, natural language processing, and computer vision. In his previous roles, he has developed AI solutions for a variety of applications, including:

• Automating information extraction from unstructured documents

• Implementing semantic search on domain-specific documentation

• Implementing conversational agents

• Automating data analysis and discovery

Christian is passionate about using AI to solve real-world problems and make a positive impact on the world.

‍

Host:
Hello, Christian. How are you doing?

Christian Merril:
Doing well. How about yourself?

Host:
Doing very well. So, I'm going to introduce you here. Uh, this, uh... and correct me if I say this incorrectly, this is Christian Merril. Uh, he is many things, but first and foremost, an AI researcher. I was doing that for many years, and, uh, he was in the US Army for four years, and for even longer, he's been working at Bristol-Myers Squibb, which, if you don't know that company, that's, uh, sort of under this famous umbrella called Big Pharma. And, uh, it's been quite a few years there, and, uh, currently, he is the associate director of AI and data science.

Christian Merril:
Yeah, no, that's definitely the title I currently have.

Host:
Well, you know, with that title being said, my first question is like, you know, what do you feel like you do every day? What's it like to be Christian?

Christian Merril:
So, I actually, yeah, so it's actually kind of fun, um, in my opinion. Uh, part of the reason I've stuck around the same organization, you know, bounced around the company, um, a lot of my day-to-day is being able to solve interesting problems, right? Um, and being able to look at what's currently going on in terms of like the open-source community and then being able to try to, you know, bring that to bear, you know, outside of just the confines of a, you know, an academic paper.

Um, and right now, it's even more exciting, uh, because the community has kind of exploded. And right now, it kind of feels like every day is like a good day in data science right now. Or every day is like a good year, sorry, in data science right now. It's just, um, you know, there's hundreds of papers being published, uh, the technology is rapidly changing, especially in the space that I work in, uh, which is computational linguistics or NLP.

Um, but yeah, no, it's really, I think it's just exciting, right? Um, it's like every day is Christmas. You know, there's a new technology that I get to play with or, you know, try to implement. Um, you know, and everyone seems to be very excited about doing it right now. Uh, historically, you know, the space doesn't move very quickly, and so it's not been, you know, the demand has never been like, hey, we need like a thousand things done now. It's been, you know, you'd get lucky if people wanted to spend a lot of time with you.

Host:
I mean, you sort of mentioned the space, computational linguistics. So if I—I'd like to hook on that, uh, just, uh, give the audience an idea of, you know, what it is. So could you just explain that really quickly, and then we can sort of use that as a basis for the rest of the conversation?

Christian Merril:
Yeah, sure. So, computational linguistics is kind of like the formal name of the umbrella term that a lot of people call NLP or natural language processing. So, um, it could be anything from looking at how, uh, you can parse through natural language text to do something with it. It could be, you know, building out algorithms to do things like sentiment analysis. Uh, it could be, you know, some of the newer exciting things on building conversational agents.

But it's more of just a giant blanket term for basically looking at natural language and then doing stuff with computers to, you know, work with it. Um, at least, at least that's how I've been kind of defining it, um, as opposed to, like, NLP is a very, like, a narrower subtask inside of it. It's kind of like, you know, like in machine learning, there's, you know, there's deep learning, and there's different types of, like, supervised and unsupervised learning. Um, but it's, yeah, it's just an umbrella term.

Host:
Alright, so, uh, so how does NLP then fit into the computational linguistics? Um, like, how does it fit into the umbrella exactly?

Christian Merril:
So it's, I mean, it's kind of synonymous nowadays, but NLP itself typically is you're trying to do something with, with language in terms of like parsing it or doing some sort of processing with it. Um, but broadly speaking, you know, you could be looking at a whole host of different things, um, uh, you know, everything from, you know, what types of, um, you know, like parsing out a good example would be like parsing out earning call transcripts. You know, okay, this is a person's name and identifying it as a name, and then this is a title or this is content. Um, yeah, so it's, it's all...Kind of synonymous nowadays. It's gotten a lot different than, you know, at the very beginning.

Host:

What are some of the cousins of NLP that are inside that umbrella? So I figured we're going to see some of the spokes of the umbrella. What would you give?

Christian Merril:

Yeah, so you'd see like symbolic NLP. You'll see things like statistical NLP. Statistical NLP is a lot of what we were doing prior to 2010. A lot of things like SpaCy, they'll use statistical models to do different tasks, whether that's trying to determine like the next best word, kind of like, you know, type ahead on your phone. And then nowadays, we're kind of in like neural NLP, which is a little bit more like the transformer models using deep learning. I believe it was originally like, um, recurrent neural networks were the big thing originally and then they kind of blended into LSTMs, and then now the Transformer architecture seems to be the most prominent.

Host:

Alright, isn't a Transformer just like a higher-order version of statistical NLP?

Christian Merril:

Yeah, but at the end of the day, yes, it is all, again, averaging across the probability distribution. But the actual approach of what you're doing is going to be slightly different.

Host:

Okay, okay. So moving, uh, sort of away—so I guess I want to tie this back to like what you do every day. So how does all this—how does all this help you, you know, at your job and what you do? Like, how does that—how do you use that to bring out some value for your organization?

Christian Merril:

I mean, it depends on the different use cases, but, you know, broadly speaking, with some of the things that have really kind of opened up in the past couple of years—being able to do things like abstractive summarization—being so, being able to basically break down large amounts of text to then be able to serve to somebody, um, has been super helpful. That's pretty good. Other things have been like building out ontologies. So being able to kind of understand, you know, like, uh, what concepts are inside of different bits of text and then how that all rolls up into the broader organization. Um, it's also like super helpful with diseases. So when you're looking at medical text, um, it's good to know that, you know, that someone who had a heart attack that rolls up into a very specific subset of diseases. Obviously, you're to myocardial infarction, so it's death of heart tissue. Um, or that you have a specific type of cancer, and being able to target, okay, that it is cancer, yes, and then it relates to these specific tissues or these specific organs, and then being able to go and actually use that for either doing things like search, like semantically searching documentation, or being able to be a little bit more targeted in what you're doing with, you know, just large amounts of unstructured data.

Host:

What are some other types of use cases?

Christian Merril:

So we've also, you know, like one of the things that I've done out of, you know, basically out of desperation recently was building a recommendation engine, um, for myself. This was like a little project I did on a weekend just for myself because I needed to be able to kind of get in control of all the academic papers that were being released. Historically, it's a good problem, but historically, I'd spend like a couple of hours a week, um, you know, just reading, and that became impossible. So building a recommendation engine that basically reads through these abstracts and papers for me and then comes up with a recommendation of whether I should spend time there. Like, these are the types of things that are very powerful in different organizations because I know when we think of like Pharma, um, a lot of people will look at things like, "Oh, you're doing like protein folding" or "You're trying to design, you know, drugs, um, in silico," or, you know, there's a whole host of different types of, you know, use cases that people might associate. Um, and there, you know, folks do do that, but, uh, that's not been the work that I've been doing.

Host:

Not been. Not been the work that I've been doing. Before you go into, because we could talk a little bit about—I do want to talk about the research and some of the things that you're seeing that's interesting. I think that's going to be really exciting to hear about, but I wanted to talk about that a little bit later in the pod, because, uh, well, you..So, yeah, I think at first I'd like to talk about some of the structures that you insinuated, like this idea of when you're working with NLP, you have this bigger knowledge set that you're using as a filter for the value of a particular word. The way I interpreted it, you know, if you have a myocardial infarction, or I suppose you want to classify a heart attack in a report of a patient's well-being as a minor particular inflection, so you're sort of trying to get your machine to understand that this is in this category. But also, based on the context of the writing, the person isn't necessarily going to say that they have that, but the way that they— or the doctor isn't going to say that they have that, but the way that they have written it, it has all the symptoms of a myocardial infarction. So could you talk a bit about that sort of, you know, those kinds of machines and the experience of building them?

Christian Merril:

Yeah, so I'm not actually like building medical devices or anything like that. One of the ways of kind of looking at this from an intuition perspective is that words tend to mean, you know, like with the company that they keep. So when you look at, yeah, when you look at like, let’s say, a medical journal that's going to be talking about a specific type of disease, you'll tend to start from a pure statistics perspective. With advancements in AI drug research and AI-driven drug analysis, these processes have shifted from simply tracking term frequencies to using encoder-only Transformer models. This approach allows us to leverage tools such as OpenAI’s ADA model or mini LM for more complex, multi-dimensional text encoding.

And what these models are able to do is, instead of requiring us to manually assess term frequencies, they enable encoding of text into n-dimensional features, which then helps classify and re-rank documents. For example, in AI in clinical trials, it’s valuable to match concepts like “heart attack” with relevant medical literature through techniques like cosine similarity. This is especially useful in the context of predictive analytics in pharmaceuticals, where one can encode medical texts and search queries and match them accurately using vector-based models.

You’ll see this approach in Retrieval-Augmented Generation (RAG), a method detailed in OpenAI’s cookbook, where folks set up vector databases, such as Pinecone, and use vectorization exercises to improve retrieval. In AI personalized medicine, where specific treatments or insights are tailored to individual cases, this method allows precise information extraction from extensive datasets, such as research papers or physician notes. By embedding this information, you create a robust system where similarity lookups can accurately answer domain-specific questions. Open-source tools like Langchain and LlamaIndex are also making this process more accessible by providing frameworks to manage these vectors efficiently.

‍

It’s a little bit— I mean, Langchain has a whole host of different types of features that it proposes. LlamaIndex, I believe, is more geared towards, like, the actual semantic lookups or search-related use cases. I don’t use either of them a lot right now, simply because I found that the packages are a little bit volatile. They're very rapidly changing. It’s volatile in a good way, right? There's a lot of interest, the community is moving very fast, but I've had situations, you know, internally, where we’ll try to use something like Langchain, and then they might change, like, a method call or something, a little bit deprecated, and we didn’t recognize that it got deprecated and then it may cause issues so we haven't haven't used a lot of that. I primarily been using Lang chain for their recursive text splitter which is like really really great it's been a huge Time Saver.

Um, where it'll actually split out you know sections of text but with some level of overlap you can encode that and then when you do your recall you'll get those sections effects. You can kind of reconstruct things together and then pass that to your, you know, your GPT-4 or whatever to give you like a summary or to, you know, provide you a recommendation or you know whatever you want to do right. It's recursive in the sense that the text that's associated with each other is smaller and smaller as you go deeper and deeper down the tree. Is that how I'm trying to visualize it, but is that what?

Host:

Uh, okay go ahead. So I think it's, I think it's primarily a relatively naive text splitting algorithm. And so what it's doing is it's you have like a specific chunk size that you want and you want a specific amount of overlap, and then it's just sitting there and kind of breaking down your, let's say, your 100-page document into, you know, chunks of the size that you dictated.

Christian Merril:

But for each chunk, there's a little bit of overlap basically and that's as simple as that. It's not, yeah, it's I don't think it's, I mean I'm sure you could get very creative and we have. Um, in certain instances we've looked at, you know, doing full-blown like document layout and then trying to associate sections of documents.

Because then, you know, like hey this section was the introduction or the abstract and this one was like the methods and then from there you can have, you know, a better enrichment of metadata. And so you can say like, "Oh well I know that I only am interested in my abstract so I'm only interested in my methodology and so I only want to search on the methodology." Obviously that's going to be a little bit more performant than trying to, you know, look across, you know, at every pit of the document.

Host:

Um, and so that it kind of gets into like what do you want to do? Um, at this like, like again, this is a huge space right there's a lot that one can do inside of this particular space, um but the, yeah, the recursive splitter, you know, just kind of helps you split it apart because a lot of the models will have a token limit including the embedding models, right? So I'm sure I'm sure you're familiar with like you can only talk to chat GPT for so long before it kind of starts forgetting things.

Christian Merril:

Um, and that has to do with you have a context window and so, you know, a token is basically like a portion of a portion of a word or it could be a full word or it could be, you know, punctuation. And so you'll get like, let's say, 4,000 odd some, you know, tokens that you can fit into the model before the model has issues where you need to get creative. Um, with some of the open source models, if you, you know, go too big you might run into in-texting errors.

Christian Merril:

Or the generative model will just start spewing absolute gibberish.

Host:

Um, and then with like some of the, like, well it's, it's more than hallucinating, it's like absolute gibberish.

Christian Merril:

Um, yeah, yeah, it's, it's worse. Uh, but that's only if you go beyond it. OpenAI’s APIs as well as like any commercial service, they'll probably stop you, it'll give you like an error.

Host:

Yeah, it just is. I think it just gives an error at these points, I think, right?

Christian Merril:

Um, and a lot of, yeah, so a lot of the reason that you split this stuff up is so that you can, you know, get that, you know, those segments, um, and, and, you know, put more relevant stuff into your model so that it's able to do something. Um, on the embedding side, what ends up happening is typically it'll be like, you'll get like an indexing error and so things that you can do to kind of get around it is to truncate, you know, by your max length. So let's say you have 512.

Christian Merril:

Um, let's say just words or tokens, I mean you truncate by that, or what you can do is you can split it up into those, those segments and then do like some sort of an averaging technique.

Host:

Um, the downside to that is, is that typically if you do this...

Christian Merril:

Yeah, yeah, do you get data loss or dilution, right? So you, you won't get as high of a quality of the semantic search.

Host:

Um, and so there's, I mean, there's a lot of engineering challenges to all of this. Um, I wouldn't say that there's, well, I'm sure that there's Solutions out of the box where you can kind of feed stuff in, it'll do a lot of this for you.

Christian Merril:

Um, but I'm not 100% certain what the, the best approach is for, you know, like if I could give you one approach, the approach that everyone should be doing, I don't think I have that. It seems like, uh, you know, there's a... it's, it is... very, but you know, it's a very powerful tool, but its power almost creates an illusion of a bigger boundary than it has. In fields like AI in pharmaceuticals, for instance, using summarization tools to condense large datasets or research papers can be challenging due to token limits.

Like, even just what you spoke about now, something like summarizing a whole, let's say, maybe a book, right? This is a four-thousand-token limit. You just can't really do that. You already can only summarize a chapter at a time, and then you can sort of cheat by just like, summarize all these chapters and summarize the summary, which is what you talked about. But you have a dilution issue where you can lose—you don’t have any control over the kind of context you're going to lose, or you have a tough time controlling that context, and then by the time you fix that up, then you may as well read the book.

So, you know, in applications like AI in pharmaceuticals, careful handling of context is essential to preserve the integrity of the summarized data. Well, I don't know if you necessarily need to, you know, not try to go and— well, so there's—I mean, there's definitely approaches, right?

‍

Host:

So there's been a whole lot of research into this. Some of it is, hey, maybe I'm gonna have it take notes. So maybe I can fit, like, a chapter at a time, and so I'll have it take notes of each of the different chapters, and then I'll pass that into my—you know, like, I'll compile that together and then I'll pass that in, right? You have control of the context.

Christian Merril:

Um, and then you might have— there's other things that I've seen, you know, beyond just like note-taking or storing it. There's a lot of research right now on how can we increase the token lengths? Um, I know that there was a model, it's called MPT. I have to double-check what the token length was, but they trained a 7 billion parameter model to have this like absolutely absurd context length of 65,000 tokens— this absolutely absurd token length. And what they did, I think, as one of their tech demos is, I think that they put, like, the entire Great Gatsby into it, um, and then they had it write an epilogue for them. Um, and it did, it did well. But yeah, there's definitely a lot of research into this problem.

Host:

What were they using to make this work? I'm assuming they just poured more power into the problem. So they did—?

Christian Merril:

Some of that was the answer, just the book, throw more CPUs at it?

Host:

There's— it's twofold, right? So, historically, the issue, the reason that token lengths have been short, has been because you would need an exponential amount of compute power in order to continue to, you know, extend it.

Christian Merril:

Um, we've gotten around that using different types of attention mechanisms or making them more efficient, and so it's gotten a little bit more to a linear relationship. Um, and with the case of the MPT one, I believe that they just did throw a hundred—problem.

Host:

It's a 7 billion parameter model, so it's a little bit on the smaller side, um, in terms of, you know, you've got llama two at, what, 70 billion? I think open AI is 100 and something, GPT-4 is supposed to be, I think, like a set of 220 billion parameter models, although they don't really release a lot of those, so you don't exactly know for sure.

Christian Merrill: Um, but yeah, the, they took a smaller model, um, and they basically threw a lot of compute at it. Um, and it's the same with even, you know, like, Llama. I think Llama, out of the box, is 2048 tokens. I think the original Llama was, um, and I believe, I have to double-check, should I just check for you real quick?

Host: Yeah.

Christian Merril:

Um, but Llama two, I believe, is four thousand little tokens. Yeah, so, and people have already started extending it. I know that there's a long Llama project where they've extended it to a larger number of tokens. So there's a lot of work being done, and in terms of the compute piece, um, there are some things that we can do to kind of make things a little bit more tolerable, right? So, when we think about large language models, um, typically, or at least historically, they were in like floating point 32-bit, um, the weights were, and they're huge, right? So, um, you can get, you know, 70 gig, 100 and some odd gig binaries of just the model weights.

Host:

And so one of the things that has been kind of trending recently, which is super exciting and something that I've been following a lot, is this idea of quantization. So, what we found, um, is that you can actually reduce the precision of the weights and continue to, uh, get really good performance. Yeah, I won't say exactly the same, but you can, you can not get a significant degradation of, you know, the behavior of the model.

Christian Merril: Okay, so there's an order of magnitude's reduction in the amount of precision. Every, every bit, every, you know, it's basically every, uh, every place value that you remove. Right? So how are you, how are you not... How are you still maintaining that? How do you still maintain accuracy even though you've lost that or...

Host:

How are you still maintaining the result? I might have used the wrong word. I used to be the result with that level of loss. So, that's a great question.

Christian Merril:

Yeah, that's a great question, and it's one that I'm not entirely certain about. Why one can get away with dragging it all the way down to four-bit and still getting really good performance... It seems that four-bit is kind of the minimum. There was a paper that explored this called SPQR. Um, and they released a package called Q Laura. And so that paper particularly, what it discusses is like, "Okay, how low can you go? What's the bottom?" So, they started just reducing the precision of the weights. I believe they started at 32-bit precision, and then they went down, like, below four. And they found that around three-bit precision, it started having issues.

Host:

Yeah, but that's a lot though. That's a big jump, from four to three. That number to three is a lot.

Christian Merril:

Yeah, it is. And it's got really good implications, you know, for folks like myself who want to, you know, at home do my own research, as well as, you know, even large enterprises. And it's even probably good for Google and OpenAI, Microsoft, right, or Amazon, whoever, right. Because you don’t necessarily need to, you know, fit a 13 billion or 16 billion parameter model on a single A10 GPU instead of needing, like, four, or, you know, a large set of A100s to do something.

Host:

Now, there are some downsides, right? One is that you have to have a relatively modern GPU. If you try to use, like, some of the old—what would be like the K40s or K80s, or, you know, the model that would have been below the 30 series or so, they aren’t as efficient...

Christian Merril:

Yeah, because they can’t support the matrix multiplication out of the box. And when they first did the four-bit—like, I remember when Bits and Bytes first released their four-bit implementation... like, you could train the four-bit model, and it would train really well. Like, you could do a fine-tuning exercise using Laura, which is low-rank adapters. Basically, you put weights on top of your specific modules in your model and then train those instead of trying to train the entire model. And so we were doing it with four-bit, and it would, you know, go really well. And then you try to do imprints with four-bit, and it would take, like, 50 more times just to spit out the same answer, which, you know, obviously that’s not very good.

Host:

But they recently released, you know, CUDA like a CUDA kernel to fix this particular problem, and now it’s, you know, it’s very fast. It’s just as fast as eight-bit or sixteen-bit.

Christian Merril:

So, it sounds like—and this might be fundamentally wrong—but if it takes more time but it trains the same, right? That’s what you’re saying. It trains the same speed but it takes more time to get the result out. The model—it just needs more... The fundamental thing that does the prediction isn’t really... isn’t doesn’t really need...

Host:

How do I put this? It just needs more space to make the prediction itself to generate the reality that it’s best to value out with. But there... But the reality itself has a minimum size.

Christian Merril:

Yeah, so I think I understand. So, you’re saying that the weights themselves have some level of minimum viability and that you have—they have to be at least a certain size in order to be able to execute the way that we’re expecting them to execute, right? But specifically, the rest of the space isn’t really for the model itself. It’s just for it to generate... it’s just extra space for it to make a prediction or something. Prediction might be the wrong thing, space, money, or our word, something else. But it’s like, um, it just basically—it’s almost like, uh, you need a big table to make a big breakfast, but you can make it on the small table, right? But it’s just going to take more time.

Host:

Yeah, a lot of that. So, what’s interesting is a lot of the slowness just is... at least...

Christian Merril:

Originally, it was just because you were basically putting like a—you’re basically trying to do something in software instead of on hardware, or you were trying to get around something, so you're basically multiplying your matrices in a very inefficient way. And so once they patched that, once they enabled native support on the actual GPU through whatever it was that they created—I believe it’s like a CUDA kernel—it no longer, you know, they weren’t trying to use some PyTorch extraction to try to influence this piece, and it's gotten a lot faster. It's actually almost as fast as like a 16-bit or even 32-bit. It’s like ridiculously fast now.

Host:

So it's the same thing, it's just happening on a lower level and that's what's giving you the speed. So that's really what the value was, they just put it lower down the stack.

Christian Merril:

Yeah, I believe so. It’s basically just creating efficiencies where they can.

Yeah, because it’s— I mean, Python’s notorious for being slow. I don’t think anyone would argue that Python’s super fast in comparison to like C++.

Host:

Well, the only reason why we’re using Python for anything in data science is because it’s using C underneath, which is—everyone knows that.

Christian Merril:

Yeah, and it’s a really nice language to work with. It’s easy to write, and you’re not putting semicolons at the end of every line.

Host:

(chuckles) So, sorry, we got so deep into this, I almost forgot we’re recording. But let me go back a little bit to, all the way back. One of the things I wanted to explain was sort of what the different measures of distance in your vector databases are and how they affect your results. So, could you talk to me a little bit about cosine similarity and other ways of using different concepts of distance, and what kind of results you get when you use one or the other?

Christian Merril:

Yeah, so primarily, what we—the one that most people I think default to is cosine similarity. The other one that we've used, or at least that I’ve used, is L2 norm, which is Euclidean distance, if I can pronounce that right. In fields like AI in pharmaceuticals, these methods are particularly useful for analyzing complex datasets and predicting outcomes.

I mean, they're just ever so slightly different ways of trying to do some level of—it's a, I believe it’s just a projection, right? It's a linear projection of your vector spaces because you have two different vectors. Honestly, I’m being like super honest, I haven’t seen significant differences between the two methodologies. And maybe that’s just because I’m using—so I like to use, like, a blended retrieval. I don’t just rely on a vector retrieval; I do both a vector retrieval and like a BM25 retrieval, which is like a more traditional keyword, you know, term frequency, inverse document frequency, like retrieval. And in applications such as AI in pharmaceuticals, this balanced approach can help reduce noise and improve accuracy in model predictions.

I found that that kind of helps balance out some of the noise that you might get with something like a vector lookup. And, again, not all vector lookups are the same, too, right? You have a couple of options. You could brute force it, going through your entire corpus, getting a score, and pulling it back. But typically, what I’ve been doing is an approximate nearest neighbor approach, and I really like it—there’s a piece of open-source software, actually forked by Amazon, called OpenSearch. I’ve been enjoying using that product because it’s very user-friendly and allows for a lot of customization. It also now has Hugging Face support directly integrated, so you can embed your model within OpenSearch rather than needing an external setup.

A lot of people working in AI drug research and similar fields are trying to understand exactly what embedding models contribute, especially as it relates to AI-driven drug analysis. OpenSearch, for example, can be leveraged for predictive analytics in pharmaceuticals, where embeddings are used to process vast datasets to identify potential drug interactions or outcomes. This approach is increasingly being adopted in AI in clinical trials, where predictive insights based on historical data can help anticipate patient responses.

With modern encoder models, typically you’re working with three broad classifications of Transformers: encoder-only models like BERT, encoder-decoder models like T5, and decoder-only models like GPT. The encoder-only models are commonly used for embedding generation, as they create structured vector spaces that are valuable for AI personalized medicine. This capability allows for customized healthcare solutions, where patient data can be analyzed to suggest tailored treatments. The flexibility and growing support of tools like OpenSearch mean that these models can directly support a wide range of AI applications, from pharmaceutical research to personalized patient care.

‍

Host:

Why do you need it? What is opening? I need it. So, is that just a label to the way they do it, and maybe they have different algorithms to do the encoding, or no?

Christian Merril:

So, the recent— so the reason that it’s kind of like these— these are just the terms that have kind of evolved over time. There’s actually a really nice paper, like a really nice paper that has like a historical tree. It actually might even be a GitHub repository— I’d have to go double-check. But there's— it has like a very nice tree of like this is the lineage of, like, the language, large language models, and so it shows you the three different branches. I think even Jan LeCun kind of criticized the naming convention. So, Jan LeCun, for folks, is like one of the heads of AI, if not the chief AI scientist at Meta. He’s a key player in the space, a key thought leader. And he even was like, “Yeah, it’s kind of unfortunate how we’ve named this because the naming is kind of confusing.” And like, even though it says it’s only an encoder model, there really is a decoder there. And, you know, they have the encoder-decoder one, which is great, even though you have the decoder-only, that really still is an encoder.

Christian Merril:

It seems these, uh, the people that make magical language processing, language...

Host:

Well, yeah, and we also have a penchant for cute names. So, I mean, if you look at, like, Google was naming them after Sesame Street characters for a while. So, you had BERT, you had, um, yeah, you had BERT, and then someone, you know, did— I think there was Elmo, and there’s Ernie. Um, and now they've got even more fun names. One of the more interesting things that have come out recently, it’s called Platypus.

Host:

So, uh, my buddy and I actually have a joke that if we come out with our own language model, we’re going to call it Snuffleupagus just because it’s a lot of fun for people to spell, right?

Christian Merril:

Um, but yeah, so you’ll definitely get a monopoly on Google rankings.

Host:

Yeah, I guess.

Christian Merril:

Um, but yeah, so you have these different branches of models, and they’re good at different things, right? Historically— and again, there are caveats to all of this— but historically, your encoder-only models are better at tasks like creating an embedding that you can use to do a semantic search or for something like predictive analytics in pharmaceuticals. This type of model is often integral in AI-driven drug analysis, where embeddings are used to find patterns in large datasets, aiding in drug discovery.

Your encoder-decoder models are highly efficient for tasks like sequence-to-sequence translation. In fact, they’re also being utilized in AI drug research for tasks requiring the interpretation and transformation of complex medical data, which can improve how data is used across studies. On the other hand, decoder-only models, traditionally known for their ability to predict the next word, have evolved significantly. Nowadays, these models contribute to AI in clinical trials by helping predict patient responses based on prior outcomes, and providing crucial insights during trials.

With the growth of larger and more complex models, many are even performing tasks across previously defined boundaries. For instance, while encoder-decoder models were not originally designed for complex pattern identification, some recent applications have shown their effectiveness in AI personalized medicine. Here, they can aid in creating individualized patient treatments by analyzing data patterns and predicting optimal care paths. The landscape has shifted quite a bit, with these blurred lines allowing a variety of models to support both generalized and specialized AI applications in the medical and pharmaceutical industries.

‍

Host:

Yeah, I mean, it would be...

Christian Merril:

But yeah, you're not incorrect. I think it just depends on what it is that you're trying to do, but historically the encoder-only models have been better at creating the embeddings than some of the other ones. It seems like the lines of the delineation are more about utility than invisibility. It's like a specialization at that time relative to the power of these things, and then as things got a bit more powerful, the lines weren't as necessary anymore because it could be sort of nested in one box. Now, to the point where you could have that box be four bits sized.

Host:

Yeah, well, and I mean, like anyone else in this space, I am still merely a student, right? And you know, we're all still...

Christian Merril:

Yeah, every day is kind of different, right?

Host:

So, the thing that's crazy is that you never know if you're going to wake up in the morning and all of a sudden there's a new way to do something, like, "Oh, here's some non-autoregressive model that's going to be better at doing this thing than anything else you could possibly throw at it." And so now you should learn about this and how to implement it and how to use it.

Christian Merril:

Yeah, and so, yeah, you never know. But like I said, historically, it was like, "Oh, you want to do text embeddings? Okay, I'm going to go grab like a BERT model or a longformer model or mini-L." Oh, you want to do text generation? Well, let's look at like GPT-2 or, you know, BLOOM was one that was out for a while before it was one of the big ones. Or I want to use like Wama2 or something, right? So, yeah, that's typically how it's gone.

Host:

But at the same time, I wish I could create just like a, you know, a fixed flowchart to be like, "Hey, you want to do this use case? Follow all of these steps and here you go!" You could kind of get close to that, but it's just changing so rapidly that sometimes it's challenging to adapt to it.

Christian Merril:

What are some of the things you think are interesting coming down the pipeline? Because, you know, you're in a rare position where you're able to actually build something for the adjustments. So, what do you think are some seminal papers—seminal is a relative term—but that have come out?

Host:

So, the things that... a lot of my stuff that I've been looking at recently has been around the quantization, right? How do you be more efficient at training? How do you be more efficient at hosting? Because at the end of the day, I'm usually asked to deliver some kind of a solution for somebody to use. And so, keeping costs low, making sure that we get the best performance, and then, you know, obviously with pharma and I'm sure other enterprises, there's always some concerns about using like a commercial product that you don't completely have control of.

Christian Merril:

Right, so I don't know how open AI is going to version, you know, GPT-3.5 or 4. And so, that might have implications to what I want to do, and so that's why, like, "Oh well, if I can host it myself in my own environment and I don't have any like privacy or security concerns, you know, I want to do that." But I also don't want to have to go in and, you know, talk to like my boss's boss and be like, "I need 10 million dollars for compute."

Host:

Only 10 million?

Christian Merril:

Yeah, well for like one thing or for, you know, like a handful of things, right? I don't want to, I don't want to be doing that, if I can avoid it. Or I don't want to have the table for this project, you know, we're going to need to have like an entire data center just in order to support it.

Host:

So a lot of the stuff I've been really excited about has been like, P-EFT, which was a Hugging Face library, they have a paper SPQR, which was a...

Christian Merril:

Yeah, so P-E-F-T is state-of-the-art parameter-efficient fine-tuning. It's just an optimization of parameters themselves or just making it fine-tuning faster or something?

So yeah, it was about how do you get more efficient at training these models. And so there are different approaches—there was Laura, which is the low-rank adapters, which we kind of talked about. It's prefix tuning, there's P-tuning, there's prompt tuning, multitask prompt tuning, you know, there's a whole host of different ways that one could go about it.

Christian Merril:

Um, and there's a library that's actually released called P-E-F-T, which is this nice kind of one-stop shop where you can kind of get access to some higher-level methods so you're not having to, you know, spend a lot of time implementing things. Um, I believe Hugging Face released that one.

Christian Merril:

Um, and then there's, you know, another paper—like it wouldn't be a paper, but other packages that go along with P-E-F-T. It's bits and bytes, which is, uh, Tim Demeester’s. So, he—he I believe was an author on the SPQR paper. But his research interests tend to be around this collectization or how to be more efficient around getting, you know, imprints and training, and how do you—basically, how do you get us onto like a Raspberry Pi or how do I start using it on a CPU instead of needing a GPU?

Host:

So are there, are there techniques and optimizations that aren't necessarily really about, you know, just playing with the structure of chips, but some sort of like, um, even mathematical or just some sort of abstraction leveraging that, that can sort of shift things around a bit more efficiently?

Christian Merril:

Yeah, so like bits and bytes is definitely not about chip optimization or like rearranging a chip. It's all software, right? Um, there's definitely organizations like, is it Sarah Ross? There's a couple of smaller organizations that are kind of popping up that are trying to look into chips, right? Like how do I, how do I make a better chip for training a large language model?

Host:

But sorry, you said bits and bytes was the other one that did the four-bit right?

Christian Merril:

Yeah, that's right. But it's not—they're not doing it on, like, changing the actual structure of a chip. They're—it’s all software. So they just made it so that what was in the chip was optimized for vector calculations, but they didn't—they didn't actually play with the chip itself.

Host:

So is there, like, is there a layer above that, where—so what I mean is, therefore, let me just restructure my question. Is there something else going on, or is there another abstraction where you're not having to basically— is there a way, is there something that's happening where we don't have to go down one level because you can just use that on the lower level as well? Like, so it becomes a functional—because there's a limit to, like, we know exactly what it is that makes it so we can do vector calculations more optimally, so we just go all the way down the stack till we get an electron to do that, you know, or vector calculation or something, right? So that train of thought is feasible, is understandable, um, or, you know, to its own actual limit. But is there something, like, fundamental where something else—fundamental—where we can, like, you know, maybe abuse the concept of computation for linear algebra calculations that someone's come up with?

Christian Merril:

So that's... yeah, no, I hear what you're saying. Yeah, so that's actually a really interesting question. And the answer is, I'm not certain, right? These are the things that, like, people are actively working on. In fields like AI in pharmaceuticals, for example, there's a lot of potential but also many open questions about how best to apply these models effectively.

Um, and so, like, you know, kind of going back to bits and bytes, they—they went down to, I guess—I won't say all the way down to the hardware level. I don't believe they went that far, but they did optimize aspects, particularly for specialized applications like AI in pharmaceuticals, where model precision and efficiency are essential. They’re not going to, you know, tell NVIDIA to change the way they design their chips, but they are getting close to that kind of control.

P-E-F-T, for instance, would be closer to what you were asking about in terms of an abstraction level. It allows researchers, especially in domains like AI in pharmaceuticals, to avoid fiddling directly with low-level code like C. P-E-F-T offers higher-level abstractions that integrate well with the Transformers library from Hugging Face, making it easier to work with large language models without diving into the complexities of Torch or other low-level code. It’s an exciting development, especially given how AI in pharmaceuticals is advancing due to tools like these.

‍

Host:

So, it's like PyTorch code or—

Yeah, so you can get even higher, right? You can get to be like, I want to load my model, and you load the model, and I want to do—yeah, and I want to, like, you know, maybe I'm going to strip the language modeling head off of this and I'm just going to grab the underlying, you know, representations of... You know, the vectors or so, it's a very powerful library. You can very rapidly implement a language model-like application, but you can also do a decent amount of experimentation instead of needing to be super intimate with TensorFlow or PyTorch. It still helps to know it, but you can at least have a slightly higher level of abstraction that you can kind of play around with, especially if you're a researcher and you're like, I don't know, if you're a scientist, I would say. So, like, maybe you're a chemist, but you really wanted to be using a language model to do something. It's going to be really good for you because then, you know, maybe you know Python, but you don't know a ton about everything else, so you'll be able to kind of slot in and start utilizing it.

Christian Merril:

It also works with a bunch of other things, like I believe they even have support for things like stable diffusion and I think Whisper AI, which is a speech-to-text model. So, there's now becoming like...

Host:

Yeah.

Christian Merril:

Yeah, so they're kind of adding a lot of different types of models that you can use in their library, so it's becoming rapidly like a one-stop shop. They also have a very nice model repository and dataset repository. But going back to your previous question, like other stuff that's really interesting that's coming out right now. This may sound really silly, but it's actually the data that you use to train, right? So, if we think about historically, like deep learning, this kind of goes to one of my favorite interview questions, it's like, "Tell me an important truth that few people would agree with you on." It's a Peter Thiel question. My response to this is, "NVIDIA is bad for AI," which a lot of people would probably disagree with me on, and I'll explain myself a little bit and why I think that the stat...

Host: Yeah, right...

Christian Merril:

Yeah, that's it. Hopefully, no one comes after me for that comment. But it goes back to, like, historically, we're very interested in, like, how much compute can we throw at something? How big of parameters can we make it? And we were scaling that, and we weren't really thinking a lot about, you know, what is the data that we're putting into it, and is it enough? And there was a paper that I believe was from Google. It was called the Chinchilla paper, where they started focusing on the volume of the data and, does that volume translate to, you know, does it stop learning, does it overfit? You know, what's actually happening? What's the optimal amount? And so we started realizing very quickly that we were undertraining a lot of our models.

One of the things that's been very interesting to me is seeing papers like the Platypus paper, which is "Platypus: Quick, Cheap, and Powerful Refinement of Large Language Models." In fields such as AI in pharmaceuticals, advancements like this have the potential to drastically improve how quickly we develop and refine models. In this case, researchers are using a Llama base model, applying 25,000 instruction-tuned examples, and achieving performance that rivals other commercial large language models.

This is particularly relevant to AI in pharmaceuticals, where accuracy and specialized applications are essential. It’s fascinating because it demonstrates that high performance doesn't always depend on sheer data volume. Instead, it highlights the importance of curation—having a variety of unique, high-quality datasets. In Platypus, they focused on curating datasets like Science QA, Theorem QA, and Open Book QA, using cosine similarity to reduce redundant samples, which in turn enhanced the model's versatility.

Interestingly, they took this approach even further by experimenting with merging different LoRAs, or adapters. This method has been applied in areas like stable diffusion, but the potential applications for AI in pharmaceuticals are enormous. Fine-tuning large models on smaller, carefully selected datasets might offer a faster and more cost-effective way to develop high-performing models tailored for specific needs, such as drug discovery or patient data analysis.

‍

Host:

Okay so this was now a stable diffusion where you could have like an adapter for like Pokemon and then maybe you have an adapter for like selfies and so now you can have like Pikachu taking a selfie um and so we found that that also is you're able to do that with like large language models and there's research that's starting to show that it might matter in what order you put these adapters on the model um which is also yeah like click the merge the adapters into the model so you like merge one adapter then you merge another then you merge a third okay okay so precedence says meaning which also is self-evident by by the sentence I guess in a way right and so that's yeah and so that's that's one of the things that um that's been really kind of interesting to start seeing is that you know that work.

Christian Merril:

Um and I personally am going to be starting a ablation study on basically the Platypus data set um one of the things that I I want to do and this will be my own personal time um but one of the things that I want to do is start looking at well what if I removed like this type of data from the the data set what is what does that then cause the model to do like where does it start suffering um obviously it'll be a very expensive ablation study but um what I wanna but I wanna like there's some but I mean there's some data in there that's considered non-commercial license so that means you can't use platypus for commercial uses I can't use it at work right um so what if I take all of that non-commercial data out and then I train the model What's It Gonna how's it going to perform what's gonna be what's gonna be different and then what was special about that data that got it to the next I got it to the performance so that it you know that it was achieving and so those those are the things that I find like really fascinating right now.

Host:

And there's a lot happening here—this isn't even doing the field justice because, you know, there's so much going on. For example, in AI drug research, they're advancing rapidly, with new techniques that include multi-modal models that can consume different types of media, such as images or other formats, which is valuable when considering predictive analytics in pharmaceuticals. It’s fascinating to see how these innovations can help in areas like understanding patient data trends or drug efficacy.

You've got researchers working on how to generate synthetic datasets to train models, which has promising applications but can sometimes degrade the quality of results. This is relevant when using AI-driven drug analysis since synthetic data might help simulate rare cases. However, there is also research suggesting that synthetic data, while diverse, may affect model accuracy, potentially impacting AI in clinical trials if not managed carefully. In cases where precise patient data analysis is critical, the quality of data directly influences the effectiveness of the AI.

Then there's alignment, which is another big focus right now: how do you train models to avoid biases, or ensure they give accurate, respectful responses in sensitive fields like medicine? This is especially significant in AI personalized medicine, where ensuring accurate and sensitive recommendations can impact patient outcomes. In fact, we ran into a situation at work where we asked a commercial model about a medical topic, and it refused, citing an inability to provide answers on medical topics. It was a surprising limitation, especially given the potential for AI in these complex and specialized fields.

‍

Christian Merril:

That we worked with, we worked with the folks, and we got it fixed, but it was one of those alignment issues, you know? When you're handing over a model, especially for something like AI in pharmaceuticals, you want to be sure it's safe and won’t be misused—like building bombs or spreading toxicity on the Internet. At the same time, you need the model to operate effectively within specific boundaries. So, how do you navigate that? There's active research into setting these boundaries, which becomes particularly crucial in fields like AI in pharmaceuticals, where regulatory compliance and ethical considerations are paramount.

The question then becomes, what are the levels of security? How many safeguards are needed for critical applications, like preventing misuse in drug development, versus other applications that may involve lower risk? How do you ensure these protections can’t be bypassed? For instance, a recent paper explored using a shift cipher to ask inappropriate questions—like how to hotwire a car. The alignment filter missed it, but the model still understood the intent and responded in ciphered language, which users could then decode back to English. This example highlights the importance of robust security and oversight, especially as AI in pharmaceuticals continues to expand its impact and implications.

‍

Host:

So like the cybersecurity or the, I guess just the overall security in the language model as well is another big spot of interest. So if I'm getting it right, they wanted to bypass it and they said they used a cipher for it to decode for it to understand and recode so that they could decode and that's how they bypassed it?

Christian Merril:

Yeah, apparently. Yeah, like the example that... yeah, the example they showed in the paper was they had like, okay, do this thing and they showed it in English, and it said like, as an AI language model, I can't blah, blah, blah. And then they just did like a simple, you know, like a rolling cipher and it looked like gibberish and it spit out what looked like gibberish. And then they just took it and reapplied the cipher and it came back with like the actual answer to that particular question that it shouldn't have answered.

Host:

And this isn't the first time?

Christian Merril:

Yeah, I mean this... I mean, there was, uh, what was it called? Um, it was called Dan, like "Do Anything Now," where there was like a group of folks on like Reddit or 4chan or something that had come up with a way to like jailbreak ChatGPT to just do anything.

Host:

It was through just prompting?

Christian Merril:

Um, there was another one that was very interesting on large language model security where they were showing how you could like essentially prompt hijack or like prompt inject models to get them to do things they shouldn't be doing. And there's cute things that they... that they should, like, injecting something so that all of a sudden it's talking like a pirate. But you could get, but you could easily...

Host:

So that was blocked?

Christian Merril: Well, no, no. It was, it was basically showcasing how some of these models, um, you have the opportunity, like let's say that you had like an email assistant, right? Let's pretend, right? You have an email assistant, it's reading all of your emails and let's say that it has some level of decision making. So it can like reply to an email or it can maybe send emails or do something, right? Well, what if I sent you an email, but inside of the, you know, at the beginning of the email, it's like, you know, "Hi Chris, it was great to meet you," and et cetera, et cetera.

Host:

And then all of a sudden it says, okay, you know, now do you know, open up your contacts, send this email, you know, to 30 people and you basically could get the language model to see that and then execute those instructions or potentially be, you know, poisoned by that text?

Christian Merril:

Yeah, and it was a very interesting... it's a very interesting paper. This was a kind of the very beginning... well, a lot of the beginning of the hype cycle for ChatGPT when I became aware of the hype cycle.

Host:

And I believe there is a GitHub repository called LLM security?

Christian Merril:

Yeah, and it talks about how you can compromise language models just by having them look at specific... yeah, basically...

Host:

Yeah, look at specific content, that's a real problem.

Christian Merril:

Oh yeah, especially because it could be on any website. If you're like, let's say you're doing web scraping, yeah, an email, chat message, whatever. You could even considerably be considered innocent because this is an agent that's acting, as opposed to like the concept of, uh, the concept of sort of hacking or like inappropriate access is based on some sort of event occurring either... and or in signing an event occurring by a person.

Host:

So the one way inside a person that, you know, can be alleviated by, you know, a lot of large companies teach you what is and isn't a good email. Uh, and then if someone actually hacked into your system, they hacked into your system, but now it's because these agents are going out and being active. It's almost like if you set up a website that is actually meant to do something nefarious, you can argue that you came to us, you know, yeah, and it's gonna be... it's gonna be a very weird next couple of years, I think.

Christian Merril:

Yeah, the law... the whole law system has to get... has to change, which never will, but I'll say, it'll be very interesting, right?

Host:

And so, yeah, I mean, I don't have solutions to these things.

Christian Merril:

Um, I don't, I'm not personally studying some of the security components, but I know that it's, you know, like I said, this is a very broad field right now. There's a lot of work in a lot of different places, um, probably both for nefarious reasons and for, you know, legitimate reasons, but...

Host:

But no, but you know, just to make sure to, you know, kind of close out the point... Work and a lot of the stuff I'm really interested in are the data pieces and the, you know, how to make these things more efficient because right now they're just, they're just not. I joke to some folks today on a call when I was doing a demo, I think I was using like GPT-4 or something, and I joked to them that, oh, we're, you know, we're kind of back in like the 1980s or 1990s where we have to start waiting for the computer to do something again.

‍

Christian Merril: (Laughs)

Host:

Because we're not used to it. They're like, "Oh, do you think you can make this faster?" And it's like, I can't make this. This is exactly as fast as it gets right now. And that's why I'm really interested in some of these methodologies because it's just going to make them better, more capable. And then, you know, having them on the edge is going to, in my opinion, be really important. We're finally back to pushing these machines as far as they can go, and we're back to that sort of point, and that's really exciting for the future of computers.

Christian Merril:

But, uh, thanks so much for being on. I would love to keep going, but we do have a time limit.

Host:

No worries, you gave us a lot to chew on, I must say. You know, so I do appreciate it. And if they want to find you, and I know you're doing a lot of projects on your own, and I'm sure there's going to be a ton of questions on some of the ideas you brought forth and some of the things that you've looked at. So, if they wanted to find you, how could they?

The easiest would be LinkedIn. You know, I'm not, I'll be honest, I'm not super active on social media. I think when you reached out to me, I... like, I think it took me a couple of days to get to you because I had no idea because I don't check it.

Host:

Yeah, you... I actually got like, you're a real person?

Christian Merril: :

Well, yeah, I got lucky because one of my, we had an intern and they're like, "Can I connect with you on LinkedIn?" And I was like, yeah sure, and then I was like, oh, somebody... like, I've got a bunch of missed alerts. But LinkedIn is probably the best way to get in touch with me.

Host:

All right, and again, I want to emphasize, you know, I am not the master of the universe at this stuff. I'm learning just like everybody else. But it's super exciting, and right now it's really cool because just about anyone can participate.

Christian Merril:

So, I hope that folks, you know, get excited and try to participate. Anyone with a laptop nowadays can do this. You don't need a data center, which is great.

‍

Host:

Thank you so much, Christian. I appreciate it.

Christian Merril:

Yeah, thank you. It was wonderful. Have a great day.

‍

Watch Here Listen Here