Gen AI Summit 2024 | Panel Discussion: AI Operating Systems: Infrastructure, Data, Foundation Models and Frameworks
Speakers
Alexander Ratner, Co-Founder and CEO, Snorkel AI
Ram Sriharsha, CTO, Pinecone
Vijay Karunamurthy, Field CTO, Scale AI
Raghvender Arni, “Arni”, Director of Worldwide GenAI GTM, AWS
Vijay Narayanan, Partner at Fellows Fund, Ex-Chief AI Officer, Fellows Fund (moderator)
Summary
This panel discussion featured experts Alexander Ratner (Co-Founder and CEO, Snorkel AI), Ram Sriharsha (CTO, Pinecone), Vijay Karunamurthy (Field CTO, Scale AI), Raghvender Arni (Director of Worldwide GenAI GTM, AWS), and moderator Vijay Narayanan (Partner at Fellows Fund, Fellows Fund). The panel discussed the emerging AI stack, ranging from the hardware, data, foundation models and application layers. The panelists highlighted the complementary roles of GPUs and other accelerators like TPUs, IPUs, RDUs and LPUs. They debated the complementarity of large and small foundation models, emphasizing the need for domain-specific, high-fidelity data to train efficient AI systems. The panelists discussed the increasing use of advanced Retrieval-Augmented Generation (RAG) combined with fine-tuning to enhance AI performance in enterprise use cases. They touched on synthetic datasets and their role in training smaller foundation models. On the topic of evaluation, they stressed the importance of enterprise use-case specific testing and the pitfalls of overfitting to public benchmarks. The session concluded with a call for a nuanced approach to AI deployment, balancing innovation with practical application to enterprise use cases and robust evaluation methods.
Video
Full Transcript
Vijay Narayanan: Welcome, everyone, to this exciting panel on AI operating systems. As AI, and more specifically Gen AI, starts to pervade our personal and professional lives, there is a new stack that's emerging for generative AI, all the way from the hardware layer, the data layers, model layers, application layers, and a whole slew of services for building trustworthy, reliable applications. In that context, today, we will cover, at least at a high level, some pertinent questions across each of the layers in this stack. Before I do that, I would like to invite the panelists to take about 20 seconds each to introduce themselves and share with us what is the one generative AI product or service or app that you have used most commonly in the last 12 months? And something that's not part of your company or part of what you do at work? Let's kick off with that.
Raghvender Arni: Hello, everyone. My name is Arni. I lead the generative AI go-to-market within AWS. What that really means is, I spend a lot of time with customers, trying to understand what they are trying to do with generative AI across a range of industries—what's working, what's not working—and then work closely with our product and service teams to make sure the roadmap is headed in the right direction. Are there partners we need to work with to augment our own roadmap and ensure that our customers' needs are being met? It has been a fun, extremely busy last two years, working both with customers and with partners. A product that I almost live on a regular basis is Claude AI, which is very similar to GPT but has deeper reasoning, in my opinion. I'm really bad at writing. My grammar is terrible, but thanks to LLMs, they sound much, much better now. So I always go to Claude and validate and make sure that whatever I'm writing makes sense, which is even more important in a heavy writing culture like Amazon's.
Vijay Karunamurthy: I'm Vijay Karunamurthy, and I'm the CTO at Scale. At Scale, we're known for doing testing and evaluation of large models, both for the model providers, those training new foundation models, and enterprises using this for all sorts of purposes: financial services, healthcare, insurance. Governments and public sector institutions need to know the risks and safety implications with some of the models coming out. For me, it has been amazing this last couple of years seeing these new capabilities come online. I still really enjoy using ChatGPT to write code. I still ask a lot of questions about code that I see in the wild. I ask it to explain it to me, and it’s fascinating to me—stuff like the Linux kernel that I never understood 20 years ago, I can now ask a question and get something informative. I feel like I understand a bit more than I did before.
Ram Sriharsha: I'm Ram, the CTO of Pinecone. At Pinecone, I lead all aspects of engineering and research. Pinecone is a vector database company, and for those who don't know us, we act as the knowledge layer of the emerging generative AI stack. I was going to say ChatGPT, as well, since I think practically all of us use it all the time. But if I had to pick something else, lately someone turned me on to this email productivity app called Shortwave, which, if you haven't used it, I highly recommend. It makes dealing with email a lot easier. It actually uses RAG and generative AI to achieve this. It’s super cool to see something as productive as that come from a space like email being innovated as we speak.
Alexander Ratner: Awesome. I'm Alex, one of the co-founders and CEO at Snorkel AI, also on faculty at the University of Washington. What's the most used GenAI app? As a company, we subsidize the use of GitHub Copilot, so definitely by volume, that’s the one. As an individual, I'm probably the most tech-backwards person on the panel. By volume, it's still various forms of embedded autocomplete, email autocomplete, which I kind of like because that was the original use case of language models before they got the large prefix. So that has definitely entered my life. Given that my day job is all about developing the data to tune models to stop failing in basic enterprise settings, I'm maybe a little more conservative on trusting some of my higher cognitive functions and tasks to Gen AI. Still an optimist, big leap forward, but generally a lot of work to still get the last mile closed. So I guess that reflects in my daily life.
Vijay Narayanan: Thank you. Let's get started on the core thesis of the panel. Maybe we can start with the bottom-most layer, the hardware layer. GPUs have been the workhorse of the AI revolution and have been so for a number of years now, continuing to be the underpinning for the Gen AI revolution as well. But we're starting to see a number of other accelerators starting to come in. Obviously, the most famous one is Google TPUs, but there are a lot of others like the IPUs, SambaNova's RDU, and Groq's language processor accelerator as well. Where do you all think this is heading in the next three to five years? Are we going to be primarily a GPU game, or do you see any other innovation, any other promising accelerators that will exist along with GPUs? You can take a crack at it either on the training side or inference side or both.
Raghvender Arni: I think first and foremost, GPUs, not graphics, are really gold right now. So it's hard to get access to GPUs, no matter which kind. But I think GPUs in their current form will be here for a very long time because they do what they’re supposed to do really well. The number of frameworks and software libraries available to be able to take existing models or new models and write them out on CUDA kernels will be here for a very long time. So there's not going to be a dramatic shift for a while and for good reason. But one thing we certainly are seeing is more and more customers, either for training or for inference, saying, "Look, we think we can do more. We have to do more," because of the power consumption, because of the thermal footprint that’s being thrown onto these models for both training and inference. Just yesterday, one of the cloud providers announced that their power consumption has gone up by 30%. Anything around carbon neutrality they had targeted for 2030 is at risk. This will be a common theme we'll see with almost every single player out there. There’s no shortage of startups working on both sides of the neural network. Going forward, if you look at classic matrix multiplication, pretty much all these accelerators do is take matrix multiplication and burn it down to the silicon level. These boxes are nothing but high bandwidth memory connected down to matrix multipliers in silicon. That's one thing. On the flip side, we're also seeing some early startups come in for doing backpropagation using photonics and other systems to lower the thermal footprint. Within AWS itself, we've been making deep investments in this space. We have two specific chipsets, the Trn1 and Inferentia, which we use extensively internally and also make available for customers. If you wanted to go out on either Llama or Mistral, you can certainly run it on GPUs. Nvidia is an incredible partner, deep support from the A100s to H100s, and soon the Blackwell series. But for those specific use cases where you have hard requirements around extreme performance, either at a lower cost or improved latency or a lower thermal footprint, you can use Trn1 and Inferentia to help lower the burden you would actually put on these data centers from a cooling standpoint.
Vijay Narayanan: So your point is GPUs are here to stay.
Raghvender Arni: Yeah, I think GPUs are here to stay, but GPUs will not be the only ones as this market evolves, which includes accelerators like we have.
Alexander Ratner: I agree with a lot of that. I don’t think GPUs are going to disappear, but I do think we’ll see some easing of supply and demand dynamics. Things are going to renormalize a bit. We have new supply coming in with these new innovations and new chipsets. Whenever you get standardization of model architectures, which we’ve seen with the transformer architecture, it makes sense that you should be able to build custom chipsets that are going to optimize for that. It seems quite natural that through chipset specialization, you’ll have that trend. Things will be more diverse, even though GPUs won't go away. One thing worth commenting on is this other bifurcation between big models and small models. There’s this statement I’ve heard: "You don’t necessarily need a rocket ship to drive to the grocery store." There are probably a lot of safety issues and cost issues with taking a rocket ship to the grocery store. I think we’re going to see a big split between compute optimized around these large models and a lot of specialized or just standard architectures that works fine for smaller specialist models. We’re already seeing that lift and shift phenomenon of getting something to work on the biggest model and then slicing it up into smaller models. That has a very different compute footprint.
Vijay Narayanan: That’s a good segue to our next topic. In 2022-23, we saw this race to large models, monolith LLMs, with model sizes going up by one or two orders of magnitude very quickly. In the last 12 months, we’ve seen that model size growth curve starting to plateau. There have been some interesting approaches with ensembles of smaller models. Now, new generations of models, for instance the Phi models from Microsoft, are becoming competitive to GPT-3.5 but almost two orders of magnitude smaller. In the next 12 months, if we were to take a stab at this, where do you think the most innovation is going to come from? Are we saturating on the first L in the LLMs? Is the L going to flip to S? Or are we going to see another order of magnitude bigger models come up?
Ram Sriharsha: First of all, I don’t think we are close to saturating large models yet. We can clearly see custom domains and hard problems where large models are still better, even now, especially the proprietary ones. The gap is closing, but it’s not going to close in the next few months. That said, not everything needs large models. There are a lot of tasks that don’t need complex reasoning. There are a lot of tasks that can be done by simpler models. The two types of models you mentioned—you can think of them as a mixture of experts or performance-oriented models. They are trying to make it more cost-effective and performant at a trade[able loss. I’m more excited about the Phi models. They’re going after the problem that you don’t need these large general-purpose models and a ton of data if you can find better, higher fidelity data. The definition of higher fidelity is complex and ongoing, but I think there’s a lot of promise there. You can build smaller models that can go very far on a large number of tasks by choosing your data better. You can also have bigger models teach smaller models.
Alexander Ratner: The notion of higher fidelity data is interesting. With higher fidelity data for general purpose models, you have fundamental information-theoretic limits of how small you can go. But if you say higher fidelity with respect to a specific task, where I think a lot of the enterprise value is going to live, we’re entering a kind of put up or shut up moment in enterprise AI. Let’s get a couple of real use cases across the line, narrower ones that will ship. You have even further to go in terms of distilling, shrinking, specializing, which is intuitive.
Vijay Karunamurthy: There’s an interesting convergence on the large and small side. Large models are now a mixture of expert models, with domain-specific knowledge that one or two experts within the mixture of experts are really well-tuned to work at. On the small side, a lot of those are now trained by distillation of the larger models. For example, Gemini Flash, announced yesterday by Google, is a distillation of one of their larger Gemini models. The universal challenge now is coming up with robust testing and evaluation across all this disparate domain knowledge that the larger models are expected to have and that you’re using a mixture of experts or other ensemble methods to capture. On the smaller models, if you’re using distillation, how do you prove that the distillation captures the nuance of what you’re aiming the models to reveal? For example, if you have a Text-to-SQL model and distill it down from a larger model, how do you have a realistic Text-to-SQL benchmark that lives up to real-world use cases? So you don’t just say, "Hey, my smaller model does great on the Spider Text-to-SQL benchmark," and then realize that in the real world, "Hey, my tables are more complicated than what that benchmark reveals." Maybe my smaller model wasn’t capable enough to handle that.
Vijay Narayanan: So what I’m hearing is its use case specific. In some cases, smaller models might be sufficient. Whereas if you need more generalization, you might need to think about larger models.
Raghvender Arni: One thing I’d add is, I know you focused on the first L in the LLM—the large—but I think the second L is more important. Most of the new models are not just language; they’re multimodal by nature. As scaling laws run into limits of the amount of text data they can read, we’ve basically hit the limits. The only way they can grow is by reading images, video, and other multimodal formats. Large models will be multimodal by default. The small models will be text and custom domain, custom text or custom data. We’ll see that divergence more.
Vijay Narayanan: That’s a good point to switch gears. Let’s move on to data. Let’s start with training and then go to inference. We’ve seen a lot of innovations and performance improvements in the last six months, claiming they get better performance because they spent more time curating the data. It’s still the same data, but they curated it better, took out noise, etc. Are we as a community at risk of running out of good publicly accessible datasets? At a point, all these models are starting to train on similar amounts of high-quality datasets that are publicly available on the internet. Are we at risk of saturating there? If that well runs dry, where do we turn next? Where is the next set of innovations coming from?
Alexander Ratner: I guess I’m consistently on the less cool side, pitching for small versus large. I was trying to pitch a reporter on "small models and small datasets'' recently. There are both large and small datasets. There are many interesting areas, especially in multimodal areas where there’s lots of scale-up possible, especially in areas permissible to synthetic generation like videos from video game engines. There’s scaling, but there’s also saturation, especially with public text data. A lot of the interesting areas now are scaling down to more specific datasets for specific enterprises, domains, use cases, etc. I’ll leave you to make the point about curating data at inference time, like in RAG, but we do some work on tuning that. A lot of our work is on fine-tuning, aligning, and curating the data for that. There’s an immense amount of alpha in enterprises' own data and knowledge. The data in their data lakes, but also the knowledge of their lawyers, underwriters, doctors who know how to specify what is good. I don’t think we’ve exhausted that, but I don’t think that’s a volume game. That’s a specificity game. In some of our public-facing research recently, we had a consortium project with a bunch of tech companies last year called DataComp that basically showed if you curate down the data, you beat state-of-the-art on a multimodal CLIP model. There’s a lot of "less is more" where less data that’s higher quality and more specific to the use case you’re trying to do well on versus just pure scale-up. I’ll end on a metaphor: one way I think about the LLM leap in the last couple of years is going from kindergarten to college. There’s a lot more there if you’re trying to teach a kindergartener; you may want to optimize for volume, just hearing lots of language to pick up the grammar. If you’re trying to teach a college student to become an underwriter, you don’t want to just dump a big pile of books on their desk. You want to give the most minimal curated curriculum for that domain and task possible. That’s not black or white, but that’s kind of the transition phase we’re entering into.
Vijay Narayanan: Do we see a situation where synthetic datasets generated by these larger, bigger models become the training data for smaller SLMs? Would you envisage a situation like that happening in the next 12 months?
Vijay Karunamurthy: There are very specific examples where that works really well. For example, for text-to-SQL and natural language-to-SQL, creating synthetic datasets based on your real-world schema is incredibly invaluable for fine-tuning a model. That fine-tuned model is likely to outperform a general-purpose model given it understands how to disambiguate multiple tables with the same name, understands your particular patterns and naming conventions, and joining structures. It’s going to generate not just correct SQL, but probably more efficient SQL than a general-purpose model would to your specific schema pattern. That’s a great example where synthetic data can pay huge dividends in enterprises. In lots of other cases, enterprises struggle to test the real-world implications of these models and how they’re trained. If you’re using a model as a copilot to assist a wealth advisor in talking to customers, do you have any public benchmarks that you can turn to that say, "Hey, this copilot is going to speak in a way that’s nuanced and sophisticated and understands the cultural implications of telling someone how to save for retirement?" Probably not. You should start first by thinking about that benchmark and how you can translate your internal insights into that benchmark. Only after that can you go about thinking, "Here are synthetic data examples that can improve that performance." Generating synthetic data examples blindly without any measurable improvement in performance is a fool's errand.
Vijay Narayanan: Let’s switch gears to the different patterns emerging for leveraging foundation models in enterprises: simple prompting, sophisticated prompting methodologies, optimization techniques, prompt evolution techniques, RAG, fine-tuning models, adapters, and pre-training from scratch. Your team works with a lot of enterprises. What are your thoughts on which one or two design patterns do you see dominating? Or is it still always coming back to "try everything, take the easiest one for the use case at hand?" Is there a framework for this choice?
Raghvender Arni: Last year, in the second half and early part of this year, RAG was the dominant pattern. Classic retrieval augmented generation was very common. But naive or classic RAG is not sufficient. Most enterprises have figured out that just RAG on its own is not going to work. If you’ve not read the paper "The Seven Failure Points of RAG," look for it and also look for the RAG survey on arXiv. They speak about the challenges of RAG from chunking, indexing, and retrieval strategies. There’s a lot of rethinking even from RAG that customers are going through. RAG is a dominant pattern, but advanced RAG without a doubt. That’s point number one. Point number two, as both Vijay and Alex mentioned, in the last six months, there has been a dramatic increase in fine-tuning. Not custom model building, but fine-tuning. For three simple reasons: availability of high-quality base models, especially from open source; the number of fine-tuning techniques making it more approachable; and parameter-efficient fine-tuning, which lowers the cost of fine-tuning. What would cost a million dollars or more initially has now come down to a small fraction. The third reason is customers realizing that their data is important. They do have customized data and don’t want to keep passing this into RAG every time because they’ll pay a fortune in tokens. They look at their dataset and say, "What percentage of that data doesn’t change much?" It could be the ontology if you’re in pharma or specific datasets in financial services that don’t change much. So they create a fine-tuned model and then RAG on top of that. We’ve seen that combination of fine-tuning plus advanced RAG become the dominant pattern in a lot of enterprises.
Vijay Narayanan: So advanced RAG plus, as needed, fine-tuning.
Alexander Ratner: One thing worth noting is the notion of fine-tuning vs. RAG debates early on. They are really complementary tools. I’m going to pinhole myself as the dumb metaphor guy, but think about this as a med student going to do a complex diagnosis. No matter how much access to the patient’s chart and medical literature they have, if they don’t have training on the subdomain, they won’t do well. Conversely, even an expert oncologist won’t do well if they can’t access the patient’s chart and the latest literature. They’re complementary tools. The original RAG paper did fine-tuning of the retrieval and embedding model. We’re doing a lot of work now to label data and tune embedding and retrieval models to improve RAG as well as fine-tune the LLM. They’re different kinds of tools. Some trade-offs mentioned are: is this data changing very frequently, so I want to call it at inference time, or is it fairly fixed, that I’d rather encode? There’s a rich trade-off space there. We’ll see methods of getting data into models become lower-level trade-offs for what kind of setting you’re in versus big holy war debates. That’ll bring the focus to how I get my data in the right place to both tune and evaluate. Then the mechanism of how you get it in and how you serve it will be an architecture decision you figure out downstream.
Raghvender Arni: One quick thing. With the rise of large token windows, a lot of customers initially think, "Do I even need to do RAG? Let alone fine-tune? Do I even need to do RAG?" They think they can send everything to the model and forget about indexing and vectors. But when they see the first bill come back, they realize maybe they should rethink what they send to these models. The way attention works, there’s no magic. You’re going to pay a cost at some point. So from that perspective, RAG is not going away. Even with fine-tuning, you’re going to combine it in various ways. The large context window opens reasoning opportunities across many documents versus reducing RAG or fine-tuning. The mental model needs to shift for a lot of customers.
Ram Sriharsha: There’s a confusion about RAG and fine-tuning. The original RAG paper took a parametrized model, thought of your knowledge as the non-parametric thing, put it together, and then fine-tuned the whole system. You can get high-quality retrieval, knowledge into the language model, reduce hallucinations, and make it more grounded. What has happened is that when people say RAG now, it’s a simplified system with no feedback loop in most implementations. You could take the original RAG paper to its logical conclusion and think of LLM architecture as a combination of a reasoning engine with just enough memory for it to understand language and reason, with the actual knowledge stored in a vector database that it can retrieve. This also addresses making things cheaper by not having large language models store all their memory in parameters. How do you make data more accessible, detectable? Don’t put it in the model’s parameters, because it’s much harder to read and attribute. RAG as a foundational architecture has a lot of value, but RAG in its current implementation is very much an art rather than a science.
Vijay Narayanan: That’s a very nuanced point. The original RAG work distinguishes knowledge and language generation as two separate components but trains them holistically. Now these are being applied as two separate, independently optimized components, just plugging in a retriever in front of the LLM. Let’s shift to something at the tail end of why we are doing all of this: creating and measuring the value of GenAI powered applications. Testing and evaluation of these systems come with a baggage of issues: hallucinations, trustworthiness issues, fairness issues, etc. As these systems start taking foothold in enterprises, how should we be thinking about the overall testing and evaluation framework? Are there any standard frameworks emerging?
Vijay Karunamurthy: It’s a really interesting time because every month, you see a new large model come out, sometimes small models. They always publish superior performance on benchmarks, which are important to measure things like coding performance, ability to incorporate human knowledge in various domains. But every month that happens, the actual human-based evaluation of these models shows they aren’t improving as quickly as these benchmarks seem to be breaking down. One reason is that models are now being trained on examples from the evaluation set. For example, there’s a common grade school math set GSM 8K that has helped contribute to model improvement over time. But we found that models have gotten sophisticated in recognizing patterns present in the training l set. Our lab, called the Scale Evaluation and Analysis Lab (SEAL), just published the first open report on a holdout dataset called GSM 1K, which resembles those 8,000 public examples but is different enough in distribution. We saw a huge change in performance across models. The Phi models, for example, saw a significant performance decrease. We’re realizing a lot of prompting techniques have led to overfitting against specific benchmarks. We need to do more work publicly and transparently about better evaluations. You’re going to see a future where every public benchmark has a holdout set that provides a much more effective way of evaluating performance over time.
Alexander Ratner: Enterprise-specific evaluation is critical. The stuff you guys put out on that is awesome. How do we do public benchmarks better? Some of that is domain specificity, getting closer to what people actually care about. Some of it is about sanity, so you don’t leak the answers that can be cheated on. There’s a side note: thinking as an AI/ML educator, I think we made a mistake over the last decade. Many data scientists were taught, "Never peek at the test set. That’s cheating." I think that turned into, "Don’t think about the test set." There’s not a lot of engineering put into constructing it so it’s both sanity and intentional construction, relevant for real domains. There’s been a bonanza of auto-eval techniques, basically using GPT-4 or another model under the hood. If your gold standard is that model, that’s great. But for settings where you’re trying to go beyond the off-the-shelf, the auto-eval techniques don’t make sense. It’s like training an undergrad to be an underwriter, but then pulling another undergrad off the street to grade the final underwriting exam. There’s a human-in-the-loop element and a domain, problem, and enterprise-specific element to evaluation that needs to happen for meaningful test and eval that we’re just not mature around.
Vijay Narayanan: The takeaway for me is, don’t just blindly take the leader in a leaderboard but test it for yourself on your use case with your data and whatever applications you care about. With that, unfortunately, we have run out of time. Thank you all for a fantastic conversation.