Reasoning With Large Language Model Powered Products From The Ground Up

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet.

It is an exciting time to build products powered by LLMs (Large Language Models). Some of the best the commercial market has to offer (such as OpenAI Davinci, Cohere Generate) exhibit excellent properties, such as picking up CAP theorem (in distributed computing) from a very vague description[Fig1], or explaining the Black-Scholes formula in a unique voice.[Fig2] These "breakthrough capabilities,” or “emergent abilities” described by Jason Wei, are not yet widely reflected in academic benchmarks.

[Fig1] Breakthrough properties exhibited by the GPT-3.5 series / Claude feel poorly represented by existing benchmarks.
[Fig2] New GPT-3 version (text-davinci-003) explains the Black-Scholes formula in a unique voice.

This short opinion piece assesses the feasibility of rapidly replicating these capabilities within a 6-month timeframe. It provides the most effective approach for product builders when dealing with large language models. In short, we need internet-scale data and collaborative training power to get breakthrough capabilities. Data collection relies on near-data-center storage, scraping, and curation capabilities. The former requires specialized skills and a complex setup. At the same time, the latter is even more challenging since scraping is inherently adversarial. Curation has a high barrier of entry and is generally underexplored in industry and academia. Collaborative training in this context refers to synchronized gradient descent across a mostly static training set. It demands significant raw computing power and bandwidth, the former of which is readily available in retail, but training a GPT-3.5 level model without adequate bandwidth has yet to have been achieved (While considering the developments of Together GPT-JT-6B-v1).

Currently, techniques similar to MosaicML have yet to reach broad adoption, so a partnership with cloud providers or colocation centers is necessary. Colocation centers typically need bespoke relationships and contract values at least in single-digit millions and may take up to three months to set up. Meanwhile, cloud providers such as Google lack high bandwidth clusters, while Amazon's Elastic Fabric Adapter (EFA) stack is challenging to set up and inferior to Mellanox. Azure reportedly is only interested in deals in the double-digit millions of dedicated spending, while Oracle also has limited capacity for over 64 nodes of computing. Leaving us with second-tier providers such as Lambda Labs and Coreweave, but finding enough capacity to train GPT-3.5 level models may still be challenging.

PyTorch is arguably the most commonly used framework in effectively utilizing collaborative computing power. However, the distributed training component is challenging. MegatronLM, Horovod, and Deepspeed have provided evidence of a lack of growth, meaning labor supply remains stable. Even startups using PyTorch may not attract enough infrastructure engineers to maintain proper breakthrough-LLM training jobs. Using specialized hardware (such as Tenstorrent, Graphcore, Cerebras, and Google TPU) requires even more professional training. While academic institutions have access to supercomputers with ample computing power, it’s harder to convert that into proprietary models readily useful for product building.

A preliminary conclusion is that most startups claiming to build new models are likely only to be able to train with a limited number of hardware nodes (probably just one), mainly through fine-tuning of existing large models or training smaller models from scratch. The most probable fine-tunable models in this category are GPT-J-6B, GPT-NeoX-20B, OPT, BART, BLOOM, and Flan-T5, among others. Per scaling law, it's hard for these startups to compete with “breakthrough” models like Bard/Claude/ChatGPT, which have better access to highly curated data, more collaborative computing power, and a talented team to ensure consistent development progress.

Another minor comment is that scaling and serving large language models is still challenging. While frameworks like Ray Serve, Alpa, and Colossal AI exist, there needs to be more knowledge provided on how to use them effectively in products in production. Most new models, released as a reaction to ChatGPT, are still in the "research preview" stage.

This evaluation is likely inaccurate in three ways:

1. Vision and other multimodal models may have more favorable scaling laws, allowing unique properties to emerge with just 1-8 nodes of computing power, which are more abundant in the market now.
2. Some companies have been privately training language models for specific use cases for a very long time (such as Character AI and Cresta), and their expertise and first-mover advantage will be valuable.
3. Specialized training companies like MosaicML may have, or will, discover how to consistently replicate innovative abilities in limited use cases with reduced computing power.

Open-source initiatives like the Pile or LAION may also challenge this conclusion. However, at present, the data quality still needs to improve. Given the challenges in moving along the scaling laws curve with labor and hardware constraints, a more practical approach for product builders may be to assemble existing open and closed-source LLMs rather than training from scratch or even fine-tuning. This is frequently called the “composability” hypothesis in recent discussions. Langchain is a noteworthy example of this effort, with non-LLM APIs such as Google Search and Wolfram Alpha also becoming involved. Automated and learned prompt engineering with either black-box or white-box models is another exciting possibility. It is essential to be transparent about whether to build products using models trained from scratch, fine-tuned models, or composable APIs. We should avoid the notion that only models trained from scratch are "impressive" or "technically defendable." LLMs still produce false information, have limited memory, struggle with multi-modality, and their outputs are hard to formalize. Instead, we should be cool-headed in trying to make LLMs perform tasks that require high precision and think more deeply about their eloquence, cross-lingual contextual understanding, and exceptional fuzzy retrieval abilities.