Cookie Consent
By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.

How Tokenomics is Shaping the Economics of Generative AI

How Tokenomics is Shaping the Economics of Generative AI

For the last several years, the big news in Generative AI has been the emergence of bigger and more capable models. From Google’s BERT Large at a mere 340M parameters in 2018 to Open AI’s GPT-4 in 2022 at nearly 1.8T parameters, LLMs have scaled by more than 5,000x with impressive improvements in performance. This was made possible by the development of GPU-based systems from Nvidia that can scale to clusters of more than 10,000 GPUs to train these enormous models.

While recent reports suggest that the benefits of scaling to larger and larger models is slowing, there is no slowing of innovation in the AI space. At Scale AI’s Leadership Summit in November, CEO Alex Wang spoke of the time from 2018 to 2024 as the “scaling phase” with about $200B going into training. But he considers the release of OpenAI o1-preview in Sept 2024 as the kick-off of the “innovation era”, which will include deeper developments in advanced reasoning and test-time compute and others on the path to superintelligence.

And while o1 details are not public, Alan D. Thompson speculates that it may actually be a much smaller model than GPT-4 that gets much better results by spending more time thinking (Inferencing).

Training models is an internal investment in anticipation of future revenue through deployment of models. And inferencing is the path for realizing the benefits of generative AI by making models available to millions of users. From a financial perspective, think of training as a cost center while inferencing is a profit center which must achieve sustainable profitability to deliver the promise of generative AI.

Already there are predictions that inferencing will exceed 80% of the market in the next few years. Meanwhile training, which previously made up the vast majority of the market will capture a much smaller percentage. Since innovations are also shifting toward increased inference compute, we can expect to see enormous growth in inference system demand and deployment in the next few years.

And so whether you are a developer deploying your own model, a hyperscalar providing models for millions of users, or a CSP hosting models - choosing a GenAI system for inference deployment is an extremely important decision with enormous potential to impact your bottom line.

Microsoft CEO Satya Nadella said it this way in his keynote at the Microsoft AI Tour London in October, “... one of the things that I think a lot about is performance. You could even say tokens per dollar per watt. That’s the new sort of currency.”

So how can you evaluate different systems? Currently there are some great resources like Artificial Analysis that help to compare some performance metrics as well as the latest pricing for different models. But there are limited resources to help understand the actual costs.

So why should we be concerned about actual costs, after all isn’t it the price one pays what matters? In reality it is critical for developers to determine if a particular provider’s pricing is sustainable. Otherwise they might get locked in and be at risk for future price increases or transition costs to move to another provider. And of course those who provide compute as a business, e.g. CSPs, need to have confidence that they will be able to generate good return on their investments.

Tokenomics

The term “Tokenomics” has been repurposed from crypto and popularized by Dylan Patel of SemiAnalysis to describe the economics of Tokens as a Service (TaaS) for generative AI. SemiAnalysis is a great resource and has a number of reports that help provide an understanding of the tokenomics of different systems.

At Recogni, we also regularly evaluate other systems to make sure that we are on track with our mission to build the most efficient, accurate, and economical multimodal GenAI inference systems, so that the world can use GenAI profitably, sustainably, and with confidence.

In our experience, being able to quickly complete an initial first-order evaluation of new systems is crucial to sort through the marketing fluff. Sometimes the provider has impressive output speed for single users, but it becomes quickly obvious that the pricing is completely unsustainable, with losses that will easily exceed 10x revenue.

Today we would like to share with you a high-level approach to quickly get a first-order understanding of tokenomics and identify some of the key contributors on the road to sustainable profitability.

Key Contributors

The three main variables impacting tokenomics are system price, power, and performance (total tokens/sec).  For a more accurate comparison, be sure to collect results for the same workloads, context lengths, and data precision.

Capital costs for generative AI systems are 50 to 80% of the total 5 year expenditures.  Energy costs around 10% and datacenter hosting accounts for the remainder.  But this may be shifting due to significant increases in energy costs in locations where growth is outpacing the ability to put in place the needed energy capacity.

Artificial Analysis and MLCommons are useful sources for system performance.  Artificial Analysis seems to be more focused on user experience, reporting latency and output speed for single users.  While MLPerf is more focused on total tokens, which might be more useful to understand overall performance.

We’ve made available a fairly simple spreadsheet that you can download to get a head start for making your own comparisons.

If you already know how many racks are used, there are only a few pieces of information that you need to collect:

  • Total tokens/sec
  • Power per Rack
  • Price per Rack

If you don’t know the number of racks, then you can estimate it with a bit more additional information. See “Estimating the Number of Racks”.

Lastly, other financial and datacenter related parameters can also be entered if known, or default values can be used. This includes items like depreciation life, weighted average cost of capital, co-location or data center hosting costs, energy cost, power efficiency (PUE), and utilization. Since you would normally use the same values for the comparisons, they may have less impact between providers.

Estimating the Number of Racks

Chip and system memory specifications are the most useful parameters to project the number of racks that are needed. Memory is needed for storing the model parameters (aka weights) and intermediate results (aka key-value (KV) cache in LLMs).

In addition to this, here are the other information that is needed:

  • Users
  • Context length - how many input and output tokens?
  • Model precision - weights and activations
  • Memory per chip
  • Chips per rack

So if you know the memory is needed and the memory available in each rack, you can determine the required racks for your use case.

Let’s look at how to estimate the memory needed for a use case consisting of a single instance of Llama 3.1-70B using FP16 datatype with 16 users (batch size?) and a context length of 10,000 tokens.

Calculating the memory needed for weights is straightforward:

  • Weights memory = model_param * datatype_bytes * num_instances
  • ~130 GB = 70B param * 2 bytes (FP16) * 1 model

Calculating the memory needed for KV cache is more complicated and is different for each LLM. This can require a little research. For Llama 3.1-70B, we can find the information we need in Table 3 of “The Llama 3 Herd of Models”:

  • KV-cache memory = 2 * kv-heads * layers * model_dim / att_heads * datatype_bytes * context_length * users
  • ~49 GB = 2 * 8 * 80 * 8192 / 64 * 2 (FP16) * 10,000 * 16

This means our use case will require ~179 GB of total memory.

Next we will need to calculate the amount of high-speed memory per rack from the amount of HBM or on-chip SRAM and HBM available.

Note:  Providers rely on HBM or on-chip SRAM to optimize benchmarks, since the lower bandwidth of standard DRAM would significantly degrade performance.

For chips with HBM, we usually only need to consider the HBM capacity since it is typically around 1,000x more than SRAM. For example, the Nvidia GH200 has up to 144GB of HBM but only 114MB of shared SRAM. In this case, we require only two chips to support our use case:

  • Number of chips = Memory required / memory per chip
  • 1.24 chips = 179 GB / 144 GB

And for chips without HBM, we consider only the SRAM available. Let’s assume a rack that has 128 chips with 256 MB (0.25 GB) of SRAM each. In this case we require 716 chips and six racks:

  • Number of chips = Memory required / memory per chip
  • Racks = Number of chips / chips per rack
  • 716 chips = 179 GB / 0.25 GB
  • 5.6 racks 716 / 128


----

"Published under our former name, Recogni. As of 09/08/2025, Recogni is now Tensordyne."

Read our article