Token Economics Calculator Explained

TLDR
Tensordyne’s Token Economics Calculator which was developed to compare a broad range of AI inference systems across key metrics based on publicly available data.
- Initially created to address investor and customer questions
- Normalizes data to key metrics: tokens per second, per dollar, and per kWh.
- Explains the approach including gaps in available data
- Lets users test assumptions
The Motivation
Generative AI is bringing fundamental change to industry and society. Hardware makers, neoclouds, and service providers are racing to power the next wave of intelligence. But is anyone really making money besides Nvidia? According to outspoken generative AI critic Ed Zitron the answer is a resounding no. Since Nvidia controls >90% of the market, everyone wants to know who else is out there and how they perform.
Here at Tensordyne, we are taking a unique approach to developing an AI inferencing system built on highly efficient logarithmic math compute. Our approach includes more SRAM, large HBM, and a high throughput scale-up interconnect implemented in a custom ASIC and built into a rack-level system. And our simulations show impressive metrics.
But potential investors and customers all ask us similar questions. How do you compare to Nvidia? How are you different from other startups? Will you actually be any better?
It was our need to answer these questions that drove us to develop the Tensordyne tokenomics calculator. To be able to look across key metrics and evaluate long-term profitability. Not just relative to Nvidia, but across a broad range of alternative rack-level AI systems.
While our initial focus was for internal needs, the response was overwhelmingly positive when we shared it with others and included numerous requests to be given access to the tool.
And so this is what we share with you today.
The Challenge
If you‘ve ever tried to evaluate products of any reasonable complexity, then you probably know it isn’t always easy. And AI inferencing systems are no exception.
A fair evaluation starts by running the same AI model on the system. But even with the same model, there are a huge number of parameters you could control - input (ISL) and output sequence lengths (OSL), batch size and users, tensor (TP) and pipeline (PP) parallelism, just to name a few.
And the important metrics reported by different sources vary. They may be for one user or for the full system. Or may not even be provided. And what is missing is sometimes as important as what is provided.
Let’s take a brief look at a few of the publicly available sources:
Artificial Analysis benchmark results are often quoted by TaaS (Tokens-as-a-Service) providers like Cerebras, Groq, and SambaNova running models on their own custom hardware. Results also include numerous neoclouds running Nvidia GPUs. The Artificial Analysis key metrics are more focused on end user experience and include latency or time-to-first-token (TTFT), output speed or 1/TPOT (time-per-output-token), and price (not cost) per million tokens. While this is super interesting, it doesn’t give insight into the actual system implementation - i.e. how many racks to support a model or the number of users and total throughput per system or the actual cost to deploy.
Nvidia shares a lot of great data on their AI Inference developer page. They provide specific system configuration information like ISL, OSL, TP, PP, and details of the type and number of GPUs and throughput (total output tokens/sec by Nvidia’s definition) as the key metric. But batch size or number of concurrent users is also consistently missing here.
MLCommons’ MLPerf® Inference: Datacenter (see End Note) benchmark suite is very useful. It defines a standardized set of requirements covering a variety of models with corresponding datasets and quality targets. Some scenarios also include server latency constraints (TTFT and TPOT). This is designed to promote fairer evaluations based on more realistic real-life scenarios. Submissions from hardware and system providers, neoclouds, and others include the system configuration and total throughput - but not users. Almost all are for Nvidia or AMD based systems with startups noticeably missing. Submitting power measurements is optional here but there are very few submissions.
You may have noticed some common themes across the different sources. They are all missing the number of concurrent users1 supported and the power consumed by the system. And each source covers only a subset of the systems available in the market.
Trying to assess performance of multiple systems demands normalizing of the data, necessitates assumptions to fill the gaps, and requires compromise.
However, even with these limitations, the feedback has been clear that people find it useful and want access.
The Metrics
Even though Generative AI has made enormous progress, we are still in the early days with incredible growth ahead. So it’s not surprising at all that players are jockeying for position and trading margins for market share.
But longer term, AI will only meet its lofty goals by providing value and being efficient and profitable. For AI Inference System providers, this translates into the following key metrics:
- Total Tokens per Second - generating more tokens means generating more value
- Total Tokens per USD 2 - profitability requires covering the costs of processing all tokens and more tokens means higher revenue
- Total Tokens per kWh - more tokens for less power lowers strain on the power distribution infrastructure and the environment
It’s important to emphasize that the focus of this calculator is on the cost to generate tokens, and not on market price. Many TaaS and GPUaaS players are focused on capturing market share at the expense of thinner and sometimes even negative margins. Looking at actual costs helps to level the playing field between system providers.
It’s also worth noting that especially in the USA and EU, power constraints are becoming more and more important. Whether replacing or upgrading existing capacity or adding new capacity, higher Tokens per kWh hour means more value (tokens) and increased revenue for the available power. And it may turn out that this is the key to higher profitability.
Satya Nadella, CEO of Microsoft put it this way, “When it comes to AI, we are continuing to build out these new data center intelligence factories. We're extending all Azure as the world's computer to basically be these intelligence factories. Tokens per watt plus dollar is the best way to think about the new currency of performance. It's all about maximizing that value and doing it in the most efficient way.”
The Approach
The Tokenomics Calculator uses publicly available data across many sources to develop estimates of performance which are normalized to common metrics. Sources may include media or supplier blogs, press releases, conference papers, video blogs, etc.
All metrics are based on actual published system performance other than Tensordyne which is based on our simulation results. Where results are believed to include speculative decoding, performance has been either normalized or excluded.
Models
The initial release of the calculator includes Llama 3 70B and Llama 3 405B, but we plan to include some of the popular MoE models in future updates. The two main reasons we choose these models were:
- Availability of data: Llama 3 70B was extremely popular when it was initially released and key metrics are publicly available from almost every AI system provider. While others have since passed it in usage, it remains the most popular model for reporting benchmarking performance across a broad range of providers.
- Size of the model: Smaller models have their place, but of the 20 most popular open source models at openrouter.ai accounting for >1T tokens per week, 15 have at least 70B parameters. Llama 3 70B represents the lower end and is a reasonable size that can often be run on a single chip, while Llama 3 405B is toward the higher end of active parameters in memory and requires multiple chips. For SRAM-only systems, these models may require multiple racks.
Both of these models have MLPerf® submissions 3 for a number of different systems. However, since MLCommons policies do not allow MLPerf® results to be compared against non-MLPerf results we have separate configuration for systems that submitted and those that did not.
For both Llama 3 70B configurations and for Llama 3 405b - 1k_1k we assumed 2k context length for the purpose of calculating racks required to support the Concurrent Users selected. For Llama 405B - MLPerf® we assumed 32k.
Racks Needed
Racks Needed are calculated by dividing the available memory per rack by the required memory to run the selected model with the number of Concurrent Users. The calculator assumes that model parameters and KV-cache must reside in fast memory (SRAM, HBM, GDDR) to achieve best performance.
Required memory is determined from the Model Selected, Concurrent Users, and the Quantization inputs. This simplified approach does not include other memory requirements and may underestimate the number of racks required, e.g. according to some sources, Groq requires 9 racks to run Llama 3 70B, but the calculator estimates 6 racks.
Quantization
Currently, the Quantization inputs are used only for calculating Racks Needed, they do not impact throughput estimates.
The actual quantizations used may be 16, 8 or 4-bit and are highly dependent on specific system compute capabilities/ limitations. Lower precision typically enables higher throughput but at the cost of accuracy which is often not reported.
Nvidia typically provides FP4 results for systems that support it, resulting in higher throughput than they would achieve using FP8. In some cases they use a mixture of FP8/ FP4, e.g. MLPerf®. It’s not obvious why, but it could be in order to meet accuracy goals - your guess is as good as ours.
All Tensordyne results are for FP8 weights and activations.
Power
There is extremely limited publicly available information on actual power consumption. So for most systems, we use TDP or maximum power datasheet values. This can be modified by the user.
Rack Cost
This is a key input where the user should bring their own knowledge to make up for the lack of publicly available information. In some cases, system pricing is available from OEM partners or system integrators web-sites. But in many cases it is confidential. So beyond speculative pricing in the media, there is little to go on.
We almost didn't include default Rack Costs because of the lack of data. But this would defeat the whole purpose of a calculator. So instead we tried to be generous on Rack Cost wherever we were lacking information.
We estimated the default Rack Costs to be higher (on a per chip basis) for systems with AI processors using HBM and more state-of-the-art process nodes and lower for those using SRAM and more mature process nodes.
And for providers offering TaaS running on their own custom hardware (Cerebras, Groq, and SambaNova), Rack Cost represents an estimate of internal cost without adding any margin, with the presumption that margins are made on the sale of tokens. This results in system costs that are significantly lower than system pricing previously reported by the media. But it can be useful when trying to estimate the potential margins for TaaS offerings.
For all others, the Rack Cost is our estimate of the price you might pay on the open market to buy the equipment for the purpose of running models.
Again, we encourage you to provide your own values here.
Concurrent Users
Most public sources do not divulge the number of concurrent users supported for a specific configuration and result. We have arrived at the estimated default values by reviewing many different sources of information.
For most systems, this control is only used to determine the number of racks needed as described earlier.
However for a few providers - Cerebras, Groq, and SamabNova - there is also no reliable total throughput data. In these cases, this input is also used to determine the total throughput. Essentially, throughput per user is first calculated by determining total tokens per second from TTFT and TPOT and then multiplied by concurrent users to get total throughput. So for these providers modifying this value will directly change the total throughput and may also impact the racks needed.
There is no check for viability. To confirm that the results are trustworthy, it is incumbent on the user to validate whether the number of concurrent users can actually be supported.
Operating Parameters
These are mostly self explanatory with additional information in the help screens.
We would welcome contributions of links for publicly available sources of colocation cost or energy costs for other regions. Especially any that will be regularly updated from trusted sources.
Note: Lastly we want to make it clear that we do not collect and retain any data from the use of the calculator, other than from specific requests made through the “Contact” and “Contribute” links.
The Results
Here are some things to look for:
Sensitivity Analysis
Input your own values and see how it impacts results. Especially for inputs requiring estimates or assumptions like Rack Power, Rack Cost, and Concurrent Users. How are the results impacted by a 2x increase or a 50% decrease?
Commercial
How does the cost per million tokens change if the system is paid off? Check cost vs. market pricing to get an indication of potential margins.
System Architectures
What impact does increasing the Concurrent Users have on Racks Needed for SRAM-only systems vs. those that also have HBM?
Use Case Impact
For Llama 3 405B, check how Total Tokens per Second vary between the two different use cases for the same model to see the importance of assessing similar scenarios:
Table 1: Total Tokens per Second - same model, different scenario
The Future
There is a lot more complexity we could have added to the calculator, but we wanted to keep it as easy to use as possible.
What would you like to see? Additional providers? Additional metrics? Additional features?
We’d welcome your feedback. Please hit Contribute on the calculator page below the Results and share your thoughts.
End Note:
MLPerf® Inference: Datacenter benchmark results for Llama 2 70B 99.9% and Llama 3.1 405B were retrieved from https://mlcommons.org/benchmarks/ on April 10,2025 (v5.0 Closed) and September 15, 2025 (v5.1 Closed) from entries 5.0-0047, 5.0-0060, 5.1-0003, 5.1-0051, 5.1-0069, 5.1-0071, 5.1-0075. Total Tokens per Second were determined by dividing total throughput by the number of reported chips and multiplying by the number of chips for the Racks Systems Needed. Tensordyne's simulation results of MLPerf® configurations are unverified and have not been through an MLPerf® review and may use measurement methodologies and/or workload implementations that are inconsistent with the MLPerf® specification for verified results. Tensordyne plans to submit results for verification as soon as possible after system availability.
Total Tokens per USD and Total Tokens per kWh are not primary metrics of MLPerf® Inference: Datacenter. Results not verified by MLCommons Association.
The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See www.mlcommons.org for more information.