Using Steam’s basic usage statistics to see which games have staying power

Author

Kari Silvennoinen

Published

February 6, 2026

library(tidyverse)
library(broom)
library(ggrepel)

Sys.setlocale("LC_TIME", "C") # For English month names
options("lubridate.week.start" = 1)

steam_stats <- readRDS("steam_stats.rds")

Introduction

A long time ago I saw a presentation by Eric Seufert on Using (Free!) App Annie data to optimize your next game1. The idea was to do market research by taking the daily iOS Top Grossing chart positions of games and then do a linear regression of them and so filter down to games that have a positive coefficient over time. It’s really rough but does give some indication about games’ life cycles and potential. One downside is that you only get the chart position when ideally you would want to know the daily player count or daily revenue (although you can get estimates of these if you pay for them).

Similar analysis can also be done on Steam, using the data available at Steam & Game Stats page. This page shows the current and peak players for top 100 games by current or daily players. The daily players is a more recent addition, back when I did my analysis below this option wasn’t available. Instead of peak or current, ideally you would like to have a daily active user (DAU) figure, but either of these should work as a proxy for it. Also, ideally, it would be nice to be able to browse this data over time, but Steam doesn’t provide this. This data is collected for example by SteamCharts, but they don’t have a public API for the historical data.

For your own purposes, you would need either have a script collect the data through Steam Web API like SteamCharts, and try to use services like Internet Archive to get historical data and scrape the HTML.

Data quality and limitations

In this analysis I’ll use a data set ranging between January 2015 and September 2022, with (almost) daily data from mid-August 2016 onwards (except around May 2018). The data collection method was a bit spotty in the beginning, because I was trying to get historical data through the Internet Archive’s Wayback Machine.

dates <- unique(steam_stats$date)
expected <- seq(min(dates), max(dates), by="day")

cal <- tibble(
  date = expected,
  data_exists = if_else(expected %in% dates, "Yes", "No")
  ) |> 
  mutate(
  year = year(date),
  month = month(date, label = TRUE),
  wday = wday(date, label = TRUE, abbr = TRUE),
  isoweek = isoweek(date)
  ) |>
  group_by(year, month) |>
  mutate(week = if_else(
    month(date) == 1 & isoweek(date) >= 52,
    1,
    if_else(
      month(date) == 12 & isoweek(date) == 1,
      max(isoweek(date)) + 1,
      isoweek(date)
      )
    ) - min(
      if_else(month(date) == 12 & isoweek(date) == 1, 
              max(isoweek(date))+1, 
              isoweek(date)))+1
    ) |>
  ungroup()

ggplot(cal, aes(y = week, x = wday, fill = data_exists)) +
  geom_tile(color = "white") +
  coord_equal() +
  facet_grid(year ~ month) +
  scale_y_reverse(labels = NULL, breaks = NULL) +
  scale_x_discrete(labels = NULL, breaks = NULL) +
  labs(title = "Data availability", x = NULL, y = NULL, fill = "Data exists") +
  guides(fill = guide_legend(position = "bottom"))

Just to remind, this is data from Steam only, so it gives no indication of these games’ performance on consoles or on other launchers on PC (like Epic Games or GOG). We are also only looking at games that managed to get at least once to Top 100 list. There are tens of thousands of games on Steam, and we are limited to looking at barely 1264 of them.

Also note that this analysis does not tell anything about a game’s revenue performance - we are looking at player counts, not revenue. However, what it tells is what kind of games are prime candidates for service-based business models. In fact, looking at the data it is pretty easy to identify which games are (or act like) traditional AAA games with little post-launch planning and which games can sustain and grow their audience over time - be it through marketing efforts or pure virality or a combination of these.

Data prep and cleanup

Once we have parsed the data and have the raw data set, let’s check the condition of the data we have. First we explore the uniqueness and existence of both Game IDs and game names and the cardinality between the two.

Football Manager together with Call of Duty have some entries missing the Game ID in the raw data. Fortunately, these can be found from elsewhere in the data.

steam_stats |>
  filter(is.na(game_id)) |>
  group_by(game) |>
  summarise(n = n()) |>
  arrange(-n) |>
  left_join(steam_stats |>
              filter(!is.na(game_id)) |>
              group_by(game_id) |>
              summarise(game=first(game)), by = "game") |>
  group_by(game) |>
  summarise(
    n = sum(n), 
    game_id = paste(unique(game_id), collapse = ", ")
    ) |>
  knitr::kable(col.names = c("Game", "N", "Game ID"), align = "lcc")
Game N Game ID
Call of Duty: Advanced Warfare - Multiplayer 85 209660
Call of Duty: Black Ops II - Multiplayer 244 202990
Call of Duty: WWII - Multiplayer 2 476600, 476620
Football Manager 2012 27 71270
Football Manager 2013 91 207890
Football Manager 2014 304 231670

Games with missing Game IDs

steam_stats <- steam_stats |>
  group_by(game) |>
  mutate(game_id = if_else(is.na(game_id), min(game_id, na.rm = TRUE), game_id)) |>
  ungroup()

In the above table, you’ll notice Call of Duty: WWII - Multiplayer matches two Game IDs. There’s also some more non-uniqueness with game names in the data set.

steam_stats |>
  group_by(game) |>
  summarise(
    count = length(unique(game_id)),
    ids = paste(unique(game_id), collapse = ", ")) |>
  filter(count > 1) |>
  arrange(-count) |>
  knitr::kable(col.names = c("Game", "Count", "Game IDs"), align = "lcc")
Game Count Game IDs
Black Desert 2 836620, 582660
Call of Duty: Modern Warfare 2 - Multiplayer 2 10190, 10180
Call of Duty: Modern Warfare 3 - Multiplayer 2 42690, 115300
Call of Duty: WWII - Multiplayer 2 476600, 476620
H1Z1 2 295110, 433850
Total War: SHOGUN 2 2 34330, 201270

Games with multiple Game IDs

steam_stats <- steam_stats |>
  group_by(game) |>
  mutate(game_id = min(game_id)) |>
  ungroup()

In addition there are also multiple names for the same Game ID. Name changes are somewhat common, usually these are weird punctuation changes (Elite: Dangerous -> Elite Dangerous) or in case of Asian games, addition of an English name. Or, in the case of for example H1Z1, evolution of the game. There was also some weirdness with addition and removal of trademark symbols, but those were filtered out already in the parsing of the raw data2. This suggests that following the lifecycle of all games programmatically can get really complicated.

steam_stats |>
  filter(!is.na(game_id)) |>
  group_by(game_id) |>
  summarise(
    count = length(unique(game)),
    names = paste(unique(game), collapse = ", ")
    ) |>
  filter(count > 1) |> 
  arrange(-count) |> 
  head(25) |>
  knitr::kable(col.names = c("Game ID", "Count", "Game names"), align = "lcr")
Game ID Count Game names
1205550 3 New World Confidential Test, New World Preview, New World Closed Beta
268850 3 EVGA PrecisionX 16, EVGA PrecisionX OC, EVGA Precision XOC
295110 3 H1Z1, H1Z1: Just Survive, Just Survive
677620 3 Splitgate: Arena Warfare, Splitgate, Splitgate (Beta)
10500 2 Empire: Total War, Total War: EMPIRE - Definitive Edition
1056640 2 Phantasy Star Online 2, Phantasy Star Online 2 New Genesis
1097150 2 Fall Guys, Fall Guys: Ultimate Knockout
1203220 2 NARAKA: BLADEPOINT Demo, NARAKA: BLADEPOINT
1301210 2 Knockout City Block Party Trial, Knockout City Trial
1599340 2 Lost Ark Closed Technical Alpha, Lost Ark
1619990 2 SUPER PEOPLE CBT, SUPER PEOPLE FINAL BETA
208580 2 Star Wars: Knights of the Old Republic II, STAR WARS Knights of the Old Republic II: The Sith Lords
221380 2 Age of Empires II: HD Edition, Age of Empires II (2013)
226320 2 Marvel Heroes 2015, Marvel Heroes 2016
228200 2 Company of Heroes (New Steam Version), Company of Heroes
238960 2 Path of Exile, PATH OF EXILE: ROYALE
24010 2 Train Simulator 2015, Train Simulator
273350 2 Evolve, Evolve Stage 2
306130 2 The Elder Scrolls Online, The Elder Scrolls Online: Tamriel Unlimited
322330 2 Don’t Starve Together Beta, Don’t Starve Together
359320 2 Elite: Dangerous, Elite Dangerous
363150 2 Romance of the Three Kingdoms 13 / 三國志13, Romance of the Three Kingdoms 13
368230 2 Kingdom, Kingdom: Classic
39210 2 FINAL FANTASY XIV: A Realm Reborn, FINAL FANTASY XIV Online
433850 2 H1Z1: King of the Kill, Z1 Battle Royale

Top 25 games with multiple names

steam_stats <- steam_stats |>
  group_by(game_id) |>
  mutate(game = last(game)) |>
  ungroup()

With some annual franchises like Football Manager and 2K’s NBA - the player base quickly migrates to the next installation a(nd sometimes like with the first franchise, the old game is removed from the Steam storefront). These installments could be grouped to one game for our purposes, but where should we set the limit? Should we do similar treatment for Call of Duty or just for annual sports franchises?

ggplot(steam_stats |> 
         select(peak_today, game, date) |> 
         filter(str_starts(game, "Football Manager")) |> 
         mutate(game = str_replace(game, "Football Manager", "FM")), 
       aes(x = date, y = peak_today, color = game)) +
  geom_line() +
  labs(title = "Daily peak CCU for Football Manager games", x = "Date", y = "Peak CCU last 24 hours", color = "Installment") +
  scale_y_continuous(labels = scales::label_number(scale_cut = scales::cut_short_scale()))

steam_stats <- steam_stats |>
  mutate(game = if_else(str_starts(game, "Football Manager"), "Football Manager", game)) |>
  group_by(game) |>
  mutate(game_id = if_else(game == "Football Manager", min(game_id), game_id)) |>
  group_by(game, game_id, date) |>
  # NOTE: We assume distinct players across games, there can be overlap, we could also take max
  summarise(peak_today = sum(peak_today), current_players = sum(current_players), .groups = "drop") 
ggplot(steam_stats |> 
         select(peak_today, game, date) |> 
         filter(str_starts(game, "NBA 2K")), 
       aes(x = date, y = peak_today, color = game)) +
  geom_line() +
  scale_y_continuous(labels = scales::label_number(scale_cut = scales::cut_short_scale())) +
  labs(title = "Daily peak CCU for 2K's NBA games", x = "Date", y = "Peak CCU last 24 hours", color = "Installment") 

steam_stats <- steam_stats |>
  mutate(game = if_else(str_starts(game, "NBA 2K"), "NBA 2K", game)) |>
  group_by(game) |>
  mutate(game_id = if_else(game == "NBA 2K", min(game_id), game_id)) |>
  group_by(game, game_id, date) |>
  # NOTE: We assume distinct players across games, there can be overlap, we could also take max
  summarise(peak_today = sum(peak_today), current_players = sum(current_players), .groups = "drop") 

Now our data is in a good enough shape to continue with the actual modeling.

Modeling engagement and growth

Seufert’s idea was to summarise a game’s activity graph as a simple linear regression: we can use the trend line’s slope as a summary of how well the game’s player base is doing. As an example, let’s check what the Steam activity looks like for Warframe and Rainbow Six Siege3. The idea is to do this for all games in the dataset and find the games that go to the up and the right, ie. have a positive trend line.

steam_stats |> 
  filter(game %in% c("Tom Clancy's Rainbow Six Siege", "Warframe")) |>
  ggplot(aes(x = date, y = peak_today, color = game)) + 
  labs(title = "Daily peak CCU of Siege and Warframe", y = "Peak CCU last 24 hours", color = NULL) +
  geom_line(aes(color = game)) + 
  guides(color = guide_legend(position = "bottom")) +
  geom_smooth(method = 'lm') +
  scale_y_continuous(labels = scales::label_number(scale_cut = scales::cut_short_scale()))
`geom_smooth()` using formula = 'y ~ x'

games <- steam_stats |>
  group_by(game) |>
  nest() |>
  mutate(
    first_date = map_vec(data, \(x) min(x$date)),
    model = map(data, \(x) lm(peak_today ~ date, data = x)),
    glance = map(model, broom::glance),
    tidy = map(model, broom::tidy)
    ) |>
  unnest(c(glance, tidy), names_repair = "unique") |>
  filter(term == "date") |>
  select(game, first_date, r.squared, nobs, estimate) |>
  rename(coeff = estimate) |>
  filter(!is.na(coeff)) |>
  ungroup()
games |>
  filter(nobs > 90) |>
  arrange(-coeff) |> 
  head(25) |>
  knitr::kable(col.names = c("Game", "First date", "$R^2$", "Observations", "Coefficient"), align = "lcc", digits = 2)
Game First date \(R^2\) Observations Coefficient
Apex Legends 2020-11-06 0.77 685 387.70
Counter-Strike: Global Offensive 2015-01-01 0.47 2475 170.20
MONSTER HUNTER RISE 2022-01-13 0.04 211 128.77
Source SDK Base 2013 Multiplayer 2017-05-25 0.87 1245 120.31
Stumble Guys 2022-02-25 0.78 202 111.62
FIFA 22 2021-10-01 0.34 357 77.28
Total War: WARHAMMER III 2022-02-18 0.03 119 75.08
Wallpaper Engine 2017-01-06 0.86 2027 42.02
Z1 Battle Royale 2016-02-22 0.05 531 40.46
Rust 2015-01-01 0.64 2475 37.47
Grand Theft Auto V 2015-04-14 0.39 2426 36.63
Tom Clancy’s Rainbow Six Siege 2015-12-01 0.19 2299 24.31
Raft 2018-06-13 0.29 382 23.79
eFootball PES 2020 2019-07-31 0.38 410 22.47
World of Warships 2017-11-19 0.10 578 21.57
Dead by Daylight 2016-06-15 0.63 2187 20.77
Splitgate 2019-05-25 0.06 99 19.91
tModLoader 2020-05-17 0.63 857 19.73
Phantasy Star Online 2 New Genesis 2020-08-06 0.09 204 19.48
Tomb Raider 2015-01-01 0.52 95 17.68
Source SDK Base 2007 2015-03-24 0.07 462 16.49
Hunt: Showdown 2018-02-23 0.64 1051 15.40
Hearts of Iron IV 2016-06-07 0.84 2193 14.97
The Sims 4 2020-09-07 0.32 675 14.88
Team Fortress 2 2015-01-01 0.32 2475 14.85

Only games with at least 90 observations

Top 25 games by slope coefficient

Many games only a quick appearance in the Top 100, either by having a big launch with little follow-up or due to a free weekend, full game giveaway or an open beta. We can filter these out by only looking at games that have been on the top list for at least 90 (non-consecutive) days. It’s also worth to note that this is not a list of top played games played on Steam, this list is sorted by how much we have seen these games’ player counts increase over time.

Above you’ll notice Source SDK Base 2013 Multiplayer, it is used as the game identifier for some GTA V mod. Likewise, a game called Spacewar might also pop up, it is a game included as a sample app in Steam SDK and often used as a placeholder by pirated games. You might also wonder about Wallpaper Engine, and I’ll let this MIT Technology Review article eplain that one. We’ll filter these out.

games <- games |>
  filter(!game %in% c("Source SDK Base 2013 Multiplayer", "Spacewar", "Wallpaper Engine"))

Data exploration

Interesting cases of game’s growth are for example Grand Theft Auto V. We see a traditional launch spike, but followed with a solid growth thanks to the GTA Online feature. The mid-year and year end spikes are related to Steam’s biannual sales when the game has been available for a steep discount. The game was given away free on Epic Games Store in May 2020, and this giveaway shows as increased player activity also on Steam.

steam_stats |>
  filter(game == "Grand Theft Auto V") |>
  ggplot(aes(x = date, y = peak_today, color = game)) +   
  labs(title = "Daily peak CCU of Grand Theft Auto V", y = "Peak CCU last 24 hours") +
  geom_line() + 
  scale_y_continuous(labels = scales::label_number(scale_cut = scales::cut_short_scale())) +
  geom_smooth(method = 'lm', color = "black") +
  guides(color = "none")
`geom_smooth()` using formula = 'y ~ x'

Destiny 2 was released on Steam as a free-to-play title in October 2019. It’s normal that the initial interest wanes off for this kind of a AAA title, but the crucial thing for service based game is if it can recover from the “death valley”.

steam_stats |>
  filter(game == "Destiny 2") |>
  ggplot(aes(x = date, y = peak_today, color = game)) + 
  labs(title = "Daily peak CCU of Destiny 2", y = "Peak CCU last 24 hours") +
  guides(color = "none") +
  geom_line() + 
  scale_y_continuous(labels = scales::label_number(scale_cut = scales::cut_short_scale())) +
  geom_smooth(method = 'lm', color = "black")
`geom_smooth()` using formula = 'y ~ x'

Have we looked at the data early 2018, the top performer would have been PUBG. It was a massive beast during 2018 but it looks like Fortnite won the battle. Fornite, obviously, isn’t on Steam so we can’t compare it against PUBG here, but charts comparing their viewership on Twitch show Fortnite bypassing PUBG around February 2018, which is when we see PUBG’s activity on Steam start to crater.

steam_stats |>
  filter(game == "PUBG: BATTLEGROUNDS") |>
  ggplot(aes(x = date, y = peak_today, color = game)) + 
  labs(title = "Daily peak CCU of PUBG: BATTLEGROUNDS", y = "Peak CCU last 24 hours") +
  guides(color = guide_legend(position = "bottom")) +
  geom_line() + 
  geom_smooth(method = 'lm') +
  scale_y_continuous(labels = scales::label_number(scale_cut = scales::cut_short_scale())) +
  geom_line(aes(color = "PUBG until 2019"), data = steam_stats |> filter(game == "PUBG: BATTLEGROUNDS", date < as.Date("2019-01-01"))) +
  geom_smooth(aes(color = "PUBG until 2019"), data = steam_stats |> filter(game == "PUBG: BATTLEGROUNDS", date < as.Date("2019-01-01")), method = 'lm')
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

In our analysis we are looking at all the data, but we could limit the outcome window of games and only look at first X months after launch as well. The way we are doing our analysis here is a bit sensitive to the point of time we are looking at and biased toward new games that have yet to reach their plateau.

The true value of this analysis is to find those “smaller” games that are still growing, year on year. Especially the stories of niche games like Euro Truck Simulator 2 are very interesting. The higher variance in Path of Exile could indicate it seems to really depend on regular content updates to drive players back. This can be a dangerous content treadmill. ETS2 regularly adds new regions as DLCs, but it user activity seems to be growing in a more stable fashion.

steam_stats |> 
  filter(game %in% c("Euro Truck Simulator 2", "Path of Exile")) |>
  ggplot(aes(x = date, y = peak_today, color  = game)) + 
  labs(title = "Daily peak CCU of Path of Exile and ETS2", y = "Peak CCU last 24 hours", color = NULL) +
  scale_y_continuous(labels = scales::label_number(scale_cut = scales::cut_short_scale())) +
  guides(color = guide_legend(position = "bottom")) +
  geom_line() + 
  geom_smooth(method = 'lm')
`geom_smooth()` using formula = 'y ~ x'

We can also look at “product-like” launches, with huge intial spikes and eventual fall off the charts - what I would call traditional games as products. This games are often single player experiences with little replayability. These games usually benefit from new player acquisition spikes during Steam’s winter and summer sales.

steam_stats |> 
  filter(game == "Assassin's Creed Odyssey") |>
  ggplot(aes(x = date, y = peak_today, color = game)) + 
  labs(title = "Daily peak CCU of Assassin's Creed Odyssey", y = "Peak CCU last 24 hours") +
  guides(color = "none") +
  geom_line() + 
  scale_y_continuous(labels = scales::label_number(scale_cut = scales::cut_short_scale())) +
  geom_smooth(method = 'lm', color = "black")
`geom_smooth()` using formula = 'y ~ x'

Clustering the games

In Seufert’s presentation, he wanted to map the “opportunities”, or games with an upward trajectory, on a matrix with “Mass appeal / Niche appeal” and “Male / Female” axes. I’ll chart the games a little bit differently, based on the elements that correlate with the idea of “staying power”. Below we’ll just cluster based on the linear regression coefficients and top chart data. We could further enrich the data with furhter metadata from Steam, for example publisher-set genres and user-defined tags that could also be used as features for clustering or other segmentation.

ggplot(data = games |> filter(coeff > 0, nobs > 90),
       aes(x = r.squared, y = coeff, label = game)) +
  labs(
    title = "Top games by slope that have more than 90 observations", 
    subtitle = "Games in top-right have grown most and most consistently",
    x = "R^2", 
    y = "Coefficient (log scale)") +
  geom_point() +
  scale_y_log10() +
  geom_text_repel(max.overlaps = 9, size = 3)

Above is a chart of all the games that have been on top list on more than 90 occasions (~days) and which have in general seen their player base growing since the launch of the game. Y-axis (in logarithmic scale) is the coefficient (~avg. player growth/day) and X-axis is the \(R^2\).

We can try to quickly group these games a bit by clustering. The attributes used in clustering are the slope, days in top list and the \(R^2\). We’ll also this time include all the games, not just those that have a positive trend to also identify games that have more product-like activity curves.

set.seed(42)

cluster_games <- games |> 
  filter(nobs > 90 & !is.na(coeff)) |>
  mutate(
    age = as.numeric(max(steam_stats$date) - first_date),
    logcoeff = if_else(coeff > 0, log10(coeff), -log10(-coeff))
    )

games_matrix <- cluster_games |> 
  select(-c(game, first_date, coeff)) |> 
  as.matrix() |> 
  scale()

rownames(games_matrix) <- cluster_games |> 
  pull(game)

km <- kmeans(games_matrix, centers = 4, nstart = 25)

# rescale cluster centers
km$centers |> 
  sweep(2, attr(games_matrix, "scaled:scale"), "*") |> 
  sweep(2, attr(games_matrix, "scaled:center"), "+") |> 
  as_tibble() |>
  knitr::kable(col.names = c("$R^2$", "Observations", "Age", "Log of Coefficient"), align = "c", digits = c(2, 0, 0, 4), row.names = TRUE)
\(R^2\) Observations Age Log of Coefficient
1 0.11 521 2225 0.0436
2 0.23 443 871 -1.6580
3 0.62 638 2203 -0.0300
4 0.36 2240 2670 0.6216

Cluster centers

We can make rough explanations for the clusters,

  1. Games that do not have a strong explanatory power
  2. Product-like games with a downward trend
  3. Games with strong explanatory power, some of the opportunities can be found here
  4. Service-like games that have upward trend and are mainstays on the top charts
cluster_games <- cluster_games |> 
  mutate(cluster = km$cluster)

ggplot(
  data = cluster_games, 
  aes(x = r.squared, y = logcoeff, label = game, color = as.factor(cluster))
  ) +
  labs(
    title = "Clustering of games",
    subtitle = "Games that have more than 90 observations", 
    x = "$R^2$", 
    color = "Cluster", 
    y = "Coefficient (log scale)"
  ) +
  geom_point() +
  geom_text_repel(max.overlaps = 12, size = 3)

Conclusion

This is just a quick clustering, and very likely far from optimal. We still want to focus on the top-right corner of this map to find the winners and opportunities. We might want to segment the data based on the age of the game and as mentioned before, limit the outcome window of the games. We are also working with left-censored data, some of the big games had already plateaued by the start date of our data set, like Dota 2.

This analysis can be used as a quick way to get a better sense of the games on Steam’s top charts and what their trajectories look like. It can’t be overstated that this analysis is by necessity backward looking and it might be late to try to join the hype train as the market trends might change fast.

Footnotes

  1. App Annie is now part of SensorTower, and this same analysis could be done with any source that gives you historical Top Grossing positions. App Annie just provided that information for free, although not for long after this presentation was shared.↩︎

  2. Not shown here.↩︎

  3. These are poor examples in that they are multiplatform games that even on PC have their own launchers, so very much not Steam exclusive games.↩︎