BigQuery and DataStudio for understanding the State of the Web

Recently, as part of a home project , I was looking to find some market intelligence on CDNs, Web traffic performance and user activity behaviour.

As it turns out, most of the market intelligence reports on CDNs and Web Traffic are either paid subscriptions or seem to be non-authentic (Do let me know if you find a good one though!). Incidentally, I bumped into this whole new world of HTTP Archive and Chrome User Experience Reports. These are publicly available datasets that give as many data points about the Web as possible. A classic DIY (Do It Yourself) case!

To start with, I wanted to answer some basic questions —

What amount of web traffic in India directly hit the backend origin servers vs CDNs?
How many sites switch CDNs?

In this post, I’ve addressed only #1 for the most part and #2 briefly. #2 may be covered in detail in my subsequent posts.

So here’s what I used for my analysis

BigQuery: Google’s cloud-based, serverless, highly scalable enterprise data warehouse
Data Source 1: HTTP Archive (Public Dataset)
Data Source 2: Chrome User Experience Report (Public Dataset)
Google Data Studio: To visualise the result set

Why BigQuery?

Set up your data warehouse in seconds: BigQuery runs blazing-fast SQL queries on GBs to PBs of data
Scales seamlessly: Using managed columnar storage and massively parallel execution. It’s important to note that BigQuery is a public implementation of Dremel, the service that handles Big Data every second of every day to provide services like Search, Gmail, YouTube and Google docs. Just to give you some perspective — Dremel can scan 35 billion rows without an Index in tens of seconds. That’s really huge!
Public datasets: Makes it easy to join public or commercial datasets with your data (E.g. Chrome User Experience Report used in this example)
Integration with existing ETL tools like Tableau, Data studio etc. to further accelerate your insights
Cost — You get to pay only for the storage and compute resources you use!

Why BigQuery for this exercise?

Because the tables that we are going to query runs in several hundred GBs and have got about 10 million rows in total. I don’t need to rack up any new hardware or install specialised software to operate at this scale. BigQuery does all the heavy-lifting for me in the cloud. Also, you get to enrich your insights with stunning visualisations using Data Studio, which is available off the shelf in Google Cloud Platform
You get to understand the “State of the Web” by analysing publicly available datasets and this is exactly what we’re trying to do in this case

A couple of things about these datasets

Chrome-ux-report.country_in.201908: This will help us pull out the top origins in a given country (India, in this case). The Chrome User Experience Report provides user experience metrics for how real-world Chrome users experience popular destinations on the web
httparchive:pages.2019_08_01 (Mobile and Desktop): HTTP Archive crawls the full list of desktop and mobile origins from Chrome User Experience Report. This will help us match the set of URLs we pull from the CruX dataset against a list of pages (URLs) and check if CDN is enabled or not for each page/URL.

Results — Donut Graph view

Inference from the analysis

Around 33% of the origins with user activity in August 2019 use a CDN
Around 1% have been switched between a direct origin and CDN/s

Note —

The data is skewed to sites that get visits from Chrome so some long-tail sites may be under-represented
The results are accurate as WebPage Test’s detection and may miss any CDNs it can’t identify, but it should be in the ballpark
I’ve been wrong before and it will probably happen again. You can look at all the raw data available and question all my assumptions. It will be cool to see what results you get

Standard SQL query that was used

SELECT COUNT(url) as num_pages, cdn
FROM(SELECT
url,
STRING_AGG(DISTINCT(CASE WHEN JSON_EXTRACT_SCALAR(payload,”$._base_page_cdn”) = ‘’ THEN “Direct Origin” WHEN JSON_EXTRACT_SCALAR(payload,”$._base_page_cdn”) = ‘null’ THEN “Direct Origin” WHEN JSON_EXTRACT_SCALAR(payload,”$._base_page_cdn”) iS NULL THEN “Direct Origin” ELSE “CDN Enabled” END), “ | “ order by CASE WHEN JSON_EXTRACT_SCALAR(payload,”$._base_page_cdn”) = ‘’ THEN “Direct Origin” WHEN JSON_EXTRACT_SCALAR(payload,”$._base_page_cdn”) = ‘null’ THEN “Direct Origin” WHEN JSON_EXTRACT_SCALAR(payload,”$._base_page_cdn”) iS NULL THEN “Direct Origin” ELSE “CDN Enabled” END
) as cdn
FROM
`httparchive.pages.2019_08_01_*`
JOIN
`chrome-ux-report.country_in.201908`
ON
CONCAT(origin, ‘/’) = url
GROUP BY
url)
GROUP BY
cdn
ORDER BY
num_pages DESC
LIMIT 50

Results — Tabular View

That was neat. Going forward, it will be interesting to find out -

Number of redirects to each of the top 100 websites?
Number of connections per website for the top 100 websites
What’s the form factor distribution like ? E.g. Mobile vs Desktop vs Tablet etc.
What’s the connection type distribution like across the top 100 websites?

With the availability of public datasets like HTTP archive and Chrome User Experience and an analytics engine like BigQuery, the possibilities of exploration are endless. Stay tuned for more posts and insights!

UPDATE#1:This post has been updated to cover only the analysis pertaining to the presence and coverage of CDNs broadly. The parts mentioning specific CDNs have been removed due to certain policy guidelines