Deploy Blockchain Data Warehouse like Dune Analyst on AWS from 2000$ per month

- 10 mins

Before we get started, you need to note that this is just my suggestion based on the minimum needs for the system to work, it takes more things and of course more money to make it work on production, however it would be a good starting point.

Content

  1. Questions & Answers
  2. System architecture
  3. Deployment infrastructure
  4. Cost calculation
  5. Conclusion

:pray: I am currently consulting, designing and implementing data analysis infrastructure, Data Warehouse, Lakehouse for individuals and organizations in need. You can see and try a system I have built here. Please contact me via email: lakechain.nguyen@gmail.com. Thank you!

Questions & Answers

Why “like Dune Analyst”?

Why do I want to have my own data warehouse instead of using it directly on Dune?

System architecture

System Architecture

The figure above describes the system’s component architecture and data flow. Now I’m going to show you the details of this system

Data warehouse

This is the infrastructure that stores all of the system’s data, You can learn more about data warehouse in my article here.

From a technical perspective, there are two points to note:

Blockchain DWH

About data table, the tables in our data warehouse are divided into 3 groups:

RPC Service

RPC service is not part of our system, it acts as the raw data provider of each blockchain, you can learn more about RPC API here. Now I’ll tell you which RPC API each of our raw tables comes from.

Note: The data in the raw tables in some cases may not represent what actually happened on the blockchain (when the transaction execution failed, then both logs and traces in that transaction will not be recognized). To know whether a transaction has been executed successfully or not, we need to rely on the status of the transaction in the transaction receipt.

Other components and technical stack

Technical components

Both collector and decoder are Spark jobs, this allows me to configure data processing of multiple blocks concurrently, thereby increasing data processing capacity.

Deployment infrastructure

Below is a diagram of Three-tier infrastructure deployment on AWS for this system. You can read more about the Three-tier architecture in here.

Technical components

Cost calculation

The cost will depend on usage needs, I will take a specific case of building a datawarehouse for Ethereum blockchain for this estimate.

Ethereum information:

  • Block height: 19.651.256
  • Total transactions: 2.333,51 Millions
  • Number transactions per date (avg): 1.207.914

According to my practical experience, for every 1 million transactions we will need about 4 GB of storage and processing capacity. So with ethereum we will need to process and store 9.3 TB of data and every day after that we need to process and store another 5 GB of data. I will base on this data to determine the size of our warehouse system.

First, of course, we need 9.3 TB of storage on S3 for all Ethereum data. Regarding computing ability, I will divide it into 2 categories as follows:

All calculations will be deployed on 2 EMR clusters as I presented in the infrastructure architecture section. For each EMR cluster, we need at least 3 EC2 nodes, of which there will be 1 primary node responsible for managing the entire cluster and not responsible for calculation. Since the data we have stored on S3 instead of EMR’s HDFS, we will only need 1 core node. Computation and data processing work will be performed on core nodes and task nodes. You can see more about EMR estimate capacity here. To ensure system stability and save costs, I recommend using on-demand EC2 for Primary and Core nodes, and Sport instances for Task nodes. I’m using the benchmark results here to estimate that 1 CPU on EMR can process 26 GB of data in 1 hour.

To calculate infrastructure costs, I use the AWS Calculator tool. My estimated total cost is about 1800$~2000$ per month, you can see the details here. With this configuration we can collect, process and store all historical data of Ethereum in about 7 days (according to theoretical calculations).

Conclusion

In this article, I have proposed a proposal to deploy a data warehouse infrastructure for the Ethereum blockchain. Thank you for reading, see you in the next articles.

Lake Nguyen

Lake Nguyen

Founder of Chainslake

comments powered by Disqus