Project overview

Objective

- Identify data sources for Eth analysis.
- Build a data store to host data collected from various sources.
- Create a unified dashboard that extracts insights by combining data from multiple sources.

[EthLooker](https://ethlooker.streamlitapp.com/) tool is built as part of this challenge.

![EthLooker Dashboard](/images/tasks/eth_merge_challenge.png)[Dashboard Link](https://ethlooker.streamlitapp.com/)


Code Location

- [ethdash-ui](https://github.com/kikura3/ethdash-ui) - Front end code components
- [ethdash](https://github.com/kikura3/ethdash) - Backend data processing code components

📖 In this **blog post**, we will go over the

- Data source identification and collection process.
- Data curation challenges.
- Insights and actionable items. 
- Limitations and next steps.

![Eth Looker Architecture](/images/tasks/ethlooker_architecture.png)EthLooker Architecture

Introduction

Eth Merge Data Challenge

Client Analysis

Objective

- To retrieve network data about nodes (such as client, ip, hosting, peers etc)
- Analyze EL-CL pairings, Client diversity, Client effectiveness, Peering differences between clients.
- Setup a data pipeline to ingest data into a data store for further analysis.

Challenges

- Without setting up a node or a crawler, there was no way to query the required information.

Steps taken

1. Identified and experimented the various open source network crawler implementations listed below:

 - node-crawler (Eth1) : 🧡 Although the documentation was good for setting this up, it could only parse 100 mainnet Eth1 nodes/day. There is no guide to configure this.
 - node-watch (Eth2) : 💚 Easy to setup. Crawled Eth2 nodes and stored them in MongoDB. We wrote scripts to extract data from MongoDB and load it into an AWS RDS MySQL instance.
 - CrawlEth (Eth1) : ❤️ It appears to be a useful tool, but it was difficult to set up due to a lack of documentation.

2. We were able to set up nodewatch in an AWS instance and parse below information for Eth2 nodes
 - Client
 - IP
 - Fork Digest
 - Sync Status 
 - Code: [nodewatch-to-db.py](https://github.com/kikura3/ethdash/blob/main/services/nodewatch-to-db/lambda_function.py)

3. We were unsuccessful in retrieving some of the desired information (such as peers of the nodes). This would be one of the action items in our roadmap.

4. We were also unsuccessful in setting up a Eth1 Crawler.

5. We ended up crawling ethernodes to retrieve required Eth1 info.

 - Code: [ethernode-to-db.py](https://github.com/kikura3/ethdash/blob/main/services/el-mapper/lambda_function.py)

6. To retrieve hosting provider information, We utilized [Maxmind](https://dev.maxmind.com/geoip/geolite2-free-geolocation-data?lang=en) free database. We have to manually filter out ISP providers without hosting infrastructure to remove noise.

 - Code: [geoip-to-db.py](https://github.com/kikura3/ethdash/blob/main/services/geoip-to-db/lambda_function.py)

Data Collection

Client Analysis - Data Collection

1. Consensus Client Distribution

![Consensus Distribution](/images/tasks/eth_consensus_distribution.png)[Dashboard Link](https://ethlooker.streamlitapp.com/)

Comparing this with the [clientdiversity](https://clientdiversity.org/),

- Migalabs and Blockprint have estimated higher penetration of Teku client than by the NodeWatch crawler.
- Though all of the sources have a similar distribution, they are not identical.

ℹ️ Takeaway: Collaborate with the other teams and compare the results of different crawlers to get a more accurate estimate.

2. EL CL pairing

![EL-CL Pairing](/images/tasks/eth_el_cl_pairing.png)[Dashboard Link](https://ethlooker.streamlitapp.com/)

We were able to match EL and CL nodes based on IP for ~3500 nodes.

The goal is to identify the most popular client combinations. This would assist stakers in selecting a different client combination to maximize client diversity.

For example, even if the staker prefers Geth as EL (despite the fact that it is the dominant one), selecting a different consensus client (such as Teku) would be a better option than selecting Lighthouse or Prysm.

Furthermore, the above information would aid in the creation of installation guides for the least popular client combinations, making it easier for stakers to select them over the popular ones.

3. Consensus Hosting Diversity

![Hosting Diversity](/images/tasks/eth_hosting_diversity.png)[Dashboard Link](https://ethlooker.streamlitapp.com/)

It provides an overview of hosting providers' network penetration.

Despite having anti-crypto policies, Hetzner still runs more than 10% of consensus nodes.

If Hetzner begins to take it seriously, it could endanger more than 10% of the nodes (unsure of the validator%).

It would be fantastic to have a staking guide that offers alternatives to risky hosting providers.

4. CL - Hosting Provider Affinity

![Hosting CL Affinity](/images/tasks/eth_hosting_cl_affinity.png)[Dashboard Link](https://ethlooker.streamlitapp.com/)

The objective is to determine whether there is an affinity between hosting providers and clients. It appears so. Google, OVH predominantly runs Prysm nodes.

We are unsure of the actionable steps based on the information presented above. We'd love to hear from the community.

5. Trending Clients 

Though we know the overall distribution of clients, Is there any way that we can understand the trend? 
Is there a shift in client adoption?

In order to answer the above, We need to know the new validators and their client information.
However, the information pulled from nodewatch crawler does not contain any information about validators.

Fortunately, SigmaPrime has built a ML system that can predict the validator's client based on the block fingerprint.
We pulled information from blockprint API to map validators to client. We identified all the first instances of a validator making a block proposal and extracted their (estimated)client info to understand the client adoption trend.

![Client Trend](/images/tasks/eth_client_trend.png)[Dashboard Link](https://ethlooker.streamlitapp.com/)

we can also filter by individual clients and understand their trend.
Despite the low numbers, the chart below shows an upward trend in lodestar adoption.

![Client Trend](/images/tasks/eth_lodestar.png)[Dashboard Link](https://ethlooker.streamlitapp.com/)

We hope that this chart will provide the community with insight into client trends and enable the community to take any proactive measures.

6. Client Effectiveness 

Is one client performing better than the others?

APR serves as a proxy for client behaviour. We could better understand client effectiveness by investigating APR across clients.

We pulled validator's consensus reward from the beaconchain API. Using the blockprint data, we mapped the validator's reward to the client.

![Client Performance](/images/tasks/eth_client_performance.png)[Dashboard Link](https://ethlooker.streamlitapp.com/)

No significant insights observed other than Lodestar's performance. It could be due to the fact that Lodestar runs on a very small number of validators.

Insights

Client Analysis - Insights

Staking Analysis

Objective

- Retrieve depositor, validator, and staking data.
- Examine the distribution, effectiveness, and diversity of depositors.
- Create a data pipeline to load data into a data store for analysis.

Challenges

- Depositor labelling was the hardest part, as there is no single source of truth.

Steps taken

1. Extracted validator, depositor information from Dune Analytics.

 - Code: [dune-to-db.py](https://github.com/kikura3/ethdash/blob/main/services/dune-to-db/lambda_function.py)

2. Obtained labels for some of the depositors from etherscan.

3. Written rules for other depositors (such as Coinbase and RocketPool) who do not deposit from a static address. Coinbase, for example, uses a strategy in which they deposit from their known wallets to a new wallet, send 32 ETH to an eth deposit contract, and then send the excess back to their known wallet. Coinbase has over 60k deposit addresses.

4. Used blockprint data to map depositor to client in order to understand depositor client diversity.

 - Code: [sigp-to-db.py](https://github.com/kikura3/ethdash/blob/main/services/sigp-to-db/lambda_function.py)

5. Analyzed depositor's consensus reward performance by using beaconchain API.

 - Code: [beaconchain-to-db.py](https://github.com/kikura3/ethdash/blob/main/services/beaconchain-to-db/lambda_function.py)

Staking Analysis - Data Collection

1. Staking Overview

![Staking Overview](/images/tasks/staking_overview.png)[Dashboard Link](https://ethlooker.streamlitapp.com/)

The above chart provides an overview of the staking deposits.

It also depicts the weekly trend of validator and depositor signups. As you can observe, there is a small spike around the merge time in terms of validator/depositors.


2. Staking Distribution

![Staking Distribution](/images/tasks/staking_distribution.png)[Dashboard Link](https://ethlooker.streamlitapp.com/)

Staking pools and CEX are the major stakers in the Ethereum network.

The above chart provides a distribution of their stakes as well as their growth.

We hope this would enable the community to understand more about the staking entities.

3. Staking Entity Effectiveness

Another key question would be to understand the effectiveness of each of the staking entities.

![Staking Performance](/images/tasks/staking_performance.png)[Dashboard Link](https://ethlooker.streamlitapp.com/)

Although there are no red flags other than neukind and stereum, this chart would allow staking pools to compare their performance to that of others.


4. Staking Entity Diversity Leaderboard

![Staking Client Diversity](/images/tasks/eth_client_diversity.png)[Dashboard Link](https://ethlooker.streamlitapp.com/)

This leaderboard rates the staking entity based on their diversity index.
It also provides an indicator of their major and minority clients.

We hope this would encourage the entities to take actions to improve their diversity.

Staking Analysis - Insights

Roadmap


As part of this exercise, we have identified the below challenges in terms of data analysis:

1. Data is in silos (crawlers, nodes, onchain data, labels, tools)

2. In order to make deep insights, the data must be combined and should be allowed open access for data analysts/scientists to enable them to build analytics/ML models on top of it.

3. There are different stake holders (researchers, devs, stakers, staking pools etc) within the ecosystem. Each of them would have different analytic needs.

![Eth Data](/images/tasks/eth_next_steps.png)


We'd like to continue building [Ethlooker dashboard](https://ethlooker.streamlitapp.com/) and Ethlooker database with guidance and support from Ethereum community.

Credits:

1. [Eth2 Book](https://eth2book.info/)
2. [Dune Analytics](https://dune.com/browse/dashboards)
3. Blockprint API by [michaelsproul.eth](https://twitter.com/sproulM_)
4. [Beaconchain API](https://beaconcha.in/api/v1/docs/index.html)
5. [Etherscan labels](https://info.etherscan.com/public-name-tags-labels/)
6. [Node Watch](https://github.com/ChainSafe/nodewatch-ui) by ChainSafe
7. [EtherNodes](https://ethernodes.org/)
8. [Maxmind Geo Data](https://dev.maxmind.com/geoip/geolite2-free-geolocation-data?lang=en)