No single tool, framework, or package works well in every blockchain situation. You must define your project’s requirements, consider the resources available to you, and then select the best collection of tools for your analytics project toolbox. Here, you learn about ten common tools that analysts use for blockchain analytics projects. This article includes an assortment of tools that address a wide range of requirements. These tools will help you get a jumpstart toward delivering quality blockchain analytics results.
Develop blockchain data analytics models with Anaconda
You should download and install the Anaconda environment because of its value in any analytics project. Anaconda is the first tool you should be using because of the many ways it makes analytics easier.You can get Anaconda for small teams or for enterprise analytics development and deployment. The team and enterprise Anaconda licenses aren’t free, but in exchange for the licensing fee you get lots of collaboration capabilities that will make team analytics development easier, including tools to extract and organize data, prototype models, develop analytics solutions, and deploy those solutions.
The Anaconda environment promotes “an integrated, end-to-end data experience,” where analytics project team members can easily collaborate and share project artifacts. Anaconda Navigator, shown below, is the default user interface, but you can use the conda command-line interface if you prefer a text-based interface.
In the image above, note that only some tools are installed. When you install Anaconda, the install process searches your computer to see if any tools in Anaconda Navigator are already installed. Any tools that are recommended as part of Anaconda environment haven't been installed have an Install button under their icons. To install any new tool, just click or tap the Install button.
Anaconda is far more than just a collection of tools. One of the most valuable aspects of Anaconda is that it automatically installs many of the analytics libraries you’ll use when building models.And if highly productive tools and pre-installed libraries aren’t enough, Anaconda also provides lots of entry points for product documentation and tutorials to help you get up to speed in record time. If you choose only one tool to install to supercharge your analytics projects, choose Anaconda.
Write code in Visual Studio Code
When writing software for nearly any environment (in nearly any language), try using Visual Studio Code Integrated Development Environment (IDE). Visual Studio Code, commonly called VS Code, is a freely available code editor and IDE from Microsoft that includes support for debugging, task execution, and version control. Microsoft provides VS Code for Windows, Linux, and MacOS.Although technically a lightweight alternative to the flagship product, Visual Studio IDE, VS Code brings a ton of functionality to the table. VS Code is free for private and commercial use and gives developers a great environment for developing code.
In addition to being free, VS Code is extremely functional and developer friendly. VS Code has its own marketplace with hundreds of free extensions. VS Code extensions provide support for multiple languages (syntax checking and inline help), handling different types of file formats, and integration with many other tools. If you use VS Code and want some additional feature, there’s a good chance you can find an extension that does what you want.
The following image shows VS Code in the editor window. This version of VS Code includes a Python extension, so VS Code automatically checks any Python code for syntax errors. Because you don’t see any red squiggly underlines in the following image, the code you see is syntactically correct.
Although other good IDEs for code development are available, VS Code is one of the most popular choices for software developers, which is why it's one of the default tools in the Anaconda Navigator.
Prototype blockchain data analytics models with Jupyter
Jupyter Notebook and JupyterLab are popular products from Project Jupyter, an open-source and open-standards group dedicated to providing interactive programming support for many languages. Jupyter Notebook and JupyterLab are both included in the default Anaconda Navigator due to their popularity with data analysts and machine-learning model developers. Both tools are web applications that allow developers and analysts to build and populate models in a shared environment.Jupyter tools are popular choices when learning about data analytics and machine learning because the online design of the tools makes it easy to share code and data, called notebooks, with others. Anyone who wants to share a model, data, or any examples can just share a notebook. This next image shows the kmeans.py Python program in Jupyter Notebook.
Building on the popularity of Jupyter Notebook, JupyterLab is the next generation of Jupyter’s web interface for notebooks, code, and data. The image below shows the kmeans.py Python program in JupyterLab. Jupyter products support over 40 languages.
Develop blockchain data models in the R language with RStudio
Throughout this book, you learn about building analytics models with the Python language. But Python isn’t the only language commonly used to build analytics models. The R language is another popular language for data modeling and analysis. Like Python, R can import many libraries, called packages in R, to provide access to hundreds of analytics functions.One of the most popular IDEs for working with the R language is RStudio. You can use VS Code for R development, but RStudio is a strong alternative and a favorite of R developers. In fact, you can use RStudio for both R and Python code development.
RStudio is available as a standalone IDE and a web-based server interface. Both are open-source products. RStudio also offers a range of professional for-fee products designed for teams of analysts and developers who need collaboration features.
The following image shows an R program that analyzes a dataset of income records by zip code. The RStudio IDE displays the R code, console messages, a list of items in memory, and the final visual output.
Before you install RStudio, you must install the R language. If you try to install and then launch RStudio and get a message that R needs to be installed, you forgot to install the R language first.
Interact with blockchain data with web3.py
You need a blockchain client to interact with data stored in your blockchain. Each blockchain implementation is different, but the overall concepts are similar. After you learn how to access and analyze data from one blockchain implementation, mapping that knowledge to another environment is relatively easy.You can use the web3.py Ethereum blockchain client to access blockchain data. You’ll need this critical library to examine and extract the blockchain data required by your analytics models.
This image shows the web3.py project website and several options you can use to install the web3.py library.
But web3.py isn’t the only option. There are a few options for the Ethereum blockchain, and a quick Internet search will show you multiple options for other blockchains.
Extract blockchain data to a database
Throughout this book you learn how to identify blockchain data of interest and extract that data for use in analytics models. In some cases, you might need to extract blockchain data first and explore it later. Because you may not know what data you’ll need up front, you may find it more efficient to extract blockchain data to an off-chain repository for later analysis. By extracting blockchain data and storing it in a high-performance database management system, you can decrease data access times.You can write your own extraction code, but several generic products are already available to extract blockchain data and store it in a database.
Extracting blockchain data with EthereumDB
EthereumDB is an open-source product that extracts Ethereum blockchain data and stores it in a SQLite database. EthereumDB is a quick and simple method for extracting summary data, transaction details, and block information into separate relational database tables. You can use EthereumDB as is or as a tutorial on how to extract Ethereum blockchain data.Storing blockchain data in a database using Ethereum-etl
Ethereum-etl is another open-source product you can use to extract Ethereum blockchain data. Ethereum-etl is more complex and flexible than EthereumDB. Using Ethereum-etl, you can output extracted data to text files or database tables.You also have a wider range of blockchain data you can extract, including block data, token transfers, and event logs. If you want to be able to tailor the data you extract from an Ethereum blockchain, Ethereum-etl is a good option to explore.
Access Ethereum networks at scale with Infura
All examples in this book use local blockchains provided by Ganache. Although Ganache is a great tool for learning blockchain concepts and developing your own blockchain code, it isn’t a live blockchain network. Real analytics projects will need to interact with real blockchain networks. Your organization may implement its own blockchain network; if not, you’ll need to interact with Ethereum’s mainnet or some other public blockchain.Interacting with a public blockchain comes with some constraints and obstacles. First, to get to all of a blockchain’s data, you need to connect to a full node. Running a full blockchain node requires an investment of infrastructure. Specifically, you need to dedicate disk space to store the blockchain data, a device to run the blockchain client, and sufficient network access to initially download all the blockchain data and then to process new blocks.
Interacting with one blockchain may be feasible, but as you add more public blockchains to your data universe, the infrastructure requirements may become untenable. One common solution to increasing infrastructure investment is to use someone else’s infrastructure, and one of the most popular services for Ethereum blockchain access is Infura.
An Infura account provides API access over HTTPS and webSockets to multiple Ethereum networks and InterPlanetary File System (IPFS) resources as well. Using Infura can take one large obstacle (setting up your own Ethereum node) off the table and let you focus on building analytics models. The next image shows Infura’s architecture for accessing Ethereum and IPFS resources.
Analyze very large blockchain datasets in Python with Vaex
Regardless where you get your data, there is likely to be lots of it. One common obstacle to operationalizing data analytics models is the size of datasets you need to analyze. Most model types increase accuracy with more data. But at some point, datasets become so large that they become difficult to manage. Even though your organization’s infrastructure may have lots of servers with lots of memory, you may not always be able to provision huge amounts of resources every time you need to run a model.To scale models to available hardware, many developers or analysts run models on partitions of their data or employ distributed processing. Partitioning your data can cut out important information and distributing analytics can take a lot of work. However, another choice is available.
Vaex is an open-source library that implements out-of-core dataframes, which allows you to write code that explores and visualizes datasets far bigger than your computer’s memory. With Vaex, shown below, you can run analytics models on datasets hundreds of gigabytes in size, even on a laptop computer!
Examine blockchain data
One of the most important early steps in any analytics project is to identify the data your models need. You must take inventory of the data available to you and then explore sources for other data that your models require. When working in blockchain environments, the most common tool used to examine available data is a blockchain explorer. Most blockchain explorers are web applications that provide an easy interface for accessing data stored in a blockchain.Many blockchain explorer options are available, and each blockchain implementation has its own options. Here, you discover three popular options for exploring data on Ethereum and Bitcoin blockchains.
Explore Ethereum with Etherscan.io
Etherscan.io is the most popular blockchain explorer for Ethereum networks. Using Etherescan.io, you can explore blockchain data from Ethereum’s mainnet or any of the most popular test Ethereum networks. You can look at blocks, transactions, event logs, or any data related to your selected network.Etherescan.io makes it easy to examine your blockchain data to identify the source data your models require. The following image shows the main Etherescan.io web page.
Peruse multiple blockchains with Blockchain.com
Some blockchain explorers support access to multiple blockchain networks. For example, Block Explorer from Blockchain.com implements similar visibility as Etherscan.io but to more blockchain network types.Block Explorer provides an interface to block data from the main nets of Bitcoin, Bitcoin Cash, and Ethereum, as well as the test nets for Bitcoin and Bitcoin Cash. This next image shows the main Block Explorer interface for the Bitcoin network.
View cryptocurrency details with ColossusXT
Some blockchain explorers, such as ColossusXT, focus on cryptocurrency transactions. Instead of providing generic block access, ColossusXT identifies blocks that contain specific cryptocurrency transactions. If your analytics queries focus on cryptocurrency transactions, ColossusXT may help you find the data you need. The image you see below shows the ColussusXT main interface for Bitcoin cryptocurrency transactions.Preserve privacy in blockchain analytics with MADANA
A core concern for handling data, including in the context of analytics projects, is maintaining compliance with privacy regulations. Privacy is a growing concern with governing bodies. The old, naive perception that encryption enforces privacy has been shown to be false. Privacy isn’t about the data — privacy is about the individual.Data analytics queries often provide aggregate results that simplify classification or prediction. If your models enable the audience to associate an individual with its results, you've violated that individual’s privacy.
To avoid publishing any data that might inadvertently leak granular data that could be used to identify an individual, you have two main options. The first option is to apply good privacy-preserving techniques to your models. You’ll have to learn about k-anonymity, l-diversity, t-closeness, and differential privacy. Or you can use a framework such as MADANA, which does it for you.
MADANA provides a framework that helps you protect confidentiality and privacy. If compliance is a concern for your organization, a framework like MADANA can help you stay compliant without having to design privacy-preserving models yourself. The image below shows the MADANA website, with some of its benefits.
Want to learn more? Check out our Blockchain Data Analytics Cheat Sheet.