Articles From Michael G. Solomon
Filter Results
Cheat Sheet / Updated 07-24-2023
Blockchain technology is much more than just another way to store data. It's a radical new method of storing validated data and transaction information in an indelible, trusted repository. Blockchain has the potential to disrupt business as we know it, and in the process, provide a rich new source of behavioral data. Data analysts have long found valuable insights from historical data, and blockchain can expose new and reliable data to drive business strategy. To best leverage the value that blockchain data offers, become familiar with blockchain technology and how it stores data, and learn how to extract and analyze this data.
View Cheat SheetArticle / Updated 07-24-2023
Ethereum is a comprehensive, decentralized application platform that expands the range of capabilities beyond what was possible before blockchain technology. So, what sets it apart from other decentralized platforms? Here’s a bit of Ethereum background. Introducing Ethereum Bitcoin was the first blockchain technology application. It was revolutionary and defined the first widely used digital currency, called cryptocurrency. The crypto part of the name refers to the use of cryptographic hashes to ensure the integrity of the blockchain. The shared ledger literally keeps a copy of every cryptocurrency transaction that gets verified by all nodes. Using this approach, bitcoin created a permanent record of every exchange of their cryptocurrency. And, because account owners are identified only by an address, bitcoin has always enjoyed a measure of anonymity. Although bitcoin addresses aren’t linked directly to people, many exchanges have records of identities that are related to addresses. At some point, you have to exchange your cryptocurrency for real currency. That switchover point is where many law enforcement officials focus when they’re trying to track down criminals using cryptocurrency. As bitcoin became more and more popular, researchers began to see more applications for blockchain technology beyond cryptocurrency. In 2013, Vitalik Buterin, the cofounder of Bitcoin Magazine, published a whitepaper that proposed a new, more functional blockchain implementation. This new proposal was for the Ethereum blockchain. After gaining interest and attracting technical and financial support, the Ethereum Foundation, a Swiss non-profit organization, was founded and became the developer of Ethereum. Ethereum wasn’t created just to exchange cryptocurrency. In fact, it was designed from the beginning to be different. The core features of Ethereum are the smart contract and ether. Ether is the native cryptocurrency that Ethereum supports, although you can create your own tokens to exchange value in many other forms. Smart contracts provide an execution environment that ensures integrity across all nodes. Any code that executes on one node executes the same way on all nodes. This guarantee makes it possible to deploy a wide range of applications across untrusted environments. The foundational guarantees Ethereum provides support many types of value exchanges without the concern about fraud, censorship, or any involvement by a third party. When you interact with an Ethereum application, you don’t have to rely on any intermediary to broker your transactions. You don’t need a bank, wholesaler, or transaction broker to provide trust. As a result of Ethereum’s disintermediation, you can often complete transactions faster, with far lower service fees and without requiring approval from external authorities. Whereas legacy solutions to data and process sharing required third-party authorities to enforce integrity, Ethereum provides process and data integrity, along with disintermediation. The possibilities are just beginning to be explored. Exploring Ethereum’s consensus, mining, and smart contracts Ethereum provides integrity in the way it implements immutability and smart contracts. Immutability isn’t actually a blockchain guarantee. You can change data in any block — even after other blocks are added to the blockchain. However, as soon as you change a block, that block and all subsequent blocks fail integrity checks and your node is out of sync. Instead of saying that the blockchain is immutable, it is more accurate to say that any changes (mutations) to the blockchain are easily and immediately detected. Ethereum is based on democracy. Each node gets an equal vote. Every time nodes get a new block to add to the blockchain, they validate the block and its transactions, and then vote whether to accept or reject the block. If several different blocks are submitted by different nodes, only one of the blocks can receive votes from a majority. The block that gets more than half of the network node’s votes gets to join the blockchain as its newest block. One of the first problems is to determine when a new block is ready for the blockchain. When too many conflicting blocks are submitted, the voting process slows down. Ethereum makes it hard to add new blocks to keep the number of new block collisions low and to make voting faster. Ethereum uses a consensus protocol called Proof of Work (PoW), which sets the rules for validating and adding new blocks. PoW makes add blocks to the blockchain difficult but profitable. Ethereum defines ether as its cryptocurrency. You can transfer ether between accounts or earn it by doing the hard work of adding blocks to the Ethereum blockchain. The Ethereum PoW mechanism requires that nodes find a number that, when combined with the block’s header data, produces a cryptographic hash value that matches the current target, which is a value that is adjusted to keep new block production at a steady rate. Finding a hash value that matches the current target is hard. You have to try on average more than a quadrillion values to find the right one. That’s the point. Using a PoW mechanism makes it so hard to submit a block that fewer blocks are submitted, which reduces the number of collisions. The node that finds the right value gets a small ether payment for the effort. This process is called mining, and the node that wins the prize is that block’s miner. Mining regulates the speed at which new blocks get submitted as candidate blocks, and results in a number that is easy to validate. Finding the right number to solve the puzzle is difficult, but verifying the number is fast and easy. Another interesting aspect of mining is that each block’s header contains a hash from the previous block. Ethereum nodes use the hash to easily detect unauthorized block changes. If a block changes, the hash result doesn’t match and the block becomes invalid. Mining cryptocurrency is also a way to make money using blockchain technology. Mining has become competitive, and most of today’s miners invest in high-performance hardware with multiple GPUs to carry out the complex operations. To keep the mining process fair, Ethereum uses a complexity value that makes the mining process even harder as miners get faster. Adjusting the complexity allows Ethereum to regulate the new block frequency to an average of one new block every 14 seconds. The glue that holds the Ethereum environment together is the smart contract. Ethereum is much more than just a financial ledger, and smart contracts provide much of its rich functionality. Each Ethereum node runs a copy of the Ethereum virtual machine (EVM). The EVM runs smart contract code in a way that guarantees that smart contracts execute the same way on all nodes and produce the same output. Running smart contract code is not optional. Smart contracts execute based on specific rules and cannot be subverted or halted. The EVM smart contract guarantees provide a stable platform for automated transaction processing that you can trust. Smart contracts provide the primary power of the Ethereum environment. One of the known weaknesses with software is that attackers can sometimes bypass its controls and carry out unintended actions. That type of attack is more difficult in Ethereum, primarily due to its smart contract implementation. Attackers can’t directly attack the blockchain and make unauthorized changes because any such changes will be immediately detected The next most likely attack vector is the smart contract interface to the blockchain data. Ethereum guarantees that smart contract code, which is translated into bytecode before it is written to the blockchain, executes on every EVM instance the same way. Also, the EVM determines when code executes and what code executes. Attackers have few opportunities to leverage smart contract code, which makes Ethereum an even more secure environment. The Ethereum platform as a whole offers possibilities that extend beyond the current uses of blockchain.
View ArticleArticle / Updated 07-24-2023
In 2008, Bitcoin was the only blockchain implementation. At that time, Bitcoin and blockchain were synonymous. Now hundreds of different blockchain implementations exist. Each new blockchain implementation emerges to address a particular need and each one is unique. However, blockchains tend to share many features with other blockchains. Before examining blockchain applications and data, it helps to look at their similarities. Check out this article to learn how blockchains work. Categorizing blockchain implementations One of the most common ways to evaluate blockchains is to consider the underlying data visibility, that is, who can see and access the blockchain data. And just as important, who can participate in the decision (consensus) to add new blocks to the blockchain? The three primary blockchain models are public, private, and hybrid. Opening blockchain to everyone Nakamoto’s original blockchain proposal described a public blockchain. After all, blockchain technology is all about providing trusted transactions among untrusted participants. Sharing a ledger of transactions among nodes in a public network provides a classic untrusted network. If anyone can join the network, you have no criteria on which to base your trust. It’s almost like throwing s $20 bill out your window and trusting that only the person you intend to pick it up will do so. Public blockchain implementations, including Bitcoin and Ethereum, depend on a consensus algorithm that makes it hard to mine blocks but easy to validate them. PoW is the most common consensus algorithm in use today for public blockchains, but that may change. Ethereum is in the process of transitioning to the Proof of Stake (PoS) consensus algorithm, which requires less computation and depends on how much blockchain currency a node holds. The idea is that a node with more blockchain currency would be affected negatively if it participates in unethical behavior. The higher the stake you have in something, the greater the chance that you’ll care about its integrity. Because public blockchains are open to anyone (anyone can become a node on the network), no permission is needed to join. For this reason, a public blockchain is also called a permissionless blockchain. Public (permissionless) blockchains are most often used for new apps that interact with the public in general. A public blockchain is like a retail store, in that anyone can walk into the store and shop. Limiting blockchain access The opposite of a public blockchain is a private blockchain, such as Hyperledger Fabric. In a private blockchain, also called a permissioned blockchain, the entity that owns and controls the blockchain grants and revokes access to the blockchain data. Because most enterprises manage sensitive or private data, private blockchains are commonly used because they can limit access to that data. The blockchain data is still transparent and readily available but is subject to the owning entity’s access requirements. Some have argued that private blockchains violate data transparency, the original intent of blockchain technology. Although private blockchains can limit data access (and go against the philosophy of the original blockchain in Bitcoin), limited transparency also allows enterprises to consider blockchain technology for new apps in a private environment. Without the private blockchain option, the technology likely would never be considered for most enterprise applications. Combining the best of both worlds A classic blockchain use case is a supply chain app, which manages a product from its production all the way through its consumption. The beginning of the supply chain is when a product is manufactured, harvested, caught, or otherwise provisioned to send to an eventual customer. The supply chain app then tracks and manages each transfer of ownership as the product makes its way to the physical location where the consumer purchases it. Supply chain apps manage product movement, process payment at each stage in the movement lifecycle, and create an audit trail that can be used to investigate the actions of each owner along the supply chain. Blockchain technology is well suited to support the transfer of ownership and maintain an indelible record of each step in the process. Many supply chains are complex and consist of multiple organizations. In such cases, data suffers as it is exported from one participant, transmitted to the next participant, and then imported into their data system. A single blockchain would simplify the export/transport/import cycle and auditing. An additional benefit of blockchain technology in supply chain apps is the ease with which a product’s provenance (a trace of owners back to its origin) is readily available. Many of today’s supply chains are made up of several enterprises that enter into agreements to work together for mutual benefit. Although the participants in a supply chain are business partners, they do not fully trust one another. A blockchain can provide the level of transactional and data trust that the enterprises need. The best solution is a semi-private blockchain – that is, the blockchain is public for supply chain participants but not to anyone else. This type of blockchain (one that is owned by a group of entities) is called a hybrid, or consortium, blockchain. The participants jointly own the blockchain and agree on policies to govern access. Describing basic blockchain type features Each type of blockchain has specific strengths and weaknesses. Which one to use depends on the goals and target environment. You have to know why you need blockchain and what you expect to get from it before you can make an informed decision as to what type of blockchain would be best. The best solution for one organization may not be the best solution for another. The table below shows how blockchain types compare and why you might choose one over the other. Differences in Types of Blockchain Feature Public Private Hybrid Permission Permissionless Permissioned (limited to organization members) Permissioned (limited to consortium members) Consensus PoW, PoS, and so on Authorized participants Varies; can use any method Performance Slow (due to consensus) Fast (relatively) Generally fast Identity Virtually anonymous Validated identity Validated identity The primary differences between each type of blockchain are the consensus algorithm used and whether participants are known or anonymous. These two concepts are related. An unknown (and therefore completely untrusted) participant will require an environment with a more rigorous consensus algorithm. On the other hand, if you know the transaction participants, you can use a less rigorous consensus algorithm. Contrasting popular enterprise blockchain implementations Dozens of blockchain implementations are available today, and soon there will be hundreds. Each new blockchain implementation targets a specific market and offers unique features. There isn’t room in this article to cover even a fair number of blockchain implementations, but you should be aware of some of the most popular. Remember that you’ll be learning about blockchain analytics in this book. Although organizations of all sizes are starting to leverage the power of analytics, enterprises were early adopters and have the most mature approach to extracting value from data. The What Matrix website provides a comprehensive comparison of top enterprise blockchains. Visit whatmatrix.com for up-to-date blockchain information. Following are the top enterprise blockchain implementations and some of their strengths and weaknesses (ranking is based on the What Matrix website): Hyperledger Fabric: The flagship blockchain implementation from the Linux Foundation. Hyperledger is an open-source project backed by a diverse consortium of large corporations. Hyperledger’s modular-based architecture and rich support make it the highest rated enterprise blockchain. VeChain: Currently more popular that Hyperledger, having the highest number of enterprise use cases among products reviewed by What Matrix. VeChain includes support for two native cryptocurrencies and states that its focus is on efficient enterprise collaboration. Ripple Transaction Protocol: A blockchain that focuses on financial markets. Instead of appealing to general use cases, Ripple caters to organizations that want to implement financial transaction blockchain apps. Ripple was the first commercially available blockchain focused on financial solutions. Ethereum: The most popular general-purpose, public blockchain implementation. Although Ethereum is not technically an enterprise solution, it's in use in multiple proof of concept projects. The preceding list is just a brief overview of a small sample of blockchain implementations. If you’re just beginning to learn about blockchain technology in general, start out with Ethereum, which is one of the easier blockchain implementations to learn. After that, you can progress to another blockchain that may be better aligned with your organization. Want to learn more? Check out our Blockchain Data Analytic Cheat Sheet.
View ArticleArticle / Updated 06-09-2023
Blockchain technology alone cannot provide rich analytics results. For all that blockchain is, it can’t magically provide more data than other technologies. Before selecting blockchain technology for any new development or analytics project, clearly justify why such a decision makes sense. If you already depend on blockchain technology to store data, the decision to use that data for analysis is a lot easier to justify. Here, you examine some reasons why blockchain-supported analytics may allow you to leverage your data in interesting ways. Leveraging newly accessible decentralized tools to analyze blockchain data You’ll want to learn how to manually access and analyze blockchain data. But, it's also important to understand how to exercise granular control over your data throughout the analytics process, higher-level tools make the task easier. The growing number of decentralized data analytics solutions means more opportunities to build analytics models with less effort. Third-party tools may reduce the amount of control you have over the models you deploy, but they can dramatically increase analytics productivity. The following list of blockchain analytics solutions is not exhaustive and is likely to change rapidly. Take a few minutes to conduct your own internet search for blockchain analytics tools. You’ll likely find even more software and services: Endor: A blockchain-based AI prediction platform that has the goal of making the technology accessible to organizations of all sizes. Endor is both a blockchain analytics protocol and a prediction engine that integrates on-chain and off-chain data for analysis. Crystal: A blockchain analytics platform that integrates with the Bitcoin and Ethereum blockchains and focuses on cryptocurrency transaction analytics. Different Crystal products cater to small organizations, enterprises, and law enforcement agencies. OXT: The most focused of the three products listed, OXT is an analytics and visualization explorer tool for the Bitcoin blockchain. Although OXT doesn’t provide analytics support for a variety of blockchains, it attempts to provide a wide range of analytics options for Bitcoin. Monetizing blockchain data Today’s economy is driven by data, and the amount of data being collected about individuals and their behavior is staggering. Think of the last time you accessed your favorite shopping site. Chances are, you saw an ad that you found relevant. Those targeted ads seem to be getting better and better at figuring out what would interest you. The capability to align ads with user preferences depends on an analytics engine acquiring enough data about the user to reliably predict products or services of interest. Blockchain data can represent the next logical phase of data’s value to the enterprise. As more and more consumers realize the value of their personal data, interest is growing in the capability to control that data. Consumers now want to control how their data is being used and demand incentives or compensation for the use of their data. Blockchain technology can provide a central point of presence for personal data and the ability for the data’s owner to authorize access to that data. Removing personal data from common central data stores, such as Google and Facebook, has the potential to revolutionize marketing and advertising. Smaller organizations could access valuable marketing information by asking permission from the data owner as opposed to the large data aggregators. Circumventing big players such as Google and Facebook could reduce marketing costs and allow incentives to flow directly to individuals. There is a long way to go to move away from current personal data usage practices, but blockchain technology makes it possible. This process may be accelerated by emerging regulations that protect individual rights to control private data. For example, the European Union’s General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) both strengthen an individual’s ability to control access to, and use of, their personal data. Exchanging and integrating blockchain data effectively Much of the value of blockchain data is in its capability to relate to off-chain data. Most blockchain apps refer to some data stored in off-chain repositories. It doesn’t make sense to store every type of data in a blockchain. Reference data, which is commonly data that gets updated to reflect changing conditions, may not be good candidates for storing in a blockchain. Blockchain technology excels at recording value transfers between owners. All applications define and maintain additional information that supports and provides details for transactions but doesn’t directly participate in transactions. Such information, such as product description or customer notes, may make more sense to store in an off-chain repository. Any time blockchain apps rely on on-chain and off-chain data, integration methods become a concern. Even if your app uses only on-chain data, it is likely that analytics models will integrate with off-chain data. For example, owners in blockchain environments are identified by addresses. These addresses have no context external to the blockchain. Any association between an address and a real-world identity is likely stored in an off-chain repository. Another example of the need for off-chain data is when analyzing aircraft safety trends. Perhaps your analysis correlates blockchain-based incident and accident data with weather conditions. Although each blockchain transaction contains a timestamp, you’d have to consult an external weather database to determine prevailing weather conditions at the time of the transaction. Many examples of the need to integrate off-chain data with on-chain transactions exist. Part of the data acquisition phase of any analytics project is to identify data sources and access methods. In a blockchain analytics project, that process means identifying off-chain data you need to satisfy the goals of your project and how to get that data. Want to learn more? Check out our Blockchain Data Analytics Cheat Sheet.
View ArticleArticle / Updated 08-04-2022
A common question from management when first considering data analytics and again in the specific context of blockchain is “Why do we need this?” Your organization will have to answer that question, in general, and you’ll need to explain why building and executing analytics models on your blockchain data will benefit your organization. Without an expected return on investment (ROI), management probably won't authorize and fund any analytics efforts. The good news is that you aren’t the pioneer in blockchain analytics. Other organizations of all sizes have seen the value of formal analysis of blockchain data. Examining what other organizations have done can be encouraging and insightful. You’ll probably find some fresh ideas as you familiarize yourself with what others have accomplished with their blockchain analytics projects. Here, you learn about ten ways in which blockchain analytics can be useful to today’s (and tomorrow’s) organizations. Blockchain analytics focuses on analyzing what happened in the past, explaining what's happening now, and even preparing for what's expected to come in the future. Analytics can help any organization react, understand, prepare, and lower overall risk. Accessing public financial transaction data The first blockchain implementation, Bitcoin, is all about cryptocurrency, so it stands to reason that examining financial transactions would be an obvious use of blockchain analytics. If tracking transactions was your first thought of how to use blockchain analytics, you’d be right. Bitcoin and other blockchain cryptocurrencies used to be viewed as completely anonymous methods of executing financial transactions. The flawed perception of complete anonymity enticed criminals to use the new type of currency to conduct illegal business. Since cryptocurrency accounts aren’t directly associated with real-world identities (at least on the blockchain), any users who wanted to conduct secret business warmed up to Bitcoin and other cryptocurrencies. When law enforcement noticed the growth in cryptocurrency transactions, they began looking for ways to re-identify transactions of interest. It turns out that with a little effort and proper legal authority, it isn’t that hard to figure out who owns a cryptocurrency account. When a cryptocurrency account is converted and transferred to a traditional account, many criminals are unmasked. Law enforcement became an early adopter of blockchain analytics and still uses models today to help identify suspected criminal and fraudulent activity. Chainalysis is a company that specializes in cryptocurrency investigations. Their product, Chainalysis Reactor, allows users to conduct cryptocurrency forensics to connect transactions to real-world identities. The image shows the Chainalysis Reactor tool. But blockchain technology isn’t just for criminals, and blockchain analytics isn’t just to catch bad guys. The growing popularity of blockchain and cryptocurrencies could lead to new ways to evaluate entire industries, P2P transactions, currency flow, the wealth of nation-states, and a variety of other market valuations with this new area of analysis. For example, Ethereum has emerged as a major avenue of fundraising for tech startups, and its analysis could lend a deeper look into the industry. Connecting with the Internet of Things (IoT) The Internet of Things (IoT) is loosely defined as the collection of devices of all sizes that are connected to the internet and operate at some level with little human interaction. IoT devices include doorbell cameras, remote temperature sensors, undersea oil leak detectors, refrigerators, and vehicle components. The list is almost endless, as is the number of devices connecting to the internet. Each IoT device has a unique identity and produces and consumes data. All of these devices need some entity that manages data exchange and the device’s operation. Although most IoT devices are autonomous (they operate without the need for external guidance), all devices eventually need to request or send data to someone. But that someone doesn’t have to be a human. Currently, the centralized nature of traditional IoT systems reduces their scalability and can create bottlenecks. A central management entity can handle only a limited number of devices. Many companies working in the IoT space are looking to leverage the smart contracts in blockchain networks to allow IoT devices to work more securely and autonomously. These smart contracts are becoming increasingly attractive as the number of IoT devices exceeds 20 billion worldwide in 2020. The figure below shows how IoT has matured from a purely centralized network in the past to a distributed network (which still had some central hubs) to a vision of the future without the need for central managers. The applications of IoT data are endless, and if the industry does shift in this direction, knowing and understanding blockchain analytics will be necessary to truly unlock its potential. Using blockchain technology to manage IoT devices is only the beginning. Without the application of analytics to really understand the huge volume of data IoT devices will be generating, much of the value of having so many autonomous devices will be lost. Ensuring data and document authenticity The Lenovo Group is a multinational technology company that manufactures and distributes consumer electronics. During a business process review, Lenovo identified several areas of inefficiency in their supply chain. After analyzing the issues, they decided to incorporate blockchain technology to increase visibility, consistency, and autonomy, and to decrease waste and process delays. Lenovo published a paper, “Blockchain Technology for Business: A Lenovo Point of View,” detailing their efforts and results. In addition to describing their supply chain application of blockchain technology in their paper, Lenovo cited examples of how the New York Times uses blockchain to prove that photos are authentic. They also described how the city of Dubai is working to have all its government documents on blockchain by the end of 2020 in an effort to crack down on corruption and the misuse of funds. In the era of deep fakes, manipulated photos and consistently evolving methods of corruption and misappropriation of funds, blockchain can help identify cases of data fraud and misuse. Blockchain’s inherent transparency and immutability means that data cannot be retroactively manipulated to support a narrative. Facts in a blockchain are recorded as unchangeable facts. Analytics models can help researchers understand how data of any type originated, who the original owner was, how it gets amended over time, and if any amendments are coordinated. Controlling secure document integrity As just mentioned, blockchain technology can be used to ensure document authenticity, but it can be used also to ensure document integrity. In areas where documents should not be able to be altered, such as the legal and healthcare industries, blockchain can help make documents and changes to them transparent and immutable, as well as increase the power the owner of the data has to control and manage it. Documents do not have to be stored in the blockchain to benefit from the technology. Documents can be stored in off-chain repositories, with a hash stored in a block on the blockchain. Each transaction (required to write to a new block) contains the owner’s account and a timestamp of the action. The integrity of any document at a specific point in time can be validated simply by comparing the on-chain hash with the calculated hash value of the document. If the hash values match, the document has not changed since the blockchain transaction was created. The company DocStamp has implemented a novel use for blockchain document management. Using DocStamp, shown below, anyone can self-notarize any document. The document owner maintains control of the document while storing a hash of the document on an Ethereum blockchain. Services such as DocStamp provide the capability to ensure document integrity using blockchain technology. However, assessing document integrity and its use is up to analytics models. The DocStamp model is not generally recognized by courts of law to be as strong as a traditional notary. For that to change, analysts will need to provide model results that show how the approach works and how blockchain can help provide evidence that document integrity is ensured. Tracking supply chain items In the Lenovo blockchain paper, the author described how Lenovo replaced printed paperwork in its supply chain with processes managed through smart contracts. The switch to blockchain-based process management greatly decreased the potential for human error and removed many human-related process delays. Replacing human interaction with electronic transaction increased auditability and gave all parties more transparency in the movement of goods. The Lenovo supply chain became more efficient and easier to investigate. Blockchain-based supply chain solutions are one of the most popular ways to implement blockchain technology. Blockchain technology makes it easy to track items along the supply chain, both forward and backward. The capability to track an item makes it easy to determine where an item is and where that item has been. Tracing an item’s provenance, or origin, makes root cause analysis possible. Because the blockchain keeps all history of movement through the supply chain, many types of analysis are easier than traditional data stores which can overwrite data. The US Food and Drug Administration is working with several private firms to evaluate using blockchain technology supply chain applications to identify, track, and trace prescription drugs. Analysis of the blockchain data can provide evidence for identifying counterfeit drugs and delivery paths criminals use to get those drugs to market. Empowering predictive analytics You can build several models that allow you to predict future behavior based on past observations. Predictive analytics is often one of the goals of an organization’s analytics projects. Large organizations may already have a collection of data that supports prediction. Smaller organizations, however, probably lack enough data to make accurate predictions. Even large organizations would still benefit from datasets that extend beyond their own customers and partners. In the past, a common approach to acquiring enough data for meaningful analysis was to purchase data from an aggregator. Each data acquisition request costs money, and the data you receive may still be limited in scope. The prospect of using public blockchains has the potential to change the way we all access public data. If a majority of supply chain interactions, for example, use a public blockchain, that data is available to anyone — for free. As more organizations incorporate blockchains into their operations, analysts could leverage the additional data to empower more companies to use predictive analytics with less reliance on localized data. Analyzing real-time data Blockchain transactions happen in real time, across intranational and international borders. Not only are banks and innovators in financial technology pursuing blockchain for the speed it offers to transactions, but data scientists and analysts are observing blockchain data changes and additions in real time, greatly increasing the potential for fast decision-making. To view how dynamic blockchain data really is, visit the Ethviewer Ethereum blockchain monitor’s website. The following image shows the Ethviewer website. Each small circle in the blob near the lower-left corner of the web page is a distinct transaction waiting to make it into a new block. You can see how dynamic the Ethereum blockchain is — it changes constantly. And when the blockchain changes, so does the blockchain data that your models use to provide accurate results. Supercharging business strategy Companies big and small — marketing firms, financial technology giants, small local retailers, and many more — can fine-tune their strategies to keep up with, and even get ahead of, shifts in the market, the economy, and their customer base. How? By utilizing the results of analytics models built on the organization’s blockchain data. The ultimate goal for any analytics project is to provide ROI for the sponsoring organization. Blockchain analytics projects provide a unique opportunity to provide value. New blockchain implementations are only recently becoming common in organizations, and now is the time to view those sources of data as new opportunities to provide value. Analytics can help identify potential sources of ROI. Managing data sharing Blockchain technology is often referred to as a disruptive technology, and there is some truth to that characterization. Blockchain does disrupt many things. In the context of data analytics, blockchain changes the way analysts acquire at least some of their data. If a public or consortium blockchain is the source for an analytics model, it's a near certainty that the sponsoring organization does not own all the data. Much of the data in a non-private blockchain comes from other entities that decided to place the data in a shared repository, the blockchain. Blockchain can aid in the storage of data in a distributed network and make that data easily accessible to project teams. Easy access to data makes the whole analytics process easier. There still may be a lot of work to do, but you can always count on the facts that blockchain data is accessible and it hasn’t changed since it was written. Blockchain makes collaboration among data analysts and other data consumers easier than with more traditional data repositories. Standardizing collaboration forms Blockchain technology empowers analytics in more ways than just providing access to more data. Regardless of whether blockchain technology is deployed in the healthcare, legal, government, or other organizational domain, blockchain can lead to more efficient process automation. Also, blockchain’s revolutionary approach to how data is generated and shared among parties can lead to better and greater standardization in how end users populate forms and how other data gets collected. Blockchains can help encourage adherence to agreed-upon standards for data handling. The use of data-handling standards will greatly decrease the amount of time necessary for data cleaning and management. Because cleansing data commonly requires a large time investment in the analytics process, standardization through the use of blockchain can make it easier to build and modify models with a short time-to-market.
View ArticleArticle / Updated 08-04-2022
The main purpose of data analytics is to uncover hidden meaning in data. If it were easy to look at raw data and interpret what it means, there wouldn’t be a need for sophisticated data analytics. Although a well-trained analyst can look at a model’s mathematical output and make inferences about the data, those inferences aren’t always easy to explain to others. To clearly explain the results of most models’ output, you need to draw a picture. Visualizing data isn’t just a nice thing to know; it's critical to conveying meaning to other people. Technical and non-technical people alike benefit from a good data visualization. Sometimes a bar chart most clearly explains data visually; other times a pie chart is better. Knowing how to visualize your data for the biggest effect is an important skill that improves with experience. One of the most critical parts of any analytics project is presenting the results. Choosing the right visualizations for presenting your results can make or break your presentation. In this article, you discover ten tips for visualizing data. These tips will help you assess your data and choose a visualization technique that will most clearly convey the story your data wants to tell. Checking the landscape around you Just like the great scientists of our age stand on the shoulders of the giants who came before them, you should take the opportunity to learn from existing visualizations. A quick Internet search on visualizing data will give you many ideas on what kinds visualizations others have used, pointers on how they were done, and even some potential pitfalls. In many cases, you can visualize a specific type of data in several ways, and seeing how others have done it might give you some ideas. And if you’ve already created visualizations of your data, seeing someone else's approach might inspire you to improve your work. To get started, look at an example from the king of data, Google. This image shows a visualization of the Ethereum blockchain from BigQuery, Google’s big data analytics platform. You can read about BigQuery and its blockchain visualizations. Regardless of the source, taking time to look over how others have visualized their data can be both instructive and enlightening. Leveraging the Blockchain community Many analysts and data scientists of all skill levels are online and willing to help point aspiring data visualizers to the right datasets and tools. Stack Overflow, Reddit (and appropriate subreddits, such as the one for data visualization and predictive analysis.), and Kaggle are all great places to network online, ask questions, and learn how to build first-rate visualizations quickly. Many tools have active communities. Don’t ignore the value of asking questions of people who are more experienced than you. Chances are, they had lots of questions at some point in the past as well. User communities are great places to learn. This image shows the results when the term techniques for visualizing data was searched for on Stack Overflow. The image you see below shows the community and subreddit results of searching for visualizing data on Reddit. The following image shows the Kaggle website. You’ll find lots of resources on Stack Overflow, Reddit, and Kaggle, and all are worth bookmarking for later reference. Make friends with network visualizations One of the many data visualizations in computer science is the directed acyclic graph (DAG). DAGs have many uses and indications, and it's easy to dive deep in a short period of time. For our use, let’s stick with a simple explanation of DAGs. A DAG, also sometimes called a network graph, is a directed graph of vertices and edges. Vertices are generally states, and edges are transitions from one state to another. If you’re wondering how DAGs remotely relate to blockchain data, remember that blockchain technology excels at handling transfers of ownership. You can represent a blockchain transaction as two vertices (from account and to account), and an edge (amount of transfer). Using a DAG (network graph), you can visually show how assets are transferred from one account to another. Network graphs make it possible to visualize any transfer, such as in a supply chain blockchain. Visualizing data using network graphs isn’t new. For example, the GIGRAPH application makes it easy to turn spreadsheet data into a network graph. You could do the same thing with any type of blockchain data. The following image shows an example of a network graph generated from tabular data in an Excel spreadsheet. Recognize subjectivity when visualizing Blockchain data Whenever you engage in cryptocurrency or other blockchain data analysis and visualizations, you should recognize that legacy systems often calculate value differently than new systems, especially new systems that incorporate cryptocurrency-based transactions. The value of transactions and the currency itself is subject to at least some degree of subjectivity. For instance, it's common to explain how blockchain transaction fees are far cheaper than the real-life processing fees they should replace. This may be true today, but if the value of cryptocurrency changes dramatically with respect to fiat currency, the relative values may change as well. A blockchain transaction fee today may seem very low, but worldwide financial turmoil coupled with a global strengthening of trust in cryptocurrency could invert today’s value perception. When you analyze and especially when you visualize, make sure you deal with any ambiguity that relative valuation may cause and communicate it clearly to the audience of your visualizations. Likewise, if your visualizations are built on any assumptions or constraints, be sure to note those as well. You want your visualizations to stand on their own as much as possible, not open to wildly different interpretations by the audience. Use scale, text, and the information you need to visualize your data Blockchain analysis is a data-rich environment, so you need to make sure you don’t overwhelm your audience with too much information. Providing too many nodes or colors or excessively specific visual markers can make visualizations confusing, which misses the point of visuals. Determining what is “too much” is a bit of an art form. In general, use your best judgement and make sure you include only the information you need and are presenting it clearly. Tableau Gurus published a nice article on how to avoid clutter in your visuals. The data visualization recommendations in this article are timeless and worth incorporating into your own work. The suggestions are simple but straightforward. The following image shows an example suggestion from Tableau Gurus to simplify visualizations. If your data is either isolated to a narrow band in your visualization or varies widely, consider changing the scale. Decreasing the scale can cause narrowly depicted data to show more variance, and a log scale can show relative changes more clearly. If your data doesn’t tell a story clearly, try changing its scale to see if that exposes interesting information. Consider frequent updates for volatile blockchain data Although it's true that data in a blockchain block never changes, new blocks are added every few minutes or seconds. Regardless of when you execute an analytics model on blockchain data, the volatility of the blockchain makes your analysis stale almost immediately. New transactions are submitted in a nearly continuous stream, and any of those transactions could affect your models. Your choice is to either frequently update your model and its associated datasets to be relatively current with the live blockchain or clearly state the highest block represented in your model. The latter approach tends to be easier but more confusing. Just reminding your audience that a model is based on outdated data generally doesn’t communicate the potential risk of relying on old data. In most cases, frequent updates mean more accurate results. To get an idea of the dynamic nature of blockchains, visit Ethviewer, a real-time Ethereum blockchain monitor shown below. You don’t have to look at the Ethviewer web page long to get an appreciation of how quickly transactions are submitted and make it into a new block. Get ready for big data Blockchain analysis gives analysts access to massive amounts of information. If you want to successfully analyze and visualize large sets of data in compelling ways, both your visualization tools and the hardware that runs them must be capable of handling the load. Hadoop is one of the most popular options for big-data analysis. On the visualization side, Jupyter, Tableau, D3.js, and Google Charts can help. A little research into the right tools goes a long way. As far as hardware, make sure your CPU and memory are up to the task — you’ll want at least a quad core CPU and 16 GB of RAM. You can run analytics on big data with less, but your performance might suffer. Visit the following websites to get more information on visualization tools that are ready to handle big-data analysis: Jupyter: This extremely useful toolset supports visualizations of datasets from small to extremely large. Learn about the products from the Jupyter Project; you’ll be glad you did. Tableau: Tableau is a market leader in big data analysis and visualization. This product is mature and integrates with most large-scale data-handling and high-performance processing platforms. For an enterprise class analytics framework, Tableau is hard to beat. Google Charts: The Google Charts website says it all: “Google chart tools are powerful, simple to use, and free.” js: The Data Driven Document JavaScript library (D3.js) provides the capability to visualize big data using many techniques in JavaScript programs. If you’re using JavaScript to build analytics models, D3.js should be on your evaluation list. Protect privacy in your data visualizations In today’s hyper-regulated and privacy-sensitive business environment, you must ensure that you're using a large enough dataset or partitions to avoid the possibility of associating any unique individual with the data your audience views. To make matters worse, even large datasets or partitions may not be enough to protect privacy. Sophisticated re-identification capabilities can infer unique identities with what seems to be a minimal amount of data. In addition to taking care to preserve privacy when you build datasets, your models must also be built to preserve privacy in the results they produce. Blockchain might seem immune to privacy issues because no real-life identities are associated with transactions. But Peter Szilagyi, a core Ethereum developer, has talked about various sites capable of creating links between a user’s IP address and an Ethereum transaction address. Although many the ability he describes has generally been blocked in many apps, other attacks on privacy will arise. As with all data analysis and visualization efforts, it’s better to be safe than sorry. Always pay attention to privacy as you build datasets and the models that analyze your data. Let your data visualizations tell your story Any time you attempt to digest a large amount of data and present results, it’s easy to overwhelm your audience with too much information and complex visualizations. Just as important as creating easy-to-understand visualizations is ensuring that they contribute to what you are trying to say. This point is true for any visualizations, not just those associated with blockchain. Keep in mind the big picture you’re creating. Go back to the beginning of your analytics project. Remind yourself of the original goals of the project. Then, as you work toward building visualizations for each model, revisit the goals for each model. As long as each visualization conveys the message you want to convey and meets one or more of the project’s goals, you've created a useful visualization. Only include useful visuals. Extra visuals, no matter how flashy they may be, detract from the project’s primary goal. Stay focused on what you've been asked to do. Challenge yourself! Blockchain is an emerging technology and its uses are still being discovered and fleshed out. Keep up with the latest research, papers, and competitions on sites such as Kaggle to keep your analysis and visualization skills sharp. Take online courses on visualization topics and tools and just keep learning! Remember that if a picture really is worth a thousand words, strive to use those thousand words better with each new project. Want to learn more? Check out this article to learn what makes a good data visualization.
View ArticleArticle / Updated 08-04-2022
The information age offers many new opportunities and just as many (if not more) challenges. The vast amount of data available to organizations of all types empowers advanced decision-making and raises new questions of privacy and ethics. Whether you are undertaking a blockchain data analytics project or engaging with data in any way, there are certain regulation and data privacy laws you should be aware of. Consumer protection groups have long been voicing concerns about how personal data is being used. In response to discovered abuses and the recognition of potential future abuses, governing bodies around the world have passed regulations and legislation to limit how data is collected and used. Although collecting a few pieces of information about a customer may seem innocent, it doesn’t take long for accumulated data to paint a picture of an individual’s personal characteristics and behavior. Knowing the past behavior of someone makes it relatively easy to predict the person's future actions and choices. Predicting actions has value for marketing but also poses a danger to an individual’s privacy. Classifying individuals in data The concern is that personal data has been, and will continue to be, used to classify individuals based on their past behavior. Classifying individuals can be great for marketing and sales purposes. For example, any retailer that can identify engaged couples can target them with ads and coupons for wedding-related items. This type of targeted advertising is generally more productive than general marketing. Advertising budget can be focused on target markets that provide the greatest ROI. On the other hand, knowing too much about individuals may violate a person’s privacy. One instance of a privacy violation was a result of the Target Corporation’s astute data analysis. Target’s analysts were able to identify expectant mothers early in their pregnancy based on their changing purchasing habits. When a new expectant mother was identified, Target would send unsolicited coupons for baby-related items. In one case, the coupons arrived in the mail before the mother had shared that she was pregnant; her family found out about the pregnancy from a retailer. Privacy is such a difficult issue because legitimate actions can violate a person’s privacy. Identifying criminals Another aspect of privacy is when criminals, or other individuals who deliberately want to operate anonymously, hide their identities from exposure. Privacy may be important to the general population, but it's a necessity for criminal activity. The ability to deny, or repudiate, some action is crucial in avoiding discovery and capture, and to any subsequent defense. Money laundering and fraud are two activities in which privacy and anonymity are desired to obfuscate illegal activity. On the other hand, law enforcement needs the ability to associate actions with individuals. That’s why laws exist that protect the general public but allow law enforcement to conduct investigations and identify alleged perpetrators. Protecting the privacy of law-abiding individuals while identifying criminals has become important across a spectrum of organizations. To enable law enforcement to deal with online privacy issues, legislative bodies have passed various laws to address those issues directly. Common privacy laws Here are a few of the most important privacy-related laws you’ll likely encounter and may be compelled to satisfy: Children’s Online Privacy Protection Act (COPPA): Passed in 1998, COPPA requires parental or guardian consent before collecting or using private information about children under the age of 13. Health Insurance Portability and Accountability Act (HIPAA): Passed in 1996, HIPAA modernized the flow of health care information and contains specific stipulations on protecting the privacy of personal health information (PHI). Family Educational Rights and Privacy Act (FERPA): Passed in 1974, FERPA protects access to educational information, including protection for the privacy of student records. General Data Protection Regulation (GDPR): Passed in 2016 (and implemented in 2018), GDPR is a comprehensive regulation from the European Union (EU) protecting the private data of EU citizens. Every organization, regardless of location, must comply with GDPR to conduct business with EU citizens. The EU citizen must retain control over his or her own data, its collection, and its use. California Consumer Protection Act (CCPA): Passed in 2018, CCPA has been called “GDPR lite” to imply that it includes many of the requirements of GDPR. CCPA requires any organization that conducts business to protect consumer data privacy. Anti-Money Laundering Act (AML): AML is a set of laws and regulations that assists law enforcement investigations by requiring financial transactions to be associated with validated identities. AML imposes requirements and procedures on financial institutions that essentially make it very difficult to transfer money without leaving a clear audit trail. Know Your Customer (KYC): KYC laws and regulations work with AML to ensure that businesses expend reasonable effort to verify the identity of each customer and business partner. KYC helps to discourage money laundering, bribery, and other financial-based criminal activities that rely on anonymity. Want to learn more? Read our article to learn how to prevent data privacy disasters.
View ArticleArticle / Updated 08-04-2022
Although understanding the blockchain data available through transactions, events, and contract state is important, you must understand what that data represents before you can make much sense out of it. An important part of any blockchain (or traditional) data analytics project is to align data with the real world. In a blockchain environment, that understanding starts with smart contracts. Understanding smart contract functions You can think of smart contracts as programs that contain data and the functions to manipulate that data. One way to help understand smart contracts is to think of state data as nouns and functions as verbs. Associating smart contract elements with parts of speech helps to understand each element’s purpose. You store data that represents something in the real world, such as an order, a product, or a letter of credit. Functions provide the actions that applications take on data, such as creating an order, createOrder(), shipping a product, shipProduct(), or requesting a letter of credit, requestLoC(). Data analytics is focused on extracting meaningful and actionable information from data. It is important to understand the data available to you, along with how that data was created and what real-world things and processes it represents. Smart contract functions provide the roadmap to how data gets added to the blockchain and what that data means. Assessing smart contract event logs One process early in any data analytics project is assessing your available data. In a blockchain environment, that step should include assessing any events related to the smarts contracts you’ll examine. One way to view events is as documentation of internal operations. These microtransaction artifacts often provide a level of granular data that you can’t get anywhere else. Don’t ignore the event logs — they may provide your best description of blockchain data and what it really represents. Ranking blockchain transaction and event data by its effect After you have a catalog of the data available to you, rank each data item’s importance by its effect. A data item has greater effect when it corresponds to some entity attribute or action in the real world. Data that represents a letter of credit’s approval status change is likely more important than the field that records the page count of the letter of credit document. All data is not equal. It is always up to you, the data analyst, to focus on the important data and not spend too much time on data with little value. Properly ranking data value by its effect is a learned skill, and one that takes practice. Want to learn more? Check out our Blockchain Data Analytics Cheat Sheet.
View ArticleArticle / Updated 08-04-2022
Blockchain technology is viewed as a disruptive technology due to the promise of removing intermediaries and changing the way business is conducted. That promise is a big one for blockchain, but it is possible. Removing even some of the intermediaries in existing business processes has the potential of streamlining and economizing workflows at all levels. On the other hand, changing a business process to blockchain technology is not a simple switch. For widespread implementation of blockchain technology, new business and software products that integrate with existing software and data are required. The challenge of moving from concept to deployment poses the greatest current difficulty for blockchain adoption. Finding a good blockchain fit for your business The first step in successfully implementing blockchain technology in any environment is finding a good-fit use case. It doesn’t make any sense to jump into blockchain just because it’s new and cool. It has to make sense for you and your organization. That statement sounds obvious, but you’d be surprised how many organizations want to chase the shiny object that is blockchain. Blockchain has many benefits, but three of the most common are data transparency, process disintermediation (removing middlemen), and persistent transaction history. The best-fit use cases for blockchain generally focus on one of these benefits. If you have to look hard at how blockchain technology can meet the needs of your organization, it may be best to wait until there is a clear need. The most successful blockchain implementations are those that start with clear goals that align with blockchain. For example, suppose a seafood supplier wants to be able to trace their seafood back to the source to determine if it were caught or harvested in the wild using humane and sustainable methods. A blockchain app would make it possible to manage seafood from the point of collection all the way to the consumer’s purchase. Any participant along the way, including the consumer, can scan a tag on the seafood and find out when and where it was originally caught. To increase the probability of a successful blockchain project, start with a clear description of how the technology aligns with project goals. Trying to fit blockchain to an ill-suited use case leads to frustration and ultimate failure. Integrating blockchain technology with legacy artifacts After you determine that blockchain is a good fit for your environment, the next step is to determine where it fits in the workflow. Unless you're building a new app and workflow, you’ll have to integrate with existing software and infrastructure. If you are creating something new, the only considerations revolve around how your app stores the data it needs. Will you store everything on the blockchain? It may not make sense to do that. For example, blockchain does a great job at handling transactional data and keeping permanent audit trails of changes to data. Do you need that for customer information? You may find that only part of your app data should be stored on the blockchain. It may make more sense to store supporting data in off-chain data repositories. (Now that we’re in the blockchain era, legacy databases are called off-chain repositories.) If this is the case, your app will have to integrate with the blockchain and the off-chain repository. In many cases, people are integrating new blockchain functionality with legacy applications and data. This integration effort could include introducing both new blockchain functionality and moving existing functionality to a blockchain environment. Although this task may sound straightforward, integrating with legacy systems involves many subtle implications. Legacy systems define notions of identity, transaction scoping (defining how much work is accomplished in a single transaction), and performance expectations. Some questions to consider: How will your new app associate legacy identities with blockchain accounts? How will you adhere to your existing application’s notion of traditional transactions? If your application supports rolling back a transaction, how will your blockchain do this? Will the legacy application’s users have to wait for blockchain transactions, or will they be able to carry out work like they did before the blockchain implementation? And lastly, will the integration of blockchain maintain sufficient performance or will it slow down the legacy application? Scaling blockchain to the enterprise The last question above leads well into one of the biggest current obstacles to blockchain adoption. Scaling performance to an enterprise scale is an ongoing pursuit that hasn’t been completely resolved. Most enterprise applications use legacy database management systems to store and retrieve data. These data repositories have been around for decades and have become efficient at handling vast amounts of data. According to Chengpeng Zhao (CEO of the cryptocurrency exchange Binance), a blockchain implementation must be able to support 40,000 transactions per second to be viable as a core technology in a global cryptocurrency exchange. Currently, only four popular blockchain implementations claim to be capable of more than 1,000 transactions per second: Futurepia, EOS, Ripple, and NEO. The most popular public blockchain, Ethereum, currently can handle about 25 transactions per second. Future releases of Ethereum, however, are focusing on raising the transaction throughput substantially. The technology is getting better but has a long way to go to be ready for the volume that enterprises require. Performance isn’t the only limiting factor when assessing blockchain for the enterprise. Integration with legacy artifacts and the ease with which the blockchain infrastructure fits into the existing enterprise IT infrastructure are concerns as well. Do all blockchain nodes require new virtual or physical hardware? Can the new nodes run on existing servers? What about network connectivity? Will existing network infrastructure support the new blockchain network? These are only a few of the many questions that enterprises must answer before deploying a blockchain integration project. Want to learn more? Check out our Blockchain Data Analytics Cheat Sheet.
View ArticleArticle / Updated 08-04-2022
Knowing how to access blockchain data and use it in analytics models are only the first steps toward creating useful results. The next step is to actually do these tasks. Although you can develop models using a simple text editor, having the right tools will speed the process and make you far more productive. The right tool for each part of the blockchain data analytics project can dramatically increase the probability that your results will have value to your organization. No single tool, framework, or package works well in every blockchain situation. You must define your project’s requirements, consider the resources available to you, and then select the best collection of tools for your analytics project toolbox. Here, you learn about ten common tools that analysts use for blockchain analytics projects. This article includes an assortment of tools that address a wide range of requirements. These tools will help you get a jumpstart toward delivering quality blockchain analytics results. Develop blockchain data analytics models with Anaconda You should download and install the Anaconda environment because of its value in any analytics project. Anaconda is the first tool you should be using because of the many ways it makes analytics easier. You can get Anaconda for small teams or for enterprise analytics development and deployment. The team and enterprise Anaconda licenses aren’t free, but in exchange for the licensing fee you get lots of collaboration capabilities that will make team analytics development easier, including tools to extract and organize data, prototype models, develop analytics solutions, and deploy those solutions. The Anaconda environment promotes “an integrated, end-to-end data experience,” where analytics project team members can easily collaborate and share project artifacts. Anaconda Navigator, shown below, is the default user interface, but you can use the conda command-line interface if you prefer a text-based interface. In the image above, note that only some tools are installed. When you install Anaconda, the install process searches your computer to see if any tools in Anaconda Navigator are already installed. Any tools that are recommended as part of Anaconda environment haven't been installed have an Install button under their icons. To install any new tool, just click or tap the Install button. Anaconda is far more than just a collection of tools. One of the most valuable aspects of Anaconda is that it automatically installs many of the analytics libraries you’ll use when building models. And if highly productive tools and pre-installed libraries aren’t enough, Anaconda also provides lots of entry points for product documentation and tutorials to help you get up to speed in record time. If you choose only one tool to install to supercharge your analytics projects, choose Anaconda. Write code in Visual Studio Code When writing software for nearly any environment (in nearly any language), try using Visual Studio Code Integrated Development Environment (IDE). Visual Studio Code, commonly called VS Code, is a freely available code editor and IDE from Microsoft that includes support for debugging, task execution, and version control. Microsoft provides VS Code for Windows, Linux, and MacOS. Although technically a lightweight alternative to the flagship product, Visual Studio IDE, VS Code brings a ton of functionality to the table. VS Code is free for private and commercial use and gives developers a great environment for developing code. In addition to being free, VS Code is extremely functional and developer friendly. VS Code has its own marketplace with hundreds of free extensions. VS Code extensions provide support for multiple languages (syntax checking and inline help), handling different types of file formats, and integration with many other tools. If you use VS Code and want some additional feature, there’s a good chance you can find an extension that does what you want. The following image shows VS Code in the editor window. This version of VS Code includes a Python extension, so VS Code automatically checks any Python code for syntax errors. Because you don’t see any red squiggly underlines in the following image, the code you see is syntactically correct. Although other good IDEs for code development are available, VS Code is one of the most popular choices for software developers, which is why it's one of the default tools in the Anaconda Navigator. Prototype blockchain data analytics models with Jupyter Jupyter Notebook and JupyterLab are popular products from Project Jupyter, an open-source and open-standards group dedicated to providing interactive programming support for many languages. Jupyter Notebook and JupyterLab are both included in the default Anaconda Navigator due to their popularity with data analysts and machine-learning model developers. Both tools are web applications that allow developers and analysts to build and populate models in a shared environment. Jupyter tools are popular choices when learning about data analytics and machine learning because the online design of the tools makes it easy to share code and data, called notebooks, with others. Anyone who wants to share a model, data, or any examples can just share a notebook. This next image shows the kmeans.py Python program in Jupyter Notebook. Building on the popularity of Jupyter Notebook, JupyterLab is the next generation of Jupyter’s web interface for notebooks, code, and data. The image below shows the kmeans.py Python program in JupyterLab. Jupyter products support over 40 languages. Develop blockchain data models in the R language with RStudio Throughout this book, you learn about building analytics models with the Python language. But Python isn’t the only language commonly used to build analytics models. The R language is another popular language for data modeling and analysis. Like Python, R can import many libraries, called packages in R, to provide access to hundreds of analytics functions. One of the most popular IDEs for working with the R language is RStudio. You can use VS Code for R development, but RStudio is a strong alternative and a favorite of R developers. In fact, you can use RStudio for both R and Python code development. RStudio is available as a standalone IDE and a web-based server interface. Both are open-source products. RStudio also offers a range of professional for-fee products designed for teams of analysts and developers who need collaboration features. The following image shows an R program that analyzes a dataset of income records by zip code. The RStudio IDE displays the R code, console messages, a list of items in memory, and the final visual output. Before you install RStudio, you must install the R language. If you try to install and then launch RStudio and get a message that R needs to be installed, you forgot to install the R language first. Interact with blockchain data with web3.py You need a blockchain client to interact with data stored in your blockchain. Each blockchain implementation is different, but the overall concepts are similar. After you learn how to access and analyze data from one blockchain implementation, mapping that knowledge to another environment is relatively easy. You can use the web3.py Ethereum blockchain client to access blockchain data. You’ll need this critical library to examine and extract the blockchain data required by your analytics models. This image shows the web3.py project website and several options you can use to install the web3.py library. But web3.py isn’t the only option. There are a few options for the Ethereum blockchain, and a quick Internet search will show you multiple options for other blockchains. Extract blockchain data to a database Throughout this book you learn how to identify blockchain data of interest and extract that data for use in analytics models. In some cases, you might need to extract blockchain data first and explore it later. Because you may not know what data you’ll need up front, you may find it more efficient to extract blockchain data to an off-chain repository for later analysis. By extracting blockchain data and storing it in a high-performance database management system, you can decrease data access times. You can write your own extraction code, but several generic products are already available to extract blockchain data and store it in a database. Extracting blockchain data with EthereumDB EthereumDB is an open-source product that extracts Ethereum blockchain data and stores it in a SQLite database. EthereumDB is a quick and simple method for extracting summary data, transaction details, and block information into separate relational database tables. You can use EthereumDB as is or as a tutorial on how to extract Ethereum blockchain data. Storing blockchain data in a database using Ethereum-etl Ethereum-etl is another open-source product you can use to extract Ethereum blockchain data. Ethereum-etl is more complex and flexible than EthereumDB. Using Ethereum-etl, you can output extracted data to text files or database tables. You also have a wider range of blockchain data you can extract, including block data, token transfers, and event logs. If you want to be able to tailor the data you extract from an Ethereum blockchain, Ethereum-etl is a good option to explore. Access Ethereum networks at scale with Infura All examples in this book use local blockchains provided by Ganache. Although Ganache is a great tool for learning blockchain concepts and developing your own blockchain code, it isn’t a live blockchain network. Real analytics projects will need to interact with real blockchain networks. Your organization may implement its own blockchain network; if not, you’ll need to interact with Ethereum’s mainnet or some other public blockchain. Interacting with a public blockchain comes with some constraints and obstacles. First, to get to all of a blockchain’s data, you need to connect to a full node. Running a full blockchain node requires an investment of infrastructure. Specifically, you need to dedicate disk space to store the blockchain data, a device to run the blockchain client, and sufficient network access to initially download all the blockchain data and then to process new blocks. Interacting with one blockchain may be feasible, but as you add more public blockchains to your data universe, the infrastructure requirements may become untenable. One common solution to increasing infrastructure investment is to use someone else’s infrastructure, and one of the most popular services for Ethereum blockchain access is Infura. An Infura account provides API access over HTTPS and webSockets to multiple Ethereum networks and InterPlanetary File System (IPFS) resources as well. Using Infura can take one large obstacle (setting up your own Ethereum node) off the table and let you focus on building analytics models. The next image shows Infura’s architecture for accessing Ethereum and IPFS resources. Analyze very large blockchain datasets in Python with Vaex Regardless where you get your data, there is likely to be lots of it. One common obstacle to operationalizing data analytics models is the size of datasets you need to analyze. Most model types increase accuracy with more data. But at some point, datasets become so large that they become difficult to manage. Even though your organization’s infrastructure may have lots of servers with lots of memory, you may not always be able to provision huge amounts of resources every time you need to run a model. To scale models to available hardware, many developers or analysts run models on partitions of their data or employ distributed processing. Partitioning your data can cut out important information and distributing analytics can take a lot of work. However, another choice is available. Vaex is an open-source library that implements out-of-core dataframes, which allows you to write code that explores and visualizes datasets far bigger than your computer’s memory. With Vaex, shown below, you can run analytics models on datasets hundreds of gigabytes in size, even on a laptop computer! Examine blockchain data One of the most important early steps in any analytics project is to identify the data your models need. You must take inventory of the data available to you and then explore sources for other data that your models require. When working in blockchain environments, the most common tool used to examine available data is a blockchain explorer. Most blockchain explorers are web applications that provide an easy interface for accessing data stored in a blockchain. Many blockchain explorer options are available, and each blockchain implementation has its own options. Here, you discover three popular options for exploring data on Ethereum and Bitcoin blockchains. Explore Ethereum with Etherscan.io Etherscan.io is the most popular blockchain explorer for Ethereum networks. Using Etherescan.io, you can explore blockchain data from Ethereum’s mainnet or any of the most popular test Ethereum networks. You can look at blocks, transactions, event logs, or any data related to your selected network. Etherescan.io makes it easy to examine your blockchain data to identify the source data your models require. The following image shows the main Etherescan.io web page. Peruse multiple blockchains with Blockchain.com Some blockchain explorers support access to multiple blockchain networks. For example, Block Explorer from Blockchain.com implements similar visibility as Etherscan.io but to more blockchain network types. Block Explorer provides an interface to block data from the main nets of Bitcoin, Bitcoin Cash, and Ethereum, as well as the test nets for Bitcoin and Bitcoin Cash. This next image shows the main Block Explorer interface for the Bitcoin network. View cryptocurrency details with ColossusXT Some blockchain explorers, such as ColossusXT, focus on cryptocurrency transactions. Instead of providing generic block access, ColossusXT identifies blocks that contain specific cryptocurrency transactions. If your analytics queries focus on cryptocurrency transactions, ColossusXT may help you find the data you need. The image you see below shows the ColussusXT main interface for Bitcoin cryptocurrency transactions. Preserve privacy in blockchain analytics with MADANA A core concern for handling data, including in the context of analytics projects, is maintaining compliance with privacy regulations. Privacy is a growing concern with governing bodies. The old, naive perception that encryption enforces privacy has been shown to be false. Privacy isn’t about the data — privacy is about the individual. Data analytics queries often provide aggregate results that simplify classification or prediction. If your models enable the audience to associate an individual with its results, you've violated that individual’s privacy. To avoid publishing any data that might inadvertently leak granular data that could be used to identify an individual, you have two main options. The first option is to apply good privacy-preserving techniques to your models. You’ll have to learn about k-anonymity, l-diversity, t-closeness, and differential privacy. Or you can use a framework such as MADANA, which does it for you. MADANA provides a framework that helps you protect confidentiality and privacy. If compliance is a concern for your organization, a framework like MADANA can help you stay compliant without having to design privacy-preserving models yourself. The image below shows the MADANA website, with some of its benefits. Want to learn more? Check out our Blockchain Data Analytics Cheat Sheet.
View Article