Information
AI Chat

Data Fabric as Modern Data Architecture

Data Fabric as Modern Data Architecture But could you make it even be...

Course

Probability for Computer Scientists (CS109)

5 Documents

Students shared 5 documents in this course

University

Stanford University

Academic year: 2021/2022

Uploaded by:

덕영 이

Stanford University

0followers

1Uploads

7upvotes

Comments

Please sign in or register to post comments.

Preview text

978-1-098-10592-

[LSI]

Data Fabric as Modern Data Architecture by Alice LaPlante

Printed in the United States of America.

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.

O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (oreilly). For more infor‐ mation, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.

Acquisitions Editor: Jessica Haberman Development Editor: Gary O’Brien Production Editor: Kate Galloway Copyeditor: Audrey Doyle

Proofreader: Christina Edwards Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Kate Dullea

June 2021: First Edition

Revision History for the First Edition 2021-06-02: First Release

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Data Fabric as Modern Data Architecture, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc.

The views expressed in this work are those of the author, and do not represent the publisher’s views. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, includ‐ ing without limitation responsibility for damages resulting from the use of or reli‐ ance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of oth‐ ers, it is your responsibility to ensure that your use thereof complies with such licen‐ ses and/or rights.

This work is part of a collaboration between O’Reilly and TIBCO. See our statement of editorial independence.

Introduction................................................... v

1. Why Build a Data Fabric?..................................... 1

The Limits of Existing Data Architectures 4 What Success Looks Like 5

2. What Is a Data Fabric?........................................ 7

The Architectural Pattern of a Data Fabric 11 Building a Data Fabric Is a Journey 13

3. How to Get Started......................................... 15

Five Pieces of Advice for Getting Started on a Data Fabric 16 Best Practices When Managing and Growing Your Data Fabric 20 Conclusion: It’s Time to Act 24

iii

1 “Data: The Strategic Asset,” Financial Executives Research Foundation, Inc., November 2019, oreil/uEflr.

Introduction

Your business faces major changes due to digital transformation. The only way to thrive during the complex transitions that are the inevitable part of your transformation journey is through data. By treating your data as the strategic asset that it is, you can successfully complete your journey in a way that differentiates you from the competition.

The good news is that most businesses today understand this: according to an Ernst & Young survey, more than 80% of organiza‐ tions view data as a strategic asset. 1 But this doesn’t mean they’re act‐ ing in a way that allows them to get the most from their data: less than half (49%) have put a formal data strategy in place, and less than 30% of financial executives said they have fully weighed the costs of poor-quality data.

This is because becoming data driven and accomplishing the admit‐ tedly ambitious goal of fully democratizing your data is not particu‐ larly easy. Each of the “five Vs” of big data—volume, velocity, variety, veracity, and value—has its own challenges. But the one we’re going to focus on in this report is arguably one of the top chal‐ lenges to getting the most out of your data: variety. And not just the variety of data structures, formats, and types, but the variety of data meanings (i., semantics).

Of the five Vs, variety references the different types of data that can exist. When data variety is high, the complexity of the data increa‐ ses, which is the chief reason businesses are seeking data fabrics: they have X sources of data, and every source has hundreds of tables, each with dozens of columns. At the same time, with all these sources of data they must serve Y users or use cases, each requiring slightly different data.

Whether data is structured or unstructured is only the beginning of the complexity facing businesses today. Most are familiar with these two categories (three if you add semistructured data) and have fig‐ ured out ways to integrate them. But there are a number of other challenges specifically concerning the variety of data. Chief among them is performing analytics with mixed-modal data—since tradi‐ tional analytics is designed to work with highly formatted data and doesn’t like inconsistent or noisy data. This makes it hard to inte‐ grate different types of data together, which is why data lakes are notoriously difficult to manage. Finally, the quality of data that exhibits a lot of variety can be low.

A subset of data variety is data distribution. That is, we would argue that it’s not only the different types, but the number of sources that raise challenges, especially when considering how much data is being created and stored in the cloud.

Essentially, data is everywhere, and it is all different. This includes Internet of Things (IoT) data from a distribution warehouse, real- time SAP transactions, and Salesforce or other software-as-a-service (SaaS) datasets. All of these sources may involve customer data of some kind, but each has a different purpose and different data consumers.

All the silos in all the departments, each with its own set of tools and techniques, business rules, and definitions that must be orchestra‐ ted, also add to the complexity.

Questions arise. Where is the data? What kind of data is it? How can I get the data to the users who need it?

Centralization implies control, and some companies are still pursu‐ ing the goal of having only one, centralized source of data (we’ll explain why this is not necessarily such a good idea later in this report). Unsurprisingly (to us), only 6% of companies have achieved

vi | Introduction

4 “Big Data and AI Executive Survey 2019: Executive Summary of Findings,” NewVant‐ age Partners LLC, accessed May 6, 2021, oreil/AnyH8.

66% report less operational efficiency as a result of broken pipelines.
59% report delayed decisions or lost opportunities because of broken pipelines.

Finally, data complexity can also be caused by data naming conven‐ tions. When businesses use technical data specifications such as table names and column names instead of the business terminology users are familiar with, miscommunications and inconsistencies invariably arise. If a certain kind of data is called different names by different systems—for example, if the definitions of order entry and receivables in Salesforce are different from those in SAP—you’ve got additional complexity to factor in.

A Changing World That Needs Data

Democratization

In addition to these challenges with the data itself, we’re also chal‐ lenged by a world that is in the middle of a major organizational transition. With the onset of the COVID-19 pandemic, office work‐ ers began working remotely, and many may continue to do so once social distance restrictions are fully lifted. In addition, we’re seeing more mobile workers, and even people working nomadically with no fixed office.

Indeed, mobile users and so-called digital nomads are causing busi‐ nesses to think in new ways about the user experience. Data analysts who sometimes work from home, sometimes on the road, and sometimes from a café need the same secure access to data that they need at the office. Easier, simpler tools are required.

But sometimes it can feel like we’re taking two steps forward and one step back. By 2019, almost half (48%) of businesses said they competed using their data, according to the 2019 NewVantage Part‐ ners survey on big data. 4 This showed progress; in NewVantage’s 2006 survey, only 5% of large organizations said this.

viii | Introduction

5 “NewVantage Partners Releases 2020 Big Data and AI Executive Survey,” Business Wire, Jan. 6, 2020, oreil/SPTy5.

In NewVantage’s 2020 report, however, the news was not particularly good. Although investment in data was up, showing that companies generally realize data’s importance, the pace of that investment was losing momentum. The percentage of companies investing more than $50 million in data was 65% in 2020, compared to just 40% in 2019. But only 52% of companies were increasing their rate of investment, compared to the 92% that were doing this in 2019. 5

Worse, only 38% reported that they had created a data-driven orga‐ nization. Even fewer—only 27%—had built a data culture. This tells us that the all-important goal of data democratization is not being reached. And it’s not necessarily the technology that is holding firms back. Nine out of 10 companies point to people and process chal‐ lenges as the biggest barriers to data democratization.

Opportunities Abound—with the Help of a

Data Fabric

By enabling a distributed, mobile workforce and democratizing data, businesses today can do the following:

Increase operational efficiencies
Better calibrate the right pricing for their goods and services
Personalize sales and marketing initiatives
Improve the customer experience
Identify fraudulent transactions

.. much, much more.

Until fairly recently, data scientists and analysts squandered 80% of their time wrestling with data and spent just 20% exploring it. That used to be the rule. But IDC’s research director of data integration and data intelligence software, Stewart Bond, reported last year that this rule is starting to bend. IDC’s December 2019 data culture sur‐ vey found that knowledge workers are spending closer to 30% of

Introduction | ix

1 Tanner Luxner, “Cloud Computing Trends: 2021 State of the Cloud Report,” Flexera, March 15, 2021, oreil/skemo.

CHAPTER 1

Why Build a Data Fabric?

Why do you need this thing called a data fabric? It’s not just because of the sheer size of your data. You also are faced with access and integration challenges because of where the data is coming from, where it’s stored, and in what form. You’ve got data on premises. In the public cloud. In private clouds. You have data in multicloud and hybrid cloud ecosystems. Within these various silos, some of the data is structured but most is unstructured, which raises challenges. And don’t forget streaming data—that’s an important part of the pic‐ ture, too.

What’s the state of enterprise data, then? Fragmented. A full 93% of enterprises have a multicloud strategy, with 87% having a hybrid cloud environment in place, according to Flexera’s 2020 State of the Cloud survey. 1 On average, companies have data stored in 2 public and 2 private clouds, as well as in various on-premises data reposi‐ tories (see Figure 1-1).

Businesses are pushing the limits of what they can do with existing data management tools.

2 Adam DeMattia, John McKnight, Jennifer Gahm, and Monya Keane, “Research Proves IT Transformation’s Persistent Link to Agility, Innovation, and Business Value,” The Enterprise Strategy Group, Inc., March 2018, oreil/sAZUW.

Figure 1-1. The fragmented state of enterprise data (Source: Flexera)

The reasons for this fragmentation are varied, and include the following:

Time-to-data-insight is a competitive differentiator Today nearly every business transformation—whether aiming for greater customer intimacy, more optimized operations, or faster innovation—is fueled by data-driven insights. The days when business users would patiently wait weeks or even months for IT to deliver new datasets are gone. Not only are your users demanding rapid responses to their queries, but the competitive nature of today’s markets requires it. The dilemma is that quer‐ ies on databases with billions of records can take hours to return. The need to change this is urgent, as companies with data intelligence shared in real time or near-real time are 18 times more likely to make better and faster decisions than their competitors. 2

Demand for self-service data continues to explode Enabled by easier-to-use, more powerful analytics tools such as Power BI and Spotfire, business users are demanding more data, delivered more swiftly. Whether you consider this data democ‐ ratization or data chaos, the trend is very real, and data users’ needs must be satisfied for your organization to maintain a competitive edge.

2 | Chapter 1: Why Build a Data Fabric?

You need a new, flexible solution to cope with all of this—one that can achieve the following, arguably difficult-to-hit, objectives:

Simplify data democratization
Unify your data environment
Eliminate data silos
Centrally coordinate data flows
Scale easily, to keep up with increasing data volumes
Span all datatypes
Align IT with the business
Empower remote and mobile workers

The Limits of Existing Data Architectures

Current methods of managing data that attempt to meet all the objectives using data warehouses and data lakes frequently don’t succeed, because they never include all the data that is needed. But they still remain important components in a larger distributed data landscape.

Although data warehouses can solve your integration challenges for much of your data, they never actually integrate all the data. Addi‐ tionally, they’re inflexible. You won’t get the agility you need to respond to your users’ requirements. Finally, applying AI technolo‐ gies like machine learning (ML) is a more demanding task than most data warehouses can cope with—in terms of both the volume of data required and the complexity of the integrations.

Alternatively, data lakes can hold unstructured as well as structured data, but it can be difficult to actually find and integrate different datasets as a lake continues to grow. The more data that is placed into a data lake, the more difficult it is to manage it, much less squeeze value from the vast quantities. The popular term for this scenario is data swamp, and it’s something you definitely want to avoid. Although data lakes can be good options for inexpensively processing large and relatively simple datasets, they are constrained from effectively managing today’s complex, multifaceted data that businesses want to locate and analyze swiftly for immediate insights.

4 | Chapter 1: Why Build a Data Fabric?

What Success Looks Like

If you manage to address all the challenges, your rewards will be substantial. Here’s a taste of what’s to come. With a data fabric you will get the opportunity to do the following:

Fuel your data-driven business Support multiple, diverse users and use cases with a modern, distributed data architecture, shared data assets, and optimized data management and integration processes.

Accelerate value realization Accelerate time to value by unlocking your distributed on- premises, cloud, and hybrid cloud data, no matter where it resides, and delivering it at the pace of your business.

Empower your people with timely, consistent, and trusted data Democratize data access to arm business users with all the data required to make faster and more accurate business decisions. Empower remote and distributed workers as much as your tra‐ ditional office workers.

Benefit from technology innovation sooner Embrace new data and analytics technology advancements such as data science, real-time data, and the cloud faster to stay ahead of your competition.

Save time and money Streamline data management and integration processes and pipelines via an optimized combination of intelligent, con‐ verged data management and integration capabilities that embed AI/ML and business self-service.

Govern and comply with confidence Ensure proper data governance and control so that you can deliver the right data at the right time, securely, and in compli‐ ance with your ever-changing regulatory landscape.

To achieve all this you need a data fabric. We’re going to define a data fabric more precisely in Chapter 2, as there are various conflict‐ ing definitions for it. Although it is a relatively new term, the impor‐ tance of what it does is not new. For years, enterprises have struggled to integrate all their data into a single, scalable platform. A data fabric describes a comprehensive way to achieve that goal.

What Success Looks Like | 5

CHAPTER 2

What Is a Data Fabric?

Let’s start with what a data fabric isn’t. It is not a single product or even a single platform. You can’t buy and deploy it overnight. It is an architecture. And a journey.

The good news is that you don’t have to rip and replace your exist‐ ing technology. A data fabric encompasses the data ecosystem you have in place. Neither do you need to be beholden to a single ven‐ dor. You can choose best-of-breed solutions and—in theory at least—they should all work together within your data fabric.

To summarize what we discussed in Chapter 1, with a data fabric your users will get to spend more time analyzing their data than wrangling with it. And other consumers of data—think systems and applications—will get access to integrated data. It’s as simple as that. The data fabric is there to make it easier to find data in a way that’s trusted and gives access to anyone. This is the frame for our entire data fabric discussion: that a data fabric will drive the old 80/20 rule (now 70/30) to increasingly favorable proportions.

Some people call it data intelligence rather than data fabric, because it makes it easier for users and systems/applications to intelligently find, work with, and clean data, and apply AI models to it.

So what is a data fabric?

A data fabric is a modern, distributed data architecture that includes shared data assets and optimized data management and integration processes that you can use to address today’s data challenges in a unified way.

Despite what many vendors might claim, a data fabric is not a single product or specific platform that you can simply buy and insert into your existing data architecture. It includes architecture, shared data assets, and data management and integration technology.

A data fabric supports the following:

Data for all users and use cases Provides timely, trusted, reusable data for a wide range of ana‐ lytical, operational, and governance use cases, as well as busi‐ ness self-service users

Data from any and all sources Accesses, combines, and transforms both in-motion and at-rest data from across a diverse, distributed data landscape using metadata, models, and pipelines

Data that spans any environment Flexibly spans distributed on-premises, hybrid, and multicloud environments

In short, a data fabric’s job is to connect any kind of data to any‐ where and anyone (or anything). That’s admittedly a tall order, as IT systems are getting more complex as users demand simplicity for easier, faster decision-making. A data fabric addresses both needs.

Let’s be very clear that many of the components that make up a data fabric are not new. They’re constantly evolving, true—especially when the cloud is involved. But it’s the combination of them that cre‐ ates this new thing, this data fabric.

Here are some of the components of a typical data fabric:

Data catalog Allows you to categorize, access, and collaborate around com‐ pany data across multiple data sources, while enforcing strong governance and access management.

Master data management Involves creating a single master record for all business data from across both internal and external data sources.

Metadata management How you manage the data that describes other data (the meta‐ data). It involves establishing policies and processes that ensure

8 | Chapter 2: What Is a Data Fabric?

Was this document helpful?