Building a Modern Data Architecture for the Data Driven Enterprise
Building a Modern Data Architecture for the Data Driven Enterprise

Building a Modern Data Architecture for the Data Driven Enterprise

Published
December 10, 2022
Author
Mark Ferguson Big Data LDN
summary
Solution architecture will limit data architecture to only what is needed for a specific solution
date
Dec 10, 2022
status
type

Building a Modern Data Architecture for the Data Driven Enterprise

Solution architecture will limit data architecture to only what is needed for a specific solution

Big Data LDN 2019

Video preview
[Autogenerated] Okay, So, uh, May thanks for taking time at the show to come and listen to this session. My session today as Mike Ferguson, I'm your conference chairman of Big Data London. But my session today's about modern data architecture and how we can build that, given all the technology we have out here. And how does it all come together in an end to end architecture that works to deliver value in your organization? So what I want to talk about is, you know, what date architecture is in the kind of key elements of that different example types of that and ways in which we're seeing that emerged within within organizations today. A look at the kind of hybrid computing environment that we're trying to implement this on these days on end what's happening there on the emergence of different kinds of database, which is a key part of data architecture that that are out there on how we're using them on Ben. Look at Holly. All fit together in an end to end offering and how data should kind of flow. If you like to try and shorten the time to value and give you a more unified approach towards data delivery s o If I look at the just Wikipedia and you look a tw what what? The definition of data architecture er is you could see it covers models, policies, rules, standards to go in the data you collected. Hey, story how you arrange it and integrate it and put it to use crusher organization Pretty broad broad definition covering things like methodologies, database technologies, the processes or pipelines in which you would wantto process that data on any administrative efforts to manage it all across a lot. Environment solution. Architecture, on the other hand, is kind of restricting data architecture to a specific solution. So, you know, many of you may be solution Architects particularly focused on particular business problem on trying to look a daily architecture. In that context, we've had plenty of examples of that over many, many years, things like the classic data warehouse architecture, which hasn't changed in a very, very long time. Um, or another example that might be streaming architectural where you got streaming data coming in on something like Kafka or kinesis. Are Google pumps up or something like that on? You gotta filter something out of that you've got to maybe combine it with other data rest that could be master date or some kind, perhaps, and then feed that into a train model in order to drive scores and predictions, or maybe even automate the decision ing driving alerts or actions on, then maybe filter some of that data off. What you doing in real time? You're not even storing it. You're doing it before you store it, and you could be using something like Spark or Flink for that kind of approach. But it's still a very specific architecture for a particular problem on DSO. It's one example, but I think for most of us were building these silos today on I think for me that means we're not running particularly efficiently we've got. If you look at the green boxes across the bottom on, I look at that and I think, well, how many tools on scripts on code is out there in whatever language across your organization where people are trying to clean and integrate data, and I'm sure it's more than one? Um, it's gonna be multiple on the problem with that is, sharing metadata across those tools is nearly impossible And so the idea to understand what people have invented in any of those silos being shared with what other people are doing in any other silo is almost zero because they have no ability to know what was being created. Onda, one of my clients, said, Everybody's blindly integrating data with no attempt to show what they create. I kind of think that's a pretty interesting statement, but but obviously the danger is that we end up with the silos and we don't do is an efficient job as we could. But we're trying to sort of come come together to put a data architecture on top of a modern kind of operating environment these days, and that's turning out to be not centralized anymore. You know, Well, beyond the data center were in, multiple clouds were in the edge. You might only only be in one cloud, but how many of you are now looking at multiple clouds across for your organization today to implement? Okay, quite a lot of So, um so we have this challenge of managing data and putting an architecture basically on top of this. And then the question is really well how and How you gonna capture data, store it integrated and analyze it at the edge in multiple clouds and your data centers or across all of it on. Do you have to keep building these silos for each of these operating environments? Or could we come up with a common approach that would pull their stuff together so that we could get manage it across all of this setup on? Obviously, when we look at that, we have to think about types of databases. We've seen an _________ and databases over the last, uh, 567 years or other kinds of data store. If I look at the spectrum, your general purpose, if you like databases in the middle, you know your traditional relational databases, which which have done a great job kind of universal category. But the further you go to either end of the spectrum, whether you go to extreme transaction processing down one end or extreme analytical processing up the other end, you're gonna move into specialist databases on what we're seeing in both sides. That spectrum is no sequel on relational systems like new sequel in the area of extreme transaction processing like Volte Be or nearly be no sequel databases being around for years, all the way back to I m s, which is still help there, Believe it or not, um, to the mom Goes and H Bases and Cassandra's and so on couch bases and what not? Let me see there on the analytical side again. You see analytical relational databases, which are widely used, our own organizations and several of them here this week. Or you can see, uh, other no sequel analytical systems like graft databases, which obviously don't understand the secret language. Typically, you've got obviously her deep systems, or even spark being used as an in memory, massively parallel capability to do analytics in machine learning. And I guess if I look at no sequel databases, there's different categories on probably for me at least, the main category for analytics. His graph. The others are more optimized for up. When I'm co operational posting, maybe you could do some some operational analysis on on live data, but but generally speaking, I think graph databases are definitely an analytical data data base, perhaps for fraud for something like cyber crime, which are surreal sweet spots. Also, her dupe is an analytical system. It's got a file system. It's got multiple execution engines, probably dominated these days by spark. But of course, Spark doesn't just run on her. Do pick and rolling without her dude doesn't need to do. We've got other engines on there as well. My produces, kind of pretty obsolete these days has still out there, you know, running on the Cloudera platform now a cz well, and so we we can run, you know, some kind of batch processing over there. We can even run some kind of streaming or kind of capability there, too, with spark Spark has kind of dominated. You know where you build python or Java or scholar or our applications, and you send your application into a driver, which will then bring you retrieving your data either from cloud storage or maybe a file system up into memory do massively parallel computer on it on. Obviously, I can implement a kind of combination off streaming on batch processing with spark and so kind of implement the so called Lambda architecture in that kind of environment and then streaming increasingly popular, whether you're using something like sparks streaming, thio, analyze a process live data. So maybe prepare it on, analyze it while it's streaming before you landed anywhere. All of that happening in a spark environment and being able to Then if you want to take actions on the back of that or spill it off into desk for offline analysis at some later stage on, that's not, of course, not the only streaming engine out. There's lots of them out there on the floor. Probably another very popular might be Flink from an open source perspective as well. But I think when you look at what's happened with spark ah, nde her dupe. You know what we've seen in the last year is being seismic rials shift from just Should you do this on her dupe? Or should I just love my data on cloud storage? A new sparkles of service on giving All the cloud vendors have put H D f s A p eyes on top of cloud stories to make it smell like tastes like, um look like h D f s. Even though it isn't, um, you can effectively just run sparkles the service. Good example might be Microsoft is your data bricks running on top of, uh, the is your data like a zoo? Good example. But we've also got Hadoop itself as a service on the cloud as well. Several vendors listed, and then we've got relational databases, analytical relational databases. These are databases from the ground up that were relational databases, but they execute sequel queries in parallel. It's not single threaded secret Curries they create comes in. It's optimized and run into different, uh, broke up into different pots on running different notes. The massively peril sequel Cory Processing probably pioneered by ______ data as much as nearly 40 years ago. Now, on days now implemented by a lot of database vendors out there, some of them doing in memory. I'm Sequels on them, doing it in G fuse like kinesis and blazing D B GPU based databases, some of them more broadly out there, including the new is your synapse analytics, which Rohan Kumar just announced right on the stage yesterday, Um, that that that Mike stuff have launched and so what we're doing with that is massively parallel career pressing, so you know you can get very, very large relational database is out there into the hundreds of terabytes. In some cases, beyond hundreds of terabytes with massively parallel Cory processing. So it's it for me is it is a complete ____ out there that relational databases don't scale. It's how you use no easy and four. But for but for data warehousing, this is absolute sweet spot where massively parallel databases have for years been very, very effective. And now we've We've got vendors releasing databases at the edge because it's just not gonna be practical to bring all the hard data from all of those devices out there all the way to cloud or all the way to a data center before you can process it and analyze it. I have to do it sooner than that. And so what we're now seeing is, Well, if we can't bring the data to where you want a process and analyze it, we've got to go the other way. So we're seeing database is going out there. We're seeing analytics going out there. We're even seeing pipelines to process data on integrated out there, being pushed all the way out into that landscape. Beyond cloud on, the question is, do we have tohave multiple different tools for EJ other ones for cloud other ones for data center or coming move to a common approach to doing this. It doesn't really matter where you want to deploy this stuff. I think that's what kind of going on at the moment. We've ended up with a number of pieces, a big, big enterprises saying Look, you know, buying all these bits and stitching it all together is kind of challenging. Is there a way in which we could move towards a common approach that doing this and organized teams to be able to handle that? What we are definitely seeing is all of those different workloads, all now running in the cloud. But we're also seeing vendors now looking at trying to manage it all from a single council, whether it's in one cloud of multiple clouds and again, a good example. That would be what what Cloudera showed this morning here on on the stage in the first talk this morning, the emergence of CDP, which is not a dupe. It's a facility to spend up multiple different types of Cuban of these clusters for different use cases, all off of a commonplace on potentially accessing data on cloud stories of the moment it's only on HD if s. I think it's only a matter of time. Sorry on them, aws. It's only a matter of time before it rolls out on us. You're another examples. So what we're seeing is ah, whole range of different data stores here. Stuff at the Edge. No sequel databases coming in, particularly being deployed in customer facing applications, too. Speed up the ability to read and write on give good responses in the front office. But there's some side effects of that data. Quality is kind of deteriorated, I think a lot because of no sequel databases because you don't have a don't have a relational database anymore to check integrity of the data, it's up to the program or whether he or she checks the integrity of the data when they rise into the database. But nevertheless, major adoption and growth in no sequel databases for operational applications particulars, they start to introduce acid properties to guarantee transaction processing. We've seen graft database, we see ha dupe systems. We're now seeing EJ databases, And so the question is, how does it all come together? I think the first thing is gonna be, you know, can you somehow build a set up for your business, where data on analytics is at the middle and we can wire it into everything. Because if you listen to say, Rob Thomas is session this morning, which I thought was very instrument talked about a I today I is a complete game changer for for all of us. And then the question then becomes not about just building models is by using them. And we're not all gonna use them in a clout or just in a data center weekly or just on the edge. I want to build models that are going to be deployed at the edge. I also want to build models that are gonna be run the data center. But I'm gonna be running a database that is gonna be running the clout. I don't need all of it. And on the question is, for what purpose are we doing these in building these models? You know, if I'm trying to solve a business problem like reducing fraud, I can build real time streaming analytical models to stop our life transaction in flight before it commits to detect it is a fraudulent transactions. Stop it. That's real Time analytics. But I can also go into a graph database on low data in there on China and buy a fraud ring. It's still a fraud problem, but it's just a different platform on which we're having to deal or solve with some of those fraud issues on DSO. It's a collection of things that were building across this operating environment, where we want to deploy things in order to solve big business problems, and it doesn't mean it's all gonna happen on just one physical platform. So data architectures really matters if we're gonna be able to deliver trust, trusted data and trusted analytics across that operating environment. So the question is, how do you go about doing that? So let's start with the traditional data warehouse architecture, something that's pretty familiar to most of us. This is absolutely being for a very, very long time the kind of place where analytical relational databases have lived massively parallel databases for staging areas and for your data warehouse itself on. For your data marks, you may get a lap databases that's cube oriented databases, some something like Microsoft Sequel Server Analysis Service is or the Oracle Hyperion stuff a CZ an example there, or it may even be virtual cubes like at scale or something like that. But the point about this is, you know, it's Bean very often the case that I load data into staging tables in a data warehouse on. Then I run E T L jobs on transformed data between staging tables and production. So in this case, the staging on DDE the production tables are both in a relation. Database is very, very common in data warehousing today. It's also very common to see analytical massively pilot relational databases on depending data Marts. Uh, master data hardly talked about probably the most widely used data in any business. Customers, products, suppliers, materials, assets. Everybody needs master data because we needed an analytical systems and we needed an operational systems on MGM is becoming important. We've put that primarily on sequel databases, not necessarily massively parallel sequel databases, but just ordinary relation. All databases similarly, also, we would want to synchronize the data coming out of there with both analytical systems to feed it into data warehouse and back into operational systems to make them consistent as well. No sequel databases. We talked about those we said operational no sequel databases on analytical, no sequel databases. And I said analytical would be graph on operational the other types. So here you see the other types as a data source to our analytical environment. So we may be capturing data in this in this world, and now we're gonna take that data and move it into our analytical world. And then I cannot draft databases into here and start doing things like maybe bringing in data on using it as a data mart using Graphics, Data Mart. Or I could introduce graph to accelerate master data management and understand all the relationships across all of those key entities. Andi, I could also build up a graph database that's just pulling data straight in to do some kind of analysis totally independently of any kind of data warehouse, such as a fraud kind of use case or a cyber crime kind of use case. But in that case, they may need massively parallel a graph databases, several of which are out there on the floor. For example, I noticed offenders like Tiger graph, for example, and then I gotta put some kind of data leak in place. Now there seems to be a myth that a data leak has to be her do Absolutely not. Could be cloud storage with us. Two examples off centralized a lake, but I have a customer's with logical data Lakes where they they haven't got one data store. They've got multiple data stores dedicated to ingestion. They've got multiple data stores dedicated to curation. And then they got multiple data stores dedicated to trusted data that they're producing. And so it's a logical day Lake, consisting of multiple data stores, so so that it doesn't have to be a centralized data store. You can't ask a company that operates in 100 countries to move all their data. The one data store. It's just not practical, so we gotta find a way out of it. But zoning within a data lake, whether it's cloud storage, whether it's a dupe or whether it's a logical day, Lake is organizing it to be able to understand where the data is to have a catalog to know world, the date is within it, and then we can put data signs on top of that. So data scientists equinox is trusted data. If we can to speed them up so that they can use Jupiter notebooks. We can use our studio or whatever you want on top of something like sparked the scale that then about streaming will weaken Do streaming analytics of the edge. Or I can push streaming data into spark, for example, and do it there. And now I'm doing in flight data. I'm not landing in anywhere. I'm doing in flight data analyzing the data and taking action on the data long before I stored anywhere. And whether you do that with spark or whether you do it with Flink, the point of trying to make here is how does streaming data fit in the data architecture? You can't stuff streaming data into a staging area on go through a batch process of getting it out of there and putting it somewhere else and getting it out of there and put it somewhere else, and getting it out of there and then giving it to a user because it takes too long is too many hops. I have to act on life data in other near real time or true riel time. I can't stick it in the staging area and use a classic data warehouse architecture. But what I might want to do is analyze it and filter out what's interesting. And having already taken action, I might want a filter that data into my day lake on, then merge it with other stuff on Use it offline either going into my data warehouse or in further data science projects, building models on dhe so on, and so be ableto handle that on. Then I got too many data mites. I got lots of customers with too many data months. It's copies of data slows you down more and more copies all over the place more, more things to change if something has to change in a warehouse. And so the question is, how do we get agility into Data Warehouse? Yes, we can use massively parallel sequel databases, but I see new data modeling techniques. They're like data vault instead of, say, classic Inman kind of approach was seeing the ability to change quickly at the data warehouse. But how do you get more agility in your data marts? We still want start scheme is there. We still won't cut Kimble there, but could I not give me virtual data marts use data, virtual ization. Get rid of physical data stores. I can now get, you know, vendors like Tico on the floor here with date of our civilization with massively parallel in memory database inside with spark inside it. And so it's date of organization with spark inside it. That's out there today when we have that kind of capability. And so we have the potential to simplify architecture, getting rid of physical data stores. And so what you're now seeing is beginning to use these different kites kinds of data store in an end to and architecture whether or not they're in the cloud should make no difference or whether they're not, they're physically in the data center should make no difference. But we can link these things together to make them work for overall enterprise and then short in the time to value. If I will. Even if we do put those data stores in plates, how do you get the data flow moving? It was back to sailors again. We can't continue like this because the cost of data integration is too high. If everybody's using their own tools and everybody every silo, we can't share because you can't just assume me by this tool, that tool not, too, and they just share because there's no standard for metadata. The only standard from meditated that's emerging is an open source, one on its visible. Here at the show on the IBM booth, there's a bunch of guys with ODP I giri a on their back. That's an open source. Emerging metadata. Uh um standard. And it's a very interesting one. Very young one, only maybe a year old, and it's really gathering momentum fast. You should go. Take it, talk to those guys. Very, very interesting. What's happening? Maybe maybe we might finally get a meditative standard in our industry to share things and understand what exists across multiple tools. I only see one option for that, and that's that Algeria project right now. But for me, the question is, how do you rationalize this thing? How do you know if we are? If we can use all of these data stores and architecture, what else do we need to do what we need to come up with a common approach, Being able to prepare an integrate data? It shouldn't matter whether it's a user need to do it? A business analyst, a data scientist or a 90 professional? Couldn't you have role based common platform to do that? That's what we call a data fabric, going across multiple kinds of data stores in multiple clowns, and we want to be able to have a catalog to know what's out there again. Lots of cattle ad vendors out on the floor of the last three days on this fabric has to go across multiple clouds. It has to go across edge, gymnastic across data centers on. We need to discover what data's out there in those data stores. We need tohave, multiple people using the different user interfaces of a common fabric to prepare and integrate data for whatever environment it needs to happen. It on the _________ of data fabric vendors in the market place is happening out of pace. You know, this year alone seeing some major new vendors, vendors like Amazon enter the game. Vendors like Google entered the game on DDE, and we need to collect the data from multiple underlying systems on populate the catalog to know what day it is out there, and we need to organize our team's on top of this common environment in order to deliver value on, then build ready made products ready made data assets and published them in your catalog so that consumers we can accelerate the last mile and get consumers to drive new value. So shop for data ready made. If you could get 60 or 70 or 80% of what you need. Ready made and it's trusted. It's got Lenny's and it's in a catalog. How much time will it save you so that I can get, you know, a curated thing going with a data lake? I can deliver trusted assets that are picked up on, give you virtual views of it on an invite people to come in and consume that data and go dry value with it. That's what I call accelerating the last mile so they can pick up Dr Value on a trusted base already made assets on. Then, if they build new assets with those things, then publish those back into the catalog. But as long as you put it through a governance process for us, make sure everything that we're gonna publish in this catalog you trusted it documented has got lineage, etcetera, so that we can then build up incrementally a completely new set of valuable assets that the next people to come along find not just 60% of what they need, but 70% or 80% on so they get faster and faster and faster to deliver. That for me, is really where we've got to get to here. You know, we're dealing with lots of technology, lots of data stores. But we gotta find a way to be able to put all of these different kinds of data stores together together, make them work for us and still deliver value. The expectation is a CZ. The Americans would say that we can turn on a dime and everybody's going on about a job We're not gonna get agile unless you can figure out how you can deploy these kind of data stores, put them together and create a common approach. Thio preparing an integrating data that share a ble on understandable, trusted on, then accelerate the last mile by giving them ready made rather than asking them all to go back to roll a data and give themselves service tools and go, You figure it out for me, that's the thing that's not gonna work. If we just give everybody self service tools and say, Here's the raw data, you figure it out. I wouldn't like to go to lunch and would like to have some chicken and walking and somebody help me Or raw chicken and some ingredients is Hey, you are you figure it out. I don't do that. I want ready, mate and I want to speed up my ability to deliver the value. So I hope that's giving you an idea of what you might be able to do with all the technologies on the floor to string them together across a landscape that involves edge multiple clouds in your data center. I would just like to thank you again for the time. Thanks again, Cheers.