Modern Enterprise Data Engineering
Modern Enterprise Data Engineering

Modern Enterprise Data Engineering

Published
December 9, 2022
Author
Suki Dhupar Big Data LDN
summary
Big Data LDN 2019 | Modern Enterprise Data Engineering
date
Dec 9, 2022
status
type

Modern Enterprise Data Engineering

We envision a world where enterprise data customers readily have access to high-quality, cross-silo, unified enterprise data for all of their core logical entities. Data Operations (DataOps) is a methodology consisting of people, processes, tools, and services for enterprises to rapidly, repeatedly, and reliably deliver production-ready data from the vast array of enterprise data sources. Learn how to and why implementing these key ingredients can help a business achieve the analytic velocity necessary to create a competitive advantage.
 

Bid Data LDN 2019

 
[Autogenerated] eso Thank you for turning up to this talk. I know it's the last one, so people probably want to go home, promise or try to promise to make it a little bit interesting at least. So what we're talking about today it is data engine Eagle, Modern enterprise, data engineering. I just want to ask a question. First of all, how many people here have heard off Data Rapps hands up. It's amazing because I asked the same question on this stage this time last year, and three people put their hands up. So at least people know what I'm gonna be talking about this time around. Right? Um so getting into sort of data engineering, this is come out of the whole idea of data, Ross, because we started off talking about It's a process on where it originated from data ropes is a process is if you think about software engineering and how products have been built more often. Now the process is so much more slick on automated in terms of bringing out what we look at APS on our phones these days, or even and surprise products that are coming out are being used by large organizations. It's quick, it's fast. Lot of automation has been put in, Um, originally, when there was a set of companies and Time has been one of the first ones to look at this. When we looked at data drops in the first place is we looked at How can we simplify and get data out to end users as quickly as possible? Um, a lot of it was focused around process what we forgot about where the people that were involved in trying to bring the data out. And traditionally, when we talk about the people they tend to, you know, we've got this ideology, that it's people who either r D B a type of people or data scientists. Actually, there's a there's a myriad people in between there are involved in the process of getting good quality. Data out that could be used on value can be determined off the back of it. So one of the things that we talk about quite a lot on as you will see out there the message out there is all about a IA machine learning on a lot of companies and enterprises are tempted enormously to put the a I oml cart before the horse. What we were talking about there is generally, let's go and get some tools that have a I and M L built in and forget about what actually powers all of that, which is the data itself. And that's been one of the sword again, One of the key things that we've seen in organizations. People are ready to invest in tools and technology on in people. But forget about the data itself. Um, data engineering is foundational wth e terminology used here. Everybody's heard of it, garbage in, garbage out. When we look at it, when we divide it down into what I's done today, when we look at people asking questions, analyst, third party consultants, you know they tend to just want to know the answers to questions that are being off. How many customers do you have? A simple question. Very tough to answer, because if you ask finance how many customers organization has got, they will give you one answer. If you are sailed, how many customers that you got they'll give you a second answer depends on what a customer is when we look at data scientists. These days. The reality is, value is not being extracted out of the skill set that they've got purely because they're having to do a lot of the janitorial work, which is cleaning up data as opposed to the stuff that really what they want to do, which is building models. Look at the statistics, the visual ization on the analytics around the data itself. Data engineers generally what we've done. We've looked at data engineers what people have talked about. You know, we've tried to sub in people who have got their jobs already, which is D B A's or people looking after systems again. They're into the i t part of the technology. They don't may not know the data as well, so that's generally the challenges that we are looking at. Um, what's the difference between data science and data engineering, isn't it? There's a big overlap, and when it comes to skills on responsibilities, but there's a big difference in terms of what value they bring on what their roles entail. When you look at um, when you look at data scientists, you know the key sort of metric that they measured on tends to be around reporting visual ization, statistical modeling, machine learning, no around data movement, data cleansing or the performance or optimization of databases. That's really a data engineering role, and it's important when we look at organizations that we look at it as a team. You know, there's many members of a team. Get them together, and you get a fast, effective way to operate on way to bring out data APS that are important to the organization. So who's doing what? Um, you know, we've got the data suppliers on. Generally, they tend to be the hardest people to get hold off people that owned the system's source systems such a CR EMS, D B A's or the I T. Part of the business. You have the data consumers, the date says, and people are using the data on a day to day basis, creating reports of visualizations or reporting up to senior levels. All the senior level themselves. The bit that we're missing at the moment or we're trying to address is the data engineers. You know, the curator is the stewards, the people that understand movement off data from system to end user. When you look at companies like Netflix, they've done this really well. I mean, they're they're a newer companies. They've been born with data at the hands, but what they provide is portals for people to use where the engineers have built these portals for end users to be able to do analytics very, very quickly. So organizations such as Netflix, even though people think of this as a new company originally they start off selling DVDs, right? Or renting DVDs. So they started off in a traditional world. But what they did was brought bring together data very, very well on dhe set up things like data drops on this date drops pipeline for people to be able to use. Hence they've been successful very quickly. Some of the customers that we've worked with, what we know about all of them, regardless of what industry there from have got the same data challenges on actually are looking at using the same mechanisms, even though the use cases are very different to be able to bring together data to use about. As you can see, some of these companies are very, very traditional companies being around over 100 years, some of them anew, but store trying to modernize their sort of data enterprise network that they've got. Why is it so difficult? Well, many of you have seen variations of this slide on this has many guises. It's sort of the way the data is connected. Together we try to address it's a lot a while back by moving from databases to date a lake, many of you will hear the term that Data Lake is now a day to swamp. But the challenges you've got, things like politics, mergers and acquisitions, data hoarding, restructuring or just a legacy in place. The legacy burden from systems which causes some of these challenges that organizations have got today to bring their data together. So the consequences off This is the amount of time spent on data prep versus Data Analytics. The high rates of failure when it comes to business, intelligence projects and and and and analytics project on just people giving up after a parent. So I'm because people just don't want everybody to have access to data. Another company that does this really well being be as you can still well presented, you got the sort of the metadata on one side. People recognize this from Wikipedia on, then the key attributes off the data on the left hand side. So it's really easy to pinned it. What you want from the data? Very, very quickly on. Ultimately, every organization can do this because we've got the data available for us to be able to do this on the big challenge that we have is behavior, right? It's humans. It's the sort of the being afraid to share data awarding of data. Thea. Other challenges which are not just human, are around data complexities, privacy on on. You have got sort of limiting that data to small numbers of people actually using the data itself. Um, yeah, this is a nice or sorry for this, but this is what you're dealing with today, and actually, it's an example of what you see out there. Everybody has the same message everybody's doing. I everybody's doing machine learning, you know, These are the technologies that you're being presented with today, and you have to walk through each and everyone to work out what's right for you. A data rocks pipeline consists of a number of different things. A lot of these people, a lot of people and we're part of this will say we can do all of it or we could do some of it. And the important thing is being able to navigate through all of this. The noise has, according to get through to pick the right ones, which will represent a great data pipeline to give you the examples in your organization off. What companies like Netflix and Andi Airbnb have done a Wikipedia already. So how can you do this? What's a creep? Cool. Everybody knows, you know, everybody has the source state assets, Whether it's Excel sheet C S V tabular systems, that's a B C R M system. Whatever you want. We're also starting to engage with external data a ____ of a lot more, whether it's from places like Dun and Bradstreet, Thompson, Reuters or scrape data. So you're getting lots and lots of information in different places. But the key sort of components off data engineering and data Rapps are around governance, mastering and data quality around movement catalogs, public being able to publish that data storage and compute on a feedback loop. So having a human in involved who's updating as that data data is changing on a regular basis. Other methods that were used, the data injury and, you know it's really important. We look at the traditional methods that have been used, which have in my opinion anyway have failed in particular ways. But when we look at standardization, that doesn't work all the time trying to get this perfect model. Well, we could be here for years. Trying to get the perfect model we've been trying to do this one last 2025 years doesn't work. Let's work in small bits of work that we can do on DDE not have this one schemer to rule them. All right, aggregation When we're doing aggregation, it just creates more silos, more sets of data. And we forget about the granularity when we're looking at the data federation, which creates performance challenges. MGM, which we talk about extremely important. But let's look at modern wade off doing MDM are not the very, very traditional Let's get everything in together on Boil the Ocean when we're doing this, you know, looking at rationalizing systems, single single vendor verses. Lots of the slide I showed before has lots of multiple vendors, but actually what you'll find is this is gonna be a stream of people who say we do all of this, right? Yes, people can do all of this, but they may not be able to do it right in that particular way and just throw bodies, own traditional rules time consuming and hasn't worked. So yeah, So some of the key sort of processing tool that we talk about anyway and we think are really important are putting cloud first. Everybody knows in terms of even the most traditional banks and organizations that we've got the in the industry today in industries today are moving to the cloud. It's happening. It's already happened on when we look at the big vendors like Microsoft and Google and AWS, they're getting a fair share of those vendors moving there. So clouds really, really important, as are a number of sort of different areas here. You know, we should be able to build creative APs using these kind of processes in the data apps world. Very quick, very fast. You know, to get the word out there, we can do this. Undo this. Really? Well, um, as a city earlier on the question asked data rocks is a thing. While Wikipedia has got an entry for four, it's worth, but it is becoming a big movement, and it's becoming The data engineer is an important part of that particular movement. So what's the value off you expect from a data engineer way? Look, ATT Chief Data officers in the same ilk as a chief financial officer except for a chief financial officer, is looking for a return on assets on a chief data. Rossa is looking for their return on data. It's the value you're getting out of it. They tend to look for things like analytics and information. Where is the data coming from? Where are my sources? You know how many employees, parts, products, contacts and customers do I have in an organization and ideal world? This is the outcome that we're looking for. This is very similar to what B and B and the pages I showed you before. It's the metadata about your data, which then can be drilled into and used by different individuals. Um, yeah, it's transformational. Ah, lot of people have a lot of source systems where they've got the same data residing in all of those source systems, so mastering some of those source systems to understand your before and after is extremely, extremely important. And, of course, salespeople are looking for distribution by sales or different types of analytical income. Outcome story on DDE That can only be done after you have cleaned that data rock and transform that data and what is important now. Well, seven years ago, everybody wanted data scientists literally everybody out there. It was the best job to go and get. We've got the stakes here from from elsewhere. You know, if you look at the job listings for dated scientists over the past sort of seven or eight years, they've gone up significantly. So now we have data scientists. But what we're asking them to do is mainly janitorial work, which is plead the data rod. You know, that's not their role. We need to have people who are specializing, skilled in that particular role to that job. And that's why we need data engineers now. It's a unique moment, you know. We've got, you know, did he landscape off? Technology is changing. M l A. I this is coming in and it's become extremely useful. Um, the way that data is where it's residing is changing. We're using cloud computing a lot more on dhe. Organizations are finding that data is extremely important to those organizations. It's an attic. They're agreeing to this. So you're getting economic buying from the people at the top. You're getting the tools you're getting processed. Change on. Now is a time for us to take that particular action on. Make that particular change. A lot of the global 2000 companies out there are realizing this on making this change within their organizations. And I just want to leave. You know, time's running out and I want to ask him I want some questions, but what not to do when we're doing data engineering? As I mentioned before, let's not boil the ocean. Let's not go back to traditional sort of habits that we've formed in terms of delivering projects. Boil the ocean waterfall. Let's try and be agile. There's no think that one systems or one single platform is going to solve the problem. That's not what data rocks is about. It's using the best of breed for an individual task. There's a lot of a show you the ice or slide. There's lots of technologies out there that you can try. Let's not stick to single vendor on. Think about bringing the best three together. Don't underestimate the effort required. As I said, You know, as it says here, just because Google doesn't do it, it doesn't mean it won't work for you. Try something that works for you as an organisation on, not look at other people were doing this. Don't underestimate the behavioral change that would be required in the organization. When you want to introduce these methodologies on, you know, avoid that kind of year. Data Scientists are data engineers really that kind of hubris around it. It's two different roles. You cannot get two people doing the same role. You let them do what they're really, really good at what's next. Yep, with the opportunity, the opportunity we've got, we can create great outcomes for our organizations. We can create great sort of repositories where people can start using data well and everybody can start using it. Were consuming as individual consumers off systems a lot more data than we ever did before on that should be available for everybody within the organization, and they're great technologies out there to be able to do this. That's it from me. Any questions I'm happy to wants from now. Or you can come to our booth, which is just down the aisle here. But if there's any questions, I'm happy to take them.