Address Critical Shortcomings of ETL Tools for Better Data Analytics

Improve Data Quality to Gain Real Business Insights

Sarah-Lynn: Good morning and welcome to our Webinar, Cloud Native data Pipelines and ETL Techniques. My name is Sarah-Lynn Brunner and I will be your host for today.

Today’s presenter is Maulik Parikh. Maulik helps Enterprises accelerate their digital initiative with data analytics, state of the art cloud enablement in Enterprise applications. He brings more than 11 years of experience in Architectural Engineering of large-scale and high-performance cloud fact application. Let’s welcome Maulik.

Maulik Parikh: Thanks, Sarah. Thanks for having me. Good morning everyone.

Sarah-Lynn: Thanks for joining us but before we begin, let me review some housekeeping items so, first this webcast is being recorded and will be distributed via email to allowing you to share it with your internal teams to watch again for later. Second, your line is currently muted. Third, please feel free to submit any questions during the call by utilizing the chat function on the bottom of your screen. We’ll answer all questions towards the end of the presentation and we’ll do our best to keep this webinar to the 30-minute time allotment. Now at this time, I’d like to turn this presentation over to Maulik.

Maulik: Hi everyone. Good morning and for today’s conversation will be focused on a little more technical aspects on data warehouse and analytics techniques. We’ll look at how in cloud-native world the data pipelines can be built and what kind of modernization objectives can be achieved and so on. In the conversation, we’ll look at traditional warehouse and analytics techniques. We’ll look at our objectives and how we can modernize.

We’ll also touch upon our journey to modern data analytics and specifically, we’ll take a deep dive into AWS’s big data product view, certain details and best practices around it and then we’ll touch upon certain questions and answers which you will have.

At this point, I will jump into the initial set of conversation which is around traditional data warehouse and analytical reporting techniques so, over the period of years we have seen that traditional data warehouse or ETL techniques have been essentially catering to relational data so, the way this ETL techniques have been designed is essentially you have a set of infrastructure which is already planned and provisioned and then essentially you have server-side applications which has the broker.

Those brokers will interact with relational stores, they essentially have their in-memory computation which will extract the data and essentially in your data zone, within the staging area you will load that data, that becomes an extraction and load and after that the next set of processes essentially you pick up the data from staging area and transform that, cleanse them and put them into data tables which is essentially a periodic load.

Now, from a conversation standpoint, this looks pretty simple and easy but there are a lot of challenges around that for an example, the challenges would be that you have to provision your infrastructure upfront. You have to actually have your capacity, a measured upfront, you have to pay for the licenses of certain proprietary software as well as infrastructures and so on.

With that, the additional challenge which comes across is essentially how I continue to scale out to cater to the demand of the volume, velocity, and variety of the data. As you see here, certain ELT solutions usually cater to relational stores. Now when you come across data which is a semi-structured data, how would you handle those challenges if you have, if you are let’s say a manufacturing organization and if you have unstructured data or dark data as they say in their organization.

How would you leverage them and put them into more of an analytics format and essentially how you can visualize and give the insights to different consumers within your organization or outside of your organization. So, with such challenges essentially there is an imminent need to go to more of a cloud-native data analytics techniques. What kind of motivation which we essentially come across while working with our customers? First and foremost a challenge which our customers have is how do I optimize on the cost which I am essentially investing as part of our solutions?

How do I handle the fault tolerance and have the high availability for all of my analytics users? How can I reduce my operational overhead so that when I put my application from a development to production and the maintenance around it is essentially very well managed? I would also want to stay to leading-edge or cutting edge technology stack alongside having scalable and expandable solution, then given these organizations have several integration channels with their partners are essentially within the organization, they will have different solutions, these data pipeline should cater to all of those integration channels and enable them for the users.

Besides that, one of the important aspects in recent conversations which we have seen is how we can bring organizations to a level where they are a little more data driven organization and they have a lot of insights about their data and how they can actually expose that data in a meaningful manner and last but not the least, the very important aspect is, how I can overcome the challenges of the legacy ETL solutions. Looking back in consideration, we essentially looked into a variety of different options at different customers. In today’s conversation we will touch upon a specific comparative study we can say, what we are comparing is ODI, OWB which is Oracle data integrator and Oracle workbench as compared to AWS Glue’s surveillance offering as such.

There are certain criteria’s which you have to evaluate as part of this comparative study. First and foremost, you look at how your infrastructure procurement would look like and how you would want to maintain. If you are on let’s say ODI and OWB based solution you have to procure your infrastructure, you have to actually provision a lot of licensing aspect as well as you will have to scale up as you go further if you are trying to cater to a lot of data as such.

As compared to that when you come to AWS Glue, it runs on a server-less environment essentially and with AWS we are actually able to leverage pay as you go payment model which is amazing because then you don’t have to provision your infrastructure upfront and essentially you are able to spin your cluster as needed and you can execute your ETL jobs and you can optimize on the cost. Second aspect, scalability and availability with having package deployment of ODI and OWB when you try to scale you have to actually automate a lot of things by your hand and essentially there are a lot of tooling and techniques which you have to look into in order to cater to that and from an availability standpoint you will definitely have to provision a lot of nodes or multiple instances to make it highly available which is essentially leading to cost and maintenance.

On AWS Glue, it’s essentially highly available and scaled out infrastructure then you actually build your jobs you can specify that this job is supposed to cater to a certain data load and based on that you can have your assessment and you would probably be able to determine whether how many data processing unit which in a way virtual CPUs is what you are looking at how you would want to allocate them and as needed, you can also scale out using a lot of automation techniques along with that.

Licensing is important factor when considering this with OWB and ODI it comes with an Oracle licensing which is a hefty amount as such and as compared to that, AWS Glue brings us more of an open source environment where in the licensing is a pre-built or managed by AWS and the implementation is over Apache Spark which is again an open source offering. Now it comes to a little bit of features which I think modern data is always looking for. If you have higher volume of the data and a variety or veracity of the data, then you would also want to look at what kind of data, pharma data types you have. One way to look at it is whether can I crawl this source data when they are actually getting ingested? Can I classify a data and can I say that this data is valuable for me though other data is essentially less valuable but I would still like to have that data, process that data. With ODI, OWB, there is no provision like such. With AWS Glue, essentially, it brings data crawlers and classifiers on a box.

Data at rest, security is very important. With ODI and OWB, whoever is the provisioner they have or whoever is the owner, they have to manage the security. With AWS, we get AWS TMS with this 256 out of box. Execution cost, this is a very important aspect which a lot of organizations look into. In our experience, we have observed that certain ETL solutions if they have not been built essentially to cater to the variety of data and the volume of the data, the execution of a given computation instance will be already high. The execution time will be very high and in turn, essentially, it uses a lot of compute units.

As compared to that, with AWS Glue as you allocate server-less compute with it and you have enough data processing unit allocated to it, the cost is really minimal and I have a comparative analysis for this use case coming further in the slides. We’ll touch upon that as well. Another aspect which is also being considered as part of the evaluation is being able to monitor and audit. With ODI and OWB, one has to actually embed a lot of monitoring and auditing solution, essentially, with AWS Glue, the products with itself offers CloudWatch, CloudTrail, and Missy.

CloudWatch allows you to monitor your executions, monitor your infrastructure. With this, we’ll jump into modern data lakes architectures and how we are essentially looking at data lake pipelines for our actionable intelligence. This is a blueprint which, essentially, we look at when we try to build cloud native data pipelines. Specifically, this is looking at the AWS as an infrastructure provider.

From left to right, if we look at the organizations will have different type of data sources. They may have cloud native data storage, they may have on-premise data storage, they may have different media channels such as integration with Facebook, Google, and other media providers. Also, besides that, within all the applications and infrastructure suite, they will have a lot of logs and other information which is associated with that. All of these different data sources actually brings a challenge of, how do we handle the amount of data load as well as what is the variety and velocity of this data comes across?

Now, as part of our data engineering practice, we look into several aspects of data solutions and the ingestion mechanism of these data sources are essentially categorized into two different aspects. One is the batch or semi-batch process and the other is the string process as such. Essentially, batch and semi-batch would probably cater to a lot of relational data storage or semi-structured data which is being ingested as more of a transactional computational units and to cater to that essentially, we would be using AWS Glue as an RETA infrastructure. From left to right, the way it usually works is your data sources are connected using AWS Glue, you crawl that data and land that data into S3. Now, at that point in time, the data is in a true raw format and you would also want to employ different types of schemas on top of it and then massage that data, manage that data and utilize it for further purposes.

From that perspective, the life cycle of the event would be, essentially, after you ingest raw data into S3, you can run your data processing, cleansing jobs on AWSQ and then you can set that data in a more of a staging environment and so on. With this lifecycle event, essentially, you have a lot more cleansed data which could be leveraged for a lot of data analysis purposes and a lot of data scientists would also be interested in such data. To make it more business-ready and essentially avail it to different business users and personas, you would probably want to have a dimensionality of that data and that dimensionality can be achieved by different data warehousing solutions. From AWS standpoint, they offer Redshift as their data warehousing solution, it’s out of box petabyte-scale data warehousing solution, which allows us to perform a lot of MPP capabilities. Those MPP capabilities essentially achieve a lot of business insights which we would like to perform.

From left to right as we look that, we can serve this data to a variety of personas using Redshift as a mechanism as well as data analysis techniques. If you are an engineer who would like to explore this data have an adhoc exploration and so on, you would probably be using something like AWS Athena to query that data. This is also a server-less offering by Amazon Web Services. It’s an implementation over Presto. If there is a familiarity of Presto or if you are familiar with the ANSI SQL, then all those queries which we implement in your true sequel environment, you would be able to execute them here using Athena as well.

With this life cycle, you are actually catering to a variety of personas, essentially data scientist, data analyst, business users. Then, essentially, once you have this data in a more meaningful manner and you would want to have your partners leverage this data and essentially realize the value of that data, you can actually achieve that using several of engagement platforms. It could be essentially using solutions like EPA gateway, you build the Lambda functions and avail those APA’s using EPA gateway or you can also have more of an interactive mechanism using NLPs which you could also look at.

Now, we talked about batch ingestion mechanism, but what happens if there is a true real-time or near real-time stream of data which you have to cater to? To solve that problem, as part of the AWS’s offering out-of-box, it comes with Amazon Kinesis and Kinesis has different product Suites as well. Those Product Suites are essentially Kinesis Firehose, Kinesis Analytics and Kinesis itself, Kinesis Streams et cetera. With the Streams, essentially, we are able to ingest that data into one of the streams and that data can also be analyzed over EMR using Spark solution. Once you analyze that data, you can actually query that data or you can also pump that data into a raw format to AWS S3 and then further channels can be achieved with that.

A few of the customers have also come across solutions where their data is behind the Data Centers, as such. The first copy of the data should stay behind the Data Centers. In order to achieve that or address that use case, we can also have an agent sitting behind their infrastructure. That agent, essentially, can communicate with the AWS’s infrastructure and using either SDK’s or CLI’s of the implementation, we can also achieve this challenge as well.

With this, I’ll take a deep dive into AWS Glue’s different components and features, what it avails. Glue, as I mentioned as a product, it’s a server-less execution platform for a lot of ETL techniques. It comes with three different predominantly fundamental aspects of discovering your data, developing your transformation technique and deploying those ETL solutions.

To discover the data which is associated with a variety of data sources, there is a Glue data catalog which is available out-of-box. Glue Data catalog is nothing but an Apache Hive compatible metastore which essentially crawls through. As soon as you have established a connection with your source, it crawls through your source data. Essentially, it harvests a lot of schema and other meta-details which gets stored into data catalog.

As I mentioned, there are automated data source crawlers available. These crawlers can also be run after you land your data into S3 buckets or some other destinations as well. Similarly, you can actually build a schema. Important feature is with Glue, we are achieving schema on read concept. The way it works is it goes through first 5MB worth of the data which resides either on your AWS infrastructure or outside of AWS infrastructure. I mean, it prescribes the schema to us. One can also build our own external schema on top of hive compatible metastore, as well.

Another important aspect which was of interest for us is data catalog can be leveraged across multiple AWS services. For an example, EMR can also leverage data catalog for the Presto solution as well. With the availability of data catalog across multiple services, we are actually able to empower organizations to build their microservices implementation as well. That caters to data discovery mechanism and repository of the data which you are essentially looking at.

Second part which is, how do I author my job. It truly relates to developing your ETL jobs in a true sense. As with a variety of ETL tools, there are a lot of out-of-box mapping mechanisms available. With Glue, that’s also one of the features which is available. As you crawl through your sources, we have already harvested your schemas and you are actually able to generate a scaffolding base code with it.

Now of course, that’s source and sink mechanism through lift and shift of the data. If you would want to massage your data in between and you would probably want to leverage a subset of your data tables, then you would want to implement Spark’s DataFrame implementation and so on, you can have the mapping.

The important factor for developer over here is as we look at, it supports two different programming languages: Python and Scala, alongside the Apache Spark environment. One can implement in either of the programming language. Edit, debug and share are the features which are available out-of-box. A lot of ETL tools would allow you to edit and debug, but being able to share your scripts within the environment to your team members is an important aspect of any project life cycle. That’s one of the important factor which we have considered so far as looking at AWS Glue as our engineering environment.

With development, the need would always be that how do I deploy my ETL jobs. From a deployment standpoint and execution standpoint, Glue comes with first class server-less execution. As soon as you have built your jobs and once you have tested your jobs for you to actually promote your environment, you would be able to provision that using cloud formation templates. When you have scheduled your jobs, you would be able to deploy that particular cluster environment and essentially the nodes will be allocated to that.

Now, this in a way doesn’t ask for us to provision the infrastructure, but behind the scenes, Glue has an entire automation suite built on top of their implementation, which deploys your spark jobs or Glue jobs, and they essentially execute these jobs. With Glue, we get a capability of scheduling our jobs how we want. This is essentially acyclic graphs, dynamic acyclic graphs is what we look at.

There is a real-time monitoring and alerting mechanism available from Glue. Out-of-box, it connects with CloudWatch. So there are job states, which can be actually maintained and based on those jobs states, CloudWatch can actually monitor. If there is a failure in your job execution, and if there is a failure, you can actually have your orchestration built up which in a way can notify different personas who are interested in your execution of the jobs. That’s a little deeper dive in AWS Glue.

Out of several customers with whom we work, one of these case study which I wanted to touch upon today. This is one of our largest customers who is a Global Solutions provider DVR on legacy ODI, OWB solution, and there are certain challenges where one of their executions used to take about 19 hours to extract, load, and transform that data which is an inefficient execution in order to avail the data in a timely manner to different users.

With such cloud native data pipelines, we were able to actually achieve 88% efficient historical load. Alongside that, our test also confirmed that in a production to production environment, we were able to have 53% efficient incremental runs. Since I mentioned that AWS Glue essentially is an implementation over Apache Spark. We were able to build our scripts or our code in such a manner that we were able to reuse that code in a governed environment or we can say behind the firewalls when we were to put that solution in a more governed infrastructure.

As you know, larger organizations would have a presence into multi Geo’s as such, and those Geo’s would have their own specific features which they would want to adopt and leverage. To cater to such challenges, we essentially introduced the concept of global future factories where essentially we were able to organize our functions and we were able to mark them available for different geographies and when we can schedule those functions, and what kind of features we can avail them.

As I mentioned, we were able to build reusable code and this code can essentially along with infrastructure, the code could be also deployed in variety of environments. We were able to achieve a true cloud agnostic solution with that and last but not the least, most interesting part is whenever we build our solution, we essentially consider a lot of infrastructure automation aspects. With this solution, we essentially introduce the cloud formation, infrastructure provisioning techniques and our deployments were less than five minutes when we had that cloud automation.

Around this area, essentially we also organized our solution with having different DevOps techniques, which essentially I would love to touch upon if there is an interest. Looking at this, and the amount of efficiency which we had, we were able to actually reduce 89% of the cost for our customer and we evaluated that, is essentially these solutions have a separation of concern. Your data storage is essentially on S3 as well as on Redshift which alleviates the challenge of having an embedded data storage infrastructure as compared to data engines as such.

We were able to build a lot of important data transformations which were very efficient. Since these jobs were server-less jobs, this essentially accounted for a very limited type of execution. With limited time of execution, we were able to actually save a lot of costs. With that, we also built a lot of lambda functions which essentially allowed us to execute for a sub-minute or sub-30-second execution on several data pipeline techniques.

Ease of management and monitoring is what we could achieve with this solution. With that, I’ll touch upon some practices which we followed using AWS Glue. As we know, different providers and different ETL solutions do come with their own challenges. Some of those challenges which we came across, we wanted to share across with everyone here. I’ll touch upon them. The first one is essentially, how do I achieve incremental loads with the data chain as such?

Our understanding is Glue’s dynamic frame essentially can extract and process the entire data, but if you were to only extract the data between a given time interval, you would probably have challenges essentially because you have to load an entire data into memory. The amount of memory which is allocated could be sufficient or not sufficient to cater to the data which we are looking at. The need was essentially you only ingest the changed data. To achieve that, we were able to use Spark’s dataframe concept along with Spark’s SQL implementation.

Second and one of the very important questions which we come across is, how do I cater to humongous historical load in a batch processing environment when there is a limited memory allocation available? With different executions in Spark environment, we have to manage it in an efficient manner. The challenge is usually with higher volume, we have to factor in a lot of memory allocation and how we distribute that data across different nodes within the cluster.

The practice which we follow is, since we know that this is a batch processing, we asses the data size, we apply the data chunking mechanism and then essentially we perform an incremental run within the historical run. Another good practice which we follow is we start with smaller data processing unit which is smaller virtual CPUs. Then based on what data we are catering to or the size we are catering to, we increase our DPU from using certain automation techniques. Another important challenge or another challenge which we have looked into is data type mismatches.

Imagine a use case where your source has a generic character or text data type and within that, you are storing multiple values or multiple information or different types of information within your source. Since in a historical load you may have ingested that data with the data type which is a consistent data type, but in the incremental load, you may come across a challenge that a data type has changed. How do you handle that using AWS Glue? The practice which we follow is use pre-defined hide schemas and apply that along with Glue’s crawler methodology.

Another important factor which we look at is how do we handle the slowly changing dimensions in a dimensional data? The solution which we had built is essentially a columnar structure using parquet file format. Parquet as we know in a big data world, data is stored in a more compressed and columnar structure to achieve the OLAP model and fast and efficient queries.. Essentially, since these are immutable data stores, how do I achieve the changing dimension and change the value of certain dimensional aspects? Our practice is essentially– there’s a really in-line change. We can continue to append with a new value and essentially consider that when we extract the data, when we actually realize that data, or visualize that data, we are able to consider the latest information of that particular dimension. Assessing the cluster needs is an important factor which one has to consider.

The way we practice is, as I mentioned, we start with small compute units, prepare our production equivalent data sets, and then test with the volume, velocity, and adjustable compute units. With AWS Glue, we also come across a challenge where, if I have a multi-AZ, basically, multi-region deployment on multiple availability zone requirements for global solutions, if we would have to use a lot of cloud automation techniques to stage our code into a particular set of buckets globally, and those buckets could be replicated, and then can be leveraged when executing the Glue jobs.

This fundamental need is essentially catering to one of the requirements which Glue has is essentially those clusters are well organized in a particular environment, and it’s a cluster managed group. To set up those nodes by using cluster managed groups, they have to sit into the same region where these nodes are situated. Because of that, the code which is being picked up from S3 buckets have to be in the same region where the Glue jobs are being executed. From there, we have different learnings. Essentially, some of the best practices which we follow, we automate data pipelines, we compress, model, and catalog our physical data.

We always try and optimize with server-less transforms. Infra as a code is our practice. This allows us to lower our resource usage as well. We try and build feature factory settings. We schedule with the acyclic Graphs. Our philosophy is always to monitor, notify, and roll forward. Essentially, you can always catch up. We are using only the code, but architecture built ups as well, so you can actually implement the same blueprint across different solutions. One of working really closely with AWS, we were able to provide feedback and we were able to get a lot of product features out to the market to solve our certain challenges as well. With this, I’ll switch over to a quick demo, considering certain challenges how we are catered to.

First challenge which we touched upon was essentially, what if your source has data types which are fairly generic. For an example, what we are considering here is a more of a network or application monitoring infrastructure, wherein the packets which are being ingested in your data environment has different data types, essentially. The way it is specified it’s a text as a generic data type. Now, when we ingest this data, and the pattern which we see here, the response data as a column has the numeric values. When you crawl through these data and ingest that in, we would only be able to see that you have only one data type ingested in. In the subsequent job run, if you were to actually cater to a different data type, how would you achieve that? That’s essentially using pre-defined schemas. As you see here, we have

Sarah-Lynn: Hey, Maulik. Hi, Maulik, it’s Sarah-Lynn. I just want to let you know that the demo is not showing on your screen right now.

Maulik: Is it? Just give me a moment.

Sarah-Lynn: There you go. Thank you.

Maulik: Okay. All right. No problem. Let me go back to the original discussion which I was having, essentially. For an example, you have your network monitoring tool which is ingesting the data. As you see, the data types here are fairly simple data types which we are specifying. You have request time which is text, region IP is also text and response data is also text type. Initially when your requests are going through, these are the response data sets which you are looking at and those response data sets are often numeric value. Now, in the next ingestion, for an example, if you have a different data ingestion.

These response data could be from a 301 to essentially you could say that, “Okay, the value is okay.” From a numeric value, we are on a character value as such. With auto-ingestion mechanism or auto-inference mechanism which Glue provides out-of-box. Your first ingestion catered to after you crawl data to the data type as big integer and the next ingestion of cycle would say that here is the mix of data which I have. Now, with the mix of data that you add in this, it goes across multiple partitions and, essentially, creates multiple data bases for us. It creates certain data type, database tables with big integer as a data type and certain tables as a string data type.

Now, for you to crawl through our, look through this data, you are to hop across multiple tables in the same database. That challenge can be solved using three different schema. The way it works is, essentially, you define your scheme on a bucket with, essentially, properties file. That schema, you can infer in your Glue’s data catalogue and that essentially builds up pre-defined schema so when you ingest data type across the board. This way, you can achieve the consistency of the data types. Now, the second problem which we touched upon is essentially, what if I have allocated limited amount of data processing units?

I’m using Spark’s dataframe concept to execute my jobs. With Spark’s dataframe concept along with Glue, the data distribution mechanism is inconsistent across the node and we end up with the challenge that certain nodes get a lot of data load and those nodes may not have enough memory to cater. As you see in the logs here, this particular node essentially had about five and a half-gigs worth of memory and the amount of data which we are trying to move was more than five and a half-gigs. How do we achieve the data distribution consistency across the board? Glue has its own black magic which essentially they call dynamic tree which really allows them to organize the data which we throw across to their implementation.

That essentially caters to the data distribution requirements which we have. Card problem which we heard of about incremental load which we talked about. Out-of-box Spark, Glue would provide us the dynamic frame capabilities. As you see here, we’re actually building a dynamic frame and from dynamic frame, we are trying to ingest that data and the data which we extract is an entire data chunk which we have from the source. We’ll have to manage it in memory, chunk it between the date and time interval which we are specifying and then in turn, we’ll only use a slice of data or subset of the data to push through the S3 core. This essentially needs a lot of memory or computer allocation just to cater to that challenge. In order to reduce this memory allocation and processing unit allocation, we can also build this as a Spark’s implementation and we can build Spark’s data frame and deploy them to do our cloud infrastructure. That’s it for my demo. Over to you, Sarah.

Sarah-Lynn: Great. Thanks, Maulik. I’m turning back over to the PowerPoint presentation right now. At this time, we would like to get your questions answered. As a reminder, you can at any time ask a question utilizing the chat feature on the bottom of your screen. Maulik, we do have a couple of questions for you that came in. First one is, how did you or do you plan to handle near real-time data consumption using Glue?

Maulik: It’s a very good question. Real-time or near real-time data ingestion mechanism needs a dedicated compute capacity and a dedicated transformation capacity. To allocate the dedicated compute and transform capacity, as I mentioned in my presentation, we would probably use Amazon’s Kinesis or after base implementation. Which will allow us to ingest the data into a raw format, into S3 buckets and then essentially apply Glue’s transformation logic on top of it.

That way, we are catering to the incoming data load with velocity and volume. With Glue server-less transforms, we are actually transforming the subset of the data or entire data set, but essentially we are allocating a limited memory processing units around.

Sarah-Lynn: Great. Thank you. The next question, how can we check data lineage at each touch point?

Maulik: Excellent question. Data catalog is the feature which you use. As I mentioned, data catalog is essentially highest metastore. Because data catalog can actually talk to multiple sources, multiple stages and multiple destinations, you are able to actually gather all the different data sources from that perspective, you are able to manage the schema. With data catalogs, out-of-box, it also comes with a feature where it maintains the versions of your schema.

You would be able to trace back to different versions of the schema and you can correlate back your source infrastructure, your staging infrastructure and destination infrastructure had correct data mapping at any given point of time. You can actually validate that across multiple versions of the schema. That is how you can also maintain or manage your lineage. This Hive Metastore can also be availed to different Hadoop’s infrastructure as well. If you’re looking at more of a data governance aspects, you can also look at different data lineage solutions which are available from Hadoop’s as well.

Sarah-Lynn: Perfect. We only have time for one more question so here’s the last one. Can you provide an example of where else in the application ecosystem you have used data catalog?

Maulik: Nice. Really interesting question. Data catalog essentially harvests different schemas which are available across your data infrastructure. The way you would use it is you would use Glue’s SDK APIs and you would leverage or scan through your catalog schemas. Then, this becomes a kitchen sink for you to manage and you can organize this across your organization’s microservices infrastructure. You can perform the discovery of the metadata of your data and leverage that across your applications so that you now have a centralized mechanism to maintain your schemas. This goes in line with more of a data discovery and data as a service concept.

Sarah-Lynn: Great. Thanks, Maulik. I think that’s all the questions we were able to answer at this time. Also be sure to check out and subscribe to DTV, a new digital transformation channel that brings an industry expert. Many thanks to our speaker, Maulik. Thank you everyone for joining us today. You’ll receive an email in the next 24 hours of this live presentation and link to the webcast replay. If you have any questions, please contact us at info@infostretch.com or call 1-408-727-1100 to speak with a representative. Again, thank you all and enjoy the rest of your day.

Maulik: Thank you very much. Thanks everyone.

Improve Data Quality to Gain Real Business Insights

Sarah-Lynn: Good morning and welcome to our Webinar, Cloud Native data Pipelines and ETL Techniques. My name is Sarah-Lynn Brunner and I will be your host for today.

Today’s presenter is Maulik Parikh. Maulik helps Enterprises accelerate their digital initiative with data analytics, state of the art cloud enablement in Enterprise applications. He brings more than 11 years of experience in Architectural Engineering of large-scale and high-performance cloud fact application. Let’s welcome Maulik.

Maulik Parikh: Thanks, Sarah. Thanks for having me. Good morning everyone.

Sarah-Lynn: Thanks for joining us but before we begin, let me review some housekeeping items so, first this webcast is being recorded and will be distributed via email to allowing you to share it with your internal teams to watch again for later. Second, your line is currently muted. Third, please feel free to submit any questions during the call by utilizing the chat function on the bottom of your screen. We’ll answer all questions towards the end of the presentation and we’ll do our best to keep this webinar to the 30-minute time allotment. Now at this time, I’d like to turn this presentation over to Maulik.

Maulik: Hi everyone. Good morning and for today’s conversation will be focused on a little more technical aspects on data warehouse and analytics techniques. We’ll look at how in cloud-native world the data pipelines can be built and what kind of modernization objectives can be achieved and so on. In the conversation, we’ll look at traditional warehouse and analytics techniques. We’ll look at our objectives and how we can modernize.

We’ll also touch upon our journey to modern data analytics and specifically, we’ll take a deep dive into AWS’s big data product view, certain details and best practices around it and then we’ll touch upon certain questions and answers which you will have.

At this point, I will jump into the initial set of conversation which is around traditional data warehouse and analytical reporting techniques so, over the period of years we have seen that traditional data warehouse or ETL techniques have been essentially catering to relational data so, the way this ETL techniques have been designed is essentially you have a set of infrastructure which is already planned and provisioned and then essentially you have server-side applications which has the broker.

Those brokers will interact with relational stores, they essentially have their in-memory computation which will extract the data and essentially in your data zone, within the staging area you will load that data, that becomes an extraction and load and after that the next set of processes essentially you pick up the data from staging area and transform that, cleanse them and put them into data tables which is essentially a periodic load.

Now, from a conversation standpoint, this looks pretty simple and easy but there are a lot of challenges around that for an example, the challenges would be that you have to provision your infrastructure upfront. You have to actually have your capacity, a measured upfront, you have to pay for the licenses of certain proprietary software as well as infrastructures and so on.

With that, the additional challenge which comes across is essentially how I continue to scale out to cater to the demand of the volume, velocity, and variety of the data. As you see here, certain ELT solutions usually cater to relational stores. Now when you come across data which is a semi-structured data, how would you handle those challenges if you have, if you are let’s say a manufacturing organization and if you have unstructured data or dark data as they say in their organization.

How would you leverage them and put them into more of an analytics format and essentially how you can visualize and give the insights to different consumers within your organization or outside of your organization. So, with such challenges essentially there is an imminent need to go to more of a cloud-native data analytics techniques. What kind of motivation which we essentially come across while working with our customers? First and foremost a challenge which our customers have is how do I optimize on the cost which I am essentially investing as part of our solutions?

How do I handle the fault tolerance and have the high availability for all of my analytics users? How can I reduce my operational overhead so that when I put my application from a development to production and the maintenance around it is essentially very well managed? I would also want to stay to leading-edge or cutting edge technology stack alongside having scalable and expandable solution, then given these organizations have several integration channels with their partners are essentially within the organization, they will have different solutions, these data pipeline should cater to all of those integration channels and enable them for the users.

Besides that, one of the important aspects in recent conversations which we have seen is how we can bring organizations to a level where they are a little more data driven organization and they have a lot of insights about their data and how they can actually expose that data in a meaningful manner and last but not the least, the very important aspect is, how I can overcome the challenges of the legacy ETL solutions. Looking back in consideration, we essentially looked into a variety of different options at different customers. In today’s conversation we will touch upon a specific comparative study we can say, what we are comparing is ODI, OWB which is Oracle data integrator and Oracle workbench as compared to AWS Glue’s surveillance offering as such.

There are certain criteria’s which you have to evaluate as part of this comparative study. First and foremost, you look at how your infrastructure procurement would look like and how you would want to maintain. If you are on let’s say ODI and OWB based solution you have to procure your infrastructure, you have to actually provision a lot of licensing aspect as well as you will have to scale up as you go further if you are trying to cater to a lot of data as such.

As compared to that when you come to AWS Glue, it runs on a server-less environment essentially and with AWS we are actually able to leverage pay as you go payment model which is amazing because then you don’t have to provision your infrastructure upfront and essentially you are able to spin your cluster as needed and you can execute your ETL jobs and you can optimize on the cost. Second aspect, scalability and availability with having package deployment of ODI and OWB when you try to scale you have to actually automate a lot of things by your hand and essentially there are a lot of tooling and techniques which you have to look into in order to cater to that and from an availability standpoint you will definitely have to provision a lot of nodes or multiple instances to make it highly available which is essentially leading to cost and maintenance.

On AWS Glue, it’s essentially highly available and scaled out infrastructure then you actually build your jobs you can specify that this job is supposed to cater to a certain data load and based on that you can have your assessment and you would probably be able to determine whether how many data processing unit which in a way virtual CPUs is what you are looking at how you would want to allocate them and as needed, you can also scale out using a lot of automation techniques along with that.

Licensing is important factor when considering this with OWB and ODI it comes with an Oracle licensing which is a hefty amount as such and as compared to that, AWS Glue brings us more of an open source environment where in the licensing is a pre-built or managed by AWS and the implementation is over Apache Spark which is again an open source offering. Now it comes to a little bit of features which I think modern data is always looking for. If you have higher volume of the data and a variety or veracity of the data, then you would also want to look at what kind of data, pharma data types you have. One way to look at it is whether can I crawl this source data when they are actually getting ingested? Can I classify a data and can I say that this data is valuable for me though other data is essentially less valuable but I would still like to have that data, process that data. With ODI, OWB, there is no provision like such. With AWS Glue, essentially, it brings data crawlers and classifiers on a box.

Data at rest, security is very important. With ODI and OWB, whoever is the provisioner they have or whoever is the owner, they have to manage the security. With AWS, we get AWS TMS with this 256 out of box. Execution cost, this is a very important aspect which a lot of organizations look into. In our experience, we have observed that certain ETL solutions if they have not been built essentially to cater to the variety of data and the volume of the data, the execution of a given computation instance will be already high. The execution time will be very high and in turn, essentially, it uses a lot of compute units.

As compared to that, with AWS Glue as you allocate server-less compute with it and you have enough data processing unit allocated to it, the cost is really minimal and I have a comparative analysis for this use case coming further in the slides. We’ll touch upon that as well. Another aspect which is also being considered as part of the evaluation is being able to monitor and audit. With ODI and OWB, one has to actually embed a lot of monitoring and auditing solution, essentially, with AWS Glue, the products with itself offers CloudWatch, CloudTrail, and Missy.

CloudWatch allows you to monitor your executions, monitor your infrastructure. With this, we’ll jump into modern data lakes architectures and how we are essentially looking at data lake pipelines for our actionable intelligence. This is a blueprint which, essentially, we look at when we try to build cloud native data pipelines. Specifically, this is looking at the AWS as an infrastructure provider.

From left to right, if we look at the organizations will have different type of data sources. They may have cloud native data storage, they may have on-premise data storage, they may have different media channels such as integration with Facebook, Google, and other media providers. Also, besides that, within all the applications and infrastructure suite, they will have a lot of logs and other information which is associated with that. All of these different data sources actually brings a challenge of, how do we handle the amount of data load as well as what is the variety and velocity of this data comes across?

Now, as part of our data engineering practice, we look into several aspects of data solutions and the ingestion mechanism of these data sources are essentially categorized into two different aspects. One is the batch or semi-batch process and the other is the string process as such. Essentially, batch and semi-batch would probably cater to a lot of relational data storage or semi-structured data which is being ingested as more of a transactional computational units and to cater to that essentially, we would be using AWS Glue as an RETA infrastructure. From left to right, the way it usually works is your data sources are connected using AWS Glue, you crawl that data and land that data into S3. Now, at that point in time, the data is in a true raw format and you would also want to employ different types of schemas on top of it and then massage that data, manage that data and utilize it for further purposes.

From that perspective, the life cycle of the event would be, essentially, after you ingest raw data into S3, you can run your data processing, cleansing jobs on AWSQ and then you can set that data in a more of a staging environment and so on. With this lifecycle event, essentially, you have a lot more cleansed data which could be leveraged for a lot of data analysis purposes and a lot of data scientists would also be interested in such data. To make it more business-ready and essentially avail it to different business users and personas, you would probably want to have a dimensionality of that data and that dimensionality can be achieved by different data warehousing solutions. From AWS standpoint, they offer Redshift as their data warehousing solution, it’s out of box petabyte-scale data warehousing solution, which allows us to perform a lot of MPP capabilities. Those MPP capabilities essentially achieve a lot of business insights which we would like to perform.

From left to right as we look that, we can serve this data to a variety of personas using Redshift as a mechanism as well as data analysis techniques. If you are an engineer who would like to explore this data have an adhoc exploration and so on, you would probably be using something like AWS Athena to query that data. This is also a server-less offering by Amazon Web Services. It’s an implementation over Presto. If there is a familiarity of Presto or if you are familiar with the ANSI SQL, then all those queries which we implement in your true sequel environment, you would be able to execute them here using Athena as well.

With this life cycle, you are actually catering to a variety of personas, essentially data scientist, data analyst, business users. Then, essentially, once you have this data in a more meaningful manner and you would want to have your partners leverage this data and essentially realize the value of that data, you can actually achieve that using several of engagement platforms. It could be essentially using solutions like EPA gateway, you build the Lambda functions and avail those APA’s using EPA gateway or you can also have more of an interactive mechanism using NLPs which you could also look at.

Now, we talked about batch ingestion mechanism, but what happens if there is a true real-time or near real-time stream of data which you have to cater to? To solve that problem, as part of the AWS’s offering out-of-box, it comes with Amazon Kinesis and Kinesis has different product Suites as well. Those Product Suites are essentially Kinesis Firehose, Kinesis Analytics and Kinesis itself, Kinesis Streams et cetera. With the Streams, essentially, we are able to ingest that data into one of the streams and that data can also be analyzed over EMR using Spark solution. Once you analyze that data, you can actually query that data or you can also pump that data into a raw format to AWS S3 and then further channels can be achieved with that.

A few of the customers have also come across solutions where their data is behind the Data Centers, as such. The first copy of the data should stay behind the Data Centers. In order to achieve that or address that use case, we can also have an agent sitting behind their infrastructure. That agent, essentially, can communicate with the AWS’s infrastructure and using either SDK’s or CLI’s of the implementation, we can also achieve this challenge as well.

With this, I’ll take a deep dive into AWS Glue’s different components and features, what it avails. Glue, as I mentioned as a product, it’s a server-less execution platform for a lot of ETL techniques. It comes with three different predominantly fundamental aspects of discovering your data, developing your transformation technique and deploying those ETL solutions.

To discover the data which is associated with a variety of data sources, there is a Glue data catalog which is available out-of-box. Glue Data catalog is nothing but an Apache Hive compatible metastore which essentially crawls through. As soon as you have established a connection with your source, it crawls through your source data. Essentially, it harvests a lot of schema and other meta-details which gets stored into data catalog.

As I mentioned, there are automated data source crawlers available. These crawlers can also be run after you land your data into S3 buckets or some other destinations as well. Similarly, you can actually build a schema. Important feature is with Glue, we are achieving schema on read concept. The way it works is it goes through first 5MB worth of the data which resides either on your AWS infrastructure or outside of AWS infrastructure. I mean, it prescribes the schema to us. One can also build our own external schema on top of hive compatible metastore, as well.

Another important aspect which was of interest for us is data catalog can be leveraged across multiple AWS services. For an example, EMR can also leverage data catalog for the Presto solution as well. With the availability of data catalog across multiple services, we are actually able to empower organizations to build their microservices implementation as well. That caters to data discovery mechanism and repository of the data which you are essentially looking at.

Second part which is, how do I author my job. It truly relates to developing your ETL jobs in a true sense. As with a variety of ETL tools, there are a lot of out-of-box mapping mechanisms available. With Glue, that’s also one of the features which is available. As you crawl through your sources, we have already harvested your schemas and you are actually able to generate a scaffolding base code with it.

Now of course, that’s source and sink mechanism through lift and shift of the data. If you would want to massage your data in between and you would probably want to leverage a subset of your data tables, then you would want to implement Spark’s DataFrame implementation and so on, you can have the mapping.

The important factor for developer over here is as we look at, it supports two different programming languages: Python and Scala, alongside the Apache Spark environment. One can implement in either of the programming language. Edit, debug and share are the features which are available out-of-box. A lot of ETL tools would allow you to edit and debug, but being able to share your scripts within the environment to your team members is an important aspect of any project life cycle. That’s one of the important factor which we have considered so far as looking at AWS Glue as our engineering environment.

With development, the need would always be that how do I deploy my ETL jobs. From a deployment standpoint and execution standpoint, Glue comes with first class server-less execution. As soon as you have built your jobs and once you have tested your jobs for you to actually promote your environment, you would be able to provision that using cloud formation templates. When you have scheduled your jobs, you would be able to deploy that particular cluster environment and essentially the nodes will be allocated to that.

Now, this in a way doesn’t ask for us to provision the infrastructure, but behind the scenes, Glue has an entire automation suite built on top of their implementation, which deploys your spark jobs or Glue jobs, and they essentially execute these jobs. With Glue, we get a capability of scheduling our jobs how we want. This is essentially acyclic graphs, dynamic acyclic graphs is what we look at.

There is a real-time monitoring and alerting mechanism available from Glue. Out-of-box, it connects with CloudWatch. So there are job states, which can be actually maintained and based on those jobs states, CloudWatch can actually monitor. If there is a failure in your job execution, and if there is a failure, you can actually have your orchestration built up which in a way can notify different personas who are interested in your execution of the jobs. That’s a little deeper dive in AWS Glue.

Out of several customers with whom we work, one of these case study which I wanted to touch upon today. This is one of our largest customers who is a Global Solutions provider DVR on legacy ODI, OWB solution, and there are certain challenges where one of their executions used to take about 19 hours to extract, load, and transform that data which is an inefficient execution in order to avail the data in a timely manner to different users.

With such cloud native data pipelines, we were able to actually achieve 88% efficient historical load. Alongside that, our test also confirmed that in a production to production environment, we were able to have 53% efficient incremental runs. Since I mentioned that AWS Glue essentially is an implementation over Apache Spark. We were able to build our scripts or our code in such a manner that we were able to reuse that code in a governed environment or we can say behind the firewalls when we were to put that solution in a more governed infrastructure.

As you know, larger organizations would have a presence into multi Geo’s as such, and those Geo’s would have their own specific features which they would want to adopt and leverage. To cater to such challenges, we essentially introduced the concept of global future factories where essentially we were able to organize our functions and we were able to mark them available for different geographies and when we can schedule those functions, and what kind of features we can avail them.

As I mentioned, we were able to build reusable code and this code can essentially along with infrastructure, the code could be also deployed in variety of environments. We were able to achieve a true cloud agnostic solution with that and last but not the least, most interesting part is whenever we build our solution, we essentially consider a lot of infrastructure automation aspects. With this solution, we essentially introduce the cloud formation, infrastructure provisioning techniques and our deployments were less than five minutes when we had that cloud automation.

Around this area, essentially we also organized our solution with having different DevOps techniques, which essentially I would love to touch upon if there is an interest. Looking at this, and the amount of efficiency which we had, we were able to actually reduce 89% of the cost for our customer and we evaluated that, is essentially these solutions have a separation of concern. Your data storage is essentially on S3 as well as on Redshift which alleviates the challenge of having an embedded data storage infrastructure as compared to data engines as such.

We were able to build a lot of important data transformations which were very efficient. Since these jobs were server-less jobs, this essentially accounted for a very limited type of execution. With limited time of execution, we were able to actually save a lot of costs. With that, we also built a lot of lambda functions which essentially allowed us to execute for a sub-minute or sub-30-second execution on several data pipeline techniques.

Ease of management and monitoring is what we could achieve with this solution. With that, I’ll touch upon some practices which we followed using AWS Glue. As we know, different providers and different ETL solutions do come with their own challenges. Some of those challenges which we came across, we wanted to share across with everyone here. I’ll touch upon them. The first one is essentially, how do I achieve incremental loads with the data chain as such?

Our understanding is Glue’s dynamic frame essentially can extract and process the entire data, but if you were to only extract the data between a given time interval, you would probably have challenges essentially because you have to load an entire data into memory. The amount of memory which is allocated could be sufficient or not sufficient to cater to the data which we are looking at. The need was essentially you only ingest the changed data. To achieve that, we were able to use Spark’s dataframe concept along with Spark’s SQL implementation.

Second and one of the very important questions which we come across is, how do I cater to humongous historical load in a batch processing environment when there is a limited memory allocation available? With different executions in Spark environment, we have to manage it in an efficient manner. The challenge is usually with higher volume, we have to factor in a lot of memory allocation and how we distribute that data across different nodes within the cluster.

The practice which we follow is, since we know that this is a batch processing, we asses the data size, we apply the data chunking mechanism and then essentially we perform an incremental run within the historical run. Another good practice which we follow is we start with smaller data processing unit which is smaller virtual CPUs. Then based on what data we are catering to or the size we are catering to, we increase our DPU from using certain automation techniques. Another important challenge or another challenge which we have looked into is data type mismatches.

Imagine a use case where your source has a generic character or text data type and within that, you are storing multiple values or multiple information or different types of information within your source. Since in a historical load you may have ingested that data with the data type which is a consistent data type, but in the incremental load, you may come across a challenge that a data type has changed. How do you handle that using AWS Glue? The practice which we follow is use pre-defined hide schemas and apply that along with Glue’s crawler methodology.

Another important factor which we look at is how do we handle the slowly changing dimensions in a dimensional data? The solution which we had built is essentially a columnar structure using parquet file format. Parquet as we know in a big data world, data is stored in a more compressed and columnar structure to achieve the OLAP model and fast and efficient queries.. Essentially, since these are immutable data stores, how do I achieve the changing dimension and change the value of certain dimensional aspects? Our practice is essentially– there’s a really in-line change. We can continue to append with a new value and essentially consider that when we extract the data, when we actually realize that data, or visualize that data, we are able to consider the latest information of that particular dimension. Assessing the cluster needs is an important factor which one has to consider.

The way we practice is, as I mentioned, we start with small compute units, prepare our production equivalent data sets, and then test with the volume, velocity, and adjustable compute units. With AWS Glue, we also come across a challenge where, if I have a multi-AZ, basically, multi-region deployment on multiple availability zone requirements for global solutions, if we would have to use a lot of cloud automation techniques to stage our code into a particular set of buckets globally, and those buckets could be replicated, and then can be leveraged when executing the Glue jobs.

This fundamental need is essentially catering to one of the requirements which Glue has is essentially those clusters are well organized in a particular environment, and it’s a cluster managed group. To set up those nodes by using cluster managed groups, they have to sit into the same region where these nodes are situated. Because of that, the code which is being picked up from S3 buckets have to be in the same region where the Glue jobs are being executed. From there, we have different learnings. Essentially, some of the best practices which we follow, we automate data pipelines, we compress, model, and catalog our physical data.

We always try and optimize with server-less transforms. Infra as a code is our practice. This allows us to lower our resource usage as well. We try and build feature factory settings. We schedule with the acyclic Graphs. Our philosophy is always to monitor, notify, and roll forward. Essentially, you can always catch up. We are using only the code, but architecture built ups as well, so you can actually implement the same blueprint across different solutions. One of working really closely with AWS, we were able to provide feedback and we were able to get a lot of product features out to the market to solve our certain challenges as well. With this, I’ll switch over to a quick demo, considering certain challenges how we are catered to.

First challenge which we touched upon was essentially, what if your source has data types which are fairly generic. For an example, what we are considering here is a more of a network or application monitoring infrastructure, wherein the packets which are being ingested in your data environment has different data types, essentially. The way it is specified it’s a text as a generic data type. Now, when we ingest this data, and the pattern which we see here, the response data as a column has the numeric values. When you crawl through these data and ingest that in, we would only be able to see that you have only one data type ingested in. In the subsequent job run, if you were to actually cater to a different data type, how would you achieve that? That’s essentially using pre-defined schemas. As you see here, we have

Sarah-Lynn: Hey, Maulik. Hi, Maulik, it’s Sarah-Lynn. I just want to let you know that the demo is not showing on your screen right now.

Maulik: Is it? Just give me a moment.

Sarah-Lynn: There you go. Thank you.

Maulik: Okay. All right. No problem. Let me go back to the original discussion which I was having, essentially. For an example, you have your network monitoring tool which is ingesting the data. As you see, the data types here are fairly simple data types which we are specifying. You have request time which is text, region IP is also text and response data is also text type. Initially when your requests are going through, these are the response data sets which you are looking at and those response data sets are often numeric value. Now, in the next ingestion, for an example, if you have a different data ingestion.

These response data could be from a 301 to essentially you could say that, “Okay, the value is okay.” From a numeric value, we are on a character value as such. With auto-ingestion mechanism or auto-inference mechanism which Glue provides out-of-box. Your first ingestion catered to after you crawl data to the data type as big integer and the next ingestion of cycle would say that here is the mix of data which I have. Now, with the mix of data that you add in this, it goes across multiple partitions and, essentially, creates multiple data bases for us. It creates certain data type, database tables with big integer as a data type and certain tables as a string data type.

Now, for you to crawl through our, look through this data, you are to hop across multiple tables in the same database. That challenge can be solved using three different schema. The way it works is, essentially, you define your scheme on a bucket with, essentially, properties file. That schema, you can infer in your Glue’s data catalogue and that essentially builds up pre-defined schema so when you ingest data type across the board. This way, you can achieve the consistency of the data types. Now, the second problem which we touched upon is essentially, what if I have allocated limited amount of data processing units?

I’m using Spark’s dataframe concept to execute my jobs. With Spark’s dataframe concept along with Glue, the data distribution mechanism is inconsistent across the node and we end up with the challenge that certain nodes get a lot of data load and those nodes may not have enough memory to cater. As you see in the logs here, this particular node essentially had about five and a half-gigs worth of memory and the amount of data which we are trying to move was more than five and a half-gigs. How do we achieve the data distribution consistency across the board? Glue has its own black magic which essentially they call dynamic tree which really allows them to organize the data which we throw across to their implementation.

That essentially caters to the data distribution requirements which we have. Card problem which we heard of about incremental load which we talked about. Out-of-box Spark, Glue would provide us the dynamic frame capabilities. As you see here, we’re actually building a dynamic frame and from dynamic frame, we are trying to ingest that data and the data which we extract is an entire data chunk which we have from the source. We’ll have to manage it in memory, chunk it between the date and time interval which we are specifying and then in turn, we’ll only use a slice of data or subset of the data to push through the S3 core. This essentially needs a lot of memory or computer allocation just to cater to that challenge. In order to reduce this memory allocation and processing unit allocation, we can also build this as a Spark’s implementation and we can build Spark’s data frame and deploy them to do our cloud infrastructure. That’s it for my demo. Over to you, Sarah.

Sarah-Lynn: Great. Thanks, Maulik. I’m turning back over to the PowerPoint presentation right now. At this time, we would like to get your questions answered. As a reminder, you can at any time ask a question utilizing the chat feature on the bottom of your screen. Maulik, we do have a couple of questions for you that came in. First one is, how did you or do you plan to handle near real-time data consumption using Glue?

Maulik: It’s a very good question. Real-time or near real-time data ingestion mechanism needs a dedicated compute capacity and a dedicated transformation capacity. To allocate the dedicated compute and transform capacity, as I mentioned in my presentation, we would probably use Amazon’s Kinesis or after base implementation. Which will allow us to ingest the data into a raw format, into S3 buckets and then essentially apply Glue’s transformation logic on top of it.

That way, we are catering to the incoming data load with velocity and volume. With Glue server-less transforms, we are actually transforming the subset of the data or entire data set, but essentially we are allocating a limited memory processing units around.

Sarah-Lynn: Great. Thank you. The next question, how can we check data lineage at each touch point?

Maulik: Excellent question. Data catalog is the feature which you use. As I mentioned, data catalog is essentially highest metastore. Because data catalog can actually talk to multiple sources, multiple stages and multiple destinations, you are able to actually gather all the different data sources from that perspective, you are able to manage the schema. With data catalogs, out-of-box, it also comes with a feature where it maintains the versions of your schema.

You would be able to trace back to different versions of the schema and you can correlate back your source infrastructure, your staging infrastructure and destination infrastructure had correct data mapping at any given point of time. You can actually validate that across multiple versions of the schema. That is how you can also maintain or manage your lineage. This Hive Metastore can also be availed to different Hadoop’s infrastructure as well. If you’re looking at more of a data governance aspects, you can also look at different data lineage solutions which are available from Hadoop’s as well.

Sarah-Lynn: Perfect. We only have time for one more question so here’s the last one. Can you provide an example of where else in the application ecosystem you have used data catalog?

Maulik: Nice. Really interesting question. Data catalog essentially harvests different schemas which are available across your data infrastructure. The way you would use it is you would use Glue’s SDK APIs and you would leverage or scan through your catalog schemas. Then, this becomes a kitchen sink for you to manage and you can organize this across your organization’s microservices infrastructure. You can perform the discovery of the metadata of your data and leverage that across your applications so that you now have a centralized mechanism to maintain your schemas. This goes in line with more of a data discovery and data as a service concept.

Sarah-Lynn: Great. Thanks, Maulik. I think that’s all the questions we were able to answer at this time. Also be sure to check out and subscribe to DTV, a new digital transformation channel that brings an industry expert. Many thanks to our speaker, Maulik. Thank you everyone for joining us today. You’ll receive an email in the next 24 hours of this live presentation and link to the webcast replay. If you have any questions, please contact us at info@infostretch.com or call 1-408-727-1100 to speak with a representative. Again, thank you all and enjoy the rest of your day.

Maulik: Thank you very much. Thanks everyone.