Agile Analytics for Actionable Intelligence

Learn how to streamline your organization’s disparate data

Praveen: Hello and welcome, everybody. First of all, I just like to make sure because the initial logistical problem. Sarah, can you hear me? Praveen: Right. I’m going to take the control. I’m Praveen, and I do everything data for a living. Overall 15 years of computing, but from past five years I’ve been doing the engineering and data science in a selective intensive mode rather, where I’m totally focused on data analytics project. Before which I was pretty much a generalist in technology field, and ended up being everywhere until I realized I belong here. In retrospect, it wasn’t a badly planned career. Out here with Infostretch, I’m part of an extremely dynamic team comprised of some smart and creative engineers working on some innovative solutions targeted towards our big data and machine learning solutions for our different clients. During the rest of the presentation, you’re going to see me giving preferences to a lot of work that’s brewing in our labs, that we’re trying to give in for different clients. Because we have a rather comprehensive topic at hand, I’d come to the point straight away. Today we’re going to talk data. Don’t be too serious about the title of Webinar. Now that you are here, that ends there. Not really. Let’s have a look at what we are going to cover. We’ll try to cover the entire rainbow from start to finish. From traditional data methods to the most contemporary and creative usage of data. Just bringing most surprising changes in the way we are going to labialize. The part one we’re going to be dealing with some of the traditional data solutions which I’m sure you guys are mostly familiar with. We’re going to talk about what is that, why is it still relevant and how it is being done to a little extent. We’re also going to talk about some of the business contexts where it fits in. But because this is something which is extremely familiar though, we’re not going to spend much of the time because we have something very interesting waiting for in part two and part three. In part two, we’re going to talk about the big data analytics which is one of the most happening things at the moment, actually. Everybody, if you must, will come across that. We’re going to capitalize this very good opportunity to know more about that in the kind of details that this does. In a while we’re going to move on to that. Lastly, we’re going to end up with some very interesting and creative usage of data, in how it statistically learns and we’re going to incline more towards the machine learning part and the AI part. Some of the very interesting stories are going to discuss out there. Let’s move forward, shall we? I’m going to go on to the next slide. There you go. I think it’s only important to discuss what is data analytics. Just for the academic reason, I am going to read the definition of what a Data Analytics is all about. Data Analytics is the process of converting data into useful information, no-brainer for that, but there are multiple methods, this is just a start. A typical data analytics process follows this cycle, precisely starting from the top box which is acquire, explore, visualize, test and study. We have data spread across, in normal circumstances as well we have data which resides in small, small pockets within the organization. What we do is we acquire that and bring it to a state where we can do some sort of analysis on the top of it. Because until we do that we cannot do any kind of visualization on the top of it because it’s only then it’s going to make sense. Then we’re going to test that particular thing to make sure that whatever visualization we’ve done, that that’s meeting the purpose itself, and that’s going to help us analyze the information that initially we started acquiring actually. This is a straight and forward process many of us would have done it. I think we can clearly remember some of the payslips information, some of the HR related information, and the logistical information many times you would have done using some of the conventional tools that we’re going to talk about in the next slide. This is one of the most popular tool which is Microsoft Excel, and the reason why it’s the most popular is because it allows us to do so many things in this same platform. It gets installed on any hand system, is handy and it gives us the opportunity to do data explorations. Different tables, one of the most used thing actually for enabling us to do from out of the box analysis which it can offer. We can do quicker calculations in charting, not spending too much time on it with an assumption that this is extremely familiar. The next level of analytics as many of you would have recognized are all the databases actually, and this being called as traditional is still relevant. They are computer systems designed to store and organize data for transaction and analytical purpose. Let’s get into details of what other kind of databases we were talking about actually. Normally databases are relational in nature, which means that we normalize the data, we tie the data around entities, and we create relationships between them. The information of an employee can go into one entity, the information about employees’ personal details can go in another table, the information about employees’ activities can go in another table itself. What we are doing is we are creating logical boundaries of information which is going to be interconnected and such a system is called RDBMS which is Relational Database Management System. Quite used application, or maybe a system for containing information and also allowing us to run some ad hoc queries on the top of it, or else use that as a back-end for many applications itself. Examples for those are many like My Sequel, MySQL, SQL Server, and Oracle normally comes up with many such databases. The second category of the databases is analytical databases. Now these are some specialized form of database which are released by companies like Oracle and some other organizations which want to release the products like Informatica and all that stuff. The different thing about these analytical databases is the way the data is being kept there. Like I told you initially classic RDBMS which happens to be in the first column works on the concept of surrounding the data around entities and interconnecting them, but for analytics you’d rather keep all the data together because we would like to do analysis which might incorporate data from multiple pockets. To make it easy for you to understand that kind of data is called de-normalized data and it’s more human-oriented. Think of it like, if your data is kept in six different branches and if you had to get it at one place, it’s going to be hard work for you to get it at one place. The idea of analytical database is that you keep all the data together in a form which doesn’t take much for the person who is trying to do analysis to get it together. If it happens to be together, it’s going to be far easier to analyze. There are simple queries and you’re going to find these kinds of analytical databases in use mostly in the data warehouse, in data mart kind of situation actually. The third category is going to be a NOSQL, and there can be an entire presentation based on NOSQL but the way I’m going to tell you is that it’s a kind of database which gives you a lot flexibility and it is distributed. Which means that it is quite capable of handling huge scales of data. If you have plenty of information of your call records and all that stuff, let’s say a telecom company which has got excessive number of call records, or a hospital which has got too many electronic medical records. What they can do is they can store that particular data in a distributed fashion. The flexibility part comes in because what happens in normal databases is that you create a structure first, and then after that you put the data which is in conformance with that particular structure. But here in NOSQL, you can move the data in there and it will allow you to store the data which may not be in a consistent form but still you can end up running queries on it, doing analysis on the top of it. I think these are the different categories of databases that all three of them, they have their own business context actually. I’m going to move forward. Like I told you this is the first section, we’ll see familiar stuff, we’re going to move it faster on that. The one thing that I talked to you about data warehousing, when we talk about data analytics this becomes extremely crucial, because this is the way the world was moving forward before we came across some of the most contemporary technologies like big data and machine learning. Expensive stuff, a lot of bigger companies, they invested millions of dollars in these kinds of solutions until they realized that this has limitations, which is where we have some breakthrough technologies coming in. You can think of data warehousing as a kind of superstore which has got different shelves, but everything that you need to buy you can get under one roof. That’s what data warehouse is all about. Some organizations might have different areas where the data comes in. For example, an organization might have financial data, logistics data, marketing data, organizational data, a number of data. If you go on to decide to make a warehouse, the first thing that you’re going to do is you’re going to bring all that data at one place so that collectively you can analyze that. Think of it like a superstore, but still it has different alleys and different shelves where respective data is going to be kept actually. That is the reason why the icon, a known technical icon has been used action ahead right now. It can’t be complete unless I give you a visual representation of what happens normally. The best way to understand this diagram is to imagine a porch and swanky control room of a top company where the c-level executives are sitting, and they’re lounging over huge shiny dashboards where all reports are coming. Think of it like the rightmost area where it’s a strategic decisions business mind creating and targeting customers, but how do those reports come by? What happens is, in the leftmost column as you can see the data can come from dispatch sources, it all comes down to one single container. It’s logically represented out here as a back office which simply means all the data comes at one point in a data store. Then we have to analyze it, usually done by the front office using some of the transformations and technical parlance, it’s being called as ETL because we’re extracting the information from different sources. We are transforming that in order to make it conducive for the consumption reports and then we load it on. That’s what it is, but essentially what it means is that you get all the information together and enable some of the one-tool important reports out for the consumption of the guys who are on the top of the strategy. They want to take some decisions, that’s what’s it’s all about. That’s why it pays by collecting data from multiple sources, organizing it into simple structures, analyst and programs can leverage it for better decision making. Let’s take this opportunity to get into the details of big data analytic solution. It logically follows the area that I described to you at the moment. We talked about the traditional data solutions, then all the limitations, all the limitations. The limitation was largely in terms of scale. In past five years, the world’s data has been doubled and a lot of challenges have come in as far as the scale of data is concerned. Let’s talk about it and what are the ways forward — let me just, bear with me for a second. There you go. What is big data? In simple words, it’s a lot of data and to explain it in empirical terms, must have heard of terabytes or petabytes, and trust me as we talk now, these terms are also quite elementary at the moment actually. even the dimensions have gone on to the next level actually. If you’re talking about this kind of data, yes, you’re talking about big data. Well, there are some other things as well that we are going to speak about. Big data is a term for collection of data sets so large and complex that it becomes difficult to process using hands-on database management tools, a traditional data processing application like we spoke about in the previous slide. It’s an ecosystem which involves so many things actually that we’re going to take opportunity to understand in the coming slides. If I have to mention, we have challenges related to capture, curation, storage, search, sharing, transfer, analysis and visualization. That’s too much of it, you can put your premature, take that as simple English because we have some other slides to come by, I believe I’ve gone a bit further. Just to give you an idea, when you say big data, when you talk about terabytes and petabytes and what other kinds of challenges. I’m going to give you an example, there’s a couple of organizations which are already dealing with this kind of data. Trust me this is — let me talk about these organizations, we’re just going to scratch the surface, right? Just to give you an estimate, Facebook at the moment, it deals with 300 petabytes of data, and I’m sure this figure also as we talk is going to be more arcane. It is dealing with the processing of 600 petabytes of data every day. Over 1 billion users per month and rest of it you can read. I could have put in any kind of number and trust me you had no other option but to believe it, but trust me these are real numbers. If you talk about NSA you can see the humongous dimension of the data they are dealing with, and they don’t have any other options actually apart from dealing with this much of traffic, this is what they do. I want to talk about one of the hubs of data which is Google and they had in storage 15 petabytes, 100 petabytes processed per day, 60 trillion number of pages indexed and trust me that’s data as well. Unique searches per month greater than a billion, 2.3 million searches per second and that’s just the tip of the iceberg actually. There’s so much of words there but the entire — just of this including this particular graphic is that — people say that the data is increasing and incoming every second, yes, the world’s data is going to double, but what has caused this particular data to double and double again and it’s growing exponentially? The kind of data that we used to deal with in the past, that probably is not increasing at an extent which is as alarming as the overall data. The part of data that’s increasing is actually unstructured data, and one of the reasons behind it is that the number of devices have increased. We have so many different information sources. Don’t forget we’re living in a selfie age, every now and then people are taking their pictures, that is data. People uploading the videos to YouTube, that’s data, everybody has got phones, be it Android, iOS or whatever it maybe, they are generating data day-in day-out and that is unstructured data. Just to quickly tell you what is unstructured data, structured data and semi-structured data. Which are precisely the kind of categories in which I can divide the data. The structured data is very similar to what we’ve talked about in the past. The data which has got schema, something that we can relate, something which has got a structure. Think of a table which has got three columns, first is employee ID, second is employee email and third is employee name. Whatever data is going to come in that container is going to come in that particular structure. That is structured data. But trust me, it’s a very simple definition of it. Semi-structured data relates to something which has a structure but within the details its format can change. A classic example of that can be an XML file or a JSON file or something like that. When we talk about unstructured data, we’re talking about data which is like voice data or images data, videos and all that stuff, because classically, these are binary data. They don’t have any schema, they don’t have any structure. But they have a lot of hidden intelligence in that and that’s precisely the kind of data which is increasing at an exponential rate which is causing tremendous amount of stress. That’s not a kind of data which the world can ignore. I’m also making my share of increasing down structured data because I’m quite sure this presentation is also getting recorded. That’s my contribution to the increase of the unstructured data of the world and to technology. Just to understand big data from a technology standpoint, not entirely though. But this is how IBM define the big data, and trust me this time the entire world agrees to it as well. Largely, the big data can be divided into following acceptance criteria, if I may say so. Volume is one thing that we talked about. Data has to be extortionately bigger in order to qualify as something which is a big data problem. Petabyte, terabytes, Vita-bytes and all that stuff. The second V, popularly they call it four V’s, right? The second V, Velocity, is if the rate of generation of data or the accumulation of data is high, the way you denote that is the velocity. The understanding of that, any such data which is large and it’s getting produced with a tremendous pace, that is a feature which confirms to it being called as a big data. Third thing is, it has to have a variety. The data necessarily may not be in one form, like I told you the difference between structured, semi-structured, unstructured. Data can come in all the forms, levels, data, images, videos, audios and there are a number of other formats as well. The velocity. I believe this is very, very important. A recently added V. What it means is that we cannot take a data which is not trustworthy, which is not from the genuine resources. Because sometimes what happens is we have programs and we have applications which can produce data power on its own. The authenticity of data is something which I’ve been emphasized out here. There’s a lot of research going on as to how to truly define big data. A lot of different new roles are coming. A new role is also present in one V from her side, then I’m sure technology’s going to be even more interesting actually. These are essentially the big data system requirements like I told you. There are plenty of frameworks actually. All of the frameworks, they have to have the following requirements. They should be able to store data, they should be having the capability to actually process the data, and they should also be flexible enough to actually scale the data because the amount of data can increase actually. In case in future the scale increases, that particular system or ecosystem, or big data ecosystem should be able to actually scale itself out. Let’s move on to the next slide. Let’s bring all the complexity that I told you which relates to the processing and the scaling, and the storing of all different kinds of data, the velocity and all that stuff. A traditional data technology that we spoke earlier probably won’t be able to cut it anymore. That’s the entire idea of explaining to you the complexities was to make sure under these kind of circumstances you probably have to move down to a big data as well. One of the most established– When I say the big data system rather, there are a couple of others as well, but Hadoop is a best example if you have to walk over what the big data system is all about. It can transform all those three requirements that we mentioned. Hadoop has got the capability to store any kind of data, and of any scale, and of any variety. It has the capability to actually process that particular editor as well. It has the capability and flexibility to scale it if the case requires it to. It’s a framework that allows for distributed processing of large datasets. How that is distributed we’re going to talk about it very, very clearly. Just watch this space. Just to mention it’s an open source data management tool which means a lot of developers from around the world contribute to its development. It’s currently under the Apache Software Foundation. If you feel more interested in reading about it probably there’s an EPI page in there. You can just click on it to understand. We’re anyways going to understand how it works actually. Let me just grab some water. This is how, it’s important to understand in order to understand and appreciate the true power of Hadoop, we need to understand how it’s being made. There are two ways to build the system. One is monolithic, another one is distributed. When I say monolithic, I’m talking about the system which is like a very smart star player who can dribble and shoot. It’s almost like a single solo Ironman kind of guy who can do any damn thing. In the technology parlance, probably it can correspond to a very powerful server which is not extremely spruced up. Memory heart is processing power. Too much of network bandwidth and all that stuff. You will agree that these kinds of systems often have their limits. Pretty much like the player who might be powerful but you’re single at the end of the day. There’s another methodology in which you can create a system which has got a combination of too many not so smart kind of machines, but they together paralleled can work in parallel. That’s called distributed way to build a system. I love this too, a lot of players, good players who can pass well. It’s important to understand this philosophy because this is how Hadoop is being made as well. Just the clarification of some of the terms because this is a term that I’m going to use a lot — cluster. A combination of the individual systems, forms of cluster. Individual computer or rather server is called the node. What happens is in order to create a Hadoop cluster, what we do is we stack up a lot of commodity servers, which are like typical rack-base servers. Every company has got too many of them. We can put them to work alongside in parallel to create a cluster which forms the basis of Hadoop itself. That’s what most of the companies are doing. Companies like Facebook, Amazon, Google All these companies are creating vast sever farms. I can remember, Yahoo has got 40,000 nodes cluster actually for processing their used data sets. It’s important to understand that why we have created a cluster of so many individual nodes. There are not going to work until there is one single coordinating software. Think of this software as a master in all the individual nodes, which collectively form the structure as slaves. Very popularly, systems like Hadoop therefore master slave architecture, actually. Now let’s understand what the single coordinating software does actually. Because we will be talking about huge data, we need to partition the data as well so that we understand what sort of data is kept where. That’s been done by this single coordinating software. When we have to process that data, we have to collectively use the resources of all the individual nodes. We have to run the jobs on specified data which is distributed across all the nodes actually. Somebody has to coordinate the computing tasks. In simple words, somebody has to understand that I want to run this particular job on this data, kept in these nodes. I want to see whether those nodes are alive or not actually. One of the important thing is we should also be able to have — The good thing about this is that because we’re going to use the cheap machines that deter the normal commodity rack-based servers, there’s a chance actually that one of them is also going to get faulty, but these Hadoop systems, or for that matter these distributed systems are designed in such a way that even if one commodity hardware fails in our — other than T. The reason why they’re going to do because every data will replicate more than two times or three times by default, but that’s configurable actually. It’s important for you to understand there is a tremendous fault tolerance. You’re not going to miss out on the data, so you’re not going to miss out on the processing as well. You’re not going to miss out on the business intelligence as well. Fantastic. The next take away is that it’s a Master Slave Architecture, allows you to process the data, store the data in a parallelized manner. That’s why it is powerful. That’s why it can be scaled. The way it happened is that Google actually did this particular thing, because they were the first ones logically so to deal with the huge data sets. They have to sort the data, so they created a Google File System. They have to process the data, they created a programming environment which is called MapReduce based on the same concept that I talked about, parallel processing. Very simple. Then once they reached a logical end, they contributed the entire code base to Hachi Software Foundation. Apache took it up, opened the doors for all the developers from around the world. They called the programming the Environment Park which is MapReduce on your right. They called it MapReduce precisely the same what Google called it, but they changed the name of the File System of Google which was Google File Systems to HDFS. HDFS for storage, MapReduce for processing collectively formed Apache Hadoop, until a new member came in which was called the Yarn in the dark orange in the middle. We’ll quickly go — We understand what a MapReduce is all about which allows developers to write the programs to decide how we’re going to process the data. What they do is they define the map and the reduced tasks using the MapReduce API and triggers it. In simple words, how do we process the data in a human way as well? If somebody asks us to actually process the data, we first of all go as divergent as possible to open up that problem. Then we were as convergent as possible in order to get some results out of it. You can think of Map and reduce precisely the same thing actually. Map allows you to open the data in order to analyze and reduce and finally reduces it to a final result actually. That’s one of my producers. HDFS, we all know what it is. It’s a simplified distributed file system which allows you to keep the data impact on our different nodes and still keep track of what is kept where. Yarn is something which is like a pop-up thing actually. It knows where the data is kept. It knows how many nodes we have, how many nodes that are alive, which nodes have got the memory alive, which node has got the hard disk remaining and the processing, it knows everything. Basically, it’s the heart of the master. That’s something which runs the show. These are the primary components of the recent version of Hadoop with the Yarn coming in, in the MapReduce and in the Hadoop 2. I’m sure you can grab more details of that, that Hadoop is such a comprehensive ecosystem that probably we can end up reading about it for probably six months or so. It’s important to understand it’s distributed. It also contains the following components. Every component has got its own essence, and in the interest of time I’m just going to give a one-liner, but I’ll try to explain as transparently as possible. MapReduce is a difficult thing to write. It’s not one of the easiest thing to write. That’s why the developers who were dealing with the databases earlier before scale intimidated them using SQL and all that stuff, they got scared. I should not use the word scared, but they got intimidated actually when they were exposed to some of the most complex way of programming. In order to make their life easy, companies like Yahoo and Facebook, they decided to create an abstraction so that the developers can still write their transformation queries in SQL. Internally they get result of MapReduce. Yahoo went on to create Pig and Facebook went on to create Hive. How accessible is that? It means that you’re in Pig and Hive, you can run all those transformations in a distributed environment using some of the statements which are pretty similar to SQL. That’s how they made the life of data developers easy and then they made the life of people like me easy as well because I learn them and I do it. Hbase, think of Hbase as a know people. Flume and Sqoop is something which allows you to bring in the data and take the data out. You have to bring in the data for processing and once you process it you remove it out as well. Those are like messaging things actually. Spark– I’ll just go back. All right, so Spark and Oozie, they are some other components, but it’ll take a while for me to explain. Spark is another way you can process the data, but using in-memory techniques and Oozie is something which is like a scheduling engine. Sometimes you have to run jobs which are too complex and you have to type them out actually. There are many common big data application scenarios. You can go through it. Study the examples. eBay uses that, recommendation engine Verizon, of course, it’s dealing with the calls all the time. There have plenty of information, and there are so many different cases in which they end up using the big data. Same with Nextbio, JPMorgan, Chase, and Trust means there’s only few examples. Hardly any company actually which is dealing with it. This is the final section which is Data Learning Solutions. It’s important to understand that it’s one thing to acquire the data and deal with the problem of too much, and then you’re able to process that as well and get some intelligence out of it. The true power of data comes in when data takes care of itself and it helps the system learn like a human being. In case you’re not able to connect to what I’m saying let’s get into the details. I want to go to the next slide. I’m just going to drive you there in an easy to understand manner actually. If you talk about Amazon and Netflix, these are the services we love to use. Why so? Because they are like our alter egos. They know us like our wives itself, and trust me I’m not going to talk more about it, but this is close, right? This helps us and we really love them. I can’t imagine living my life without Google Maps. Why do we love to use them? Because they are almost indispensable for us. They help us and obviously, even they help us in a way that we want them to — They help us. The third one, here’s another example. Our life would have been really difficult if there was no email system which could have separated the information from all the fluff that we keep on getting from so many different people. As you can see, there is a clearer classification of important messages from the scan. All these examples and a lot of innovations that are normally ushering in, it’s not difficult to imagine machine learning is the invisible hands behind all of this. These are not happening as a standalone technology. They’re all connected by this common factor which is machine learning, and we’re going to take this opportunity actually to talk something about it. This is rather cryptic definition for that. Seems like a bit of English. Machine learning is a method of data analysis that automates analytical model building using algorithms that I directly learned from data. Machine learning allows computers to find hidden insights without being explicitly programmed. Let’s understand what is this in a simple way actually. I think in order to understand and appreciate the true power of machine learning, I think we need to draw some parallels between machine learning and human beings, and I think hold through the next slides. I’m going to keep on drawing that parallel. Just the way human beings learn, we also create some of the systems where the rules are not explicitly written, but those rules get formed based on the kind of experiences that machine gets. When I say experiences in the technology world, we can understand experiences as data. Pretty much like a human being when he was a child, rules are based on experience and that sort of belief systems get made sure right. When I talk about experiences in the human form that you can safely correspond with the data that an application comes across. That experience can be a user clicks or the views, it can be sample of the past question and answers when you talk about the data which is coming from the likes of Stack Overflow and aquora.coms and all that stuff. It can be like user’s past travel data in order for you to enable usage of Google Maps and all that stuff. These are all experiences which we feed to a machine and that machine grows pretty much like all the human beings grow actually. They’re in these kind of systems, the rules are updated automatically based on data. Very much like all human beings are susceptible to formation of beliefs and also change in the belief system or into having the experiences that we have. That’s exactly what machine learning is all about. Now, this is the typical machine learning workflow where what we have to do is we, first of all have to pick the problem, represent our data, and apply an algorithm. What it means — It means that when we identify with a problem we have to understand what sort of problem is it. Is it a classification problem? Is it a regression problem? Is it about making some kind of recommendation? Or is it a clustering problem? We’re going to visit each and every one of those problems in details probably then you’ll be able to understand a bit more. But then once you’ve decided that this is what the problem is all about, then we have to feed the data to an algorithm which we assume is going to be the best solution for that. The middle part which is represent your data, is the entire process of bringing a truly chaotic data in a form which can be provided as an input to the algorithm. It is usually the data is dirty, it is everywhere. We have to slice and dice and process and unify and make it in a form which can then be consumed by an algorithm to give us that kind of a behavior. Let’s with each and every set of that machine learning workflow. These are classically four buckets, and this is quite an accepted classification of most of the problems that machine learning will normally encounters where either the problem is going to be belonging to the classification, regression, recommendation or clustering. Then like I told you, represent your data, data is going to be in different forms, the unstructured data, images, and videos. The problem — and just for an information this step of the machine learning workflow is something which consumes 80% of our time. [Sneezes] Once we understand the problem, and once we understand the algorithm that we have to run over these things the majority of the effort is going to be — and then bringing the data to the form where it can be fed to the algorithm actually. The third one is applying algorithm. It’s very simple to understand that when we provide the data to that algorithm it forms a belief system which is what we call as a model. There are plenty of such examples. We can safely assume that the customers who spend the most amount of money in my superstore are my best customers. That’s the initial updated rule that I might freeze in my brain, but then I provide more data to it probably the definition gets even more fine-tuned and all that. It’s an iterative process with more and more data coming in, the rules keep on getting updated, and that results in a model in which I can provide the newer data to get the information out. That’s what this is all about at a high level actually. I spoke about the types of machine learning problems. We’re going to look at them one by one actually. When I talk about classification it’s always going to be in the form of questions. Whenever you have a problem in the form of a question you can safely assume that it might be a classification problem. For example, is this email spam or have sentiment analysis? Is this tweet positive or negative? Is this day going to be an up-day or the down-day? Now if you look at it in every such question there is a problem instance like in the first case it’s email, the second one it’s tweet, and the third one it is the trading day. Then we have some of the classifiers like spam or ham or positive and negative, an up-day or down-day and all that stuff actually. There’s a way to go about this particular thing. It is one of my favorite topics actually it’s a chance and I really would like to spend time on it, but since we don’t have too much of time left I’m just going to go over. Is there something wrong with this? I want to go back to regression. The second form is the regression, and it’s about predicting the future patterns based on the historic data. For example, some of the problems like, what will be the price of the stock on a given date? In order to do that of course we’ll have to take the past data, plot it and then find a pattern in the spread of that particular information. Find the pattern on that and based on that, using statistics and mathematics predict what’s going to be the future variation actually. Or for that matter, some of the questions like how long will it take to commute from point A to point B, all those falls and staying back here. Again we’ll have to consider a lot of historical data actually. For example, if I leave my home at 9:00, and reach the office at 9:30 every day. Google Maps do a linear regression to understand that the place where I start from in all likelihood is going to be my home and the place where I reach at 9:30 that in all likelihood is going to be my office actually. What it does internally is a linear regression, to give you a small example. What will be the sales of this product in the given week? Again falls in the same category. These are all regression problems actually just to give you a hint. The third one is the recommendation. Some of the examples for that is what kind of artists will this user like? What are the top 10 book picks for this user? I’m sure I must be ringing the bell especially for some of the Amazon followers. If a user buys this phone what else will they buy? There are plenty of such use cases that you’re going to find especially in the e-commerce space, and sold based on the behavioral study of a user. If one million users go on to buy product A and product B, you can safely assume that they are alike, and if 100 customers out of those one million buy product C, then you can safely recommend this to the rest of the customers that product C. This is a small example which I can give you in order for you to understand this complex topic. I am not going into the mathematics of all these things, a real checker, but there’s a lot of stuff available and I’ll be happy to answer your questions in case you want it detailed further. The final one which can be categorized as a machine learning problem is clustering. It’s nothing but division of — Clustering means grouping, clustering means formation of the groups. Groups which are based on some sort of logical deduction. For example, you would like to divide your social media users into the groups for effective advertising maybe for the age, for certain kind of liking, or for certain follow-ups and all that stuff. You can divide the customers into different categories for personalized offers, and I’m sure banks, your interaction with the banks and the telecoms providers probably can give you a hint about what sort of algorithms they will be running at the back of it in order for them to qualify you into certain groups. Once you find that certain groups — What they can do is they can make a policy which applies to that particular group and in turn that can be applied to you as well, how simple. Without it, it would be really difficult for us to imagine what really happens. Division of employees into different sections for targeted policies. Once again it falls in the same group actually. There are plenty of use cases which are based on the same sort of use cases, same sort of problems that I discussed, and same sort of algorithms. But this is not even the tip of the iceberg. We are not even scratching the surface here. Machine learning is quite applied in variety of business problems, and going forward, it’s going to be one of the major indicators of the innovation. You can go through it. The first line sums it up. Machine learning is complex but so are human beings. What we are trying to do is, we’re trying to grow parallels with human beings, and we’re making applications which can behave like human beings, which can talk like humans – sorry not talk. Behave like human beings and perform like human beings, which is not a simple thing. At times it becomes complex, and this is the single most happening thing at the moment, and all the bigger companies are putting the stakes there. In case it rises to you as a complex thing, trust me with patience it’s going to start making some sense gradually. I think that’s a strange tip from our side. I’d just like to tell you that we are part of a team actually which is working with some exciting solutions for clients from different verticals. Especially in the fields of image analysis, and video content analysis, and the clustering algorithms, primarily in the space of retailing and the queue management, and the heat map analysis and all that kind of stuff. If I get an opportunity actually to talk with you, we can detail more about it. That’s pretty much it from my side, Sarah, you want to take it over? Sarah: Hi, Praveen. Yes, thank you so much. We do have a couple of questions for you, and a couple of the audience has a few questions as well here. Number one, according to you what would be the biggest challenges and opportunities related to future trends in the world of data? Praveen: It’s not a simple question to answer because — one of the biggest problem is the change in the culture. We’re talking about big data. There are some companies like Google and Facebook, they’ve taken their data quite seriously from times immemorial. They understood that early in the cycle that data is a hot property, it’s an intellectual thing. But there are a lot of companies who are generating a lot of data, but simply don’t have the means to capture that. Now, first of all they need to actually create a shift of their culture so that they respect the data, they capture it, and only then all the magic that I spoke about, all the different technologies can work on the top of it. I think that’s going to be the number one challenge. Number two challenge is going to be that with the advent of these technologies, you’re asking about different kind of human resources. We’re talking not just about programmers, we’re talking about some of the scientific computation programmers, and there’s a huge skill shortage for that. Unless there is a team which has got the expertise and the experience of multiple implementations, and chances are that such a bigger step which is going to involve a lot of cost to me probably be a risky proposition. Also, one of the big challenge actually. Sarah: Okay. Thank you, Praveen. Perfect. Just in the interest of time, you want to be respectful of everyone else’s time as well, so we’re just going to ask one more question for you. How can an organization know if machine learning and AI is for them? Praveen: I think it’s a rather simple question to answer actually. If any organization is making too many human judgments, then I think that that organization must certainly involve machine learning. Because you can be a smart retailer who has about 25 years of experience, and you’ve seen too many winters and too many summers. You’ve spared a bit of idea that in the coming summer, this kind of cool ring is going to be in demand, and you can probably throw up a figure to sound like a person who makes excellent decisions. But there’s a lot of risk in making those kinds of decisions actually, but what machine learning can do in those kind of situations, it can augment those fantastic experienced-based judgmental decisions with the backing of mathematically calculated, statistically digitized, data driven decisions. Sarah: Great. Thank you so much for being in, and I just want to thank everyone

Learn how to streamline your organization’s disparate data

Praveen: Hello and welcome, everybody. First of all, I just like to make sure because the initial logistical problem. Sarah, can you hear me? Praveen: Right. I’m going to take the control. I’m Praveen, and I do everything data for a living. Overall 15 years of computing, but from past five years I’ve been doing the engineering and data science in a selective intensive mode rather, where I’m totally focused on data analytics project. Before which I was pretty much a generalist in technology field, and ended up being everywhere until I realized I belong here. In retrospect, it wasn’t a badly planned career. Out here with Infostretch, I’m part of an extremely dynamic team comprised of some smart and creative engineers working on some innovative solutions targeted towards our big data and machine learning solutions for our different clients. During the rest of the presentation, you’re going to see me giving preferences to a lot of work that’s brewing in our labs, that we’re trying to give in for different clients. Because we have a rather comprehensive topic at hand, I’d come to the point straight away. Today we’re going to talk data. Don’t be too serious about the title of Webinar. Now that you are here, that ends there. Not really. Let’s have a look at what we are going to cover. We’ll try to cover the entire rainbow from start to finish. From traditional data methods to the most contemporary and creative usage of data. Just bringing most surprising changes in the way we are going to labialize. The part one we’re going to be dealing with some of the traditional data solutions which I’m sure you guys are mostly familiar with. We’re going to talk about what is that, why is it still relevant and how it is being done to a little extent. We’re also going to talk about some of the business contexts where it fits in. But because this is something which is extremely familiar though, we’re not going to spend much of the time because we have something very interesting waiting for in part two and part three. In part two, we’re going to talk about the big data analytics which is one of the most happening things at the moment, actually. Everybody, if you must, will come across that. We’re going to capitalize this very good opportunity to know more about that in the kind of details that this does. In a while we’re going to move on to that. Lastly, we’re going to end up with some very interesting and creative usage of data, in how it statistically learns and we’re going to incline more towards the machine learning part and the AI part. Some of the very interesting stories are going to discuss out there. Let’s move forward, shall we? I’m going to go on to the next slide. There you go. I think it’s only important to discuss what is data analytics. Just for the academic reason, I am going to read the definition of what a Data Analytics is all about. Data Analytics is the process of converting data into useful information, no-brainer for that, but there are multiple methods, this is just a start. A typical data analytics process follows this cycle, precisely starting from the top box which is acquire, explore, visualize, test and study. We have data spread across, in normal circumstances as well we have data which resides in small, small pockets within the organization. What we do is we acquire that and bring it to a state where we can do some sort of analysis on the top of it. Because until we do that we cannot do any kind of visualization on the top of it because it’s only then it’s going to make sense. Then we’re going to test that particular thing to make sure that whatever visualization we’ve done, that that’s meeting the purpose itself, and that’s going to help us analyze the information that initially we started acquiring actually. This is a straight and forward process many of us would have done it. I think we can clearly remember some of the payslips information, some of the HR related information, and the logistical information many times you would have done using some of the conventional tools that we’re going to talk about in the next slide. This is one of the most popular tool which is Microsoft Excel, and the reason why it’s the most popular is because it allows us to do so many things in this same platform. It gets installed on any hand system, is handy and it gives us the opportunity to do data explorations. Different tables, one of the most used thing actually for enabling us to do from out of the box analysis which it can offer. We can do quicker calculations in charting, not spending too much time on it with an assumption that this is extremely familiar. The next level of analytics as many of you would have recognized are all the databases actually, and this being called as traditional is still relevant. They are computer systems designed to store and organize data for transaction and analytical purpose. Let’s get into details of what other kind of databases we were talking about actually. Normally databases are relational in nature, which means that we normalize the data, we tie the data around entities, and we create relationships between them. The information of an employee can go into one entity, the information about employees’ personal details can go in another table, the information about employees’ activities can go in another table itself. What we are doing is we are creating logical boundaries of information which is going to be interconnected and such a system is called RDBMS which is Relational Database Management System. Quite used application, or maybe a system for containing information and also allowing us to run some ad hoc queries on the top of it, or else use that as a back-end for many applications itself. Examples for those are many like My Sequel, MySQL, SQL Server, and Oracle normally comes up with many such databases. The second category of the databases is analytical databases. Now these are some specialized form of database which are released by companies like Oracle and some other organizations which want to release the products like Informatica and all that stuff. The different thing about these analytical databases is the way the data is being kept there. Like I told you initially classic RDBMS which happens to be in the first column works on the concept of surrounding the data around entities and interconnecting them, but for analytics you’d rather keep all the data together because we would like to do analysis which might incorporate data from multiple pockets. To make it easy for you to understand that kind of data is called de-normalized data and it’s more human-oriented. Think of it like, if your data is kept in six different branches and if you had to get it at one place, it’s going to be hard work for you to get it at one place. The idea of analytical database is that you keep all the data together in a form which doesn’t take much for the person who is trying to do analysis to get it together. If it happens to be together, it’s going to be far easier to analyze. There are simple queries and you’re going to find these kinds of analytical databases in use mostly in the data warehouse, in data mart kind of situation actually. The third category is going to be a NOSQL, and there can be an entire presentation based on NOSQL but the way I’m going to tell you is that it’s a kind of database which gives you a lot flexibility and it is distributed. Which means that it is quite capable of handling huge scales of data. If you have plenty of information of your call records and all that stuff, let’s say a telecom company which has got excessive number of call records, or a hospital which has got too many electronic medical records. What they can do is they can store that particular data in a distributed fashion. The flexibility part comes in because what happens in normal databases is that you create a structure first, and then after that you put the data which is in conformance with that particular structure. But here in NOSQL, you can move the data in there and it will allow you to store the data which may not be in a consistent form but still you can end up running queries on it, doing analysis on the top of it. I think these are the different categories of databases that all three of them, they have their own business context actually. I’m going to move forward. Like I told you this is the first section, we’ll see familiar stuff, we’re going to move it faster on that. The one thing that I talked to you about data warehousing, when we talk about data analytics this becomes extremely crucial, because this is the way the world was moving forward before we came across some of the most contemporary technologies like big data and machine learning. Expensive stuff, a lot of bigger companies, they invested millions of dollars in these kinds of solutions until they realized that this has limitations, which is where we have some breakthrough technologies coming in. You can think of data warehousing as a kind of superstore which has got different shelves, but everything that you need to buy you can get under one roof. That’s what data warehouse is all about. Some organizations might have different areas where the data comes in. For example, an organization might have financial data, logistics data, marketing data, organizational data, a number of data. If you go on to decide to make a warehouse, the first thing that you’re going to do is you’re going to bring all that data at one place so that collectively you can analyze that. Think of it like a superstore, but still it has different alleys and different shelves where respective data is going to be kept actually. That is the reason why the icon, a known technical icon has been used action ahead right now. It can’t be complete unless I give you a visual representation of what happens normally. The best way to understand this diagram is to imagine a porch and swanky control room of a top company where the c-level executives are sitting, and they’re lounging over huge shiny dashboards where all reports are coming. Think of it like the rightmost area where it’s a strategic decisions business mind creating and targeting customers, but how do those reports come by? What happens is, in the leftmost column as you can see the data can come from dispatch sources, it all comes down to one single container. It’s logically represented out here as a back office which simply means all the data comes at one point in a data store. Then we have to analyze it, usually done by the front office using some of the transformations and technical parlance, it’s being called as ETL because we’re extracting the information from different sources. We are transforming that in order to make it conducive for the consumption reports and then we load it on. That’s what it is, but essentially what it means is that you get all the information together and enable some of the one-tool important reports out for the consumption of the guys who are on the top of the strategy. They want to take some decisions, that’s what’s it’s all about. That’s why it pays by collecting data from multiple sources, organizing it into simple structures, analyst and programs can leverage it for better decision making. Let’s take this opportunity to get into the details of big data analytic solution. It logically follows the area that I described to you at the moment. We talked about the traditional data solutions, then all the limitations, all the limitations. The limitation was largely in terms of scale. In past five years, the world’s data has been doubled and a lot of challenges have come in as far as the scale of data is concerned. Let’s talk about it and what are the ways forward — let me just, bear with me for a second. There you go. What is big data? In simple words, it’s a lot of data and to explain it in empirical terms, must have heard of terabytes or petabytes, and trust me as we talk now, these terms are also quite elementary at the moment actually. even the dimensions have gone on to the next level actually. If you’re talking about this kind of data, yes, you’re talking about big data. Well, there are some other things as well that we are going to speak about. Big data is a term for collection of data sets so large and complex that it becomes difficult to process using hands-on database management tools, a traditional data processing application like we spoke about in the previous slide. It’s an ecosystem which involves so many things actually that we’re going to take opportunity to understand in the coming slides. If I have to mention, we have challenges related to capture, curation, storage, search, sharing, transfer, analysis and visualization. That’s too much of it, you can put your premature, take that as simple English because we have some other slides to come by, I believe I’ve gone a bit further. Just to give you an idea, when you say big data, when you talk about terabytes and petabytes and what other kinds of challenges. I’m going to give you an example, there’s a couple of organizations which are already dealing with this kind of data. Trust me this is — let me talk about these organizations, we’re just going to scratch the surface, right? Just to give you an estimate, Facebook at the moment, it deals with 300 petabytes of data, and I’m sure this figure also as we talk is going to be more arcane. It is dealing with the processing of 600 petabytes of data every day. Over 1 billion users per month and rest of it you can read. I could have put in any kind of number and trust me you had no other option but to believe it, but trust me these are real numbers. If you talk about NSA you can see the humongous dimension of the data they are dealing with, and they don’t have any other options actually apart from dealing with this much of traffic, this is what they do. I want to talk about one of the hubs of data which is Google and they had in storage 15 petabytes, 100 petabytes processed per day, 60 trillion number of pages indexed and trust me that’s data as well. Unique searches per month greater than a billion, 2.3 million searches per second and that’s just the tip of the iceberg actually. There’s so much of words there but the entire — just of this including this particular graphic is that — people say that the data is increasing and incoming every second, yes, the world’s data is going to double, but what has caused this particular data to double and double again and it’s growing exponentially? The kind of data that we used to deal with in the past, that probably is not increasing at an extent which is as alarming as the overall data. The part of data that’s increasing is actually unstructured data, and one of the reasons behind it is that the number of devices have increased. We have so many different information sources. Don’t forget we’re living in a selfie age, every now and then people are taking their pictures, that is data. People uploading the videos to YouTube, that’s data, everybody has got phones, be it Android, iOS or whatever it maybe, they are generating data day-in day-out and that is unstructured data. Just to quickly tell you what is unstructured data, structured data and semi-structured data. Which are precisely the kind of categories in which I can divide the data. The structured data is very similar to what we’ve talked about in the past. The data which has got schema, something that we can relate, something which has got a structure. Think of a table which has got three columns, first is employee ID, second is employee email and third is employee name. Whatever data is going to come in that container is going to come in that particular structure. That is structured data. But trust me, it’s a very simple definition of it. Semi-structured data relates to something which has a structure but within the details its format can change. A classic example of that can be an XML file or a JSON file or something like that. When we talk about unstructured data, we’re talking about data which is like voice data or images data, videos and all that stuff, because classically, these are binary data. They don’t have any schema, they don’t have any structure. But they have a lot of hidden intelligence in that and that’s precisely the kind of data which is increasing at an exponential rate which is causing tremendous amount of stress. That’s not a kind of data which the world can ignore. I’m also making my share of increasing down structured data because I’m quite sure this presentation is also getting recorded. That’s my contribution to the increase of the unstructured data of the world and to technology. Just to understand big data from a technology standpoint, not entirely though. But this is how IBM define the big data, and trust me this time the entire world agrees to it as well. Largely, the big data can be divided into following acceptance criteria, if I may say so. Volume is one thing that we talked about. Data has to be extortionately bigger in order to qualify as something which is a big data problem. Petabyte, terabytes, Vita-bytes and all that stuff. The second V, popularly they call it four V’s, right? The second V, Velocity, is if the rate of generation of data or the accumulation of data is high, the way you denote that is the velocity. The understanding of that, any such data which is large and it’s getting produced with a tremendous pace, that is a feature which confirms to it being called as a big data. Third thing is, it has to have a variety. The data necessarily may not be in one form, like I told you the difference between structured, semi-structured, unstructured. Data can come in all the forms, levels, data, images, videos, audios and there are a number of other formats as well. The velocity. I believe this is very, very important. A recently added V. What it means is that we cannot take a data which is not trustworthy, which is not from the genuine resources. Because sometimes what happens is we have programs and we have applications which can produce data power on its own. The authenticity of data is something which I’ve been emphasized out here. There’s a lot of research going on as to how to truly define big data. A lot of different new roles are coming. A new role is also present in one V from her side, then I’m sure technology’s going to be even more interesting actually. These are essentially the big data system requirements like I told you. There are plenty of frameworks actually. All of the frameworks, they have to have the following requirements. They should be able to store data, they should be having the capability to actually process the data, and they should also be flexible enough to actually scale the data because the amount of data can increase actually. In case in future the scale increases, that particular system or ecosystem, or big data ecosystem should be able to actually scale itself out. Let’s move on to the next slide. Let’s bring all the complexity that I told you which relates to the processing and the scaling, and the storing of all different kinds of data, the velocity and all that stuff. A traditional data technology that we spoke earlier probably won’t be able to cut it anymore. That’s the entire idea of explaining to you the complexities was to make sure under these kind of circumstances you probably have to move down to a big data as well. One of the most established– When I say the big data system rather, there are a couple of others as well, but Hadoop is a best example if you have to walk over what the big data system is all about. It can transform all those three requirements that we mentioned. Hadoop has got the capability to store any kind of data, and of any scale, and of any variety. It has the capability to actually process that particular editor as well. It has the capability and flexibility to scale it if the case requires it to. It’s a framework that allows for distributed processing of large datasets. How that is distributed we’re going to talk about it very, very clearly. Just watch this space. Just to mention it’s an open source data management tool which means a lot of developers from around the world contribute to its development. It’s currently under the Apache Software Foundation. If you feel more interested in reading about it probably there’s an EPI page in there. You can just click on it to understand. We’re anyways going to understand how it works actually. Let me just grab some water. This is how, it’s important to understand in order to understand and appreciate the true power of Hadoop, we need to understand how it’s being made. There are two ways to build the system. One is monolithic, another one is distributed. When I say monolithic, I’m talking about the system which is like a very smart star player who can dribble and shoot. It’s almost like a single solo Ironman kind of guy who can do any damn thing. In the technology parlance, probably it can correspond to a very powerful server which is not extremely spruced up. Memory heart is processing power. Too much of network bandwidth and all that stuff. You will agree that these kinds of systems often have their limits. Pretty much like the player who might be powerful but you’re single at the end of the day. There’s another methodology in which you can create a system which has got a combination of too many not so smart kind of machines, but they together paralleled can work in parallel. That’s called distributed way to build a system. I love this too, a lot of players, good players who can pass well. It’s important to understand this philosophy because this is how Hadoop is being made as well. Just the clarification of some of the terms because this is a term that I’m going to use a lot — cluster. A combination of the individual systems, forms of cluster. Individual computer or rather server is called the node. What happens is in order to create a Hadoop cluster, what we do is we stack up a lot of commodity servers, which are like typical rack-base servers. Every company has got too many of them. We can put them to work alongside in parallel to create a cluster which forms the basis of Hadoop itself. That’s what most of the companies are doing. Companies like Facebook, Amazon, Google All these companies are creating vast sever farms. I can remember, Yahoo has got 40,000 nodes cluster actually for processing their used data sets. It’s important to understand that why we have created a cluster of so many individual nodes. There are not going to work until there is one single coordinating software. Think of this software as a master in all the individual nodes, which collectively form the structure as slaves. Very popularly, systems like Hadoop therefore master slave architecture, actually. Now let’s understand what the single coordinating software does actually. Because we will be talking about huge data, we need to partition the data as well so that we understand what sort of data is kept where. That’s been done by this single coordinating software. When we have to process that data, we have to collectively use the resources of all the individual nodes. We have to run the jobs on specified data which is distributed across all the nodes actually. Somebody has to coordinate the computing tasks. In simple words, somebody has to understand that I want to run this particular job on this data, kept in these nodes. I want to see whether those nodes are alive or not actually. One of the important thing is we should also be able to have — The good thing about this is that because we’re going to use the cheap machines that deter the normal commodity rack-based servers, there’s a chance actually that one of them is also going to get faulty, but these Hadoop systems, or for that matter these distributed systems are designed in such a way that even if one commodity hardware fails in our — other than T. The reason why they’re going to do because every data will replicate more than two times or three times by default, but that’s configurable actually. It’s important for you to understand there is a tremendous fault tolerance. You’re not going to miss out on the data, so you’re not going to miss out on the processing as well. You’re not going to miss out on the business intelligence as well. Fantastic. The next take away is that it’s a Master Slave Architecture, allows you to process the data, store the data in a parallelized manner. That’s why it is powerful. That’s why it can be scaled. The way it happened is that Google actually did this particular thing, because they were the first ones logically so to deal with the huge data sets. They have to sort the data, so they created a Google File System. They have to process the data, they created a programming environment which is called MapReduce based on the same concept that I talked about, parallel processing. Very simple. Then once they reached a logical end, they contributed the entire code base to Hachi Software Foundation. Apache took it up, opened the doors for all the developers from around the world. They called the programming the Environment Park which is MapReduce on your right. They called it MapReduce precisely the same what Google called it, but they changed the name of the File System of Google which was Google File Systems to HDFS. HDFS for storage, MapReduce for processing collectively formed Apache Hadoop, until a new member came in which was called the Yarn in the dark orange in the middle. We’ll quickly go — We understand what a MapReduce is all about which allows developers to write the programs to decide how we’re going to process the data. What they do is they define the map and the reduced tasks using the MapReduce API and triggers it. In simple words, how do we process the data in a human way as well? If somebody asks us to actually process the data, we first of all go as divergent as possible to open up that problem. Then we were as convergent as possible in order to get some results out of it. You can think of Map and reduce precisely the same thing actually. Map allows you to open the data in order to analyze and reduce and finally reduces it to a final result actually. That’s one of my producers. HDFS, we all know what it is. It’s a simplified distributed file system which allows you to keep the data impact on our different nodes and still keep track of what is kept where. Yarn is something which is like a pop-up thing actually. It knows where the data is kept. It knows how many nodes we have, how many nodes that are alive, which nodes have got the memory alive, which node has got the hard disk remaining and the processing, it knows everything. Basically, it’s the heart of the master. That’s something which runs the show. These are the primary components of the recent version of Hadoop with the Yarn coming in, in the MapReduce and in the Hadoop 2. I’m sure you can grab more details of that, that Hadoop is such a comprehensive ecosystem that probably we can end up reading about it for probably six months or so. It’s important to understand it’s distributed. It also contains the following components. Every component has got its own essence, and in the interest of time I’m just going to give a one-liner, but I’ll try to explain as transparently as possible. MapReduce is a difficult thing to write. It’s not one of the easiest thing to write. That’s why the developers who were dealing with the databases earlier before scale intimidated them using SQL and all that stuff, they got scared. I should not use the word scared, but they got intimidated actually when they were exposed to some of the most complex way of programming. In order to make their life easy, companies like Yahoo and Facebook, they decided to create an abstraction so that the developers can still write their transformation queries in SQL. Internally they get result of MapReduce. Yahoo went on to create Pig and Facebook went on to create Hive. How accessible is that? It means that you’re in Pig and Hive, you can run all those transformations in a distributed environment using some of the statements which are pretty similar to SQL. That’s how they made the life of data developers easy and then they made the life of people like me easy as well because I learn them and I do it. Hbase, think of Hbase as a know people. Flume and Sqoop is something which allows you to bring in the data and take the data out. You have to bring in the data for processing and once you process it you remove it out as well. Those are like messaging things actually. Spark– I’ll just go back. All right, so Spark and Oozie, they are some other components, but it’ll take a while for me to explain. Spark is another way you can process the data, but using in-memory techniques and Oozie is something which is like a scheduling engine. Sometimes you have to run jobs which are too complex and you have to type them out actually. There are many common big data application scenarios. You can go through it. Study the examples. eBay uses that, recommendation engine Verizon, of course, it’s dealing with the calls all the time. There have plenty of information, and there are so many different cases in which they end up using the big data. Same with Nextbio, JPMorgan, Chase, and Trust means there’s only few examples. Hardly any company actually which is dealing with it. This is the final section which is Data Learning Solutions. It’s important to understand that it’s one thing to acquire the data and deal with the problem of too much, and then you’re able to process that as well and get some intelligence out of it. The true power of data comes in when data takes care of itself and it helps the system learn like a human being. In case you’re not able to connect to what I’m saying let’s get into the details. I want to go to the next slide. I’m just going to drive you there in an easy to understand manner actually. If you talk about Amazon and Netflix, these are the services we love to use. Why so? Because they are like our alter egos. They know us like our wives itself, and trust me I’m not going to talk more about it, but this is close, right? This helps us and we really love them. I can’t imagine living my life without Google Maps. Why do we love to use them? Because they are almost indispensable for us. They help us and obviously, even they help us in a way that we want them to — They help us. The third one, here’s another example. Our life would have been really difficult if there was no email system which could have separated the information from all the fluff that we keep on getting from so many different people. As you can see, there is a clearer classification of important messages from the scan. All these examples and a lot of innovations that are normally ushering in, it’s not difficult to imagine machine learning is the invisible hands behind all of this. These are not happening as a standalone technology. They’re all connected by this common factor which is machine learning, and we’re going to take this opportunity actually to talk something about it. This is rather cryptic definition for that. Seems like a bit of English. Machine learning is a method of data analysis that automates analytical model building using algorithms that I directly learned from data. Machine learning allows computers to find hidden insights without being explicitly programmed. Let’s understand what is this in a simple way actually. I think in order to understand and appreciate the true power of machine learning, I think we need to draw some parallels between machine learning and human beings, and I think hold through the next slides. I’m going to keep on drawing that parallel. Just the way human beings learn, we also create some of the systems where the rules are not explicitly written, but those rules get formed based on the kind of experiences that machine gets. When I say experiences in the technology world, we can understand experiences as data. Pretty much like a human being when he was a child, rules are based on experience and that sort of belief systems get made sure right. When I talk about experiences in the human form that you can safely correspond with the data that an application comes across. That experience can be a user clicks or the views, it can be sample of the past question and answers when you talk about the data which is coming from the likes of Stack Overflow and aquora.coms and all that stuff. It can be like user’s past travel data in order for you to enable usage of Google Maps and all that stuff. These are all experiences which we feed to a machine and that machine grows pretty much like all the human beings grow actually. They’re in these kind of systems, the rules are updated automatically based on data. Very much like all human beings are susceptible to formation of beliefs and also change in the belief system or into having the experiences that we have. That’s exactly what machine learning is all about. Now, this is the typical machine learning workflow where what we have to do is we, first of all have to pick the problem, represent our data, and apply an algorithm. What it means — It means that when we identify with a problem we have to understand what sort of problem is it. Is it a classification problem? Is it a regression problem? Is it about making some kind of recommendation? Or is it a clustering problem? We’re going to visit each and every one of those problems in details probably then you’ll be able to understand a bit more. But then once you’ve decided that this is what the problem is all about, then we have to feed the data to an algorithm which we assume is going to be the best solution for that. The middle part which is represent your data, is the entire process of bringing a truly chaotic data in a form which can be provided as an input to the algorithm. It is usually the data is dirty, it is everywhere. We have to slice and dice and process and unify and make it in a form which can then be consumed by an algorithm to give us that kind of a behavior. Let’s with each and every set of that machine learning workflow. These are classically four buckets, and this is quite an accepted classification of most of the problems that machine learning will normally encounters where either the problem is going to be belonging to the classification, regression, recommendation or clustering. Then like I told you, represent your data, data is going to be in different forms, the unstructured data, images, and videos. The problem — and just for an information this step of the machine learning workflow is something which consumes 80% of our time. [Sneezes] Once we understand the problem, and once we understand the algorithm that we have to run over these things the majority of the effort is going to be — and then bringing the data to the form where it can be fed to the algorithm actually. The third one is applying algorithm. It’s very simple to understand that when we provide the data to that algorithm it forms a belief system which is what we call as a model. There are plenty of such examples. We can safely assume that the customers who spend the most amount of money in my superstore are my best customers. That’s the initial updated rule that I might freeze in my brain, but then I provide more data to it probably the definition gets even more fine-tuned and all that. It’s an iterative process with more and more data coming in, the rules keep on getting updated, and that results in a model in which I can provide the newer data to get the information out. That’s what this is all about at a high level actually. I spoke about the types of machine learning problems. We’re going to look at them one by one actually. When I talk about classification it’s always going to be in the form of questions. Whenever you have a problem in the form of a question you can safely assume that it might be a classification problem. For example, is this email spam or have sentiment analysis? Is this tweet positive or negative? Is this day going to be an up-day or the down-day? Now if you look at it in every such question there is a problem instance like in the first case it’s email, the second one it’s tweet, and the third one it is the trading day. Then we have some of the classifiers like spam or ham or positive and negative, an up-day or down-day and all that stuff actually. There’s a way to go about this particular thing. It is one of my favorite topics actually it’s a chance and I really would like to spend time on it, but since we don’t have too much of time left I’m just going to go over. Is there something wrong with this? I want to go back to regression. The second form is the regression, and it’s about predicting the future patterns based on the historic data. For example, some of the problems like, what will be the price of the stock on a given date? In order to do that of course we’ll have to take the past data, plot it and then find a pattern in the spread of that particular information. Find the pattern on that and based on that, using statistics and mathematics predict what’s going to be the future variation actually. Or for that matter, some of the questions like how long will it take to commute from point A to point B, all those falls and staying back here. Again we’ll have to consider a lot of historical data actually. For example, if I leave my home at 9:00, and reach the office at 9:30 every day. Google Maps do a linear regression to understand that the place where I start from in all likelihood is going to be my home and the place where I reach at 9:30 that in all likelihood is going to be my office actually. What it does internally is a linear regression, to give you a small example. What will be the sales of this product in the given week? Again falls in the same category. These are all regression problems actually just to give you a hint. The third one is the recommendation. Some of the examples for that is what kind of artists will this user like? What are the top 10 book picks for this user? I’m sure I must be ringing the bell especially for some of the Amazon followers. If a user buys this phone what else will they buy? There are plenty of such use cases that you’re going to find especially in the e-commerce space, and sold based on the behavioral study of a user. If one million users go on to buy product A and product B, you can safely assume that they are alike, and if 100 customers out of those one million buy product C, then you can safely recommend this to the rest of the customers that product C. This is a small example which I can give you in order for you to understand this complex topic. I am not going into the mathematics of all these things, a real checker, but there’s a lot of stuff available and I’ll be happy to answer your questions in case you want it detailed further. The final one which can be categorized as a machine learning problem is clustering. It’s nothing but division of — Clustering means grouping, clustering means formation of the groups. Groups which are based on some sort of logical deduction. For example, you would like to divide your social media users into the groups for effective advertising maybe for the age, for certain kind of liking, or for certain follow-ups and all that stuff. You can divide the customers into different categories for personalized offers, and I’m sure banks, your interaction with the banks and the telecoms providers probably can give you a hint about what sort of algorithms they will be running at the back of it in order for them to qualify you into certain groups. Once you find that certain groups — What they can do is they can make a policy which applies to that particular group and in turn that can be applied to you as well, how simple. Without it, it would be really difficult for us to imagine what really happens. Division of employees into different sections for targeted policies. Once again it falls in the same group actually. There are plenty of use cases which are based on the same sort of use cases, same sort of problems that I discussed, and same sort of algorithms. But this is not even the tip of the iceberg. We are not even scratching the surface here. Machine learning is quite applied in variety of business problems, and going forward, it’s going to be one of the major indicators of the innovation. You can go through it. The first line sums it up. Machine learning is complex but so are human beings. What we are trying to do is, we’re trying to grow parallels with human beings, and we’re making applications which can behave like human beings, which can talk like humans – sorry not talk. Behave like human beings and perform like human beings, which is not a simple thing. At times it becomes complex, and this is the single most happening thing at the moment, and all the bigger companies are putting the stakes there. In case it rises to you as a complex thing, trust me with patience it’s going to start making some sense gradually. I think that’s a strange tip from our side. I’d just like to tell you that we are part of a team actually which is working with some exciting solutions for clients from different verticals. Especially in the fields of image analysis, and video content analysis, and the clustering algorithms, primarily in the space of retailing and the queue management, and the heat map analysis and all that kind of stuff. If I get an opportunity actually to talk with you, we can detail more about it. That’s pretty much it from my side, Sarah, you want to take it over? Sarah: Hi, Praveen. Yes, thank you so much. We do have a couple of questions for you, and a couple of the audience has a few questions as well here. Number one, according to you what would be the biggest challenges and opportunities related to future trends in the world of data? Praveen: It’s not a simple question to answer because — one of the biggest problem is the change in the culture. We’re talking about big data. There are some companies like Google and Facebook, they’ve taken their data quite seriously from times immemorial. They understood that early in the cycle that data is a hot property, it’s an intellectual thing. But there are a lot of companies who are generating a lot of data, but simply don’t have the means to capture that. Now, first of all they need to actually create a shift of their culture so that they respect the data, they capture it, and only then all the magic that I spoke about, all the different technologies can work on the top of it. I think that’s going to be the number one challenge. Number two challenge is going to be that with the advent of these technologies, you’re asking about different kind of human resources. We’re talking not just about programmers, we’re talking about some of the scientific computation programmers, and there’s a huge skill shortage for that. Unless there is a team which has got the expertise and the experience of multiple implementations, and chances are that such a bigger step which is going to involve a lot of cost to me probably be a risky proposition. Also, one of the big challenge actually. Sarah: Okay. Thank you, Praveen. Perfect. Just in the interest of time, you want to be respectful of everyone else’s time as well, so we’re just going to ask one more question for you. How can an organization know if machine learning and AI is for them? Praveen: I think it’s a rather simple question to answer actually. If any organization is making too many human judgments, then I think that that organization must certainly involve machine learning. Because you can be a smart retailer who has about 25 years of experience, and you’ve seen too many winters and too many summers. You’ve spared a bit of idea that in the coming summer, this kind of cool ring is going to be in demand, and you can probably throw up a figure to sound like a person who makes excellent decisions. But there’s a lot of risk in making those kinds of decisions actually, but what machine learning can do in those kind of situations, it can augment those fantastic experienced-based judgmental decisions with the backing of mathematically calculated, statistically digitized, data driven decisions. Sarah: Great. Thank you so much for being in, and I just want to thank everyone