Understanding Machine Learning: Quick takeaways for CRM
Updated: Mar 9
Machine Learning is the method of data analysis that automates analytical model building. This science enables Computer Systems to self-learn progressively over the time course to demonstrate continuous improvement in results using wide range of algorithms capable of learning from the iterations on the training dataset to offer solution for a wide range of problems that can be re-modelled into the broad well-defined general problem categories. The approach contradicts the traditional style of using explicitly programmed (coded) and well-defined instructions for problem-solving.
Artificial Intelligence (or AI) in the simplistic sense is the intelligence exhibited by the Machine. Machine Learning falls under this broad umbrella of AI. The key goals to be achieved from Artificial Intelligence are the human-like abilities to Reason, Seek Knowledge, Perceive, Communicate, Move and Manipulate objects in the surroundings.
Here, one shall walk over Machine Learning’s broad categorization of various algorithms with their typical real life use cases, business implementation challenges, key industry leaders and breakthroughs, ways to quickly get hands-on to building models for optimized solutions and the technology limitations it faces today. Finally, we shall go over the implementation of Artificial Intelligence in the current CRM domain and the potential it posses to reform the industry.
Machine Learning algorithms can be broadly categorized on the basis of the type of inputs and outputs into these 4 generic flavors.
Supervised Learning classifies a labelled data point into one of the annotated group or predicts a label based on the known features using extrapolated regression.
Semi-Supervised Learning uses un-annotated data for training – small fraction of annotated data (requires human intuition) with a large chunk of un-annotated data
Unsupervised Learning defines data groups based on similarities in unannotated data to group these unlabeled data points based on known features into one of the previously defined groups
Reinforcement Learning teaches an algorithm to learn to perform action/s based on experience or past activity
Visualizing Machine Learning Classification and Broad Applications 
Deep Learning falls under the broad family of Machine Learning algorithms which: 
Comprise of multi-dimensions of non-linear processing units (or neural networks) forming deep-layered architectures
Learns multiple levels of representations corresponding to different levels or layers of abstraction; each level contributing to the conceptual hierarchy
Learns in both supervised (for Classification problems) and/or in an unsupervised manners (which happens to be used for Pattern Analysis or Cluster formation)
Machine Learning algorithms find wide use in diverse scenarios in a range of problems. In the Machine Learning domain, the ‘No Free Lunch’ Theorem is well-known, which broadly condenses into the fact that no single algorithm works best for all scenarios, thus multiple algorithms must be tried onto the test data to find the best performer. It is typically true for the case of ‘Supervised Learning’. The factors at play for deciding the best algorithm for a specific problem are generally defined by the size and structure of the Data Set (Input Type) and the Result (Output) desired of the available data. Certain complex cases might require coming up with a certain algorithm incorporating select features from other different algorithm/s.
A Simple analogy to relate to the challenge of choosing the best algorithm for a specific problem can be made with effectively cleaning the floor of a house but to decide whether a vacuum, broom or mop will be used depends upon the total floor area to be cleaned, the type of floor and the available resources (both in terms of cleaning accessories and personnel). Thus, the cleaning tools are directly analogous to algorithms, data size to the floor size and the data set type (or structure) to the type of the floor.
Supervised Machine Learning Algorithms for predictive modelling is generalized as the method of learning a target function (f) that maps the input variables (X) to the output variable (Y). Let us assume that we need to make future predictions (Y) for new input variables (X). In this process, however, we are unaware of the function’s form (f) and its exact nature. Had we known it we shall have directly used it and Machine Learning shall not have been important. The Predictive Modelling or Predictive analytics of mapping Y = f(X) to make predictions for Y for new values of X is most commonly used form of Machine Learning and the goal for us is to make the predictions accurate as possible, which requires us to fine tune the function (f) based on learning.
For the Machine Learning enthusiasts eager to understand the basics of Machine Learning, what follows is a quick insight into the 10 popular Machine Learning algorithms (both from Supervised and Unsupervised Learning groups) by ML Engineers, Data Scientists and AI experts.
Logistic Regression is one of the many Supervised Learning models borrowed for Machine Learning from the discipline of Statistics to perform classification (especially used for Binary Classification problems) with a goal of weighing the coefficient values for each input variable to obtain a prediction based on the non-linear (S-shaped) logistic function (or sigmoid function) which allocates the inputs any value ranging from 0 and 1 to eventually round the values either to 0 or 1 corresponding to the output classes in case of binary classification. A new labelled point is also predicted using the method to fall under one of these output classes, and the extrapolation of linear regression model can also help with forecasting when considering the varying features.
Linear and Logistic Regression 
Logistic Regression is similar to Linear Regression especially when removing unrelated attributes to the output variable and also when the attributes are very similar (correlated) but happens to be an improvement over basic Linear Regression because of its non-linearity.
Application: A good use case of regression in the CRM domain is Market Forecasting. It can also be used specifically to forecast the growth and revenue for a Company’s Quarter based on data from Sales Teams based on various predictors and independent variables. Another use case is Stock Price Forecasting for the Stock Market.
Decision Trees is another Supervised Learning model used to split a set of data into smaller and smaller groups (called nodes), based on one feature at a time. It falls under the category of Classifiers. Each time a subset of the data is split, the predictions become more accurate if each of the resulting sub-groups are more homogeneous (contain similar labels) than before. The advantage of using a computer is that it is exhaustive and more precise than manual data exploration especially to classify huge Data sets.
Decision Trees 
Application: A good example of decision trees in the CRM domain is Case Classification. It can also be used to group cases based on their type for the Customer representatives and then help escalate priority cases for immediate attention.
Naive Bayes Algorithm is the Supervised Learning based Classification Method based on finding the likelihood of an event to occur if all events are independent of each other. It can be used to quickly train using small data subset because of this basic assumption that all features needed to correctly classify data are “independent” of each other. However, when the independence assumption doesn’t hold true (or when features have relationships between them), this algorithm still performs very well and classifies based on the maximum likelihood of a label provided the set of features rather than on the probability of each feature .
Bayes Theorem 
Gaussian Naive Bayes algorithm is the modification of Naive Bayes to be used for the continuous Gaussian Data sets input.
Application: Gaussian Naive Bayes algorithm is commonly used in the industry for the Text Classification purpose, especially for filtering Spam emails based on the probability of the presence of certain words in spam emails over non-spam ones. In the CRM business, it has the potential to be modelled for the use of understanding and rating customer interest based on email conversations and indicating Sales representatives to work on unsatisfied customers and close satisfied leads or opportunities.
Boosting is an Ensemble technique falling under Supervised Learning that attempts to create a strong classifier from a number of weak classifiers. This is done by building a model from the training data, then creating a second model that attempts to correct the errors from the first model. Models are added until the training set is predicted perfectly or a maximum number of models are added . Later this classification model can also be used to make predictions.
AdaBoost (or Adaptive Boosting) is used for building modern boosting methods, most notably Stochastic Gradient Boosting Machines. It is used with short decision trees. After the creation of the first tree, the performance of the tree on each training instance is used to weight how much attention the next tree that is created should pay attention to each training instance. Training data that is difficult to predict has higher weight, whereas easy to predict instances are allocated less weight. Models are created sequentially one after the other, each updating the weights on the training instances that affect the learning performed by the next tree in the sequence. After all the trees are built, predictions are made for new data, and the performance of each tree is weighted by how accurate it was on training data.
Adaboost Representation 
Application: AdaBoost finds use in finding Donors for a Charity based on income, hours worked per week and various other features. It has the potential to predict the interest and likeliness of customers in buying products of a company based on their revenue, past purchases, similarities to current customers using the product and other useful features for the Sales representatives to know before contacting them. It can support Lead and Opportunity Scoring.
Stochastic Gradient Descent Classifier (SGDC) is an incremental Gradient Descent method falling under the broad classification of Supervised Learning Model that helps converge faster and is thus very fast as compared to other iterative models capable of delivering similar performance. It is because of the fact that the gradient is calculated for a ‘mini-batch’ (small sample of data points instead of a single example). Thus, the vectorization libraries can be applied to the training data set rather than computing each step separately. It also results in smoother convergence, as the gradient computed at each step is averaged over more training examples. SGDC also helps prevent overfitting.
Application: In the area of CRM, Stochastic Gradient Descent Classifier can be used to group customers eligible for good discounts from all accounts for a particular product or service while not making loss and increasing chances to grow revenue from Sales.
SGDC is used in Geophysics, specifically for the applications of Full Waveform Inversion (FWI). It is also used as a de facto standard for training artificial neural networks in combination with the back-propagation algorithm. Least mean squares filter based on Stochastic Gradient Descent Classifier is capable of mimicking adaptive filter  .
Support Vector Machine (SVM) is a very popular Supervised Learning technique used for Classification to best find the Data boundary (or a hyperplane) that maintains maximum distance (or best separates) between the Data points in the input variable space by their class based on calculating coefficients.
Support Vector Machine (SVM) 
Application: In the area of CRM, SVM can be very useful identifying customers or accounts with no communication over a course of time for the Sales representatives to follow up and sustain business relationships.
It is typically used in text classification, especially for handwritten character recognition.
K-means Clustering is a method of vector quantization Unsupervised Learning technique of Cluster formation borrowed from Signal Processing. It follows two simple steps: 1. Assignment; 2. Optimization. Once, assignment of centroids is done based on data distribution, the points falling in each cluster are used to update the centroids based on Data point density. These two steps continue iteratively to cluster the Data points.
K-Means Clustering 
Application: In the CRM world, K-means clustering can be used for wide clustering applications such as grouping leads based on their company’s revenue, grouping accounts based on various criteria for analytics to be useful for the representatives to gather insight about the accounts’ (organization).
Netflix uses a complex system fundamentally based on K-means clustering to group movies based on genre tastes and other criteria of users to provide them with recommendations. It is believed that about 2000 groups exist for the Netflix users based on feature correlation.
Density-based spatial clustering of applications with noise (or DBScan) is a type of Unsupervised Learning model used for Data Clustering. This approach finds groups or regions with high densities which are separated by low density regions. The cluster formation can be defined on the basis of the configurable density. It is especially useful to find density based arbitrary shaped patterns. The only drawback of the algorithm is that very sparsely located Data points are identified as noise and do not fall under any cluster.
Cluster formation using DBScan 
Application: An application of DBScan in the CRM domain can be the Customer Segmentation. It is commonly used to group accounts and frequent customers based on various paradigms such as buying strength, taste or preference, etc. in defined groups or clusters to suggest products and or services similar to what other customers from the group are inclined to or prefer buying.
Amazon uses a complex system based on such a system to form customer segments to offer product recommendations. It is also used by Pizza Chains to identify the correct locations for Pizza Parlors based on customer housing or locations.
Principal Component Analysis is the main type of dimensionality reduction technique amongst the numerous Unsupervised Learning models used to derive fewer variables (in decreasing order of importance) that are linear combinations of the original variables and are uncorrelated. One of the reasons for performing a principal components analysis is to find a smaller group of underlying variables from the vast number of variables that describe the huge datasets .
Principal Component Analysis with 2 features
Application: PCA is a well-known feature extraction (or transformation) technique, which finds use in facial recognition from image processing.
In the CRM business, Principal component Analysis is used in Customer churns analysis and predication. Because of the discrepancy of collecting channel and data gathering, raw customer data have imprecise, unbalanced and high dimensional characteristics, which degrade model performance .
End-to-end Reinforcement Learning falls under the most novel of ML categories – Reinforcement Learning, which is an end-to-end technique (right from reception to action) capable of teaching the model to perform actions over a course of iterations based on defined rules and experience. This technique involves a single, layered or recurrent neural network without modularization trained using Reinforcement Learning. It uses various aspects of Artificial Intelligence to imitate human intelligence and actions.
Application: Google DeepMind has been successful in using End-to-end Reinforcement Learning for teaching a computer to learn and play AlphaGo (2016) .
Some day, for CRM it might be useful to have virtual representatives functioning in areas of Sales, Marketing, Service etc for specific industries developed and trained using End-to-end Reinforcement Learning.
Concluding the discussion, the key takeaway for everyone, especially beginners is that the choice of algorithm depends mainly on: 1. The Size, Quality and Nature of DataSets; 2. Desired Computational Time; 3. Task Urgency; and 4. Desired Output. Experienced Data Scientists also fall short while deciding amongst the best performing algorithms before implementing different algorithms.
Although there are many other Machine Learning algorithms, these are the most popular ones. If you’re just taking off with Machine Learning, these would present you with the essence and flavor of Machine Learning and usability for improving CRM functions.
Deng, L.; Yu, D. (2014). “Deep Learning: Methods and Applications”. Foundations and Trends in Signal Processing. 7 (3–4): Page No 1–199.
A Smola and S.V.N. Vishwanathan, Introduction to Machine Learning, Cambridge University Press, 2010, Page 21
Udacity : Machine Learning Engineer Nanodegree
Andrew R. Webb, Statistical Pattern Recognition, Second Edition, QinetiQ Ltd., Malvern, UK, John Wiley and Sons Ltd
Silver, David; Huang, Aja; Maddison, Chris J.; Guez, Arthur; Sifre, Laurent; Driessche, George van den; Schrittwieser, Julian; Antonoglou, Ioannis; Panneershelvam, Veda; Lanctot, Marc; Dieleman, Sander; Grewe, Dominik; Nham, John; Kalchbrenner, Nal; Sutskever, Ilya; Lillicrap, Timothy; Leach, Madeleine; Kavukcuoglu, Koray; Graepel, Thore; Hassabis, Demis (28 January 2016). “Mastering the game of Go with deep neural networks and tree search”. Nature.