Big Data Analytics is a crucial component of the Big data paradigm and refers to the process of extracting useful knowledge from large datasets or streams of data. Due to enormity, high dimensionality, heterogeneous, and distributed nature of data, traditional techniques of data mining may be unsuitable to work with big data.
In this lecture, different Big data tools and machine learning algorithms are introduced, discussed and analyzed. Depending on the main learning algorithm, the machine learning algorithms can be categorized as supervised, unsupervised and reinforcement learning
- Supervised learning
Supervised learning approaches – methodologies optimize the model’s parameter so as to minimize some criterion function that describes the differences between the desired and estimated output. Frequently-used supervised techniques are linear regression (estimates continuous output with the presupposed linear model with respect to the predictors), logistic regression (classification approach that estimates the probability of the input affiliation to the considered class), support vector machines and its improvement for regression problems support vector regression (based on the margin calculation), Gaussian process regression (a non-parametric statistical approach used for the regression problems), discriminant analysis and naive Bayes (also statistical methods, yet used for classification), neural networks (able to extract highly informative features even from extraordinary complex problems), ensemble methods (improve performances by using a combination of various of models), decision trees (preferable for classification purposes; designed in such a way that the nodes represent the attributes, whilst the branches depict their values), etc. On the other hand, the second group with commonly used unsupervised approaches are K means, K Medoids, fuzzy C Means, hierarchical and Gaussian mixture. As the clustering methods are concerned, Hidden Markov Models and their advancements are broadly used as probabilistic techniques based on the Markov chain, unsupervised neural networks for similar reasons as already mentioned, etc.
- Unsupervised learning
Unsupervised learning is a type of machine learning algorithm used to draw inferences from datasets consisting of input data without labeled responses. The most common unsupervised learning method is cluster analysis, which is used for exploratory data analysis to find hidden patterns or grouping in data.
- Reinforcement learning
Reinforcement learning is an area of machine learning concerned with how software agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning.