Supervised Learning is the machine learning task of inferring a function from labeled training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and the desired output value (also called the supervisory signal). A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a “reasonable” way (see inductive bias). In order to solve the supervised learning problem, one has to perform following steps: determine the type of training examples, gather a training set, determine the input feature representation of the learned function, determine the structure of the learned function and corresponding learning algorithm, complete the design, and evaluate the accuracy of the learned function. A wide range of supervised learning algorithms is available, each with its strengths and weaknesses. There is no single learning algorithm that works best on all supervised learning problems.
Statistical Significance in statistical hypothesis testing is attained whenever the observed p-value of a test statistic is less than the significance level defined for the study. The p-value is the probability of obtaining results at least as extreme as those observed, given that the null hypothesis is true. The significance level, α, is the probability of rejecting the null hypothesis, given that it is true. In any experiment or observation that involves drawing a sample from a population, there is always the possibility that an observed effect would have occurred due to sampling error alone. But if the p-value of an observed effect is less than the significance level, an investigator may conclude that that effect reflects the characteristics of the whole population, thereby rejecting the null hypothesis. A significance level is chosen before data collection and typically set to 5% or much lower, depending on the field of study. This technique for testing the significance of results was developed in the early 20th century. The term significance does not imply importance here, and the term statistical significance is not the same as research, theoretical, or practical significance. For example, the term clinical significance refers to the practical importance of a treatment effect.
Statistical Power of any test of statistical significance is defined as the probability that it will reject a false null hypothesis. Statistical power is inversely related to beta or the probability of making a Type II error. The power is a function of the possible distributions, often determined by a parameter, under the alternative hypothesis. As the power increases, there are decreasing chances of a Type II error, which are also referred to as the false negative rate (β) since the power is equal to 1−β, again, under the alternative hypothesis. A similar concept is Type I error or the level of a test under the null hypothesis. Power analysis can be used to calculate the minimum sample size required so that one can be reasonably likely to detect an effect of a given size. For example: “how many times do I need to toss a coin to conclude it is rigged?” Power analysis can also be used to calculate the minimum effect size that is likely to be detected in a study using a given sample size. In addition, the concept of power is used to make comparisons between different statistical testing procedures: for example, between a parametric and a nonparametric test of the same hypothesis.
Sentiment Analysis refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine. Generally speaking, sentiment analysis aims to determine the attitude of a speaker, writer, or other subjects with respect to some topic or the overall contextual polarity or emotional reaction to a document, interaction, or event. The attitude may be a judgment or evaluation, affective state (the emotional state of the author or speaker), or the intended emotional communication (the emotional effect intended by the author or interlocutor). A basic task in sentiment analysis is classifying the polarity of a given text at the document, sentence, or feature/aspect level—whether the expressed opinion in a document, a sentence or an entity feature/aspect is positive, negative, or neutral. Advanced, “beyond polarity” sentiment classification looks, for instance, at emotional states such as “angry”, “sad”, and “happy”.
Semi-Supervised Learning is a class of supervised learning tasks that also make use of unlabeled data for training – typically a small amount of labeled data with a large amount of unlabelled data. Semi-supervised learning falls between unsupervised learning (without any labeled training data) and supervised learning (with completely labeled training data). Many machine-learning researchers have found that unlabelled data, when used in conjunction with a small amount of labeled data, can produce considerable improvement in learning accuracy. Semi-supervised learning is also of theoretical interest in machine learning and as a model for human learning. Methods of semi-supervise learning include generative methods, low-density separation, graph-based methods, heuristic approaches.
Semantic Indexing or Latent Semantic Indexing (LSI) is a mathematical method used to determine the relationship between terms and concepts in content. The contents of a web page are crawled by a search engine and the most common words and phrases are collated and identified as the keywords for the page. LSI looks for synonyms related to the title of your page. Latent Semantic Indexing came as a direct reaction to people trying to cheat search engines by cramming Meta keyword tags full of hundreds of keywords, Meta description full of more keywords, and page content full of nothing more than random keywords and no subject-related material or worthwhile content. LSI will not affect a squeeze page that has no intention of achieving a search engine rank anyway, due to its minimalistic content. But for site owners or bloggers hoping to get on the search engines good side, pay attention to LSI.
Self-Organizing Map (SOM) is a type of artificial neural network that is trained using unsupervised learning to produce a low-dimensional (typically two-dimensional), discretized representation of the input space of the training samples, called a map, and is, therefore, a method to do dimensionality reduction. Self-organizing maps differ from other artificial neural networks as they apply competitive learning as opposed to error-correction learning (such as backpropagation with gradient descent), and in the sense that they use a neighborhood function to preserve the topological properties of the input space. This makes SOMs useful for visualizing low-dimensional views of high-dimensional data. Like most artificial neural networks, SOMs operate in two modes: training and mapping. “Training” builds the map using input, while “mapping” automatically classifies a new input vector. A self-organizing map consists of components called nodes or neurons. Associated with each node are a weight vector of the same dimension as the input data vectors and a position in the map space. The usual arrangement of nodes is a two-dimensional regular spacing in a hexagonal or rectangular grid. The procedure for placing a vector from data space onto the map is to find the node with the closest weight vector to the data space vector.
Selection Bias is the selection of individuals, groups or data for analysis in such a way that proper randomization is not achieved, thereby ensuring that the sample obtained is not representative of the population intended to be analyzed. It is sometimes referred to as the selection effect. The phrase “selection bias” most often refers to the distortion of a statistical analysis, resulting from the method of collecting samples. If the selection bias is not taken into account, then some conclusions of the study may not be accurate. There are many types of possible selection bias, including sampling bias ( systematic error due to a non-random sample of a population, causing some members of the population to be less likely to be included than others), time interval, exposure. An assessment of the degree of selection bias can be made by examining correlations between background variables and a treatment indicator.
R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression. R-squared is the percentage of the response variable variation that is explained by the model, it is always between 0 and 100%:
0% indicates that the model explains none of the variability of the response data around its mean
100% indicates that the model explains all the variability of the response data around its mean
In general, the higher the R-squared, the better the model fits your data. The biggest limitations are: R-squared cannot determine whether the coefficient estimates and predictions are biased, which is why you the residual plots need to be assessed, R-squared does not indicate whether a regression model is adequate. You can have a low R-squared value for a good model, or a high value for a model that does not fit the data.
Root Mean Square Error (RMSE) is a frequently used measure of the differences between values (sample and population values) predicted by a model or an estimator and the values actually observed. The RMSD represents the sample standard deviation of the differences between predicted values and observed values. These individual differences are called residuals when the calculations are performed over the data sample that was used for estimation and are called prediction errors when computed out-of-sample. The RMSD serves to aggregate the magnitudes of the errors in predictions for various times into a single measure of predictive power. RMSD is a good measure of accuracy, but only to compare forecasting errors of different models for a particular variable and not between variables, as it is scale-dependent.