What precision and recall are? After the predictive model has been finished, the most important question is: How good is it? Does it predict well? Evaluating the model is one of the most important tasks in the data science project, it indicates how good predictions are. Very often for classification problems we look at metrics called precision and recall, to define them in detail let’s quickly introduce confusion matrix first. Confusion Matrix for binary classification is made of four simple ratios: True Negative(TN): case was true negative and predicted negative True Positive(TP): case was true positive and predicted positive False […]

## What is Supervised Learning?

Supervised Learning is the machine learning task of inferring a function from labeled training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and the desired output value (also called the supervisory signal). A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances. This requires the learning algorithm to generalize from the training data […]

## What is Statistical Significance?

Statistical Significance in statistical hypothesis testing is attained whenever the observed p-value of a test statistic is less than the significance level defined for the study. The p-value is the probability of obtaining results at least as extreme as those observed, given that the null hypothesis is true. The significance level, α, is the probability of rejecting the null hypothesis, given that it is true. In any experiment or observation that involves drawing a sample from a population, there is always the possibility that an observed effect would have occurred due to sampling error alone. But if the p-value of […]

## What is Statistical Power?

Statistical Power of any test of statistical significance is defined as the probability that it will reject a false null hypothesis. Statistical power is inversely related to beta or the probability of making a Type II error. The power is a function of the possible distributions, often determined by a parameter, under the alternative hypothesis. As the power increases, there are decreasing chances of a Type II error, which are also referred to as the false negative rate (β) since the power is equal to 1−β, again, under the alternative hypothesis. A similar concept is Type I error or the […]

## What is Sentiment Analysis?

Sentiment Analysis refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine. Generally speaking, sentiment analysis aims to determine the attitude of a speaker, writer, or other subjects with respect to some topic or the overall contextual polarity or emotional reaction to a document, interaction, or event. The […]

## What is Semi-Supervised Learning?

Semi-Supervised Learning is a class of supervised learning tasks that also make use of unlabeled data for training – typically a small amount of labeled data with a large amount of unlabelled data. Semi-supervised learning falls between unsupervised learning (without any labeled training data) and supervised learning (with completely labeled training data). Many machine-learning researchers have found that unlabelled data, when used in conjunction with a small amount of labeled data, can produce considerable improvement in learning accuracy. Semi-supervised learning is also of theoretical interest in machine learning and as a model for human learning. Methods of semi-supervise learning include […]

## What is Semantic Indexing or Latent Semantic Indexing (LSI)?

Semantic Indexing or Latent Semantic Indexing (LSI) is a mathematical method used to determine the relationship between terms and concepts in content. The contents of a web page are crawled by a search engine and the most common words and phrases are collated and identified as the keywords for the page. LSI looks for synonyms related to the title of your page. Latent Semantic Indexing came as a direct reaction to people trying to cheat search engines by cramming Meta keyword tags full of hundreds of keywords, Meta description full of more keywords, and page content full of nothing more […]

## What is Self-Organizing Map (SOM)?

Self-Organizing Map (SOM) is a type of artificial neural network that is trained using unsupervised learning to produce a low-dimensional (typically two-dimensional), discretized representation of the input space of the training samples, called a map, and is, therefore, a method to do dimensionality reduction. Self-organizing maps differ from other artificial neural networks as they apply competitive learning as opposed to error-correction learning (such as backpropagation with gradient descent), and in the sense that they use a neighborhood function to preserve the topological properties of the input space. This makes SOMs useful for visualizing low-dimensional views of high-dimensional data. Like most […]

## What is Selection Bias?

Selection Bias is the selection of individuals, groups or data for analysis in such a way that proper randomization is not achieved, thereby ensuring that the sample obtained is not representative of the population intended to be analyzed. It is sometimes referred to as the selection effect. The phrase “selection bias” most often refers to the distortion of a statistical analysis, resulting from the method of collecting samples. If the selection bias is not taken into account, then some conclusions of the study may not be accurate. There are many types of possible selection bias, including sampling bias ( systematic […]

## What is R-squared?

R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression. R-squared is the percentage of the response variable variation that is explained by the model, it is always between 0 and 100%: 0% indicates that the model explains none of the variability of the response data around its mean 100% indicates that the model explains all the variability of the response data around its mean In general, the higher the R-squared, the better the model fits […]