What precision and recall are?

What precision and recall are?


After the predictive model has been finished, the most important question is: How good is it? Does it predict well?

Evaluating the model is one of the most important tasks in the data science project,  it indicates how good predictions are. Very often for classification problems we look at metrics called precision and recall, to define them in detail let’s quickly introduce confusion matrix first.

Confusion Matrix for binary classification is made of four simple ratios:

  • True Negative(TN): case was true negative and predicted negative
  • True Positive(TP): case was true positive and predicted positive
  • False Negative(FN): case was true positive but predicted negative
  • False Positive(FP): case was true negative but predicted positive




Understanding the confusion matrix, calculating precision and recall is easy.


Precision – is the ratio of correctly predicted positive observations to the total predicted positive observations, or what percent of positive predictions were correct?

Precision = TP/TP+FP


Recall – also called sensitivity, is the ratio of correctly predicted positive observations to all observations in actual class – yes, or what percent of the positive cases did you catch?

Recall = TP/TP+FN


There are also two more useful matrices coming from confusion matrix,  Accuracy – correctly predicted observation to the total observations and F1 score the weighted average of Precision and Recall. Although intuitively it is not as easy to understand as accuracy, the F1 score is usually more useful than accuracy, especially if you have an uneven class distribution.

Example Python Code to get Precision and Recall:


from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.metrics import precision_recall_fscore_support as score

data = datasets.load_iris()
X = data['data']
y = data['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

model = LogisticRegression()
preds = model.predict(X_test)

precision, recall, fscore, support = score(y_test, preds)


Was the above useful? Please share with others on social media.

If you want to look for more information, check some free online courses available at   coursera.orgedx.org or udemy.com.

Recommended reading list:


Data Science from Scratch: First Principles with Python

Data science libraries, frameworks, modules, and toolkits are great for doing data science, but they’re also a good way to dive into the discipline without actually understanding data science. In this book, you’ll learn how many of the most fundamental data science tools and algorithms work by implementing them from scratch.

If you have an aptitude for mathematics and some programming skills, author Joel Grus will help you get comfortable with the math and statistics at the core of data science, and with hacking skills you need to get started as a data scientist. Today’s messy glut of data holds answers to questions no one’s even thought to ask. This book provides you with the know-how to dig those answers out.

Get a crash course in Python
Learn the basics of linear algebra, statistics, and probability—and understand how and when they're used in data science
Collect, explore, clean, munge, and manipulate data
Dive into the fundamentals of machine learning
Implement models such as k-nearest Neighbors, Naive Bayes, linear and logistic regression, decision trees, neural networks, and clustering
Explore recommender systems, natural language processing, network analysis, MapReduce, and databases
Practical Statistics for Data Scientists: 50 Essential Concepts

Statistical methods are a key part of of data science, yet very few data scientists have any formal statistics training. Courses and books on basic statistics rarely cover the topic from a data science perspective. This practical guide explains how to apply various statistical methods to data science, tells you how to avoid their misuse, and gives you advice on what's important and what's not.

Many data science resources incorporate statistical methods but lack a deeper statistical perspective. If you’re familiar with the R programming language, and have some exposure to statistics, this quick reference bridges the gap in an accessible, readable format.

With this book, you’ll learn:

Why exploratory data analysis is a key preliminary step in data science
How random sampling can reduce bias and yield a higher quality dataset, even with big data
How the principles of experimental design yield definitive answers to questions
How to use regression to estimate outcomes and detect anomalies
Key classification techniques for predicting which categories a record belongs to
Statistical machine learning methods that “learn” from data
Unsupervised learning methods for extracting meaning from unlabeled data
Doing Data Science: Straight Talk from the Frontline

Now that people are aware that data can make the difference in an election or a business model, data science as an occupation is gaining ground. But how can you get started working in a wide-ranging, interdisciplinary field that’s so clouded in hype? This insightful book, based on Columbia University’s Introduction to Data Science class, tells you what you need to know.

In many of these chapter-long lectures, data scientists from companies such as Google, Microsoft, and eBay share new algorithms, methods, and models by presenting case studies and the code they use. If you’re familiar with linear algebra, probability, and statistics, and have programming experience, this book is an ideal introduction to data science.

Topics include:

Statistical inference, exploratory data analysis, and the data science process
Spam filters, Naive Bayes, and data wrangling
Logistic regression
Financial modeling
Recommendation engines and causality
Data visualization
Social networks and data journalism
Data engineering, MapReduce, Pregel, and Hadoop
The Data Science Handbook: Advice and Insights from 25 Amazing Data Scientists

The Data Science Handbook contains interviews with 25 of the world s best data scientists. We sat down with them, had in-depth conversations about their careers, personal stories, perspectives on data science and life advice. In The Data Science Handbook, you will find war stories from DJ Patil, US Chief Data Officer and one of the founders of the field. You ll learn industry veterans such as Kevin Novak and Riley Newman, who head the data science teams at Uber and Airbnb respectively. You ll also read about rising data scientists such as Clare Corthell, who crafted her own open source data science masters program. This book is perfect for aspiring or current data scientists to learn from the best. It s a reference book packed full of strategies, suggestions and recipes to launch and grow your own data science career.
Introduction to Machine Learning with Python: A Guide for Data Scientists

Machine learning has become an integral part of many commercial applications and research projects, but this field is not exclusive to large companies with extensive research teams. If you use Python, even as a beginner, this book will teach you practical ways to build your own machine learning solutions. With all the data available today, machine learning applications are limited only by your imagination.

You’ll learn the steps necessary to create a successful machine-learning application with Python and the scikit-learn library. Authors Andreas Müller and Sarah Guido focus on the practical aspects of using machine learning algorithms, rather than the math behind them. Familiarity with the NumPy and matplotlib libraries will help you get even more from this book.

With this book, you’ll learn:

Fundamental concepts and applications of machine learning
Advantages and shortcomings of widely used machine learning algorithms
How to represent data processed by machine learning, including which data aspects to focus on
Advanced methods for model evaluation and parameter tuning
The concept of pipelines for chaining models and encapsulating your workflow
Methods for working with text data, including text-specific processing techniques
Suggestions for improving your machine learning and data science skills

What is Supervised Learning?

Supervised Learning is the machine learning task of inferring a function from labeled training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and the desired output value (also called the supervisory signal). A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a “reasonable” way (see inductive bias). In order to solve the supervised learning problem, one has to perform following steps: determine the type of training examples, gather a training set, determine the input feature representation of the learned function, determine the structure of the learned function and corresponding learning algorithm, complete the design, and evaluate the accuracy of the learned function. A wide range of supervised learning algorithms is available, each with its strengths and weaknesses. There is no single learning algorithm that works best on all supervised learning problems.

What is Statistical Significance?

Statistical Significance in statistical hypothesis testing is attained whenever the observed p-value of a test statistic is less than the significance level defined for the study. The p-value is the probability of obtaining results at least as extreme as those observed, given that the null hypothesis is true. The significance level, α, is the probability of rejecting the null hypothesis, given that it is true. In any experiment or observation that involves drawing a sample from a population, there is always the possibility that an observed effect would have occurred due to sampling error alone. But if the p-value of an observed effect is less than the significance level, an investigator may conclude that that effect reflects the characteristics of the whole population, thereby rejecting the null hypothesis. A significance level is chosen before data collection and typically set to 5% or much lower, depending on the field of study. This technique for testing the significance of results was developed in the early 20th century. The term significance does not imply importance here, and the term statistical significance is not the same as research, theoretical, or practical significance. For example, the term clinical significance refers to the practical importance of a treatment effect.

What is Statistical Power?

Statistical Power of any test of statistical significance is defined as the probability that it will reject a false null hypothesis. Statistical power is inversely related to beta or the probability of making a Type II error. The power is a function of the possible distributions, often determined by a parameter, under the alternative hypothesis. As the power increases, there are decreasing chances of a Type II error, which are also referred to as the false negative rate (β) since the power is equal to 1−β, again, under the alternative hypothesis. A similar concept is Type I error or the level of a test under the null hypothesis. Power analysis can be used to calculate the minimum sample size required so that one can be reasonably likely to detect an effect of a given size. For example: “how many times do I need to toss a coin to conclude it is rigged?” Power analysis can also be used to calculate the minimum effect size that is likely to be detected in a study using a given sample size. In addition, the concept of power is used to make comparisons between different statistical testing procedures: for example, between a parametric and a nonparametric test of the same hypothesis.

What is Sentiment Analysis?

Sentiment Analysis refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine. Generally speaking, sentiment analysis aims to determine the attitude of a speaker, writer, or other subjects with respect to some topic or the overall contextual polarity or emotional reaction to a document, interaction, or event. The attitude may be a judgment or evaluation, affective state (the emotional state of the author or speaker), or the intended emotional communication (the emotional effect intended by the author or interlocutor). A basic task in sentiment analysis is classifying the polarity of a given text at the document, sentence, or feature/aspect level—whether the expressed opinion in a document, a sentence or an entity feature/aspect is positive, negative, or neutral. Advanced, “beyond polarity” sentiment classification looks, for instance, at emotional states such as “angry”, “sad”, and “happy”.

What is Semi-Supervised Learning?

Semi-Supervised Learning is a class of supervised learning tasks that also make use of unlabeled data for training – typically a small amount of labeled data with a large amount of unlabelled data. Semi-supervised learning falls between unsupervised learning (without any labeled training data) and supervised learning (with completely labeled training data). Many machine-learning researchers have found that unlabelled data, when used in conjunction with a small amount of labeled data, can produce considerable improvement in learning accuracy. Semi-supervised learning is also of theoretical interest in machine learning and as a model for human learning. Methods of semi-supervise learning include generative methods, low-density separation, graph-based methods, heuristic approaches.

What is Semantic Indexing or Latent Semantic Indexing (LSI)?

Semantic Indexing or Latent Semantic Indexing (LSI) is a mathematical method used to determine the relationship between terms and concepts in content. The contents of a web page are crawled by a search engine and the most common words and phrases are collated and identified as the keywords for the page. LSI looks for synonyms related to the title of your page. Latent Semantic Indexing came as a direct reaction to people trying to cheat search engines by cramming Meta keyword tags full of hundreds of keywords, Meta description full of more keywords, and page content full of nothing more than random keywords and no subject-related material or worthwhile content. LSI will not affect a squeeze page that has no intention of achieving a search engine rank anyway, due to its minimalistic content. But for site owners or bloggers hoping to get on the search engines good side, pay attention to LSI.

What is Self-Organizing Map (SOM)?

Self-Organizing Map (SOM) is a type of artificial neural network that is trained using unsupervised learning to produce a low-dimensional (typically two-dimensional), discretized representation of the input space of the training samples, called a map, and is, therefore, a method to do dimensionality reduction. Self-organizing maps differ from other artificial neural networks as they apply competitive learning as opposed to error-correction learning (such as backpropagation with gradient descent), and in the sense that they use a neighborhood function to preserve the topological properties of the input space. This makes SOMs useful for visualizing low-dimensional views of high-dimensional data. Like most artificial neural networks, SOMs operate in two modes: training and mapping. “Training” builds the map using input, while “mapping” automatically classifies a new input vector. A self-organizing map consists of components called nodes or neurons. Associated with each node are a weight vector of the same dimension as the input data vectors and a position in the map space. The usual arrangement of nodes is a two-dimensional regular spacing in a hexagonal or rectangular grid. The procedure for placing a vector from data space onto the map is to find the node with the closest weight vector to the data space vector.

What is Selection Bias?

Selection Bias is the selection of individuals, groups or data for analysis in such a way that proper randomization is not achieved, thereby ensuring that the sample obtained is not representative of the population intended to be analyzed. It is sometimes referred to as the selection effect. The phrase “selection bias” most often refers to the distortion of a statistical analysis, resulting from the method of collecting samples. If the selection bias is not taken into account, then some conclusions of the study may not be accurate. There are many types of possible selection bias, including sampling bias ( systematic error due to a non-random sample of a population, causing some members of the population to be less likely to be included than others), time interval, exposure. An assessment of the degree of selection bias can be made by examining correlations between background variables and a treatment indicator.

What is R-squared?

R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression. R-squared is the percentage of the response variable variation that is explained by the model, it is always between 0 and 100%:
0% indicates that the model explains none of the variability of the response data around its mean
100% indicates that the model explains all the variability of the response data around its mean
In general, the higher the R-squared, the better the model fits your data. The biggest limitations are: R-squared cannot determine whether the coefficient estimates and predictions are biased, which is why you the residual plots need to be assessed, R-squared does not indicate whether a regression model is adequate. You can have a low R-squared value for a good model, or a high value for a model that does not fit the data.