2
Cálculo 3
UFSCAR
9
Cálculo 3
UFSCAR
6
Cálculo 3
UFSCAR
1
Cálculo 3
UFSCAR
1
Cálculo 3
UFSCAR
2
Cálculo 3
UFSCAR
26
Cálculo 3
UFSCAR
6
Cálculo 3
UFSCAR
1
Cálculo 3
UFSCAR
3
Cálculo 3
UFSCAR
Texto de pré-visualização
Essential Statistics for NonSTEM Data Analysts Copyright 2020 Packt Publishing All rights reserved No part of this book may be reproduced stored in a retrieval system or transmitted in any form or by any means without the prior written permission of the publisher except in the case of brief quotations embedded in critical articles or reviews Every effort has been made in the preparation of this book to ensure the accuracy of the information presented However the information contained in this book is sold without warranty either express or implied Neither the author nor Packt Publishing or its dealers and distributors will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals However Packt Publishing cannot guarantee the accuracy of this information Commissioning Editor Sunith Shetty Acquisition Editor Devika Battike Senior Editor Roshan Kumar Content Development Editor Sean Lobo Technical Editor Sonam Pandey Copy Editor Safis Editing Project Coordinator Aishwarya Mohan Proofreader Safis Editing Indexer Pratik Shirodkar Production Designer Roshan Kawale First published November 2020 Production reference 1111120 Published by Packt Publishing Ltd Livery Place 35 Livery Street Birmingham B3 2PB UK ISBN 9781838984847 wwwpacktcom Packtcom Subscribe to our online digital library for full access to over 7000 books and videos as well as industry leading tools to help you plan your personal development and advance your career For more information please visit our website Why subscribe Spend less time learning and more time coding with practical eBooks and Videos from over 4000 industry professionals Improve your learning with Skill Plans built especially for you Get a free eBook or video every month Fully searchable for easy access to vital information Copy and paste print and bookmark content Did you know that Packt offers eBook versions of every book published with PDF and ePub files available You can upgrade to the eBook version at packtcom and as a print book customer you are entitled to a discount on the eBook copy Get in touch with us at customercarepacktpubcom for more details At wwwpacktcom you can also read a collection of free technical articles sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks Contributors About the author Rongpeng Li is a data science instructor and a senior data scientist at Galvanize Inc He has previously been a research programmer at Information Sciences Institute working on knowledge graphs and artificial intelligence He has also been the host and organizer of the Data Analysis Workshop Designed for NonSTEM Busy Professionals at LA Michael Hansen httpswwwlinkedincominmichaelnhansen a friend of mine provided invaluable English language editing suggestions for this book Michael has great attention to detail which made him a great language reviewer Thank you Michael About the reviewers James Mott PhD is a senior education consultant with extensive experience in teaching statistical analysis modeling data mining and predictive analytics He has over 30 years of experience using SPSS products in his own research including IBM SPSS Statistics IBM SPSS Modeler and IBM SPSS Amos He has also been actively teaching about these products to IBMSPSS customers for over 30 years In addition he is an experienced historian with expertise in the research and teaching of 20th century United States political history and quantitative methods His specialties are data mining quantitative methods statistical analysis teaching and consulting Yidan Pan obtained her PhD in system synthetic and physical biology from Rice University Her research interest is profiling mutagenesis at genomic and transcriptional levels with molecular biology wet labs bioinformatics statistical analysis and machine learning models She believes that this book will give its readers a lot of practical skills for data analysis Packt is searching for authors like you If youre interested in becoming an author for Packt please visit authors packtpubcom and apply today We have worked with thousands of developers and tech professionals just like you to help them share their insight with the global tech community You can make a general application apply for a specific hot topic that we are recruiting an author for or submit your own idea Table of Contents Preface Section 1 Getting Started with Statistics for Data Science 1 Fundamentals of Data Collection Cleaning and Preprocessing Technical requirements 4 Collecting data from various data sources 5 Reading data directly from files 5 Obtaining data from an API 6 Obtaining data from scratch 9 Data imputation 11 Preparing the dataset for imputation 11 Imputation with mean or median values 16 Imputation with the modemost frequent value 18 Outlier removal 20 Data standardization when and how 21 Examples involving the scikit learn preprocessing module 23 Imputation 23 Standardization 23 Summary 24 2 Essential Statistics for Data Assessment Classifying numerical and categorical variables 26 Distinguishing between numerical and categorical variables 26 Understanding mean median and mode 30 Mean 30 Median 31 Mode 32 ii Table of Contents Learning about variance standard deviation quartiles percentiles and skewness 33 Variance 33 Standard deviation 36 Quartiles 37 Skewness 39 Knowing how to handle categorical variables and mixed data types 43 Frequencies and proportions 43 Transforming a continuous variable to a categorical one 46 Using bivariate and multivariate descriptive statistics 47 Covariance 48 Crosstabulation 50 Summary 51 3 Visualization with Statistical Graphs Basic examples with the Python Matplotlib package 54 Elements of a statistical graph 54 Exploring important types of plotting in Matplotlib 56 Advanced visualization customization 65 Customizing the geometry 65 Customizing the aesthetics 70 Queryoriented statistical plotting 72 Example 1 preparing data to fit the plotting function API 73 Example 2 combining analysis with plain plotting 76 Presentationready plotting tips 78 Use styling 78 Font matters a lot 80 Summary 80 Section 2 Essentials of Statistical Analysis 4 Sampling and Inferential Statistics Understanding fundamental concepts in sampling techniques 84 Performing proper sampling under different scenarios 86 The dangers associated with non probability sampling 86 Probability sampling the safer approach 88 Understanding statistics associated with sampling 98 Table of Contents iii Sampling distribution of the sample mean 98 Standard error of the sample mean 103 The central limit theorem 107 Summary 108 5 Common Probability Distributions Understanding important concepts in probability 110 Events and sample space 110 The probability mass function and the probability density function 111 Subjective probability and empirical probability 116 Understanding common discrete probability distributions 116 Bernoulli distribution 117 Binomial distribution 118 Poisson distribution 120 Understanding the common continuous probability distribution 121 Uniform distribution 122 Exponential distribution 122 Normal distribution 124 Learning about joint and conditional distribution 126 Independency and conditional distribution 127 Understanding the power law and black swan 127 The ubiquitous power law 128 Be aware of the black swan 129 Summary 130 6 Parametric Estimation Understanding the concepts of parameter estimation and the features of estimators 132 Evaluation of estimators 133 Using the method of moments to estimate parameters 136 Example 1 the number of 911 phone calls in a day 137 Example 2 the bounds of uniform distribution 139 Applying the maximum likelihood approach with Python 141 Likelihood function 141 MLE for uniform distribution boundaries 144 MLE for modeling noise 145 MLE and the Bayesian theorem 155 Summary 160 iv Table of Contents 7 Statistical Hypothesis Testing An overview of hypothesis testing 162 Understanding Pvalues test statistics and significance levels 164 Making sense of confidence intervals and Pvalues from visual examples 167 Calculating the Pvalue from discrete events 168 Calculating the Pvalue from the continuous PDF 170 Significance levels in tdistribution 174 The power of a hypothesis test 179 Using SciPy for common hypothesis testing 180 The paradigm 180 Ttest 181 The normality hypothesis test 185 The goodnessoffit test 189 A simple ANOVA model 192 Stationarity tests for time series 197 Examples of stationary and non stationary time series 198 Appreciating AB testing with a realworld example 206 Conducting an AB test 206 Randomization and blocking 207 Common test statistics 210 Common mistakes in AB tests 211 Summary 212 Section 3 Statistics for Machine Learning 8 Statistics for Regression Understanding a simple linear regression model and its rich content 216 Least squared error linear regression and variance decomposition 220 The coefficient of determination 227 Hypothesis testing 230 Connecting the relationship between regression and estimators 230 Simple linear regression as an estimator 232 Having handson experience with multivariate linear regression and collinearity analysis 233 Collinearity 239 Learning regularization from logistic regression examples 241 Summary 246 Table of Contents v 9 Statistics for Classification Understanding how a logistic regression classifier works 248 The formulation of a classification problem 250 Implementing logistic regression from scratch 251 Evaluating the performance of the logistic regression classifier 256 Building a naïve Bayes classifier from scratch 259 Underfitting overfitting and crossvalidation 267 Summary 272 10 Statistics for TreeBased Methods Overviewing treebased methods for classification tasks 274 Growing and pruning a classification tree 278 Understanding how splitting works 279 Evaluating decision tree performance 287 Exploring regression tree 289 Using tree models in scikitlearn 296 Summary 298 11 Statistics for Ensemble Methods Revisiting bias variance and memorization 300 Understanding the bootstrapping and bagging techniques 303 Understanding and using the boosting module 311 Exploring random forests with scikitlearn 316 Summary 318 vi Table of Contents Section 4 Appendix 12 A Collection of Best Practices Understanding the importance of data quality 322 Understanding why data can be problematic 322 Avoiding the use of misleading graphs 326 Example 1 COVID19 trend 326 Example 2 Bar plot cropping 328 Fighting against false arguments 334 Summary 335 13 Exercises and Projects Exercises 338 Chapter 1 Fundamentals of Data Collection Cleaning and Preprocessing 338 Chapter 2 Essential Statistics for Data Assessment 339 Chapter 3 Visualization with Statistical Graphs 340 Chapter 4 Sampling and Inferential Statistics 341 Chapter 5 Common Probability Distributions 342 Chapter 6 Parameter Estimation 344 Chapter 7 Statistical Hypothesis Testing 346 Chapter 8 Statistics for Regression 348 Chapter 9 Statistics for Classification 349 Chapter 10 Statistics for TreeBased Methods 351 Chapter 11 Statistics for Ensemble Methods 353 Project suggestions 355 Nontabular data 355 Realtime weather data 356 Goodness of fit for discrete distributions 358 Building a weather prediction web app 359 Building a typing suggestion app 360 Further reading 360 Textbooks 361 Visualization 361 Exercising your mind 361 Summary 362 Other Books You May Enjoy Index Preface Data science has been trending for several years and demand in the market is now really on the increase as companies governments and nonprofit organizations have shifted toward a datadriven approach Many new graduates as well as people who have been working for years are now trying to add data science as a new skill to their resumes One significant barrier for stepping into the realm of data science is statistics especially for people who do not have a science technology engineering and mathematics STEM background or left the classroom years ago This book is designed to fill the gap for those people While writing this book I tried to explore the scattered concepts in a dotconnecting fashion such that readers feel that new concepts and techniques are needed rather than simply being created from thin air By the end of this book you will be able to comfortably deal with common statistical concepts and computation in data science from fundamental descriptive statistics and inferential statistics to advanced topics such as statistics using treebased methods and ensemble methods This book is also particularly handy if you are preparing for a data scientist or data analyst job interview The nice interleaving of conceptual contents and code examples will prepare you well Who this book is for This book is for people who are looking for materials to fill the gaps in their statistics knowledge It should also serve experienced data scientists as an enjoyable read The book assumes minimal mathematics knowledge and it may appear verbose as it is designed so that novices can use it as a selfcontained book and follow the book chapter by chapter smoothly to build a knowledge base on statistics from the ground up viii Preface What this book covers Chapter 1 Fundamentals of Data Collection Cleaning and Preprocessing introduces basic concepts in data collection cleaning and simple preprocessing Chapter 2 Essential Statistics for Data Assessment talks about descriptive statistics which are handy for the assessment of data quality and exploratory data analysis EDA Chapter 3 Visualization with Statistical Graphs introduces common graphs that suit different visualization scenarios Chapter 4 Sampling and Inferential Statistics introduces the fundamental concepts and methodologies in sampling and the inference techniques associated with it Chapter 5 Common Probability Distributions goes through the most common discrete and continuous distributions which are the building blocks for more sophisticated real life empirical distributions Chapter 6 Parametric Estimation covers a classic and rich topic that solidifies your knowledge of statistics and probability by having you estimate parameters from accessible datasets Chapter 7 Statistical Hypothesis Testing looks at a musthave skill for any data scientist or data analyst We will cover the full life cycle of hypothesis testing from assumptions to interpretation Chapter 8 Statistics for Regression discusses statistics for regression problems starting with simple linear regression Chapter 9 Statistics for Classification explores statistics for classification problems starting with logistic regression Chapter 10 Statistics for TreeBased Methods delves into statistics for treebased methods with a detailed walk through of building a decision tree from first principles Chapter 11 Statistics for Ensemble Methods moves on to ensemble methods which are metaalgorithms built on top of basic machine learning or statistical algorithms This chapter is dedicated to methods such as bagging and boosting Chapter 12 Best Practice Collection introduces several important practice tips based on the authors data science mentoring and practicing experience Chapter 13 Exercises and Projects includes exercises and project suggestions grouped by chapter Preface ix To get the most out of this book As Jupyter notebooks can run on Google Colab a computer connected to the internet and a Google account should be sufficient If you are using the digital version of this book we advise you to type the code yourself or access the code via the GitHub repository link available in the next section Doing so will help you avoid any potential errors related to the copying and pasting of code Download the example code files You can download the example code files for this book from GitHub at https githubcomPacktPublishingEssentialStatisticsforNonSTEM DataAnalysts In case theres an update to the code it will be updated on the existing GitHub repository We also have other code bundles from our rich catalog of books and videos available at httpsgithubcomPacktPublishing Check them out Download the color images We also provide a PDF file that has color images of the screenshotsdiagrams used in this book You can download it here httpsstaticpacktcdncom downloads9781838984847ColorImagespdf Conventions used There are a number of text conventions used throughout this book Code in text Indicates code words in text database table names folder names filenames file extensions pathnames dummy URLs user input and Twitter handles Here is an example You can use pltrcytick labelsizex medium x Preface A block of code is set as follows import pandas as pd df pdreadexcelPopulationEstimatesxlsskiprows2 dfhead8 margin 0 Any commandline input or output is written as follows pip install pandas Bold Indicates a new term an important word or words that you see onscreen For example words in menus or dialog boxes appear in the text like this Here is an example seaborn is another popular Python visualization library With it you can write less code to obtain more professionallooking plots Tips or important notes R is another famous programming language for data science and statistical analysis There are also successful R packages The counterpart of Matplotlib is the R ggplot2 package I mentioned above Get in touch Feedback from our readers is always welcome General feedback If you have questions about any aspect of this book mention the book title in the subject of your message and email us at customercarepacktpubcom Errata Although we have taken every care to ensure the accuracy of our content mistakes do happen If you have found a mistake in this book we would be grateful if you would report this to us Please visit wwwpacktpubcomsupporterrata selecting your book clicking on the Errata Submission Form link and entering the details Piracy If you come across any illegal copies of our works in any form on the Internet we would be grateful if you would provide us with the location address or website name Please contact us at copyrightpacktcom with a link to the material If you are interested in becoming an author If there is a topic that you have expertise in and you are interested in either writing or contributing to a book please visit authors packtpubcom Preface xi Reviews Please leave a review Once you have read and used this book why not leave a review on the site that you purchased it from Potential readers can then see and use your unbiased opinion to make purchase decisions we at Packt can understand what you think about our products and our authors can see your feedback on their book Thank you For more information about Packt please visit packtcom Section 1 Getting Started with Statistics for Data Science In this section you will learn how to preprocess data and inspect distributions and correlations from a statistical perspective This section consists of the following chapters Chapter 1 Fundamentals of Data Collection Cleaning and Preprocessing Chapter 2 Essential Statistics for Data Assessment Chapter 3 Visualization with Statistical Graphs 1 Fundamentals of Data Collection Cleaning and Preprocessing Thank you for purchasing this book and welcome to a journal of exploration and excitement Whether you are already a data scientist preparing for an interview or just starting learning this book will serve you well as a companion You may already be familiar with common Python toolkits and have followed trending tutorials online However there is a lack of a systematic approach to the statistical side of data science This book is designed and written to close this gap for you As the first chapter in the book we start with the very first step of a data science project collecting cleaning data and performing some initial preprocessing It is like preparing fish for cooking You get the fish from the water or from the fish market examine it and process it a little bit before bringing it to the chef 4 Fundamentals of Data Collection Cleaning and Preprocessing You are going to learn five key topics in this chapter They are correlated with other topics such as visualization and basic statistics concepts For example outlier removal will be very hard to conduct without a scatter plot Data standardization clearly requires an understanding of statistics such as standard deviation We prepared a GitHub repository that contains readytorun codes from this chapter as well as the rest Here are the topics that will be covered in this chapter Collecting data from various data sources with a focus on data quality Data imputation with an assessment of downstream task requirements Outlier removal Data standardization when and how Examples involving the scikitlearn preprocessing module The role of this chapter is as a primer It is not possible to cover the topics in an entirely sequential fashion For example to remove outliers necessary techniques such as statistical plotting specifically a box plot and scatter plot will be used We will come back to those techniques in detail in future chapters of course but you must bear with it now Sometimes in order to learn new topics bootstrapping may be one of a few ways to break the shell You will enjoy it because the more topics you learn along the way the higher your confidence will be Technical requirements The best environment for running the Python code in the book is on Google Colaboratory httpscolabresearchgooglecom Google Colaboratory is a product that runs Jupyter Notebook in the cloud It has common Python packages that are preinstalled and runs in a browser It can also communicate with a disk so that you can upload local files to Google Drive The recommended browsers are the latest versions of Chrome and Firefox For more information about Colaboratory check out their official notebooks https colabresearchgooglecom You can find the code for this chapter in the following GitHub repository https githubcomPacktPublishingEssentialStatisticsforNonSTEM DataAnalysts Collecting data from various data sources 5 Collecting data from various data sources There are three major ways to collect and gather data It is crucial to keep in mind that data doesnt have to be wellformatted tables Obtaining structured tabulated data directly For example the Federal Reserve httpswwwfederalreservegovdatahtm releases wellstructured and welldocumented data in various formats including CSV so that pandas can read the file into a DataFrame format Requesting data from an API For example the Google Map API https developersgooglecommapsdocumentation allows developers to request data from the Google API at a capped rate depending on the pricing plan The returned format is usually JSON or XML Building a dataset from scratch For example social scientists often perform surveys and collect participants answers to build proprietary data Lets look at some examples involving these three approaches You will use the UCI machine learning repository the Google Map API and USC Presidents Office websites as data sources respectively Reading data directly from files Reading data from local files or remote files through a URL usually requires a good source of publicly accessible data archives For example the University of California Irvine maintains a data repository for machine learning We will be reading the air quality dataset with pandas The latest URL will be updated in the books official GitHub repository in case the following code fails You may obtain the file from https archiveicsuciedumlmachinelearningdatabasesheart disease From the datasets we are using the processedhungariandata file You need to upload the file to the same folder where the notebook resides 6 Fundamentals of Data Collection Cleaning and Preprocessing The following code snippet reads the data and displays the first several rows of the datasets import pandas as pd df pdreadcsvprocessedhungariandata sep names agesexcptrestbps cholfbsrestecgthalach exangoldpeakslopeca thalnum dfhead This produces the following output Figure 11 Head of the Hungarian heart disease dataset In the following section you will learn how to obtain data from an API Obtaining data from an API In plain English an Application Programming Interface API defines protocols agreements or treaties between applications or parts of applications You need to pass requests to an API and obtain returned data in JSON or other formats specified in the API documentation Then you can extract the data you want Note When working with an API you need to follow the guidelines and restrictions regarding API usage Improper usage of an API will result in the suspension of an account or even legal issues Collecting data from various data sources 7 Lets take the Google Map Place API as an example The Place API https developersgooglecomplaceswebserviceintro is one of many Google Map APIs that Google offers Developers can use HTTP requests to obtain information about certain geographic locations the opening hours of establishments and the types of establishment such as schools government offices and police stations In terms of using external APIs Like many APIs the Google Map Place API requires you to create an account on its platform the Google Cloud Platform It is free but still requires a credit card account for some services it provides Please pay attention so that you wont be mistakenly charged After obtaining and activating the API credentials the developer can build standard HTTP requests to query the endpoints For example the textsearch endpoint is used to query places based on text Here you will use the API to query information about libraries in Culver City Los Angeles 1 First lets import the necessary libraries import requests import json 2 Initialize the API key and endpoints We need to replace APIKEY with a real API key to make the code work APIKEY Your API key goes here TEXTSEARCHURL httpsmapsgoogleapiscommapsapi placetextsearchjson query Culver City Library 3 Obtain the response returned and parse the returned data into JSON format Lets examine it response requestsgetTEXTSEARCH URLqueryquerykeyAPIKEY jsonobject responsejson printjsonobject 8 Fundamentals of Data Collection Cleaning and Preprocessing This is a oneresult response Otherwise the results fields will have multiple entries You can index the multientry results fields as a normal Python list object htmlattributions results formattedaddress 4975 Overland Ave Culver City CA 90230 United States geometry location lat 340075635 lng 1183969651 viewport northeast lat 3400909257989272 lng 1183955611701073 southwest lat 3400639292010727 lng 1183982608298927 icon httpsmapsgstaticcommapfilesplaceapi iconscivicbuilding71png id ccdd10b4f04fb117909897264c78ace0fa45c771 name Culver City Julian Dixon Library openinghours opennow True photos height 3024 htmlattributions a hrefhttpsmapsgooglecom mapscontrib102344423129359752463Khaled Alabeda photoreference CmRaAAAANT4Td01h1tkI7dTn35vAkZhx mg3PjgKvjHiyh80M5UlI3wVw1cer4vkOksYR68NM9aw33ZPYGQzzXTE 8bkOwQYuSChXAWlJUtz8atPhmRht4hP4dwFgqfbJULmG5f1EhAfWlF cpLz76sD81fns1OGhT4KUzWTbuNY544XozE02pLNWw width 4032 placeid ChIJrUqREx6woARFrQdyscOZ8 pluscode compoundcode 2J5326 Culver City California globalcode 85632J5326 rating 42 reference ChIJrUqREx6woARFrQdyscOZ8 types library pointofinterest establishment userratingstotal 49 status OK The address and name of the library can be obtained as follows printjsonobjectresults0formattedaddress printjsonobjectresults0name Collecting data from various data sources 9 The result reads as follows 4975 Overland Ave Culver City CA 90230 United States Culver City Julian Dixon Library Information An API can be especially helpful for data augmentation For example if you have a list of addresses that are corrupted or mislabeled using the Google Map API may help you correct wrong data Obtaining data from scratch There are instances where you would need to build your own dataset from scratch One way of building data is to crawl and parse the internet On the internet a lot of public resources are open to the public and free to use Googles spiders crawl the internet relentlessly 247 to keep its search results up to date You can write your own code to gather information online instead of opening a web browser to do it manually Doing a survey and obtaining feedback whether explicitly or implicitly is another way to obtain private data Companies such as Google and Amazon gather tons of data from user profiling Such data builds the core of their dominating power in ads and ecommerce We wont be covering this method however Legal issue of crawling Notice that in some cases web crawling is highly controversial Before crawling a website do check their user agreement Some websites explicitly forbid web crawling Even if a website is open to web crawling intensive requests may dramatically slow down the website disabling its normal functionality to serve other users It is a courtesy not only to respect their policy but also the law Here is a simple example that uses regular expression to obtain all the phone numbers from the web page of the presidents office University of Southern California httpdepartmentsdirectoryuscedupresoffhtml 1 First lets import the necessary libraries re is the Python builtin regular expression library requests is an HTTP client that enables communication with the internet through the http protocol import re import requests 10 Fundamentals of Data Collection Cleaning and Preprocessing 2 If you look at the web page you will notice that there is a pattern within the phone numbers All the phone numbers start with three digits followed by a hyphen and then four digits Our objective now is to compile such a pattern pattern recompiled3d4 3 The next step is to create an http client and obtain the response from the GET call response requestsgethttpdepartmentsdirectoryusc edupresoffhtml 4 The data attribute of response can be converted into a long string and fed to the findall method patternfindallstrresponsedata The results contain all the phone numbers on the web page 7402111 8211342 7402111 7402111 7402111 7402111 7402111 7402111 7409749 7402505 7406942 8211340 8216292 In this section we introduced three different ways of collecting data reading tabulated data from data files provided by others obtaining data from APIs and building data from scratch In the rest of the book we will focus on the first option and mainly use collected data from the UCI Machine Learning Repository In most cases API data and scraped data will be integrated into tabulated datasets for production usage Data imputation 11 Data imputation Missing data is ubiquitous and data imputation techniques will help us to alleviate its influence In this section we are going to use the heart disease data to examine the pros and cons of basic data imputation I recommend you read the dataset description beforehand to understand the meaning of each column Preparing the dataset for imputation The heart disease dataset is the same one we used earlier in the Collecting data from various data sources section It should give you a real red flag that you shouldnt take data integrity for granted The following screenshot shows missing data denoted by question marks Figure 12 The head of Hungarian heart disease data in VS Code CSV rainbow extension enabled First lets do an info call that lists column data type information dfinfo Note dfinfo is a very helpful function that provides you with pointers for your next move It should be the first function call when given an unknown dataset 12 Fundamentals of Data Collection Cleaning and Preprocessing The following screenshot shows the output obtained from the preceding function Figure 13 Output of the info function call If pandas cant infer the data type of a column it will interpret it as objects For example the chol cholesterol column contains missing data The missing data is a question mark treated as a string but the remainder of the data is of the float type The records are collectively called objects Pythons type tolerance As Python is pretty errortolerant it is a good practice to introduce a necessary type check For example if a column mixes the numerical values instead of using numerical values to check truth explicitly check its type and write two branches Also it is advised to avoid type conversion on columns with data type objects Remember to make your code completely deterministic and futureproof Data imputation 13 Now lets replace the question mark with the NaN values The following code snippet declares a function that can handle three different cases and treat them appropriately The three cases are listed here The record value is The record value is of the integer type This is treated independently because columns such as num should be binary Floating numbers will lose the essence of using 01 encoding The rest includes valid strings that can be converted to float numbers and original float numbers The code snippet will be as follows import numpy as np def replacequestionmarkval if val return npNaN elif typevalint return val else return floatval df2 dfcopy for columnName in df2iteritems df2columnName df2columnNameapplyreplacequestion mark Now we call the info function and the head function as shown here df2info 14 Fundamentals of Data Collection Cleaning and Preprocessing You should expect that all fields are now either floats or integers as shown in the following output Figure 14 Output of info after data type conversion Now you can check the number of nonnull entries for each column and different columns have different levels of completeness age and sex dont contain missing values but ca contains almost no valid data This should guide you on your choices of data imputation For example strictly dropping all the missing values which is also considered a way of data imputation will almost remove the complete dataset Lets check the shape of the DataFrame after the default missing value drops You see that there is only one row left We dont want it df2dropnashape A screenshot of the output is as follows Figure 15 Removing records containing NaN values leaves only one entry Before moving on to other more mainstream imputation methods we would love to perform a quick review of our processed DataFrame Data imputation 15 Check the head of the new DataFrame You should see that all question marks are replaced by NaN values NaN values are treated as legitimate numerical values so native NumPy functions can be used on them df2head The output should look as follows Figure 16 The head of the updated DataFrame Now lets call the describe function which generates a table of statistics It is a very helpful and handy function for a quick peak at common statistics in our dataset df2describe Here is a screenshot of the output Figure 17 Output from the describe call Understanding the describe limitation Note that the describe function only considers valid values In this sample the average age value is more trustworthy than the average thal value Do also pay attention to the metadata A numerical value doesnt necessarily have a numerical meaning For example a thal value is encoded to integers with given meanings Now lets examine the two most common ways of imputation 16 Fundamentals of Data Collection Cleaning and Preprocessing Imputation with mean or median values Imputation with mean or median values only works on numerical datasets Categorical variables dont contain structures such as one label being larger than another Therefore the concepts of mean and median wont apply There are several advantages associated with meanmedian imputation It is easy to implement Meanmedian imputation doesnt introduce extreme values It does not have any time limit However there are some statistical consequences of meanmedian imputation The statistics of the dataset will change For example the histogram for cholesterol prior to imputation is provided here Figure 18 Cholesterol concentration distribution The following code snippet does the imputation with the mean Following imputation the with mean the histogram shifts to the right a little bit chol df2chol plthistcholapplylambda x npmeanchol if npisnanx else x binsrange063030 pltxlabelcholesterol imputation pltylabelcount Data imputation 17 Figure 19 Cholesterol concentration distribution with mean imputation Imputation with the median will shift the peak to the left because the median is smaller than the mean However it wont be obvious if you enlarge the bin size Median and mean values will likely fall into the same bin in this eventuality Figure 110 Cholesterol imputation with median imputation The good news is that the shape of the distribution looks rather similar The bad news is that we probably increased the level of concentration a little bit We will cover such statistics in Chapter 3 Visualization with Statistical Graphs 18 Fundamentals of Data Collection Cleaning and Preprocessing Note In other cases where the distribution is not centered or contains a substantial ratio of missing data such imputation can be disastrous For example if the waiting time in a restaurant follows an exponential distribution imputation with mean values will probably break the characteristics of the distribution Imputation with the modemost frequent value The advantage of using the most frequent value is that it works well with categorical features and without a doubt it will introduce bias as well The slope field is categorical in nature although it looks numerical It represents three statuses of a slope value as positive flat or negative The following code snippet will reveal our observation plthistdf2slopebins 5 pltxlabelslope pltylabelcount Here is the output Figure 111 Counting of the slope variable Data imputation 19 Without a doubt the mode is 2 Following imputation with the mode we obtain the following new distribution plthistdf2slopeapplylambda x 2 if npisnanx else xbins5 pltxlabelslope mode imputation pltylabelcount In the following graph pay attention to the scale on y Figure 112 Counting of the slope variable after mode imputation Replacing missing values with the mode in this case is disastrous If positive and negative values of slope have medical consequences performing prediction tasks on the preprocessed dataset will depress their weights and significance Different imputation methods have their own pros and cons The prerequisite is to fully understand your business goals and downstream tasks If key statistics are important you should try to avoid distorting them Also do remember that collecting more data is always an option 20 Fundamentals of Data Collection Cleaning and Preprocessing Outlier removal Outliers can stem from two possibilities They either come from mistakes or they have a story behind them In principle outliers should be very rare otherwise the experiment survey for generating the dataset is intrinsically flawed The definition of an outlier is tricky Outliers can be legitimate because they fall into the long tail end of the population For example a team working on financial crisis prediction establishes that a financial crisis occurs in one out of 1000 simulations Of course the result is not an outlier that should be discarded It is often good to keep original mysterious outliers from the raw data if possible In other words the reason to remove outliers should only come from outside the dataset only when you already know the originals For example if the heart rate data is strangely fast and you know there is something wrong with the medical equipment then you can remove the bad data The fact that you know the sensorequipment is wrong cant be deduced from the dataset itself Perhaps the best example for including outliers in data is the discovery of Neptune In 1821 Alexis Bouvard discovered substantial deviations in Uranus orbit based on observations This led him to hypothesize that another planet may be affecting Uranus orbit which was found to be Neptune Otherwise discarding mysterious outliers is risky for downstream tasks For example some regression tasks are sensitive to extreme values It takes further experiments to decide whether the outliers exist for a reason In such cases dont remove or correct outliers in the data preprocessing steps The following graph generates a scatter plot for the trestbps and chol fields The highlighted data points are possible outliers but I probably will keep them for now Figure 113 A scatter plot of two fields in heart disease dataset Data standardization when and how 21 Like missing data imputation outlier removal is tricky and depends on the quality of data and your understanding of the data It is hard to discuss systemized outlier removal without talking about concepts such as quartiles and box plots In this section we looked at the background information pertaining to outlier removal We will talk about the implementation based on statistical criteria in the corresponding sections in Chapter 2 Essential Statistics for Data Assessment and Chapter 3 Visualization with Statistical Graphs Data standardization when and how Data standardization is a common preprocessing step I use the terms standardization and normalization interchangeably You may also encounter the concept of rescaling in literature or blogs Standardization often means shifting the data to be zerocentered with a standard deviation of 1 The goal is to bring variables with different unitsranges down to the same range Many machine learning tasks are sensitive to data magnitudes Standardization is supposed to remove such factors Rescaling doesnt necessarily bring the variables to a common range This is done by means of customized mapping usually linear to scale original data to a different range However the common approach of minmax scaling does transform different variables into a common range 0 1 People may argue about the difference between standardization and normalization When comparing their differences normalization will refer to normalizing different variables to the same range 0 1 and minmax scaling is considered a normalization algorithm However there are other normalization algorithms as well Standardization cares more about the mean and standard deviation Standardization also transforms the original distribution closer to a Gaussian distribution In the event that the original distribution is indeed Gaussian standardization outputs a standard Gaussian distribution When to perform standardization Perform standardization when your downstream tasks require it For example the knearest neighbors method is sensitive to variable magnitudes so you should standardize the data On the other hand treebased methods are not sensitive to different ranges of variables so standardization is not required 22 Fundamentals of Data Collection Cleaning and Preprocessing There are mature libraries to perform standardization We first calculate the standard deviation and mean of the data subtract the mean from every entry and then divide by the standard deviation Standard deviation describes the level of variety in data that will be discussed more in Chapter 2 Essential Statistics for Data Assessment Here is an example involving vanilla Python stdChol npstdchol meanChol npmeanchol chol2 cholapplylambda x xmeanCholstdChol plthistchol2binsrangeintminchol2 intmaxchol21 1 The output is as follows Figure 114 Standardized cholesterol data Note that the standardized distribution looks more like a Gaussian distribution now Data standardization is irreversible Information will be lost in standardization It is only recommended to do so when no original information such as magnitudes or original standard deviation will be required later In most cases standardization is a safe choice for most downstream data science tasks In the next section we will use the scikitlearn preprocessing module to demonstrate tasks involving standardization Examples involving the scikitlearn preprocessing module 23 Examples involving the scikitlearn preprocessing module For both imputation and standardization scikitlearn offers similar APIs 1 First fit the data to learn the imputer or standardizer 2 Then use the fitted object to transform new data In this section I will demonstrate two examples one for imputation and another for standardization Note Scikitlearn uses the same syntax of fit and predict for predictive models This is a very good practice for keeping the interface consistent We will cover the machine learning methods in later chapters Imputation First create an imputer from the SimpleImputer class The initialization of the instance allows you to choose missing value forms It is handy as we can feed our original data into it by treating the question mark as a missing value from sklearnimpute import SimpleImputer imputer SimpleImputermissingvaluesnpnan strategymean Note that fit and transform can accept the same input imputerfitdf2 df3 pdDataFrameimputertransformdf2 Now check the number of missing values the result should be 0 npsumnpsumnpisnandf3 Standardization Standardization can be implemented in a similar fashion from sklearn import preprocessing 24 Fundamentals of Data Collection Cleaning and Preprocessing The scale function provides the default zeromean onestandard deviation transformation df4 pdDataFramepreprocessingscaledf2 Note In this example categorical variables represented by integers are also zero mean which should be avoided in production Lets check the standard deviation and mean The following line outputs infinitesimal values df4meanaxis0 The following line outputs values close to 1 df4stdaxis0 Lets look at an example of MinMaxScaler which transforms every variable into the range 0 1 The following code fits and transforms the heart disease dataset in one step It is left to you to examine its validity minMaxScaler preprocessingMinMaxScaler df5 pdDataFrameminMaxScalerfittransformdf2 Lets now summarize what we have learned in this chapter Summary In this chapter we covered several important topics that usually emerge at the earliest stage of a data science project We examined their applicable scenarios and conservatively checked some consequences either numerically or visually Many arguments made here will be more prominent when we cover other more sophisticated topics later In the next chapter we will review probabilities and statistical concepts including the mean the median quartiles standard deviation and skewness I am sure you will then have a deeper understanding of concepts such as outliers 2 Essential Statistics for Data Assessment In Chapter 1 Fundamentals of Data Collection Cleaning and Preprocessing we learned about data collection basic data imputation outlier removal and standardization Hence this will provide you with a good foundation to understand this chapter In this chapter you are going to learn how to examine the essential statistics for data assessment Essential statistics are also often referred to as descriptive statistics Descriptive statistics provide simple quantitative summaries of datasets usually combined with descriptive graphics For example descriptive statistics can demonstrate the tendency of centralization or measures of the variability of features and so on Descriptive statistics are important Correctly represented descriptive statistics give you a precise summary of the datasets at your disposal In this chapter we will learn to extract information and make quantitative judgements from descriptive statistics Just a headsup at this point Besides descriptive statistics another kind of statistics is known as inferential statistics which tries to learn information from the distribution of the population that the dataset was generated or sampled from In this chapter we assume the data covers a whole population rather than a subset sampled from a distribution We will see the differences between the two statistics in later chapters as well For now dont worry 26 Essential Statistics for Data Assessment The following topics will be covered in this chapter Classifying numerical and categorical variables Understanding mean median and mode Learning about variance standard deviation percentiles and skewness Knowing how to handle categorical variables and mixed data types Using bivariate and multivariate descriptive statistics Classifying numerical and categorical variables Descriptive statistics are all about variables You must know what you are describing to define corresponding descriptive statistics A variable is also referred to as a feature or attribute in other literature They all mean the same thing a single column in a tabulated dataset In this section you will examine the two most important variable types numerical and categorical and learn to distinguish between them Categorical variables are discrete and usually represent a classification property of entry Numerical variables are continuous and descriptive quantitatively Descriptive statistics that can be applied to one kind of variable may not be applied to another one hence distinguishing between them precedes analytics Distinguishing between numerical and categorical variables In order to understand the differences between the two types of variables with the help of an example I will be using the population estimates dataset released by the United States Department of Agriculture by way of a demonstration It contains the estimated population data at county level for the United States from 2010 to 2018 You can obtain the data from the official website httpswwwersusdagovdataproducts countyleveldatasetsdownloaddata or the books GitHub repository The following code snippet loads the data and examines the first several rows import pandas as pd df pdreadexcelPopulationEstimatesxlsskiprows2 dfhead8 Classifying numerical and categorical variables 27 The output is a table with more than 140 columns Here are two screenshots showing the beginning and trailing columns Figure 21 First 6 columns of the dfad output In the preceding dataset there is a variable called RuralurbanContinuum Code2013 It takes the value of integers This leads to pandas autointerpreting this variable pandas autointerprets it as numerical Instead however the variable is actually categorical Should you always trust libraries Dont always trust what functions from Python libraries give you They may be wrong and the developer which is you has to make the final decision After some research we found the variable description on this page https wwwersusdagovdataproductsruralurbancontinuumcodes According to the code standard published in 2013 the RuralurbanContinuum Code2013 variable indicates how urbanized an area is 28 Essential Statistics for Data Assessment The meaning of RuralurbanContinuum Code2013 is shown in Figure 22 Figure 22 Interpretation of RuralurbanContinuum Code2013 Note Pandas makes intelligent autointerpretations of variable types but oftentimes it is wrong It is up to the data scientist to investigate the exact meaning of the variable type and then change it Many datasets use integers to represent categorical variables Treating them as numerical values may result in serious consequences in terms of downstream tasks such as machine learning mainly because artificial distances between numerical values will be introduced On the other hand numerical variables often have a direct quantitative meaning For example RNETMIG2013 means the rate of net immigration in 2013 for a specific area A histogram plot of this numerical variable gives a more descriptive summary of immigration trends in the States but it makes little sense plotting the code beyond simple counting Lets check the net immigration rate for the year 2013 with the following code snippet pltfigurefigsize86 pltrcParamsupdatefontsize 22 plthistdfRNETMIG2013binsnplinspacenpnanmindfR NETMIG2013npnanmaxdfRNETMIG2013num100 plttitleRate of Net Immigration Distribution for All Records 2013 Classifying numerical and categorical variables 29 The result appears as follows Figure 23 Distribution of the immigration rate for all records in datasets Here are the observations drawn from Figure 23 In either categorical or numerical variables structures can be introduced to construct special cases A typical example is date or time Depending on the scenarios date and time can be treated as categorical variables as well as numerical variables with a semicontinuous structure It is common to convert numerical variables to categorical variables on the basis of a number of rules The ruralurban code is a typical example Such a conversion is easy for conveying a first impression 30 Essential Statistics for Data Assessment Now that we have learned how to distinguish between numerical and categorical variables lets move on to understanding a few essential concepts of statistics namely mean median and mode Understanding mean median and mode Mean median and mode describe the central tendency in some way Mean and median are only applicable to numerical variables whereas mode is applicable to both categorical and numerical variables In this section we will be focusing on mean median and mode for numerical variables as their numerical interactions usually convey interesting observations Mean Mean or arithmetical mean measures the weighted center of a variable Lets use n to denote the total number of entries and as the index The mean reads as follows Mean is influenced by the value of every entry in the population Let me give an example In the following code I will generate 1000 random numbers from 0 to 1 uniformly plot them and calculate their mean import random randomseed2019 pltfigurefigsize86 rvs randomrandom for in range1000 plthistrvs bins50 plttitleHistogram of Uniformly Distributed RV x xi n Understanding mean median and mode 31 The resulting histogram plot appears as follows Figure 24 Histogram distribution of uniformly distributed variables between 0 and 1 The mean is around 0505477 pretty close to what we surmised Median Median measures the unweighted center of a variable If there is an odd number of entries the median takes the value of the central one If there is an even number of entries the median takes the value of the mean of the central two entries Median may not be influenced by every entrys value On account of this property median is more robust or representative than the mean value I will use the same set of entries as in previous sections as an example The following code calculates the median npmedianrvs 32 Essential Statistics for Data Assessment The result is 05136755026003803 Now I will be changing one entry to 1000 which is 1000 times larger than the maximal possible value in the dataset and repeat the calculation rvs11000 printnpmeanrvs printnpmedianrvs The results are 15054701085937803 and 05150437661964872 The mean increased by roughly 1 but the median is robust The relationship between mean and median is usually interesting and worth investigating Usually the combination of a larger median and smaller mean indicates that there are more points on the bigger value side but that an extremely small value also exists The reverse is true when the median is smaller than the mean We will demonstrate this with some examples later Mode The mode of a set of values is the most frequent element in a set It is evident in a histogram plot such that it represents the peaks If the distribution has only one mode we call it unimodal Distributions with two peaks that dont have to have equal heights are referred to as bimodal Bimodals and bimodal distribution Sometimes the definition of bimodal is corrupted The property of being bimodal usually refers to the property of having two modes which according to the definition of mode requires the same height of peaks However the term bimodal distribution often refers to a distribution with two local maxima Doublecheck your distribution and state the modes clearly The following code snippet demonstrates two distributions with unimodal and bimodal shapes respectively r1 randomnormalvariate0502 for in range10000 r2 randomnormalvariate0201 for in range5000 r3 randomnormalvariate0802 for in range5000 fig axes pltsubplots12figsize125 axes0histr1bins100 axes0settitleUnimodal Learning about variance standard deviation quartiles percentiles and skewness 33 axes1histr2r3bins100 axes1settitleBimodal The resulting two subplots appear as follows Figure 25 Histogram of unimodal and bimodal datasets with one mode and two modes So far we have talked about mean median and mode mean median and mode which are the first three statistics of a dataset They are the start of almost all exploratory data analysis Learning about variance standard deviation quartiles percentiles and skewness In the previous section we studied the mean median and mode They all describe to a certain degree the properties of the central part of the dataset In this section we will learn how to describe the spreading behavior of data Variance With the same notation variance for the population is defined as follows Intuitively the further away the elements are from the mean the larger the variance Here I plotted the histogram of two datasets with different variances The one on the left subplot has a variance of 009 and the one on the right subplot has a variance of 0009 10 times smaller σ2 xi x 2 n 34 Essential Statistics for Data Assessment The following code snippet generates samples from the two distributions and plots them r1 randomnormalvariate0503 for in range10000 r2 randomnormalvariate0501 for in range10000 fig axes pltsubplots12figsize125 axes0histr1bins100 axes0setxlim12 axes0settitleBig Variance axes1histr2bins100 axes1settitleSmall Variance axes1setxlim12 The results appear as follows Figure 26 Big and small variances with the same mean at 05 The following code snippet generates a scatter plot that will demonstrate the difference more clearly The variable on the x axis spreads more widely pltfigurefigsize88 pltscatterr1r2alpha02 pltxlim12 pltylim12 pltxlabelBig Variance Variable Learning about variance standard deviation quartiles percentiles and skewness 35 pltylabelSmall Variance Variable plttitleVariables With Different Variances The result looks as follows Figure 27 Scatter plot of largevariance and smallvariance variables The spread in the x axis is significantly larger than the spread in the y axis which indicates the differences in variance magnitude A common mistake is not getting the range correct Matplotlib will by default try to determine the ranges You need to use a code such as pltxlim to force it otherwise the result is misleading 36 Essential Statistics for Data Assessment Standard deviation Standard deviation is the square root of the variance It is used more commonly to measure the level of dispersion since it has the same unit as the original data The formula for the standard deviation of a population reads as follows Standard deviation is extremely important in scientific graphing A standard deviation is often plotted together with the data and represents an estimate of variability For this chapter I will be using the net immigration rate for Texas from 2011 to 2018 as an example In the following code snippet I will first extract the countylevel data append the means and standard deviations to a list and then plot them at the end The standard deviation is obtained using numpystd and the error bar is plotted using matplotlibpyploterrorbar dfTX dfdfStateTXtail1 YEARS year for year in range20112019 MEANS STDS for i in range20112019 year RNETMIGstri MEANSappendnpmeandfTXyear STDSappendnpstddfTXyear pltfigurefigsize108 plterrorbarYEARSMEANSyerrSTDS pltxlabelYear pltylabelNet Immigration Rate σ xi x 2 n Learning about variance standard deviation quartiles percentiles and skewness 37 The output appears as shown in the following figure Figure 28 Net immigration rate across counties in Texas from 2011 to 2018 We can see in Figure 28 that although the net immigration in Texas is only slightly positive the standard deviation is huge Some counties may have a big positive net rate while others may potentially suffer from the loss of human resources Quartiles Quartiles are a special kind of quantile that divide data into a number of equal portions For example quartiles divide data into four equal parts with the ½ quartile as the median Deciles and percentiles divide data into 10 and 100 equal parts respectively The first quartile also known as the lower quartile 1 takes the value such that 25 of all the data lies below it The second quartile is the median The third quartile 3 is also known as the upper quartile and 25 of all values lie above it Quartiles are probably the most commonly used quantiles because they are associated with a statistical graph called a boxplot Lets use the same set of Texas net immigration data to study it 38 Essential Statistics for Data Assessment The function in NumPy is quantile and we specify a list of quantiles as an argument for the quantiles we want to calculate as in the following singleline code snippet npquantiledfTXRNETMIG201302505075 The output reads as follows array783469971 087919226 884040759 The following code snippet visualizes the quartiles pltfigurefigsize125 plthistdfTXRNETMIG2013bins50alpha06 for quartile in npquantiledfTXRNET MIG201302505075 pltaxvlinequartilelinestylelinewidth4 As you can see from the following output the vertical dotted lines indicate the three quartiles Figure 29 Quartiles of the net immigration data in 2013 The lower and upper quartiles keep exactly 50 of the data values in between 3 1 is referred to as the interquartile range called Interquartile Range IQR and it plays an important role in outlier detection We will see more about this soon Learning about variance standard deviation quartiles percentiles and skewness 39 Skewness Skewness differs from the three measures of variability we discussed in the previous subsections It measures the direction the data takes and the extent to which the data distribution tilts Skewness is given as shown in the following equation σ Various definitions of skewness The skewness we defined earlier is precisely referred to as Pearsons first skewness coefficient It is defined through the mode but there are other definitions of skewness For example skewness can be defined through the median Skewness is unitless If the mean is larger than the mode skewness is positive and we say the data is skewed to the right Otherwise the data is skewed to the left Here is the code snippet that generates two sets of skewed data and plots them r1 randomnormalvariate0504 for in range10000 r2 randomnormalvariate0102 for in range10000 r3 randomnormalvariate1102 for in range10000 fig axes pltsubplots12figsize125 axes0histr1r2bins100alpha05 axes0axvlinenpmeanr1r2 linestylelinewidth4 axes0settitleSkewed To Right axes1histr1r3bins100alpha05 axes1axvlinenpmeanr1r3linestylelinewidth4 axes1settitleSkewed to Left 40 Essential Statistics for Data Assessment The vertical dotted line indicates the position of the mean as follows Figure 210 Skewness demonstration Think about the problem of income inequality Lets say you have a plot of the histogram of the population with different amounts of wealth A larger value just like where the x axis value indicates the amount of wealth and the y axis value indicates the portion of the population that falls into a certain wealth amount range A larger x value means more wealth A larger y value means a greater percentage of the population falls into that range of wealth possession Positive skewness the left subplot in Figure 210 means that even though the average income looks good this may be driven up by a very small number of super rich individuals when the majority of people earn a relatively small income Negative skewness the right subplot in Figure 210 indicates that the majority may have an income above the mean value so there might be some very poor people who may need help A revisit of outlier detection Now lets use what we have learned to revisit the outlier detection problem The zscore also known as the standard score is a good criterion for detecting outliers It measures the distance between an entry and the population mean taking the population variance into consideration z σ Learning about variance standard deviation quartiles percentiles and skewness 41 If the underlying distribution is normal a situation where a zscore is greater than 3 or less than 0 only has a probability of roughly 027 Even if the underlying distribution is not normal Chebyshevs theorem guarantees a strong claim such that at most 1k2 where k is an integer of the total population can fall outside k standard deviations As an example the following code snippet generates 10000 data points that follow a normal distribution randomseed2020 x randomnormalvariate1 05 for in range10000 pltfigurefigsize108 plthistxbins100alpha05 styles for i in range3 pltaxvlinenpmeanx i1npstdx linestylestylesi linewidth4 pltaxvlinenpmeanx i1npstdx linestylestylesi linewidth4 plttitleInteger Z values for symmetric distributions In the generated histogram plot the dotted line indicates the location where 1 The dashed line indicates the location of 2 The dashed dotted line indicates the location of 3 Figure 211 Integer z value boundaries for normally distributed symmetric data 42 Essential Statistics for Data Assessment If we change the data points the distribution will change but the zscore criteria will remain valid As you can see in the following code snippet an asymmetric distribution is generated rather than a normal distribution x randomnormalvariate1 05 randomexpovariate2 for in range10000 This produces the following output Figure 212 Integer z value boundaries for asymmetric data Note on the influence of extreme outliers A drawback of the zscore is that the mean itself is also influenced by extreme outliers The median can replace a mean to remove this effect It is flexible to set different criteria in different production cases We have covered several of the most important statistics to model variances in a dataset In the next section lets work on the data types of features Knowing how to handle categorical variables and mixed data types 43 Knowing how to handle categorical variables and mixed data types Categorical variables usually have simpler structures or descriptive statistics than continuous variables Here we introduce the two main descriptive statistics and talk about some interesting cases when converting continuous variables to categorical ones Frequencies and proportions When we discussed the mode for categorical variables we introduced Counter which outputs a dictionary structure whose keyvalue pair is the elementcounting pair The following is an example of a counter Counter20 394 30 369 60 597 10 472 90 425 70 434 80 220 40 217 50 92 The following code snippet illustrates frequency as a bar plot where the absolute values of counting become intuitive counter CounterdfRuralurbanContinuum Code2013 dropna labels x for key val in counteritems labelsappendstrkey xappendval pltfigurefigsize108 pltbarlabelsx plttitleBar plot of frequency 44 Essential Statistics for Data Assessment What you will get is the bar plot that follows Figure 213 Bar plot of ruralurban continuum code For proportionality simply divide each count by the summation of counting as shown in the following code snippet x nparrayxsumx The shape of the bar plot remains the same but the y axis ticks change To better check the relative size of components I have plotted a pie plot with the help of the following code snippet pltfigurefigsize1010 pltpiexxlabelslabels plttitlePie plot for ruralurban continuum code Knowing how to handle categorical variables and mixed data types 45 What you get is a nice pie chart as follows Figure 214 Pie plot of ruralurban continuum code It becomes evident that code 20 contains about twice as many samples as code 80 does Unlike the mean and median categorical data does have a mode We are going to reuse the same data CounterdfRuralurbanContinuum Code2013dropna The output reads as follows Counter20 394 30 369 60 597 10 472 90 425 70 434 80 220 40 217 50 92 46 Essential Statistics for Data Assessment The mode is 60 Note The mode means that the counties with urban populations of 2500 to 19999 adjacent to a metro area are most prevalent in the United States and not the number 60 Transforming a continuous variable to a categorical one Occasionally we may need to convert a continuous variable to a categorical one Lets take lifespan as an example The 80 age group is supposed to be very small Each of them will represent a negligible data point in classification tasks If they can be grouped together the noise introduced by the sparsity of this age groups individual points will be reduced A common way to perform categorization is to use quantiles For example quartiles will divide the datasets into four parts with an equal number of entries This avoids issues such as data imbalance For example the following code indicates the cutoffs for the categorization of the continuous variable net immigration rate series dfRNETMIG2013dropna quantiles npquantileseries02i for i in range15 pltfigurefigsize108 plthistseriesbins100alpha05 pltxlim5050 for i in rangelenquantiles pltaxvlinequantilesilinestyle linewidth4 plttitleQuantiles for net immigration data Using bivariate and multivariate descriptive statistics 47 As you can see in the following output the dotted vertical lines split the data into 5 equal sets which are hard to spot with the naked eye I truncated the x axis to select the part between 50 and 50 The result looks as follows Figure 215 Quantiles for the net immigration rate Note on the loss of information Categorization destroys the rich structure in continuous variables Only use it when you absolutely need to Using bivariate and multivariate descriptive statistics In this section we briefly talk about bivariate descriptive statistics Bivariate descriptive statistics apply two variables rather than one We are going to focus on correlation for continuous variables and crosstabulation for categorical variables 48 Essential Statistics for Data Assessment Covariance The word covariance is often incorrectly used as correlation However there are a number of fundamental differences Covariance usually measures the joint variability of two variables while correlation focuses more on the strength of variability Correlation coefficients have several definitions in different use cases The most common descriptive statistic is the Pearson correlation coefficient We will also be using it to describe the covariance of two variables The correlation coefficient for variables x and y from a population is defined as follows Lets first examine the expressions sign The coefficient becomes positive when x is greater than its mean and y is also greater than its own mean Another case is when x and y are both smaller than their means respectively The products sum together and then get normalized by the standard deviation of each variable So a positive coefficient indicates that x and y vary jointly in the same direction You can make a similar argument about negative coefficients In the following code snippet we select the net immigration rates for counties in Texas as our datasets and use the corr function to inspect the correlation coefficient across years corrs dfTXRNETMIG2011RNETMIG2012 RNET MIG2013 RNETMIG2014 RNETMIG2015RNETMIG2016 RNETMIG2017 RNETMIG2018corr The output is a socalled correlation matrix whose diagonal elements are the self correlation coefficients which are just 1 Figure 216 Correlation matrix for the net immigration rate ρ 1 1 σxσy Using bivariate and multivariate descriptive statistics 49 A good way to visualize this matrix is to use the heatmap function from the Seaborn library The following code snippet generates a nice heatmap import seaborn as sns pltfigurefigsize108 pltrcParamsupdatefontsize 12 snsheatmapcorrscmapYlGnBu The result looks as follows Figure 217 Heatmap of a correlation matrix for net immigration rates in Texas We do see an interesting pattern that odd years correlate with one another more strongly and even years correlate with each other more strongly too However that is not the case between even and odd numbered years Perhaps there is a 2year cyclic pattern and the heatmap of the correlation matrix just helped us discover it 50 Essential Statistics for Data Assessment Crosstabulation Crosstabulation can be treated as a discrete version of correlation detection for categorical variables It helps derive innumerable insights and sheds light on downstream task designs Here is an example I am creating a list of weather information and another list of a golfers decisions on whether to go golfing The crosstab function generates the following table weather rainysunnyrainywindywindy sunnyrainywindysunnyrainy sunnywindywindy golfing YesYesNoNoYesYesNoNo YesNoYesNoNo dfGolf pdDataFrameweatherweathergolfinggolfing pdcrosstabdfGolfweather dfGolfgolfing marginsTrue Figure 218 Crosstabulation for golfing decisions As you can see the columns and rows give the exact counts which are identified by the column name and row name For a dataset with a limited number of features this is a handy way to inspect imbalance or bias We can tell that the golfer goes golfing if the weather is sunny and that they seldom go golfing on rainy or windy days With that we have come to the end of the chapter Summary 51 Summary Statistics or tools to assess datasets were introduced and demonstrated in this chapter You should be able to identify different kinds of variables compute corresponding statistics and detect outliers We do see graphing as an essential part of descriptive statistics In the next chapter we will cover the basics of Python plotting the advanced customization of aesthetics and professional plotting techniques 3 Visualization with Statistical Graphs A picture is worth a thousand words Humans rely on visual input for more than 90 of all information obtained A statistical graph can demonstrate trends explain reasons or predict futures much better than words if done right Python data ecosystems come with a lot of great tools for visualization The three most important ones are Matplotlib seaborn and plotly The first two are mainly for static plotting while plotly is capable of interactive plotting and is gaining in popularity gradually In this chapter you will focus on static plotting which is the backbone of data visualization We have already extensively used some plots in previous chapters to illustrate concepts In this chapter we will approach them in a systematic way The topics that will be covered in this chapter are as follows Picking the right plotting types for different tasks Improving and customizing visualization with advanced aesthetic customization Performing statistical plotting tailored for business queries Building stylish and professional plots for presentations or reports 54 Visualization with Statistical Graphs Lets start with the basic Matplotlib library Basic examples with the Python Matplotlib package In this chapter we will start with the most basic functionalities of the Matplotlib package Lets first understand the elements to make a perfect statistical graph Elements of a statistical graph Before we dive into Python code l will give you an overview of how to decompose the components of a statistical graph I personally think the philosophy that embeds the R ggplot2 package is very concise and clear Note R is another famous programming language for data science and statistical analysis There are also successful R packages The counterpart of Matplotlib is the R ggplot2 package mentioned previously ggplot2 is a very successful visualization tool developed by Hadley Wickman It decomposes a statistical plot into the following three components Data The data must have the information to display otherwise the plotting becomes totally misleading The data can be transformed such as with categorization before being visualized Geometries Geometry here means the types of plotting For example bar plot pie plot boxplot and scatter plot are all different types of plotting Different geometries are suitable for different visualization purposes Aesthetics The size shape color and positioning of visual elements such as the title ticks and legends all belong to aesthetics A coherent collection of aesthetic elements can be bundled together as a theme For example Facebook and The Economist have very distinguishable graphical themes Basic examples with the Python Matplotlib package 55 Lets use the birth rate and death rate data for Texas counties grouped by urbanization level as an example Before that lets relate this data with the three components mentioned previously The data is the birth rate and death rate data which determines the location of the scattered points The geometry is a scatter plot If you use a line plot you are using the wrong type of plot because there isnt a natural ordering structure in the dataset There are many aesthetic elements but the most important ones are the size and the color of the spots How they are determined will be detailed when we reach the second section of this chapter Incorporating this data into a graph gives a result that would look something like this Figure 31 Example for elements of statistical graphing 56 Visualization with Statistical Graphs Geometry is built upon data and the aesthetics will only make sense if you have the right data and geometry In this chapter you can assume we already have the right data If you have the wrong data you will end up with graphs that make no sense and are oftentimes misleading In this section lets focus mainly on geometry In the following sections I will talk about how to transform data and customize aesthetics Exploring important types of plotting in Matplotlib Lets first explore the most important plotting types one by one Simple line plots A simple line plot is the easiest type of plotting It represents only a binary mapping relationship between two ordered sets Stock price versus date is an example temperature versus time is another The following code snippet generates a list of evenly spaced numbers and their sine and plots them Please note that the libraries only need to be imported once import numpy as np import matplotlibpyplot as plt matplotlib inline fig pltfigure x nplinspace0 10 1000 pltplotx npsinx This generates the following output Figure 32 A simple line plot of the sine function Basic examples with the Python Matplotlib package 57 You can add one or two more simple line plots Matplotlib will decide the default color of the lines for you The following snippet will add two more trigonometric functions fig pltfigurefigsize108 x nplinspace0 10 100 pltplotx npsinxlinestylelinewidth4 pltplotxnpcosxlinestylelinewidth4 pltplotxnpcos2xlinestylelinewidth4 Different sets of data are plotted with dashed lines dotted lines and dasheddotted lines as shown in the following figure Figure 33 Multiple simple line plots Now that we have understood a simple line plot lets move on to the next type of plotting a histogram plot 58 Visualization with Statistical Graphs Histogram plots We used a histogram plot extensively in the previous chapter This type of plot groups data into bins and shows the counts of data points in each bin with neighboring bars The following code snippet demonstrates a traditional onedimensional histogram plot x1 nprandomlaplace0 08 500 x2 nprandomnormal3 2 500 plthistx1 alpha05 densityTrue bins20 plthistx2 alpha05 densityTrue bins20 The following output shows the histogram plots overlapping each other Figure 34 A onedimensional histogram Here density is normalized so the histogram is no longer a frequency count but a probability count The transparency level the alpha value is set to 05 so the histogram underline is displayed properly Boxplot and outlier detection A twodimensional histogram plot is especially helpful for visualizing correlations between two quantities We will be using the immigration data we used in the Classifying numerical and categorical variables section in Chapter 2 Essential Statistics for Data Assessment as an example Basic examples with the Python Matplotlib package 59 The good thing about a boxplot is that it gives us a very good estimation of the existence of outliers The following code snippet plots the Texas counties net immigration rate of 2017 in a boxplot import pandas as pd df pdreadexcelPopulationEstimatesxlsskiprows2 dfTX dfdfStateTXtail1 pltboxplotdfTXRNETMIG2017 The plot looks as in the following figure Figure 35 A boxplot of the 2017 net immigration rate of Texas counties What we generated is a simple boxplot It has a box with a horizontal line in between There are minimum and maximum data points which are represented as short horizontal lines However there are also data points above the maximum and below the minimum You may also wonder what they are since there are already maximum and minimum data points We will solve these issues one by one Lets understand the box first The top and bottom of the box are the ¾ quartile and the ¼ quartile respectively This means exactly 50 of the data is in the box The distance between the ¼ quartile and the ¾ quartile is called the Interquartile Range IQR Clearly the shorter the box is the more centralized the data points are The orange line in the middle represents the median 60 Visualization with Statistical Graphs The position of the maximum is worked out as the sum of the ¾ quartile and 15 times the IQR The minimum is worked out as the difference between the ¼ quartile and 15 times the IQR What still lies outside of the range are considered outliers In the preceding boxplot there are four outliers For example if the distribution is normal a data point being an outlier has a probability of roughly 07 which is small Note A boxplot offers you a visual approach to detect outliers In the preceding example 15 times the IQR is not a fixed rule and you can choose a cutoff for specific tasks Scatter plots A scatter plot is very useful for visually inspecting correlations between variables It is especially helpful to display data at a different time or date from different locations in the same graph Readers usually find it difficult to tell minute distribution differences from numerical values but a scatter plot makes them easy to spot For example lets plot the birth rate and death rate for all the Texas counties in 2013 and 2017 It becomes somewhat clear that from 2013 to 2017 some data points with the highest death rate disappear while the birth rates remain unchanged The following code snippet does the job pltfigurefigsize86 pltscatterdfTXRbirth2013dfTXR death2013alpha05label2013 pltscatterdfTXRbirth2017dfTXR death2017alpha05label2017 pltlegend pltxlabelBirth Rate pltylabelDeath Rate plttitleTexas Counties BirthDeath Rates Basic examples with the Python Matplotlib package 61 The output looks as in the following figure Figure 36 A scatter plot of the birth rate and death rate in Texas counties Note The scatter plot shown in Figure 36 doesnt reveal onetoone dynamics For example we dont know the change in the birth rate or death rate of a specific county and it is possible though unlikely that county A and county B exchanged their positions in the scatter plot Therefore a basic scatter plot only gives us distributionwise information but no more than that 62 Visualization with Statistical Graphs Bar plots A bar plot is another common plot to demonstrate trends and compare several quantities side by side It is better than a simple line chart because sometimes line charts can be misleading without careful interpretation For example say I want to see the birth rate and death rate data for Anderson County in Texas from 2011 to 2018 The following short code snippet would prepare the column masks to select features and examine the first row of the DataFrame which is the data for Anderson County birthRates listfilterlambda x xstartswithR birthdfTXcolumns deathRates listfilterlambda x xstartswithR deathdfTXcolumns years nparraylistmaplambda x intx4 birthRates The Anderson County information can be obtained by using the iloc method as shown in the following snippet dfTXiloc0 Figure 37 shows the first several columns and the last several ones of the Anderson County data Figure 37 Anderson County data Note DataFrameiloc in pandas allows you to slice a DataFrame by the index field Basic examples with the Python Matplotlib package 63 The following code snippet generates a simple line plot pltfigurefigsize106 width04 pltplotyearswidth2 dfTXiloc0birthRates label birth rate pltplotyearswidth2 dfTXiloc0deathRateslabeldeath rate pltxlabelyears pltylabelrate pltlegend plttitleAnderson County birth rate and death rate The following figure shows the output which is a simple line plot with the dotted line being the birth rate and the dashed line being the death rate by default Figure 38 A line chart of the birth rate and death rate Without carefully reading it you can derive two pieces of information from the plot The death rates change dramatically across the years The death rates are much higher than the birth rates 64 Visualization with Statistical Graphs Even though the y axis tick doesnt support the two claims presented admit it this is the first impression we get without careful observation However with a bar plot this illusion can be eliminated early The following code snippet will help in generating a bar plot pltfigurefigsize106 width04 pltbaryearswidth2 dfTXiloc0birthRates widthwidth label birth rate alpha 1 pltbaryearswidth2 dfTXiloc0deathRates widthwidthlabeldeath rate alpha 1 pltxlabelyears pltylabelrate pltlegend plttitleAnderson County birth rate and death rate I slightly shifted year to be the X value and selected birthRates and deathRates with the iloc method we introduced earlier The result will look as shown in Figure 39 Figure 39 A bar plot of the Anderson County data Advanced visualization customization 65 The following is now much clearer The death rate is higher than the birth rate but not as dramatically as the line plot suggests The rates do not change dramatically across the years except in 2014 The bar plot will by default show the whole scale of the data therefore eliminating the earlier illusion Note how I used the width parameter to shift the two sets of bars so that they can be properly positioned Advanced visualization customization In this section you are going to learn how to customize the plots from two perspectives the geometry and the aesthetics You will see examples and understand how the customization works Customizing the geometry There isnt enough time nor space to cover every detail of geometry customization Lets learn by understanding and following examples instead Example 1 axissharing and subplots Continuing from the previous example lets say you want the birth rate and the population change to be plotted on the same graph However the numerical values of the two quantities are drastically different making the birth rate basically indistinguishable There are two ways to solve this issue Lets look at each of the ways individually Axissharing We can make use of both the lefthand y axis and the righthand Y axis to represent different scales The following code snippet copies the axes with the twinx function which is the key of the whole code block figure ax1 pltsubplotsfigsize106 ax1plotyears dfTXiloc0birthRates label birth ratecred ax2 ax1twinx ax2plotyears dfTXiloc0popChanges1 labelpopulation change 66 Visualization with Statistical Graphs ax1setxlabelyears ax1setylabelbirth rate ax2setylabelpopulation change ax1legend ax2legend plttitleAnderson County birth rate and population change As you can see the preceding code snippet does three things in order 1 Creates a figure instance and an axis instance ax1 2 Creates a twin of ax1 and plots the two sets of data on two different axes 3 Creates labels for two different axes shows the legend sets the title and so on The following is the output Figure 310 Double Y axes example The output looks nice and both trends are clearly visible Subplots With subplots we can also split the two graphs into two subplots The following code snippet creates two stacked subplots and plots the dataset on them separately Advanced visualization customization 67 figure axes pltsubplots21figsize106 axes0plotyears dfTXiloc0birthRates label birth ratecred axes1plotyears dfTXiloc0popChanges1 labelpopulation change axes1setxlabelyears axes0setylabelbirth rate axes1setylabelpopulation change axes0legend axes1legend axes0settitleAnderson County birth rate and population change Note The subplots function takes 2 and 1 as two arguments This means the layout will have 2 rows but 1 column So the axes will be a twoelement list The output of the previous code will look as follows Figure 311 Birth rate and population subplots example 68 Visualization with Statistical Graphs The two plots will adjust the scale of the Y axis automatically The advantage of using subplots over a shared axis is that subplots can support the addition of more complicated markups while a shared axis is already crowded Example 2 scale change In this second example we will be using the dataset for the total number of coronavirus cases in the world published by WHO At the time of writing this book the latest data I could obtain was from March 15 2020 You can also obtain the data from the official repository of this book The following code snippet loads the data and formats the date column into a date data type coronaCases pdreadcsvtotalcases03152020csv from datetime import datetime coronaCasesdate coronaCasesdateapplylambda x datetimestrptimex Ymd Then we plot the data for the world and the US The output of the previous code snippet will look like this Advanced visualization customization 69 Figure 312 Coronavirus cases in linear and log scales Note how I changed the second subplot from a linear scale to a log scale Can you work out the advantage of doing so On a linear scale because the cases in the world are much larger than the cases in the US the representation of cases in the US is basically a horizontal line and the details in the total case curve at the early stage are not clear In the logscale plot the Y axis changes to a logarithm scale so exponential growth becomes a somewhat linear line and the numbers in the US are visible now 70 Visualization with Statistical Graphs Customizing the aesthetics Details are important and they can guide us to focus on the right spot Here I use one example to show the importance of aesthetics specifically the markers A good choice of markers can help readers notice the most important information you want to convey Example markers Suppose you want to visualize the birth rate and death rate for counties in Texas but also want to inspect the rates against the total population and ruralurbancontinuum code for a specific year In short you have four quantities to inspect so which geometry will you choose and how will you represent the quantities Note The continuum code is a discrete variable but the other three are continuous variables To represent a discrete variable that doesnt have numerical relationships between categories you should choose colors or markers over others which may suggest numerical differences A naïve choice is a scatter plot as we did earlier This is shown in the following code snippet pltfigurefigsize126 pltscatterdfTXRbirth2013 dfTXRdeath2013 alpha04 s dfTXPOPESTIMATE20131000 pltxlabelBirth Rate pltylabelDeath Rate plttitleTexas Counties BirthDeath Rates in 2013 Note that I set the s parameter the size of the default marker to be 1 unit for every 1000 of the population Advanced visualization customization 71 The output already looks very informative Figure 313 The birth rate and death rate in Texas However this is probably not enough because we cant tell whether the region is a rural area or an urban area To do this we need to introduce a color map Note A color map maps a feature to a set of colors In Matplotlib there are many different color maps For a complete list of maps check the official document at httpsmatplotliborg320tutorialscolors colormapshtml The following code snippet maps ruralurbancontinuumcode to colors and plots the color bar Although the code itself is numerical the color bar ticks contain no numerical meaning pltfigurefigsize126 pltscatterdfTXRbirth2013 dfTXRdeath2013 alpha04 s dfTXPOPESTIMATE20131000 c dfTXRuralurbanContinuum Code2003 cmap Dark2 72 Visualization with Statistical Graphs pltcolorbar pltxlabelBirth Rate pltylabelDeath Rate plttitleTexas Counties BirthDeath Rates in 2013 The output looks much easier to interpret Figure 314 The birth rate and death rate in Texas revised From this plotting counties with smaller code numbers have a bigger population a relatively moderate birth rate but a lower death rate This is possibly due to the age structure because cities are more likely to attract younger people This information cant be revealed without adjusting the aesthetics of the graph Queryoriented statistical plotting The visualization should always be guided by business queries In the previous section we saw the relationship between birth and death rates population and code and with that we designed how the graph should look Queryoriented statistical plotting 73 In this section we will see two more examples The first example is about preprocessing data to meet the requirement of the plotting API in the seaborn library In the second example we will integrate simple statistical analysis into plotting which will also serve as a teaser for our next chapter Example 1 preparing data to fit the plotting function API seaborn is another popular Python visualization library With it you can write less code to obtain more professionallooking plots Some APIs are different though Lets plot a boxplot You can check the official documentation at httpsseaborn pydataorggeneratedseabornboxplothtml Lets try to use it to plot the birth rates from different years for Texas counties However if you look at the DataFrame that the seaborn library imported it looks different from what we used earlier import seaborn as sns tips snsloaddatasettips tipshead The output looks as follows Figure 315 Head of tips a seaborn builtin dataset The syntax of plotting a boxplot is shown in the following snippet ax snsboxplotxday ytotalbill datatips 74 Visualization with Statistical Graphs The output is as follows Figure 316 The seaborn tips boxplot Note It is hard to generate such a beautiful boxplot with oneline code using the Matplotlib library There is always a tradeoff between control and easiness In my opinion seaborn is a good choice if you have limited time for your tasks Notice that the x parameter day is a column name in the tips DataFrame and it can take several values Thur Fri Sat and Sun However in the Texas county data records for each year are separated as different columns which is much wider than the tidy tips DataFrame To convert a wide table into a long table we need the pandas melt function https pandaspydataorgpandasdocsstablereferenceapipandas melthtml Queryoriented statistical plotting 75 The following code snippet selects the birth raterelated columns and transforms the table into a longer thinner format birthRatesDF dfTXbirthRates birthRatesDFindex birthRatesDFindex birthRatesDFLong pdmeltbirthRatesDFid varsindexvaluevars birthRatesDFcolumns1 birthRatesDFLongvariable birthRatesDFLongvariable applylambda x intx4 The longformat table now looks as in the following figure Figure 317 Long format of the birth rates data Now the seaborn API can be used directly as follows pltfigurefigsize108 snsboxplotxvariable yvalue databirthRatesDFLong pltxlabelYear pltylabelBirth Rates 76 Visualization with Statistical Graphs The following will be the output Figure 318 Texas counties birth rates boxplot with the seaborn API Nice isnt it Youve learned how to properly transform the data into the formats that the library APIs accept Good job Example 2 combining analysis with plain plotting In the second example you will see how oneline code can add inference flavor to your plots Suppose you want to examine the birth rate and the natural population increase rate in the year 2017 individually but you also want to check whether there is some correlation between the two To summarize we need to do the following 1 Examine the individual distributions of each quantity 2 Examine the correlation between these two quantities 3 Obtain a mathematical visual representation of the two quantities Queryoriented statistical plotting 77 seaborn offers the jointplot function which you can make use of It enables you to combine univariate plots and bivariate plots It also allows you to add annotations with statistical implications The following code snippet shows the univariate distribution bivariate scatter plot an estimate of univariate density and bivariate linear regression information in one command g snsjointplotRNATURALINC2017 Rbirth2017 datadfTX kindregheight10 The following graph shows the output Figure 319 Joint plot of a scatter plot and histogram plot example 78 Visualization with Statistical Graphs Tip By adding inference information density estimation and the linear regression part to an exploratory graph we can make the visualization very professional Presentationready plotting tips Here are some tips if you plan to use plots in your professional work Use styling Consider using the following tips to style plots You should consider using a style that accommodates your PowerPoint or slides For example if your presentation contains a lot of grayscale elements you shouldnt use colorful plots You should keep styling consistent across the presentation or report You should avoid using markups that are too fancy Be aware of the fact that sometimes people only have grayscale printing so red and green may be indistinguishable Use different markers and textures in this case For example the following code replots the joint plot in grayscale style with pltstylecontextgrayscale pltfigurefigsize126 g snsjointplotRNATURALINC2017 Rbirth2017 datadfTX kindregheight10 Presentationready plotting tips 79 The result is as follows Figure 320 Replot with grayscale style 80 Visualization with Statistical Graphs Font matters a lot Before the end of this chapter I would like to share my tips for font choice aesthetics Font size is very important It makes a huge difference What you see on a screen can be very different from what you see on paper or on a projector screen For example you can use pltrcytick labelsizexmedium to specify the xtick size of your graph Be aware that the font size usually wont scale when the graph scales You should test it and set it explicitly if necessary Font family is also important The font family of graphs should match the font of the paper Serif is the most common one Use the following code to change the default fonts to serif pltrcfont familyserif Lets summarize what we have learned in this chapter Summary In this chapter we discussed the most important plots in Python Different plots suit different purposes and you should choose them accordingly The default settings of each plot may not be perfect for your needs so customizations are necessary You also learned the importance of choosing the right geometries and aesthetics to avoid problems in your dataset such as significant quantity imbalance or highlighting features to make an exploratory argument Business queries are the starting point of designing a statistical plot We discussed the necessity of transforming data to fit a function API and choosing proper plotting functions to answer queries without hassle In the next chapter lets look at some probability distributions After all both the histogram plot and the density estimation plot in a joint plot try to uncover the probability distributions behind the dataset Section 2 Essentials of Statistical Analysis Section 2 covers the most fundamental and classical contents of statistical analysis at the undergraduate level However the statistical analysis well get into is applied to messy realword datasets This section will give you a taste of statistical analysis as well as sharpening your math skills for further chapters This section consists of the following chapters Chapter 4 Sampling and Inferential Statistics Chapter 5 Common Probability Distributions Chapter 6 Parametric Estimation Chapter 7 Statistical Hypothesis Testing 4 Sampling and Inferential Statistics In this chapter we focus on several difficult sampling techniques and basic inferential statistics associated with each of them This chapter is crucial because in real life the data we have is most likely only a small portion of a whole set Sometimes we also need to perform sampling on a given large dataset Common reasons for sampling are listed as follows The analysis can run quicker when the dataset is small Your model doesnt benefit much from having gazillions of pieces of data Sometimes you also dont want sampling For example sampling a small dataset with subcategories may be detrimental Understanding how sampling works will help you to avoid various kinds of pitfalls The following topics will be covered in this chapter Understanding fundamental concepts in sampling techniques Performing proper sampling under different scenarios Understanding statistics associated with sampling 84 Sampling and Inferential Statistics We begin by clarifying the concepts Understanding fundamental concepts in sampling techniques In Chapter 2 Essential Statistics for Data Assessment I emphasized that statistics such as mean and variance were used to describe the population The intent is to help you distinguish between the population and samples With a population at hand the information is complete which means all statistics you calculated will be authentic since you have everything With a sample the information you have only relates to a small portion or a subset of the population What exactly is a population A population is the whole set of entities under study If you want to study the average monthly income of all American women then the population includes every woman in the United States Population will change if the study or the question changes If the study is about finding the average monthly income of all Los Angeles women then a subset of the population for the previous study becomes the whole population of the current study Certain populations are accessible for a study For example it probably only takes 1 hour to measure kids weights in a single kindergarten However it is both economically and temporally impractical to obtain income information for American women or even Los Angeles women In order to get a good estimate of such an answer sampling is required A sample is a subset of the population under study The process of obtaining a sample is called sampling For example you could select 1000 Los Angeles women and make this your sample By collecting their income information you can infer the average income of all Los Angeles women As you may imagine selecting 1000 people will likely give us a more confident estimation of the statistics The sampling size matters because more entries will increase the likelihood of representing more characteristics of the original population What is more important is the way how sampling is done if you randomly select people walking on the street in Hollywood you probably will significantly overestimate the true average income If you go to a college campus to interview students you will likely find an underestimated statistic because students wont have a high income in general Understanding fundamental concepts in sampling techniques 85 Another related concept is the accessible population The whole population under study is also referred to as the target population which is supposed to be the set to study However sometimes only part of it is accessible The key characteristic is that the sampling process is restricted by accessibility As regards a study of the income of all Los Angeles women an accessible population may be very small Even for a small accessible population researchers or survey conductors can only sample a small portion of it This makes the sampling process crucially important Failed sampling Failed sampling can lead to disastrous decision making For example in earlier times when phones were not very accessible to every family if political polling was conducted based on phone directories the result could be wildly inaccurate The fact of having a phone indicated a higher household income and their political choices may not reveal the characteristics of the whole community or region In the 1936 Presidential election between Roosevelt and Landon such a mistake resulted in an infamous false Republican victory prediction by Literary Digest In the next section you will learn some of the most important sampling methods We will still be using the Texas population data For your reference the following code snippet reads the dataset and creates the dfTX DataFrame import pandas as pd df pdreadexcelPopulationEstimatesxlsskiprows2 dfTX dfdfStateTXtail1 The first several columns of the dfTX DataFrame appear as follows Figure 41 First several columns of the dfTX DataFrame Next lets see how different samplings are done 86 Sampling and Inferential Statistics Performing proper sampling under different scenarios The previous section introduced an example of misleading sampling in political polling The correctness of a sampling approach will change depending on its content When telephones were not accessible polling by phone was a bad practice However now everyone has a phone number associated with them and in general the phone number is largely random If a polling agency generates a random phone number and makes calls the bias is likely to be small You should keep in mind that the standard of judging a sampling method as right or wrong should always depend on the scenario There are two major ways of sampling probability sampling and nonprobability sampling Refer to the following details Probability sampling as the name suggests involves random selection In probability sampling each member has an equal and known chance of being selected This theoretically guarantees that the results obtained will ultimately reveal the behavior of the population Nonprobability sampling where subjective sampling decisions are made by the researchers The sampling process is usually more convenient though The dangers associated with nonprobability sampling Yes Here I do indeed refer to nonprobability sampling as being dangerous and I am not wrong Here I list the common ways of performing nonprobability sampling and we will discuss each in detail Convenience sampling Volunteer sampling Purposive sampling There are two practical reasons why people turn to nonprobability sampling Nonprobability sampling is convenient It usually costs much less to obtain an initial exploratory result with nonprobability sampling than probability sampling For example you can distribute shopping surveys in a supermarket parking lot to get a sense of peoples shopping habits on a Saturday evening But your results will likely change if you do it on a Monday morning For example people might tend to buy more alcohol at the weekend Such sampling is called convenience sampling Performing proper sampling under different scenarios 87 Convenience sampling is widely used in a pilot experimentstudy It can avoid wasting study resources on improper directions or find hidden issues at the early stages of study It is considered unsuitable for a major study Two other common nonprobability sampling methods are volunteer sampling and purposive sampling Volunteer sampling relies on the participants own selfselection to join the sampling A purposive selection is highly judgmental and subjective such that researchers will manually choose participants as part of the sample A typical example of volunteer sampling is a survey conducted by a political figure with strong left or rightwing tendencies Usually only their supporters will volunteer to spend time taking the survey and the results will be highly biased tending to support this persons political ideas An exaggerated or even hilarious example of purposive sampling is asking people whether they successfully booked a plane ticket on the plane The result is obvious because it is done on the plane You may notice that such sampling techniques are widely and deliberately used in everyday life such as in commercials or political campaigns in an extremely disrespectful way Many surveys conclude the results before they were performed Be careful with nonprobability sampling Nonprobability sampling is not wrong It is widely used For inexperienced researchers or data scientists who are not familiar with the domain knowledge it is very easy to make mistakes with nonprobability sampling The non probability sampling method should be justified carefully to avoid mistakes such as ignoring the fact that people who own a car or a telephone in 1936 were likely Republican In the next section you are going to learn how to perform sampling safely Also since probability sampling doesnt involve much subjective judgement you are going to see some working code again 88 Sampling and Inferential Statistics Probability sampling the safer approach I refer to probability sampling as a safer sampling approach because it avoids serious distribution distortion due to human intervention in most cases Here I introduce three ways of probability sampling They are systematic and objective and are therefore more likely to lead to unbiased results We will spend more time on them As before I will list them first Simple random sampling Stratified random sampling Systematic random sampling Lets start with simple random sampling Simple random sampling The first probability sampling is Simple Random Sampling SRS Lets say we have a study that aims to find the mean and standard deviation of the counties populations in Texas If it is not possible to perform this in all counties in Texas simple random sampling can be done to select a certain percentage of counties in Texas The following code shows the total number of counties and plots their population distributions The following code selects 10 which is 25 counties of all of all the counties populations in 2018 First lets take a look at our whole datasets distribution pltfigurefigsize106 pltrcParamsupdatefontsize 22 plthistdfTXPOPESTIMATE2018bins100 plttitleTotal number of counties formatlendfTXPOP ESTIMATE2018 pltaxvlinenpmeandfTXPOP ESTIMATE2018crlinestyle pltxlabelPopulation pltylabelCount Performing proper sampling under different scenarios 89 The result is a highly skewed distribution with few very large population outliers The dashed line indicates the position of the mean Figure 42 Population histogram plotting of all 254 Texas counties Most counties have populations of below half a million and fewer than 5 counties have population in excess of 2 million The population mean is 112999 according to the following oneline code npmeandfTXPOPESTIMATE2018 Now lets use the randomsample function from the random module to select 25 nonrepetitive samples and plot the distribution To make the result reproducible I set the random seed to be 2020 Note on reproducibility To make your analysis such that it involves reproducible randomness set a random seed so that randomness becomes deterministic The following code snippet selects 25 counties data and calculates the mean population figure randomseed2020 pltfigurefigsize106 sample randomsampledfTXPOPESTIMATE2018tolist25 plthistsamplebins100 pltaxvlinenpmeansamplecr plttitleMean of sample population formatnp meansample 90 Sampling and Inferential Statistics pltxlabelPopulation pltylabelCount The result appears as follows Notice that the samples mean is about 50 smaller than the populations mean Figure 43 Simple random sample results We can do this several more times to check the results since the sampling will be different each time The following code snippet calculates the mean of the sample 100 times and visualizes the distribution of the sampled mean I initialize the random seed so that it becomes reproducible The following code snippet repeats the SRS process 100 times and calculates the mean for each repetition Then plot the histogram of the means I call the number of occasions 100 trials and the size of each sample 25 numSample numSample 25 trials 100 randomseed2020 sampleMeans for i in rangetrials sample randomsampledfTXPOPESTIMATE2018to listnumSample sampleMeansappendnpmeansample pltfigurefigsize108 plthistsampleMeansbins25 plttitleDistribution of the sample means for sample size Performing proper sampling under different scenarios 91 of formattrials numSample pltgcaxaxissettickparamsrotation45 pltxlabelSample Mean pltylabelCount The result looks somewhat like the original distribution of the population Figure 44 Distribution of the mean of 100 SRS processes However the distribution shape will change drastically if you modify the sample size or number of trials Let me first demonstrate the change in sample size In the following code snippet the number of samples takes values of 25 and 100 and the number of trials is 1000 Note that the distribution is normed so the scale becomes comparable numSamples 25100 colors rb trials 1000 randomseed2020 pltfigurefigsize108 sampleMeans for j in rangelennumSamples for i in rangetrials sample randomsampledfTXPOPESTIMATE2018to listnumSamplesj sampleMeansappendnpmeansample plthistsampleMeanscolorcolorsj alpha05bins25labelsample size 92 Sampling and Inferential Statistics formatnumSamplesjdensityTrue pltlegend plttitleDistribution density of means of 1000 SRS with respect to sample sizes pltxlabelSample Mean pltylabelDensity You can clearly see the influence of sample size on the result in the following graph Figure 45 Demonstration of the influence of sample size In short if you choose a bigger sample size it is more likely that you will obtain a larger estimation of the mean of the population data It is not counterintuitive because the mean is very susceptible to extreme values With a larger sample size the extreme values those 1 million are more likely to be selected and therefore increase the chance that the sample mean is large Performing proper sampling under different scenarios 93 I will leave it to you to examine the influence of the number of trials You can run the following code snippet to find out The number of trials should only influence accuracy numSample 100 colors rb trials 10005000 randomseed2020 pltfigurefigsize108 sampleMeans for j in rangelentrials for i in rangetrialsj sample randomsampledfTXPOPESTIMATE2018to listnumSample sampleMeansappendnpmeansample plthistsampleMeanscolorcolorsj alpha05bins25labeltrials formattrialsjdensityTrue pltlegend plttitleDistribution density of means of 1000 SRS and 5000 SRS pltxlabelSample Mean pltylabelDensity Most of the code is the same as the previous one except the number of trials now takes another value that is 5000 in line 3 Stratified random sampling Another common method of probability sampling is stratified random sampling Stratifying is a process of aligning or arranging something into categories or groups In stratified random sampling you should first classify or group the population into categories and then select elements from each group randomly The advantage of stratified random sampling is that every group is guaranteed to be represented in the final sample Sometimes this is important For example if you want to sample the income of American women without SRS it is likely that most samples will fall into highpopulation states such as California and Texas Information about small states will be completely lost Sometimes you want to sacrifice the absolute equal chance to include the representativeness 94 Sampling and Inferential Statistics For the Texas county population data we want to include all counties from different urbanization levels The following code snippet examines the urbanization code level distribution from collections import Counter CounterdfTXRuralurbanContinuum Code2013 The result shows some imbalance Counter70 39 60 65 50 6 20 25 30 22 10 35 80 20 90 29 40 13 If we want equal representativeness from each urbanization group such as two elements from each group stratified random sampling is likely the only way to do it In SRS the level 5 data will have a very low chance of being sampled Think about the choice between sampling equal numbers of entries in each levelstrata or a proportional number of entries as a choice between selecting senators and House representatives Note In the United States each state has two senators regardless of the population and state size The number of representatives in the House reflects how large the states population is The larger the population the more representatives the state has in the House The following code snippet samples four representatives from each urbanization level and prints out the mean Note that the code is not optimized for performance but for readability randomseed2020 samples for level in sortednpuniquedfTXRuralurbanContinuum Code2013 Performing proper sampling under different scenarios 95 samples randomsampledfTXdfTXRuralurbanContinuum Code2013levelPOPESTIMATE2018tolist4 printnpmeansamples The result is about 144010 so not bad Lets do this four more times and check the distribution of the sample mean The following code snippet performs stratified random sampling 1000 times and plots the distribution of means pltfigurefigsize108 plthistsampleMeansbins25 plttitleSample mean distribution with stratified random sampling pltgcaxaxissettickparamsrotation45 pltxlabelSample Mean pltylabelCount The following results convey some important information As you can tell the sampled means are pretty much centered around the true mean of the population Figure 46 Distribution of sample means from stratified random sampling 96 Sampling and Inferential Statistics To clarify the origin of this odd shape you need to check the mean of each group The following code snippet does the job pltfigurefigsize108 levels codeMeans for level in sortednpuniquedfTXRuralurbanContinuum Code2013 codeMean npmeandfTXdfTXRuralurbanContinuum Code2013levelPOPESTIMATE2018 levelsappendlevel codeMeansappendcodeMean pltplotlevelscodeMeansmarker10markersize20 plttitleUrbanization level code versus mean population pltxlabelUrbanization level code 2013 pltylabelPopulation mean The result looks like the following Figure 47 Urbanization level code versus the mean population Note that the larger the urbanization level code the smaller the mean population Stratified random sampling takes samples from each group so an improved performance is not surprising Performing proper sampling under different scenarios 97 Recall that the urbanization level is a categorical variable as we introduced it in Chapter 2 Essential Statistics for Data Assessment The previous graph is for visualization purposes only It doesnt tell us information such as that the urbanization difference between levels 3 and 2 is the same as the difference between levels 5 and 6 Also notice that it is important to choose a correct stratifying criterion For example classifying counties into different levels based on the first letter in a county name doesnt make sense here Systematic random sampling The last probability sampling method is likely the easiest one If the population has an order structure you can first select one at random and then select every nth member after it For example you can sample the students by ID on campus or select households by address number The following code snippet takes every tenth of the Texas dataset and calculates the mean randomseed2020 idx randomrandint010 populations dfTXPOPESTIMATE2018tolist samples samplesappendpopulationsidx while idx 10 lenpopulations idx 10 samplesappendpopulationsidx printnpmeansamples The result is 158799 so not bad Systematic random sampling is easy to implement and understand It naturally avoids potential clustering in the data However it assumes a natural randomness in the dataset such that manipulation of the data may cause false results Also you have to know the size of the population beforehand in order to determine a sampling interval We have covered three ways of probability sampling Combining previous nonprobability sampling techniques you have six methods at your disposal Each of the sampling techniques has its own pros and cons Choose wisely in different cases In the next section we will study some statistics associated with sampling techniques that will help us make such decisions 98 Sampling and Inferential Statistics Understanding statistics associated with sampling In the previous section you saw something like a histogram plot of the samples means We used the histogram to show the quality of the sampled mean If the distribution of the mean is centered around the true mean I claim it has a better quality In this section we will go deeper into it Instead of using Texas population data I will be using artificial uniform distributions as examples It should be easier for you to grasp the quantitative intuition if the distribution underlining the population is clear Sampling distribution of the sample mean You have seen the distribution of the sampled mean in the previous section There are some questions remaining For example what is the systematic relationship between the sample size and the sample mean What is the relationship between the number of times of sampling and the sample means distribution Assume we have a population that can only take values from integers 1 to 10 with equal probability The population is very large so we can sample as many as we want Let me perform an experiment by setting the sample size to 4 and then calculate the sample mean Lets do the sampling 100 times and check the distribution of the sample mean We did similar computational experiments for the Texas population data but here you can obtain the theoretical mean and standard deviation of the uniform distribution beforehand The theoretical mean and standard deviation of the distribution can be calculated in one line printnpmeani for i in range111 printnpsqrtnpmeani552 for i in range111 The mean is 55 and the standard deviation is about 287 The following code snippet performs the computational experiment and plots the distribution of the sample mean trials 100 sampleSize 4 randomseed2020 sampleMeans candidates i for i in range111 Understanding statistics associated with sampling 99 pltrcParamsupdatefontsize 18 for i in rangetrials sampleMean npmeanrandomchoicecandidates for in rangesampleSize sampleMeansappendsampleMean pltfigurefigsize106 plthistsampleMeans bins25 pltaxvline55cr linestyle plttitleSample mean distribution trial sample size formattrials sampleSize pltxlabelSample mean pltylabelCount I used the dashed vertical line to highlight the location of the true population mean The visualization can be seen here Figure 48 Sample mean distribution for 100 trials with a sample size of 4 Lets also take a note of the sample means standard deviation npmeansampleMeans The result is 59575 Now lets repeat the process by increasing the number of trials to 4 16 64 and 100 keeping sampleSize 4 unchanged I am going to use a subplot to do this 100 Sampling and Inferential Statistics The follow code snippet first declares a function that returns the sample means as a list def obtainSampleMeanstrials 100 sampleSize 4 sampleMeans candidates i for i in range111 for i in rangetrials sampleMean npmeanrandomchoicecandidates for in rangesampleSize sampleMeansappendsampleMean return sampleMeans The following code snippet makes use of the function we declared and plots the result of the experiments randomseed2020 figure axes pltsubplots41figsize816 figuretightlayout times 41664100 for i in rangelentimes sampleMeans obtainSampleMeans100timesi4 axesihistsampleMeansbins40density True axesiaxvline55cr axesisettitleSample mean distribution trial sample size format100timesi 4 printmean std formatnpmeansampleMeansnp stdsampleMeans Understanding statistics associated with sampling 101 You may observe an interesting trend where the distributions assume an increasingly smooth shape Note that the skipping is due to the fact that the possible sample values are all integers Figure 49 Sample mean distribution with 400 1600 6400 and 10000 trials 102 Sampling and Inferential Statistics There are two discoveries here As the number of trials increases the sample means distribution becomes smoother and bellshaped When the number of trials reaches a certain level the standard deviation doesnt seem to change To verify the second claim you need to compute the standard deviation of the sample means I will leave this to you The result is listed here As regards the four different numbers of trials the standard deviations are all around 144 Here is the output of the code snippet showing no significant decrease in standard deviation trials 400 mean 564 std 14078218992472025 trials 1600 mean 553390625 std 14563112832464553 trials 6400 mean 54877734375 std 14309896472527093 trials 10000 mean 551135 std 14457899838842432 Next lets study the influence of the number of trials The following code snippet does the trick In order to obtain more data points for future analysis I am going to skip some plotting results but you can always check the official notebook of the book for yourself I am going to stick to trials 6400 for this experiment randomseed2020 sizes 2k for k in range19 figure axes pltsubplots81figsize848 figuretightlayout for i in rangelensizes sampleMeans obtainSampleMeans6400sizesi axesihistsampleMeansbinsnplinspacenp minsampleMeansnpmaxsampleMeans40density True axesiaxvline55cr linestyle axesisettitleSample mean distribution trial sample size format6400 sizesi axesisetxlim010 printmean std formatnpmeansampleMeansnp stdsampleMeans Understanding statistics associated with sampling 103 Lets check the sampleSize 16 result Figure 410 Sample mean distribution sample size 16 If the sample size increases eightfold you obtain the following Figure 411 Sample mean distribution sample size 128 We do see that the standard error of the sample mean shrinks when the sample size increases The estimates of the population mean are more precise hence a tighter and tighter histogram We will study this topic more quantitatively in the next subsection Standard error of the sample mean The sample means standard error decreases when the sample size increases We will do some visualization to find out the exact relationship The standard error is the standard deviation of a statistic of a sampling distribution Here the statistic is the mean 104 Sampling and Inferential Statistics The thought experiment tip A useful technique for checking a monotonic relationship is to perform a thought experiment Imagine the sample size is 1 then when the number of trials increases to infinity basically we will be calculating the statistics of the population itself On the other hand if the sample size increases to a very large number then every sample mean will be very close to the true population mean which leads to small variance Here I plotted the relationship between the size of the sample and the standard deviation of the sample mean The following code snippet does the job randomseed2020 sizes 2k for k in range19 ses figure axes pltsubplots81figsize848 for i in rangelensizes sampleMeans obtainSampleMeans6400sizesi sesappendnpstdsampleMeans Due to space limitations here we only show two of the eight subfigures Figure 412 Sample mean distribution when the sample size is 2 Understanding statistics associated with sampling 105 With a larger sample size we have the following diagram Figure 413 Sample mean distribution when the sample size is 256 Then we plot the relationship in a simple line chart pltfigurefigsize86 pltplotsizesses plttitleStandard Error of Sample Mean Versus Sample Size pltxlabelSample Size pltylabelStandard Error of Sample Mean What you get is a following curve Figure 414 The sample mean standard error decreases with the sample size 106 Sampling and Inferential Statistics Now lets perform a transformation of the standard error so the relationship becomes clear pltfigurefigsize86 pltplotsizes1ele2 for ele in ses plttitleInverse of the Square of Standard Error versus Sample Size pltxlabelSample Size pltylabelTransformed Standard Error of Sample Mean The output becomes a straight linear line Figure 415 Standard error transformation There is a linear relationship between the sample size and the inverse of the square of the standard error Lets use n to denote the sample size σ𝑛𝑛 1 𝑛𝑛 Understanding statistics associated with sampling 107 Now recall that if the sample size is 1 we are basically calculating the population itself Therefore the relationship is exactly the following This equation is useful for estimating the true population standard deviation Note on replacement I used the randomchoice function for sampleSize times in this example This suggests that I am sampling from an infinitely large population or sampling with replacements However in the first section when sampling Texas population data I used the randomsample sampleSize function to sample a finite dataset without replacements The analysis of the sample mean will still apply but the standard errors coefficient will be different You will pick up a finite population correction factor that is related to the population size We wont go deeper into this topic due to content limitation The central limit theorem One last topic to discuss in this chapter is probably one of the most important theorems in statistics You may notice that the shape of the sample mean distribution tends to a bell shaped distribution indeed a normal distribution This is due to one of the most famous and important theorems in statistics the Central Limit Theorem CLT The CLT states that given a sufficiently large sample size the sampling distribution of the mean for a variable will approximate a normal distribution regardless of that variables distribution in the population Recall that the example distribution I used is the simplest discrete uniform distribution You can already see that the sample mean follows the bellshaped distribution which is equivalent to checking the sample sum The CLT is very strong Normal distribution is the most important distribution among many others as we will cover in the next chapter Mathematicians developed a lot of theories and tools relating to normal distribution The CLT enables us to apply those tools to other distributions as well Proving the CLT is beyond the scope of this introductory book However you are encouraged to perform the following thought experiment and do a computational experiment to verify it σ𝑛𝑛 σ 𝑛𝑛 108 Sampling and Inferential Statistics You toss an unfair coin that favors heads and record the heads as 1 and tails as 0 A set of tossings contains n tossings If n equals 1 you can do m sets of tossing and count the sum of each sets results What you can get is m binary numbers either 0 or 1 with 1 more likely than 0 If n 10 the sum of a set can now take values from 0 to 10 and will likely have an average greater than 5 because the coin is unfair However now you have a spread of the possible outcomes no longer binary As n increases the sum of the tossing set keeps increasing but it will likely increase at a more stable rate probably around 07 Casually speaking the sum hides the intrinsic structure of the original distribution To verify this amazing phenomenon perform some computational experiments The code snippets in this chapter provides useful skeletons Summary In this chapter you learned important but often undervalued concepts such as population samples and sampling methods You learned the right ways to perform sampling as well as the pitfalls of dangerous sampling methods We also made use of several important distributions in this chapter In the next chapter you are going to systematically learn some common important distributions With these background concepts solidified we can then move on to inferential statistics with confidence 5 Common Probability Distributions In the previous chapter we discussed the concepts of population and sampling In most cases it is not likely that you will find a dataset that perfectly obeys a welldefined distribution However common probability distributions are the backbone of data science and serve as the first approximation of realworld distributions The following topics will be covered in this chapter Understanding important concepts in probability Understanding common discrete probability distributions Understanding common continuous probability distributions Learning about joint and conditional distribution Understanding the power law and black swan Recall the famous saying there is nothing more practical than good theory The theory of probability is beyond greatness Lets get started 110 Common Probability Distributions Understanding important concepts in probability First of all we need to clarify some fundamental concepts in probability theory Events and sample space The easiest and most intuitive way to understand probability is probably through the idea of counting When tossing a fair coin the probability of getting a heads is one half You count two possible results and associate the probability of one half with each of them And the sum of all the associated nonoverlapping events not including having a coin standing on its edge must be unity Generally probability is associated with events within a sample space S In the coin tossing example tossing the coin is considered a random experiment it has two possible outcomes and the collection of all outcomes is the sample space The outcome of having a headstails is an event Note that an event is not necessarily singleoutcome for example tossing a dice and defining an event as having a result larger than 4 The event contains a subset of the sixoutcome sample space If the dice is fair it is intuitive to say that such an event having a result larger than 4 is associated with the probability PA 1 3 The probability of the whole sample space is 1 and any probability lies between 0 and 1 If an event A contains no outcomes the probability is 0 Such intuition doesnt only apply to discrete outcomes For continuous cases such as the arrival time of a bus between 8 AM and 9 AM you can define the sample space S as the wholetime interval from 8 AM to 9 AM An event A can be a bus arriving between 830 and 840 while another event B can be a bus arriving later than 850 A has a probability of 𝑃𝑃𝐴𝐴 1 6 and B has a probability of 𝑃𝑃𝐵𝐵 1 6 as well Lets use and to denote the union and intersection of two events The following three axioms for probability calculation will hold 𝑃𝑃𝐴𝐴 0 for any event 𝐴𝐴 𝑆𝑆 𝑃𝑃𝑆𝑆 1 If A B are mutually exclusive then 𝑃𝑃𝐴𝐴 𝐵𝐵 𝑃𝑃𝐴𝐴 𝑃𝑃𝐵𝐵 This leaves to you to verify that if A B are not mutually exclusive the following relationship holds 𝑃𝑃𝐴𝐴 𝐵𝐵 𝑃𝑃𝐴𝐴 𝑃𝑃𝐵𝐵 𝑃𝑃𝐴𝐴 𝐵𝐵 Understanding important concepts in probability 111 The probability mass function and the probability density function Both the Probability Mass Function PMF and the Probability Density Function PDF we are invented to describe the point density of a distribution PMF can be used to describe the probability of discrete events whereas PDF can be used to describe continuous cases Lets look at some examples to understand these functions better PMF PMF associates each single outcome of a discrete probability with a probability For example the following table represents a PMF for our cointossing experiment Figure 51 Probability of coin tossing outcomes If the coin is biased toward heads then the probability of having heads will be larger than 05 but the sum will remain as 1 Lets say you toss two fair dice What is the PMF for the sum of the outcomes We can achieve the result by counting The table cells contain the sum of the two outcomes Figure 52 Sum of two dicetossing outcomes 112 Common Probability Distributions We can then build a PMF table as shown in the following table As you can see the probability associated with each outcome is different Also note that the sample space changes its definition when we change the random experiment In these twodice cases the sample space S becomes all the outcomes of the possible sums Figure 53 Probability of the sum of dice tossing Lets denote the one dices outcome as A and another as B You can verify that PAB PAPB In this case the easiest example is the case of that sum being 2 and 12 Understanding important concepts in probability 113 The following code snippet can simulate the experiment as we know that each dice generates the possible outcome equally First generate all the possible outcomes import random import numpy as np dice 123456 probs 161616161616 sums sortednpuniquedicei dicej for i in range6 for j in range6 The following code then calculates all the associated probabilities I iterated every possible pair of outcomes and added the probability product to the corresponding result Here we make use of the third axiom declared earlier and the relationship we just claimed from collections import OrderedDict res OrderedDict for s in sums ress 0 for i in range6 for j in range6 if diceidicejs ress probsiprobsj Note on code performance The code is not optimized for performance but for readability OrderedDict creates a dictionary that maintains the order of the key as the order in which the keys are created Lets check the results and plot them with a bar plot Since the dictionary is ordered it is OK to plot keys and values directly as x and height as per the functions API pltfigurefigsize86 pltrcParamsupdatefontsize 22 pltbarreskeysresvalues plttitleProbabilities of Two Dice Sum pltxlabelSum Value pltylabelProbability 114 Common Probability Distributions Lets check out the beautiful symmetric result Figure 54 Probabilities of two dice sums You can check the sum of the values by using sumresvalues PDF PDF is the equivalent of PMF for continuous distribution For example a uniform distribution at interval 01 will have a PDF of 𝑓𝑓𝑋𝑋𝑥𝑥 𝑃𝑃𝑥𝑥 1 for any x in the range The requirements for a PDF to be valid are straightforward from the axioms for a valid probability A PDF must be nonnegative and integrates to 1 in the range where it takes a value Lets check a simple example Suppose the PDF of the bus arrival time looks as follows Figure 55 PDF of the bus arrival time Understanding important concepts in probability 115 You may check that the shaded region does have an area of 1 The bus has the highest probability of arriving at 830 and a lower probability of arriving too early or too late This is a terrible bus service anyway Unlike PMF a PDFs value can take an arbitrarily high number The highest probability for the twodice outcome is but the highest value on the PDF graph is 2 This is one crucial difference between PDF and PMF A single point on the PDF function doesnt hold the same meaning as a value in the PMF table Only the integrated Area Under the Curve AUC represents a meaningful probability For example in the previous PDF the probability that the bus will arrive between 824 AM 84 on the xaxis and 836 AM 86 on the xaxis is the area of the central lightly shaded part as shown in the following graph Figure 56 Probability that a bus will arrive between 824 AM and 836 AM Note on the difference between PMFPDF plots and histogram plots Dont confuse the PMF or PDF plots with the histogram plots you saw in previous chapters A histogram shows data distribution whereas the PMF and PDF are not backed by any data but theoretical claims It is not possible to schedule a bus that obeys the preceding weird PDF strictly but when we estimate the mean arrival time of the bus the PDF can be used as a simple approachable tool Integrating a PDF up to a certain value x gives you the Cumulative Distribution Function CDF which is shown as the following formula 𝐹𝐹𝑋𝑋𝑥𝑥 𝑓𝑓𝑋𝑋𝑥𝑥 𝑥𝑥 It takes values between 0 and 1 A CDF contains all the information a PDF contains Sometimes it is easier to use a CDF instead of a PDF in order to solve certain problems 116 Common Probability Distributions Subjective probability and empirical probability From another perspective you can classify probabilities into two types Subjective probability or theoretical probability Empirical probability or objective probability Lets look at each of these classifications in detail Subjective probability stems from a theoretical argument without any observation of the data You check the coin and think it is a fair one and then you come up with an equal probability of heads and tails You dont require any observations or random experiments All you have is a priori knowledge On the other hand empirical probability is deduced from observations or experiments For example you observed the performance of an NBA player and estimated his probability of 3point success for next season Your conclusion comes from a posteriori knowledge If theoretical probability exists by the law of large numbers given sufficient observations the observed frequency will approximate the theoretical probability infinitely closely For the content of this book we wont go into details of the proof but the intuition is clear In a realworld project you may wish to build a robust model to obtain a subjective probability Then during the process of observing random experiment results you adjust the probability to reflect the observed correction Understanding common discrete probability distributions In this section we will introduce you to some of the most important and common distributions I will first demonstrate some examples and the mechanism behind them that exhibits corresponding probability Then I will calculate the expectation and variance of the distribution show you samples that generated from the probability and plot its histogram plot and boxplot The expectation of X that follows a distribution is the mean value that X can take For example with PDF 𝑓𝑓𝑋𝑋𝑥𝑥 the mean is calculated as follows 𝐸𝐸𝑥𝑥 𝑓𝑓𝑋𝑋𝑥𝑥𝑥𝑥𝑥𝑥𝑥𝑥 Understanding common discrete probability distributions 117 The variance measures the spreading behavior of the distribution and is calculated as follows μ and σ2 are the common symbols for expectation and variance X is called a random variable Note that it is the outcome of a random experiment However not all random variables represent outcomes of events For example you can take Y expX and Y is also a random variable but not an outcome of a random experiment You can calculate the expectation and variance of any random variable We have three discrete distributions to cover They are as follows Bernoulli distribution Binomial distribution Poisson distribution Now lets look at these in detail one by one Bernoulli distribution Bernoulli distribution is the simplest discrete distribution that originates from a Bernoulli experiment A Bernoulli experiment resembles a general cointossing scenario The name comes from Jakob I Bernoulli a famous mathematician in the 1600s A Bernoulli experiment has two outcomes and the answer is usually binary that one outcome excludes another For example the following are all valid Bernoulli experiments Randomly ask a person whether they are married Buy a lottery ticket to win the lottery Whether a person will vote for Trump in 2020 If one event is denoted as a success with probability p then the opposite is denoted as a failure with probability 1 p Using X to denote the outcome and x to denote the outcomes realization where we set x 1 to be a success and x 0 to be a failure the PMF can be concisely written as follows 𝑉𝑉𝑉𝑉𝑉𝑉𝑥𝑥 𝑓𝑓𝑋𝑋𝑥𝑥𝑥𝑥 𝐸𝐸𝑥𝑥 2𝑑𝑑𝑥𝑥 𝑓𝑓𝑋𝑋𝑥𝑥 𝑝𝑝𝑥𝑥1 𝑝𝑝1𝑥𝑥 118 Common Probability Distributions Note on notations As you may have already noticed the uppercase letter is used to denote the outcome or event itself whereas the lowercase letter represents a specific realization of the outcome For example x can represent the realization that X takes the value of married in one experiment X denotes the event itself Given the definitions the mean is as follows The variance is as follows The following code performs a computational experiment with p 07 and sample size 1000 p 07 samples randomrandom 07 for in range1000 printnpmeansamplesnpvarsamples Since the result is straightforward we wont get into it You are welcome to examine it Binomial distribution Binomial distribution is built upon the Bernoulli distribution Its outcome is the sum of a collection of independent Bernoulli experiments Note on the concept of independency This is the first time that we have used the word independent explicitly In a twocoin experiment it is easy to imagine the fact that tossing one coin wont influence the result of another in any way It is enough to understand independent this way and we will go into its mathematical details later Lets say you do n Bernoulli experiments each with a probability of success p Then we say the outcome X follows a binomial distribution parametrized by n and p The outcome of the experiment can take any value k as long as k is smaller than or equal to n The PMF reads as follows 𝑝𝑝 1 1 𝑝𝑝 0 𝑝𝑝 0 𝑝𝑝21 𝑝𝑝 1 𝑝𝑝2𝑝𝑝 𝑝𝑝1 𝑝𝑝 𝑓𝑓𝑘𝑘 𝑛𝑛 𝑝𝑝 𝑃𝑃𝑋𝑋 𝑘𝑘 𝑛𝑛 𝑘𝑘 𝑛𝑛 𝑘𝑘 𝑝𝑝𝑘𝑘1 𝑝𝑝𝑛𝑛𝑘𝑘 Understanding common discrete probability distributions 119 The first term represents the combination of selecting k successful experiments out of n The second term is merely a product of independent Bernoulli distribution PMF The expectationmean of the binomial distribution is np and the variance is np1 p This fact follows the results of the sums of independent random variables The mean is the sum of means and the variance is the sum of the variance Lets do a simple computational example tossing a biased coin with PHead 08 100 times and plotting the distribution of the sum with 1000 trials The following code snippet first generates the theoretical data points X i for i in range1101 p 08 Fx npmathfactorialnnpmathfactorialnknpmath factorialkpk1pnk for k in X Then the following code snippet conducts the experiment and the plotting randomseed2020 n 100 K for trial in range1000 k npsumrandomrandom p for in rangen Kappendk pltfigurefigsize86 plthistKbins30densityTruelabelComputational Experiment PMF pltplotXFxcolorrlabelTheoretical PMFlinestyle pltlegend 120 Common Probability Distributions The result looks as follows Figure 57 Binomial distribution theoretical and simulation results The simulated values and the theoretical values agree pretty well Now well move on to another important distribution Poisson distribution The last discrete distribution we will cover is Poisson distribution It has a PMF as shown in the following equation where λ is a parameter k can take values of positive integers This distribution looks rather odd but it appears in nature everywhere Poisson distribution can describe the times a random event happens during a unit of time For example the number of people calling 911 in the United States every minute will obey a Poisson distribution The count of gene mutations per unit of time also follows a Poisson distribution Lets first examine the influence of the value λ The following code snippet plots the theoretical PMF for different values of λ lambdas 24616 K k for k in range30 pltfigurefigsize86 for il in enumeratelambdas pltplotKnpexpllknpmathfactorialk for k in K labelstrl 𝑃𝑃𝑋𝑋 𝑘𝑘 𝑒𝑒λλ𝑘𝑘 𝑘𝑘 Understanding the common continuous probability distribution 121 markeri2 pltlegend pltylabelProbability pltxlabelValues plttitleParameterized Poisson Distributions The result looks as follows Figure 58 Poisson distribution with various λ values The trend is that the larger λ is the larger the mean and variance will become This observation is true Indeed the mean and variance of Poisson distribution are both λ The numpyrandompoisson function can easily generate Poisson distribution samples The computational experiment is left to you You can try and conduct computational experiments on your own for further practice Understanding the common continuous probability distribution In this section you will see the three most important continuous distributions Uniform distribution Exponential distribution Gaussiannormal distribution Lets look at each of these in detail 122 Common Probability Distributions Uniform distribution Uniform distribution is an important uniform distribution It is useful computationally because many other distributions can be simulated with uniform distribution In earlier code examples I used randomrandom in the simulation of the Bernoulli distribution which itself generates a uniform random variable in the range 01 For a uniformly distributed random variable on 01 the mean is 05 and the variance is 1 12 This is a good number to remember for a data scientist role interview For a general uniform distribution If the range is ab the PDF reads as 𝑃𝑃𝑋𝑋 𝑥𝑥 1 𝑏𝑏 𝑎𝑎 if x is in the range ab The mean and variance become 2 and 2 12 respectively If you remember calculus check it yourself We will skip the computational experiments part for simplicity and move on to exponential distribution Exponential distribution The exponential distribution function is another important continuous distribution function In nature it mostly describes the time difference between independent random distribution For example the time between two episodes of lightning in a thunderstorm or the time between two 911 calls Recall that the number of 911 calls in a unit of time follows the Poisson distribution Exponential distribution and Poisson distribution do have similarities The PDF for exponential distribution is also parameterized by λ The value x can only take nonnegative values Its PDF observes the following form Because of the monotonicity of the PDF the maximal value always happens at x 0 where f0λ λ The following code snippet plots the PDF for different λ lambdas 02040810 K 05k for k in range15 pltfigurefigsize86 for il in enumeratelambdas pltplotKnpexplkl for k in K labelstrl markeri2 pltlegend pltylabelProbability 𝑓𝑓𝑥𝑥 λ λ𝑒𝑒λ𝑥𝑥 Understanding the common continuous probability distribution 123 pltxlabelValues plttitleParameterized Exponential Distributions The result looks as follows Figure 59 Exponential distribution with various λ The larger λ is the higher the peak at 0 is but the faster the distribution decays A smaller λ gives a lower peak but a fatter tail Integrating the product of x and PDF gives us the expectation and variance First the expectation reads as follows The variance reads as follows The result agrees with the graphical story The larger λ is the thinner the tail is and hence the smaller the expectation is Meanwhile the peakier shape brings down the variance Next we will investigate normal distribution 𝐸𝐸𝑥𝑥 𝑥𝑥λ𝑒𝑒λxdx 0 1 λ 𝑉𝑉𝑉𝑉𝑉𝑉𝑥𝑥 𝑥𝑥 𝐸𝐸𝑥𝑥 2λ𝑒𝑒λxdx 0 1 λ2 124 Common Probability Distributions Normal distribution We have used the term normal distribution quite often in previous chapters without defining it precisely A onedimensional normal distribution has a PDF as follows μ and σ2 are the parameters as expectation and standard deviation A standard normal function has an expectation of 0 and a variance of 1 Therefore its PDF reads as follows in a simpler form Qualitative argument of a normal distribution PDF The standard normal distribution PDF is an even function so it is symmetric with a symmetric axis x 0 Its PDF also monotonically decays from its peak at a faster rate than the exponential distribution PDF because it has a squared form Transforming the standard PDF to a general PDF the expectation μ on the exponent shifts the position of the symmetric axis and the variance σ2 determines how quick the decay is The universality of normal distribution is phenomenal For example the human populations height and weight roughly follow a normal distribution The number of leaves on trees in a forest will also roughly follow a normal distribution From the CLT we know that the sample sum from a population of any distribution will ultimately tend to follow a normal distribution Take the tree leaves example and imagine that the probability of growing a leaf follows a very sophisticated probability The total number of leaves on a tree will however cloud the details of the sophistication of the probability but gives you a normal distribution A lot of phenomena in nature follow a similar pattern what we observe is a collection or summation of lowerlevel mechanisms This is how the CLT makes normal distribution so universal and important For now lets focus on a onedimensional one 𝑓𝑓𝑥𝑥 1 𝜎𝜎2𝜋𝜋 𝑒𝑒𝑥𝑥𝜇𝜇2 2𝜎𝜎2 𝑓𝑓𝑥𝑥 1 2π 𝑒𝑒𝑥𝑥2 2 Understanding the common continuous probability distribution 125 Lets plot several normal distribution PDFs with different expectations 1 0 and 1 with the following code snippet mus 101 K 02k5 for k in range50 pltfigurefigsize86 for imu in enumeratemus pltplotK 1npsqrt2nppinpexpkmu22 for k in K labelstrmu markeri2 pltlegend pltylabelProbability pltxlabelValues plttitleParameterized Normal Distributions The result looks as follows Figure 510 Normal distribution with different expectations I will leave the exercise with a different variance to you Normal distribution has a deep connection with various statistical tests We will cover the details in Chapter 7 Statistical Hypothesis Testing 126 Common Probability Distributions Learning about joint and conditional distribution We have covered basic examples from discrete probability distributions and continuous probability distributions Note that all of them describe the distribution of a single experiment outcome How about the probability of the simultaneous occurrence of two eventsoutcomes The proper mathematical language is joint distribution Suppose random variables X and Y denote the height and weight of a person The following probability records the probability that X x and Y y simultaneously which is called a joint distribution A joint distribution is usually represented as shown in the following equation For a population we may have PX 170cm Y 75kg 025 You may ask the question What is the probability of a person being 170 cm while weighing 75 kg So you see that there is a condition that we already know this person weighs 75 kg The expression for a conditional distribution is a ratio as follows The notation PXY represents the conditional distribution of X given Y Conditional distributions are everywhere and people often misread them by ignoring the conditions For example does the following argument make sense Most people have accidents within 5 miles of their home therefore the further away you drive the safer you are Whats wrong with this claim It claims the probability Pdeath more than 5 miles away to be small as compared to Pdeath less than 5 miles away given the fact that Pdeath less than 5 miles away is larger than Pdeath more than 5 miles away It intentionally ignores the fact that the majority of commutes take place within short distances of the home in its phrasing of the claim The denominator is simply too small on the righthand side in the following equation I hope you are able to spot such tricks and understand the essence of these concepts 𝑃𝑃𝑋𝑋 𝑥𝑥 𝑌𝑌 𝑦𝑦 𝑃𝑃𝑋𝑋 170𝑐𝑐𝑐𝑐𝑌𝑌 75𝑘𝑘𝑘𝑘 𝑃𝑃𝑋𝑋 170𝑐𝑐𝑐𝑐 𝑌𝑌 75𝑘𝑘𝑘𝑘 𝑃𝑃𝑌𝑌 75𝑘𝑘𝑘𝑘 Pdeath more than 5 miles away 𝑃𝑃𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑ℎ 𝑚𝑚𝑚𝑚𝑚𝑚𝑑𝑑 𝑑𝑑ℎ𝑑𝑑𝑎𝑎 5 𝑚𝑚𝑚𝑚𝑚𝑚𝑑𝑑𝑚𝑚 𝑑𝑑𝑎𝑎𝑑𝑑𝑎𝑎 𝑃𝑃𝑚𝑚𝑚𝑚𝑚𝑚𝑑𝑑 𝑑𝑑ℎ𝑑𝑑𝑎𝑎 5 𝑚𝑚𝑚𝑚𝑚𝑚𝑑𝑑𝑚𝑚 𝑑𝑑𝑎𝑎𝑑𝑑𝑎𝑎 Understanding the power law and black swan 127 Independency and conditional distribution Now we can explore the true meaning of independency In short there are two equivalent ways to declare two random variables independent PX x Y y PX xPY y for any xy and PX xY y PX x for any xy You can check that they are indeed equivalent We see an independent relationship between X and Y implies that a conditional distribution of a random variable X over Y doesnt really depend on the random variable Y If you can decompose the PDFPMF of a joint probability into a product of two PDFs PMFs that one only contains one random variable and another only contains another random variable then the two random variables are independent Lets look at a quick example You toss a coin three times X denotes the event when you see two or more heads and Y denotes the event when the sum of heads is odd Are X and Y independent Lets do a quick calculation to obtain the probabilities 𝑃𝑃𝑋𝑋 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 1 2 𝑃𝑃𝑌𝑌 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 1 2 𝑃𝑃𝑋𝑋 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑌𝑌 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 1 4 You can verify the remaining three cases to check that X and Y are indeed independent The idea of conditional distribution is key to understanding many behaviors for example the survival bias For all the planes returning from the battlefield if the commander reinforces those parts where the plane got shot will the air force benefit from it The answer is probably no The commander is not looking at the whole probability but a conditional probability distribution based on the fact the plane did return To make the correct judgement the commander should reinforce those parts where the plane was not shot It is likely that a plane that got shot in those areas didnt make it back Conditional probability is also crucial to understanding the classical classification algorithm the Bayesbased classifier Adding an independence requirement to the features of the data we simplify the algorithm further to the naïve Bayes classifier We will cover these topics in Chapter 9 Working with Statistics for Classification Tasks Understanding the power law and black swan In this last section I want to give you a brief overview of the socalled power law and black swan events 128 Common Probability Distributions The ubiquitous power law What is the power law If you have two quantities such that one varies according to a power relationship of another and independent of the initial sizes then you have a power law relationship Many distributions have a power law shape rather than normal distributions 𝑃𝑃𝑋𝑋 𝑥𝑥 𝐶𝐶𝑥𝑥α The exponential distribution we saw previously is one such example For a realword example the frequency of words in most languages follows a power law The English letter frequencies also roughly follow a power law e appears the most often with a frequency of 11 The following graph taken from Wikipedia https enwikipediaorgwikiLetterfrequency shows a typical example of such a power law Figure 511 Frequency of English letters Understanding the power law and black swan 129 Whats amazing about a power law is not only its universality but also its lack of welldefined mean and variance in some cases For example when α is larger than 2 the expectation of such a power law distribution will explode to infinity while when α is larger than 3 the variance will also explode Be aware of the black swan Simply put a lack of welldefined variance implications is known as black swan behavior A black swan event is a rare event that is hard to predict or compute scientifically However black swan events have a huge impact on history science or finance and make people rationalize black swan events in hindsight Note Before people found black swans in Australia Europeans believed that all swans were white The term black swan was coined to represent ideas that were considered impossible Here are some of the typical examples of black swan events The 2020 coronavirus outbreak The 2008 financial crisis The assassination of Franz Ferdinand which sparked World War I People can justify those events afterward which make black swans totally inescapable but no one was prepared to prevent the occurrence of those black swans beforehand Beware of black swan events in the distribution you are working on Remember in the case of our Texas county population data in Chapter 4 Sampling and Inference Statistics that most of the counties have small populations but quite a few have populations that are 10 times above the average If you have sampled your data on a fair number of occasions and didnt see these outliers you may be inclined toward making incorrect estimations 130 Common Probability Distributions Summary In this chapter we covered the basics of common discrete and continuous probability distributions examined their statistics and also visualized the PDFs We also talked about joint distribution conditional distribution and independency We also briefly covered power law and black swan behavior Many distributions contain parameters that dictate the behavior of the probability distribution Suppose we know a sample comes from a population that follows a certain distribution how do you find the parameter of the distribution This will be the topic of our next chapter parametric estimation 6 Parametric Estimation One big challenge when working with probability distributions is identifying the parameters in the distributions For example the exponential distribution has a parameter λ and you can estimate it to get an idea of the mean and the variance of the distribution Parametric estimation is the process of estimating the underlying parameters that govern the distribution of a dataset Parameters are not limited to those that define the shape of the distribution but also the locations For example if you know that a dataset comes from a uniform distribution but you dont know the lower bound a and upper bound b of the distribution you can also estimate the values of a and b as they are also considered legitimate parameters Parametric estimation is important because it gives you a good idea of the dataset with a handful of parameters for example the distributions and associated descriptive statistics Although reallife examples wont exactly follow a distribution parameter estimation does serve as a benchmark for building more complicated models to model the underlying distribution 132 Parametric Estimation After finishing this chapter you will be able to complete the following tasks independently Understanding the concepts of parameter estimation and the features of estimators Using the method of moments to estimate parameters Applying the maximum likelihood approach to estimate parameters with Python Understanding the concepts of parameter estimation and the features of estimators A introduction to estimation theory requires a good mathematical understanding and careful derivation Here I am going to use laymans terms to give you a brief but adequate introduction so that we can move on to concrete examples quickly Estimation in statistics refers to the process of estimating unknown values from empirical data that involves random components Sometimes people confuse estimation with prediction Estimation usually deals with hidden parameters that are embodied in a known dataset things that already happened while prediction tries to predict values that are explicitly not in the dataset things that havent happened For example estimating the population of the world 1000 years ago is an estimation problem You can use various kinds of data that may contain information about the population The population is a number that will not change but is unknown On the other hand predicting the population in the year 2050 is a prediction problem because the number is essentially unknown and we have no data that contains it explicitly What parameters can we estimate In principle you can estimate any parameter as long as it is involved in the creation or generation of the random datasets Lets use the symbol θ as the set of all unknown parameters and x as the set of data For example θ1 and x1 will represent one of the parameters and a single data point respectively here indexed by 1 If pxθ depends on θ we can estimate that θ exists in the dataset A note on the exchangeable use of terms Estimation and prediction can sometimes be exchangeable Often estimation doesnt assert the values of new data while prediction does but ambiguity always exists For example if you want to know the trajectory of a missile given its current position the old positions are indeed unknown data but they can also be treated as hidden parameters that will determine the positions observed later We will not go down this rabbit hole here You are good to go if you understand whats going on Understanding the concepts of parameter estimation and the features of estimators 133 An estimator is required to obtain the estimated values of the parameters that were interested in An estimator is also static and the underlined parameter is called the estimand A particular value that an estimator takes is called an estimate a noun Too many concepts Lets look at a realworld case Lets take the 2016 US general election as an example The voting rate is an estimand because it is a parameter to model voting behavior A valid straightforward strategy is to take the average voting rates of a sample of counties in the US regardless of their population and other demographic characteristics This strategy can be treated as an estimator Lets say that a set of sampled counties gives a value of 034 which means 34 of the population vote Then the value 034 is an estimate as the realization of our naive estimator Take another sample of counties the same estimator may give the value of 04 as another estimate Note the estimator can be simple The estimator is not necessarily complicated or fancy It is just a way of determining unknown parameters You can claim the unknown parameter to be a constant regardless of whatever you observe A constant is a valid estimator but it is a horribly wrong one For the same estimand you can have as many estimators as you want To determine which one is better we require more quantitative analysis Without further specifications estimators in this chapter refer to the point estimator The point estimator offers the single best guess of the parameter while the socalled interval estimator gives an interval of the best guesses In the next section lets review the criteria for evaluating estimators Evaluation of estimators For the same estimand how do we evaluate the qualities of different estimators Think of the election example we want the estimation to be as accurate as possible as robust as possible and so on The properties of being accurate and robust have specific mathematical definitions and they are crucial for picking the right estimator for the right tasks The following is a list of the criteria Biasness Consistency Efficiency 134 Parametric Estimation Lets look at each of these criteria in detail The first criterion biasness Recall that an estimator is also a random variable that will take different values depending on the observed sample from the population Lets use θ to denote the estimator and θ to denote the true value of the parameter variable which our estimator tries to estimate The expected value of the difference between θ and θ is said to be the bias of the estimator which can be formulated as follows Note that the expectation is calculated over the distribution Pxθ as varied θ is supposed to change the sampled sets of x An estimator is said to be unbiased if its bias is 0 for all values of parameter θ Often we prefer an unbiased estimator over a biased estimator For example political analysts want an accurate voting rate and marketing analysts want a precise customer satisfaction rate for strategy development If the bias is a constant we can subtract that constant to obtain an unbiased estimator if the bias depends on θ it is not easy to fix in general For example the sample mean from a set using simple random sampling is an unbiased estimator This is rather intuitive because simple random sampling gives equal opportunity for every member of the set to be selected Next lets move on to the second criterion which is consistency or asymptotical consistency The second criterion consistency As the number of data points increases indefinitely the resulting sequence of the estimates converges to the true value θ in probability This is called consistency Lets say θn is the estimate given n data points then for any case of ϵ0 consistency gives the following formula For those of you who are not familiar with the language of calculus think about it this way no matter how small you choose the infinitesimal threshold ϵ you can also choose a number of data points n large enough such that the probability that θn is different from θ is going to be 0 On the other hand an inconsistent estimator will fail to estimate the parameter unbiasedly no matter how much data you use Bias θ Eθ θ lim 𝑛𝑛 𝑃𝑃 θ𝑛𝑛 θ ϵ 0 Understanding the concepts of parameter estimation and the features of estimators 135 Note on convergence There are two main types of convergence when we talk about probability and distribution One is called convergence in probability and another is called convergence in distribution There are differences if you plan to dig deeper into the mathematical definitions and implications of the two kinds of convergence All you need to know now is that the convergence in the context of consistency is convergence in probability The third criterion efficiency The last criterion I want to introduce is relative If two estimators are both unbiased which one should we choose The most commonly used quantity is called Mean Squared Error MSE MSE measures the expected value of the square of the difference between the estimator and the true value of the parameter The formal definition reads as follows Note that the MSE says nothing about the biasness of the estimator Lets say estimators A and B both have an MSE of 10 It is possible that estimator A is unbiased but the estimates are scattered around the true value of θ while estimator B is highly biased but concentrated around a point away from the bulls eye What we seek is an unbiased estimator with minimal MSE This is usually hard to achieve However take a step back among the unbiased estimators there often exists one estimator with the least MSE This estimator is called the Minimum Variance Unbiased Estimator MVUE Here we will touch on the concept of variancebias tradeoff The MSE contains two parts the bias part and the variance part If the bias part is 0 therefore unbiased then the MSE only contains the variance part So we call this estimator the MVUE We will cover the concepts of bias and variance again in the sense of machine learning for example in Chapter 8 Statistics for Regression and Chapter 9 Statistics for Classification If an estimator has a smaller MSE than another estimator we say the first estimator is more efficient Efficiency is a relative concept and it is defined in terms of a measuring metric Here we use MSE as the measuring metric the smaller the MSE is the more efficient the estimator will be 𝑀𝑀𝑀𝑀𝑀𝑀θ 𝑀𝑀 θ𝑋𝑋 θ 2 136 Parametric Estimation The concept of estimators beyond statistics If you extend the concept of estimators beyond the statistical concept to a real life methodology the accessibility of data is vitally important An estimator may have all the advantages but the difficulty of obtaining data becomes a concern For example to estimate the temperature of the suns surface it is definitively a great idea to send a proxy to do it but this is probably not a costefficient nor timeefficient way Scientists have used other measurements that can be done on Earth to do the estimation Some business scenarios share a similar characteristic where unbiasedness is not a big issue but data accessibility is In the next two sections I will introduce the two most important methods of parameter estimation Using the method of moments to estimate parameters The method of moments associates moments with the estimand What is a moment A moment is a special statistic of a distribution The most commonly used moment is the nth moment of a realvalued continuous function Lets use M to denote the moment and it is defined as follows where the order of the moment is reflected as the value of the exponent 𝑀𝑀𝑛𝑛 𝑥𝑥 𝑐𝑐𝑛𝑛 𝑓𝑓𝑥𝑥𝑑𝑑𝑥𝑥 This is said to be the moment about the value c Often we set c to be 0 𝑀𝑀𝑛𝑛 𝑥𝑥𝑛𝑛𝑓𝑓𝑥𝑥𝑑𝑑𝑥𝑥 Some results are immediately available for example because the integration of a valid Probability Density Function PDF always gives 1 Therefore we have M0 1 Also M1 is the expectation value therefore the mean A note on central moments For highorder moments where c is often set to be the mean these moments are called central moments In this setting the second moment M2 becomes the variance 138 Parametric Estimation Lets first generate a set of artificial data using the following code snippet The true parameter is 20 We will plot the histogram too nprandomseed2020 calls nprandompoissonlam20 size365 plthistcalls bins20 The result looks as follows Figure 61 Histogram plot of the artificial Poisson distribution data Now lets express the first moment with the unknown parameter In Chapter 5 Common Probability Distributions we saw that the expectation value is just λ itself Next lets express the first moment with the data which is just the mean of the data The npmeancalls function call gives the value 19989 This is our estimation and it is very close to the real parameter 20 In short the logic is the following The npmeancalls function call gives the sample mean and we use it to estimate the population mean The population mean is represented by moments Here it is just the first moment For a welldefined distribution the population mean is an expression of unknown parameters In this example the population mean happens to have a very simple expression the unknown parameter λ itself But make sure that you understand the whole chain of logic A more sophisticated example is given next Using the method of moments to estimate parameters 139 Example 2 the bounds of uniform distribution Lets see another example of continuous distribution We have a set of points that we assume comes from a uniform distribution However we dont know the lower bound α and upper bound β We would love to estimate them The assumed distribution has a uniform PDF on the legitimate domain Here is the complete form of the distribution 𝑃𝑃𝑋𝑋 𝑥𝑥 1 β α The following code snippet generates artificial data with a true parameter of 0 and 10 nprandomseed2020 data nprandomuniform0102000 Lets take a look at its distribution It clearly shows that the 2000 randomly generated data points are quite uniform Each bin contains roughly the same amount of data points Figure 62 Histogram plot of artificial uniform distribution data Next we perform the representation of moments with the unknown parameters 1 First lets express the first and second moments with the parameters The first moment is easy as it is the average of α and β 𝑀𝑀1 05α β 140 Parametric Estimation The second moment requires some calculation It is the integration of the product of x2 and the PDF according to the definition of moments 2 Then we calculate M1 and M2 from the data by using the following code snippet M1 npmeandata M2 npmeandata2 3 The next step is to express the parameters with the moments After solving the two equations that represent M1 and M2 we obtain the following α 𝑀𝑀1 3𝑀𝑀2 𝑀𝑀1 2 β 2𝑀𝑀1 α Substituting the values of the moments we obtain that α 0096 and β 100011 This is a pretty good estimation since the generation of the random variables has a lower bound of 0 and an upper bound of 10 Wait a second What will happen if we are unlucky and the data is highly skewed We may have an unreasonable estimation Here is an exercise You can try to substitute the generated dataset with 1999 values being 10 and only 1 value being 0 Now the data is unreasonably unlikely because it is supposed to contain 2000 data points randomly uniformly selected from the range 0 to 10 Do the analysis again and you will find that α is unrealistically wrong such that a uniform distribution starting from α cannot generate the dataset we coined itself How ridiculous However if you observe 1999 out of 2000 values aggregated at one single data point will you still assume the underlying distribution to be uniform Probably not This is a good example of why the naked eye should be the first safeguard of your statistical analysis You have estimated two sets of parameters in two different problems using the method of moments Next we will move on to the maximum likelihood approach 𝑀𝑀2 𝑥𝑥2 β α 𝑑𝑑𝑥𝑥 β α 1 3α2 αβ β2 Applying the maximum likelihood approach with Python 141 Applying the maximum likelihood approach with Python Maximum Likelihood Estimation MLE is the most widely used estimation method It estimates the probability parameters by maximizing a likelihood function The obtained extremum estimator is called the maximum likelihood estimator The MLE approach is both intuitive and flexible It has the following advantages MLE is consistent This is guaranteed In many practices a good MLE means the job that is left is simply to collect more data MLE is functionally invariant The likelihood function can take various transformations before maximizing the functional form We will see examples in the next section MLE is efficient Efficiency means when the sample size tends to infinity no other consistent estimator has a lower asymptotic MSE than MLE With that power in MLE I bet you just cant wait to try it Before maximizing the likelihood we need to define the likelihood function first Likelihood function A likelihood function is a conditional probability distribution function that conditions upon the hidden parameter As the name suggests it measures how likely it is that our observation comes from a distribution with the hidden parameters by assuming the hidden parameters are essentially true When you change the hidden parameters the likelihood function changes value In another words the likelihood function is a function of hidden parameters The difference between a conditional distribution function and a likelihood function is that we focus on different variables For a conditional distribution Pevent parameter we focus on the event and predict how likely it is that an event will happen So we are interested in the fevent Pevent parameter λ function where λ is known We treat the likelihood function as a function over the parameter domain where all the events are already observed fparameter Pevent E parameter where the collection of events E is known You can think of it as the opposite of the standard conditional distribution defined in the preceding paragraph Lets take coin flipping as an example Suppose we have a coin but we are not sure whether it is fair or not However what we know is that if it is unfair getting heads is more likely with a probability of Phead 06 142 Parametric Estimation Now you toss it 20 times and get 11 heads Is it more likely to be a fair coin or an unfair coin What we want to find is which of the following is more likely to be true P11 out of 20 is head fair or P11 out of 20 is head unfair Lets calculate the two possibilities The distribution we are interested in is a binomial distribution If the coin is fair then we have the following likelihood function value If the coin is biased toward heads then the likelihood function reads as follows It is more likely that the coin is fair I deliberately picked such a number so that the difference is subtle The essence of MLE is to maximize the likelihood function with respect to the unknown parameter A note on the fact that likelihood functions dont sum to 1 You may observe a fact that likelihood functions even enumerating all possible cases here two do not necessarily sum to unity This is due to the fact that likelihood functions are essentially not legitimate PDFs The likelihood function is a function of fairness where the probability of getting heads can take any value between 0 and 1 What gives a maximal value Lets do an analytical calculation and then plot it Lets use p to denote the probability of getting heads and Lp to denote the likelihood function You can do a thought experiment here The value of p changes from 0 to 1 but both p 0 and p 1 make the expression equal to 0 Somewhere in between there is a p value that maximizes the expression Lets find it Note that the value of p that gives the maximum of this function doesnt depend on the combinatorial factor We can therefore remove it to have the following expression 20 1191 2 20 01602 20 1193 5 1 1 2 5 9 01597 𝐿𝐿𝑝𝑝 20 11 9 𝑝𝑝111 𝑝𝑝9 𝐿𝐿𝑝𝑝 𝑝𝑝111 𝑝𝑝9 Applying the maximum likelihood approach with Python 143 You can further take the logarithm of the likelihood function to obtain the famous loglikelihood function 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑝𝑝 11𝑙𝑙𝑙𝑙𝑙𝑙𝑝𝑝 9𝑙𝑙𝑙𝑙𝑙𝑙1 𝑝𝑝 The format is much cleaner now In order to obtain this expression we used the formulas 𝑙𝑙𝑙𝑙𝑙𝑙𝑎𝑎𝑏𝑏 𝑏𝑏 𝑙𝑙𝑙𝑙𝑙𝑙𝑎𝑎 and 𝑙𝑙𝑜𝑜𝑔𝑔𝑎𝑎 𝑏𝑏 𝑙𝑙𝑜𝑜𝑔𝑔𝑎𝑎 𝑙𝑙𝑜𝑜𝑔𝑔𝑏𝑏 Transformation invariance The transformation suggests that the likelihood function is not fixed nor unique You can remove the global constant or transform it with a monotonic function such as logarithm to fit your needs The next step is to obtain the maximal of the loglikelihood function The derivative of logx is just 1 𝑥𝑥 so the result is simple to obtain by setting the derivative to 0 You can verify that the function only has one extremum by yourself P 055 is the right answer This agrees with our intuition since we observe 11 heads among 20 experiments The most likely guess is just 11 20 For completeness I will plot the original likelihood distribution before moving to more complex examples The following code snippet demonstrates this point pltfigurefigsize106 pltplotP factornppowerP11nppower1P9 linestyle linewidth4 labellikelihood function pltaxvline055 linestyle linewidth4 labelmost likely p colorr pltxlabelProbability of obtaining head pltylabelLikelihood function value pltlegendloc0006 𝑑𝑑𝑑𝑑𝑑𝑑𝑔𝑔𝐿𝐿𝑝𝑝 𝑑𝑑𝑝𝑝 11 𝑝𝑝 9 1 𝑝𝑝 0 144 Parametric Estimation The result is as follows The vertical dashed line indicates the maximum where p 055 Figure 63 The likelihood function of the cointossing experiment You can verify visually that the likelihood function only has one maximum MLE for uniform distribution boundaries Lets first revisit our previous example introduced in the Using the method of moments to estimate parameters section where the data is uniformly sampled from the range ab but both a and b are unknown We dont need to do any hardcore calculation or computational simulation to obtain the result with MLE Suppose the data is denoted as x1 x2 up to xn n is a large number as we saw in the previous section 2000 The likelihood function is therefore as follows The logarithm will bring down the exponent n which is a constant So what we want to maximize is 𝑙𝑙𝑙𝑙𝑙𝑙 1 𝑏𝑏 𝑎𝑎 𝑙𝑙𝑙𝑙𝑙𝑙𝑏𝑏 𝑎𝑎 which further means we wish to minimize logba 𝐿𝐿𝑎𝑎 𝑏𝑏 1 𝑏𝑏 𝑎𝑎 𝑛𝑛 Applying the maximum likelihood approach with Python 145 Transformation of logarithm Note that 𝑙𝑙𝑙𝑙𝑙𝑙 1 𝑥𝑥 is essentially 𝑙𝑙𝑙𝑙𝑙𝑙𝑥𝑥1 You can pull the 1 out of the logarithm Alright the result is simple enough such that we dont need derivatives When b becomes smaller logba is smaller When a is larger but must be smaller than b logba is smaller However b is the upper bound so it cant be smaller than the largest value of the dataset maxxi and by the same token cant be larger than the smallest value of the dataset minxi Therefore the result reads as follows A minxi b maxxi This agrees with our intuition and we have fully exploited the information we can get from such a dataset Also note that this is much more computationally cheaper than the method of moments approach MLE for modeling noise Lets check another example that is deeply connected with regression models which we are going to see in future chapters Here we will approach it from the perspective of estimators Regression and correlation A regression model detects relationships between variables usually the dependent variables outcome and the independent variables features The regression model studies the direction of the correlation and the strength of the correlation Lets say we anticipate that there is a correlation between random variable X and random variable Y For simplicity we anticipate the relationship between them is just proportional namely Y k X Here k is an unknown constant the coefficient of proportionality However there is always some noise in a realworld example The exact relationship between X and Y is therefore Y k X ε where ε stands for the noise random variable 146 Parametric Estimation Lets say we have collected the n independent data pairs xi and yi at our disposal The following code snippet creates a scatter plot for these data points For the demonstration I will choose n 100 pltfigurefigsize106 pltscatterXY pltxlabelX pltylabelY plttitleLinearly correlated variables with noise The result looks like the following Figure 64 Scatter plot of X and Y Instead of modeling the distribution of the data as in the previous example we are going to model the noise since the linear relationship between X and Y is actually known We will see two cases where the modeling choices change the estimation of the coefficient of proportionality k In the first case the noise follows a standard normal distribution and can be shown as N01 In the second case the noise follows a standard Laplace distribution Applying the maximum likelihood approach with Python 147 A note on the two candidate distributions of noise Recall that a standard normal distribution has a PDF that is shown as fx 1 2π ex2 2 Laplace distribution is very similar to the standard normal distribution It has a PDF that can be presented as f𝑥𝑥 1 2 𝑒𝑒𝑥𝑥 The big difference is that one decays faster while the other decays slower The positive half of the standard Laplace distribution is just half of the exponential distribution with λ 1 The following code snippet plots the two distributions in one graph xs nplinspace55100 normalvariables 1npsqrt2nppinpexp05xs2 laplacevariables 05npexpnpabsxs pltfigurefigsize108 pltplotxsnormalvariableslabelstandard normallinestylelinewidth4 pltplotxslaplacevariableslabelstandard Laplacelinestylelinewidth4 pltlegend The result looks as in the following figure The dashed line is the standard normal PDF and the dotted line is the standard Laplace PDF Figure 65 Standard normal and standard Laplace distribution around 0 148 Parametric Estimation I will model the noises according to the two distributions However the noise is indeed generated from a standard normal distribution Lets first examine case 1 Suppose the noise Є follows the standard normal distribution This means Єi yi kxi as you use other variables to represent Єi follows the given distribution Therefore we have fϵik 1 2π eyikxi2 2 for every such random noise data point Now think of k is the hidden parameter We just obtained our likelihood function for one data point As each data point is independent the likelihood function can therefore be aggregated to obtain the overall likelihood function as shown in the following formula What we want to find is k such that it maximizes the likelihood function Lets introduce the mathematical way to express this idea Lets take the logarithm of the likelihood function and make use of the rule that the logarithm of a product is the sum of each terms logarithm Note on simplifying the logarithm of a sum The rule of simplifying the likelihood functions expression is called the product rule of logarithm It is shown as 𝑙𝑙𝑙𝑙 𝑓𝑓𝑖𝑖 𝑙𝑙𝑙𝑙𝑓𝑓𝑖𝑖 Therefore using the product rule to decompose the loglikelihood function we have 𝑙𝑙𝑜𝑜𝑜𝑜𝑜𝑜𝑘𝑘 𝑙𝑙𝑜𝑜𝑜𝑜𝑓𝑓ϵ𝑖𝑖𝑘𝑘 𝑖𝑖 For each term we have the following Substitute the expression into the loglikelihood function Note that the optimal k wont depend on the first term which is a constant So we can drop it and focus on the second part L𝑘𝑘 fϵ0 ϵ𝑛𝑛𝑘𝑘 𝑓𝑓ϵ𝑖𝑖𝑘𝑘 𝑖𝑖 kMLE argmaxkLk log𝑓𝑓ϵ𝑖𝑖𝑘𝑘 05log2π 𝑦𝑦𝑖𝑖𝑘𝑘 𝑥𝑥𝑖𝑖2 2 logL𝑘𝑘 𝑛𝑛 2 log2π 1 2𝑦𝑦𝑖𝑖 𝑘𝑘 𝑥𝑥𝑖𝑖2 Applying the maximum likelihood approach with Python 149 Now lets calculate the derivative with respect to k At the maximum of the function the expression of the derivative will become 0 Equating this to 0 we find the optimal value of k On verifying maximality If you are familiar with calculus you may wonder why I didnt calculate the second derivative to verify that the value is indeed a maximum rather than a possible minimum You are right that in principle such a calculation is needed However in our examples in this chapter the functional forms are quite simple like a simple quadratic form with only one maximum The calculation is therefore omitted The following code snippet does the calculation Note that if a variable is a NumPy array you can perform elementwise calculation directly npsumXYnpsumXX The result is 040105608977245294 Lets visualize it to check how well this estimation does The following code snippet adds the estimated y values to the original scatter plot as a dashed bold line pltfigurefigsize106 pltscatterXYalpha 04 pltxlabelX pltylabelY pltplotXXk1linewidth4linestylecr labelfitted line pltlegend 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑘𝑘 𝑑𝑑𝑘𝑘 𝑦𝑦𝑖𝑖 𝑘𝑘 𝑥𝑥𝑖𝑖𝑥𝑥𝑖𝑖 𝑖𝑖 𝑘𝑘 𝑥𝑥𝑖𝑖𝑦𝑦𝑖𝑖 𝑥𝑥𝑖𝑖𝑥𝑥𝑖𝑖 150 Parametric Estimation The result looks as follows Figure 66 Estimated Y value according to normal noise assumption The result looks quite reasonable Now lets try the other case The logic of MLE remains the same until we take the exact form of the likelihood function fЄik The likelihood function for the second Laplacedistributed noise case is the following Taking the logarithm the loglikelihood has a form as follows I have removed the irrelevant constant 05 as I did in the cointossing example to simplify the calculation In order to maximize the loglikelihood function we need to minimize the summation yikxi This summation involves absolute values and the sign of each term depends on the value of k Put the book down and think for a while about the minimization Lk fϵ𝑖𝑖𝑘𝑘 1 2 𝑒𝑒𝑦𝑦𝑖𝑖𝑘𝑘𝑥𝑥𝑖𝑖 logLk yi 𝑘𝑘𝑥𝑥𝑖𝑖 Applying the maximum likelihood approach with Python 151 Lets define a function signx which gives us the sign of x If x is positive signx 1 if x is negative signx 1 otherwise it is 0 Then the derivative of the preceding summation with respect to k is essentially the following Because the signx function jumps it is still not easy to get a hint Lets do a computational experiment I will pick k between 0 and 1 then create a graph of the loglikelihood function values and the derivatives so that you can have a visual impression The following code snippet creates the data I needed Ks nplinspace0206100 def calloglikelihoodX Y k return npsumnpabsYkX def calderivativeXYk return npsumXnpsignYkX Likelihoods calloglikelihoodXYk for k in Ks Derivatives calderivativeXYk for k in Ks I picked the range 0206 and selected k values for a 004 increment Why did I pick this range This is just a first guess from the scatter plot The following code snippet plots the results pltfigurefigsize108 pltplotKsLikelihoodslabelLikelihood functionlinestylelinewidth4 pltplotKsDerivativeslabel Derivativelinestylelinewidth4 pltlegend pltxlabelK 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑘𝑘 𝑑𝑑𝑘𝑘 𝑥𝑥𝑖𝑖 sign𝑦𝑦𝑖𝑖𝑘𝑘𝑥𝑥𝑖𝑖 0 152 Parametric Estimation The dashed line is the likelihood function value and the dotted line is the derivative of the likelihood function Figure 67 Searching for the optimal k point through computational experimentation You may notice that there seems to be a plateau for the functions This is true Lets zoom in to the range where k takes the value 038042 The following code snippet does the job Ks nplinspace038042100 Likelihoods calloglikelihoodXYk for k in Ks Derivatives calderivativeXYk for k in Ks pltfigurefigsize108 pltplotKsLikelihoodslabelLikelihood functionlinestylelinewidth4 pltplotKsDerivativeslabel Derivativelinestylelinewidth4 pltlegend pltxlabelK Applying the maximum likelihood approach with Python 153 The result looks like the following Figure 68 The plateau of derivatives This strange behavior comes from the fact that taking the derivative of the absolute value function lost information about the value itself We only have information about the sign of the value left However we can still obtain an estimation of the optimal value by using the numpyargmax function This function returns the index of the maximal value in an array We can then use this index to index the array for our k values The following oneline code snippet does the job KsnpargmaxLikelihoods The result is about 0397 Then we know the real k values are in the range 0397 00004 0397 00004 which is smaller than our result from case 1 Why 00004 We divide a range of 004 into 100 parts equally so each grid is 00004 154 Parametric Estimation Lets plot both results together They are almost indistinguishable The following code snippet plots them together pltfigurefigsize108 pltscatterXYalpha 04 pltxlabelX pltylabelY pltplotXXk1linewidth4linestylecr labelestimation from normal noise assumption pltplotXXk2linewidth4linestylecg labelestimation from Laplace noise assumption pltlegend The result looks as follows Figure 69 Estimations of the coefficient of proportionality Alright we are almost done Lets find out how the data is generated The following code snippet tells you that it is indeed generated from the normal distribution randomseed2020 X nplinspace1010100 Y X 04 nparrayrandomnormalvariate01 for in range100 Applying the maximum likelihood approach with Python 155 In this example we modeled the random noise with two different distributions However rather than a simple unknown parameter our unknown parameter now carries correlation information between data points This is a common technique to use MLE to estimate unknown parameters in a model by assuming a distribution We used two different distributions to model the noise However the results are not very different This is not always the case Sometimes poorly modeled noise will lead to the wrong parameter estimation MLE and the Bayesian theorem Another important question remaining in a realworld case is how to build comprehensive modeling with simply the raw data The likelihood function is important but it may miss another important factor the prior distribution of the parameter itself which may be independent of the observed data To have a solid discussion on this extended topic let me first introduce the Bayesian theorem for general events A and B The Bayesian theorem builds the connection between PAB and PBA through the following rule This mathematical equation is derived from the definition of the conditional distribution 𝑃𝑃𝐴𝐴𝐵𝐵 𝑃𝑃𝐴𝐴 𝐵𝐵 𝑃𝑃𝐵𝐵 and 𝑃𝑃𝐵𝐵𝐴𝐴 𝑃𝑃𝐴𝐴 𝐵𝐵 𝑃𝑃𝐴𝐴 are both true statements therefore 𝑃𝑃𝐴𝐴𝐵𝐵𝑃𝑃𝐵𝐵 𝑃𝑃𝐵𝐵𝐴𝐴𝑃𝑃𝐴𝐴 which gives the Bayesians rule Why is the Bayesian rule important To answer this question lets replace A with observation O and B with the hidden unknown parameter λ Now lets assign reallife meaning to parts of the equation P0λ is the likelihood function we wish to maximize Pλ0 is a posterior probability of λ Pλ is a prior probability of λ PO is essentially 1 because it is observed therefore determined P𝐴𝐴𝐵𝐵 𝑃𝑃𝐵𝐵𝐴𝐴 𝑃𝑃𝐴𝐴 𝑃𝑃𝐵𝐵 P𝑂𝑂λ 𝑃𝑃λ𝑂𝑂 𝑃𝑃𝑂𝑂 𝑃𝑃λ 156 Parametric Estimation If you have forgotten the concepts of posterior probability and prior probability please refer to the Understanding the important concepts section in Chapter 5 Common Probability Distributions The Bayesian theorem basically says that the likelihood probability is just the ratio of posterior probability and prior probability Recall that the posterior distribution is a corrected or adjusted probability of the unknown parameter The stronger our observation O suggests a particular value of λ the bigger the ratio of 𝑃𝑃λ𝑂𝑂 𝑃𝑃λ is In both of our previous cases Pλ is invisible because we dont have any prior knowledge about λ Why is prior knowledge of the unknown parameter important Lets again use the coin tossing game not an experiment anymore as an example Suppose the game can be done by one of two persons Bob or Curry Bob is a serious guy who is not so playful Bob prefers using a fair coin over an unfair coin 80 of the time Bob uses the fair coin Curry is a naughty boy He randomly picks a fair coin or unfair coin for the game fifty fifty Will you take the fact of who you are playing the game with into consideration Of course you will If you play with Curry you will end up with the same analysis of solving MLE problems as coin tossing and noise modeling Earlier we didnt assume anything about the prior information of p However if you play with Bob you know that he is more serious and honest therefore he is unlikely to use an unfair coin You need to factor this into your decision One common mistake that data scientists make is the ignorance of prior distribution of the unknown parameter which leads to the wrong likelihood function The calculation of the modified coin tossing is left to you as an exercise The Bayesian theorem is often utilized together with the law of total probability It has an expression as follows We wont prove it mathematically here but you may think of it in the following way 1 For the first equals sign for the enumeration of all mutually exclusive events Bi each Bi overlaps with event A to some extent therefore the aggregation of the events Bi will complete the whole set of A 2 For the second equals sign we apply the definition of conditional probability to each term of the summation P𝐴𝐴 𝑃𝑃𝐴𝐴 𝐵𝐵𝑖𝑖 𝑖𝑖 𝑃𝑃𝐴𝐴𝐵𝐵𝑖𝑖𝑃𝑃𝐵𝐵𝑖𝑖 𝑖𝑖 Applying the maximum likelihood approach with Python 157 Example of the law of total probability Suppose you want to know the probability that a babys name is Adrian in the US Since Adrian is a genderneutral baby name you can calculate it using the law of total probability as PAdrian PAdrianboy Pboy PAdriangirl Pgirl The last example in this chapter is the famous Monty Hall question It will deepen your understanding of the Bayesian rule You are on a show to win a prize There is a huge prize behind one of three doors Now you pick door A the host opens door B and finds it empty You are offered a second chance to switch to door C or stick to door A what should you do The prior probability for each door A B or C is 𝑃𝑃𝐴𝐴 𝑃𝑃𝐵𝐵 𝑃𝑃𝐶𝐶 1 3 The host will always try to open an empty door after you select if you selected a door without the prize the host will open another empty door If you select the door with the prize the host will open one of the remaining doors randomly Lets use EB to denote the event that door B is opened by the host which is already observed Which pair of the following probabilities should you calculate and compare The likelihood of PEBA and PEBC The posterior probability PAEB and PCEB The answer is the posterior probability Without calculation you know PEBA is 1 2 Why Because if the prize is in fact behind door A door B is simply selected randomly by the host However this information alone will not give us guidance on the next action The posterior probability is what we want because it instructs us what to do after the observation Lets calculate PAEB first According to the Bayesian rule and the law of total probability we have the following equation 𝑃𝑃𝐴𝐴𝐸𝐸𝐵𝐵 𝑃𝑃𝐸𝐸𝐵𝐵𝐴𝐴𝑃𝑃𝐴𝐴 𝑃𝑃𝐸𝐸𝐵𝐵 𝑃𝑃𝐸𝐸𝐵𝐵𝐴𝐴𝑃𝑃𝐴𝐴 𝑃𝑃𝐸𝐸𝐵𝐵𝐴𝐴𝑃𝑃𝐴𝐴 𝑃𝑃𝐸𝐸𝐵𝐵𝐵𝐵𝑃𝑃𝐵𝐵 𝑃𝑃𝐸𝐸𝐵𝐵𝐶𝐶𝑃𝑃𝐶𝐶 158 Parametric Estimation Now what is PEBC It is 1 because the host is sure to pick the empty door to confuse you Lets go over the several conditional probabilities together PEBA The prize is behind door A so the host has an equal chance of randomly selecting from door B and door C This probability is 1 2 PEBB The prize is behind door B so the host will not open door B at all This probability is 0 PEBC The prize is behind door C so the host will definitely open door B Otherwise opening door C will reveal the prize This probability is essentially 1 By the same token 𝑃𝑃𝐶𝐶𝐸𝐸𝐵𝐵 2 3 The calculation is left to you as a small exercise This is counterintuitive You should switch rather than sticking to your original choice From the first impression since the door the host opens is empty there should be an equal chance that the prize will be in one of the two remaining two doors equally The devil is in the details The host has to open a door that is empty You will see from a computational sense about the breaking of symmetry in terms of choices First lets ask the question of whether we can do a computational experiment to verify the results The answer is yes but the setup is a little tricky The following code does the job and I will explain it in detail import random randomseed2020 doors ABC count stick switch 0 0 0 trials for i in range10000 prize randomchoicedoors pick randomchoicedoors reveal randomchoicedoors trial 1 while reveal prize or reveal pick P𝐴𝐴𝐸𝐸𝐵𝐵 1 2 1 3 1 2 1 3 0 1 3 1 1 3 1 3 Applying the maximum likelihood approach with Python 159 reveal randomchoicedoors trial1 trialsappendtrial if reveal pick and reveal prize count 1 if pick prize stick 1 else switch 1 printtotal experiment formatcount printtimes of switch formatswitch printtimes of stick formatstick Run it and you will see the following results total experiment 10000 times of switch 6597 times of stick 3403 Indeed you should switch The code follows the following logic 1 For 10000 experiments the prize is preselected randomly The users pick is also randomly selected The user may or may not pick the prize 2 Then the host reveals one of the doors However we know that the host will reveal one empty door from the remaining two doors for sure We use the trial variable to keep track of the times that we try to generate a random selection to meet this condition This variable is also appended to a list object whose name is trials 3 At last we decide whether to switch or not The symmetry is broken when the host tries to pick the empty door Lets use the following code to show the distribution of trials pltfigurefigsize106 plthisttrialsbins 40 160 Parametric Estimation The plot looks like the following Figure 610 Number of trials in the computer simulation In our plain simulation in order to meet the condition that the host wants to satisfy we must do random selection more than one time and sometimes even more than 10 times This is where the bizarreness hides Enough said on the Bayesian theorem you have grasped the foundation of MLE MLE is a simplified scenario of the Bayesian approach to estimation by assuming the prior distribution of the unknown parameter is uniform Summary In this chapter we covered two important methods of parameter estimation the method of moments and MLE You then learned the background of MLE the Bayesian way of modeling the likelihood function and so on However we dont know how well our estimators perform yet In general it requires a pipeline of hypothesis testing with a quantitative argument to verify a claim We will explore the rich world of hypothesis testing in the next chapter where we will put our hypothesesassumptions to the test 7 Statistical Hypothesis Testing In Chapter 6 Parametric Estimation you learned two important parameter estimation methods namely the method of moments and MLE The underlying assumption for parameter estimation is that we know that the data follows a specific distribution but we do not know the details of the parameters and so we estimate the parameters Parametric estimation offers an estimation but most of the time we also want a quantitative argument of confidence For example if the sample mean from one population is larger than the sample mean from another population is it enough to say the mean of the first population is larger than that of the second one To obtain an answer to this question you need statistical hypothesis testing which is another method of statistical inference of massive power In this chapter you are going to learn about the following topics An overview of hypothesis testing Making sense of confidence intervals and P values from visual examples Using the SciPy Python library to do common hypothesis testing 162 Statistical Hypothesis Testing Understanding the ANOVA model and corresponding testing Applying hypothesis testing to time series problems Appreciating AB testing with realworld examples Some concepts in this chapter are going to be subtle Buckle up and lets get started An overview of hypothesis testing To begin with the overview I would like to share an ongoing example from while I was writing this book As the coronavirus spread throughout the world pharmaceutical and biotechnology companies worked around the clock to develop drugs and vaccines Scientists estimated that it would take at least a year for a vaccine to be available To verify the effectiveness and safety of a vaccine or drug clinical trials needed to be done cautiously and thoroughly at different stages It is a wellknown fact that most drugs and vaccines wont reach the later trial stages and only a handful of them ultimately reach the market How do clinical trials work In short the process of screening medicines is a process of hypothesis testing A hypothesis is just a statement or claim about the statistics or parameters describing a studied population In clinical trials the hypothesis that a medicine is effective or safe is being tested The simplest scenario includes two groups of patients selected randomly You treat one group with the drug and another without the drug and control the rest of the factors Then the trial conductors will measure specific signals to compare the differences for example the concentration of the virus in the respiratory system or the number of days to full recovery The trial conductors then decide whether the differences or the statistics calculated are significant enough You can preselect a significance level α to check whether the trial results meet the expected level of significance Now imagine you observe the average math course scores for 9th grade students in a school You naturally assume that the distribution of each years math course scores follows a normal distribution However you find this years sample average score μ2020 is slightly below last years population average score 75 Do you think this finding just comes from the randomness of sampling or is the fundamental level of math skills of all students deteriorating You can use MLE or the methods of the moment to fit this years data to a normal distribution and compare the fitted mean However this still doesnt give you a quantitative argument of confidence In a scorebased a score out of 100 grading system lets say last years average score for all students is 75 and the average score for your 2020 class a 900student sample is 73 Is the decrease real How small or big is the twopoint difference An overview of hypothesis testing 163 To answer these questions we first need to clarify what constitutes a valid hypothesis Here are two conditions The statement must be expressed mathematically For example this years average score is the same as last years is a valid statement This statement can be expressed as 𝐻𝐻0 μ2020 75 with no ambiguity However this years average score is roughly the same as last years is not a valid statement because different people have different assessments of what is roughly the same The statement should be testable A valid statement is about the statistics of observed data If the statement requires data other than the observed data the statement is not testable For example you cant test the differences in students English scores if only math scores are given This famous saying by the statistician Fisher summarizes the requirements for a hypothesis well although he was talking about the null hypothesis specifically The null hypothesis must be exact that is free of vagueness and ambiguity because it must supply the basis of the problem of distribution of which the test of significance is the solution Now we are ready to proceed more mathematically Lets rephrase the math score problem as follows The average math score for the previous years 9th grade students is 75 This year you randomly sample 900 students and find the sample mean is 73 You want to know whether the average score for this years students is lower than last years To begin hypothesis testing the following three steps are required 1 Formulate a null hypothesis A null hypothesis basically says there is nothing special going on In our math score example it means there is no difference between this years score and last years score A null hypothesis is denoted by H0 On the other hand the corresponding alternative hypothesis states the opposite of the null hypothesis It is denoted by H1 or Hα In our example you can use 𝐻𝐻1 μ2020 75 Note that different choices of null hypothesis and alternative hypothesis will lead to different results in terms of accepting or rejecting the null hypothesis 164 Statistical Hypothesis Testing 2 Pick a test statistic that can be used to assess how well the null hypothesis holds and calculate it A test statistic is a random variable that you will calculate from the sampled data under the assumption that the null hypothesis is true Then you calculate the Pvalue according to the known distribution of this test statistic 3 Compute the Pvalue from the test statistic and compare it with an acceptable significance level α Gosh So many new terms Dont worry lets go back to our examples and you will see this unfold gradually After that you will be able to understand these concepts in a coherent way You will be able to follow these three steps to approach various kinds of hypothesis testing problems in a unified setting Understanding Pvalues test statistics and significance levels To explain the concepts with the math example lets first get a visual impression of the data I will be using two sample math scores for 900 students in 2019 and sample math scores for 900 students in 2020 Note on given facts The dataset for 2019 is not necessary for hypothesis testing because we are given the fact that the average score for 2019 is exactly 75 I will generate the datasets to provide you with a clear comparable visualization At this point I am not going to tell you how I generated the 2020 data otherwise the ground truth would be revealed to you beforehand I do assure you that the data for 2019 is generated from sampling a normal distribution with a mean of 75 and a variance of 25 The two datasets are called math2020 and math2019 Each of them contains 900 data points Let me plot them with histogram plots so you know roughly what they look like The following code snippet does the job pltfigurefigsize106 plthistmath2020binsnp linspace5010050alpha05labelmath2020 plthistmath2019binsnp linspace5010050alpha05labelmath2019 pltlegend An overview of hypothesis testing 165 Note that I explicitly set the bins to make sure the bin boundaries are fully determined The result looks as follows Figure 71 Histogram plot of math scores from 2020 and 2019 Note that the scores from 2020 do seem to have a smaller mean than the scores from 2019 which is supposed to be very close to 75 Instead of calling the corresponding numpy functions I will just use the following describe function from the scipy librarys stats module from scipy import stats statsdescribemath2020 The result is the following DescribeResultnobs900 minmax5361680120097629 9329408158813376 mean7289645796453996 variance2481446705891462 skewness0007960630504578523 kurtosis03444548003252992 Do the same thing for the year 2019 I find that the mean for the year 2019 is around 75 In Chapter 4 Sampling and Inferential Statistics we discussed the issue of sampling which itself involves randomness Is the difference in means an artifact of randomness or is it real To answer this question lets first embed some definitions into our example starting with the null hypothesis A null hypothesis basically says YES everything is due to randomness An alternative hypothesis says NO to randomness and claims that there are fundamental differences A hypothesis test is tested against the null hypothesis to see whether there is evidence to reject it or not 166 Statistical Hypothesis Testing Back to the math score problem We can pick the null hypothesis 𝐻𝐻0 μ2020 75 and set the alternative hypothesis 𝐻𝐻1 μ2020 75 Or we can choose 𝐻𝐻0 μ2020 75 and set 𝐻𝐻1 μ2020 75 Note that the combination of the null hypothesis and the alternative hypothesis must cover all possible cases The first case is called a twotailed hypothesis because either μ2020 75 or μ2020 75 will be a rebuttal of our null hypothesis The alternative is a onetailed hypothesis Since we only want to test whether our mean score is less than or equal to 75 we choose the onetailed hypothesis for our example Even in the null hypothesis the mean can be larger than 75 but we know this is going to have a negligible likelihood On the choice of onetailed or twotailed hypotheses Whether to use a onetailed hypothesis or a twotailed hypothesis depends on the task at hand One big difference is that choosing a twotailed alternative hypothesis requires an equal split of the significance level on both sides What we have now is a dataset a null hypothesis and an alternative hypothesis The next step is to find evidence to test the null hypothesis The null hypothesis reads 𝐻𝐻0 𝜇𝜇2020 75 whereas the alternative hypothesis reads 𝐻𝐻1 μ2020 75 After setting up the hypothesis the next step is to find a rule to measure the strength of the evidence or in other words to quantify the risk of making mistakes that reject a true null hypothesis The significance level α is essentially an index of the likelihood of making the mistake of rejecting a true null hypothesis For example if a medicine under trial is useless α 005 means that we set a threshold that at the chance of less than or equal to 5 we may incorrectly conclude that the medicine is useful If we set the bar way lower at α 0001 it means that we are very picky about the evidence In other words we want to minimize the cases where we are so unlucky that randomness leads us to the wrong conclusion As we continue to talk about significance all hypothesis testing to be discussed in this chapter will be done with statistical significance tests A statistical significance test assumes the null hypothesis is correct until evidence that contradicts the null hypothesis shows up Another perspective of hypothesis testing treats the null hypothesis and the alternative hypothesis equally and tests which one fits the statistical model better I only mention it here for completeness Making sense of confidence intervals and Pvalues from visual examples 167 To summarize if the evidence and test statistics show contradictions against the null hypothesis with high statistical significance smaller α values we reject the null hypothesis Otherwise we fail to reject the hypothesis Whether you reject the null hypothesis or not there is a chance that you will make mistakes Note Hypothesis testing includes the test of correlation or independence In this chapter we mainly focus on the test of differences However claiming two variables are correlated or independent is a legitimate statementhypothesis that can be tested Making sense of confidence intervals and Pvalues from visual examples Pvalues determine whether a research proposal will be funded whether a publication will be accepted or at least whether an experiment is interesting or not To start with let me give you some bullet points about Pvalues properties The Pvalue is a magical probability but it is not the probability that the null hypothesis will be accepted Statisticians tend to search for supportive evidence for the alternative hypothesis because the null hypothesis is boring Nobody wants to hear that there is nothing interesting going on The Pvalue is the probability of making mistakes if you reject the null hypothesis If the Pvalue is very small it means that you can safely reject the null hypothesis without worrying too much that you made mistakes because randomness tricked you If the Pvalue is 1 it means that you have absolutely no reason to reject the null hypothesis because what you get from the test statistic is the most typical results under the null hypothesis The Pvalue is defined in the way it is so that it can be comparable to the significance level α If we obtain a Pvalue smaller than the significance level we say the result is significant at significance level α The risk of making a mistake that rejects the null hypothesis wrongly is acceptable If the Pvalue is not smaller than α the result is not significant 168 Statistical Hypothesis Testing From first principles the Pvalue of an event is also the summed probability of observing the event and all events with equal or smaller probability This definition doesnt contradict the point about contradicting the null hypothesis Note that under the assumption of a true null hypothesis the cumulative probability of observing our test statistics and all other equal or rarer values of test statistics is the probability of mistakenly rejecting the null hypothesis The importance of firstprinciples thinking Firstprinciples thinking is very important in studying statistics and programming It is advised that you resist the temptation to use rules and procedures to get things done quickly but instead learn the definitions and concepts so you have a foundation in terms of first principles Please read the definition of the Pvalue carefully to make sure you fully understand it Before moving on to a concrete example of test statistics lets have a look at the Pvalue from two examples from first principles The importance of correctly understanding the Pvalue cannot be stressed enough Calculating the Pvalue from discrete events In our first example we will study the probability and the Pvalue of events in cointossing experiments Lets toss a fair coin 6 times and count the total number of heads There are 7 possibilities from 0 heads to 6 heads We can calculate the probability either theoretically or computationally I will just do a quick experiment with the following lines of code and compare the results with the theoretical values The following code snippet generates the experiment results for 1 million tosses and stores the results in the results variable randomseed2020 results for in range1000000 resultsappendsumrandomrandom 05 for i in range6 The following code snippet normalizes the results and lists them alongside the theoretical results from collections import Counter from math import factorial as factorial counter Counterresults Making sense of confidence intervals and Pvalues from visual examples 169 for head in sortedcounterkeys comput counterhead1000000 theory 056factorial6factorialheadfactorial6 head printheads Computational Theoretical formatheadcomput theory The results look as follows The computational results agree with the theoretical results pretty well heads 0 Computational 0015913 Theoretical 0015625 heads 1 Computational 0093367 Theoretical 009375 heads 2 Computational 0234098 Theoretical 0234375 heads 3 Computational 0312343 Theoretical 03125 heads 4 Computational 0234654 Theoretical 0234375 heads 5 Computational 0093995 Theoretical 009375 heads 6 Computational 001563 Theoretical 0015625 Lets answer the following questions to help us clarify the definition of the Pvalue The answers should be based on theoretical results 1 What is the probability of getting 5 heads what about Pvalue The probability is 009375 However the Pvalue is the sum of 009375 009375 0015625 0015625 021875 The Pvalue is the probability of you seeing such events with equal probability or rarer probability Getting 1 head is equally likely as getting 5 heads Getting 6 heads or 0 heads is more extreme With a Pvalue of roughly 0 we say that the event of observing 5 heads is quite typical The Pvalue for observing 6 heads is about 0031 The calculation is left to you as an exercise 2 What is the Pvalue of getting 3 heads The surprising answer here is 1 Among all 7 kinds of possibilities getting 3 heads is the most likely outcome therefore the rest of the outcomes are all rarer than getting 3 heads Another implication is that there are no other events that are more typical than observing 3 heads Now you should have a better understanding of the Pvalue by having treated it as a measurement of typicalness In the next section lets move on to a case involving the continuous Probability Density Function PDF case where we need some integration 170 Statistical Hypothesis Testing Calculating the Pvalue from the continuous PDF We just calculated Pvalues from discrete events Now lets examine a continuous distribution The distribution I am going to use is the Fdistribution The Fdistribution is the distribution we are going to use in the analysis of the variance test later so it is good to have a first impression here The analytical form of the Fdistribution is parameterized by two degrees of freedom d1 and d2 Fd1 d2 If x is greater than 0 the PDF is as follows f𝑥𝑥 𝑑𝑑1 𝑑𝑑2 1 𝐵𝐵 𝑑𝑑1 2 𝑑𝑑2 2 𝑑𝑑1 𝑑𝑑2 𝑑𝑑1 2 𝑥𝑥 𝑑𝑑1 2 1 1 𝑑𝑑1 𝑑𝑑2 𝑥𝑥 𝑑𝑑1𝑑𝑑2 2 The Bxy function is called the beta function and its a special kind of function If you are familiar with calculus it has the following definition as an integration B𝑥𝑥 𝑦𝑦 𝑡𝑡𝑥𝑥11 𝑡𝑡𝑦𝑦1dt 1 0 Fortunately we dont need to write our own function to generate these samples The scipy library provides another handy function f for us to use The following code snippet generates the PDFs with four pairs of parameters and plots them from scipystats import f pltfigurefigsize108 styles for i dfn dfd in enumerate2030206050305060 x nplinspacefppf0001 dfn dfd fppf0999 dfn dfd 100 pltplotx fpdfx dfn dfd linestyle stylesi lw4 alpha06 label formatdfndfd pltlegend Making sense of confidence intervals and Pvalues from visual examples 171 The plotted graph looks like this Figure 72 The Fdistribution PDF The probability distribution function of the Fdistribution is not symmetrical it is right skewed with a long tail Lets say you have a random variable x following the distribution F2060 If you observe x to be 15 what is the Pvalue for this observation The following code snippet highlights the region where the equal or rare events are highlighted in red I generated 100 linearly spaced data points and stored them in the x variable and selected those rarer observations on the right and those on the left pltfigurefigsize108 dfn dfd 2060 x nplinspacefppf0001 dfn dfd fppf0999 dfn dfd 100 pltplotx fpdfx dfn dfd linestyle lw4 alpha06 label formatdfndfd right xx15 left xfpdfx dfn dfd fpdfrightdfndfd008 pltfillbetweenrightf pdfrightdfndfdalpha04colorr pltfillbetweenleftfpdfleftdfndfdalpha04colorr pltlegend 172 Statistical Hypothesis Testing There is a little bit of hardcoding here where I manually selected the left part of the shaded area You are free to inspect the expression of the left variable The result looks as follows Figure 73 A rarer observation than 15 in F2060 The integration of the shaded area gives us the Pvalue for observing the value 15 The following code snippet uses the Cumulative Distribution Function CDF to calculate the value fcdfleft1dfndfd 1fcdfright0dfndfd The Pvalue is about 0138 so not very bad It is somewhat typical to observe a 15 from such an Fdistribution If your preselected significance level is α 005 then this observation is not significant enough By now you should understand the definition and implication of the Pvalue from first principles The remaining question is what exactly is the Pvalue in a hypothesis test The answer involves test statistics In the second step of hypothesis testing we calculate the best kind of statistic and check its Pvalue against a preselected significance level In the math score example we want to compare a sample mean against a constant this is a onesample onetailed test The statistic we want to use is the tstatistic The specific hypothesis test we want to apply is Students ttest Please bear with me on the new concepts The tstatistic is nothing special its just another random variable that follows a specific distribution which follows Students tdistribution We will cover both specific and Students tdistribution shortly with clear definitions and visualizations Making sense of confidence intervals and Pvalues from visual examples 173 Tests and test statistics Different problems require different test statistics If you want to test the differences in samples across several categories or groups you should use the Analysis of Variance ANOVA Ftest If you want to test the independence of two variables in a population you should use the Chisquare test which we will cover very soon Under the null hypothesis the tstatistic t is calculated as follows t μ2020 75 𝑠𝑠𝑛𝑛 n is the sample size and s is the sample standard deviation The random variable t follows Students tdistribution with a degree of freedom of n1 Students tdistribution Students tdistribution is a continuous probability distribution used when estimating the mean of a normally distributed distribution with an unknown population standard deviation and a small sample size It has a complicated PDF with a parameter called the Degree of Freedom DOF We wont go into the formula of the PDF as its convoluted but I will show you the relationship between the DOF and the shape of the tdistribution PDF The following code snippet plots the tdistributions with various DOFs alongside the standard normal distribution functions Here I use the scipystats module from scipystats import t norm pltfigurefigsize126 DOFs 248 linestyles for i df in enumerateDOFs x nplinspace4 4 100 rv tdf pltplotx rvpdfx k lw2 label DOF strdflinestylelinestylesi pltplotxnorm01pdfxk lw2 labelStandard Normal pltlegend 174 Statistical Hypothesis Testing The result looks like the following Pay attention to the line styles As you see when the DOF increases the tdistribution PDF tends to approach the standard normal distribution with larger and larger centrality Figure 74 The Students tdistribution PDF and a standard normal PDF Alright our statistic t μ2020 75 𝑠𝑠𝑛𝑛 follows the tdistribution with a DOF of 899 By substituting the numbers we can get the value of our tstatistic using the following code npmeanmath202075npstdmath202030 The result is about 126758 Replacing the tdistribution with a normal distribution With a large DOF 899 in our case the tdistribution will be completely indistinguishable from a normal distribution In practice you can use a normal distribution to do the test safely Significance levels in tdistribution Lets say we selected a significance level of α 001 Is our result significant enough We need to find out whether our result exceeds the threshold of the significance level 001 The tdistribution doesnt have an easytocalculate PDF so given a significance level of α 001 how do we easily find the thresholds Before the advent of easytouse libraries or programs people used to build tstatistics tables to solve this issue For a given significance level you can basically look up the table and find the corresponding tstatistic value Making sense of confidence intervals and Pvalues from visual examples 175 As the tdistribution is symmetric the importance of whether you are doing a onetail test or a twotailed test increases Lets first check a tdistribution table for a onetail test For example with a DOF of 5 to be significant at the 005 level the tstatistic needs to be 2015 Figure 75 The tdistribution table for onetailed significance levels For a more intuitive impression the following code snippet plots the different thresholds of the tdistribution PDF pltfigurefigsize106 df 5 x nplinspace8 8 200 rv tdf pltplotx rvpdfx k lw4linestyle alphas 0100500250010005000100005 threasholds 1476201525713365403258946869 for thre alpha in zipthreasholdsalphas pltplotthrethre0rvpdfthre label formatstralphalinewidth4 pltlegend The result looks as follows Figure 76 The significance levels for a tstatistics distribution with DOF5 176 Statistical Hypothesis Testing Adding the following two lines will zoom into the range that we are interested in pltxlim28 pltylim0015 The result looks as follows If you cant distinguish the colors just remember that the smaller the significance level is the further away the threshold is from the origin Figure 77 A zoomedin tdistribution showing different significance levels As the significance level decreases that is as α decreases we tend toward keeping the null hypothesis because it becomes increasingly harder to observe a sample with such low probability Next lets check the twotailed tdistribution table The twotailed case means that we must consider both ends of the symmetric distribution The summation of both gives us the significance level Figure 78 The tdistribution table for twotailed significance levels Making sense of confidence intervals and Pvalues from visual examples 177 Notice that for α 02 the tstatistic is the same as α 01 for a onetailed test The following code snippet illustrates the relationship between tstatistic and onetailed test using α 001 as an example I picked the most important region to show pltfigurefigsize106 df 5 x nplinspace8 8 200 rv tdf pltplotx rvpdfx k lw4linestyle alpha001 onetail 3365 twotail 4032 pltplotonetailonetail0rvpdfonetail label onetaillinewidth4linestyle pltplottwotailtwotail0rvpdftwotail label two taillinewidth4colorrlinestyle pltplottwotailtwotail0rvpdftwotail label two taillinewidth4colorrlinestyle pltfillbetweennplinspace8twotail200 rvpdfnplinspace8two tail200colorg pltfillbetweennplinspaceonetailtwotail200 rvpdfnplinspaceonetailtwo tail200colorg pltylim0002 pltlegend 178 Statistical Hypothesis Testing The result looks as follows Figure 79 A comparison of twotailed and onetailed results for the same significance level You need to trust me that the shaded parts have the same area The onetailed case only covers the region to the right of the vertical dashed line the left edge of the right shaded area but the twotailed case covers both sides symmetrically the outer portion of the two dotted vertical lines Since the significance levels are the same they should both cover the same area under the curve AUC which leads to the equal area of the two shaded regions For our onesided test our tstatistic is less than 10 It is equivalent to the threshold for the positive value because of the symmetry of the problem If you look up the tdistribution table with such a large DOF of 899 the difference between large DOFs is quite small For example the following two rows are found at the end of the tdistribution table Figure 710 A tdistribution table with very large DOFs In the math score example the absolute value 1267 for our tstatistic is far away from both 2358 and 2330 We have enough confidence to reject the null hypothesis which means that the alternative hypothesis is true indeed students math skills have declined Making sense of confidence intervals and Pvalues from visual examples 179 The following code snippet reveals how I generated the score data randomseed2020 math2019 randomnormalvariate755 for in range900 math2020 randomnormalvariate735 for in range900 Feel free to reproduce the random data generated to visualize it yourself Next lets examine another concept in hypothesis testing power The power of a hypothesis test I will briefly mention another concept you may see from other materials the power of a hypothesis test The power of hypothesis testing is the probability of making the correct decision if the alternative hypothesis is true It is easier to approach this concept from its complementary part The opposite side of this is the probability of failing to reject the null hypothesis H0 while the alternative hypothesis H1 is true This is called a type II error The smaller the type II error is the greater the power will be Intuitively speaking greater power means the test is more likely to detect something interesting going on On the other hand everything comes with a cost If the type II error is low then we will inevitably reject the null hypothesis based on some observations that indeed originate from pure randomness This is an error thats called a type I error The type I error is the mistake of rejecting the null hypothesis H0 while H0 is indeed true Does this definition ring a bell When you choose a significance level α you are choosing your highest acceptable type I error rate As you can imagine the type I error and type II error will compensate for each other in most cases We will come back to this topic again and again when we talk about machine learning Examples of type I and type II errors A classic example of type I and type II errors has to do with radar detection Say that a radar system is reporting no incoming enemy aircraft the null hypothesis is that there are no incoming enemy aircraft and the alternative hypothesis is that there actually are incoming enemy aircraft A type I error is reporting the enemy aircraft when there are no aircraft in the area A type II error would be when there are indeed incoming enemy aircraft but none were reported 180 Statistical Hypothesis Testing In the next section we are going to use the SciPy library to apply what we have learned so far to various kinds of hypothesis testing problems You will be amazed at how much you can do Using SciPy for common hypothesis testing The previous section went over a ttest and the basic concepts in general hypothesis testing In this section we are going to fully embrace the powerful idea of the paradigm of hypothesis testing and use the SciPy library to solve various hypothesis testing problems The paradigm The powerful idea behind the hypothesis testing paradigm is that if you know that your assumption when hypothesis testing is roughly satisfied you can just invoke a well written function and examine the Pvalue to interpret the results Tip I encourage you to understand why a test statistic is built in a specific way and why it follows a specific distribution For example for the tdistribution you should understand what the DOF is However this will require a deeper understanding of mathematical statistics If you just want to use hypothesis testing to gain insights knowing the paradigm is enough If you want to apply hypothesis testing to your dataset follow this paradigm 1 Identify the problems you are interested in exploring What are you going to test A difference correlation or independence 2 Find the correct hypothesis test and assumption Examine whether the assumption is satisfied carefully 3 Choose a significance level and perform the hypothesis test with a software package Recall that in the previous section we did this part manually but now it is all left to the software In this section I will follow the paradigm and do three different examples in SciPy Using SciPy for common hypothesis testing 181 Ttest First I will redo the ttest with SciPy The default API for a single sample ttest from scipystats only provides for twotailed tests We have already seen an example of interpreting and connecting twotailed and onetailed significance levels so this isnt an issue anymore The function we are going to use is called scipystatsttest1samp The following code snippet applies this function to our math score data from scipy import stats statsttest1sampmath2020750 The result reads as follows Ttest1sampResultstatistic12668347669098846 pvalue5842470780196407e34 The first value statistic is the tstatistic which agrees with our calculation The second term is the Pvalue it is so small that if the null hypothesis is true and you drew a 900student sample every second it would take longer than the amount of time the universe has existed for you to observe a sample as rare as we have here Lets do a twosample ttest The twosample ttest will test whether the means of two samples are the same For a twosample test the significance level is twotailed as our hypothesis is 𝐻𝐻0 μ1 μ2 There are two cases for the twosample ttest depending on the variance in each sample If the two variances are the same it is called a standard independent twosample ttest if the variances are unequal the test is called Welchs ttest Lets first examine the standard ttest The following code snippet generates and plots two normally distributed samples one at mean 2 and another at 21 with an equal population variance of 1 nprandomseed2020 sample1 nprandomnormal21400 sample2 nprandomnormal211400 pltfigurefigsize106 plthistsample1binsnplinspace1510alpha05labelsa mple1 182 Statistical Hypothesis Testing plthistsample2binsnplinspace1510alpha05labelsa mple2 pltlegend The result looks as follows Figure 711 Two samples with unequal means Lets call the ttestind ttest function directly with the following line of code statsttestindsample1sample2 The result looks as follows TtestindResultstatistic17765855804956159 pvalue007601736167057595 Our tstatistic is about 18 If our significance level is set to 005 we will fail to reject our null hypothesis How about increasing the number of samples Will it help Intuitively we know that more data contains more information about the population therefore it is expected that well see a smaller Pvalue The following code snippet does the job nprandomseed2020 sample1 nprandomnormal21900 sample2 nprandomnormal211900 statsttestindsample1sample2 Using SciPy for common hypothesis testing 183 The result shows a smaller Pvalue TtestindResultstatistic3211755683955914 pvalue00013425868478419776 Note that Pvalues can vary significantly from sample to sample In the following code snippet I sampled the two distributions and conducted the twosample ttest 100 times nprandomseed2020 pvalues for in range100 sample1 nprandomnormal21900 sample2 nprandomnormal211900 pvaluesappendstatsttestindsample1sample21 Lets see how the Pvalue itself distributes in a boxplot pltfigurefigsize106 pltboxplotpvalues The boxplot of the Pvalues will look as follows Figure 712 A boxplot of Pvalues for 100 standard twosample ttests The majority of the Pvalues do fall into the region of 0 02 but there are a handful of outliers as well 184 Statistical Hypothesis Testing Next if you dont know whether the two samples have the same variance you can use Welchs ttest First lets use the following code snippet to generate two samples from two different uniform distributions with different sample sizes as well Note Our null hypothesis remains unchanged which means that the twosample means are the same nprandomseed2020 sample1 nprandomuniform210400 sample2 nprandomuniform112900 pltfigurefigsize106 plthistsample1binsnp linspace01520alpha05labelsample1 plthistsample2binsnp linspace01520alpha05labelsample2 pltlegend The result looks as follows Figure 713 Two uniformly distributed samples with different means variances and sample sizes Lets call the same SciPy function but this time well tell it that the variances are not equal by setting the equalvar parameter to False statsttestindsample1sample2equalvarFalse Using SciPy for common hypothesis testing 185 The result shows quite a small Pvalue TtestindResultstatistic31364786834852163 pvalue00017579405400172416 With a significance level of 001 we will have enough confidence to reject the null hypothesis You dont have to know what Welchs tstatistic distributes this is the gift that the Python community gives to you The normality hypothesis test Our next test is a normality test In a normality test the null hypothesis H0 is that the sample comes from a normal distribution The alternative hypothesis H1 is that the sample doesnt come from a normal distribution There are several ways to do normality tests not a hypothesis test You can visually examine the datas histogram plot or check its boxplot or QQ plot However we will refer to the statisticians toolsets in this section A note on QQ plots QQ plots are not covered in this book They are used to compare two distributions You can plot data from a distribution against data from an ideal normal distribution and compare distributions There are several major tests for normality The most important ones are the Shapiro Wilk test and the AndersonDarling test Again we wont have time or space to go over the mathematical foundation of either test all we need to do is check their assumptions and call the right function in a given scenario How large should a random sample be to suit a normality test As you may have guessed if the size of the sample is small it really doesnt make much sense to say whether it comes from a normal distribution or not The extreme case is the sample size being 1 It is possible that it comes from any distribution There is no exact rule on how big is big enough The literature mentions that 50 is a good threshold beyond which the normality test is applicable I will first generate a set of data from Chisquared distributions with different parameters and use the two tests from SciPy to obtain the Pvalues 186 Statistical Hypothesis Testing A note on Chisquare distributions The Chisquared distribution or x2 distribution is a very important distribution in statistics The sum of the square of k independent standard normal random variables follows a Chisquared distribution with a DOF of k The following code snippet plots the real PDFs of Chisquared distributions so that you get an idea about the DOFs influence over the shape of the PDF from scipystats import chi2 pltfigurefigsize106 DOFs 481632 linestyles for i df in enumerateDOFs x nplinspacechi2ppf001 dfchi2ppf099 df 100 rv chi2df pltplotx rvpdfx k lw4 label DOF strdflinestylelinestylesi pltlegend The result looks as follows Figure 714 Chisquared distributions with different DOFs Next lets generate two sets of data of sample size 400 and plot them nprandomseed2020 sample1 nprandomchisquare8400 sample2 nprandomchisquare32400 pltfigurefigsize106 Using SciPy for common hypothesis testing 187 plthistsample1binsnp linspace06020alpha05labelsample1 plthistsample2binsnp linspace06020alpha05labelsample2 pltlegend The histogram plot looks as follows Sample one has a DOF of 8 while sample two has a DOF of 32 Figure 715 The Chisquared distributions with different DOFs Now lets call the shapiro and anderson test functions in SciPyStats to test the normality The following code snippet prints out the results The anderson function can be used to test fitness to other distributions but defaults to a normal distribution printResults for ShapiroWilk Test printSample 1 shapirosample1 printSample 2 shapirosample2 print printResults for AndersonDarling Test printSample 1 andersonsample1 printSample 2 andersonsample2 The results for the ShapiroWilk test read as follows Sample 1 09361660480499268 4538336286635802e12 Sample 2 09820653796195984 7246905443025753e05 188 Statistical Hypothesis Testing The results for the AndersonDarling Test read as follows Sample 1 AndersonResultstatistic6007815329566711 criticalvaluesarray057 065 0779 0909 1081 significancelevelarray15 10 5 25 1 Sample 2 AndersonResultstatistic18332323421475962 criticalvaluesarray057 065 0779 0909 1081 significancelevelarray15 10 5 25 1 The results for ShapiroWilk follow the format of statistic Pvalue so it is easy to see that for both sample one and sample two the null hypothesis should be rejected The results for the AndersonDarling test gives the statistic but you need to determine the corresponding critical value The significance level list is in percentages so 1 means 1 The corresponding critical value is 1081 For both cases the statistic is larger than the critical value which also leads to rejection of the null hypothesis Different test statistics cant be compared directly For the same hypothesis test if you choose a different test statistic you cannot compare the Pvalues from different methods directly As you can see from the preceding example the ShapiroWilk test has a much smaller Pvalue than the AndersonDarling test for both samples Before moving on to the next test lets generate a sample from a normal distribution and test the normality The following code snippet uses a sample from a standard normal distribution with a sample size of 400 sample3 nprandomnormal01400 printResults for ShapiroWilk Test printSample 3 shapirosample3 print printResults for AndersonDarling Test printSample 3 andersonsample3 The results read as follows Note that the function call may have a different output from the provided Jupyter notebook which is normal Results for ShapiroWilk Test Sample 3 0995371401309967 02820892035961151 Results for AndersonDarling Test Using SciPy for common hypothesis testing 189 Sample 3 AndersonResultstatistic046812258253402206 criticalvaluesarray057 065 0779 0909 1081 significancelevelarray15 10 5 25 1 It is true that we cant reject the null hypothesis even with a significance level as high as α 015 The goodnessoffit test In a normality test we tested whether a sample comes from a continuous normal distribution or not It is a fitness test which means we want to know how well our observation agrees with a preselected distribution Lets examine another goodnessoffit test for a discrete case the Chisquared goodnessoffit test Suppose you go to a casino and encounter a new game The new game involves drawing cards from a deck of cards three times the deck doesnt contain jokers You will win the game if two or three of the cards drawn belong to the suit of hearts otherwise you lose your bet You are a cautious gambler so you sit there watch and count After a whole day of walking around the casino and memorizing you observe the following results of card draws There are four cases in total Ive tabulated the outcomes here Figure 716 Counting the number of hearts Tip Note that in real life casinos will not allow you to count cards like this It is not in their interests and you will most likely be asked to leave if you are caught First lets calculategiven a 52card deck where there are 13 hearts what the expected observation look like For example picking 2 hearts would mean picking 2 hearts from the 13 hearts of the deck and pick 1 card from the remaining 39 cards which yields a total number of 13 11 2 39 381 So the total combination of choosing 3 hearts out of 52 cards is 52 3 39 Taking the ratio of those two instances we have the probability of obtaining 2 hearts being about 138 190 Statistical Hypothesis Testing The number of all observations is 1023 so in a fairgame scenario we should observe roughly 10231380 which gives 141 observations of 2 hearts cards being picked Based on this calculation you probably have enough evidence to question the casino owner The following code snippet calculates the fairgame probability and expected observations I used the comb function from the SciPy library from scipyspecial import comb P comb393icomb13icomb523 for i in range4 expected 1023p for p in P observed 46045110210 The index in the P and expected arrays means the number of observed hearts For example P0 represents the probability of observing 0 hearts Lets use a bar plot to see the differences between the expected values and the observed values The following code snippet plots the expected values and the observed values back to back x nparray0123 pltfigurefigsize106 pltbarx02expectedwidth04labelExpected pltbarx02observedwidth04 label Observed pltlegend pltxticksticks0123 The output of the code is as shown in the following graph Figure 717 The expected number of hearts and the observed number of hearts Using SciPy for common hypothesis testing 191 We do see that it is somewhat more probable to get fewer hearts than other cards Is this result significant Say we have a null hypothesis H0 that the game is fair How likely is it that our observation is consistent with the null hypothesis The Chisquare goodnessof fit test answers this question Tip Chisquare and Chisquared are often used interchangeably In this case the x2 statistic is calculated as 𝑂𝑂𝑖𝑖 𝐸𝐸𝑖𝑖2 𝐸𝐸𝑖𝑖 4 𝑖𝑖 where 0i is the number of observations for category i and Ei is the number of expected observations for category i We have four categories and the DOF for this x2 distribution is 4 1 3 Think about the expressive meaning of the summation If the deviation from the expectation and the observation is large the corresponding term will also be large If the expectation is small the ratio will become large which puts more weight on the smallexpectation terms Since the deviation for 2heart and 3heart cases is somewhat large we do expect that the statistic will be largely intuitive The following code snippet calls the chisquare function in SciPy to test the goodness of the fit from scipystats import chisquare chisquareobservedexpected The result reads as follows PowerdivergenceResultstatistic14777716323788255 pvalue0002016803916729754 With a significance level of 001 we reject the null hypothesis that the game cant be fair 192 Statistical Hypothesis Testing The following PDF of the x2 distribution with a DOF of 3 can give you a visual idea of how unfair the game is Figure 718 A Chisquared distribution with and DOF of 3 The code for generating the distribution is as follows pltfigurefigsize106 x nplinspacechi2ppf0001 3chi2ppf0999 3 100 rv chi23 pltplotx rvpdfx k lw4 label DOF str3linestyle Next lets move on to the next topic ANOVA A simple ANOVA model The ANOVA model is actually a collection of models We will cover the basics of ANOVA in this section ANOVA was invented by British statistician RA Fisher It is widely used to test the statistical significance of the difference between means of two or more samples In the previous ttest you saw the twosample ttest which is a generalized ANOVA test Using SciPy for common hypothesis testing 193 Before moving on let me clarify some terms In ANOVA you may often see the terms factor group and treatment Factor group and treatment basically mean the same thing For example if you want to study the average income of four cities then city is the factor or group It defines the criteria how you would love to classify your datasets You can also classify the data with the highest degree earned therefore you can get another factorgroup degree The term treatment originates from clinical trials which have a similar concept of factors and groups You may also hear the word level Level means the realizations that a factor can be For example San Francisco Los Angeles Boston and New York are four levels for the factorgroup city Some literature doesnt distinguish between groups or levels when it is clear that there is only one facet to a whole dataset When the total number of samples extends beyond two lets say g groups with group i having ni data points the null hypothesis can be formulated as follows 𝐻𝐻0 μ1 μ2 μ𝑔𝑔 In general you can do a sequence of ttests to test each pair of samples You will have gg 12 ttests to do For two different groups group i and group j you have the null hypothesis 𝐻𝐻0 μ1 μ2 μ𝑔𝑔 This approach has two problems You need to do more than one hypothesis test and the number of tests needed doesnt scale well The results require additional analysis Now lets examine the principles of ANOVA and see how it approaches these problems I will use the average income question as an example The sample data is as follows Figure 719 Income data samples from four cities Assumptions for ANOVA ANOVA has three assumptions The first assumption is that the data from each group must be distributed normally The second assumption is that samples from each group must have the same variance The third assumption is that each sample should be randomly selected In our example I assume that these three conditions are met However the imbalance of income does violate normality in real life Just to let you know 194 Statistical Hypothesis Testing The ANOVA test relies on the fact that the summation of variances is decomposable The total variance of all the data can be partitioned into variance between groups and variance within groups VAR𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 VAR𝑏𝑏𝑏𝑏𝑡𝑡𝑏𝑏𝑏𝑏𝑏𝑏 VAR𝑏𝑏𝑤𝑤𝑡𝑡ℎ𝑤𝑤𝑖𝑖 Heres that again but using notation common in the literature 𝑆𝑆𝑇𝑇 2 𝑆𝑆𝐵𝐵 2 𝑆𝑆𝑊𝑊 2 Now we define the three terms Let me use Xij to denote the j th data point from the j th group μ to denote the mean of the whole dataset in our example the mean of all the income data from all four groups and μi to denote the mean from group i For example μ1 is the average income for people living in San Francisco The total variance 𝑆𝑆𝑇𝑇 2 can be defined as follows 𝑆𝑆𝑇𝑇 2 𝑋𝑋𝑖𝑖𝑖𝑖 μ 2 𝑛𝑛𝑖𝑖 𝑖𝑖1 𝑔𝑔 𝑖𝑖1 The variance within groups 𝑆𝑆𝑊𝑊 2 is defined as follows 𝑆𝑆𝑊𝑊 2 𝑋𝑋𝑖𝑖𝑖𝑖 μi 2 𝑛𝑛𝑖𝑖 𝑖𝑖1 𝑔𝑔 𝑖𝑖1 The only difference is that the data point in each group now subtracts the group mean rather than the total mean The variance between groups is defined as follows 𝑆𝑆𝐴𝐴 2 𝑛𝑛𝑖𝑖μ𝑖𝑖 μ2 𝑔𝑔 𝑖𝑖1 The square of the difference between the group mean and the total mean is weighted by the number of group members The reason why this partition holds comes from the fact that a data point Xij can be decomposed as follows 𝑋𝑋𝑖𝑖𝑖𝑖 μ μ𝑖𝑖 μ 𝑋𝑋𝑖𝑖𝑖𝑖 μ𝑖𝑖 Using SciPy for common hypothesis testing 195 The first term on the righthand side of the equation is the total mean The second term is the difference of means across the group and the third term is the difference within the group I encourage you to substitute this expression into the formula of 𝑆𝑆𝑇𝑇 2 and collect terms to rediscover it as the sum of 𝑆𝑆𝐵𝐵 2 and 𝑆𝑆𝑊𝑊 2 It is a good algebraic exercise The following code snippet does the calculation to verify the equation First I create the following four numpy arrays SF nparray1200001103001278006890079040208000 15900089000 LA nparray6570088340240000190000450802590069000120300 BO nparray8799986340980001240001138009800010800078080 NY nparray30000062010450001300002380005600089000123000 Next the following code snippet calculates 𝑆𝑆𝑇𝑇 2 𝑆𝑆𝐵𝐵 2 and 𝑆𝑆𝑊𝑊 2 mu npmeannpconcatenateSFLABONY ST npsumnpconcatenateSFLABONY mu2 SW npsumSFnpmeanSF2 npsumLAnpmeanLA2 npsumBOnpmeanBO2 npsumNYnpmeanNY2 SB 8npmeanSFmu2 8npmeanLAmu2 8npmeanBOmu2 8npmeanNYmu2 Now lets verify that ST SW SB ST SWSB The answer is True So indeed we have this relationship How is this relationship useful Let me first denote the variance for each group with σ2 they are the same because this is one of the assumptions and then check each term carefully 196 Statistical Hypothesis Testing The question is what is the distribution of the statistic 𝑆𝑆𝑊𝑊 2 σ2 Recall that for group i the sum of the squared differences is just a Chisquare distribution with a DOF of ni 1 𝑋𝑋𝑖𝑖𝑖𝑖 μ𝑖𝑖 2 𝑛𝑛𝑖𝑖 𝑖𝑖 σ2χ𝑛𝑛𝑖𝑖1 2 When the null hypothesis holds namely μi μj for arbitrary i and j 𝑆𝑆𝑊𝑊 2 𝜎𝜎2 is just the summation of the statistic Because each group is independent we have 𝑆𝑆𝑊𝑊 2 σ2 χ𝑛𝑛𝑔𝑔 2 where n is the total number of the samples and g is the number of groups How about 𝑆𝑆𝐵𝐵 2 𝜎𝜎2 When the null hypothesis holds each observation no matter which group it comes from can be treated as a realization of 𝑁𝑁μ σ2 therefore 𝑆𝑆𝑇𝑇 2 will follow a Chisquare distribution with a DOF of n 1 However we have the equation 𝑆𝑆𝑇𝑇 2 𝑆𝑆𝐵𝐵 2 𝑆𝑆𝑊𝑊 2 so 𝑆𝑆𝐵𝐵 2 𝜎𝜎2 must follow an x2 distribution with a DOF of g 1 where g 1 n 1 n g The test statistic F is further defined as the ratio of the two equations where σ2 is canceled 𝐹𝐹 𝑆𝑆𝐵𝐵 2𝑔𝑔 1 𝑆𝑆𝑊𝑊 2 𝑛𝑛 𝑔𝑔 The statistic F follows an Fdistribution of Fg 1 n g If the null hypothesis doesnt hold the variance between groups 𝑆𝑆𝐵𝐵 2 will be large so F will be large If the null hypothesis is true 𝑆𝑆𝐵𝐵 2 will be small so F will also be small Lets manually calculate our test statistic for the income problem and compare it with the functionality provided by SciPy The following code snippet computes the F statistic F SB41SW484 F The result is about 0388 Before we do the Ftest lets also look at the PDF of the Fdistribution The following code snippet plots the Fdistribution with DOFs of 3 and 28 pltfigurefigsize106 x nplinspacefppf0001 3 28fppf0999 3 28 100 rv fdfn3 dfd28 pltplotx rvpdfx k lw4linestyle Using SciPy for common hypothesis testing 197 The plot looks as follows Figure 720 The Fdistribution with DOFs of 3 and 28 You can estimate that such a small statistic 0388 will probably have a very large Pvalue Now lets use the foneway function from sciPystats to do the Ftest The following code snippet gives us the statistic and the Pvalue from scipystats import foneway fonewayLANYSFBO Here is the result FonewayResultstatistic038810442907126874 pvalue07624301696455358 The statistic agrees with our own calculation and the Pvalue suggests that we cant reject the null hypothesis even at a very high significance value Beyond simple ANOVA After the Ftest when the means are different various versions of ANOVA can be used further to analyze the factors behind the difference and even their interactions But due to the limitations of space and time we wont cover this Stationarity tests for time series In this section we are going to discuss how to test the stationarity of an autoregression time series First lets understand what a time series is Using SciPy for common hypothesis testing 199 A note on the name of white noise The name of white noise actually comes from white light White light is a mixture of lights of all colors White noise is a mixture of sounds with different frequencies White noise is an important sound because the ideal white noise will have equal power or energy throughout the frequency spectrum Due to the limitation of space we wont go deep into this topic The following code snippet generates a white noise time series You can verify that there is no time dependence between Xt and Xtk The covariance is 0 for arbitrary k nprandomseed2020 pltfigurefigsize106 whitenoise nprandomnormal for in range100 pltxlabelTime step pltylabelValue pltplotwhitenoise The results look as follows You can see that it is stationary Figure 721 The white noise time series Another simple time series is random walk It is defined as the addition of the previous term in a sequence and a white noise term ϵ𝑡𝑡𝑁𝑁0 σ2 You can define X0 to be a constant or another white noise term 𝑋𝑋𝑡𝑡 𝑋𝑋𝑡𝑡1 ϵ𝑡𝑡 200 Statistical Hypothesis Testing Just as for white noise time series the random walk time series has a consistent expectation However the variance is different Because of the addition of independent normally distributed white noises the variance will keep increasing 𝑋𝑋𝑡𝑡 𝑋𝑋0 ϵ1 ϵ2 ϵ𝑡𝑡 Therefore you have the variance expressed as follows 𝑉𝑉𝑉𝑉𝑉𝑉𝑋𝑋𝑡𝑡 𝑉𝑉𝑉𝑉𝑉𝑉ϵ𝑖𝑖 𝑖𝑖 𝑡𝑡σ2 This is a little bit surprising because you might have expected the white noises to cancel each other out because they essentially symmetrical around 0 The white noises do cancel each other out in the mean sense but not in the variance sense The following code snippet uses the same set of random variables to show the differences between white noise time series and random walk time series pltfigurefigsize106 nprandomseed2020 whitenoise nprandomnormal for in range1000 randomwalk npcumsumwhitenoise pltplotwhitenoise label white noise pltplotrandomwalk label standard random walk pltlegend Here I used the cumsum function from numpy to calculate a cumulative sum of a numpy array or list The result looks as follows Figure 722 Stationary white noise and nonstationary random walk Using SciPy for common hypothesis testing 201 Say you took the difference to define a new time series δXt δ𝑋𝑋𝑡𝑡 𝑋𝑋𝑡𝑡𝑋𝑋𝑡𝑡1 ϵ𝑡𝑡 Then the new time series would become stationary In general a nonstationary time series can be reduced to a stationary time series by continuously taking differences Now lets talk about the concept of autoregression Autoregression describes the property of a model where future observations can be predicted or modeled with earlier observations plus some noise For example the random walk can be treated as a sequence of observations from a firstorder autoregressive process as shown in the following equation 𝑋𝑋𝑡𝑡 𝑋𝑋𝑡𝑡1 ϵ𝑡𝑡 The observation at timestamp t can be constructed from its onestepback value Xt1 Generally you can define an autoregressive time series with order n as follows where instances of Φi are real numbers 𝑋𝑋𝑡𝑡 ϕ1𝑋𝑋𝑡𝑡1 ϕ2𝑋𝑋𝑡𝑡2 ϕ𝑛𝑛𝑋𝑋𝑡𝑡𝑛𝑛 ϵ𝑡𝑡 Without formal mathematical proof I would like to show you the following results The autoregressive process given previously has a characteristic equation as follows 𝑓𝑓𝑠𝑠 1 ϕ1𝑠𝑠 ϕ2𝑠𝑠2 ϕ𝑛𝑛𝑠𝑠𝑛𝑛 0 In the domain of complex numbers this equation will surely have n roots Here is the theorem about these roots If all the roots have an absolute value larger than 1 then the time series is stationary Let me show that to you with two examples Our random walk model has the following characteristic function fs 1 s 0 which has a root equal to 0 so it is not stationary How about the following modified random walk Lets see 𝑋𝑋𝑡𝑡 08𝑋𝑋𝑡𝑡1 ϵ𝑡𝑡 It has a characteristic function of fs 1 08s 0 which has a root of 125 By our theorem this time series should be stationary 202 Statistical Hypothesis Testing The influence of Xt1 is reduced by a ratio of 08 and this effect will be compounding and fading away The following code snippet uses the exact same data we have for white noise and random walk to demonstrate this fading behavior I picked the first 500 data points so linestyles can be distinguishable for different lines pltfigurefigsize106 nprandomseed2020 whitenoise nprandomnormal for in range500 randomwalkmodified whitenoise0 for i in range1500 randomwalkmodifiedappendrandomwalkmodified108 whitenoisei randomwalk npcumsumwhitenoise pltplotwhitenoise label white noiselinestyle pltplotrandomwalk label standard random walk pltplotrandomwalkmodified label modified random walklinestyle pltlegend The graph looks as follows Figure 723 A comparison of a modified random walk and a standard random walk Lets try a more complicated example Is the time series obeying the following the autoregressive relationship 𝑋𝑋𝑡𝑡 06𝑋𝑋𝑡𝑡1 12𝑋𝑋𝑡𝑡2 ϵ𝑡𝑡 Using SciPy for common hypothesis testing 203 The characteristic equation reads 𝑓𝑓𝑠𝑠 1 06𝑠𝑠 12𝑠𝑠2 It has two roots Both roots are complex numbers with nonzero imaginary parts The roots absolute values are also smaller than 1 on the complex plane The following code snippet plots the two roots on the complex plane You can see that they are just inside the unit circle as shown in Figure 724 for root in nproots12061 pltpolar0npangleroot0absrootmarkero The graph looks as follows Figure 724 A polar plot of roots inside a unit circle You should expect the time series to be nonstationary because both roots have absolute values smaller than 1 Lets take a look at it with the following code snippet pltfigurefigsize106 nprandomseed2020 whitenoise nprandomnormal for in range200 series whitenoise0whitenoise1 for i in range2200 seriesappendseriesi106seriesi212 white noisei pltplotseries label oscillating pltxlabelTime step pltylabelValue pltlegend 204 Statistical Hypothesis Testing The result looks as follows Figure 725 A nonstationary secondorder autoregressive time series example Check the scales on the yaxis and you will be surprised The oscillation seems to come from nowhere The exercise of visualizing this time series in the log scale is left to you as an exercise In most cases given a time series the Augmented DickeyFuller ADF unit root test in the statsmodels library can be used to test whether a unit root is present or not The null hypothesis is that there exists a unit root which means the time series is not stationary The following code snippet applies the ADF unit root test on the white noise time series the random walk and the modified random walk You can leave the optional arguments of this function as their defaults from statsmodelstsastattools import adfuller adfullerwhitenoise The result is as follows 13456517599662801 35984059677945306e25 0 199 1 34636447617687436 5 28761761179270766 10 257457158581854 5161905447452475 Using SciPy for common hypothesis testing 205 You need to focus on the first two highlighted values terms in the result the statistic and the Pvalue The dictionary contains the significance levels In this case the Pvalue is very small and we can safely reject the null hypothesis There is no unit root so the time series is stationary For the random walk time series the result of adfuller randomwalkmodified is as follows 14609492394159564 05527332285592418 0 499 1 34435228622952065 5 2867349510566146 10 2569864247011056 13744481241324318 The Pvalue is very large therefore we cant reject the null hypothesis a unit root might exist The time series is not stationary The result for the modified random walk is shown in the following code block The Pvalue is also very small It is a stationary time series 7700113158114325 13463404483644221e11 0 499 1 34435228622952065 5 2867349510566146 10 2569864247011056 13756034107373926 Forcefully applying a test is dangerous How about our wildly jumping time series If you try to use the adfuller function on it you will find a wild statistic and a Pvalue of 0 The ADF test simply fails because the underlying assumptions are violated Because of the limitations of space and the complexity of it I omitted coverage of this You are encouraged to explore the roots of the cause and the mechanism of ADF tests from first principles by yourself 206 Statistical Hypothesis Testing We have covered enough hypothesis tests it is time to move on to AB testing where we will introduce cool concepts such as randomization and blocking Appreciating AB testing with a realworld example In the last section of this chapter lets talk about AB testing Unlike previous topics AB testing is a very general concept AB testing is something of a geeky engineers word for statistical hypothesis testing At the most basic level it simply means a way of finding out which setting or treatment performs better in a singlevariable experiment Most AB testing can be classified as a simple Randomized Controlled Trial RCT What randomized control means will be clear soon Lets take a realworld example a consulting company proposes a new workinghours schedule for a factory claiming that the new schedule will improve the workers efficiency as well as their satisfaction The cost of abruptly shifting the workinghours schedule may be big and the factory does not want the risk involved Therefore the consulting company proposes an AB test Consultants propose selecting two groups of workers group A and group B These groups have controlled variables such as workers wages occupations and so on such that the two groups are as similar as possible in terms of those variables The only difference is that one group follows the old workinghours schedule and the other group follows the new schedule After a certain amount of time the consultants measure the efficiency and level of satisfaction through counting outputs handing out quantitative questionnaires or taking other surveys If you are preparing for a data scientist interview the AB test you will likely encounter would be about the users of your website application or product For example landing page optimization is a typical AB test scenario What kind of front page will increase users click rates and conversion The content the UI the loading time and many other factors may influence users behavior Now that we have understood the importance of AB testing lets dive into the details of its steps Conducting an AB test To conduct an AB test you should know the variables that fall into the following three categories The metric This is a dependent variable that you want to measure In an experiment you can choose one or more such variables In the previous example workers efficiency is a metric Appreciating AB testing with a realworld example 207 The control variables These are variables that you can control A control variable is independent For example the font and color scheme of a landing page are controllable You want to find out how such variables influence your metric Other factors These are variables that may influence the metric but you have no direct control over them For example the wages of workers are not under your control The devices that users use to load your landing page are also not under your control However wages surely influence the level of satisfaction of workers and device screen sizes influence users clicking behavior Those factors must be identified and handled properly Lets look at an experiment on landing page optimization Lets say we want to find out about the color schemes influence on users clicking behavior We have two choices a warm color scheme and a cold color scheme We also want group A to have the same size as group B Here is a short list of variables that will influence the users clicking rate You are free to brainstorm more such variables The device the user is using for example mobile versus desktop The time at which the user opens the landing page The browser type and version for example Chrome versus Internet Explorer IE The battery level or WiFi signal level How do we deal with such variables To understand their influence on users click rates we need to first eliminate the influence of other factors so that if there is a difference in click rates we can confidently attribute the difference to the four variables we selected This is why we introduce randomization and blocking Randomization and blocking The most common way to eliminate or minimize the effect of unwanted variables is through blocking and randomization In a completely randomized experiment the individual test case will be assigned a treatmentcontrol variable value randomly In the landing page scenario this means that regardless of the device browser or the time a user opens the page a random choice of a warm color scheme or a cold color scheme is made for the user 208 Statistical Hypothesis Testing Imagine the scenario that the number of participants of the experiment is very large the effect of those unwanted variables would diminish as the sample size approaches infinity This is true because in a completely randomized experiment the larger the sample size is the smaller the effect that randomness has on our choices When the sample size is large enough we expect the number of IE users who see the warm color scheme to be close to the number of IE users who see the cold scheme The following computational experiment will give you a better idea of how randomization works I chose three variables in this computational experiment device browser and WiFi signal First lets assume that 60 of the users use mobile 90 of them use Chrome and 80 of them visit the website using a strong WiFi signal We also assume that there are no interactions among those variables for instance we do not assume that Chrome users have a strong preference to stick to a strong WiFi connection The following code snippet will assign a color scheme to a random combination of our three variables def buildsample device mobile if nprandomrandom 06 else desktop browser chrome if nprandomrandom 09 else IE wifi strong if nprandomrandom 08 else weak scheme warm if nprandomrandom 05 else cold return device browser wifi scheme Lets first generate 100 sample points and sort the results by the number of appearances from collections import Counter results buildsample for in range100 counter Counterresults for key in sortedcounter key lambda x counterx printkey counterkey The result looks as follows You can see that some combinations dont show up desktop IE strong warm 1 mobile IE weak cold 1 mobile IE strong cold 2 mobile chrome weak warm 3 mobile chrome weak cold 4 desktop chrome weak warm 4 Appreciating AB testing with a realworld example 209 desktop chrome weak cold 5 desktop IE strong cold 6 desktop chrome strong warm 10 desktop chrome strong cold 19 mobile chrome strong warm 20 mobile chrome strong cold 25 If you check each pair with the same setting for example users who use the mobile Chrome browser with strong WiFi signal have a roughly 5050 chance of getting the cold or warm color scheme landing page Lets try another 10000 samples The only change in the code snippet is changing 100 to 10000 The result looks like this desktop IE weak cold 41 desktop IE weak warm 45 mobile IE weak warm 55 mobile IE weak cold 66 desktop IE strong warm 152 desktop IE strong cold 189 mobile IE strong cold 200 mobile IE strong warm 228 desktop chrome weak cold 359 desktop chrome weak warm 370 mobile chrome weak cold 511 mobile chrome weak warm 578 desktop chrome strong warm 1442 desktop chrome strong cold 1489 mobile chrome strong warm 2115 mobile chrome strong cold 2160 Now you see even with the two most unlikely combinations we have about 30 to 40 data points There is although we tried to mitigate it an imbalance between the highlighted two combinations we have more cold scheme users than warm scheme users This is the benefit that randomization brings to us However this usually comes at a high cost It is not easy to obtain such large data samples in most cases There is also a risk that if the warm color scheme or cold color scheme is very bad for the users conversion rates such a largescale AB test will be regrettable 210 Statistical Hypothesis Testing With a small sample size issues of being unlucky can arise For example it is possible that IE desktop users with weak WiFi signals are all assigned the warm color scheme Given how AB testing is done there is no easy way to reverse such bad luck Blocking on the other hand arranges samples into blocks according to the unwanted variables first then randomly assigns block members to different control variable values Lets look at the landing page optimization example Instead of grouping users after providing them with random color schemes we group the users according to the device browser or WiFi signal before making the decision as to which color scheme to show to them Inside the block of desktop IE users we can intervene such that randomly half of them will see the warm color scheme and the other half will see the cold scheme Since all the unwanted variables are the same in each block the effect of the unwanted variables will be limited or homogeneous Further comparisons can also be done across blocks just like for complete randomization You may think of blocking as a kind of restricted randomization We want to utilize the benefit of randomization but we dont want to fall into a trap such as a specific group of candidates purely being associated with one control variable value Another example is that in a clinical trial you dont want complete randomization to lead to all aged people using a placebo which may happen You must force randomization somehow by grouping candidates first Common test statistics So an AB test has given you some data whats next You can do the following for a start Use visualization to demonstrate differences This is also called testing by visualization The deviation of the results can be obtained by running AB tests for several rounds and calculating the variance Apply a statistical hypothesis test Many of the statistical hypothesis tests we have covered can be used For example we have covered ttests a test for testing differences of means between two groups it is indeed one of the most important AB test statistics When the sizes of group A and group B are different or the variances are different we can use Welchs ttest which has the fewest assumptions involved Appreciating AB testing with a realworld example 211 For the clicking behavior of users Fishers exact test is good to use It is based on the binomial distribution I will provide you with an exercise on it in Chapter 13 Exercises and Projects For the work efficiency question we mentioned at the very beginning of this section ANOVA or a ttest can be used For a summary of when to use which hypothesis test here is a good resource httpswwwscribbrcomstatisticsstatisticaltests Tip Try to include both informative visualizations and statistical hypothesis tests in your reports This way you have visual elements to show your results intuitively as well as solid statistical analysis to justify your claims Make sure you blend them coherently to tell a complete story Common mistakes in AB tests In my opinion several common mistakes can lead to misleading AB test data collection or interpretations Firstly a careless AB test may miss important hidden variables For example say you want to randomly select users in the United States to do an AB test and you decide to do the randomization by partitioning the names For example people whose first name starts with the letters AF are grouped into a group those with GP go into another and so on What can go wrong Although this choice seems to be OK there are some pitfalls For example popular American names have changed significantly throughout the years The most popular female names in the 1960s and 1970s are Lisa Mary and Jennifer In the 2000s and 2010s the most popular female names become Emily Isabella Emma and Ava You may think that you are selecting random names but you are actually introducing biases to do with age Also different states have different popular names as well Another common mistake is making decisions too quickly Different from academic research where rigorousness is above all managers in the corporate world prefer to jump to conclusions and move on to the next sales goals If you only have half or even onethird of the tested data available you should hold on and wait until all the data is collected 212 Statistical Hypothesis Testing The last mistake is focusing on too many metrics or control variables at the same time It is true that several metrics can depend on common control variables and a metric can depend on several control variables Introducing too many metrics and control variables will include higherorder interactions and make the analysis less robust with low confidence If possible you should avoid tracking too many variables at the same time Higherorder interaction Higherorder interaction refers to the joint effect of three or more independent variables on the dependent variable For example obesity smoking and high blood pressure may contribute to heart issues much more severely if all three of them happen together When people refer to the main effect of something they often mean the effect of one independent variable and the interaction effect refers to the joint effect of two variables Lets summarize what we have learned in this chapter Summary This chapter was an intense one Congratulations on finishing it First we covered the concept of the hypothesis including the basic concepts of hypotheses such as the null hypothesis the alternative hypothesis and the Pvalue I spent quite a bit of time going over example content to ensure that you understood the concept of the Pvalue and significance levels correctly Next we looked at the paradigm of hypothesis testing and used corresponding library functions to do testing on various scenarios We also covered the ANOVA test and testing on time series Toward the end we briefly covered AB testing We demonstrated the idea with a classic click rate example and also pointed out some common mistakes One additional takeaway for this chapter is that in many cases new knowledge is needed to understand how a task is done in unfamiliar fields For example if you were not familiar with time series before reading this chapter now you should know how to use the unit root test to test whether an autoregressive time series is stationary or not Isnt this amazing In the next chapter we will begin our analysis of regression models Section 3 Statistics for Machine Learning Section 3 introduces two statistical learning categories regression and classification Concepts in machine learning are introduced Statistics with respect to learning models are developed and examined Methods such as boosting and bagging are explained This section consists of the following chapters Chapter 8 Statistics for Regression Chapter 9 Statistics for Classification Chapter 10 Statistics for TreeBased Methods Chapter 11 Statistics for Ensemble Methods 8 Statistics for Regression In this chapter we are going to cover one of the most important techniquesand likely the most frequently used technique in data science which is regression Regression in laymans terms is to build or find relationships between variables features or any other entities The word regression originates from the Latin regressus which means a return Usually in a regression problem you have two kinds of variables Independent variables also referred to as features or predictors Dependent variables also known as response variables or outcome variables Our goal is to try to find a relationship between dependent and independent variables Note It is quite helpful to understand word origins or how the scientific community chose a name for a concept It may not help you understand the concept directly but it will help you memorize the concepts more vividly 216 Statistics for Regression Regression can be used to explain phenomena or to predict unknown values In Chapter 7 Statistical Hypothesis Testing we saw examples in the Stationarity test for time series section of time series data to which regression models generally fit well If you are predicting the stock price of a company you can use various independent variables such as the fundamentals of the company and macroeconomic indexes to do a regression analysis against the stock price then use the regression model you obtained to predict the future stock price of that company if you assume the relationship you found will persist Of course such simple regression models were used decades ago and likely will not make you rich In this chapter you are still going to learn a lot from those classical models which are the baselines of more sophisticated models Understanding basic models will grant you the intuition to understand more complicated ones The following topics will be covered in this chapter Understanding a simple linear regression model and its rich content Connecting the relationship between regression and estimators Having handson experience with multivariate linear regression and collinearity analysis Learning regularization from logistic regression examples In this chapter we are going to use real financial data so prepare to get your hands dirty Understanding a simple linear regression model and its rich content Simple linear regression is the simplest regression model You only have two variables one dependent variable usually denoted by y and an independent variable usually denoted by x The relationship is linear so the model only contains two parameters The relationship can be formulated with the following formula k is the slope and b is the intercept Є is the noise term Understanding a simple linear regression model and its rich content 217 Note Proportionality is different from linearity Proportionality implies linearity and it is a stronger requirement that b must be 0 in the formula Linearity graphically means that the relationship between two variables can be represented as a straight but strict mathematical requirement of additivity and homogeneity If a relationship function f is linear then for any input x1 and x2 and scaler k we must have the following equations 𝑓𝑓𝑥𝑥1 𝑥𝑥2 𝑓𝑓𝑥𝑥1 𝑓𝑓𝑥𝑥2 and 𝑓𝑓𝑘𝑘𝑥𝑥1 𝑘𝑘𝑓𝑓𝑥𝑥1 Here is the code snippet that utilizes the yfinance library to obtain Netflixs stock price data between 2016 and 2018 You can use pip3 install yfinance to install the library If you are using Google Colab use pip3 install yfinance to run a shell command Pay attention to the symbol at the beginning The following code snippet imports the libraries import numpy as np import matplotlibpyplot as plt import random import yfinance as yf The following code snippet creates a Ticker instance and retrieves the daily stock price information The Ticker is a symbol for the stock Netflixs ticker is NFLX import yfinance as yf netflix yfTickerNFLX start 20160101 end 20180101 df netflixhistoryinterval1dstart startend end df 218 Statistics for Regression The result is a Pandas DataFrame as shown in the following figure Figure 81 Historical data for Netflix stock in 2016 and 2017 The next step for our analysis is to get an idea of what the data looks like The common visualization for the twovariable relationship is a scatter plot We are not particularly interested in picking the open price or the close price of the stock I will just pick the opening price as the price we are going to run regression against the Date column Date is not a normal column as other columns it is the index of the DataFrame You can use dfindex to access it When you convert a date to numerical values Matplotlib may throw a warning You can use the following instructions to suppress the warning The following code snippet suppresses the warning and plots the data from pandasplotting import registermatplotlibconverters registermatplotlibconverters pltfigurefigsize108 pltscatterdfindex dfOpen Understanding a simple linear regression model and its rich content 219 The result looks as shown in the following figure Figure 82 Scatter plot of Netflix stock price data Note that there are some jumps in the stock prices which may indicate stock price surges driven by good news Also note that the graph scales will significantly change how you perceive the data You are welcome to change the figure size to 103 and you may be less impressed by the performance of the stock price For this time period the stock price of Netflix seems to be linear with respect to time We shall investigate the relationship using our twoparameter simple linear regression model However before that we must do some transformation The first transformation is to convert a sequence of date objects the DataFrame index to a list of integers I redefined two variables x and y which represent the number of days since January 4 2016 and the opening stock price of that day The following code snippet creates two such variables I first created a timedelta object by subtracting the first element in the index January 4 2016 and then converted it to the number of days x dfindex dfindex0daystonumpy y dfOpentonumpy 220 Statistics for Regression Note If you checked the Netflix stock prices in the past 2 years you would surely agree with me that simple linear regression would be likely to fail We will try to use more sophisticated regression models in later chapters on such data Why dont we use standardization The reason is that in simple linear regression the slope k and intercept b when data is at its original scale have meanings For example k is the daily average stock price change Adding one more day to variable x the stock price will change accordingly Such meanings would be lost if we standardized the data Next lets take a look at how to use the SciPy library to perform the simplest linear regression based on least squared error minimization Least squared error linear regression and variance decomposition Lets first run the scipystatslinregress function to gain some intuition and I will then explain linear regression from the perspective of ANOVA specifically variance decomposition The following code snippet runs the regression from scipystats import linregress linregressxy The result looks as follows LinregressResultslope01621439447698934 intercept7483816138860539 rvalue09447803151619397 pvalue6807230675594974e245 stderr0002512657375708363 The result contains the slope and the intercept It also contains an Rvalue a Pvalue and a standard error Based on our knowledge from Chapter 7 Statistical Hypothesis Testing even without knowing the underlined hypothesis test such a small Pvalue tells you that you can reject whatever the null hypothesis is The R value is called the correlation coefficient whose squared value R2 is more wellknown the coefficient of determination Understanding a simple linear regression model and its rich content 221 There are two major things that the linregress function offers A correlation coefficient is calculated to quantitatively present the relationship between dependent and independent variables A hypothesis is conducted and a Pvalue is calculated In this section we focus on the calculation of the correlation coefficient and briefly talk about the hypothesis testing at the end Regression uses independent variables to explain dependent variables In the most boring case if the stock price of Netflix is a horizontal line no more explanation from the independent variable is needed The slope k can take value 0 and the intercept b can take the value of the motionless stock price If the relationship between the stock price and the date is perfectly linear then the independent variable fully explains the dependent variable in a linear sense What we want to explain quantitatively is the variance of the dependent variable npvaryleny calculates the sum of squares total SST of the stock prices The result is about 653922 The following code snippet adds the horizontal line that represents the mean of the stock prices and the differences between stock prices and their mean as vertical segments This is equivalent to estimating the stock prices using the mean stock price This is the best we can do with the dependent variable only pltfigurefigsize208 pltscatterx y ymean npmeany plthlinesymean npminx npmaxxcolorr sst 0 for x y in zipxy pltplotxxymeanycolorblacklinestyle sst y ymean2 printsst The total variance intuitively is the summed square of the differences between the stock price of the mean following the following formula You can verify that the SST variable is indeed about 653922 𝑦𝑦𝑖𝑖 𝑦𝑦2 𝑖𝑖 222 Statistics for Regression As you may expect the differences between the stock price and the mean are not symmetrically distributed along time due to the increase in Netflix stock The difference has a name residuals The result looks as shown in the following figure Figure 83 Visualization of SST and residuals If we have a known independent variable x in our case the number of days since the first data point we prefer a sloped line to estimate the stock price rather than the naïve horizontal line now Will the variance change Can we decrease the summed square of residuals Regardless of the nature of the additional independent variable we can first approach this case from a pure errorminimizing perspective I am going to rotate the line around the point npmeanxnpmeany Lets say now we have a slope of 010 The following code snippet recalculates the variance and replots the residuals Note that I used the variable sse sum of squared errors SSE to denote the total squared errors as shown in the following example pltfigurefigsize208 pltscatterx y ymean npmeany xmean npmeanx pltplotnpminxnpmaxx x2y01xmeanymeannpminxx2y01x meanymeannpmaxx colorr sse 0 for x y in zipxy yonline x2y01xmeanymeanx Understanding a simple linear regression model and its rich content 223 pltplotxxyon lineycolorblacklinestyle sse yonline y2 printsse The SSE is about 155964 much smaller than the SST Lets check the plot generated from the preceding code snippet Figure 84 Visualization of SSE and residuals It is visually clear that the differences for the data points shrink in general Is there a minimal value for SSE with respect to the slope k The following code snippet loops through the slope from 0 to 03 and plots it against SSE ymean npmeany xmean npmeanx slopes nplinspace00320 sses 0 for i in rangelenslopes for x y in zipxy for i in rangelensses yonline x2yslopesixmeanymeanx ssesi yonline y2 pltfigurefigsize208 pltrcxticklabelsize18 pltrcyticklabelsize18 pltplotslopessses 224 Statistics for Regression The result looks as shown in the following graph Figure 85 Slope versus the SSE This visualization demonstrates the exact idea of Least Square Error LSE When we change the slope the SSE changes and at some point it reaches its minimum In linear regression the sum of the squared error is parabolic which guarantees the existence of such a unique minimum Note The intercept is also an undetermined parameter However the intercept is usually of less interest because it is just a shift along the y axis which doesnt reflect how strongly the independent variable correlates with the dependent variable The following code snippet considers the influence of the intercept To find the minimum with respect to the two parameters we need a 3D plot You are free to skip this code snippet and it wont block you from learning further materials in this chapter Understanding a simple linear regression model and its rich content 225 The following code snippet prepares the data for the visualization def calsseslopeintercept x y sse 0 for x y in zipxy yonline x2yslope0interceptx sse yonline y2 return sse slopes nplinspace1120 intercepts nplinspace20040020 slopes intercepts npmeshgridslopesintercepts sses npzerosinterceptsshape for i in rangessesshape0 for j in rangessesshape1 ssesij calsseslopesijinterceptsijxy The following code snippet plots the 3D surface namely SSE versus slope and intercept from mpltoolkitsmplot3d import Axes3D from matplotlib import cm fig pltfigurefigsize1410 ax figgcaprojection3d axviewinit40 30 axsetxlabelslope axsetylabelintercept axsetzlabelsse pltrcxticklabelsize8 pltrcyticklabelsize8 surf axplotsurfaceslopes intercepts sses cmapcm coolwarm linewidth0 antialiasedTrue figcolorbarsurf shrink05 aspect5 pltshow 226 Statistics for Regression The result looks as shown in the following figure Figure 86 SSE as a function of slope and intercept The combination of the optimal values of slope and intercept gives us the minimal SSE It is a good time to answer a natural question how much of the variance in the independent variable can be attributed to the independent variable The answer is given by the R2 value It is defined as follows 𝑅𝑅2 𝑆𝑆𝑆𝑆𝑆𝑆 𝑆𝑆𝑆𝑆𝑆𝑆 𝑆𝑆𝑆𝑆𝑆𝑆 It can also be defined as shown in the following equation 𝑅𝑅2 𝑆𝑆𝑆𝑆𝑅𝑅 𝑆𝑆𝑆𝑆𝑆𝑆 Understanding a simple linear regression model and its rich content 227 where 𝑆𝑆𝑆𝑆𝑆𝑆 𝑆𝑆𝑆𝑆𝑆𝑆 𝑆𝑆𝑆𝑆𝑆𝑆 is called the sum of squared regression or regression sum of squares Do a thought experiment with me if R2 0 it means we have no error after regression All the changes of our dependent variable can be attributed to the change of the independent variable up to a proportionality coefficient k and a shift value b This is too good to be true in general and it is also the best that simple linear regression can do The slope and intercept given by the linregress function gives us an R2 value of 089094472 The verification of this value is left to you as an exercise In the next section we are going to learn about the limitations of R2 The coefficient of determination R2 is a key indicator of the quality of the regression model If SSR is large it means we captured enough information in the change of the dependent variable with the change of the independent variable If you have a very large R2 and your model is simple the story can end here However beyond simple linear regression sometimes R2 can be misleading Lets take multivariable polynomial regression as an example Lets say we have two independent variables x1 and x2 and we are free to use variables such as x1 2 as new predictors The expected expression of the dependent variable y will look like the following 𝑦𝑦 β α11𝑥𝑥1 α21𝑥𝑥2 α12𝑥𝑥1 2 α𝑘𝑘𝑘𝑘𝑥𝑥𝑘𝑘 𝑘𝑘 In the stock price example you can pick an additional independent variable such as the unemployment rate of the United States Although there is little meaning in taking the square of the number of days or the unemployment rate nothing stops you from doing it anyway Note In a simple linear model you often see r2 rather than R2 to indicate the coefficient of determination r2 is only used in the context of simple linear regression R2 will always increase when you add additional 𝛼𝛼𝑘𝑘𝑘𝑘𝑥𝑥𝑘𝑘 𝑘𝑘 terms Given a dataset R2 represents the power of explainability on this dataset You can even regress the stock price on your weight if you measure it daily during that time period and you are going to find a better R2 An increased R2 alone doesnt necessarily indicate a better model 228 Statistics for Regression A large R2 doesnt indicate any causeeffect relationship For example the change of time doesnt drive the stock price of Netflix high as it is not the cause of change of the dependent variable This is a common logic fault and a large R2 just magnifies it in many cases It is always risky to conclude causeeffect relationships without thorough experiments R2 is very sensitive to a single data point For example I created a set of data to demonstrate this point The following code snippet does the job nprandomseed2020 x nplinspace0220 y 3x nprandomnormalsizelenx xnew npappendxnparray0 ynew npappendynparray10 pltscatterxy pltscatter010 linregressxnewynew The plot looks as shown in the following figure Pay attention to the one outlier at the top left Figure 87 Linear data with an outlier Understanding a simple linear regression model and its rich content 229 The R2 value is less than 03 However removing the outlier updates the R2 value to around 088 A small R2 may indicate that you are using the wrong model in the first place Take the following as an example Simple linear regression is not suitable to fit the parabolic data nprandomseed2020 x nplinspace0220 y 4x28x nprandomnormalscale05sizelenx pltscatterxy linregressxy The R2 is less than 001 It is not correct to apply simple linear regression on such a dataset where nonlinearity is obvious You can see the failure of such a dataset where simple linear regression is applied in the following figure This is also why exploratory data analysis should be carried out before building models We discussed related techniques in Chapter 2 Essential Statistics for Data Assessment and Chapter 3 Visualization with Statistical Graphs Figure 88 A dataset where simple linear regression will fail Connecting the relationship between regression and estimators 231 We also obtain the following equation 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑘𝑘 𝑏𝑏 𝑑𝑑𝑏𝑏 𝑦𝑦𝑖𝑖 𝑘𝑘 𝑥𝑥𝑖𝑖 𝑖𝑖 Note The loglikelihood function only depends on each data point through 𝑦𝑦𝑖𝑖 𝑘𝑘𝑥𝑥𝑖𝑖 𝑏𝑏2 whose sum is exactly SSE Maximizing the loglikelihood is equivalent to minimizing the squared error Now we have two unknowns k and b and two equations We can solve it algebraically Indeed we have already solved the problem graphically through the 3D visualization but it is nice to have an algebraic solution We have the following formula for the algebraic solution 𝑘𝑘 𝑥𝑥𝑖𝑖 𝑥𝑥𝑦𝑦𝑖𝑖 𝑦𝑦 𝑖𝑖 𝑥𝑥𝑖𝑖 𝑥𝑥2 𝑖𝑖 and 𝑏𝑏 𝑦𝑦 𝑘𝑘𝑥𝑥 Now lets calculate the slope and intercept for the Netflix data with this formula and check them against the linregress results The following code does the job x dfindex dfindex0daystonumpy y dfOpentonumpy xmean npmeanx ymean npmeany k npsumxxmeanyymeannpsumxxmean2 b ymean k xmean printkb The results are about 016214 and 74838 which agree with the linregress results perfectly A computational approach is not illuminating in the sense of mathematical intuition Next lets try to understand simple linear regression from the estimation perspective Having handson experience with multivariate linear regression and collinearity analysis 233 You might have noticed that the expression does give you k In the linear model rxy connects the standard deviations of dependent and independent variables Due to an inequality restriction rxy can only take values between 1 and 1 and equality is reached when x and y are perfectly correlated negatively or positively The square of rxy gives us R2 rxy can be either positive or negative but R2 doesnt contain the directional information but the strength of explanation The standard error is associated with the estimated value of k Due to space limitations we cant go over the mathematics here However knowing that an estimator is also a random variable with variance is enough to continue this chapter A smaller variance of estimator means it is more robust and stable A corresponding concept is the efficiency of an estimator Two unbiased estimators may have different efficiencies A more efficient estimator has a smaller variance for all possible values of the estimated parameters In general the variance cant be infinitely small The socalled CramérRao lower bound restricts the minimal variance that could be achieved by an unbiased estimator Note I would like to suggest an interesting read on this crossvalidated question which you will find here httpsstatsstackexchangecom questions64195howdoicalculatethevariance oftheolsestimatorbeta0conditionalon Having handson experience with multivariate linear regression and collinearity analysis Simple linear regression is rarely useful because in reality many factors will contribute to certain outcomes We want to increase the complexity of our model to capture more sophisticated onetomany relationships In this section well study multivariate linear regression and collinearity analysis First we want to add more terms into the equation as follows 𝑦𝑦 𝑘𝑘1𝑥𝑥1 𝑘𝑘2𝑥𝑥2 𝑘𝑘𝑛𝑛𝑥𝑥𝑛𝑛 ϵ There is no nonlinear term and there are independent variables that contribute to the dependent variable collectively For example peoples wages can be a dependent variable and their age and number of employment years can be good explanatory independent variables 234 Statistics for Regression Note on multiple regression and multivariate regression You may see interchangeable usage of multiple linear regression and multivariate linear regression Strictly speaking they are different Multiple linear regression means that there are multiple independent variables while multivariate linear regression means the responsedependent variable is a vector multiple which means you must do regression on each element of it I will be using an exam dataset for demonstration purposes in this section The dataset is provided in the official GitHub repository of this book The following code snippet reads the data import pandas as pd exam pdreadcsvexamscsv exam Lets inspect the data Figure 89 Exam data for multivariate linear regression Having handson experience with multivariate linear regression and collinearity analysis 235 The dataset contains three insemester exams the independent variable and one final exam the dependent variable Lets first do some exploratory graphing Here I introduce a new kind of plot a violin plot It is like a boxplot but it gives a better idea of how the data is distributed inside the first and third quartiles Dont hesitate to try something new when you are learning with Python First we need to transform our exam DataFrame to the long format such that it only contains two columns the score column and the examname column We covered a similar example for the boxplot in Chapter 2 Essential Statistics for Data Assessment Feel free to review that part The following code snippet does the transformation examindex examindex examlong pdmeltexamidvarsindexvaluevars exam columns1variablevalue examlongcolumns examnamescore I will sample 10 rows with examlongsample10 from the new DataFrame for a peek Figure 810 The exam DataFrame in long format The following code snippet displays the violin plot You will see why it is called a violin plot import seaborn as sns snssetstylewhitegrid pltfigurefigsize86 snsviolinplotxexamname yscore dataexamlong 236 Statistics for Regression The result looks as shown in the following figure Figure 811 Violin plot of the exam scores We see that the score distributions are somewhat alike for the first three exams whereas the final exam has a longer tail Next lets do a set of scatter plots for pairs of exams with the following code snippet fig ax pltsubplots13figsize126 ax0scatterexamEXAM1examFINALcolorgreen ax1scatterexamEXAM2examFINALcolorred ax2scatterexamEXAM3examFINAL ax0setxlabelExam 1 score ax1setxlabelExam 2 score ax2setxlabelExam 3 score ax0setylabelFinal exam score Having handson experience with multivariate linear regression and collinearity analysis 237 The result looks as shown in the following figure Figure 812 Final exams versus the other three exams The linear model seems to be a great choice for our dataset because visually the numbered exam scores are strongly linearly correlated with the final exam score From simple linear regression to multivariate regression the idea is the same we would like to minimize the sum of squared errors Let me use the statsmodels library to run ordinary least square OLS regression The following code snippet does the job import statsmodelsapi as sm X examEXAM1EXAM2EXAM3tonumpy X smaddconstantX y examFINALtonumpy smOLSyXfitsummary The result looks as shown in the following figures Although there is a lot of information here we will be covering only the essential parts 238 Statistics for Regression First the regression result or summary is listed in the following figure Figure 813 OLS regression result Secondly the characteristics of the dataset and the model are also provided in the following figure Figure 814 OLS model characterization Lastly the coefficients and statistics for each predictor feature the numbered exam scores are provided in the following table Figure 815 Coefficients and statistics for each predictor feature First R2 is close to 1 which is a good sign that the regression on independent variables successfully captured almost all the variance of the dependent variable Then we have an adjusted R2 The adjusted R2 is defined as shown in the following equation 𝑅𝑅𝑎𝑎𝑎𝑎𝑎𝑎 2 1 1 𝑅𝑅2𝑛𝑛 1 𝑛𝑛 𝑑𝑑𝑑𝑑 1 Having handson experience with multivariate linear regression and collinearity analysis 239 df is the degree of freedom here it is 3 because we have 3 independent variables and n is the number of points in the data Note on the sign of adjusted R2 The adjusted R2 penalizes the performance when adding more independent variables The adjusted R2 can be negative if the original R2 is not large and you try to add meaningless independent variables Collinearity The linear model we just built seems to be good but the warning message says that there are multicollinearity issues From the scatter plot we see that the final exam score seems to be predictable from either of the exams Lets check the correlation coefficients between the exams examEXAM1EXAM2EXAM3corr The result looks as shown in the following figure Figure 816 Correlation between numbered exams EXAM 1 and EXAM 2 have a correlation coefficient of more than 090 and the smallest coefficient is around 085 Strong collinearity between independent variables becomes a problem because it tends to inflate the variance of the estimated regression coefficient This will become clearer if we just regress the final score on EXAM 1 linregressexamEXAM1 examFINAL The result is as follows LinregressResultslope18524548489068682 intercept15621968742401123 rvalue09460708318102032 pvalue9543660489160869e13 stderr013226692073027208 240 Statistics for Regression Note that with a slightly smaller R2 value the standard error of the slope is less than 10 of the estimated value However if you check the output for the threeindependent variables case the standard error for the coefficients of EXAM 1 and EXAM 2 is about onethird and onefifth of the values respectively How so An intuitive and vivid argument is that the model is confused about which independent variable it should pick to attribute the variance to The EXAM 1 score alone explains 94 of the total variance and the EXAM 2 score can explain almost 93 of the total variance too The model can either assign a more deterministic slope to either the EXAM 1 score or EXAM 2 but when they exist simultaneously the model is confused which numerically inflates the standard error of the regression coefficients In some numerical algorithms where randomness plays a role running the same program twice might give different sets of coefficients Sometimes the coefficient can even be negative when you already know it should be a positive value Are their quantitative ways to detect collinearity There are two common methods They are listed here The first one preexamines variables and the second one checks the Variance Inflation Factor VIF You can check the correlation coefficient between pairs of independent variables as we just did in the example A large absolute value for the correlation coefficient is usually a bad sign The second method of calculating the VIF is more systematic and unbiased in general To calculate the VIF of a coefficient we run a regression against its corresponding independent variable xi using the rest of the corresponding variables obtain the R2 and calculate VIFi using the following equation 𝑉𝑉𝑉𝑉𝐹𝐹𝑖𝑖 1 1 𝑅𝑅𝑖𝑖 2 Lets do an example I will use the EXAM 2 score and the EXAM 3 score as dependent variables and the EXAM 1 score as an independent variable X examEXAM2EXAM3tonumpy X smaddconstantX y examEXAM1tonumpy smOLSyXfitrsquared The result is around 0872 Therefore the VIF is about 78 This is already a big value A VIF greater than 10 suggests serious collinearity Learning regularization from logistic regression examples 241 Is collinearity an issue The answer is yes and no It depends on our goals If our goal is to predict the independent variable as accurately as possible then it is not an issue However in most cases we dont want to carry unnecessary complexity and redundancy in the model There are several ways to get rid of collinearity Some of those are as follows Select independent variables and drop the rest This may lose information Obtain more data More data brings diversity into the model and will reduce the variance Use Principle Component Analysis PCA to transform the independent variables into fewer new variables We will not cover it here because of space limitations The idea is to bundle the variance explainability of independent variables together in a new variable Use lasso regression Lasso regression is regression with regularization of L1 norm In the next section we will see how it is done and what exactly L1norm means Learning regularization from logistic regression examples L1 norm regularization which penalizes the complexity of a model is also called lasso regularization The basic idea of regularization in a linear model is that parameters in a model cant be too large such that too many factors contribute to the predicted outcomes However lasso does one more thing It not only penalizes the magnitude but also the parameters existence We will see how it works soon The name lasso comes from least absolute shrinkage and selection operator It will shrink the values of parameters in a model Because it uses the absolute value form it also helps with selecting explanatory variables We will see how it works soon Lasso regression is just like linear regression but instead of minimizing the sum of squared errors it minimizes the following function The index i loops over all data points where j loops over all coefficients Unlike standard OLS this function no longer has an intuitive graphic representation It is an objective function An objective function is a term from optimization We choose input values to maximize or minimize the value of an objective function y 𝑘𝑘𝑗𝑗𝑥𝑥1𝑖𝑖 𝑘𝑘2𝑥𝑥2𝑖𝑖 𝑘𝑘𝑣𝑣𝑥𝑥𝑣𝑣𝑖𝑖 β 2 i λ 𝑘𝑘𝑗𝑗 𝑗𝑗 242 Statistics for Regression The squared term on the left in the objective function is the OLS sum of squared error The term on the right is the regularization term λ a positive number is called the regularization coefficient It controls the strength of the penalization The regularization term is artificial For example the regression coefficientsslopes share the same coefficient but it is perfectly okay if you assign a different regularization coefficient to different regression coefficients When λ 0 we get back OLS As λ increases more and more the coefficient will shrink and eventually reach 0 If you change the regularization term from 𝜆𝜆 𝑘𝑘𝑗𝑗 𝑗𝑗 to λ kj 2 j just like the OLS term you will get ridge regression Ridge regression also helps control the complexity of a model but it doesnt help with selecting explanatory variables We will compare the effects with examples We will run the lasso regression ridge regression and normal linear regression again with modules from the sklearn library Note It is a good habit to check the same function offered from different libraries so you can compare them meaningfully For example in the sklearn library the objective function is defined such that the sum of squared error is reduced by 1 2𝑛𝑛 If you dont check the document and simply compare results from your own calculation you may end up with confusing conclusions about regularization coefficient choice This is also why in the code that follows the regularization coefficient for the ridge model multiplies 2n The APIs for two models are not consistent in the sklearn library The following code snippet prepares the data like earlier from sklearn import linearmodel X examEXAM1EXAM2EXAM3tonumpy y examFINALtonumpy In sklearn the regularization coefficient is defined as α so I am going to use α instead of λ First I choose α to be 01 alpha 01 linearregressor linearmodelLinearRegression linearregressorfitXy lassoregressor linearmodelLassoalphaalpha lassoregressorfitXy Learning regularization from logistic regression examples 243 ridgeregressor linearmodelRidgealphaalphaleny2 ridgeregressorfitXy printlinear model coefficient linearregressorcoef printlasso model coefficient lassoregressorcoef printridge model coefficient ridgeregressorcoef The result reads as follows linear model coefficient 035593822 054251876 116744422 lasso model coefficient 035537305 054236992 116735218 ridge model coefficient 03609811 054233219 116116573 Note that there isnt much difference in the values Our regularization term is still too small compared to the sum of squared error term Next I will generate a set of data varying α I will plot the scale of the three coefficients with respect to increasing α linearregressor linearmodelLinearRegression linearregressorfitXy linearcoefficient nparraylinearregressorcoef 20T lassocoefficient ridgecoefficient alphas nplinspace140020 for alpha in alphas lassoregressor linearmodelLassoalphaalpha lassoregressorfitXy ridgeregressor linearmodelRidgealphaalphaleny2 ridgeregressorfitXy lassocoefficientappendlassoregressorcoef ridgecoefficientappendridgeregressorcoef lassocoefficient nparraylassocoefficientT ridgecoefficient nparrayridgecoefficientT 244 Statistics for Regression Note that the T method is very handy it transposes a twodimensional NumPy array The following code snippet plots all the coefficients against the regularization coefficient Note how I use the loc parameter to position the legends pltfigurefigsize128 for i in range3 pltplotalphas linearcoefficienti label linear coefficient formati cr linestylelinewidth6 pltplotalphas lassocoefficienti label lasso coefficient formati c blinestylelinewidth6 pltplotalphas ridgecoefficienti label ridge coefficient formati cglinestylelinewidth6 pltlegendloc0705fontsize14 pltxlabelAlpha pltylabelCoefficient magnitude The result looks as shown in the following figure Note that different line styles indicate different regression models Note Different colors if you are reading a grayscale book check the Jupyter notebook indicate different coefficients Figure 817 Coefficient magnitudes versus the regularization coefficient Learning regularization from logistic regression examples 245 Note that the dotted line doesnt change with respect to the regularization coefficient because it is not regularized The lasso regression coefficients and the ridge regression coefficients start roughly at the same levels of their corresponding multiple linear counterparts The ridge regression coefficients decrease toward roughly the same scale and reach about 02 when α 400 The lasso regression coefficients on the other hand decrease to 0 one by one around α 250 When the coefficient is smaller than 1 the squared value is smaller than the absolute value This is true to the fact that lasso regression coefficients decreasing to 0 doesnt depend on this You can do an experiment by multiplying all independent variables by 01 to amplify the coefficients and you will find similar behavior This is left to you as an exercise So when α is large why does lasso regression tend to penalize the number of coefficients while ridge regression tends to drive coefficients at roughly the same magnitude Lets do one last thought experiment to end this chapter Consider the scenario that we have two positive coefficients k1 and k2 where k1 is larger than k2 Under the lasso penalization decreasing either coefficient by a small value δ will decrease the objective function by δ No secret there However in ridge regression decreasing the larger value k1 will always decrease the objective function more as shown in the following equation For k2 you can do the same calculation as the following Δ𝑘𝑘2 𝑘𝑘2 2 𝑘𝑘2 𝛿𝛿2 2𝑘𝑘2𝛿𝛿 𝛿𝛿2 Because k1 is greater than k2 decreasing the larger value benefits the minimization more The ridge regression discourages the elimination of smaller coefficients but prefers decreasing larger coefficients The lasso regression on the other hand is capable of generating a sparse model with fewer coefficients These regularizations especially the ridge regression are particularly useful to handle multicollinearity For readers interested in exploring this further I recommend you check out the corresponding chapter in the classical book Elements of Statistical Learning by Jerome H Friedman Robert Tibshirani and Trevor Hastie Δ𝑘𝑘1 𝑘𝑘1 2 𝑘𝑘1 δ2 2𝑘𝑘1δ δ2 246 Statistics for Regression Summary In this chapter we thoroughly went through basic simple linear regression demystified some core concepts in linear regression and inspected the linear regression model from several perspectives We also studied the problem of collinearity in multiple linear regression and proposed solutions At the end of the chapter we covered two more advanced and widely used regression models lasso regression and ridge regression The concepts introduced in this chapter will be helpful for our future endeavors In the next chapter we are going to study another important family of machine learning algorithms and the statistics behind it classification problems 9 Statistics for Classification In the previous chapter we covered regression problems where correlations in the form of a numerical relationship between independent variables and dependent variables are established Different from regression problems classification problems aim to predict the categorical dependent variable from independent variables For example with the same Netflix stock price data and other potential data we can build a model to use historical data that predicts whether the stock price will rise or fall after a fixed amount of time In this case the dependent variable is binary rise or fall lets ignore the possibility of having the same value for simplicity Therefore this is a typical binary classification problem We will look at similar problems in this chapter In this chapter we will cover the following topics Understanding how a logistic regression classifier works Learning how to evaluate the performance of a classifier Building a naïve Bayesian classification model from scratch Learning the mechanisms of a support vector classifier Applying crossvalidation to avoid classification model overfitting We have a lot of concepts and coding to cover So lets get started 248 Statistics for Classification Understanding how a logistic regression classifier works Although this section name sounds a bit unheard it is correct Logistic regression is indeed a regression model but it is mostly used for classification tasks A classifier is a model that contains sets of rules or formulas sometimes millions or more to perform the classification task In a simple logistic regression classifier we only need one rule built on a single feature to perform the classification Logistic regression is very popular in both traditional statistics as well as machine learning The name logistic originates from the name of the function used in logistic regression logistic function Logistic regression is the Generalized Linear Model GLM The GLM is not a single model but an extended group of models of Ordinary Least Squares OLS models Roughly speaking the linear part of the model in GLM is similar to OLS but various kinds of transformation and interpretations are introduced so GLM models can be applied to problems that simple OLS models cant be used for directly You will see what this means in logistic regression in the following section The logistic function and the logit function Its easier to look at the logit function first because it has a more intuitive physical meaning The logit function has another name the logodds function which makes much more sense The standard logit function takes the form log 𝑝𝑝 1 𝑝𝑝 where p is between 0 and 1 indicating the probability of one possibility happening in a binary outcome The logistic function is the inverse of the logit function A standard logistic function takes the form 1 1 𝑒𝑒𝑥𝑥 where x can take a value from ꝏ to ꝏ and the function takes a value between 0 and 1 The task for this section is to predict whether a stock index such as SPX will rise or fall from another index called the fear and greedy index The fear and greedy index is an artificial index that represents the sentiment of the stock market When most people are greedy the index is high and the overall stock index is likely to rise On the other hand when most people are fearful the index is low and the stock index is likely to fall There are various kinds of fear and greedy indexes The one composed by CNN Money is a 100point scale and contains influences from seven other economic and financial indicators 50 represents neutral whereas larger values display the greediness of the market Understanding how a logistic regression classifier works 249 We are not going to use real data though Instead I will use a set of artificial data as shown in the following code snippet As we did in Chapter 7 Statistical Hypothesis Testing I will hide the generation of the data from you until the end of the section The following code snippet creates the scatter plot of the fear and greedy index and the stock index change of the corresponding day pltfigurefigsize106 pltscatterfgindexstockindexchange 0 stockindexchangestockindexchange 0 s200 marker6 labelUp pltscatterfgindexstockindexchange 0 stockindexchangestockindexchange 0 s200 marker7 labelDown plthlines00100labelNeutral line pltxlabelFear Greedy Indexfontsize20 pltylabelStock Index Changefontsize20 pltlegendncol3 The graph looks as in the following figure Figure 91 Fear and greedy index versus stock index change 250 Statistics for Classification This might be suitable for simple linear regression but this time we are not interested in the exact values of stock index change but rather the direction of the market The horizontal neutral line bisects the data into two categories the stock index either goes up or goes down Our goal is to predict whether a stock index will rise or fall given a fear and greedy index value The formulation of a classification problem The goal of the classification task is to predict the binary outcome from a single numerical independent variable If we use the OLS model directly the outcome of a classification is binary but the normal OLS model offers continuous numerical outcomes This problem leads to the core of logistic regression Instead of predicting the probability we predict the odds Instead of predicting the hardcut binary outcome we can predict the probability of one of the outcomes We can predict the probability that the stock index will rise as p therefore 1 p becomes the probability that the stock price will fall no matter how small the scale is Notes on the term odds The odds of an event out of possible outcomes is the ratio of the events probability and the rest You might have heard the phrase against all odds which means doing something when the odds of success are slim to none Although probability is limited to 01 the odds can take an arbitrary value from 0 to infinity By applying a shift to the odds we get negative values to suit our needs By running a regression against the odds we have an intermediate dependent variable which is numerical and unbounded However there is one final question How do we choose the parameter of the regression equation We want to maximize our likelihood function Given a set of parameters and corresponding predicted probabilities we want the predictions to maximize our likelihood function In our stock price example it means the data points from the up group have probabilities of being up that are as large as possible and data points from the down group have probabilities of being down that are as large as possible too You can review Chapter 6 Parametric Estimation to refresh your memory of the maximal likelihood estimator Understanding how a logistic regression classifier works 251 Implementing logistic regression from scratch Make sure you understand the chain of logic before we start from a regression line to go over the process Then we talk about how to find the optimal values of this regression lines parameters Due to the limit of space I will omit some code and you can find them in this books official GitHub repository httpsgithub comPacktPublishingEssentialStatisticsforNonSTEMData Analysts The following is the stepbystep implementation and corresponding implementation of logistic regression We used our stock price prediction example We start by predicting the odds as a numerical outcome 1 First l will draw a sloped line to be our first guess of the regression against the odds of the stock index rising 2 Then I will project the corresponding data points on this regressed line 3 Lets look at the results The following graph has two y axes The left axis represents the odds value and the right axis represents the original stock index change I plotted one arrow to indicate how the projection is done The smaller markers as shown in the following figure are for the right axis and the large markers on the inclined line indicate the regressed odds Figure 92 Fear and greedy index versus odds The regression parameters I chose are very simpleno intercept and only a slope of 01 Notice that one of the up data points has smaller odds than one of the down data points 252 Statistics for Classification 4 Now we transform the odds into probability This is where the logistic function comes into play Probability 1 1 𝑒𝑒𝑘𝑘 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 Note that the k parameter can be absorbed into the slope and intercept so we still have two parameters However since the odds are always positive the probability will only be larger than 1 2 We need to apply a shift to the odds whose value can be observed into the parameter intercept I apply an intercept of 5 and we get the following Figure 93 Fear and greedy index versus shifted odds Note that the shifted values lose the meaning of the odds because negative odds dont make sense Next we need to use the logistic function to transform the regressed shifted odds into probability The following code snippet defines a handy logistic function You may also see the name sigmoid function used in other materials which is the same thing The word sigmoid means shaped like the character S The following code block defines the logistic function def logisticx return 1 1 npexpx The following two code snippets plot the shifted odds and transformed probability on the same graph I also defined a new function called calshiftedodds for clarity We plot the odds with the first code snippet def calshiftedoddsval slope intercept return valslope intercept Understanding how a logistic regression classifier works 253 slope intercept 01 5 fig ax1 pltsubplotsfigsize106 shiftedodds calshiftedoddsfgindexslopeintercept ax1scatterfgindexstockindexchange 0 shiftedoddsstockindexchange 0 s200 marker6 label Up ax1scatterfgindexstockindexchange 0 shiftedoddsstockindexchange 0 s200 marker7 label Down ax1plotfgindex shiftedodds linewidth2 cred The following code snippet continues to plot the probability ax2 ax1twinx ax2scatterfgindexstockindexchange 0 logisticshiftedoddsstockindexchange 0 s100 marker6 labelUp ax2scatterfgindexstockindexchange 0 logisticshiftedoddsstockindexchange 0 s100 marker7 labelDown 254 Statistics for Classification ax2plotfggrids logisticcalshiftedoddsfg gridsslopeintercept linewidth4 linestyle cgreen ax1setxlabelFear Greedy Indexfontsize20 ax1setylabelOdds 5fontsize20 ax2setylabelProbability of Going Upfontsize20 pltlegendfontsize20 The result is a nice graph that shows the shifted odds and the transformed probability side by side The dotted line corresponds to the right axis it has an S shape and data points projected onto it are assigned probabilities Figure 94 Transformed probability and shifted odds 5 Now you can pick a threshold of the probability to classify the data points For example a natural choice is 05 Check out the following graph where I use circles to mark out the up data points I am going to call those points positive The term comes from clinical testing where clinical experiments are done Understanding how a logistic regression classifier works 255 Figure 95 Threshold and positive data points As you can see there is a negative data point that was misclassified as a positive data point which means we misclassified a day that the stock index is going down as a day that the stock index goes up If you buy a stock index on that day you are going to lose money Positive and negative The term positive is relative as it depends on the problem In general when something interesting or significant happens we call it positive For example if radar detects an incoming airplane it is a positive event if you test positive for a virus it means you carry the virus Since our threshold is a linear line we cant reach perfect classification The following classifier with threshold 08 gets the misclassified negative data point right but wont fix the misclassified positive data point below it Figure 96 Threshold 08 256 Statistics for Classification Which one is better In this section we have converted a regression model into a classification model However we dont have the metrics to evaluate the performances of different choices of threshold In the next section lets examine the performance of the logistic regression classifier Evaluating the performance of the logistic regression classifier In this section we will approach the evaluation of our logistic regression classifiers in two ways The first way is to use the socalled confusion matrix and the second way is to use the F1 score To introduce the F1 score we also need to introduce several other metrics as well which will all be covered in this section Lets see what a confusion matrix looks like and define some terms For that lets take an example The following table is a 2by2 confusion matrix for the threshold 05 case The 2 in the topleft cell means that there are two positive cases that we successfully classify as positive Therefore it is called True Positive TP Correspondingly the 1 in the bottomleft cell means that one positive case was misclassified as False Negative FN We also have True Negative TN and False Positive FP by similar definition Note A perfect classifier will have false positive and false negative being 0 The false positive error is also called a type 1 error and the false negative error is also called a type 2 error As an example if a doctor is going to claim a man is pregnant it is a false positive error if the doctor says a laboring woman is not pregnant it is a false negative error In addition the recall or sensitivity or true positive rate is defined as follows 𝑇𝑇𝑇𝑇 𝑇𝑇𝑇𝑇 𝐹𝐹𝐹𝐹 Understanding how a logistic regression classifier works 257 This means the ability of the classifier to correctly identify the positive ones from all the ground truthpositive examples In our stock index example if we set the threshold to 0 we reach a sensitivity of 1 because indeed we pick out all the positive ones The allpositive classifier is n ot acceptable though On the other hand the precision or positive predictive value is defined as follows 𝑇𝑇𝑇𝑇 𝑇𝑇𝑇𝑇 𝐹𝐹𝑇𝑇 This means among all the claimed positive results how many of them are indeed positive In our stock index example setting the threshold to 1 will reach a precision of 1 because there wont be any false positive if there are no positive predictions at all A balance must be made The F1 score is the balance it is the harmonic mean of the precision and recall The harmonic mean of a and b is defined as 2𝑎𝑎𝑎𝑎 𝑎𝑎 𝑎𝑎 Notes on the harmonic mean If two values are different the harmonic mean is smaller than the geometric mean which is smaller than the most common arithmetic mean We can calculate the metrics for the preceding confusion matrix Recall and precision are both 2 3 2 2 1 The F1 score is therefore also 2 3 If we pick a threshold of 08 the confusion matrix will look as follows Then the recall will still be 2 3 but the precision will be 2 2 We reach a higher F1 score 08 We obtained a better result by simply changing the threshold However changing the threshold doesnt change the logistic function To evaluate the model itself we need to maximize the likelihood of our observation based on the regressed model Lets take a look at the regressed probabilities with code logisticcalshiftedoddsfgindexslopeintercept F1 2 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑝𝑝𝑝𝑝𝑝𝑝𝑟𝑟𝑟𝑟𝑟𝑟 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑝𝑝𝑝𝑝𝑝𝑝𝑟𝑟𝑟𝑟𝑟𝑟 258 Statistics for Classification The result looks as follows array000669285 004742587 026894142 073105858 095257413 099330715 Here I made the probabilities corresponding to positive data points bold In this case the likelihood function can be defined as follows Here index i loops through the positive indexes and index j loops through the negative indexes Since we regress against the probabilities that the stock index will go up we want the negative data points probabilities to be small which gives us the form 1 𝑃𝑃𝑗𝑗 In practice we often calculate the log likelihood Summation is easier to handle than multiplication numerically logL𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑖𝑖𝑖𝑖𝑖𝑖𝑠𝑠𝑖𝑖𝑖𝑖𝑠𝑠𝑠𝑠𝑖𝑖 𝑠𝑠𝑠𝑠𝑙𝑙𝑃𝑃𝑖𝑖 𝑖𝑖356 𝑠𝑠𝑠𝑠𝑙𝑙1 𝑃𝑃𝑗𝑗 𝑗𝑗124 Lets calculate our log likelihood function for slope 01 and intercept 5 with the following code snippet npprodprobsstockindexchange0npprod1probsstock indexchange0 The result is about 0065 Lets try another set of parameters with the following code snippet probs logisticcalshiftedoddsfgindex slope011intercept55 npprodprobsstockindexchange0npprod1probsstock indexchange0 The result is about 0058 Our original choice set of parameters is actually better Lslopeintercept Pi 1 Pj j124 i356 Building a naïve Bayes classifier from scratch 259 To find the parameters that maximize the likelihood function exactly lets use the sklearn library The following code snippet fits a regressor on our data points from sklearnlinearmodel import LogisticRegression regressor LogisticRegressionpenaltynone solvernewtoncgfitfgindex reshape11 stockindexchange0 printslope regressorcoef00 printintercept regressorintercept0 The best possible slope and intercept are about 006 and 304 respectively You can verify that this is true by plotting the likelihood function value against a grid of slopes and intercepts I will leave the calculation to you as an exercise Note on the LogisticRegression function Note that I explicitly set the penalty to none in the initialization of the LogisticRegression instance By default sklearn will set an L2 penalty term and use another solver a solver is a numerical algorithm to find the maximum that doesnt support the nopenalty setting I have to change these two arguments to make it match our approach in this section The newtoncg solver uses the Newton conjugate gradient algorithm If you are interested in finding out more about this you can refer to a numerical mathematics textbook The last thing I would like you to pay attention to is the reshaping of the input data to comply with the API Building a naïve Bayes classifier from scratch In this section we will study one of the most classic and important classification algorithms the naïve Bayes classification We covered Bayes theorem in previous chapters several times but now is a good time to revisit its form Suppose A and B are two random events the following relationship holds as long as PB 0 P𝐴𝐴𝐵𝐵 𝑃𝑃𝐵𝐵𝐴𝐴𝑃𝑃𝐴𝐴 𝑃𝑃𝐵𝐵 Some terminologies to review PAB is called the posterior probability as it is the probability of event A after knowing the outcome of event B PA on another hand is called the prior probability because it contains no information about event B 260 Statistics for Classification Simply put the idea of the Bayes classifier is to set the classification category variable as our A and the features there can be many of them as our B We predict the classification results as posterior probabilities Then why the naïve Bayes classifier The naïve Bayes classifier assumes that different features are mutually independent and Bayes theorem can be applied to them independently This is a very strong assumption and likely incorrect For example to predict whether someone has a risk of stroke or obesity problems or predicting their smoking habits and diet habits are all valid However they are not independent The naïve Bayes classifier assumes they are independent Surprisingly the simplest setting works well on many occasions such as detecting spam emails Note Features can be discrete or continuous We will only cover a discrete version example Continuous features can be naively assumed to have a Gaussian distribution I created a set of sample data as shown here which you can find in the books official GitHub repository Each row represents a set of information about a person The weight feature has three levels the highoildiet feature also has three and the smoking feature has two levels Our goal is to predict strokerisk which has three levels The following table shows the profile of 15 patients Figure 97 Stroke risk data Building a naïve Bayes classifier from scratch 261 Lets start with the first feature weight Lets calculate P𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑟𝑟𝑠𝑠𝑠𝑠𝑤𝑤𝑠𝑠𝑟𝑟𝑤𝑤ℎ𝑠𝑠 According to Bayes theorem we have the following P𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑒𝑒𝑟𝑟𝑖𝑖𝑠𝑠𝑠𝑠𝑤𝑤𝑒𝑒𝑖𝑖𝑤𝑤ℎ𝑠𝑠 𝑃𝑃𝑤𝑤𝑒𝑒𝑖𝑖𝑤𝑤ℎ𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑒𝑒𝑟𝑟𝑖𝑖𝑠𝑠𝑠𝑠𝑃𝑃𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑒𝑒𝑠𝑠𝑖𝑖𝑠𝑠𝑠𝑠 𝑃𝑃𝑤𝑤𝑒𝑒𝑖𝑖𝑤𝑤ℎ𝑠𝑠 Lets calculate the prior probabilities first since they will be used again and again P𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑟𝑟𝑠𝑠𝑠𝑠 𝑙𝑙𝑠𝑠𝑙𝑙 8 15 P𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑟𝑟𝑠𝑠𝑠𝑠 𝑚𝑚𝑟𝑟𝑚𝑚𝑚𝑚𝑚𝑚𝑠𝑠 3 15 P𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑟𝑟𝑠𝑠𝑠𝑠 ℎ𝑟𝑟𝑖𝑖ℎ 4 15 For the weight feature we have the following P𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤ℎ𝑡𝑡 𝑙𝑙𝑙𝑙𝑤𝑤 5 15 P𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤ℎ𝑡𝑡 𝑚𝑚𝑤𝑤𝑖𝑖𝑖𝑖𝑖𝑖𝑤𝑤 7 15 P𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤ℎ𝑡𝑡 ℎ𝑤𝑤𝑤𝑤ℎ 3 15 Now lets find the 3by3 matrix of the conditional probability of P𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤ℎ𝑡𝑡𝑠𝑠𝑡𝑡𝑠𝑠𝑠𝑠𝑠𝑠𝑤𝑤𝑟𝑟𝑤𝑤𝑠𝑠𝑠𝑠 The column index is for the stroke risk and the row index is for the weight The numbers in the cells are the conditional probabilities To understand the table with an example the first 0 in the last row means that P𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤ℎ𝑡𝑡 𝑙𝑙𝑙𝑙𝑤𝑤𝑠𝑠𝑡𝑡𝑠𝑠𝑙𝑙𝑠𝑠𝑤𝑤𝑟𝑟𝑤𝑤𝑠𝑠𝑠𝑠 ℎ𝑤𝑤𝑤𝑤ℎ 0 If you count the table you will find that among the four highrisk persons none of them have a low weight 262 Statistics for Classification Lets do the same thing for the other two features The last one is for smoking which is also binary OK too many numbers we will use Python to calculate them later in this chapter but for now lets look at an example What is the best stroke risk prediction if a person has middling weight a highoil diet but no smoking habit We need to determine which of the following values is the highest To simplify the expression I will use abbreviations to represent the quantities For example st stands for strokerisk and oil stands for a highoil diet 𝑃𝑃𝑠𝑠𝑠𝑠 ℎ𝑖𝑖𝑖𝑖ℎ𝑤𝑤 𝑚𝑚𝑖𝑖𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑜𝑜𝑖𝑖𝑚𝑚 𝑦𝑦𝑚𝑚𝑠𝑠 𝑠𝑠𝑚𝑚 𝑛𝑛𝑜𝑜 or 𝑃𝑃𝑠𝑠𝑠𝑠 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑤𝑤 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑜𝑜𝑚𝑚𝑚𝑚 𝑦𝑦𝑚𝑚𝑠𝑠 𝑠𝑠𝑚𝑚 𝑛𝑛𝑜𝑜 or 𝑃𝑃𝑠𝑠𝑠𝑠 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑙𝑙𝑚𝑚 𝑙𝑙𝑚𝑚𝑙𝑙 𝑦𝑦𝑚𝑚𝑠𝑠 𝑠𝑠𝑚𝑚 𝑛𝑛𝑙𝑙 In the following example I will use the high stroke risk case With Bayes theorem we have the following P𝑠𝑠𝑠𝑠 ℎ𝑖𝑖𝑖𝑖ℎ𝑤𝑤 𝑚𝑚𝑖𝑖𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑜𝑜𝑖𝑖𝑚𝑚 𝑦𝑦𝑚𝑚𝑠𝑠 𝑠𝑠𝑚𝑚 𝑛𝑛𝑜𝑜 Pst ℎ𝑖𝑖𝑖𝑖ℎ𝑃𝑃𝑤𝑤 𝑚𝑚𝑖𝑖𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑜𝑜𝑖𝑖𝑚𝑚 𝑦𝑦𝑚𝑚𝑠𝑠 𝑠𝑠𝑚𝑚 𝑛𝑛𝑜𝑜𝑠𝑠𝑠𝑠 ℎ𝑖𝑖𝑖𝑖ℎ 𝑃𝑃𝑤𝑤 𝑚𝑚𝑖𝑖𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑜𝑜𝑖𝑖𝑚𝑚 𝑦𝑦𝑚𝑚𝑠𝑠 𝑠𝑠𝑚𝑚 𝑛𝑛𝑜𝑜 Here is an interesting discovery To get comparable quantitative values we only care about the numerator because the denominator is the same for all classes The numerator is nothing but the joint probability of both the features and the category variable Building a naïve Bayes classifier from scratch 263 Next we use the assumption of independent features to decompose the numerator as follows P𝑠𝑠𝑠𝑠 ℎ𝑖𝑖𝑖𝑖ℎ𝑤𝑤 𝑚𝑚𝑖𝑖𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑜𝑜𝑖𝑖𝑚𝑚 𝑦𝑦𝑚𝑚𝑠𝑠 𝑠𝑠𝑚𝑚 𝑛𝑛𝑜𝑜 𝑃𝑃𝑠𝑠𝑠𝑠 ℎ𝑖𝑖𝑖𝑖ℎP𝑤𝑤 𝑚𝑚𝑖𝑖𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑜𝑜𝑖𝑖𝑚𝑚 𝑦𝑦𝑚𝑚𝑠𝑠 𝑠𝑠𝑚𝑚 𝑛𝑛𝑜𝑜𝑠𝑠𝑠𝑠 ℎ𝑖𝑖𝑖𝑖ℎ We can further reduce the expression as follows Thus the comparison of poster distributions boils down to the comparison of the preceding expression Note It is always necessary to check the rules with intuition to check whether they make sense The preceding expression says that we should consider the prior probability and the conditional probabilities of specific feature values Lets get some real numbers For the strokerisk high case the expression gives us the following The terms are in order You can check the preceding tables to verify them 4 15 1 4 2 4 0 4 0 The good habit of not smoking eliminates the possibility that this person has a high risk of getting a stroke How about strokerisk middle The expression is as follows 7 15 2 3 2 3 2 3 0138 Note that this value is only meaningful when comparing it with other options since we omitted the denominator in the posterior probabilitys expression earlier How about strokerisk low The expression is as follows 8 15 4 8 2 8 6 8 005 P𝑠𝑠𝑠𝑠 ℎ𝑖𝑖𝑖𝑖ℎP𝑤𝑤 𝑚𝑚𝑖𝑖𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑠𝑠𝑠𝑠 ℎ𝑖𝑖𝑖𝑖ℎP𝑜𝑜𝑖𝑖𝑚𝑚 𝑦𝑦𝑚𝑚𝑠𝑠𝑠𝑠𝑠𝑠 ℎ𝑖𝑖𝑖𝑖ℎP𝑠𝑠𝑚𝑚 𝑛𝑛𝑜𝑜𝑠𝑠𝑠𝑠 ℎ𝑖𝑖𝑖𝑖ℎ 264 Statistics for Classification The probabilities can therefore be normalized to a unit Therefore according to our Bayes classifier the person does not have a high risk of getting a stroke but has a middle or low stroke risk with a ratio of 3 to 1 after normalizing the probability Next lets write code to automate this The following code snippet builds the required prior probability for the category variable and the conditional probability for the features It takes a pandas DataFrame and corresponding column names as input def buildprobabilitiesdffeaturecolumnslist category variablestr priorprobability Counterdfcategoryvariable conditionalprobabilities for key in priorprobability conditionalprobabilitieskey for feature in featurecolumns featurekinds setnpuniquedffeature featuredict Counterdfdfcategory variablekeyfeature for possiblefeature in featurekinds if possiblefeature not in featuredict featuredictpossiblefeature 0 total sumfeaturedictvalues for featurelevel in featuredict featuredictfeaturelevel total conditionalprobabilitieskey feature feature dict return priorprobability conditionalprobabilities Building a naïve Bayes classifier from scratch 265 Lets see what we get by calling this function on our stroke risk dataset with the following code snippet priorprob conditionalprob buildprobabilitiesstrokerisk featurecolumnsweighthighoildietsmokingcategory variablestrokerisk I used the pprint module to print the conditional probabilities as shown from pprint import pprint pprintconditionalprob The result is as follows high highoildiet Counteryes 05 no 05 smoking Counteryes 10 no 00 weight Counterhigh 075 middle 025 low 00 low highoildiet Counterno 075 yes 025 smoking Counterno 075 yes 025 weight Counterlow 05 middle 05 high 00 middle highoildiet Counteryes 06666666666666666 no 03333333333333333 smoking Counterno 06666666666666666 yes 03333333333333333 weight Countermiddle 06666666666666666 low 03333333333333333 high 00 I highlighted a number the way to interpret 075 is by reading the dictionary keys as the event we are conditioned on and the event itself You can verify that this does agree with our previous table counting It corresponds to the following conditional probability expression 𝑃𝑃𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤ℎ𝑡𝑡 ℎ𝑤𝑤𝑤𝑤ℎ𝑠𝑠𝑡𝑡𝑠𝑠𝑠𝑠𝑠𝑠𝑤𝑤𝑠𝑠𝑤𝑤𝑠𝑠𝑠𝑠 ℎ𝑤𝑤𝑤𝑤ℎ 266 Statistics for Classification Next lets write another function to make the predictions displayed in the following code block def predictpriorprob conditionalprob featurevaluesdict probs total sumpriorprobvalues for key in priorprob probskey priorprobkeytotal for key in probs posteriordict conditionalprobkey for featurename featurelevel in featurevalues items probskey posteriordictfeaturenamefeature level total sumprobsvalues if total 0 printUndetermined else for key in probs probskey total return probs Note that it is totally possible that the probabilities are all 0 in the naïve Bayes classifier This is usually due to an illposed dataset or an insufficient dataset I will show you a couple of examples to demonstrate this The first example is as follows predictpriorprobconditional probweightmiddlehighoil dietnosmokingyes The result is shown next which indicates that the person is probably in the lowrisk group low 05094339622641509 high 033962264150943394 middle 015094339622641506 Underfitting overfitting and crossvalidation 267 The second example is as follows predictpriorprob conditional probweighthighhighoil dietnosmokingno The result is undetermined If you check the conditional probabilities you will find that the contradiction in the features and the insufficiency of the dataset lead to all zeros in the posterior probabilities This is left to you as an exercise In the next section lets look at another important concept in machine learning especially classification tasks crossvalidation Underfitting overfitting and crossvalidation What is crossvalidation and why is it needed To talk about crossvalidation we must formally introduce two other important concepts first underfitting and overfitting In order to obtain a good model for either a regression problem or a classification problem we must fit the model with the data The fitting process is usually referred to as training In the training process the model captures characteristics of the data establishes numerical rules and applies formulas or expressions Note The training process is used to establish a mapping between the data and the output classification regression we want For example when a baby learns how to distinguish an apple and a lemon they may learn how to associate the colors of those fruits with the taste Therefore they will make the right decision to grab a sweet red apple rather than a sour yellow lemon Everything we have discussed so far is about the training technique On the other hand putting a model into a real job is called testing Here is a little ambiguity that people often use carelessly In principle we should have no expectation of the models output on the testing dataset because that is the goal of the model we need the model to predict or generate results on the testing set However you may also hear the term testing set in a training process Here the word testing actually means an evaluation process of the trained model Strictly speaking a testing set is reserved for testing after the model is built In this case a model is trained on a training set then applied on a socalled testing set which we know the ground truth is to get a benchmark of the models performance So be aware of the two meanings of testing 268 Statistics for Classification In the following content I will refer to testing in the training process For example if the baby we mentioned previously learned that red means sweetness say one day the baby sees a red pepper for the first time and thinks it is sweet what will happen The babys colortosweetness model will likely fail the testing on the testing set a red pepper What the baby learned is an overfitted model An overfitted model learns too much about the characteristics of the training data for the baby it is the apple such that it cannot be generalized to unseen data easily How about an underfitted model An underfitted model can be constructed this way If the baby learns another feature that density is also a factor to indicate whether a fruit vegetable is sweet or not the baby may likely avoid the red pepper Compared to this model involving the fruits density the babys simple coloronly model is underfitting An underfitted model doesnt learn enough from the training data It can be improved to perform better on the training data without potentially damaging its generalization capacity As you may have guessed overfitting and underfitting are two cases that may not have clear boundaries Here is a vivid example Suppose we have the following data and we would like to have a polynomial regression model to fit it A polynomial regression model uses a polynomial rather than a linear line to fit the data The degree of the polynomial is a parameter we should choose for the model Lets see which one we should choose The following code snippet plots the artificial data pltfigurefigsize106 xcoor 1234567 ycoor 385710915 pltscatterxcoorycoor The result looks as follows Figure 98 Artificial data for polynomial fitting Underfitting overfitting and crossvalidation 269 Now let me use 1st 3rd and 5thorder polynomials to fit the data points The two functions I used are numpypolyfit and numpypolyval The following code snippet plots the graph styles pltfigurefigsize106 x nplinspace1720 for idx degree in enumeraterange162 coef nppolyfitxcoorycoordegree y nppolyvalcoefx pltplotxy linewidth4 linestylestylesidx labeldegree formatstrdegree pltscatterxcoorycoor s400 labelOriginal Data markero pltlegend The result looks as in the following figure Note that I made the original data points exceptionally large Figure 99 Polynomial fitting of artificial data points 270 Statistics for Classification Well it looks like the highdegree fitting almost overlaps with every data point and the linear regression line passes in between However with a degree of 5 the polynomial in principle can fit any five points in the plane and we merely have seven points This is clearly overfitting Lets enlarge our vision a little bit In the next figure I slightly modified the range of the x variable from 17 to 08 Lets see what happens The modification is easy so the code is omitted Figure 910 Fitting polynomials in an extended range Wow See the penalty we pay to fit our training data The higherorder polynomial just goes wild Imagine if we have a testing data point between 0 and 1 what a counterintuitive result we will get The question is how do we prevent overfitting We have already seen one tool regularization The other tool is called crossvalidation Crossvalidating requires another dataset called the validation set to validate the model before the model is applied to the testing set Crossvalidation can help to reduce the overfitting and reduce bias in the model learned in the training set early on For example the most common kfold crossvalidation splits the training set into k parts and leaves one part out of the training set to be the validation set After the training is done that validation set is used to evaluate the performance of the model The same procedure is iterated k times Bias can be detected early on if the model learned too much from the limited training set Note Crossvalidation can also be used to select parameters of the model Underfitting overfitting and crossvalidation 271 Some sklearn classifiers have crossvalidation built into the model classes Since we have reached the end of the chapter lets look at a logistic regression crossvalidation example in sklearn Here I am going to use the stroke risk data for logistic regression crossvalidation I am going to convert some categorical variables into numerical variables Recall that this is usually a bad practice as we discussed in Chapter 1 Fundamentals of Data Collection Cleaning and Preprocessing However it is doable here for a simple model such as logistic regression because there is an ordered structure in the categories For example I can map low weight to 1 middle weight to 2 and high weight to 3 The logistic regression classifier will automatically learn the parameters to distinguish them Another point is that the target stroke risk will now have three choices rather than two This multiclass classification is also achievable by training more than two logistic regression classifiers and using them together to partition the outcome space The code that does the categorytonumeric mapping is omitted due to space limitation you can find it in the Jupyter notebook The code that invokes logistic regression crossvalidation reads as follows from sklearnlinearmodel import LogisticRegressionCV X strokeriskweighthighoildietsmoking y strokeriskstrokerisk classifier LogisticRegressionCVcv3randomstate2020multi classautofitXy Note that the kfold cross validation has a k value of 3 We shouldnt choose a k value larger than the total number of records If we choose 15 which is exactly the number of records it is called leaveoneout crossvalidation You can obtain some parameters of the crossvalidated classifier by calling classifiergetparams The result reads as follows bound method BaseEstimatorgetparams of LogisticRegressionCVCs10 classweightNone cv3 dualFalse fitinterceptTrue interceptscaling10 l1ratiosNone maxiter100 multiclassauto n jobsNone penaltyl2 randomstate2020 refitTrue scoringNone solverlbfgs tol00001 verbose0 272 Statistics for Classification Note that a regularization term is automatically introduced because of the Cs parameter For more details you can refer to the API of the function Now lets call the predictprob function to predict the probabilities Lets say the person is slightly overweight so they have a weight value of 15 Recall that 1 means middle and 2 means high for weight This person also eats slightly more fatty foods but smokes a lot So they have 05 and 2 on another two features respectively The code reads as follows classifierpredictprobanparray15052 The results read as follows array020456731 015382072 064161197 So this person likely falls into the high stroke risk group Note that this model is very coarse due to the categorical variabletonumerical variable conversion but it gives you the capability to estimate on data which is beyond the previous observations Summary In this chapter we thoroughly studied the logistic regression classifier and corresponding classification task concepts Then we built a naïve Bayes classifier from scratch In the last part of this chapter we discussed the concepts of underfitting and overfitting and used sklearn to use crossvalidation functions In the next chapter we are going to study another big branch of machine learning models treebased models 10 Statistics for TreeBased Methods In the previous chapter we covered some important concepts in classification models We also built a naïve Bayes classifier from scratch which is very important because it requires you to understand every aspect of the details In this chapter we are going to dive into another family of statistical models that are also widely used in statistical analysis as well as machine learning treebased models Treebased models can be used for both classification tasks and regression tasks By the end of this chapter you will have achieved the following Gained an overview of treebased classification Understood the details of classification tree building Understood the mechanisms of regression trees Know how to use the scikitlearn library to build and regularize a treebased method Lets get started All the code snippets used in this chapter can be found in the official GitHub repository here httpsgithubcomPacktPublishingEssential StatisticsforNonSTEMDataAnalysts 274 Statistics for TreeBased Methods Overviewing treebased methods for classification tasks Treebased methods have two major varieties classification trees and regression trees A classification tree predicts categorical outcomes from a finite set of possibilities while a regression tree predicts numerical outcomes Lets first look at the classification tree especially the quality that makes it more popular and easy to use compared to other classification methods such as the simple logistic regression classifier and the naïve Bayes classifier A classification tree creates a set of rules and partitions the data into various subspaces in the feature space or feature domain in an optimal way First question what is a feature space Lets take our stroke risk data that we used in Chapter 9 Statistics for Classification as sample data Heres the dataset from the previous chapter for your reference Each row is a profile for a patient that records their weight diet habit smoking habit and corresponding stroke risk level Figure 101 Stroke risk data Overviewing treebased methods for classification tasks 275 We have three features for each record If we only look at the weight feature it can take three different levels low middle and high Imagine in a onedimensional line representing weight that there are only three discrete points a value can take namely the three levels This is a onedimensional feature space or feature domain On the other hand highoil diet and smoking habit are other twofeature dimensions with two possibilities Therefore a person can be on one of 12 322 combinations of all features in this threedimensional feature space A classification tree is built with rules to map these 12 points in the feature space to the outcome space which has three possible outcomes Each rule is a yesno question and the answer will be nonambiguous so each data record has a certain path to go down the tree The following is an example of such a classification tree Figure 102 An example of a classification tree for stroke risk data Lets look at one example to better understand the tree Suppose you are a guy who smokes but doesnt have a highoil diet Then starting at the top of the tree you will first go down to the left branch and then go right to the Middle stroke risk box The decision tree classifies you as a patient with middle stroke risk Now is a good time to introduce some terminology to mathematically describe a decision tree rather than using casual terms such as box A tree is usually drawn upside down but this is a good thing as you follow down a chain of decisions to reach the final status Here is some few important terminology that you need to be aware of Root node A root node is the only nodeblock that only has outgoing arrows In the tree shown in the previous figure it is the one at the top with the Smoking text The root node contains all the records and they havent been divided into subcategories which corresponds to partitions of feature space 276 Statistics for TreeBased Methods Decision node A decision node is one node with both incoming and outgoing arrows It splits the data feed into two groups For example the two nodes on the second level of the tree High oil diet and High weight are decision nodes The one on the left splits the smoking group further into the smoking and highoil diet group and the smoking and nonhighoil diet group The one on the right splits the nonsmoking group further into the nonsmoking and high weight and nonsmoking and nonhigh weight groups Leaf node A leaf node or a leaf is a node with only incoming arrows A leaf node represents the final terminal of a classification process where no further splitting is needed or allowed For example the node at the bottom left is a leaf that indicates that people who smoke and have a highoil diet are classified to have a high risk of stroke It is not necessary for a leaf to only contain pure results In this case it is alright to have only low stroke risk and high stroke risk people in the leaf What we optimized is the pureness of the classes in the node as the goal of classification is to reach unambiguous labeling The label for the records in a leaf node is the majority label If there is a tie a common solution is to pick a random label of the tied candidates to make it the majority label Parent node and children nodes The node at the start of an arrow is the parent node of the nodes at the end of the arrows which are called the child nodes A node can simultaneously be a parent node and a child node except the root node and the leaf The process of determining which feature or criteria to use to generate children nodes is called splitting It is common practice to do binary splitting which means a parent node will have two child nodes Depth and pruning The depth of a decision tree is defined as the length of the chain from the root node to the furthest leaf In the stroke risk case the depth is 2 It is not necessary for a decision tree to be balanced One branch of the tree can have more depth than another branch if accuracy requires The operation of removing children nodes including grandchild nodes and more is called pruning just like pruning a biological tree From now on we will use the rigorous terms we just learned to describe a decision tree Note One of the benefits of a decision tree is its universality The features dont necessarily take discrete values they can also take continuous numerical values For example if weight is replaced with continuous numerical values the splitting on high weight or not will be replaced by a node with criteria such as weight 200 pounds Overviewing treebased methods for classification tasks 277 Now lets go over the advantages of decision trees The biggest advantage of decision trees is that they are easy to understand For a person without any statistics or machine learning background decision trees are the easiest classification algorithms to understand The decision tree is not sensitive to data preprocessing and data incompletion For many machine learning algorithms data preprocessing is vital For example the units of a feature in grams or kilograms will influence the coefficient values of logistic regression However decision trees are not sensitive to data preprocessing The selection of the criteria will adjust automatically when the scale of the original data changes but the splitting results will remain unchanged If we apply logistic regression to the stroke risk data a missing value of a feature will break the algorithm However decision trees are more robust to achieve relatively stable results For example if a person who doesnt smoke misses the weight data they can be classified into the lowrisk or middlerisk groups randomly of course there are better ways to decide such as selecting the mode of records similar to it but they wont be classified into highrisk groups This result is sometimes good enough for practical use Explainability When a decision tree is trained you are not only getting a model but you also get a set of rules that you can explain to your boss or supervisor This is also why I love the decision tree the most The importance of features can also be extracted For example in general the closer the feature is to the root the more important the feature is in the model In the stroke risk example smoking is the root node that enjoys the highest feature importance We will talk about how the positions of the features are decided in the next chapter Now lets also talk about a few disadvantages of the decision tree It is easy to overfit Without control or penalization decision trees can be very complex How Imagine that unless there are two records with exactly the same features but different outcome variables the decision tree can actually build one leaf node for every record to reach 100 accuracy on the training set However the model will very likely not be generalized to another dataset Pruning is a common approach to remove overcomplex subbranches There are also constraints on the splitting step which we will discuss soon 278 Statistics for TreeBased Methods The greedy approach doesnt necessarily give the best model A single decision tree is built by greedily selecting the best splitting feature sequentially As a combination problem with an exponential number of possibilities the greedy approach doesnt necessarily give the best model In most cases this isnt a problem In some cases a small change in the training dataset might generate a completely different decision tree and give a different set of rules Make sure you doublecheck it before presenting it to your boss Note To understand why building a decision tree involves selecting rules from a combination of choices lets build a decision tree with a depth of 3 trained on a dataset of three continuous variable features We have three decision nodes including the root node to generate four leaves Each decision node can choose from three features for splitting therefore resulting in a total of 27 possibilities Yes one child node can choose the same feature as its parent Imagine we have four features then the total number of choices becomes 64 If the depth of the tree increases by 1 then we add four more decision nodes Therefore the total number of splitting feature choices is 16384 which is huge for such a fourfeature dataset Most trees will obviously be useless but the greedy approach doesnt guarantee the generation of the best decision tree We have covered the terminology advantages and disadvantages of decision trees In the next section we will dive deeper into decision trees specifically how branches of a tree are grown and pruned Growing and pruning a classification tree Lets start by examining the dataset one more time We will first simplify our problem to a binary case so that the demonstration of decision tree growing is simpler Lets examine Figure 101 again For the purpose of this demonstration I will just group the middlerisk and highrisk patients into the highrisk group This way the classification problem becomes a binary classification problem which is easier to explain After going through this section you can try the exercises on the original threecategory problem for practice The following code snippet generates the new dataset that groups middlerisk and highrisk patients together dfstrokerisk dfstrokeriskapplylambda x low if x low else high Growing and pruning a classification tree 279 The new dataset will then look as follows Figure 103 Binary stroke risk data Now lets think about the root node Which feature and what kind of criteria should we choose to generate two children nodes from the root node that contains all the records data points We will explore this topic in the next section Understanding how splitting works The principle of splitting is that splitting as a feature must get us closer to a completely correct classification We need a numerical metric to compare different choices of splitting features The goal for classification is to classify records into pure states such that each leaf will contain records that are as pure as possible Therefore pureness or impureness becomes a natural choice of metric The most common metric is called Gini impurity It measures how impure a set of data is For a binary class set of data with class labels A and B the definition of Gini impurity is the following 𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺 𝐺𝐺𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝐺𝐺𝑖𝑖𝑖𝑖 1 𝑃𝑃𝐴𝐴2 𝑃𝑃𝐵𝐵2 280 Statistics for TreeBased Methods If the set only contains A or B the Gini impurity is 0 The maximum impurity is 05 when half of the records are A and the other half are B For a threeclass dataset the minimum is still 0 but the maximum becomes 2 3 Note Gini impurity is named after the Italian demographer and statistician Corrado Gini Another wellknown index named after him is the Gini index which measures the inequality of wealth distribution in a society Lets see how this unfolds at the root node without any splitting The Gini impurity is calculated as 1 8 15 2 7 15 2 because we have eight lowrisk records and seven highrisk records The value is about 0498 close to the highest possible impurity After splitting by one criterion we have two children nodes The way to obtain the new lower impurity is to calculate the weighted Gini impurity of the two children nodes First lets take the highoil diet group as an example Lets examine the partition of the highoil diet group The following code snippet does the counting Counterdfdfhighoildietyesstrokerisk There is a total of six records with two lowrisk records and four highrisk records Therefore the impurity for the highoil diet group is 1 1 3 2 2 3 2 4 9 Meanwhile we can calculate the nonhighoil diet groups statistics Lets select and count them using the following code snippet Counterdfdfhighoildietnostrokerisk There is a total of nine records with six lowrisk records and three highrisk records Note that the proportionalities are the same for the highoil diet but with exchanging groups Therefore the Gini impurity is also 4 9 The weighted Gini impurity remains 4 9 because 4 9 6 15 4 9 9 15 4 9 It is about 0444 So what do we get from such a classification We have reduced the Gini impurity from 0498 to 0444 which is just a slight decrease but better than nothing Next lets examine the smoking behavior Growing and pruning a classification tree 281 By the same token lets first check the smoking groups statistics The following code snippet does the counting Counterdfdfsmokingyesstrokerisk There is a total of seven smoking cases Five of them are of high stroke risk and two of them are of low stroke risk The Gini impurity is therefore about 0408 Lets check the nonsmokers Counterdfdfsmokingnostrokerisk There are eight nonsmokers six of them are of low stroke risk and two of them are of high stroke risk Therefore the Gini impurity is 0375 Lets obtain the weighted impurity which is about 0408 7 15 0375 8 15 0390 This is a 0108 decrease from the original impurity without splitting and it is better than the splitting on the highoil diet group I will omit the calculation for the other feature weight but I will list the result for you in the following table Note that the weight feature has three levels so there can be multiple rules for splitting the feature Here I list all of them In the yes and no group statistics I list the number of highstroke risk records the number of lowrisk records and the Gini impurity value separated by commas Figure 104 The Gini impurity evaluation table for different splitting features at the root node 282 Statistics for TreeBased Methods Note that I highlighted the Gini impurity for the highweight group and the weighted Gini impurity for the last splitting choice All highweight patients have a high stroke risk and this drives the weighted impurity down to 0356 the lowest of all possible splitting rules Therefore we choose the last rule to build our decision tree After the first splitting the decision tree now looks like the following Figure 105 The decision tree after the first splitting Note that the left branch now contains a pure node which becomes a leaf Therefore our next stop only focuses on the right branch We naturally have an imbalanced tree now Now we have four choices for the splitting of 12 records First I will select these 12 records out with the following oneline code snippet dfright dfdfweighthigh The result looks as follows The Gini impurity for the right splitting node is 0444 as calculated previously This will become our new baseline Figure 106 The lowweight and middleweight group Growing and pruning a classification tree 283 As we did earlier lets build a table to compare different splitting choices for the splitting node on the right The ordering of the numbers is the same as in the previous table Figure 107 The Gini impurity evaluation table for different splitting features at the right splitting node We essentially only have three choices because the two splitting rules on the feature weight are mirrors of each other Now we have a tie We can randomly select one criterion for building the trees further This is one reason why decision trees dont theoretically generate the best results Note An intuitive way to solve this issue is to build both possibilities and even more possible trees which violates the greedy approach and let them vote on the prediction results This is a common method to build a more stable model or an ensemble of models We will cover related techniques in the next chapter Lets say I choose highoil diet as the criteria The tree now looks like the following Figure 108 The decision tree after the second splitting 284 Statistics for TreeBased Methods Now lets look at the two newly generated nodes The first one at a depth of 2 contains two high stroke risk records and two low stroke risk records They dont have a heavy weight but do have a highoil diet Lets check out their profile with this line of code dfrightdfrighthighoildietyes The result looks like the following Figure 109 Records classified into the first node at a depth of 2 Note that the lowweight category contains one low stroke risk record and a high stroke risk example The same situation happens with the middleweight category This makes the decision tree incapable of further splitting on any feature There wont be any Gini impurity decreasing for splitting Therefore we can stop here for this node Note Well what if we want to continue improving the classification results As you just discovered there is no way that the decision tree can classify these four records and no other machine learning method can do it either The problem is in the data not in the model There are two main approaches to solve this issue The first option is to try to obtain more data With more data we may find that low weight is positively correlated with low stroke risk and further splitting on the weight feature might benefit decreasing the Gini impurity Obtaining more data is always better because your training model gets to see more data which therefore reduces possible bias Another option is to introduce more features This essentially expands the feature space by more dimensions For example blood pressure might be another useful feature that might help us further increase the accuracy of the decision tree Growing and pruning a classification tree 285 Now lets look at the second node at depth 2 The records classified into this node are the following given by the dfrightdfrighthighoildietyes code Figure 1010 Records classified into the second node at a depth of 2 Note that only two high stroke risk records are in this node If we stop here the Gini impurity is 1 0252 0752 0375 which is quite a low value If we further split on the smoking feature note that out of all the nonsmokers four of them have a low stroke risk Half of the smokers have a high stroke risk and the other half have a low stroke risk This will give us a weighted Gini impurity of 025 if were splitting on smoking If we further split on the weight feature all the lowweight patients are at low stroke risk Two out of five middleweight records are at high risk This will give us a weighted Gini impurity of 5 8 1 2 5 2 3 5 2 03 which is also not bad 286 Statistics for TreeBased Methods For the two cases the final decision trees look as follows The following decision tree has smoking as the last splitting feature Figure 1011 Final decision tree version 1 Growing and pruning a classification tree 287 The other choice is splitting on weight again at the second node at depth 2 The following tree will be obtained Figure 1012 Final decision tree version 2 Now we need to make some hard choices to decide the final shape of our decision trees Evaluating decision tree performance In this section lets evaluate the performance of the decision tree classifiers If we stop at depth 2 we have the following confusion matrix Note that for the unclassifiable first node at depth 2 we can randomly assign it a label Here I assign it as high stroke risk The performance of our classifier can be summarized in the following table concisely Figure 1013 Confusion matrix of a decision tree of depth 2 288 Statistics for TreeBased Methods Generally we identify high risk as positive so the precision recall and F1 score are all 5 7 If you have forgotten these concepts you can review previous chapters If we dont stop at a depth of 2 the two finer decisions trees will have the following confusion matrices Again we assign the unclassifiable first node at depth 2 the label high stroke risk However the first node at depth 3 is also unclassifiable because it contains equal high stroke risk and low stroke risk records If they are classified as lowrisk ones then we essentially obtain the same result as the depthof2 one Therefore we assign the first leaf node at depth 3 a high stroke risk value The new confusion matrix will look as follows Figure 1014 Confusion matrix of a decision tree of depth 3 version 1 Note that we will have perfect recall but the precision will be just slightly better than a random guess 7 12 The F1 score is 14 19 Next lets check our final version If we split with weight the corresponding confusion matrix looks as shown in the following table Figure 1015 Confusion matrix of a decision tree of depth 3 version 2 The precision recall and F1 score will be identical to the depth 2 decision tree In real life we usually prefer the simplest model possible if it is as good or almost as good as the complicated ones Although the first depth 3 decision tree has a better F1 score it also introduces one more unclassifiable node and one more rule The second depth 3 decision tree does no better than the depth 2 one To constrain the complexity of the decision tree there are usually three methods Constrain the depth of the tree This is probably the most direct way of constraining the complexity of the decision tree Constrain the lower bound of the number of records classified into a node For example if after splitting one child node will only contain very few data points then it is likely not a good splitting Exploring regression tree 289 Constrain the lower bound of information gain In our case the information gain means lower Gini impurity For example if we set a criterion that each splitting must lower the information gain by 01 then the splitting will likely stop soon therefore confining the depth of the decision tree We will see algorithmic examples on a more complex dataset later in this chapter Note When the number of records in a splitting node is small the Gini impurity reduction is no longer as representative as before It is the same idea as in statistical significance The larger the sample size is the more confident we are about the derived statistics You may also hear the term size of the decision tree Usually the size is not the same as the depth The size refers to the total number of nodes in a decision tree For a symmetric decision tree the relationship is exponential Exploring regression tree The regression tree is very similar to a classification tree A regression tree takes numerical features as input and predicts another numerical variable Note It is perfectly fine to have mixtype features for example some of them are discrete and some of them are continuous We wont cover these examples due to space limitations but they are straightforward There are two very important visible differences The output is not discrete labels but rather numerical values The splitting rules are not similar to yesorno questions They are usually inequalities for values of certain features In this section we will just use a onefeature dataset to build a regression tree that the logistic regression classifier wont be able to classify I created an artificial dataset with the following code snippet def price2revenueprice if price 85 return 70 absprice 75 elif price 95 290 Statistics for TreeBased Methods return 10 80 else return 80 105 price prices nplinspace801008 revenue nparrayprice2revenueprice for price in prices pltrcParamsupdatefontsize 22 pltfigurefigsize108 pltscatterpricesrevenues300 pltxlabelprice pltylabeltotal revenue plttitlePrice versus Revenue Lets say we want to investigate the relationship between the price of an item and its total revenue a day If the price is set too low the revenue will be lower because the price is low If the price is too high the revenue will also be low due to fewer amounts of the item being sold The DataFrame looks as follows Figure 1016 Price and total revenue DataFrame Exploring regression tree 291 The following visualization makes this scenario clearer Figure 1017 The relationship between price and revenue The relationship between price and revenue is clearly nonlinear and logistic regression wont be able to classify it A linear regression will likely become a horizontal line There are clearly three regions where different relationships between revenue and price apply Now lets build a regression tree Like the deduction of Gini impurity in the classification tree we need a metric to measure the benefit of splitting A natural choice is still the sum of squared residuals Lets start from the root node We have eight data points so there are essentially seven intervals where we can put the first splitting criteria into For example we can split at price 85 Then we use the average revenue on both sides to be our prediction as follows The code snippet for the visualization reads as follows pltrcParamsupdatefontsize 22 pltfigurefigsize128 pltscatterpricesrevenues300 pltxlabelprice pltylabeltotal revenue plttitlePrice versus Revenue threshold 85 numleft sumprices threshold aveleft npmeanrevenueprices threshold 292 Statistics for TreeBased Methods numright sumprices threshold averight npmeanrevenueprices threshold pltaxvlinethresholdcolorredlinewidth6 pltplotpricesprices threshold aveleft for in rangenumleft linewidth6linestylecorange label average revenue on the left half pltplotpricesprices threshold averight for in rangenumright linewidth6linestylecgreen labelaverage revenue on the right half pltrcParamsupdatefontsize 16 pltlegendloc040 In the following figure the dotted line represents the average price for the scenario when the price is lower than 850 The dashed line represents the average price for the scenario when the price is higher than 850 Figure 1018 Splitting at price 850 Exploring regression tree 293 If we stop here the regression tree will have a depth of 1 and looks like the following Figure 1019 A regression tree of depth 1 However we havent tested the other six splitting choices Any splitting choice will have a corresponding sum of squared residuals and we would like to go over all the possibilities to determine the splitting that gives the minimal sum of squared residuals Note Unlike Gini impurity where we need to take a weighted average the total sum of squared residuals is a simple summation Gini impurity is not additive because it only takes a value between 0 and 1 Squared residuals are additive because each residual corresponds to one data point The following code snippet plots the sum of squared residuals against different choices of splitting For completion I plotted more than seven splitting values to visualize the stepped pattern def calssrarr if lenarr0 return 0 ave npmeanarr return npsumarrave2 splittingvalues nplinspace8010020 ssrvalues for splittingvalue in splittingvalues ssr calssrrevenueprices splittingvalue cal ssrrevenueprices splittingvalue ssrvaluesappendssr 294 Statistics for TreeBased Methods pltrcParamsupdatefontsize 22 pltfigurefigsize128 pltxlabelsplitting prices pltylabelsum of squared residuals plttitleSplitting Price versus Sum of Squared Residuals pltplotsplittingvaluesssrvalues The result looks as in the following figure Figure 1020 Splitting value for the root node versus the sum of squared residuals The visualization in the preceding figure indicates that 850 or any value between the second point and the third point is the best splitting value for the root node There are only two records in the first node with a depth of 1 so we focus on the second node and repeat the process explained here The code is omitted due to space limitations The visualization of the sum of squared residuals is the following Exploring regression tree 295 Figure 1021 Splitting choices at the second node at depth 1 Now in order to achieve the minimum sum of squared error we should put the last data point into one child node However you see that if we split at 98 the penalty we pay is not increasing much If we include another one such as splitting at 96 the penalty will soar It may be a good idea to split at 96 rather than 98 because a leaf node containing too few records is not representative in general and often indicates overfitting Here is the final look of our regression tree You can calculate the regressed average prices in each region easily The final regression tree looks as follows Figure 1022 Final regression tree 296 Statistics for TreeBased Methods The following figure shows a visualization for the partition of the regions Figure 1023 Regressed values and region partitioning In multifeature cases we will have more than one feature The scanning of the best splitting value should include all the features but the idea is the same Using tree models in scikitlearn Before ending this chapter lets try out the scikitlearn API You can verify that the results agree with our models built from scratch The following code snippet builds a regression tree with a maximum depth of 1 on the pricerevenue data from sklearntree import DecisionTreeRegressor from sklearn import tree prices revenue pricesreshape11 revenuereshape11 regressor DecisionTreeRegressorrandomstate0maxdepth1 regressorfitpricesrevenue Using tree models in scikitlearn 297 Now we can visualize the tree with the following code snippet pltfigurefigsize128 treeplottreeregressor The tree structure looks as follows Figure 1024 Regression tree visualization of depth 1 Next we limit the maximum depth to 2 and require the minimal number of records samples in a leaf node to be 2 The code only requires a small change in the following line regressor DecisionTreeRegressorrandomstate0max depth2minsamplesleaf2 After running the code we obtain the following tree structure Figure 1025 Regression tree visualization of depth 2 298 Statistics for TreeBased Methods As you can see this produces exactly the same results as the one we built from scratch Note The scikitlearn decision tree API cant explicitly handle categorical variables There are various options such as onehot encoding that you can use to bypass this limitation You are welcome to explore the solutions on your own Summary In this chapter we started with the fundamental concepts of decision trees and then built a simple classification tree and a regression tree from scratch We went over the details and checked the consistency with the scikitlearn library API You may notice that tree methods do tend to overfit and might fail to reach the optimal model In the next chapter we will explore the socalled ensemble learning They are metaalgorithms that can be used on top of many other machine learning algorithms as well 11 Statistics for Ensemble Methods In this chapter we are going to investigate the ensemble method in terms of statistics and machine learning The English word ensemble means a group of actors or musicians that work together as a whole The ensemble method or ensemble learning in machine learning is not a specific machine learning algorithm but a meta learning algorithm that builds on top of concrete machine learning algorithms to bundle them together to achieve better performance The ensemble method is not a single method but a collection of many In this chapter we will cover the most important and representative ones We are going to cover the following in this chapter Revisiting bias variance and memorization Understanding the bootstrapping and bagging techniques Understanding and using the boosting module Exploring random forests with scikitlearn Lets get started 300 Statistics for Ensemble Methods Revisiting bias variance and memorization Ensemble methods can improve the result of regression or classification tasks in that they can be applied to a group of classifiers or regressors to help build a final augmented model Since we are talking about performance we must have a metric for improving performance Ensemble methods are designed to either reduce the variance or the bias of the model Sometimes we want to reduce both to reach a balanced point somewhere on the biasvariance tradeoff curve We mentioned the concepts of bias and variance several times in earlier chapters To help you understand how the idea of ensemble learning originated I will revisit these concepts from the perspective of data memorization Lets say the following schematic visualization represents the relationship between the training dataset and the realworld total dataset The solid line shown in the following diagram separates the seen world and the unseen part Figure 111 A schematic representation of the observed data Suppose we want to build a classifier that distinguishes between the circles and the squares Unfortunately our observed data is only a poor subset of the original data In most cases we do not know the entire set of realworld data so we dont know how representative our accessible dataset is We want to train a model to classify the two classes that is square and circle However since our trained model will only be exposed to the limited observed data different choices regarding which model we choose as well as its complexity will give us different results Lets check out the following two decision boundaries Revisiting bias variance and memorization 301 First we can draw a decision boundary as a horizontal line as shown in the following diagram This way one square data point is misclassified as a round one Figure 112 A simple decision boundary Alternatively we can draw a decision boundary the other way as shown in the following diagram This zigzagging boundary will correctly classify both the square data points and the round data points Figure 113 A more complex decision boundary Can you tell which classification method is better With our hindsight of knowing what the entire dataset looks like we can tell that neither is great However the difference is how much we want our model to learn from the known data The structure of the training dataset will be memorized by the model The question is how much Note on memorization Data memorization means that when a model is being trained it is exposed to the training set so it remembers the characteristics or structure of the training data This is a good thing when the model has high bias because we want it to learn but it becomes notoriously bad when its memory gets stuck in the training data and fails to generalize Simply put when a model memorizes and learns too little of the training data it has high bias When it learns too much it has high variance 302 Statistics for Ensemble Methods Because of this we have the following famous curve of the relationship between model complexity and error This is probably the most important graph for any data scientist interview Figure 114 The relationship between model complexity and error When model complexity increases the error in terms of mean squared error or any other form will always decrease monotonically Recall that when we discussed R2 we said that adding any predictor feature will increase the R2 rate On the other hand the performance of the learned model will start to decrease on the test dataset or other unseen datasets This is because the model learns too many ungeneralizable characteristics such as the random noise of the training data In the preceding example the zigzagging boundary doesnt apply to the rest of the dataset To summarize underfitting means that the model is biased toward its original assumption which means theres information thats missing from the training set On the other hand overfitting means that too many training setspecific properties were learned so the models complexity is too high Underfitting and overfitting High bias corresponds to underfitting while high variance corresponds to overfitting Humans also fall into similar traps For example a CEO is very busy so heshe does not have a lot of free time to spend with hisher kids What is the most likely job of the kids mother Most people will likely say homemaker However I didnt specify the gender of the CEO Well the CEO is the mother Understanding the bootstrapping and bagging techniques 303 As the power of machine learning algorithms grows the necessity to curb overfitting and find a balance between bias and variance is prominent Next well learn about the bootstrapping and bagging techniques both of which can help us solve these issues Understanding the bootstrapping and bagging techniques Bootstrapping is a pictorial word It allows us to imagine someone pulling themselves up by their bootstraps In other words if no one is going to help us then we need to help ourselves In statistics however this is a sampling method If there is not enough data we help ourselves by creating more data Imagine that you have a small dataset and you want to build a classifierestimator with this limited amount of data In this case you can perform crossvalidation Crossvalidation techniques such as 10fold crossvalidation will decrease the number of records in each fold even further We can take all the data as the training data but you likely will end up with a model with very high variance What should we do then The bootstrapping method says that if the dataset being used is a sample of the unknown data in the dataset why not try resampling again The bootstrap method creates new training sets by uniformly sampling from the dataset and then replacing it This process can be repeated as many times as its necessary to create many new training datasets Each new training set can be used to train a classifierregressor Besides the magic of creating training datasets from thin air bootstrapping has two significant advantages Bootstrapping increases the randomness in the training set It is likely that such randomness will help us avoid capturing the intrinsic random noise in the original training set Bootstrapping can help build the confidence interval of calculated statistics plural form of statistic Suppose we run bootstrapping N times and obtain N new samples By doing this we can calculate the standard variation of a selected statistic which is not possible without bootstrapping Before we move on lets examine how bootstrapping works on a real dataset We are going to use the Boston Housing Dataset which you can find in its official GitHub repository httpsgithubcomPacktPublishingEssentialStatisticsfor NonSTEMDataAnalysts You can also find the meanings of each column in the respective Jupyter notebook It contains information regarding the per capita crime rate by town average number of rooms per dwelling and so on 304 Statistics for Ensemble Methods Later in this chapter we will use these features to predict the target feature that is the median value of owneroccupied homes medv Turning a regression problem into a classification problem I am going to build a classifier for demonstration purposes so I will transform the continuous variable medv into a binary variable that indicates whether a houses price is in the upper 50 or lower 50 of the market The first few lines of records in the original dataset look as follows Due to space limitations most of the code except for the crucial pieces will be omitted here Figure 115 Main section of the Boston Housing Dataset First lets plot the distribution of the index of accessibility to radial highways variable Here you can see that this distribution is quite messy no simple distribution can model it parametrically Lets say that this is our entire dataset for selecting the training dataset in this demo Figure 116 Index of accessibility to radial highways distribution the dataset contains 506 records Understanding the bootstrapping and bagging techniques 305 Now lets select 50 pieces of data to be our trainingobserved data This distribution looks as follows Notice that the yaxis has a different scale The functions well be using to perform sampling can be found in the scikitlearn library Please refer to the relevant Jupyter notebook for details Figure 117 Index of accessibility to radial highways distribution our original sample which contains 50 records Next lets run bootstrapping 1000 times For each round well sample 25 records Then we will plot these new samples on the same histogram plot Here you can see that the overlapping behavior drops some characteristics of the 50record sample such as the very high peak on the lefthand side Figure 118 Overlapping 1000 times with our bootstrapped sample containing 25 records 306 Statistics for Ensemble Methods Next we will learn how this will help decrease the variance of our classifiers by aggregating the classifiers that were trained on the bootstrapped sets The word bagging is essentially an abbreviation of bootstrap aggregation This is what we will study in the remainder of this section The premise of bagging is to train some weak classifiers or regressors on the newly bootstrapped training sets and then aggregating them together through a majority vote averaging mechanism to obtain the final prediction The following code performs preprocessing and dataset splitting I am going to use a decision tree as our default weak classifier so feature normalization wont be necessary from sklearnmodelselection import traintestsplit import numpy as np bostonbag bostoncopy bostonbagmedv bostonbagmedvapplylambda x intx npmedianbostonbagmedv bostonbagtrain bostonbagtest traintestsplitboston bagtrainsize07shuffleTruerandomstate1 bostonbagtrainX bostonbagtrainy bostonbagtrain dropmedvaxis1tonumpy bostonbagtrainmedv tonumpy bostonbagtestX bostonbagtesty bostonbagtest dropmedvaxis1tonumpy bostonbagtestmedv tonumpy Note that I explicitly made a copy of the boston DataFrame Now Im going to try and reproduce something to show the overfitting on the test set with a single decision tree I will control the maximal depth of the single decision tree in order to control the complexity of the model Then Ill plot the F1 score with respect to the tree depth both on the training set and the test set Without any constraints regarding model complexity lets take a look at how the classification tree performs on the training dataset The following code snippet plots the unconstrained classification tree from sklearntree import DecisionTreeClassifier from sklearnmetrics import f1score from sklearn import tree clf DecisionTreeClassifierrandomstate0 clffitbostonbagtrainX bostonbagtrainy Understanding the bootstrapping and bagging techniques 307 pltfigurefigsize128 treeplottreeclf By doing this we obtain a huge decision tree with a depth of 10 It is hard to see the details clearly but the visualization of this is shown in the following diagram Figure 119 Unconstrained single decision tree The F1 score can be calculated by running the following f1scorebostonbagtrainyclfpredictbostonbagtrainX This is exactly 1 This means weve obtained the best performance possible Next Ill build a sequence of classifiers and limit the maximal depth it can span Ill calculate the F1 score of the model on the train set and test set in the same way The code for this is as follows trainf1 testf1 depths range111 for depth in depths clf DecisionTreeClassifierrandomstate0max 308 Statistics for Ensemble Methods depthdepth clffitbostonbagtrainX bostonbagtrainy trainf1appendf1scorebostonbagtrainyclf predictbostonbagtrainX testf1appendf1scorebostonbagtestyclf predictbostonbagtestX The following code snippet plots the two curves pltfigurefigsize106 pltrcParamsupdatefontsize 22 pltplotdepthstrainf1labelTrain Set F1 Score pltplotdepthstestf1 labelTest Set F1 Score pltlegend pltxlabelModel Complexity pltylabelF1 Score plttitleF1 Score on Train and Test Set The following graph is what we get Note that the F1 score is an inverse indicatormetric of the error The higher the F1 score is the better the model is in general Figure 1110 F1 score versus model complexity Although the F1 score on the training set continues increasing to reach the maximum the F1 score on the test set stops increasing once the depth reaches 4 and gradually decreases beyond that It is clear that after a depth of 4 we are basically overfitting the decision tree Understanding the bootstrapping and bagging techniques 309 Next lets see how bagging would help We are going to utilize the BaggingClassifier API in scikitlearn First since we roughly know that the critical depth is 4 well build a base estimator of such depth before creating a bagging classifier marked out of 10 for it Each time well draw samples from the training dataset to build a base estimator The code for this reads as follows from sklearnensemble import BaggingClassifier baseestimator DecisionTreeClassifierrandomstate 0 max depth 4 baggingclf BaggingClassifierbaseestimatorbaseestimator nestimators10 njobs20 maxsamples07 randomstate0 baggingclffitbostonbagtrainX bostonbagtrainy Next lets plot the relationship between the F1 score and the number of base estimators trainf1 testf1 nestimators 2i1 for i in range18 for nestimator in nestimators baggingclf BaggingClassifierbaseestimatorbase estimator nestimatorsnestimator njobs20 randomstate0 baggingclffitbostonbagtrainX bostonbagtrainy trainf1appendf1scorebostonbagtrainy baggingclf predictbostonbagtrainX testf1appendf1scorebostonbagtesty baggingclf predictbostonbagtestX pltfigurefigsize106 pltplotnestimatorstrainf1labelTrain Set F1 Score pltplotnestimatorstestf1 labelTest Set F1 Score pltxscalelog pltlegend 310 Statistics for Ensemble Methods pltxlabelNumber of Estimators pltylabelF1 Score plttitleF1 Score on Train and Test Set The resulting of running the preceding code is displayed in the following graph Note that the xaxis is on the log scale Figure 1111 F1 score versus number of estimators As you can see even on the training set the F1 score stops increasing and begins to saturate and decline a little bit pay attention to the yaxis which indicates that there is still an intrinsic difference between the training set and the test set There may be several reasons for this performance and I will point out two here The first reason is that it is possible that we are intrinsically unlucky that is there was some significant difference between the training set and the test set in the splitting step The second reason is that our depth restriction doesnt successfully constrain the complexity of the decision tree What we can do here is try another random seed and impose more constraints on the splitting condition For example changing the following two lines will produce different results First well change how the dataset is split bostonbagtrain bostonbagtest traintest splitbostonbagtrainsize07shuffleTruerandom state1 Understanding and using the boosting module 311 Second well impose a new constraint on the base estimator so that a node must be large enough to be split baseestimator DecisionTreeClassifierrandomstate 0 maxdepth 4minsamplessplit30 Due to further imposed regularization it is somewhat clearer that the performance of both the training set and the test set is consistent Figure 1112 F1 score under more regularization Note that the scale on the yaxis is much smaller than the previous one Dealing with inconsistent results It is totally normal if the result is not very consistent when youre training evaluating machine learning models In such cases try different sets of data to eliminate the effect of randomness or run crossvalidation If your result is inconsistent with the expected behavior take steps to examine the whole pipeline Start with dataset completeness and tidiness then preprocessing and model assumptions It is also possible that the way the results are being visualized or presented is flawed Understanding and using the boosting module Unlike bagging which focuses on reducing variance the goal of boosting is to reduce bias without increasing variance 312 Statistics for Ensemble Methods Bagging creates a bunch of base estimators with equal importance or weights in terms of determining the final prediction The data thats fed into the base estimators is also uniformly resampled from the training set Determining the possibility of parallel processing From the description of bagging we provided you may imagine that it is relatively easy to run bagging algorithms Each process can independently perform sampling and model training Aggregation is only performed at the last step when all the base estimators have been trained In the preceding code snippet I set njobs 20 to build the bagging classifier When it is being trained 20 cores on the host machine will be used at most Boosting solves a different problem The primary goal is to create an estimator with low bias In the world of boosting both the samples and the weak classifiers are not equal During the training process some will be more important than others Here is how it works 1 First we assign a weight to each record in the training data Without special prior knowledge all the weights are equal 2 We then train our first baseweak estimator on the entire dataset After training we increase the weights of those records which are predicted with wrong labels 3 Once the weights have been updated we create the next base estimator The difference here is that those records that were misclassified by previous estimators or whose values seriously deviated from the true value provided by regression now receive higher attention Misclassifying them again will result in higher penalties which will in turn increase their weight 4 Finally we iterate step 3 until the preset iteration limit is reached or accuracy stops improving Note that boosting is also a meta algorithm This means that at different iterations the base estimator can be completely different from the previous one You can use logistic regression in the first iteration use a neural network at the second iteration and then use a decision tree at the final iteration There are two classical boosting methods we can use adaptive boosting AdaBoost and gradient descent boosting Understanding and using the boosting module 313 First well look at AdaBoost where there are two kinds of weights The weights of the training set records which change at every iteration The weights of the estimators which are determined inversely by their training errors The weight of a record indicates the probability that this record will be selected in the training set for the next base estimator For the estimator the lower its training error is the more voting power it has in the final weighted classifier The following is the pseudoalgorithm for AdaBoost 1 Initialize the equalweight training data with weight and a maximal iteration of k 2 At round k sample the data according to weight and build a weak classifier 3 Obtain the error of and calculate its corresponding weight that is 4 Update the weight of each record for round 1 1 α if classified right otherwise 1 α 5 Repeat steps 2 to 4 k times and obtain the final classifier which should be proportional to the form α Intuition behind the AdaBoost algorithm At step 3 if there are no errors will have an infinitely large weight This intuitively makes sense because if one weak classifier does the job why bother creating more At step 4 if a record is misclassified its weight will be increased exponentially otherwise its weight will be decreased exponentially Do a thought experiment here if a classifier correctly classifies most records but only a few then those records that have been misclassified will be 2α times more likely to be selected in the next round This also makes sense AdaBoost is only suitable for classification tasks For regression tasks we can use the Gradient Descent Boost GDB algorithm However please note that GDB can also be used to perform classification tasks or even ranking tasks α 1 2 1 314 Statistics for Ensemble Methods Lets take a look at the intuition behind GDB In regression tasks often we want to minimize the mean squared error which takes the form 1 2 where means the true value and is a regressed value The weak estimator at round k has a sum of residual Sequentially at iteration 1 we want to build another base estimator to remove such residuals If you know calculus and look closely youll see that the form of the residual is proportional to the derivative of the mean squared error also known as the loss function This is the key improving the base estimator is not achieved by weighing records like it is in AdaBoost but by deliberately constructing it to predict the residuals which happens to be the gradient of the MSE If a different loss function is used the residual argument wont be valid but we still want to predict the gradients The math here is beyond the scope of this book but you get the point Does GDB work with the logistic regression base learner The answer is a conditional no In general GDB doesnt work on simple logistic regression or other simple linear models The reason is that the addition of linear models is still a linear mode This is basically the essence of being linear If a linear model misclassifies some records another linear model will likely misclassify them If it doesnt the effect will be smeared at the last averaging step too This is probably the reason behind the illusion that ensemble algorithms are only applicable to treebased methods Most examples are only given with tree base estimatorslearners People just dont use them on linear models As an example lets see how GDB works on the regression task of predicting the median price of Boston housing The following code snippet builds the training dataset the test dataset and the regressor from sklearnensemble import GradientBoostingRegressor from sklearnmetrics import meansquarederror bostonboost bostoncopy bostonboosttrain bostonboosttest traintest splitbostonboost train size07 shuffleTrue Understanding and using the boosting module 315 random state1 bostonboosttrainX bostonboosttrainy bostonboost traindropmedvaxis1tonumpy bostonboost trainmedvtonumpy bostonboosttestX bostonboosttesty bostonboosttest dropmedvaxis1tonumpy bostonboosttestmedv tonumpy gdbreg GradientBoostingRegressorrandomstate0 regfitbostonboosttrainX bostonboosttrainy printmeansquarederrorregpredictbostonboosttrainX bostonboosttrainy Here the MSE on the training set is about 15 On the test set it is about 71 which is likely due to overfitting Now lets limit the number of iterations so that we can inspect the turning point The following code snippet will help us visualize this trainmse testmse nestimators range1030020 for nestimator in nestimators gdbreg GradientBoostingRegressorrandomstate0n estimatorsnestimator gdbregfitbostonboosttrainX bostonboosttrainy trainmseappendmeansquarederrorgdbregpredictboston boosttrainX bostonboosttrainy testmseappendmeansquarederrorgdbregpredictboston boosttestX bostonboosttesty pltfigurefigsize106 pltplotnestimatorstrainmselabelTrain Set MSE pltplotnestimatorstestmse labelTest Set MSE pltlegend pltxlabelNumber of Estimators pltylabelMSE plttitleMSE Score on Train and Test Set 316 Statistics for Ensemble Methods What weve obtained here is the classic behavior of biasvariance tradeoff as shown in the following graph Note that the error doesnt grow significantly after the turning point This is the idea behind decreasing bias without exploding the variance of boosting Figure 1113 Number of estimators in terms of boosting and MSE At this point you have seen how boosting works In the next section we will examine a model so that you fully understand the what bagging is Exploring random forests with scikitlearn Now that were near the end of this chapter I would like to briefly discuss random forests Random forests are not strictly ensemble algorithms because they are an extension of tree methods However unlike bagging decision trees they are different in an important way In Chapter 10 Statistical Techniques for TreeBased Methods we discussed how splitting the nodes in a decision tree is a greedy approach The greedy approach doesnt always yield the best possible tree and its easy to overfit without proper penalization The random forest algorithm does not only bootstrap the samples but also the features Lets take our stroke risk dataset as an example The heavy weight is the optimal feature to split on but this rules out 80 of all possible trees along with the other features of the root node The random forest algorithm at every splitting decision point samples a subset of the features and picks the best among them This way it is possible for the suboptimal features to be selected Exploring random forests with scikitlearn 317 Nongreedy algorithms The idea of not using a greedy algorithm to achieve the potential optimal is a key concept in AI For example in the game Go performing a shortterm optimal move may lead to a longterm strategic disadvantage A human Go master is capable of farseeing such consequences The most advanced AI can also make decisions regarding what a human does but at the cost of exponentially expensive computation power The balance between shortterm gain and longterm gain is also a key concept in reinforcement learning Lets take a look a code example of random forest regression to understand how the corresponding API in scikitlearn is called trainmse testmse nestimators range1030020 for nestimator in nestimators regr RandomForestRegressormaxdepth6 randomstate0 nestimatorsnestimator maxfeaturessqrt regrfitbostonboosttrainX bostonboosttrainy trainmseappendmeansquarederrorregrpredictboston boosttrainX bostonboosttrainy testmseappendmeansquarederrorregrpredictboston boosttestX bostonboosttesty pltfigurefigsize106 pltplotnestimatorstrainmselabelTrain Set MSE pltplotnestimatorstestmse labelTest Set MSE pltlegend pltxlabelNumber of Estimators pltylabelMSE plttitleMSE Score on Train and Test Set 318 Statistics for Ensemble Methods The visualization we receive is also a typical biasvariance tradeoff Note that the limitation of the max depth for each individual decision tree being set to 6 can significantly decrease the power of the model anyway Figure 1114 Estimators versus MSE in random forest regression One of the key features of random forests is their robustness against overfitting The relative flat curve of the test sets MSE is proof of this claim Summary In this chapter we discussed several important ensemble learning algorithms including bootstrapping for creating more training sets bagging for aggregating weak estimators boosting for improving accuracy without increasing variance too much and the random forest algorithm Ensemble algorithms are very powerful as they are models that build on top of basic models Understanding them will benefit you in the long run in terms of your data science career In the next chapter we will examine some common mistakes and go over some best practices in data science Section 4 Appendix Section 4 covers some realworld best practices that I have collected in my experience It also identifies common pitfalls that you should avoid Exercises projects and instructions for further learning are also provided This section consists of the following chapters Chapter 12 A Collection of Best Practices Chapter 13 Exercises and Projects 12 A Collection of Best Practices This chapter serves as a special chapter to investigate three important topics that are prevalent in data science nowadays data source quality data visualization quality and causality interpretation This has generally been a missing chapter in peer publications but I consider it essential to stress the following topics I want to affirm that you as a future data scientist will practice data science while following the best practice tips as introduced in this chapter After finishing this chapter you will be able to do the following Understand the importance of data quality Avoid using misleading data visualization Spot common errors in causality arguments First lets start with the beginning of any data science project the data itself 322 A Collection of Best Practices Understanding the importance of data quality Remember the old adage that says garbage in garbage out This is especially true in data science The quality of data will influence the entire downstream project It is difficult for people who work on the downstream tasks to identify the sources of possible issues In the following section I will present three examples in which poor data quality causes difficulties Understanding why data can be problematic The three examples fall into three different categories that represent three different problems Inherent bias in data Miscommunication in largescale projects Insufficient documentation and irreversible preprocessing Lets start with the first example which is quite a recent one and is pretty much a hot topicface generation Bias in data sources The first example we are going to look at is bias in data FaceDepixelizer is a tool that is capable of significantly increasing the resolution of a human face in an image You are recommended to give it a try on the Colab file the developers released It is impressive that Generative Adversarial Network GAN is able to create faces of human in images that are indistinguishable from real photos Generative adversarial learning Generative adversarial learning is a class of machine learning frameworks that enable the algorithms to compete with each other One algorithm creates new data such as sounds or images by imitating original data while another algorithm tries to distinguish the original data and the created data This adversarial process can result in powerful machine learning models that can create images sounds or even videos where humans are unable to tell whether they are real However people soon started encountering this issue within the model Among all of them I found the following example discovered by Twitter user Chicken3egg to be the most disturbing one The image on the left is the original picture with low resolution and the one on the right is the picture that FaceDepixelizer generated Understanding the importance of data quality 323 Figure 121 FaceDepixelizer example on the Obama picture If you are familiar with American politics you know the picture on the left is former President Barack Obama However the generated picture turns out to be a completely different guy For a discussion on this behavior please refer to the original tweet httpstwittercomChicken3ggstatus1274314622447820801 This issue is nothing new in todays machine learning research and has attracted the attention of the community A machine learning model is nothing but a digestor of data which outputs what it learns from the data Nowadays there is little diversity in human facial datasets especially in the case of people from minority backgrounds such as African Americans Not only will the characteristics of the human face be learned by the model but also its inherent bias This is a good example where flawed data may cause issues If such models are deployed in systems such as CCTV closedcircuit television the ethical consequences could be problematic To minimize such effects we need to scrutinize our data before feeding it into machine learning models The machine learning community has been working to address the data bias issue As the author of this book I fully support the ethical progress in data science and machine learning For example as of October 2020 the author of FaceDepixelizer has addressed the data bias issue You can find the latest updates in the official repository at https githubcomtgbomzeFaceDepixelizer Miscommunication in largescale projects When a project increases in size miscommunication can lead to inconsistent data and cause difficulties Here size may refer to code base size team size or the complexity of the organizational structure The most famous example is the loss of the 125 milliondollar Mars climate orbiter from NASA in September 1999 324 A Collection of Best Practices Back then NASAs Jet Propulsion Laboratory and Lockheed Martin collaborated and built a Mars orbiter However the engineering team at Lockheed Martin used English units of measurement and the team at Jet Propulsion Laboratory used the conventional metric system For readers who are not familiar with the English unit here is an example Miles are used to measure long distances where 1 mile is about 1600 meters which is 16 kilometers The orbiter took more than 280 days to reach Mars but failed to function The reason was later identified to be a mistake in unit usage Lorelle Young president of the US Metric Association commented that two measurement systems should not be used as the metric system is the standard measurement language of all sophisticated science Units in the United States Technically speaking the unit system in the United States is also different from the English unit system There are some subtle differences Some states choose the socalled United States customary while others have adopted metric units as official units This may not be a perfect example for our data science project since none of our projects will ever be as grand or important as NASAs Mars mission The point is that as projects become more complex and teams grow larger it is also easier to introduce data inconsistency into projects The weakest stage of a project A point when a team upgrades their systems dependencies or algorithms is the easiest stage to make mistakes and it is imperative that you pay the utmost attention at this stage Insufficient documentation and irreversible preprocessing The third most common reason for poor quality data is the absence of documentation and irreversible preprocessing We briefly talked about this in Chapter 1 Fundamentals of Data Collection Cleaning and Preprocessing Data documentation is sometimes referred to as metadata It is the data about the dataset itself for example information about how the data was obtained who is responsible for queries with respect to the data and the meanings of abbreviations in the dataset Understanding the importance of data quality 325 In data science teams especially for crossteam communication such information is often omitted based on my observations but they are actually very important You cannot assume that the data speaks for itself For example I have used the Texas county dataset throughout this book but the meaning of the rural code cant be revealed unless you read the specs carefully Even if the original dataset is accompanied by metadata irreversible preprocessing still has the potential to damage the datas quality One example I introduced earlier in Chapter 2 Essential Statistics for Data Assessment is the categorization of numerical values Such preprocessing results in the loss of information and there isnt a way for people who take the data from you to recover it Similar processing includes imputation and minmax scaling The key to solving such issues is to embrace the culture of documentation and reproducibility In a data science team it is not enough to share a result or a presentation it is important to share a reproducible result with wellwritten and easytounderstand documentation In such instances a Jupyter notebook is better than a script because you can put text nicely with the code together For the R ecosystem there is R Markdown which is similar to a Jupyter notebook You can demonstrate the pipeline of preprocessing and algorithm functioning in an interactive fashion The idea of painless reproducibility comes from different levels and not only in a data science project For a general Python project a requirementstxt file specifying versions of dependencies can ensure the consistency of a Python environment For our book in order to avoid possible hassles for readers who are not familiar with pip or virtualenv the Python package management and virtual environment management tools I have chosen Google Colab so that you can run the companying codes directly in the browser A general idea of reproducibility A common developer joke you might hear is This works on my machine Reproducibility has been a true headache at different levels For data science projects in order to make code and presentations reproducible Jupyter Notebooks R Markdown and many other tools were developed In terms of the consistency of libraries and packages we have package management tools such as pip for Python and npm for JavaScript In order to enable large systems to work across different hardware and environments Docker was created to isolate configurations of a running instance from its host machine All these technologies solve the same problem of painlessly reproducing a result or performance consistently This is a philosophical concept in engineering that you should keep in mind 326 A Collection of Best Practices In the next section well look at another aspect of common pitfalls in data sciencemisleading graphs Avoiding the use of misleading graphs Graphics convey much more information than words Not everyone understands Pvalues or statistical arguments but almost everyone can tell if one piece of a pie plot is larger than another piece of pie plot or if twoline plots share a similar trend However there are many ways in which graphs can also damage the quality of a visualization or mislead readers In this section we will examine two examples Lets start with the first example misleading graphs Example 1 COVID19 trend The following graph is a screenshot taken in early April 2020 A news channel showed this graph of new COVID19 cases per day in the United States Do you spot anything strange Figure 122 A screenshot of COVID19 coverage of a news channel Avoiding the use of misleading graphs 327 The issue is on the y axis If you look closely the y axis tickers are not separated equally but in a strange pattern For example the space between 30 and 60 is the same as the space between 240 and 250 The distances vary from 10 up to 50 Now I will regenerate this plot without mashing the y axis tickers with the following code snippet import pandas as pd import matplotlibpyplot as plt dates pddaterange20200318 periods15 freqD dailycases 33618611211612919217434430432724632 0339376 pltrcParamsupdatefontsize 22 pltfigurefigsize106 pltplot datesdailycases labelDaily Cases markero pltlegend pltgcatickparamsaxisx whichmajor labelsize14 pltxlabelDates pltylabelNew Daily Cases plttitleNew Daily Cases Versus Dates pltxticksrotation45 for xy in zipdates dailycases plttextx y stry 328 A Collection of Best Practices You will see the following visualization Figure 123 New daily cases without modifying the yaxis tickers Whats the difference between this one and the one that the news channel showed to its audience The jump from 174 to 344 is much more significant while the increase from 246 to 376 is also more dramatic The news channel manipulated the space used to represent 10 or 30 to represent 50 when the number grew large This way the visual impression is much weaker Next lets look at another example that has the ability to confuse readers Example 2 Bar plot cropping We are going to use the US county data for this example Here I am loading the data with the following code snippet we used in Chapter 4 Sampling and Inferential Statistics The difference is that this time I am looking at the influence code of all counties in the United States and not limited to just Texas The following code snippet performs the visualization from collections import Counter df pdreadexcelPopulationEstimatesxlsskiprows2 counter Counterdftail1UrbanInfluenceCode2003 codes key for key in counterkeys if strkey nan heights countercode for code in codes Avoiding the use of misleading graphs 329 pltfigurefigsize106 pltbarlistmaplambda x strxcodes heights pltxticksrotation45 plttitleUrban Influence Code for All Counties in the US pltxlabelUrban Influence Code pltylabelCount The result looks like the following Figure 124 Urban influence code counting for all counties in the US Note that I deliberately changed the urban influence code to a string and indicated that it is a categorical variable and not a numerical one According to the definition the urban influence code is a 12level classification of metropolitan counties developed by the United States Department of Agriculture 330 A Collection of Best Practices Now this is what happens if I add one more line to the previous code snippet pltgcasetylim80800 We then obtain the data as shown in the following diagram Figure 125 Urban influence code counting with limited y axis values The new graph uses the same data and the exact same kind of bar plot This visualization is not wrong but it is confusing as much as it is misleading There are more than 100 counties with an urban influence code of 30 the third bar from righthand side but the second visualization shows that there are probably no such counties The difference between being confusing and misleading is that misleading graphs are usually coined carefully and deliberately to convey the wrong message confusing graphs may not The visualizer might not realize the confusion that such data transformation or capping will cause There are other causes that may result in bad visualizations for example the improper use of fonts and color The more intense a color is the greater the importance we place on that element Opacity is an important factor too The perception of linear opacity doesnt always result in a linear impression of quantities It is not safe to purely rely on visual elements to make a quantitative judgement Avoiding the use of misleading graphs 331 In the next section we will talk about another good practice You should always question causality arguments Spot the common errors in this causality argument A popular conspiracy theory in early 2020 is that the coronavirus is caused by 5G towers being built around the world People who support such a theory have a powerful graph to support their argument I cant trace the origin of such a widespread theory but here are the two popular visualizations Figure 126 Map of the world showing a 5G conspiracy theory 332 A Collection of Best Practices The following map is similar but this time limited to the United States Figure 127 Map of the United States showing a 5G conspiracy theory The top portion shows the number of COVID19 cases in the United States while the lower portion shows the installation of 5G towers in the United States These two graphs are used to support the idea that 5G causes the spread of COVID19 Do you believe it Avoiding the use of misleading graphs 333 Lets study this problem step by step from the following perspectives Is the data behind the graphics accurate Do the graphics correctly represent the data Does the visualization support the argument that 5G is resulting in the spread of COVID19 For the first question following some research you will find that Korea China and the United States are indeed the leading players in 5G installation For example as of February 2020 Korea has 5G coverage in 85 cities while China has 5G coverage in 57 cities However Russias first 5G zone was only deployed in central Moscow in August 2019 For the first question the answer can roughly be true However the second question is definitively false All of Russia and Brazil are colored to indicate the seriousness of the spread of COVID19 and 5G rollout The visual elements do not represent the data proportionally Note that the maps first appeared online long before cases in the United States and Brazil exploded People cant tell the quantitative relationship between the rolloutcoverage of 5G and the number of COVID19 cases The graphic is both misleading and confusing Onto the last question The answer to this is also definitively false However the misrepresentation issue got confused with the interpretation of the data for the world map so lets focus on the map of the United States There are many ways to generate maps such as the COVID19 cases map or the 5G tower installation map for example by means of population or urbanization maps Since COVID19 is transmitted mainly in the air and by touching droplets containing the virus population density and lifestyles play a key role in its diffusion This explains why areas with a high population are also areas where there are more confirmed cases From a business standpoint ATT and Verizon will focus heavily on offering 5G services to high population density regions such as New York This explains the density map concerning 5G tower installation Factors such as population density are called confounding factors A confounding factor is a factor that is a true cause behind another two or more factors The causal relationships are between the confounding factor and the other factors not between the nonconfounding factors It is a common trick to use visualization to suggest or represent a causal relationship between two variables without stating the underlying reasons That said how do we detect and rebut the entire causal argument Lets understand this in the next section 334 A Collection of Best Practices Fighting against false arguments To refute false arguments you need to do the following Maintain your curiosity to dig deeper into domain knowledge Gather different perspectives from credible experts To refute false arguments domain knowledge is the key because domain knowledge can reveal the details that loose causal arguments often hide from their audience Take the case of COVID19 for example To ascertain the possible confounding factor of population density you need to know how the virus spreads which falls into the domain of epidemiology The first question you need to ask is whether there is a more science based explanation During the process of finding such an explanation you need to learn domain knowledge and can often easily poke a hole in the false arguments Proving causal relations scientifically Proving a causal relation between variables is hard but to fake one is quite easy In a rigorous academic research environment to prove a cause and effect relationship you need to control variables that leave the target variable as being the only explanation of the observed behavior Often such experiments must have the ability to be reproduced in other labs with different groups of researchers However this is sometimes hard or even impossible to reproduce in the case of social science research A second great way of refuting such false arguments is to gather opinions from credible experts A credible expert is someone who has verifiable knowledge and experience in specific domains that is trustworthy As the saying goes given enough eyes on the codes there wont be any bugs Seeking opinions from true experts will often easily reveal the fallacy in false causal arguments In your data science team pair coding and coding reviews will help you to detect errors including but not limited to causal relation arguments An even better way to do this is to show your work to the world put your code on GitHub or build a website to show it to anyone on the internet This is how academic publishing works and includes two important elementspeer review and open access to other researchers Summary 335 Summary In this chapter we discussed three best practices in data science They are also three warnings I give to you always be cautious about data quality always be vigilant about visualization and pay more attention to detect and thereby help avoid false cause and effect relationship claims In the next and final chapter of this book we will use what you have learned so far to solve the exercises and problems 13 Exercises and Projects This chapter is dedicated to exercises and projects that will enhance your understanding of statistics as well as your practical programming skills This chapter contains three sections The first section contains exercises that are direct derivatives of the code examples you saw throughout this book The second section contains some projects I would recommend you try out some of these will be partially independent of what we covered concretely in previous chapters The last section is for those of you who want to dive more into the theory and math aspects of this book Each section is organized according to the contents of the corresponding chapter Once youve finished this final chapter you will be able to do the following Reinforce your basic knowledge about the concepts and knowledge points that were covered in this book Gain working experience of a projects life cycle Understand the math and theoretical foundations at a higher level Lets get started 338 Exercises and Projects Note on the usage of this chapter You are not expected to read or use this chapter sequentially You may use this chapter to help you review the topics that were covered in a certain chapter once youve finished it You can also use it as reference for your data scientist interview Exercises Most of the exercises in each chapter dont depend on each other However if the exercises do depend on each other this relationship will be stated clearly Chapter 1 Fundamentals of Data Collection Cleaning and Preprocessing Exercises related to Chapter 1 Fundamentals of Data Collection Cleaning and Preprocessing are listed in this section Exercise 1 Loading data Load the autompg data as a pandas DataFrame by using the pandasreadcsv function This data can be found at httpsarchiveicsucieduml machinelearningdatabasesautompg Hint You may find that the default argument fails to load the data properly Search the document of the readcsv function identify the problem and find the solution Exercise 2 Preprocessing Once youve loaded the autompg data preprocess the data like so 1 Identify the type of each columnfeature as numerical or categorical 2 Perform minmax scaling for the numerical variables 3 Impute the data with the median value Exercises 339 Hint This dataset which I chose on purpose can be ambiguous in terms of determining the variable type Think about different choices and their possible consequences for downstream tasks Exercise 3 Pandas and API calling Sign up for a free account at httpsopenweathermaporgapi obtain an API key and read the API documentation carefully Make some API calls to obtain the hourly temperature for the city you live in for the next 24 hours Build a pandas DataFrame and plot a time versus temperature graph Hint You may need Pythons datetime module to convert a string into a valid datetime object Chapter 2 Essential Statistics for Data Assessment Exercises related to Chapter 2 Essential Statistics for Data Assessment are listed in this section Exercise 1 Other definitions of skewness There are several different definitions of skewness Use the data that we used to examine skewness in Chapter 2 Essential Statistics for Data Assessment to calculate the following versions of skewness according to Wikipedia httpsenwikipediaorgwiki Skewness Pearsons second skewness coefficient Quantilebased measures Groeneveld and Meedens coefficient Exercise 2 Bivariate statistics Load the autompg dataset that we introduced in Exercise 1 Loading data for Chapter 1 Fundamentals of Data Collection Cleaning and Preprocessing Identify the bivariate correlation with all the numerical variables Which variables are positively correlated and which ones are negatively correlated Do these correlations make reallife sense to you 340 Exercises and Projects Exercise 3 The crosstab function Can you implement your own crosstab function It can take two lists of equal length as input and generate a DataFrame as output You can also set an optional parameter such as the name of the input lists Hint Use the zip function for a onepass loop Chapter 3 Visualization with Statistical Graphs Exercises related to Chapter 3 Visualization with Statistical Graphs are listed in this section Exercise 1 Identifying the components This is an open question Try to identify the three components shown on Seaborns gallery page httpsseabornpydataorgexamplesindexhtml The three components of any statistical graph are data geometry and aesthetics You may encounter new kinds of graph types here which makes this a great opportunity to learn Exercise 2 Queryoriented transformation Use the autompg data we introduced in Exercise 1 Loading data for Chapter 1 Fundamentals of Data Collection Cleaning and Preprocessing transform it into the long format and generate a boxplot called the Seaborn tips boxplot Namely you want the xaxis to be the cylinders variable and the yaxis to be the mpg data Exercise 3 Overlaying two graphs For the birth rate and death rate of the Anderson county overlay the line plot and the bar plot in the same graph Some possible enhancements you can make here are as follows Choose a proper font and indicate the value of the data with numbers Choose different symbols for the death rate and the birth rate Exercises 341 Exercise 4 Layout Create a 2x2 layout and create a scatter plot of the birth rate and death rate of your four favorite states in the US Properly set the opacity and marker size so that it is visually appealing Exercise 5 The pairplot function The pairplot function of the seaborn library is a very powerful function Use it to visualize all the numerical variables of the autompg dataset we introduced in Exercise 1 Loading data of Chapter 1 Fundamentals of Data Collection Cleaning and Preprocessing Study its documentation so that you know how to add a regression line to the offdiagonal plots Does the regression line make sense to you Compare the result with the correlation result you obtained in Exercise 2 Bivariate statistics of Chapter 2 Essential Statistics for Data Assessment Hint Read the documentation regarding pairplot for more information httpsseabornpydataorggeneratedseaborn pairplothtml Chapter 4 Sampling and Inferential Statistics Exercises related to Chapter 4 Sampling and Inferential Statistics are listed in this section Exercise 1 Simple sampling with replacement Create a set of data that follows a standard normal distribution of size 1000 Run simple random sampling with and without replacement Increase the sampling size What happens if the sampling size of your replacement sampling exceeds 1000 Exercise 2 Stratified sampling Run stratified sampling on the population of county stratified with respect to states rather than ruralurban Continuum Code2013 Sample two data points from each state and compare the results by running the sampling multiple times 342 Exercises and Projects Exercise 3 Central limit theorem Verify the central limit theorem by summing nonnormal random variables by following the distributions listed here just pick the easiest form for each If you are not familiar with these distributions please refer to Chapter 5 Common Probability Distributions Binomial distribution Uniform distribution Poisson distribution Exponential distribution Can you visually examine the number of random variables that need to be summed together to approach a normal distribution for each of the aforementioned distributions What is the intuition behind your observation Chapter 5 Common Probability Distributions Exercises related to Chapter 5 Common Probability Distributions are listed in this section Exercise 1 Identify the sample space and the event corresponding to the probability being asked in the following statements By tossing four fair coins find the probability of at least getting two heads The probability that a bus arrives between 800 AM and 830 AM A battleship fires three missiles The battleships target will be destroyed if at least two missiles hit its target Find the probability that the battleships target will be destroyed Assume that the likelihood of a woman giving birth to a boy or a girl are the same If we know a family that has three kids has a girl in the family find the probability that the family has at least one girl How about the probability of having at least a boy Exercise 2 Proof of probability equations Prove the following equation 𝑃𝑃𝐴𝐴 𝐵𝐵 𝑃𝑃𝐴𝐴 𝑃𝑃𝐵𝐵 𝑃𝑃𝐴𝐴 𝐵𝐵 Exercises 343 Hint Enumerate all possibilities of both the lefthand side and the righthand side of the equation Prove that if an event is in the lefthand side that is the union of A and B then it is in the expression of the righthand side In the same regard if it is in the righthand side then prove it is also in the lefthand side Exercise 3 Proof of De Morgans law De Morgans law states two important transformation rules as stated here https enwikipediaorgwikiDeMorgan27slaws Use the same technique you used in the previous exercise to prove them Exercise 4 Three dice sum Write a program that calculates the distribution in a sum of three fair dice Write a program that will simulate the case where the dice are not fair so that for each dice it is two times more likely to have an even outcome than an odd outcome Hint Write some general simulation code so that you can specify the probability associated with each face of the dice and the total number of dice Then you are free to recover the central limit theorem easily Exercise 5 Approximate the binomial distribution with the Poisson distribution The binomial distribution with a large n can sometimes be very difficult to calculate because of the involvement of factorials However you can approximate this with Poisson distribution The condition for this is as follows n p 0 np λ where λ is a constant If all three conditions are met then the binomial distribution can be approximated by the Poisson distribution with the corresponding parameter λ Prove this visually Most of the time n must be larger than 100 and p must be smaller than 001 344 Exercises and Projects Exercise 6 Conditional distribution for a discrete case In the following table the first column indicates the grades of all the students in an English class while the first row indicates the math grades for the same class The values shown are the count of students and the corresponding students Figure 131 Scores Answer the following questions Some of these question descriptions are quite complicated so read them carefully What is the probability of a randomly selected student having a B in math What is the probability of a randomly selected student having a math grade thats no worse than a C and an English grade thats no worse than a B If we know that a randomly selected student has a D in math whats the probability that this student has no worse than a B in English Whats the minimal math grade a randomly selected student should have where you have the confidence to say that this student has at least a 70 chance of having no worse than a B in English Chapter 6 Parameter Estimation Exercises related to Chapter 6 Parameter Estimation are listed in this section Exercise 1 Distinguishing between estimation and prediction Which of the following scenarios belong to the estimation process and which belong to the prediction process Find out the weather next week Find out the battery capacity of your phone based on your usage profile Find out the arrival time of the next train based on the previous trains arrival time Exercises 345 Exercise 2 Properties of estimators If you were to use the method of moments to estimate a uniform distributions boundaries is the estimator thats obtained unbiased Is the estimator thats obtained consistent Exercise 3 Method of moments Randomly generate two variables between 0 and 1 without checking their values Set them as the μ and σ of a Gaussian random distribution Generate 1000 samples and use the method of moments to estimate these two variables Do they agree Exercise 4 Maximum likelihood I We toss a coin 100 times and for 60 cases we get heads Its possible for the coin to be either fair or biased to getting heads with a probability of 70 Which is more likely Exercise 5 Maximum likelihood II In this chapter we discussed an example in which we used normal distribution or Laplace distribution to model the noise in a dataset What if the noise follows a uniform distribution between 1 and 1 What result will be yielded Does it make sense Exercise 6 Law of total probability Lets say the weather tomorrow has a 40 chance of being windy and a 60 chance of being sunny On a windy day you have a 50 chance of going hiking while on a sunny day the probability goes up to 80 Whats the probability of you going hiking tomorrow without knowing tomorrows weather What about after knowing that its 90 likely to be sunny tomorrow Exercise 7 Monty Hall question calculation Calculate the quantity 𝑃𝑃𝐶𝐶𝐸𝐸𝐵𝐵 that we left in the Monty Hall question Check its value with the provided answer 346 Exercises and Projects Chapter 7 Statistical Hypothesis Testing Exercises related to Chapter 7 Statistical Hypothesis Testing are listed in this section Exercise 1 Onetailed and twotailed tests Is the following statement correct The significant level for a onetailed test will be twice as large as it is for the twotailed test Exercise 2 Pvalue concept I Is the following statement correct For a discrete random variable where every outcome shares an equal probability any outcome has a Pvalue of 1 Exercise 3 Pvalue concept II Are the following statements correct The value of the Pvalue is obtained by assuming the null hypothesis is correct The Pvalue by definition has a falsepositive ratio Hint Falsepositive naively means something isnt positive but you misclassify or mistreat it as being positive Exercise 4 Calculating the Pvalue Calculate the Pvalue of observing five heads when tossing an independent fair coin six times Exercise 5 Table looking Find the corresponding value in a onetailed tdistribution table where the degree of freedom is 5 and the significance level is 0002 Exercise 6 Decomposition of variance Prove the following formula mathematically 𝑆𝑆𝑇𝑇 2 𝑆𝑆𝐵𝐵 2 𝑆𝑆𝑊𝑊 2 Exercises 347 Exercise 7 Fishers exact test For the linkclicking example we discussed in this chapter read through the Fishers exact test page httpsenwikipediaorgwikiFisher27sexacttest Run a Fishers exact test on the device and browser variables Hint Build a crosstab first It may be easier to refer to the accompanying notebook for reusable code Exercise 8 Normality test with central limit theorem In Exercise 3 Central limit theorem of Chapter 4 Sampling and Inferential Statistics we tested the central limit theorem visually In this exercise use the normality test provided in Chapter 7 Statistical Hypothesis Testing to test the normality of the random variable that was generated from summing nonnormal random variables Exercise 9 Goodness of fit In Chapter 7 Statistical Hypothesis Testing we ran a goodness of fit test on the casino card game data Now lets assume the number of hearts no longer follows a binomial distribution but a Poisson distribution Now run a goodness of fit test by doing the following 1 Find the most likely parameter for the Poisson distribution by maximizing the likelihood function 2 Run the goodness of fit test Suggestion This question is a little tricky You may need to review the procedure of building and maximizing a likelihood function to complete this exercise Exercise 10 Stationary test Find the data for the total number of COVID19 deaths in the United States from the date when the first death happened to July 29th where the number of patients that died had reached 150000 Run a stationary test on the data and the first difference of the data 348 Exercises and Projects Chapter 8 Statistics for Regression Exercises related to Chapter 8 Statistics for Regression are listed in this section Exercise 1 Rsquared Are the following statements correct A bigger R2 is always a good indicator of good fit for a singlevariable linear model For a multivariable linear model an adjusted R2 is a better choice when evaluating the quality of the model Exercise 2 Polynomial regression Is the following statement correct To run a regression variable y over single variable x an 8th order polynomial can fit an arbitrary dataset of size 8 If yes why Exercise 3 Doubled R2 Is the following statement correct If a regression model has an R2 of 09 suppose we obtained another set of data that happens to match the original dataset exactly similar to a duplicate What will happen to the regression coefficients What will happen to the R2 Exercise 4 Linear regression on the autompg dataset Run simple linear regression on the autompg dataset Obtain the coefficients between the make year and mpg variables Try to do this by using different methods as we did in the example provided in this chapter Exercise 5 Collinearity Run multivariable linear regression to fit the mpg in the autompg dataset to the rest of the numerical variables Are the other variables highly collinear Calculate VIF to eliminate two variables and run the model again Is the adjusted R2 decreasing or increasing Exercises 349 Exercise 6 Lasso regression and ridge regression Repeat Exercise 5 Collinearity but this time with lasso regularization or ridge regularization Change the regularization coefficient to control the strength of the regularization and plot a set of line plots regarding the regularization parameter versus the coefficient magnitude Chapter 9 Statistics for Classification Exercises related to Chapter 9 Statistics for Classification are listed in this section Exercise 1 Odds and log of odds Determine the correctness of the following statements The odds of an event can take any value between 0 and infinity The log of odds has the meaning as a probability Hint Plot the relationship between the probability and the log of odds Exercise 2 Confusion matrix Determine the proper quadrant for the following scenarios in the coefficient matrix Diagnose a man as pregnant If there are incoming planes and the detector failed to find them A COVID19 patient was correctly diagnosed as positive Exercise 3 F1 score Calculate the F1 score for the following confusion matrix Figure 132 Confusion matrix 350 Exercises and Projects Exercise 4 Grid search for optimal logistic regression coefficients When we maximized the loglikelihood function of the stock prediction example I suggested that you use grid search to find the optimal set of slopes and intercepts Write a function that will find the optimal values and compare them with the ones I provided there Exercise 5 The linregress function Use the linregress function to run linear regression on the Netflix stock data Then verify the R2 values agreement with our manually calculated values Exercise 6 Insufficient data issue with the Bayes classifier For a naïve Bayes classifier if the data is categorical and theres not enough of it we may encounter an issue where the prediction encounters a tie between two or even more possibilities For the stroke risk prediction example verify that the following data gives us an undetermined prediction weighthighhighoildietnosmokingno Exercise 7 Laplace smoothing One solution for solving the insufficient data problem is to use Laplace smoothing also known as additive smoothing Please read the wiki page at httpsenwikipedia orgwikiAdditivesmoothing and the lecture note from Toronto university at httpwwwcstorontoedubonnercourses2007scsc411 lectures03bayeszemelpdf before resolving the issue that was raised in Exercise 4 Grid search for optimal logistic regression coefficients Exercise 8 Crossvalidation For the autompg data we introduced in Exercise 1 Loading data of Chapter 1 Fundamentals of Data Collection Cleaning and Preprocessing use 5fold crossvalidation to train multivariable linear regression models and evaluate their performance using the mean squared error metric Exercises 351 Exercise 9 ROC curve One important concept I skipped due to space limitations is the ROC curve However it is easy to replicate For the stock prediction logistic regression model pick a series of equally spaced thresholds between 0 and 1 and then create a scatter plot of the true positive rates and the true positive rates Examine the result What you will have obtained is an ROC curve You can find more information about the ROC curve at https enwikipediaorgwikiReceiveroperatingcharacteristic Exercise 10 Predicting cylinders Use the mpg horsepower and displacement variables of the autompg dataset to classify the cylinder variable using a Gaussian Naïve classifier Check out the documentation at httpsscikitlearnorgstablemodulesnaive bayeshtmlgaussiannaivebayes to find out more Hint Gaussian Naïve Bayes is the continuous version of the categorical Naïve Bayes method However it is not the only option Feel free to explore other Naïve Bayes classifiers and compare their performance You are also permitted to remove oddnumber cylinder samples since they are very rare and not informative in general Chapter 10 Statistics for TreeBased Methods Exercises related to Chapter 10 Statistics for TreeBased Methods are listed in this section Exercise 1 Tree concepts Determine the correctness of the following statements A tree can only have a single root node A decision node can only have at most two child nodes A node can only have one parent node 352 Exercises and Projects Exercise 2 Gini impurity visualized For a threecategory dataset with categories A B and C produce a 2D visualization of the Gini impurity as a function of 𝑃𝑃𝐴𝐴 and 𝑃𝑃𝐵𝐵 where 0 𝑃𝑃𝐴𝐴 𝑃𝑃𝐵𝐵 1 Hint Although we have three categories the requirement of summing to one leaves us with two degrees of freedom Exercise 3 Comparing Gini impurity with entropy Another criterion for tree node splitting is known as entropy Read the Wikipedia page at httpsenwikipediaorgwikiEntropyinformationtheory and write functions that will help you redo Exercise 2 Gini impurity revisited but with entropy being used instead What do you find In terms of splitting nodes which method is more aggressiveconservative What if you increase the number of possible categories Can you perform a Monte Carlo simulation Exercise 4 Entropybased tree building Use entropy instead of Gini impurity to rebuild the stroke risk decision tree Exercise 5 Nonbinary tree Without grouping lowrisk and middlerisk groups build a threecategory decision tree from scratch for the stroke risk dataset You can use either Gini impurity or entropy for this Exercise 6 Regression tree concepts Determine the correctness of the following statements The number of possible outputs a regression tree has is the same as the number of partitions it has for the feature space Using absolute error rather than MSE will yield the same partition result To split over a continuous variable the algorithm has to try all possible values so for each splitting the time complexity of the naïve approach will be ONM where N is the number of continuous features and M is the number of samples in the node Exercises 353 Exercise 7 Efficiency of regression tree building Write some pseudocode to demonstrate how the tree partition process can be paralleled If you cant please explain which step or steps prohibit this Exercise 8 sklearn example Use sklearn to build a regression tree that predicts the value of mpg in the autompg dataset with the rest of the features Then build a classification tree that predicts the number of cylinders alongside the rest of the features Exercise 9 Overfitting and pruning This is a hard question that may involve higher coding requirements The sklearn API provides a parameter that helps us control the depth of the tree which prevents overfitting Another way we can do this is to build the tree so that its as deep as needed first then prune the tree backward from the leaves Can you implement a helperutility function to achieve this You may need to dive into the details of the DecisionTreeClassifier class to directly manipulate the tree object Chapter 11 Statistics for Ensemble Methods Exercises related to Chapter 11 Statistics for Ensemble Methods are listed in this section Exercise 1 Biasvariance tradeoff concepts Determine the correctness of the following statements Train set and test set splitting will prevent both underfitting and overfitting Nonrandomized sampling will likely cause overfitting Variance in terms of the concept of biasvariance tradeoff means the variance of the model not the variance of the data Exercise 2 Bootstrapping concept Determine the correctness of the following statements If there is any ambiguity please illustrate your answers with examples The bigger the variance of the sample data the more performant bootstrapping will be Bootstrapping doesnt generate new information from the original dataset 354 Exercises and Projects Exercise 3 Bagging concept Determine the correctness of the following statements Each aggregated weak learner is the same weight BaggingClassifier can be set so that its trained in parallel The differences between weak learners are caused by the differences in the sampled data that they are exposed to Exercise 4 From using a bagging tree classifier to random forests You may have heard about random forests before They are treebased machine learning models that are known for their robustness to overfitting The key difference between bagging decision trees and random forests is that a random forest not only samples the records but also samples the features For example lets say were performing a regression task where were using the autompg dataset Here the cylinder feature may not be available during one iteration of the node splitting process Implement your own simple random forest class Compare its performance with the sklearn bagging classifier Exercise 5 Boosting concepts Determine the correctness of the following statements In principle boosting is not trainable in parallel Boosting in principle can decrease the bias of the training set indefinitely Boosting linear weak learners is not efficient because the linear combinations of a linear model is also a linear model Exercise 6 AdaBoost Use AdaBoost to predict the number of cylinders in the autompg dataset against the rest of the variables Exercise 7 Gradient boosting tree visualization Using the tree visualization code that was introduced in Chapter 12 Statistics for Tree Based Methods visualize the decision rules of the weak learnerstrees for a gradient descent model Select trees from the first 10 iterations first 40 iterations and then every 100 iterations Do you notice any patterns Project suggestions 355 Everything up to this point has all been exercises Ive prepared for you The next section is dedicated to the projects you can carry out Each project is associated with a chapter and a topic but its recommended that you integrate these projects to build a comprehensive project that you can show to future employers Project suggestions These projects will be classified into three different categories as listed here Elementary projects Elementary projects are ones where you only need knowledge from one or two chapters and are easy to complete Elementary projects only require that you have basic Python programming skills Comprehensive projects Comprehensive projects are ones that require you to review knowledge from several chapters Having a thorough understanding of the example code provided in this book is required to complete a comprehensive project Capstone projects Capstone projects are projects that involve almost all the contents of this book In addition to the examples provided in this book you are expected to learn a significant amount of new knowledge and programming skills to complete the task at hand Lets get started Nontabular data This is an elementary project The knowledge points in this project can be found in Chapter 1 Fundamentals of Data Collection Cleaning and Preprocessing and Chapter 2 Essential Statistics for Data Assessment The university dataset in the UCI machine learning repository is stored in a nontabular format httpsarchiveicsuciedumldatasetsUniversity Please examine its format and perform the following tasks 1 Examine the data format visually and then write down some patterns to see whether such patterns can be used to extract the data at specific lines 2 Write a function that will systematically read the data file and store the data contained within in a pandas DataFrame 3 The data description mentioned the existence of both missing data and duplicate data Identify the missing data and deduplicate the duplicated data 356 Exercises and Projects 4 Classify the features into numerical features and categorical features 5 Apply minmax normalization to all the numerical variables 6 Apply median imputation to the missing data Nontabular format Legacy data may be stored in nontabular format for historical reasons The format the university data is stored in is a LISPreadable format which is a powerful old programming language that was invented more than 60 years ago Realtime weather data This is a comprehensive project The knowledge points are mainly from Chapter 1 Fundamentals of Data Collection Cleaning and Preprocessing and Chapter 3 Visualization with Statistical Graphs The free weather API provides current weather data for more than 200000 cities in the world You can apply for a free trial here httpsopenweathermaporgapi In this example you will build a visualization of the temperature for major US cities Refer to the following instructions 1 Read the API documentation for the current endpoint https openweathermaporgcurrent Write some short scripts to test the validity of your API key by querying the current weather in New York If you dont have one apply for a free trial 2 Write another function that will parse the returned data into tabular format 3 Query the current weather in Los Angeles Chicago Miami and Denver as well You may want to store their zip codes in a dictionary for reference Properly set the fonts legends and size of markers The color of the line will be determined by the main field of the returned JSON object You are encouraged to choose a color map that associates warmer colors with higher temperatures for example 4 Write another function that requeries the weather information in each city and replots the visualization every 20 minutes Realtime weather data 357 Colormap In data sciencerelated visualization a colormap is a mapping from ordered quantities or indices to an array of colors Different colormaps significantly change the feeling of viewers For more information please refer to httpsmatplotliborg311tutorialscolors colormapshtml I also pointed out the main field in the returned JSON for you Figure 133 The main field of the returned json object We will use the functions and code you developed in this project for a capstone project later in this chapter The rest of this project is quite flexible and up to you 358 Exercises and Projects Goodness of fit for discrete distributions This is an elementary project The topics involved are from Chapter 3 Visualization with Statistical Graphs to Chapter 7 Statistical Hypothesis Testing The description is fancy but you should be able to divide it into small parts This is also a project where many details are ambiguous and you should define your own questions specifically In this project you will write a bot that can guess the parameters of discrete distributions You also need to visualize this process Suppose there is a program that randomly selects integer λ from the list 51020 first then generates Poissondistributed samples based on the pregenerated λ The program will also recreate λ after generating n samples where n is another random integer variable thats uniformly distributed between 500 and 600 By doing this you will have obtained 100000 data points generated by this program Write a function that will calculate the possibilities of λ behind every data point Then visualize this You can approach this problem by using the following instructions 1 First you need to clarify the definition Lets say you can calculate the goodness of fit tests Pvalue and then use this Pvalue to indicate how likely a specific λ is for the parameter of a distribution You can also calculate the likelihoods and compare them The point is that you need to define quantities that can describe the ambiguous term possibilities in the question This is up to you but I used the goodness of fit test as an example 2 Then you should write a program that will generate 100000 pieces of data 3 Next define a window size The window will be moved to capture the data and run the goodness of fit test on the windowed data Choose a window size and justify your choice Whats the problem with a window size being too small and whats the problem with a window size being too large 4 Calculate the Pvalues for the goodness of fit results Plot them along with the original data Choose your aesthetics wisely The idea behind this exercise There are many observations that can be modeled as a process thats controlled by a hidden parameter Its key to model such a process and determine the parameters In this project you already know that the mechanism behind the random variable generation is a Poisson process but in most cases you wont know this This project is a simplified scenario Realtime weather data 359 Building a weather prediction web app This is a capstone project To complete this project you should get your hands on the Dash framework https dashgalleryplotlyhostPortal Dash is a framework that you can use to quickly turn a Python script into a deployable data science application You should at least go over the first four sessions of the tutorial and focus on the Dash map demo to learn how to render maps httpsdashgalleryplotlyhostdashmapd demo To finish this project follow these steps 1 Finish the prerequisite mentioned in the project description 2 Write a function that will map a list of city zip codes to markers on a US map and render them Learn how to change the contents on the map when hovering over it with your mouse 3 Use the weather API to obtain a weather report for the last 7 days for the major cities we listed in earlier projects Build a linear regressor that will predict the temperature for the following data Write a function that will automate this step so that every time you invoke the function new data will be queried 4 Design a layout where the page is split into two parts 5 The left panel should be a map where each city is highlighted with todays temperature value 6 The right panel should be a line chart where regression is performed so that tomorrows temperature is predicted and distinguished from known previous temperature 7 This step is optional You are encouraged to use other machine learning algorithms rather than simple linear regression to allow the users to switch between different algorithms For example you can use both simple linear regression and regression trees 8 To train a regression tree you might need more historical data Explore the API options and give it a try 9 You need a UI component to allow the users to change between algorithms The toggle button may be what you need httpsdashplotlycomdash daqtoggleswitch 360 Exercises and Projects This project is particularly useful because as a data scientist an interactive app is probably the best way to demonstrate your skills in terms of both statistics and programming The last project is similar in this regard Building a typing suggestion app This is a capstone project In this project you are going to build a web app that predicts what the users typing by training them on large word corpus There are three components in this app but they are not strongly correlated You can start from any of them Processing word data The typing suggestion is based on Bayes theorem where the most likely or topk predictions are made by examining a large set of sentences You can obtain an English text corpus from the English Corpora website https wwwenglishcorporaorghistoryasp You can start from a small corpus such as the manually annotated subcorpus project httpwwwanc orgdatamascdownloadsdatadownload You need to tokenize the data by sequencing words You are encouraged to start with tutorials from SpaCy or NLTK You should find material online such as httpsrealpythoncom naturallanguageprocessingspacypython to help with this Building the model You need to build a twogram model which means counting the number of appearances of neighboring word pairs For example I eat is likely to be more frequent than I fly so the word eat will be more likely to show up than fly You need to create a model or module that can perform such a task quickly Also you may want to save your data in a local disk persistently so that you dont need to run the model building process every time your app starts Building the app The last step is to create a UI The documentation for the input component can be found here httpsdashplotlycomdashcore componentsinput You need to decide on how you want your user to see your suggestions You can achieve this by creating a new UI component One additional feature you may wish to add to your app is a spam filter By doing this you can inform your user of how likely the input text looks like a spam message in real time Further reading With that you have reached the last part of this book In this section I am going to recommend some of the best books on data science statistics and machine learning Ive found all of which can act as companions to this book I have grouped them into categories and shared my personal thoughts on them Further reading 361 Textbooks Books that fall into this category are read like textbooks and are often used as textbooks or at least reference books in universities Their quality has been proven and their value is timeless The first one is Statistical Inference by George Casella 2nd Edition which book covers the first several chapters of this book It contains a multitude of useful exercises and practices all of which are explained in detail It is hard to get lost when reading this book The second book is The Elements of Statistical Learning by Trevor Hastie Robert Tibshirani and Jerome Friedman 2nd Edition This book is the bible of traditional statistical learning Its not easy for beginners who are not comfortable with the conciseness of math proof There is another book An Introduction to Statistical Learning With Application in R by Gareth James and Daniela Witten that is simpler and easier to digest Both books cover all the topics in this book starting from Chapter 6 Parametric Estimation Visualization The first book I recommend about visualization is The Visual Display of Quantitative Information by Edward R Tufte This book will not teach you how to code or plot a real visualization instead teaching you the philosophy surrounding visualization The second book I recommend is also by Edward R Tufte It is called Visual and Statistical Thinking Displays of Evidence for Making Decisions It contains many examples where visualizations are done correctly and incorrectly It is also very entertaining to read I wont recommend any specific books that dedicate full content to coding examples for visualization here The easiest way to learn about visualization is by referring to this books GitHub repository and replicating the examples provided Of course it would be great if you were to get a hard copy so that you can look up information quickly and review it frequently Exercising your mind This category contains books that dont read like textbooks but also require significant effort to read think about and digest The first book is Common Errors in Statistics and How to Avoid Them by Phillip I Good and James W Hardin This book contains concepts surrounding the usage of visualizations that are widely misunderstood 362 Exercises and Projects The second book is Stat Labs Mathematical Statistics Through Applications Springer Texts in Statistics by Deborah Nolan and Terry P Speed This book is unique because it starts every topic with realworld data and meaningful questions that should be asked You will find that this books reading difficulty increases quickly I highly recommend this book You may need to use a pen and paper to perform any calculations and tabulation thats required of you Summary Congratulations on reaching the end of this book In this chapter we introduced exercises and projects of varying difficulty You were also provided with a list of additional books that will help you as you progress through the exercises and projects mentioned in this chapter I hope these additional materials will boost your statistical learning progress and make you an even better data scientist If you have any questions about the content of this book please feel free to light up any issues on the official GitHub repository for this book httpsgithub comPacktPublishingEssentialStatisticsforNonSTEMData Analysts We are always happy to answer your questions there Other Books You May Enjoy If you enjoyed this book you may be interested in these other books by Packt Practical Data Analysis Using Jupyter Notebook Marc Wintjen ISBN 9781838826031 Understand the importance of data literacy and how to communicate effectively using data Find out how to use Python packages such as NumPy pandas Matplotlib and the Natural Language Toolkit NLTK for data analysis Wrangle data and create DataFrames using pandas Produce charts and data visualizations using timeseries datasets Discover relationships and how to join data together using SQL Use NLP techniques to work with unstructured data to create sentiment analysis models 364 Other Books You May Enjoy HandsOn Mathematics for Deep Learning Jay Dawani ISBN 9781838647292 Understand the key mathematical concepts for building neural network models Discover core multivariable calculus concepts Improve the performance of deep learning models using optimization techniques Cover optimization algorithms from basic stochastic gradient descent SGD to the advanced Adam optimizer Understand computational graphs and their importance in DL Explore the backpropagation algorithm to reduce output error Cover DL algorithms such as convolutional neural networks CNNs sequence models and generative adversarial networks GANs Leave a review let other readers know what you think 365 Leave a review let other readers know what you think Please share your thoughts on this book with others by leaving a review on the site that you bought it from If you purchased the book from Amazon please leave us an honest review on this books Amazon page This is vital so that other potential readers can see and use your unbiased opinion to make purchasing decisions we can understand what our customers think about our products and our authors can see your feedback on the title that they have worked with Packt to create It will only take a few minutes of your time but is valuable to other potential customers our authors and Packt Thank you Index A AB test conducting 206 207 mistakes 211 AB testing with realworld example 206 adaptive boosting AdaBoost 313 advanced visualization customization about 65 aesthetics customizing 70 geometry customizing 65 aesthetics customizing 70 aesthetics example markers 7072 alternative hypothesis 163 analysis combining with plain plotting 76 77 Analysis of Variance ANOVA 173 AndersonDarling test 185 ANOVA model 192 assumptions 193 Application Programming Interface API 6 Area Under the Curve AUC 115178 asymptotical consistency 134 attribute 26 Augmented DickeyFuller ADF 204 autoregression 201 axissharing 65 66 B bagging 306311 bar plot 6265 Bayesian theorem 155160 Bernoulli distribution 117 118 beta function 170 bias 300303 biasness 134 bimodal 32 bimodal distribution 32 Binomial distribution 118120 bivariate descriptive statistics using 47 black swan about 129 examples 129 blocking 207 210 boosting methods about 312 adaptive boosting AdaBoost 312 368 Index Gradient Descent Boost GDB 312 boosting module 311 bootstrapping about 303305 advantages 303 boxplot about 37 59 reference link 73 C capstone projects 355 categorical variables classifying 26 handling 43 versus numerical variables 2629 causeeffect relationship 228 central limit theorem CLT 107 108 central moments 136 children nodes 276 Chisquared distributions 186 Chisquared goodnessoffit test 189 classification tasks used for overviewing tree based methods 274278 classification tree about 274 growing 278 279 pruning 278 279 coefficient of determination 221 227229 collinearity analysis about 239 240 handson experience 233 235239 colormaps reference link 357 comprehensive projects 355 conditional distribution 126 127 confounding factors 333 consistency 134 continuous PDF Pvalue calculating from 170174 continuous probability distribution about 121 exponential distribution 122 123 normal distribution 124 125 types 121 uniform distribution 122 continuous variable transforming to categorical variable 46 correlation 48 correlation coefficient 220 correlation matrix 48 49 covariance 48 crosstabulation 50 crossvalidation 267 269272 Cumulative Distribution Function CDF 115 172 D Dash framework reference link 359 Dash map reference link 359 data collecting from various data sources 5 obtaining from API 69 obtaining from scratch 9 10 preparing to fit plotting function API 7376 reading directly from files 5 data imputation about 11 dataset preparing for 1115 with mean or median values 16 17 with modemost frequent value 18 19 Index 369 data memorization 300 data quality 322 data quality problems about 322 bias in data sources 322 323 insufficient documentation 324 325 irreversible preprocessing 324 325 miscommunication in large scale projects 323 data standardization about 21 22 performing 21 decision node 276 decision tree advantages 277 methods 288 performance evaluating 287289 decision tree terminologies children node 276 decision node 276 depth 276 explainability 277 leaf node 276 overfit 277 parent node 276 pruning 276 root node 275 Degree of Freedom DOF 173 depth 276 discrete distributions 358 discrete events Pvalue calculating from 168 169 discrete probability distributions about 116 117 Bernoulli distribution 117 118 Binomial distribution 118120 Poisson distribution 120 121 types 117 E efficiency 135 efficiency of an estimator 233 elementary projects 355 empirical probability 116 estimand 133 estimate 133 estimation 132 estimator about 133 evaluating 133 features 132 133 relationship connecting between simple linear regression model 230 231 estimator criteria list biasness 134 consistency 134 efficiency 135 explainability 277 exponential distribution 122 123 extreme outliers 42 F False Negative FN 256 False Positive FP 256 Fdistribution 170 fear and greedy index 248 Federal Reserve URL 5 firstprinciples thinking 168 Fishers exact test 211 frequencies 43 370 Index G Generalized Linear Model GLM 248 Generative Adversarial Learning GAN 322 geometry customizing 65 geometry example axissharing 65 scale change 68 69 subplots 65 ggplot2 components aesthetics 54 data 54 geometries 54 Gini impurity 279 goodnessoffit test 189 Google Map API URL 5 Google Map Place API URL 7 Gradient Descent Boost GDB about 313 working 314316 H harmonic mean 257 higherorder interactions 212 histogram plot 58 hypothesis 162 hypothesis test 179 hypothesis testing 162 230 I independency 127 Interquartile Range IQR 38 59 J joint distribution 126 K kfold crossvalidation 270 L Laplace distribution 146 lasso regression 241 lasso regularization 241 law of total probability 156 leaf node 276 least squared error linear regression 220227 Least Square Error LSE 224 leaveoneout crossvalidation 271 likelihood function 141143 250 logistic regression used for learning regularization 241245 logistic regression classifier classification problem 250 implementing from scratch 251256 performance evaluating 256259 working 248250 loss function 314 M markers 7072 math example concepts 164 165 Maximum Likelihood Estimation MLE 141 230 Maximum Likelihood Estimation MLE approach Index 371 about 141 155160 advantages 141 applying with Python 141 for modeling noise 145155 for uniform distribution boundaries 144 likelihood function 141143 mean 30 31 Mean Squared Error MSE 135 median 31 59 meta algorithm 312 Minimum Variance Unbiased Estimator MVUE 135 misleading graphs false arguments 334 usage avoiding 326 misleading graphs example Bar plot cropping 328333 COVID19 trend 326328 mixed data types handling 43 mode 32 33 modeling noise using in maximum likelihood estimator approach 145155 multivariate descriptive statistics using 47 multivariate linear regression handson experience 233239 N naïve Bayes classifier building from scratch 259267 nonprobability sampling about 86 risks 86 87 nonprobability sampling methods purposive sampling 87 volunteer sampling 87 nonstationarity time series example 201205 nontabular data 355 356 normal distribution 124 125 null hypothesis 163 numerical variables classifying 26 versus categorical variables 2629 O objective function 241 onesample test 172 onetailed hypothesis 166 onetailed test 172 Ordinary Least Squares OLS 232 248 outlier about 20 removing 20 21 outlier detection 4042 59 outliers 60 overfit 277 overfitted model 268 overfitting 267 269272 302 P pandas melt function reference link 74 parameter estimation concepts 132 133 methods using 136 137 parameter estimation examples 911 phone calls 137 138 uniform distribution 139 140 parent node 276 Poisson distribution 120 121 372 Index positive predictive value 257 posterior probability 259 precision 257 prediction 132 presentationready plotting font 80 styling 78 tips 78 Principle Component Analysis PCA 241 prior probability 259 probability types 116 Probability Density Function PDF 114 115 136 probability fundamental concepts events 110 sample space 110 Probability Mass Function PMF 111114 137 probability sampling about 86 safer approach 88 simple random sampling SRS 8893 stratified random sampling 9397 systematic random sampling 97 probability types empirical probability 116 subjective probability 116 proportions 44 45 pruning 276 pure node 282 purposive sampling 87 Pvalue calculating from continuous PDF 170174 calculating from discrete events 168 169 Pvalues properties 167 Python Matplotlib package examples 54 plotting types exploring 56 statistical graph elements 5456 Python Matplotlib package plotting types about 56 bar plot 6265 boxplot 58 59 histogram plot 58 outlier detection 58 59 scatter plot 60 simple line plot 56 57 Q QQ plots 185 quantiles 37 quartiles 37 38 queryoriented statistical plotting about 72 analysis combining with plain plotting 76 77 data preparing to fit plotting function API 7376 R random forests exploring with scikitlearn 316318 randomization 207210 Randomized Controlled Trial RCT 206 random walk time series 199 realtime weather data about 356 357 discrete distributions 358 typing suggestion app building 360 weather prediction web app Index 373 building 359 360 regression tree 274 exploring 289296 regularization learning from logistic regression 241245 regularization coefficient 242 regularization term 242 ridge regression 242 root node 275 280 S sample correlation coefficient 232 sample mean sampling distribution 98103 standard error 103 105107 sampling approach nonprobability sampling 86 performing in different scenarios 86 probability sampling 86 sampling distribution of sample mean 98103 sampling techniques fundamental concepts 84 85 used for learning statistics 98 scale change 68 69 scatter plot 60 scikitlearn random forests exploring with 316318 scikitlearn API tree models using 296 297 scikitlearn preprocessing module examples 23 imputation 23 standardization implementing 23 24 SciPy using for hypothesis testing about 180 ANOVA model 192197 goodnessoffit test 189192 normality hypothesis test 185189 paradigm 180 stationarity tests for time series 197 198 ttest 181185 ShapiroWilk test 185 simple linear regression model about 216 218220 as estimator 232 233 coefficient of determination 227229 content 216 218220 hypothesis testing 230 least squared error linear regression 220227 relationship connecting between estimators 230 231 variance decomposition 220227 simple line plot 56 57 simple random sampling SRS 8893 skewness 39 40 splitting features working 279285 287 standard deviation 36 37 standard error of sample mean 103 105107 standard independent twosample ttest 181 standard score 40 stationarity time series 198 example 198200 statistical graph elements 5456 statistical significance tests 166 statistics learning with sampling techniques 98 stratified random sampling 9396 stratifying 93 Students tdistribution 173 subjective probability 116 subplots 66 68 sum of squared errors SSE 222 sum of squares total SST 221 systematic random sampling 97 T tdistribution significance levels 174178 test statistic 164 test statistics 210 time series 198 treebased methods classification tree 274 overviewing for classification tasks 274278 regression tree 274 tree models using in scikitlearn API 296 298 True Negative TN 256 True Positive TP 256 twotailed hypothesis 166 type 1 error 256 type 2 error 256 type I error 179 type II error 179 typing suggestion app building 360 model building 360 word data processing 360 U ubiquitous power law 128 129 uncorrelated sample standard deviation 232 underfitting 267 269272 302 uniform distribution 122 uniform distribution boundaries using in maximum likelihood estimator approach 144 university dataset reference link 355 V variables 26 variance 3335 300303 variance decomposition 220227 Variance Inflation Factor VIF 240 volunteer sampling 87 W weather API reference link 356 weather prediction web app building 359 360 Welchs ttest 181 white noise time series 198 Z zscore 40
2
Cálculo 3
UFSCAR
9
Cálculo 3
UFSCAR
6
Cálculo 3
UFSCAR
1
Cálculo 3
UFSCAR
1
Cálculo 3
UFSCAR
2
Cálculo 3
UFSCAR
26
Cálculo 3
UFSCAR
6
Cálculo 3
UFSCAR
1
Cálculo 3
UFSCAR
3
Cálculo 3
UFSCAR
Texto de pré-visualização
Essential Statistics for NonSTEM Data Analysts Copyright 2020 Packt Publishing All rights reserved No part of this book may be reproduced stored in a retrieval system or transmitted in any form or by any means without the prior written permission of the publisher except in the case of brief quotations embedded in critical articles or reviews Every effort has been made in the preparation of this book to ensure the accuracy of the information presented However the information contained in this book is sold without warranty either express or implied Neither the author nor Packt Publishing or its dealers and distributors will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals However Packt Publishing cannot guarantee the accuracy of this information Commissioning Editor Sunith Shetty Acquisition Editor Devika Battike Senior Editor Roshan Kumar Content Development Editor Sean Lobo Technical Editor Sonam Pandey Copy Editor Safis Editing Project Coordinator Aishwarya Mohan Proofreader Safis Editing Indexer Pratik Shirodkar Production Designer Roshan Kawale First published November 2020 Production reference 1111120 Published by Packt Publishing Ltd Livery Place 35 Livery Street Birmingham B3 2PB UK ISBN 9781838984847 wwwpacktcom Packtcom Subscribe to our online digital library for full access to over 7000 books and videos as well as industry leading tools to help you plan your personal development and advance your career For more information please visit our website Why subscribe Spend less time learning and more time coding with practical eBooks and Videos from over 4000 industry professionals Improve your learning with Skill Plans built especially for you Get a free eBook or video every month Fully searchable for easy access to vital information Copy and paste print and bookmark content Did you know that Packt offers eBook versions of every book published with PDF and ePub files available You can upgrade to the eBook version at packtcom and as a print book customer you are entitled to a discount on the eBook copy Get in touch with us at customercarepacktpubcom for more details At wwwpacktcom you can also read a collection of free technical articles sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks Contributors About the author Rongpeng Li is a data science instructor and a senior data scientist at Galvanize Inc He has previously been a research programmer at Information Sciences Institute working on knowledge graphs and artificial intelligence He has also been the host and organizer of the Data Analysis Workshop Designed for NonSTEM Busy Professionals at LA Michael Hansen httpswwwlinkedincominmichaelnhansen a friend of mine provided invaluable English language editing suggestions for this book Michael has great attention to detail which made him a great language reviewer Thank you Michael About the reviewers James Mott PhD is a senior education consultant with extensive experience in teaching statistical analysis modeling data mining and predictive analytics He has over 30 years of experience using SPSS products in his own research including IBM SPSS Statistics IBM SPSS Modeler and IBM SPSS Amos He has also been actively teaching about these products to IBMSPSS customers for over 30 years In addition he is an experienced historian with expertise in the research and teaching of 20th century United States political history and quantitative methods His specialties are data mining quantitative methods statistical analysis teaching and consulting Yidan Pan obtained her PhD in system synthetic and physical biology from Rice University Her research interest is profiling mutagenesis at genomic and transcriptional levels with molecular biology wet labs bioinformatics statistical analysis and machine learning models She believes that this book will give its readers a lot of practical skills for data analysis Packt is searching for authors like you If youre interested in becoming an author for Packt please visit authors packtpubcom and apply today We have worked with thousands of developers and tech professionals just like you to help them share their insight with the global tech community You can make a general application apply for a specific hot topic that we are recruiting an author for or submit your own idea Table of Contents Preface Section 1 Getting Started with Statistics for Data Science 1 Fundamentals of Data Collection Cleaning and Preprocessing Technical requirements 4 Collecting data from various data sources 5 Reading data directly from files 5 Obtaining data from an API 6 Obtaining data from scratch 9 Data imputation 11 Preparing the dataset for imputation 11 Imputation with mean or median values 16 Imputation with the modemost frequent value 18 Outlier removal 20 Data standardization when and how 21 Examples involving the scikit learn preprocessing module 23 Imputation 23 Standardization 23 Summary 24 2 Essential Statistics for Data Assessment Classifying numerical and categorical variables 26 Distinguishing between numerical and categorical variables 26 Understanding mean median and mode 30 Mean 30 Median 31 Mode 32 ii Table of Contents Learning about variance standard deviation quartiles percentiles and skewness 33 Variance 33 Standard deviation 36 Quartiles 37 Skewness 39 Knowing how to handle categorical variables and mixed data types 43 Frequencies and proportions 43 Transforming a continuous variable to a categorical one 46 Using bivariate and multivariate descriptive statistics 47 Covariance 48 Crosstabulation 50 Summary 51 3 Visualization with Statistical Graphs Basic examples with the Python Matplotlib package 54 Elements of a statistical graph 54 Exploring important types of plotting in Matplotlib 56 Advanced visualization customization 65 Customizing the geometry 65 Customizing the aesthetics 70 Queryoriented statistical plotting 72 Example 1 preparing data to fit the plotting function API 73 Example 2 combining analysis with plain plotting 76 Presentationready plotting tips 78 Use styling 78 Font matters a lot 80 Summary 80 Section 2 Essentials of Statistical Analysis 4 Sampling and Inferential Statistics Understanding fundamental concepts in sampling techniques 84 Performing proper sampling under different scenarios 86 The dangers associated with non probability sampling 86 Probability sampling the safer approach 88 Understanding statistics associated with sampling 98 Table of Contents iii Sampling distribution of the sample mean 98 Standard error of the sample mean 103 The central limit theorem 107 Summary 108 5 Common Probability Distributions Understanding important concepts in probability 110 Events and sample space 110 The probability mass function and the probability density function 111 Subjective probability and empirical probability 116 Understanding common discrete probability distributions 116 Bernoulli distribution 117 Binomial distribution 118 Poisson distribution 120 Understanding the common continuous probability distribution 121 Uniform distribution 122 Exponential distribution 122 Normal distribution 124 Learning about joint and conditional distribution 126 Independency and conditional distribution 127 Understanding the power law and black swan 127 The ubiquitous power law 128 Be aware of the black swan 129 Summary 130 6 Parametric Estimation Understanding the concepts of parameter estimation and the features of estimators 132 Evaluation of estimators 133 Using the method of moments to estimate parameters 136 Example 1 the number of 911 phone calls in a day 137 Example 2 the bounds of uniform distribution 139 Applying the maximum likelihood approach with Python 141 Likelihood function 141 MLE for uniform distribution boundaries 144 MLE for modeling noise 145 MLE and the Bayesian theorem 155 Summary 160 iv Table of Contents 7 Statistical Hypothesis Testing An overview of hypothesis testing 162 Understanding Pvalues test statistics and significance levels 164 Making sense of confidence intervals and Pvalues from visual examples 167 Calculating the Pvalue from discrete events 168 Calculating the Pvalue from the continuous PDF 170 Significance levels in tdistribution 174 The power of a hypothesis test 179 Using SciPy for common hypothesis testing 180 The paradigm 180 Ttest 181 The normality hypothesis test 185 The goodnessoffit test 189 A simple ANOVA model 192 Stationarity tests for time series 197 Examples of stationary and non stationary time series 198 Appreciating AB testing with a realworld example 206 Conducting an AB test 206 Randomization and blocking 207 Common test statistics 210 Common mistakes in AB tests 211 Summary 212 Section 3 Statistics for Machine Learning 8 Statistics for Regression Understanding a simple linear regression model and its rich content 216 Least squared error linear regression and variance decomposition 220 The coefficient of determination 227 Hypothesis testing 230 Connecting the relationship between regression and estimators 230 Simple linear regression as an estimator 232 Having handson experience with multivariate linear regression and collinearity analysis 233 Collinearity 239 Learning regularization from logistic regression examples 241 Summary 246 Table of Contents v 9 Statistics for Classification Understanding how a logistic regression classifier works 248 The formulation of a classification problem 250 Implementing logistic regression from scratch 251 Evaluating the performance of the logistic regression classifier 256 Building a naïve Bayes classifier from scratch 259 Underfitting overfitting and crossvalidation 267 Summary 272 10 Statistics for TreeBased Methods Overviewing treebased methods for classification tasks 274 Growing and pruning a classification tree 278 Understanding how splitting works 279 Evaluating decision tree performance 287 Exploring regression tree 289 Using tree models in scikitlearn 296 Summary 298 11 Statistics for Ensemble Methods Revisiting bias variance and memorization 300 Understanding the bootstrapping and bagging techniques 303 Understanding and using the boosting module 311 Exploring random forests with scikitlearn 316 Summary 318 vi Table of Contents Section 4 Appendix 12 A Collection of Best Practices Understanding the importance of data quality 322 Understanding why data can be problematic 322 Avoiding the use of misleading graphs 326 Example 1 COVID19 trend 326 Example 2 Bar plot cropping 328 Fighting against false arguments 334 Summary 335 13 Exercises and Projects Exercises 338 Chapter 1 Fundamentals of Data Collection Cleaning and Preprocessing 338 Chapter 2 Essential Statistics for Data Assessment 339 Chapter 3 Visualization with Statistical Graphs 340 Chapter 4 Sampling and Inferential Statistics 341 Chapter 5 Common Probability Distributions 342 Chapter 6 Parameter Estimation 344 Chapter 7 Statistical Hypothesis Testing 346 Chapter 8 Statistics for Regression 348 Chapter 9 Statistics for Classification 349 Chapter 10 Statistics for TreeBased Methods 351 Chapter 11 Statistics for Ensemble Methods 353 Project suggestions 355 Nontabular data 355 Realtime weather data 356 Goodness of fit for discrete distributions 358 Building a weather prediction web app 359 Building a typing suggestion app 360 Further reading 360 Textbooks 361 Visualization 361 Exercising your mind 361 Summary 362 Other Books You May Enjoy Index Preface Data science has been trending for several years and demand in the market is now really on the increase as companies governments and nonprofit organizations have shifted toward a datadriven approach Many new graduates as well as people who have been working for years are now trying to add data science as a new skill to their resumes One significant barrier for stepping into the realm of data science is statistics especially for people who do not have a science technology engineering and mathematics STEM background or left the classroom years ago This book is designed to fill the gap for those people While writing this book I tried to explore the scattered concepts in a dotconnecting fashion such that readers feel that new concepts and techniques are needed rather than simply being created from thin air By the end of this book you will be able to comfortably deal with common statistical concepts and computation in data science from fundamental descriptive statistics and inferential statistics to advanced topics such as statistics using treebased methods and ensemble methods This book is also particularly handy if you are preparing for a data scientist or data analyst job interview The nice interleaving of conceptual contents and code examples will prepare you well Who this book is for This book is for people who are looking for materials to fill the gaps in their statistics knowledge It should also serve experienced data scientists as an enjoyable read The book assumes minimal mathematics knowledge and it may appear verbose as it is designed so that novices can use it as a selfcontained book and follow the book chapter by chapter smoothly to build a knowledge base on statistics from the ground up viii Preface What this book covers Chapter 1 Fundamentals of Data Collection Cleaning and Preprocessing introduces basic concepts in data collection cleaning and simple preprocessing Chapter 2 Essential Statistics for Data Assessment talks about descriptive statistics which are handy for the assessment of data quality and exploratory data analysis EDA Chapter 3 Visualization with Statistical Graphs introduces common graphs that suit different visualization scenarios Chapter 4 Sampling and Inferential Statistics introduces the fundamental concepts and methodologies in sampling and the inference techniques associated with it Chapter 5 Common Probability Distributions goes through the most common discrete and continuous distributions which are the building blocks for more sophisticated real life empirical distributions Chapter 6 Parametric Estimation covers a classic and rich topic that solidifies your knowledge of statistics and probability by having you estimate parameters from accessible datasets Chapter 7 Statistical Hypothesis Testing looks at a musthave skill for any data scientist or data analyst We will cover the full life cycle of hypothesis testing from assumptions to interpretation Chapter 8 Statistics for Regression discusses statistics for regression problems starting with simple linear regression Chapter 9 Statistics for Classification explores statistics for classification problems starting with logistic regression Chapter 10 Statistics for TreeBased Methods delves into statistics for treebased methods with a detailed walk through of building a decision tree from first principles Chapter 11 Statistics for Ensemble Methods moves on to ensemble methods which are metaalgorithms built on top of basic machine learning or statistical algorithms This chapter is dedicated to methods such as bagging and boosting Chapter 12 Best Practice Collection introduces several important practice tips based on the authors data science mentoring and practicing experience Chapter 13 Exercises and Projects includes exercises and project suggestions grouped by chapter Preface ix To get the most out of this book As Jupyter notebooks can run on Google Colab a computer connected to the internet and a Google account should be sufficient If you are using the digital version of this book we advise you to type the code yourself or access the code via the GitHub repository link available in the next section Doing so will help you avoid any potential errors related to the copying and pasting of code Download the example code files You can download the example code files for this book from GitHub at https githubcomPacktPublishingEssentialStatisticsforNonSTEM DataAnalysts In case theres an update to the code it will be updated on the existing GitHub repository We also have other code bundles from our rich catalog of books and videos available at httpsgithubcomPacktPublishing Check them out Download the color images We also provide a PDF file that has color images of the screenshotsdiagrams used in this book You can download it here httpsstaticpacktcdncom downloads9781838984847ColorImagespdf Conventions used There are a number of text conventions used throughout this book Code in text Indicates code words in text database table names folder names filenames file extensions pathnames dummy URLs user input and Twitter handles Here is an example You can use pltrcytick labelsizex medium x Preface A block of code is set as follows import pandas as pd df pdreadexcelPopulationEstimatesxlsskiprows2 dfhead8 margin 0 Any commandline input or output is written as follows pip install pandas Bold Indicates a new term an important word or words that you see onscreen For example words in menus or dialog boxes appear in the text like this Here is an example seaborn is another popular Python visualization library With it you can write less code to obtain more professionallooking plots Tips or important notes R is another famous programming language for data science and statistical analysis There are also successful R packages The counterpart of Matplotlib is the R ggplot2 package I mentioned above Get in touch Feedback from our readers is always welcome General feedback If you have questions about any aspect of this book mention the book title in the subject of your message and email us at customercarepacktpubcom Errata Although we have taken every care to ensure the accuracy of our content mistakes do happen If you have found a mistake in this book we would be grateful if you would report this to us Please visit wwwpacktpubcomsupporterrata selecting your book clicking on the Errata Submission Form link and entering the details Piracy If you come across any illegal copies of our works in any form on the Internet we would be grateful if you would provide us with the location address or website name Please contact us at copyrightpacktcom with a link to the material If you are interested in becoming an author If there is a topic that you have expertise in and you are interested in either writing or contributing to a book please visit authors packtpubcom Preface xi Reviews Please leave a review Once you have read and used this book why not leave a review on the site that you purchased it from Potential readers can then see and use your unbiased opinion to make purchase decisions we at Packt can understand what you think about our products and our authors can see your feedback on their book Thank you For more information about Packt please visit packtcom Section 1 Getting Started with Statistics for Data Science In this section you will learn how to preprocess data and inspect distributions and correlations from a statistical perspective This section consists of the following chapters Chapter 1 Fundamentals of Data Collection Cleaning and Preprocessing Chapter 2 Essential Statistics for Data Assessment Chapter 3 Visualization with Statistical Graphs 1 Fundamentals of Data Collection Cleaning and Preprocessing Thank you for purchasing this book and welcome to a journal of exploration and excitement Whether you are already a data scientist preparing for an interview or just starting learning this book will serve you well as a companion You may already be familiar with common Python toolkits and have followed trending tutorials online However there is a lack of a systematic approach to the statistical side of data science This book is designed and written to close this gap for you As the first chapter in the book we start with the very first step of a data science project collecting cleaning data and performing some initial preprocessing It is like preparing fish for cooking You get the fish from the water or from the fish market examine it and process it a little bit before bringing it to the chef 4 Fundamentals of Data Collection Cleaning and Preprocessing You are going to learn five key topics in this chapter They are correlated with other topics such as visualization and basic statistics concepts For example outlier removal will be very hard to conduct without a scatter plot Data standardization clearly requires an understanding of statistics such as standard deviation We prepared a GitHub repository that contains readytorun codes from this chapter as well as the rest Here are the topics that will be covered in this chapter Collecting data from various data sources with a focus on data quality Data imputation with an assessment of downstream task requirements Outlier removal Data standardization when and how Examples involving the scikitlearn preprocessing module The role of this chapter is as a primer It is not possible to cover the topics in an entirely sequential fashion For example to remove outliers necessary techniques such as statistical plotting specifically a box plot and scatter plot will be used We will come back to those techniques in detail in future chapters of course but you must bear with it now Sometimes in order to learn new topics bootstrapping may be one of a few ways to break the shell You will enjoy it because the more topics you learn along the way the higher your confidence will be Technical requirements The best environment for running the Python code in the book is on Google Colaboratory httpscolabresearchgooglecom Google Colaboratory is a product that runs Jupyter Notebook in the cloud It has common Python packages that are preinstalled and runs in a browser It can also communicate with a disk so that you can upload local files to Google Drive The recommended browsers are the latest versions of Chrome and Firefox For more information about Colaboratory check out their official notebooks https colabresearchgooglecom You can find the code for this chapter in the following GitHub repository https githubcomPacktPublishingEssentialStatisticsforNonSTEM DataAnalysts Collecting data from various data sources 5 Collecting data from various data sources There are three major ways to collect and gather data It is crucial to keep in mind that data doesnt have to be wellformatted tables Obtaining structured tabulated data directly For example the Federal Reserve httpswwwfederalreservegovdatahtm releases wellstructured and welldocumented data in various formats including CSV so that pandas can read the file into a DataFrame format Requesting data from an API For example the Google Map API https developersgooglecommapsdocumentation allows developers to request data from the Google API at a capped rate depending on the pricing plan The returned format is usually JSON or XML Building a dataset from scratch For example social scientists often perform surveys and collect participants answers to build proprietary data Lets look at some examples involving these three approaches You will use the UCI machine learning repository the Google Map API and USC Presidents Office websites as data sources respectively Reading data directly from files Reading data from local files or remote files through a URL usually requires a good source of publicly accessible data archives For example the University of California Irvine maintains a data repository for machine learning We will be reading the air quality dataset with pandas The latest URL will be updated in the books official GitHub repository in case the following code fails You may obtain the file from https archiveicsuciedumlmachinelearningdatabasesheart disease From the datasets we are using the processedhungariandata file You need to upload the file to the same folder where the notebook resides 6 Fundamentals of Data Collection Cleaning and Preprocessing The following code snippet reads the data and displays the first several rows of the datasets import pandas as pd df pdreadcsvprocessedhungariandata sep names agesexcptrestbps cholfbsrestecgthalach exangoldpeakslopeca thalnum dfhead This produces the following output Figure 11 Head of the Hungarian heart disease dataset In the following section you will learn how to obtain data from an API Obtaining data from an API In plain English an Application Programming Interface API defines protocols agreements or treaties between applications or parts of applications You need to pass requests to an API and obtain returned data in JSON or other formats specified in the API documentation Then you can extract the data you want Note When working with an API you need to follow the guidelines and restrictions regarding API usage Improper usage of an API will result in the suspension of an account or even legal issues Collecting data from various data sources 7 Lets take the Google Map Place API as an example The Place API https developersgooglecomplaceswebserviceintro is one of many Google Map APIs that Google offers Developers can use HTTP requests to obtain information about certain geographic locations the opening hours of establishments and the types of establishment such as schools government offices and police stations In terms of using external APIs Like many APIs the Google Map Place API requires you to create an account on its platform the Google Cloud Platform It is free but still requires a credit card account for some services it provides Please pay attention so that you wont be mistakenly charged After obtaining and activating the API credentials the developer can build standard HTTP requests to query the endpoints For example the textsearch endpoint is used to query places based on text Here you will use the API to query information about libraries in Culver City Los Angeles 1 First lets import the necessary libraries import requests import json 2 Initialize the API key and endpoints We need to replace APIKEY with a real API key to make the code work APIKEY Your API key goes here TEXTSEARCHURL httpsmapsgoogleapiscommapsapi placetextsearchjson query Culver City Library 3 Obtain the response returned and parse the returned data into JSON format Lets examine it response requestsgetTEXTSEARCH URLqueryquerykeyAPIKEY jsonobject responsejson printjsonobject 8 Fundamentals of Data Collection Cleaning and Preprocessing This is a oneresult response Otherwise the results fields will have multiple entries You can index the multientry results fields as a normal Python list object htmlattributions results formattedaddress 4975 Overland Ave Culver City CA 90230 United States geometry location lat 340075635 lng 1183969651 viewport northeast lat 3400909257989272 lng 1183955611701073 southwest lat 3400639292010727 lng 1183982608298927 icon httpsmapsgstaticcommapfilesplaceapi iconscivicbuilding71png id ccdd10b4f04fb117909897264c78ace0fa45c771 name Culver City Julian Dixon Library openinghours opennow True photos height 3024 htmlattributions a hrefhttpsmapsgooglecom mapscontrib102344423129359752463Khaled Alabeda photoreference CmRaAAAANT4Td01h1tkI7dTn35vAkZhx mg3PjgKvjHiyh80M5UlI3wVw1cer4vkOksYR68NM9aw33ZPYGQzzXTE 8bkOwQYuSChXAWlJUtz8atPhmRht4hP4dwFgqfbJULmG5f1EhAfWlF cpLz76sD81fns1OGhT4KUzWTbuNY544XozE02pLNWw width 4032 placeid ChIJrUqREx6woARFrQdyscOZ8 pluscode compoundcode 2J5326 Culver City California globalcode 85632J5326 rating 42 reference ChIJrUqREx6woARFrQdyscOZ8 types library pointofinterest establishment userratingstotal 49 status OK The address and name of the library can be obtained as follows printjsonobjectresults0formattedaddress printjsonobjectresults0name Collecting data from various data sources 9 The result reads as follows 4975 Overland Ave Culver City CA 90230 United States Culver City Julian Dixon Library Information An API can be especially helpful for data augmentation For example if you have a list of addresses that are corrupted or mislabeled using the Google Map API may help you correct wrong data Obtaining data from scratch There are instances where you would need to build your own dataset from scratch One way of building data is to crawl and parse the internet On the internet a lot of public resources are open to the public and free to use Googles spiders crawl the internet relentlessly 247 to keep its search results up to date You can write your own code to gather information online instead of opening a web browser to do it manually Doing a survey and obtaining feedback whether explicitly or implicitly is another way to obtain private data Companies such as Google and Amazon gather tons of data from user profiling Such data builds the core of their dominating power in ads and ecommerce We wont be covering this method however Legal issue of crawling Notice that in some cases web crawling is highly controversial Before crawling a website do check their user agreement Some websites explicitly forbid web crawling Even if a website is open to web crawling intensive requests may dramatically slow down the website disabling its normal functionality to serve other users It is a courtesy not only to respect their policy but also the law Here is a simple example that uses regular expression to obtain all the phone numbers from the web page of the presidents office University of Southern California httpdepartmentsdirectoryuscedupresoffhtml 1 First lets import the necessary libraries re is the Python builtin regular expression library requests is an HTTP client that enables communication with the internet through the http protocol import re import requests 10 Fundamentals of Data Collection Cleaning and Preprocessing 2 If you look at the web page you will notice that there is a pattern within the phone numbers All the phone numbers start with three digits followed by a hyphen and then four digits Our objective now is to compile such a pattern pattern recompiled3d4 3 The next step is to create an http client and obtain the response from the GET call response requestsgethttpdepartmentsdirectoryusc edupresoffhtml 4 The data attribute of response can be converted into a long string and fed to the findall method patternfindallstrresponsedata The results contain all the phone numbers on the web page 7402111 8211342 7402111 7402111 7402111 7402111 7402111 7402111 7409749 7402505 7406942 8211340 8216292 In this section we introduced three different ways of collecting data reading tabulated data from data files provided by others obtaining data from APIs and building data from scratch In the rest of the book we will focus on the first option and mainly use collected data from the UCI Machine Learning Repository In most cases API data and scraped data will be integrated into tabulated datasets for production usage Data imputation 11 Data imputation Missing data is ubiquitous and data imputation techniques will help us to alleviate its influence In this section we are going to use the heart disease data to examine the pros and cons of basic data imputation I recommend you read the dataset description beforehand to understand the meaning of each column Preparing the dataset for imputation The heart disease dataset is the same one we used earlier in the Collecting data from various data sources section It should give you a real red flag that you shouldnt take data integrity for granted The following screenshot shows missing data denoted by question marks Figure 12 The head of Hungarian heart disease data in VS Code CSV rainbow extension enabled First lets do an info call that lists column data type information dfinfo Note dfinfo is a very helpful function that provides you with pointers for your next move It should be the first function call when given an unknown dataset 12 Fundamentals of Data Collection Cleaning and Preprocessing The following screenshot shows the output obtained from the preceding function Figure 13 Output of the info function call If pandas cant infer the data type of a column it will interpret it as objects For example the chol cholesterol column contains missing data The missing data is a question mark treated as a string but the remainder of the data is of the float type The records are collectively called objects Pythons type tolerance As Python is pretty errortolerant it is a good practice to introduce a necessary type check For example if a column mixes the numerical values instead of using numerical values to check truth explicitly check its type and write two branches Also it is advised to avoid type conversion on columns with data type objects Remember to make your code completely deterministic and futureproof Data imputation 13 Now lets replace the question mark with the NaN values The following code snippet declares a function that can handle three different cases and treat them appropriately The three cases are listed here The record value is The record value is of the integer type This is treated independently because columns such as num should be binary Floating numbers will lose the essence of using 01 encoding The rest includes valid strings that can be converted to float numbers and original float numbers The code snippet will be as follows import numpy as np def replacequestionmarkval if val return npNaN elif typevalint return val else return floatval df2 dfcopy for columnName in df2iteritems df2columnName df2columnNameapplyreplacequestion mark Now we call the info function and the head function as shown here df2info 14 Fundamentals of Data Collection Cleaning and Preprocessing You should expect that all fields are now either floats or integers as shown in the following output Figure 14 Output of info after data type conversion Now you can check the number of nonnull entries for each column and different columns have different levels of completeness age and sex dont contain missing values but ca contains almost no valid data This should guide you on your choices of data imputation For example strictly dropping all the missing values which is also considered a way of data imputation will almost remove the complete dataset Lets check the shape of the DataFrame after the default missing value drops You see that there is only one row left We dont want it df2dropnashape A screenshot of the output is as follows Figure 15 Removing records containing NaN values leaves only one entry Before moving on to other more mainstream imputation methods we would love to perform a quick review of our processed DataFrame Data imputation 15 Check the head of the new DataFrame You should see that all question marks are replaced by NaN values NaN values are treated as legitimate numerical values so native NumPy functions can be used on them df2head The output should look as follows Figure 16 The head of the updated DataFrame Now lets call the describe function which generates a table of statistics It is a very helpful and handy function for a quick peak at common statistics in our dataset df2describe Here is a screenshot of the output Figure 17 Output from the describe call Understanding the describe limitation Note that the describe function only considers valid values In this sample the average age value is more trustworthy than the average thal value Do also pay attention to the metadata A numerical value doesnt necessarily have a numerical meaning For example a thal value is encoded to integers with given meanings Now lets examine the two most common ways of imputation 16 Fundamentals of Data Collection Cleaning and Preprocessing Imputation with mean or median values Imputation with mean or median values only works on numerical datasets Categorical variables dont contain structures such as one label being larger than another Therefore the concepts of mean and median wont apply There are several advantages associated with meanmedian imputation It is easy to implement Meanmedian imputation doesnt introduce extreme values It does not have any time limit However there are some statistical consequences of meanmedian imputation The statistics of the dataset will change For example the histogram for cholesterol prior to imputation is provided here Figure 18 Cholesterol concentration distribution The following code snippet does the imputation with the mean Following imputation the with mean the histogram shifts to the right a little bit chol df2chol plthistcholapplylambda x npmeanchol if npisnanx else x binsrange063030 pltxlabelcholesterol imputation pltylabelcount Data imputation 17 Figure 19 Cholesterol concentration distribution with mean imputation Imputation with the median will shift the peak to the left because the median is smaller than the mean However it wont be obvious if you enlarge the bin size Median and mean values will likely fall into the same bin in this eventuality Figure 110 Cholesterol imputation with median imputation The good news is that the shape of the distribution looks rather similar The bad news is that we probably increased the level of concentration a little bit We will cover such statistics in Chapter 3 Visualization with Statistical Graphs 18 Fundamentals of Data Collection Cleaning and Preprocessing Note In other cases where the distribution is not centered or contains a substantial ratio of missing data such imputation can be disastrous For example if the waiting time in a restaurant follows an exponential distribution imputation with mean values will probably break the characteristics of the distribution Imputation with the modemost frequent value The advantage of using the most frequent value is that it works well with categorical features and without a doubt it will introduce bias as well The slope field is categorical in nature although it looks numerical It represents three statuses of a slope value as positive flat or negative The following code snippet will reveal our observation plthistdf2slopebins 5 pltxlabelslope pltylabelcount Here is the output Figure 111 Counting of the slope variable Data imputation 19 Without a doubt the mode is 2 Following imputation with the mode we obtain the following new distribution plthistdf2slopeapplylambda x 2 if npisnanx else xbins5 pltxlabelslope mode imputation pltylabelcount In the following graph pay attention to the scale on y Figure 112 Counting of the slope variable after mode imputation Replacing missing values with the mode in this case is disastrous If positive and negative values of slope have medical consequences performing prediction tasks on the preprocessed dataset will depress their weights and significance Different imputation methods have their own pros and cons The prerequisite is to fully understand your business goals and downstream tasks If key statistics are important you should try to avoid distorting them Also do remember that collecting more data is always an option 20 Fundamentals of Data Collection Cleaning and Preprocessing Outlier removal Outliers can stem from two possibilities They either come from mistakes or they have a story behind them In principle outliers should be very rare otherwise the experiment survey for generating the dataset is intrinsically flawed The definition of an outlier is tricky Outliers can be legitimate because they fall into the long tail end of the population For example a team working on financial crisis prediction establishes that a financial crisis occurs in one out of 1000 simulations Of course the result is not an outlier that should be discarded It is often good to keep original mysterious outliers from the raw data if possible In other words the reason to remove outliers should only come from outside the dataset only when you already know the originals For example if the heart rate data is strangely fast and you know there is something wrong with the medical equipment then you can remove the bad data The fact that you know the sensorequipment is wrong cant be deduced from the dataset itself Perhaps the best example for including outliers in data is the discovery of Neptune In 1821 Alexis Bouvard discovered substantial deviations in Uranus orbit based on observations This led him to hypothesize that another planet may be affecting Uranus orbit which was found to be Neptune Otherwise discarding mysterious outliers is risky for downstream tasks For example some regression tasks are sensitive to extreme values It takes further experiments to decide whether the outliers exist for a reason In such cases dont remove or correct outliers in the data preprocessing steps The following graph generates a scatter plot for the trestbps and chol fields The highlighted data points are possible outliers but I probably will keep them for now Figure 113 A scatter plot of two fields in heart disease dataset Data standardization when and how 21 Like missing data imputation outlier removal is tricky and depends on the quality of data and your understanding of the data It is hard to discuss systemized outlier removal without talking about concepts such as quartiles and box plots In this section we looked at the background information pertaining to outlier removal We will talk about the implementation based on statistical criteria in the corresponding sections in Chapter 2 Essential Statistics for Data Assessment and Chapter 3 Visualization with Statistical Graphs Data standardization when and how Data standardization is a common preprocessing step I use the terms standardization and normalization interchangeably You may also encounter the concept of rescaling in literature or blogs Standardization often means shifting the data to be zerocentered with a standard deviation of 1 The goal is to bring variables with different unitsranges down to the same range Many machine learning tasks are sensitive to data magnitudes Standardization is supposed to remove such factors Rescaling doesnt necessarily bring the variables to a common range This is done by means of customized mapping usually linear to scale original data to a different range However the common approach of minmax scaling does transform different variables into a common range 0 1 People may argue about the difference between standardization and normalization When comparing their differences normalization will refer to normalizing different variables to the same range 0 1 and minmax scaling is considered a normalization algorithm However there are other normalization algorithms as well Standardization cares more about the mean and standard deviation Standardization also transforms the original distribution closer to a Gaussian distribution In the event that the original distribution is indeed Gaussian standardization outputs a standard Gaussian distribution When to perform standardization Perform standardization when your downstream tasks require it For example the knearest neighbors method is sensitive to variable magnitudes so you should standardize the data On the other hand treebased methods are not sensitive to different ranges of variables so standardization is not required 22 Fundamentals of Data Collection Cleaning and Preprocessing There are mature libraries to perform standardization We first calculate the standard deviation and mean of the data subtract the mean from every entry and then divide by the standard deviation Standard deviation describes the level of variety in data that will be discussed more in Chapter 2 Essential Statistics for Data Assessment Here is an example involving vanilla Python stdChol npstdchol meanChol npmeanchol chol2 cholapplylambda x xmeanCholstdChol plthistchol2binsrangeintminchol2 intmaxchol21 1 The output is as follows Figure 114 Standardized cholesterol data Note that the standardized distribution looks more like a Gaussian distribution now Data standardization is irreversible Information will be lost in standardization It is only recommended to do so when no original information such as magnitudes or original standard deviation will be required later In most cases standardization is a safe choice for most downstream data science tasks In the next section we will use the scikitlearn preprocessing module to demonstrate tasks involving standardization Examples involving the scikitlearn preprocessing module 23 Examples involving the scikitlearn preprocessing module For both imputation and standardization scikitlearn offers similar APIs 1 First fit the data to learn the imputer or standardizer 2 Then use the fitted object to transform new data In this section I will demonstrate two examples one for imputation and another for standardization Note Scikitlearn uses the same syntax of fit and predict for predictive models This is a very good practice for keeping the interface consistent We will cover the machine learning methods in later chapters Imputation First create an imputer from the SimpleImputer class The initialization of the instance allows you to choose missing value forms It is handy as we can feed our original data into it by treating the question mark as a missing value from sklearnimpute import SimpleImputer imputer SimpleImputermissingvaluesnpnan strategymean Note that fit and transform can accept the same input imputerfitdf2 df3 pdDataFrameimputertransformdf2 Now check the number of missing values the result should be 0 npsumnpsumnpisnandf3 Standardization Standardization can be implemented in a similar fashion from sklearn import preprocessing 24 Fundamentals of Data Collection Cleaning and Preprocessing The scale function provides the default zeromean onestandard deviation transformation df4 pdDataFramepreprocessingscaledf2 Note In this example categorical variables represented by integers are also zero mean which should be avoided in production Lets check the standard deviation and mean The following line outputs infinitesimal values df4meanaxis0 The following line outputs values close to 1 df4stdaxis0 Lets look at an example of MinMaxScaler which transforms every variable into the range 0 1 The following code fits and transforms the heart disease dataset in one step It is left to you to examine its validity minMaxScaler preprocessingMinMaxScaler df5 pdDataFrameminMaxScalerfittransformdf2 Lets now summarize what we have learned in this chapter Summary In this chapter we covered several important topics that usually emerge at the earliest stage of a data science project We examined their applicable scenarios and conservatively checked some consequences either numerically or visually Many arguments made here will be more prominent when we cover other more sophisticated topics later In the next chapter we will review probabilities and statistical concepts including the mean the median quartiles standard deviation and skewness I am sure you will then have a deeper understanding of concepts such as outliers 2 Essential Statistics for Data Assessment In Chapter 1 Fundamentals of Data Collection Cleaning and Preprocessing we learned about data collection basic data imputation outlier removal and standardization Hence this will provide you with a good foundation to understand this chapter In this chapter you are going to learn how to examine the essential statistics for data assessment Essential statistics are also often referred to as descriptive statistics Descriptive statistics provide simple quantitative summaries of datasets usually combined with descriptive graphics For example descriptive statistics can demonstrate the tendency of centralization or measures of the variability of features and so on Descriptive statistics are important Correctly represented descriptive statistics give you a precise summary of the datasets at your disposal In this chapter we will learn to extract information and make quantitative judgements from descriptive statistics Just a headsup at this point Besides descriptive statistics another kind of statistics is known as inferential statistics which tries to learn information from the distribution of the population that the dataset was generated or sampled from In this chapter we assume the data covers a whole population rather than a subset sampled from a distribution We will see the differences between the two statistics in later chapters as well For now dont worry 26 Essential Statistics for Data Assessment The following topics will be covered in this chapter Classifying numerical and categorical variables Understanding mean median and mode Learning about variance standard deviation percentiles and skewness Knowing how to handle categorical variables and mixed data types Using bivariate and multivariate descriptive statistics Classifying numerical and categorical variables Descriptive statistics are all about variables You must know what you are describing to define corresponding descriptive statistics A variable is also referred to as a feature or attribute in other literature They all mean the same thing a single column in a tabulated dataset In this section you will examine the two most important variable types numerical and categorical and learn to distinguish between them Categorical variables are discrete and usually represent a classification property of entry Numerical variables are continuous and descriptive quantitatively Descriptive statistics that can be applied to one kind of variable may not be applied to another one hence distinguishing between them precedes analytics Distinguishing between numerical and categorical variables In order to understand the differences between the two types of variables with the help of an example I will be using the population estimates dataset released by the United States Department of Agriculture by way of a demonstration It contains the estimated population data at county level for the United States from 2010 to 2018 You can obtain the data from the official website httpswwwersusdagovdataproducts countyleveldatasetsdownloaddata or the books GitHub repository The following code snippet loads the data and examines the first several rows import pandas as pd df pdreadexcelPopulationEstimatesxlsskiprows2 dfhead8 Classifying numerical and categorical variables 27 The output is a table with more than 140 columns Here are two screenshots showing the beginning and trailing columns Figure 21 First 6 columns of the dfad output In the preceding dataset there is a variable called RuralurbanContinuum Code2013 It takes the value of integers This leads to pandas autointerpreting this variable pandas autointerprets it as numerical Instead however the variable is actually categorical Should you always trust libraries Dont always trust what functions from Python libraries give you They may be wrong and the developer which is you has to make the final decision After some research we found the variable description on this page https wwwersusdagovdataproductsruralurbancontinuumcodes According to the code standard published in 2013 the RuralurbanContinuum Code2013 variable indicates how urbanized an area is 28 Essential Statistics for Data Assessment The meaning of RuralurbanContinuum Code2013 is shown in Figure 22 Figure 22 Interpretation of RuralurbanContinuum Code2013 Note Pandas makes intelligent autointerpretations of variable types but oftentimes it is wrong It is up to the data scientist to investigate the exact meaning of the variable type and then change it Many datasets use integers to represent categorical variables Treating them as numerical values may result in serious consequences in terms of downstream tasks such as machine learning mainly because artificial distances between numerical values will be introduced On the other hand numerical variables often have a direct quantitative meaning For example RNETMIG2013 means the rate of net immigration in 2013 for a specific area A histogram plot of this numerical variable gives a more descriptive summary of immigration trends in the States but it makes little sense plotting the code beyond simple counting Lets check the net immigration rate for the year 2013 with the following code snippet pltfigurefigsize86 pltrcParamsupdatefontsize 22 plthistdfRNETMIG2013binsnplinspacenpnanmindfR NETMIG2013npnanmaxdfRNETMIG2013num100 plttitleRate of Net Immigration Distribution for All Records 2013 Classifying numerical and categorical variables 29 The result appears as follows Figure 23 Distribution of the immigration rate for all records in datasets Here are the observations drawn from Figure 23 In either categorical or numerical variables structures can be introduced to construct special cases A typical example is date or time Depending on the scenarios date and time can be treated as categorical variables as well as numerical variables with a semicontinuous structure It is common to convert numerical variables to categorical variables on the basis of a number of rules The ruralurban code is a typical example Such a conversion is easy for conveying a first impression 30 Essential Statistics for Data Assessment Now that we have learned how to distinguish between numerical and categorical variables lets move on to understanding a few essential concepts of statistics namely mean median and mode Understanding mean median and mode Mean median and mode describe the central tendency in some way Mean and median are only applicable to numerical variables whereas mode is applicable to both categorical and numerical variables In this section we will be focusing on mean median and mode for numerical variables as their numerical interactions usually convey interesting observations Mean Mean or arithmetical mean measures the weighted center of a variable Lets use n to denote the total number of entries and as the index The mean reads as follows Mean is influenced by the value of every entry in the population Let me give an example In the following code I will generate 1000 random numbers from 0 to 1 uniformly plot them and calculate their mean import random randomseed2019 pltfigurefigsize86 rvs randomrandom for in range1000 plthistrvs bins50 plttitleHistogram of Uniformly Distributed RV x xi n Understanding mean median and mode 31 The resulting histogram plot appears as follows Figure 24 Histogram distribution of uniformly distributed variables between 0 and 1 The mean is around 0505477 pretty close to what we surmised Median Median measures the unweighted center of a variable If there is an odd number of entries the median takes the value of the central one If there is an even number of entries the median takes the value of the mean of the central two entries Median may not be influenced by every entrys value On account of this property median is more robust or representative than the mean value I will use the same set of entries as in previous sections as an example The following code calculates the median npmedianrvs 32 Essential Statistics for Data Assessment The result is 05136755026003803 Now I will be changing one entry to 1000 which is 1000 times larger than the maximal possible value in the dataset and repeat the calculation rvs11000 printnpmeanrvs printnpmedianrvs The results are 15054701085937803 and 05150437661964872 The mean increased by roughly 1 but the median is robust The relationship between mean and median is usually interesting and worth investigating Usually the combination of a larger median and smaller mean indicates that there are more points on the bigger value side but that an extremely small value also exists The reverse is true when the median is smaller than the mean We will demonstrate this with some examples later Mode The mode of a set of values is the most frequent element in a set It is evident in a histogram plot such that it represents the peaks If the distribution has only one mode we call it unimodal Distributions with two peaks that dont have to have equal heights are referred to as bimodal Bimodals and bimodal distribution Sometimes the definition of bimodal is corrupted The property of being bimodal usually refers to the property of having two modes which according to the definition of mode requires the same height of peaks However the term bimodal distribution often refers to a distribution with two local maxima Doublecheck your distribution and state the modes clearly The following code snippet demonstrates two distributions with unimodal and bimodal shapes respectively r1 randomnormalvariate0502 for in range10000 r2 randomnormalvariate0201 for in range5000 r3 randomnormalvariate0802 for in range5000 fig axes pltsubplots12figsize125 axes0histr1bins100 axes0settitleUnimodal Learning about variance standard deviation quartiles percentiles and skewness 33 axes1histr2r3bins100 axes1settitleBimodal The resulting two subplots appear as follows Figure 25 Histogram of unimodal and bimodal datasets with one mode and two modes So far we have talked about mean median and mode mean median and mode which are the first three statistics of a dataset They are the start of almost all exploratory data analysis Learning about variance standard deviation quartiles percentiles and skewness In the previous section we studied the mean median and mode They all describe to a certain degree the properties of the central part of the dataset In this section we will learn how to describe the spreading behavior of data Variance With the same notation variance for the population is defined as follows Intuitively the further away the elements are from the mean the larger the variance Here I plotted the histogram of two datasets with different variances The one on the left subplot has a variance of 009 and the one on the right subplot has a variance of 0009 10 times smaller σ2 xi x 2 n 34 Essential Statistics for Data Assessment The following code snippet generates samples from the two distributions and plots them r1 randomnormalvariate0503 for in range10000 r2 randomnormalvariate0501 for in range10000 fig axes pltsubplots12figsize125 axes0histr1bins100 axes0setxlim12 axes0settitleBig Variance axes1histr2bins100 axes1settitleSmall Variance axes1setxlim12 The results appear as follows Figure 26 Big and small variances with the same mean at 05 The following code snippet generates a scatter plot that will demonstrate the difference more clearly The variable on the x axis spreads more widely pltfigurefigsize88 pltscatterr1r2alpha02 pltxlim12 pltylim12 pltxlabelBig Variance Variable Learning about variance standard deviation quartiles percentiles and skewness 35 pltylabelSmall Variance Variable plttitleVariables With Different Variances The result looks as follows Figure 27 Scatter plot of largevariance and smallvariance variables The spread in the x axis is significantly larger than the spread in the y axis which indicates the differences in variance magnitude A common mistake is not getting the range correct Matplotlib will by default try to determine the ranges You need to use a code such as pltxlim to force it otherwise the result is misleading 36 Essential Statistics for Data Assessment Standard deviation Standard deviation is the square root of the variance It is used more commonly to measure the level of dispersion since it has the same unit as the original data The formula for the standard deviation of a population reads as follows Standard deviation is extremely important in scientific graphing A standard deviation is often plotted together with the data and represents an estimate of variability For this chapter I will be using the net immigration rate for Texas from 2011 to 2018 as an example In the following code snippet I will first extract the countylevel data append the means and standard deviations to a list and then plot them at the end The standard deviation is obtained using numpystd and the error bar is plotted using matplotlibpyploterrorbar dfTX dfdfStateTXtail1 YEARS year for year in range20112019 MEANS STDS for i in range20112019 year RNETMIGstri MEANSappendnpmeandfTXyear STDSappendnpstddfTXyear pltfigurefigsize108 plterrorbarYEARSMEANSyerrSTDS pltxlabelYear pltylabelNet Immigration Rate σ xi x 2 n Learning about variance standard deviation quartiles percentiles and skewness 37 The output appears as shown in the following figure Figure 28 Net immigration rate across counties in Texas from 2011 to 2018 We can see in Figure 28 that although the net immigration in Texas is only slightly positive the standard deviation is huge Some counties may have a big positive net rate while others may potentially suffer from the loss of human resources Quartiles Quartiles are a special kind of quantile that divide data into a number of equal portions For example quartiles divide data into four equal parts with the ½ quartile as the median Deciles and percentiles divide data into 10 and 100 equal parts respectively The first quartile also known as the lower quartile 1 takes the value such that 25 of all the data lies below it The second quartile is the median The third quartile 3 is also known as the upper quartile and 25 of all values lie above it Quartiles are probably the most commonly used quantiles because they are associated with a statistical graph called a boxplot Lets use the same set of Texas net immigration data to study it 38 Essential Statistics for Data Assessment The function in NumPy is quantile and we specify a list of quantiles as an argument for the quantiles we want to calculate as in the following singleline code snippet npquantiledfTXRNETMIG201302505075 The output reads as follows array783469971 087919226 884040759 The following code snippet visualizes the quartiles pltfigurefigsize125 plthistdfTXRNETMIG2013bins50alpha06 for quartile in npquantiledfTXRNET MIG201302505075 pltaxvlinequartilelinestylelinewidth4 As you can see from the following output the vertical dotted lines indicate the three quartiles Figure 29 Quartiles of the net immigration data in 2013 The lower and upper quartiles keep exactly 50 of the data values in between 3 1 is referred to as the interquartile range called Interquartile Range IQR and it plays an important role in outlier detection We will see more about this soon Learning about variance standard deviation quartiles percentiles and skewness 39 Skewness Skewness differs from the three measures of variability we discussed in the previous subsections It measures the direction the data takes and the extent to which the data distribution tilts Skewness is given as shown in the following equation σ Various definitions of skewness The skewness we defined earlier is precisely referred to as Pearsons first skewness coefficient It is defined through the mode but there are other definitions of skewness For example skewness can be defined through the median Skewness is unitless If the mean is larger than the mode skewness is positive and we say the data is skewed to the right Otherwise the data is skewed to the left Here is the code snippet that generates two sets of skewed data and plots them r1 randomnormalvariate0504 for in range10000 r2 randomnormalvariate0102 for in range10000 r3 randomnormalvariate1102 for in range10000 fig axes pltsubplots12figsize125 axes0histr1r2bins100alpha05 axes0axvlinenpmeanr1r2 linestylelinewidth4 axes0settitleSkewed To Right axes1histr1r3bins100alpha05 axes1axvlinenpmeanr1r3linestylelinewidth4 axes1settitleSkewed to Left 40 Essential Statistics for Data Assessment The vertical dotted line indicates the position of the mean as follows Figure 210 Skewness demonstration Think about the problem of income inequality Lets say you have a plot of the histogram of the population with different amounts of wealth A larger value just like where the x axis value indicates the amount of wealth and the y axis value indicates the portion of the population that falls into a certain wealth amount range A larger x value means more wealth A larger y value means a greater percentage of the population falls into that range of wealth possession Positive skewness the left subplot in Figure 210 means that even though the average income looks good this may be driven up by a very small number of super rich individuals when the majority of people earn a relatively small income Negative skewness the right subplot in Figure 210 indicates that the majority may have an income above the mean value so there might be some very poor people who may need help A revisit of outlier detection Now lets use what we have learned to revisit the outlier detection problem The zscore also known as the standard score is a good criterion for detecting outliers It measures the distance between an entry and the population mean taking the population variance into consideration z σ Learning about variance standard deviation quartiles percentiles and skewness 41 If the underlying distribution is normal a situation where a zscore is greater than 3 or less than 0 only has a probability of roughly 027 Even if the underlying distribution is not normal Chebyshevs theorem guarantees a strong claim such that at most 1k2 where k is an integer of the total population can fall outside k standard deviations As an example the following code snippet generates 10000 data points that follow a normal distribution randomseed2020 x randomnormalvariate1 05 for in range10000 pltfigurefigsize108 plthistxbins100alpha05 styles for i in range3 pltaxvlinenpmeanx i1npstdx linestylestylesi linewidth4 pltaxvlinenpmeanx i1npstdx linestylestylesi linewidth4 plttitleInteger Z values for symmetric distributions In the generated histogram plot the dotted line indicates the location where 1 The dashed line indicates the location of 2 The dashed dotted line indicates the location of 3 Figure 211 Integer z value boundaries for normally distributed symmetric data 42 Essential Statistics for Data Assessment If we change the data points the distribution will change but the zscore criteria will remain valid As you can see in the following code snippet an asymmetric distribution is generated rather than a normal distribution x randomnormalvariate1 05 randomexpovariate2 for in range10000 This produces the following output Figure 212 Integer z value boundaries for asymmetric data Note on the influence of extreme outliers A drawback of the zscore is that the mean itself is also influenced by extreme outliers The median can replace a mean to remove this effect It is flexible to set different criteria in different production cases We have covered several of the most important statistics to model variances in a dataset In the next section lets work on the data types of features Knowing how to handle categorical variables and mixed data types 43 Knowing how to handle categorical variables and mixed data types Categorical variables usually have simpler structures or descriptive statistics than continuous variables Here we introduce the two main descriptive statistics and talk about some interesting cases when converting continuous variables to categorical ones Frequencies and proportions When we discussed the mode for categorical variables we introduced Counter which outputs a dictionary structure whose keyvalue pair is the elementcounting pair The following is an example of a counter Counter20 394 30 369 60 597 10 472 90 425 70 434 80 220 40 217 50 92 The following code snippet illustrates frequency as a bar plot where the absolute values of counting become intuitive counter CounterdfRuralurbanContinuum Code2013 dropna labels x for key val in counteritems labelsappendstrkey xappendval pltfigurefigsize108 pltbarlabelsx plttitleBar plot of frequency 44 Essential Statistics for Data Assessment What you will get is the bar plot that follows Figure 213 Bar plot of ruralurban continuum code For proportionality simply divide each count by the summation of counting as shown in the following code snippet x nparrayxsumx The shape of the bar plot remains the same but the y axis ticks change To better check the relative size of components I have plotted a pie plot with the help of the following code snippet pltfigurefigsize1010 pltpiexxlabelslabels plttitlePie plot for ruralurban continuum code Knowing how to handle categorical variables and mixed data types 45 What you get is a nice pie chart as follows Figure 214 Pie plot of ruralurban continuum code It becomes evident that code 20 contains about twice as many samples as code 80 does Unlike the mean and median categorical data does have a mode We are going to reuse the same data CounterdfRuralurbanContinuum Code2013dropna The output reads as follows Counter20 394 30 369 60 597 10 472 90 425 70 434 80 220 40 217 50 92 46 Essential Statistics for Data Assessment The mode is 60 Note The mode means that the counties with urban populations of 2500 to 19999 adjacent to a metro area are most prevalent in the United States and not the number 60 Transforming a continuous variable to a categorical one Occasionally we may need to convert a continuous variable to a categorical one Lets take lifespan as an example The 80 age group is supposed to be very small Each of them will represent a negligible data point in classification tasks If they can be grouped together the noise introduced by the sparsity of this age groups individual points will be reduced A common way to perform categorization is to use quantiles For example quartiles will divide the datasets into four parts with an equal number of entries This avoids issues such as data imbalance For example the following code indicates the cutoffs for the categorization of the continuous variable net immigration rate series dfRNETMIG2013dropna quantiles npquantileseries02i for i in range15 pltfigurefigsize108 plthistseriesbins100alpha05 pltxlim5050 for i in rangelenquantiles pltaxvlinequantilesilinestyle linewidth4 plttitleQuantiles for net immigration data Using bivariate and multivariate descriptive statistics 47 As you can see in the following output the dotted vertical lines split the data into 5 equal sets which are hard to spot with the naked eye I truncated the x axis to select the part between 50 and 50 The result looks as follows Figure 215 Quantiles for the net immigration rate Note on the loss of information Categorization destroys the rich structure in continuous variables Only use it when you absolutely need to Using bivariate and multivariate descriptive statistics In this section we briefly talk about bivariate descriptive statistics Bivariate descriptive statistics apply two variables rather than one We are going to focus on correlation for continuous variables and crosstabulation for categorical variables 48 Essential Statistics for Data Assessment Covariance The word covariance is often incorrectly used as correlation However there are a number of fundamental differences Covariance usually measures the joint variability of two variables while correlation focuses more on the strength of variability Correlation coefficients have several definitions in different use cases The most common descriptive statistic is the Pearson correlation coefficient We will also be using it to describe the covariance of two variables The correlation coefficient for variables x and y from a population is defined as follows Lets first examine the expressions sign The coefficient becomes positive when x is greater than its mean and y is also greater than its own mean Another case is when x and y are both smaller than their means respectively The products sum together and then get normalized by the standard deviation of each variable So a positive coefficient indicates that x and y vary jointly in the same direction You can make a similar argument about negative coefficients In the following code snippet we select the net immigration rates for counties in Texas as our datasets and use the corr function to inspect the correlation coefficient across years corrs dfTXRNETMIG2011RNETMIG2012 RNET MIG2013 RNETMIG2014 RNETMIG2015RNETMIG2016 RNETMIG2017 RNETMIG2018corr The output is a socalled correlation matrix whose diagonal elements are the self correlation coefficients which are just 1 Figure 216 Correlation matrix for the net immigration rate ρ 1 1 σxσy Using bivariate and multivariate descriptive statistics 49 A good way to visualize this matrix is to use the heatmap function from the Seaborn library The following code snippet generates a nice heatmap import seaborn as sns pltfigurefigsize108 pltrcParamsupdatefontsize 12 snsheatmapcorrscmapYlGnBu The result looks as follows Figure 217 Heatmap of a correlation matrix for net immigration rates in Texas We do see an interesting pattern that odd years correlate with one another more strongly and even years correlate with each other more strongly too However that is not the case between even and odd numbered years Perhaps there is a 2year cyclic pattern and the heatmap of the correlation matrix just helped us discover it 50 Essential Statistics for Data Assessment Crosstabulation Crosstabulation can be treated as a discrete version of correlation detection for categorical variables It helps derive innumerable insights and sheds light on downstream task designs Here is an example I am creating a list of weather information and another list of a golfers decisions on whether to go golfing The crosstab function generates the following table weather rainysunnyrainywindywindy sunnyrainywindysunnyrainy sunnywindywindy golfing YesYesNoNoYesYesNoNo YesNoYesNoNo dfGolf pdDataFrameweatherweathergolfinggolfing pdcrosstabdfGolfweather dfGolfgolfing marginsTrue Figure 218 Crosstabulation for golfing decisions As you can see the columns and rows give the exact counts which are identified by the column name and row name For a dataset with a limited number of features this is a handy way to inspect imbalance or bias We can tell that the golfer goes golfing if the weather is sunny and that they seldom go golfing on rainy or windy days With that we have come to the end of the chapter Summary 51 Summary Statistics or tools to assess datasets were introduced and demonstrated in this chapter You should be able to identify different kinds of variables compute corresponding statistics and detect outliers We do see graphing as an essential part of descriptive statistics In the next chapter we will cover the basics of Python plotting the advanced customization of aesthetics and professional plotting techniques 3 Visualization with Statistical Graphs A picture is worth a thousand words Humans rely on visual input for more than 90 of all information obtained A statistical graph can demonstrate trends explain reasons or predict futures much better than words if done right Python data ecosystems come with a lot of great tools for visualization The three most important ones are Matplotlib seaborn and plotly The first two are mainly for static plotting while plotly is capable of interactive plotting and is gaining in popularity gradually In this chapter you will focus on static plotting which is the backbone of data visualization We have already extensively used some plots in previous chapters to illustrate concepts In this chapter we will approach them in a systematic way The topics that will be covered in this chapter are as follows Picking the right plotting types for different tasks Improving and customizing visualization with advanced aesthetic customization Performing statistical plotting tailored for business queries Building stylish and professional plots for presentations or reports 54 Visualization with Statistical Graphs Lets start with the basic Matplotlib library Basic examples with the Python Matplotlib package In this chapter we will start with the most basic functionalities of the Matplotlib package Lets first understand the elements to make a perfect statistical graph Elements of a statistical graph Before we dive into Python code l will give you an overview of how to decompose the components of a statistical graph I personally think the philosophy that embeds the R ggplot2 package is very concise and clear Note R is another famous programming language for data science and statistical analysis There are also successful R packages The counterpart of Matplotlib is the R ggplot2 package mentioned previously ggplot2 is a very successful visualization tool developed by Hadley Wickman It decomposes a statistical plot into the following three components Data The data must have the information to display otherwise the plotting becomes totally misleading The data can be transformed such as with categorization before being visualized Geometries Geometry here means the types of plotting For example bar plot pie plot boxplot and scatter plot are all different types of plotting Different geometries are suitable for different visualization purposes Aesthetics The size shape color and positioning of visual elements such as the title ticks and legends all belong to aesthetics A coherent collection of aesthetic elements can be bundled together as a theme For example Facebook and The Economist have very distinguishable graphical themes Basic examples with the Python Matplotlib package 55 Lets use the birth rate and death rate data for Texas counties grouped by urbanization level as an example Before that lets relate this data with the three components mentioned previously The data is the birth rate and death rate data which determines the location of the scattered points The geometry is a scatter plot If you use a line plot you are using the wrong type of plot because there isnt a natural ordering structure in the dataset There are many aesthetic elements but the most important ones are the size and the color of the spots How they are determined will be detailed when we reach the second section of this chapter Incorporating this data into a graph gives a result that would look something like this Figure 31 Example for elements of statistical graphing 56 Visualization with Statistical Graphs Geometry is built upon data and the aesthetics will only make sense if you have the right data and geometry In this chapter you can assume we already have the right data If you have the wrong data you will end up with graphs that make no sense and are oftentimes misleading In this section lets focus mainly on geometry In the following sections I will talk about how to transform data and customize aesthetics Exploring important types of plotting in Matplotlib Lets first explore the most important plotting types one by one Simple line plots A simple line plot is the easiest type of plotting It represents only a binary mapping relationship between two ordered sets Stock price versus date is an example temperature versus time is another The following code snippet generates a list of evenly spaced numbers and their sine and plots them Please note that the libraries only need to be imported once import numpy as np import matplotlibpyplot as plt matplotlib inline fig pltfigure x nplinspace0 10 1000 pltplotx npsinx This generates the following output Figure 32 A simple line plot of the sine function Basic examples with the Python Matplotlib package 57 You can add one or two more simple line plots Matplotlib will decide the default color of the lines for you The following snippet will add two more trigonometric functions fig pltfigurefigsize108 x nplinspace0 10 100 pltplotx npsinxlinestylelinewidth4 pltplotxnpcosxlinestylelinewidth4 pltplotxnpcos2xlinestylelinewidth4 Different sets of data are plotted with dashed lines dotted lines and dasheddotted lines as shown in the following figure Figure 33 Multiple simple line plots Now that we have understood a simple line plot lets move on to the next type of plotting a histogram plot 58 Visualization with Statistical Graphs Histogram plots We used a histogram plot extensively in the previous chapter This type of plot groups data into bins and shows the counts of data points in each bin with neighboring bars The following code snippet demonstrates a traditional onedimensional histogram plot x1 nprandomlaplace0 08 500 x2 nprandomnormal3 2 500 plthistx1 alpha05 densityTrue bins20 plthistx2 alpha05 densityTrue bins20 The following output shows the histogram plots overlapping each other Figure 34 A onedimensional histogram Here density is normalized so the histogram is no longer a frequency count but a probability count The transparency level the alpha value is set to 05 so the histogram underline is displayed properly Boxplot and outlier detection A twodimensional histogram plot is especially helpful for visualizing correlations between two quantities We will be using the immigration data we used in the Classifying numerical and categorical variables section in Chapter 2 Essential Statistics for Data Assessment as an example Basic examples with the Python Matplotlib package 59 The good thing about a boxplot is that it gives us a very good estimation of the existence of outliers The following code snippet plots the Texas counties net immigration rate of 2017 in a boxplot import pandas as pd df pdreadexcelPopulationEstimatesxlsskiprows2 dfTX dfdfStateTXtail1 pltboxplotdfTXRNETMIG2017 The plot looks as in the following figure Figure 35 A boxplot of the 2017 net immigration rate of Texas counties What we generated is a simple boxplot It has a box with a horizontal line in between There are minimum and maximum data points which are represented as short horizontal lines However there are also data points above the maximum and below the minimum You may also wonder what they are since there are already maximum and minimum data points We will solve these issues one by one Lets understand the box first The top and bottom of the box are the ¾ quartile and the ¼ quartile respectively This means exactly 50 of the data is in the box The distance between the ¼ quartile and the ¾ quartile is called the Interquartile Range IQR Clearly the shorter the box is the more centralized the data points are The orange line in the middle represents the median 60 Visualization with Statistical Graphs The position of the maximum is worked out as the sum of the ¾ quartile and 15 times the IQR The minimum is worked out as the difference between the ¼ quartile and 15 times the IQR What still lies outside of the range are considered outliers In the preceding boxplot there are four outliers For example if the distribution is normal a data point being an outlier has a probability of roughly 07 which is small Note A boxplot offers you a visual approach to detect outliers In the preceding example 15 times the IQR is not a fixed rule and you can choose a cutoff for specific tasks Scatter plots A scatter plot is very useful for visually inspecting correlations between variables It is especially helpful to display data at a different time or date from different locations in the same graph Readers usually find it difficult to tell minute distribution differences from numerical values but a scatter plot makes them easy to spot For example lets plot the birth rate and death rate for all the Texas counties in 2013 and 2017 It becomes somewhat clear that from 2013 to 2017 some data points with the highest death rate disappear while the birth rates remain unchanged The following code snippet does the job pltfigurefigsize86 pltscatterdfTXRbirth2013dfTXR death2013alpha05label2013 pltscatterdfTXRbirth2017dfTXR death2017alpha05label2017 pltlegend pltxlabelBirth Rate pltylabelDeath Rate plttitleTexas Counties BirthDeath Rates Basic examples with the Python Matplotlib package 61 The output looks as in the following figure Figure 36 A scatter plot of the birth rate and death rate in Texas counties Note The scatter plot shown in Figure 36 doesnt reveal onetoone dynamics For example we dont know the change in the birth rate or death rate of a specific county and it is possible though unlikely that county A and county B exchanged their positions in the scatter plot Therefore a basic scatter plot only gives us distributionwise information but no more than that 62 Visualization with Statistical Graphs Bar plots A bar plot is another common plot to demonstrate trends and compare several quantities side by side It is better than a simple line chart because sometimes line charts can be misleading without careful interpretation For example say I want to see the birth rate and death rate data for Anderson County in Texas from 2011 to 2018 The following short code snippet would prepare the column masks to select features and examine the first row of the DataFrame which is the data for Anderson County birthRates listfilterlambda x xstartswithR birthdfTXcolumns deathRates listfilterlambda x xstartswithR deathdfTXcolumns years nparraylistmaplambda x intx4 birthRates The Anderson County information can be obtained by using the iloc method as shown in the following snippet dfTXiloc0 Figure 37 shows the first several columns and the last several ones of the Anderson County data Figure 37 Anderson County data Note DataFrameiloc in pandas allows you to slice a DataFrame by the index field Basic examples with the Python Matplotlib package 63 The following code snippet generates a simple line plot pltfigurefigsize106 width04 pltplotyearswidth2 dfTXiloc0birthRates label birth rate pltplotyearswidth2 dfTXiloc0deathRateslabeldeath rate pltxlabelyears pltylabelrate pltlegend plttitleAnderson County birth rate and death rate The following figure shows the output which is a simple line plot with the dotted line being the birth rate and the dashed line being the death rate by default Figure 38 A line chart of the birth rate and death rate Without carefully reading it you can derive two pieces of information from the plot The death rates change dramatically across the years The death rates are much higher than the birth rates 64 Visualization with Statistical Graphs Even though the y axis tick doesnt support the two claims presented admit it this is the first impression we get without careful observation However with a bar plot this illusion can be eliminated early The following code snippet will help in generating a bar plot pltfigurefigsize106 width04 pltbaryearswidth2 dfTXiloc0birthRates widthwidth label birth rate alpha 1 pltbaryearswidth2 dfTXiloc0deathRates widthwidthlabeldeath rate alpha 1 pltxlabelyears pltylabelrate pltlegend plttitleAnderson County birth rate and death rate I slightly shifted year to be the X value and selected birthRates and deathRates with the iloc method we introduced earlier The result will look as shown in Figure 39 Figure 39 A bar plot of the Anderson County data Advanced visualization customization 65 The following is now much clearer The death rate is higher than the birth rate but not as dramatically as the line plot suggests The rates do not change dramatically across the years except in 2014 The bar plot will by default show the whole scale of the data therefore eliminating the earlier illusion Note how I used the width parameter to shift the two sets of bars so that they can be properly positioned Advanced visualization customization In this section you are going to learn how to customize the plots from two perspectives the geometry and the aesthetics You will see examples and understand how the customization works Customizing the geometry There isnt enough time nor space to cover every detail of geometry customization Lets learn by understanding and following examples instead Example 1 axissharing and subplots Continuing from the previous example lets say you want the birth rate and the population change to be plotted on the same graph However the numerical values of the two quantities are drastically different making the birth rate basically indistinguishable There are two ways to solve this issue Lets look at each of the ways individually Axissharing We can make use of both the lefthand y axis and the righthand Y axis to represent different scales The following code snippet copies the axes with the twinx function which is the key of the whole code block figure ax1 pltsubplotsfigsize106 ax1plotyears dfTXiloc0birthRates label birth ratecred ax2 ax1twinx ax2plotyears dfTXiloc0popChanges1 labelpopulation change 66 Visualization with Statistical Graphs ax1setxlabelyears ax1setylabelbirth rate ax2setylabelpopulation change ax1legend ax2legend plttitleAnderson County birth rate and population change As you can see the preceding code snippet does three things in order 1 Creates a figure instance and an axis instance ax1 2 Creates a twin of ax1 and plots the two sets of data on two different axes 3 Creates labels for two different axes shows the legend sets the title and so on The following is the output Figure 310 Double Y axes example The output looks nice and both trends are clearly visible Subplots With subplots we can also split the two graphs into two subplots The following code snippet creates two stacked subplots and plots the dataset on them separately Advanced visualization customization 67 figure axes pltsubplots21figsize106 axes0plotyears dfTXiloc0birthRates label birth ratecred axes1plotyears dfTXiloc0popChanges1 labelpopulation change axes1setxlabelyears axes0setylabelbirth rate axes1setylabelpopulation change axes0legend axes1legend axes0settitleAnderson County birth rate and population change Note The subplots function takes 2 and 1 as two arguments This means the layout will have 2 rows but 1 column So the axes will be a twoelement list The output of the previous code will look as follows Figure 311 Birth rate and population subplots example 68 Visualization with Statistical Graphs The two plots will adjust the scale of the Y axis automatically The advantage of using subplots over a shared axis is that subplots can support the addition of more complicated markups while a shared axis is already crowded Example 2 scale change In this second example we will be using the dataset for the total number of coronavirus cases in the world published by WHO At the time of writing this book the latest data I could obtain was from March 15 2020 You can also obtain the data from the official repository of this book The following code snippet loads the data and formats the date column into a date data type coronaCases pdreadcsvtotalcases03152020csv from datetime import datetime coronaCasesdate coronaCasesdateapplylambda x datetimestrptimex Ymd Then we plot the data for the world and the US The output of the previous code snippet will look like this Advanced visualization customization 69 Figure 312 Coronavirus cases in linear and log scales Note how I changed the second subplot from a linear scale to a log scale Can you work out the advantage of doing so On a linear scale because the cases in the world are much larger than the cases in the US the representation of cases in the US is basically a horizontal line and the details in the total case curve at the early stage are not clear In the logscale plot the Y axis changes to a logarithm scale so exponential growth becomes a somewhat linear line and the numbers in the US are visible now 70 Visualization with Statistical Graphs Customizing the aesthetics Details are important and they can guide us to focus on the right spot Here I use one example to show the importance of aesthetics specifically the markers A good choice of markers can help readers notice the most important information you want to convey Example markers Suppose you want to visualize the birth rate and death rate for counties in Texas but also want to inspect the rates against the total population and ruralurbancontinuum code for a specific year In short you have four quantities to inspect so which geometry will you choose and how will you represent the quantities Note The continuum code is a discrete variable but the other three are continuous variables To represent a discrete variable that doesnt have numerical relationships between categories you should choose colors or markers over others which may suggest numerical differences A naïve choice is a scatter plot as we did earlier This is shown in the following code snippet pltfigurefigsize126 pltscatterdfTXRbirth2013 dfTXRdeath2013 alpha04 s dfTXPOPESTIMATE20131000 pltxlabelBirth Rate pltylabelDeath Rate plttitleTexas Counties BirthDeath Rates in 2013 Note that I set the s parameter the size of the default marker to be 1 unit for every 1000 of the population Advanced visualization customization 71 The output already looks very informative Figure 313 The birth rate and death rate in Texas However this is probably not enough because we cant tell whether the region is a rural area or an urban area To do this we need to introduce a color map Note A color map maps a feature to a set of colors In Matplotlib there are many different color maps For a complete list of maps check the official document at httpsmatplotliborg320tutorialscolors colormapshtml The following code snippet maps ruralurbancontinuumcode to colors and plots the color bar Although the code itself is numerical the color bar ticks contain no numerical meaning pltfigurefigsize126 pltscatterdfTXRbirth2013 dfTXRdeath2013 alpha04 s dfTXPOPESTIMATE20131000 c dfTXRuralurbanContinuum Code2003 cmap Dark2 72 Visualization with Statistical Graphs pltcolorbar pltxlabelBirth Rate pltylabelDeath Rate plttitleTexas Counties BirthDeath Rates in 2013 The output looks much easier to interpret Figure 314 The birth rate and death rate in Texas revised From this plotting counties with smaller code numbers have a bigger population a relatively moderate birth rate but a lower death rate This is possibly due to the age structure because cities are more likely to attract younger people This information cant be revealed without adjusting the aesthetics of the graph Queryoriented statistical plotting The visualization should always be guided by business queries In the previous section we saw the relationship between birth and death rates population and code and with that we designed how the graph should look Queryoriented statistical plotting 73 In this section we will see two more examples The first example is about preprocessing data to meet the requirement of the plotting API in the seaborn library In the second example we will integrate simple statistical analysis into plotting which will also serve as a teaser for our next chapter Example 1 preparing data to fit the plotting function API seaborn is another popular Python visualization library With it you can write less code to obtain more professionallooking plots Some APIs are different though Lets plot a boxplot You can check the official documentation at httpsseaborn pydataorggeneratedseabornboxplothtml Lets try to use it to plot the birth rates from different years for Texas counties However if you look at the DataFrame that the seaborn library imported it looks different from what we used earlier import seaborn as sns tips snsloaddatasettips tipshead The output looks as follows Figure 315 Head of tips a seaborn builtin dataset The syntax of plotting a boxplot is shown in the following snippet ax snsboxplotxday ytotalbill datatips 74 Visualization with Statistical Graphs The output is as follows Figure 316 The seaborn tips boxplot Note It is hard to generate such a beautiful boxplot with oneline code using the Matplotlib library There is always a tradeoff between control and easiness In my opinion seaborn is a good choice if you have limited time for your tasks Notice that the x parameter day is a column name in the tips DataFrame and it can take several values Thur Fri Sat and Sun However in the Texas county data records for each year are separated as different columns which is much wider than the tidy tips DataFrame To convert a wide table into a long table we need the pandas melt function https pandaspydataorgpandasdocsstablereferenceapipandas melthtml Queryoriented statistical plotting 75 The following code snippet selects the birth raterelated columns and transforms the table into a longer thinner format birthRatesDF dfTXbirthRates birthRatesDFindex birthRatesDFindex birthRatesDFLong pdmeltbirthRatesDFid varsindexvaluevars birthRatesDFcolumns1 birthRatesDFLongvariable birthRatesDFLongvariable applylambda x intx4 The longformat table now looks as in the following figure Figure 317 Long format of the birth rates data Now the seaborn API can be used directly as follows pltfigurefigsize108 snsboxplotxvariable yvalue databirthRatesDFLong pltxlabelYear pltylabelBirth Rates 76 Visualization with Statistical Graphs The following will be the output Figure 318 Texas counties birth rates boxplot with the seaborn API Nice isnt it Youve learned how to properly transform the data into the formats that the library APIs accept Good job Example 2 combining analysis with plain plotting In the second example you will see how oneline code can add inference flavor to your plots Suppose you want to examine the birth rate and the natural population increase rate in the year 2017 individually but you also want to check whether there is some correlation between the two To summarize we need to do the following 1 Examine the individual distributions of each quantity 2 Examine the correlation between these two quantities 3 Obtain a mathematical visual representation of the two quantities Queryoriented statistical plotting 77 seaborn offers the jointplot function which you can make use of It enables you to combine univariate plots and bivariate plots It also allows you to add annotations with statistical implications The following code snippet shows the univariate distribution bivariate scatter plot an estimate of univariate density and bivariate linear regression information in one command g snsjointplotRNATURALINC2017 Rbirth2017 datadfTX kindregheight10 The following graph shows the output Figure 319 Joint plot of a scatter plot and histogram plot example 78 Visualization with Statistical Graphs Tip By adding inference information density estimation and the linear regression part to an exploratory graph we can make the visualization very professional Presentationready plotting tips Here are some tips if you plan to use plots in your professional work Use styling Consider using the following tips to style plots You should consider using a style that accommodates your PowerPoint or slides For example if your presentation contains a lot of grayscale elements you shouldnt use colorful plots You should keep styling consistent across the presentation or report You should avoid using markups that are too fancy Be aware of the fact that sometimes people only have grayscale printing so red and green may be indistinguishable Use different markers and textures in this case For example the following code replots the joint plot in grayscale style with pltstylecontextgrayscale pltfigurefigsize126 g snsjointplotRNATURALINC2017 Rbirth2017 datadfTX kindregheight10 Presentationready plotting tips 79 The result is as follows Figure 320 Replot with grayscale style 80 Visualization with Statistical Graphs Font matters a lot Before the end of this chapter I would like to share my tips for font choice aesthetics Font size is very important It makes a huge difference What you see on a screen can be very different from what you see on paper or on a projector screen For example you can use pltrcytick labelsizexmedium to specify the xtick size of your graph Be aware that the font size usually wont scale when the graph scales You should test it and set it explicitly if necessary Font family is also important The font family of graphs should match the font of the paper Serif is the most common one Use the following code to change the default fonts to serif pltrcfont familyserif Lets summarize what we have learned in this chapter Summary In this chapter we discussed the most important plots in Python Different plots suit different purposes and you should choose them accordingly The default settings of each plot may not be perfect for your needs so customizations are necessary You also learned the importance of choosing the right geometries and aesthetics to avoid problems in your dataset such as significant quantity imbalance or highlighting features to make an exploratory argument Business queries are the starting point of designing a statistical plot We discussed the necessity of transforming data to fit a function API and choosing proper plotting functions to answer queries without hassle In the next chapter lets look at some probability distributions After all both the histogram plot and the density estimation plot in a joint plot try to uncover the probability distributions behind the dataset Section 2 Essentials of Statistical Analysis Section 2 covers the most fundamental and classical contents of statistical analysis at the undergraduate level However the statistical analysis well get into is applied to messy realword datasets This section will give you a taste of statistical analysis as well as sharpening your math skills for further chapters This section consists of the following chapters Chapter 4 Sampling and Inferential Statistics Chapter 5 Common Probability Distributions Chapter 6 Parametric Estimation Chapter 7 Statistical Hypothesis Testing 4 Sampling and Inferential Statistics In this chapter we focus on several difficult sampling techniques and basic inferential statistics associated with each of them This chapter is crucial because in real life the data we have is most likely only a small portion of a whole set Sometimes we also need to perform sampling on a given large dataset Common reasons for sampling are listed as follows The analysis can run quicker when the dataset is small Your model doesnt benefit much from having gazillions of pieces of data Sometimes you also dont want sampling For example sampling a small dataset with subcategories may be detrimental Understanding how sampling works will help you to avoid various kinds of pitfalls The following topics will be covered in this chapter Understanding fundamental concepts in sampling techniques Performing proper sampling under different scenarios Understanding statistics associated with sampling 84 Sampling and Inferential Statistics We begin by clarifying the concepts Understanding fundamental concepts in sampling techniques In Chapter 2 Essential Statistics for Data Assessment I emphasized that statistics such as mean and variance were used to describe the population The intent is to help you distinguish between the population and samples With a population at hand the information is complete which means all statistics you calculated will be authentic since you have everything With a sample the information you have only relates to a small portion or a subset of the population What exactly is a population A population is the whole set of entities under study If you want to study the average monthly income of all American women then the population includes every woman in the United States Population will change if the study or the question changes If the study is about finding the average monthly income of all Los Angeles women then a subset of the population for the previous study becomes the whole population of the current study Certain populations are accessible for a study For example it probably only takes 1 hour to measure kids weights in a single kindergarten However it is both economically and temporally impractical to obtain income information for American women or even Los Angeles women In order to get a good estimate of such an answer sampling is required A sample is a subset of the population under study The process of obtaining a sample is called sampling For example you could select 1000 Los Angeles women and make this your sample By collecting their income information you can infer the average income of all Los Angeles women As you may imagine selecting 1000 people will likely give us a more confident estimation of the statistics The sampling size matters because more entries will increase the likelihood of representing more characteristics of the original population What is more important is the way how sampling is done if you randomly select people walking on the street in Hollywood you probably will significantly overestimate the true average income If you go to a college campus to interview students you will likely find an underestimated statistic because students wont have a high income in general Understanding fundamental concepts in sampling techniques 85 Another related concept is the accessible population The whole population under study is also referred to as the target population which is supposed to be the set to study However sometimes only part of it is accessible The key characteristic is that the sampling process is restricted by accessibility As regards a study of the income of all Los Angeles women an accessible population may be very small Even for a small accessible population researchers or survey conductors can only sample a small portion of it This makes the sampling process crucially important Failed sampling Failed sampling can lead to disastrous decision making For example in earlier times when phones were not very accessible to every family if political polling was conducted based on phone directories the result could be wildly inaccurate The fact of having a phone indicated a higher household income and their political choices may not reveal the characteristics of the whole community or region In the 1936 Presidential election between Roosevelt and Landon such a mistake resulted in an infamous false Republican victory prediction by Literary Digest In the next section you will learn some of the most important sampling methods We will still be using the Texas population data For your reference the following code snippet reads the dataset and creates the dfTX DataFrame import pandas as pd df pdreadexcelPopulationEstimatesxlsskiprows2 dfTX dfdfStateTXtail1 The first several columns of the dfTX DataFrame appear as follows Figure 41 First several columns of the dfTX DataFrame Next lets see how different samplings are done 86 Sampling and Inferential Statistics Performing proper sampling under different scenarios The previous section introduced an example of misleading sampling in political polling The correctness of a sampling approach will change depending on its content When telephones were not accessible polling by phone was a bad practice However now everyone has a phone number associated with them and in general the phone number is largely random If a polling agency generates a random phone number and makes calls the bias is likely to be small You should keep in mind that the standard of judging a sampling method as right or wrong should always depend on the scenario There are two major ways of sampling probability sampling and nonprobability sampling Refer to the following details Probability sampling as the name suggests involves random selection In probability sampling each member has an equal and known chance of being selected This theoretically guarantees that the results obtained will ultimately reveal the behavior of the population Nonprobability sampling where subjective sampling decisions are made by the researchers The sampling process is usually more convenient though The dangers associated with nonprobability sampling Yes Here I do indeed refer to nonprobability sampling as being dangerous and I am not wrong Here I list the common ways of performing nonprobability sampling and we will discuss each in detail Convenience sampling Volunteer sampling Purposive sampling There are two practical reasons why people turn to nonprobability sampling Nonprobability sampling is convenient It usually costs much less to obtain an initial exploratory result with nonprobability sampling than probability sampling For example you can distribute shopping surveys in a supermarket parking lot to get a sense of peoples shopping habits on a Saturday evening But your results will likely change if you do it on a Monday morning For example people might tend to buy more alcohol at the weekend Such sampling is called convenience sampling Performing proper sampling under different scenarios 87 Convenience sampling is widely used in a pilot experimentstudy It can avoid wasting study resources on improper directions or find hidden issues at the early stages of study It is considered unsuitable for a major study Two other common nonprobability sampling methods are volunteer sampling and purposive sampling Volunteer sampling relies on the participants own selfselection to join the sampling A purposive selection is highly judgmental and subjective such that researchers will manually choose participants as part of the sample A typical example of volunteer sampling is a survey conducted by a political figure with strong left or rightwing tendencies Usually only their supporters will volunteer to spend time taking the survey and the results will be highly biased tending to support this persons political ideas An exaggerated or even hilarious example of purposive sampling is asking people whether they successfully booked a plane ticket on the plane The result is obvious because it is done on the plane You may notice that such sampling techniques are widely and deliberately used in everyday life such as in commercials or political campaigns in an extremely disrespectful way Many surveys conclude the results before they were performed Be careful with nonprobability sampling Nonprobability sampling is not wrong It is widely used For inexperienced researchers or data scientists who are not familiar with the domain knowledge it is very easy to make mistakes with nonprobability sampling The non probability sampling method should be justified carefully to avoid mistakes such as ignoring the fact that people who own a car or a telephone in 1936 were likely Republican In the next section you are going to learn how to perform sampling safely Also since probability sampling doesnt involve much subjective judgement you are going to see some working code again 88 Sampling and Inferential Statistics Probability sampling the safer approach I refer to probability sampling as a safer sampling approach because it avoids serious distribution distortion due to human intervention in most cases Here I introduce three ways of probability sampling They are systematic and objective and are therefore more likely to lead to unbiased results We will spend more time on them As before I will list them first Simple random sampling Stratified random sampling Systematic random sampling Lets start with simple random sampling Simple random sampling The first probability sampling is Simple Random Sampling SRS Lets say we have a study that aims to find the mean and standard deviation of the counties populations in Texas If it is not possible to perform this in all counties in Texas simple random sampling can be done to select a certain percentage of counties in Texas The following code shows the total number of counties and plots their population distributions The following code selects 10 which is 25 counties of all of all the counties populations in 2018 First lets take a look at our whole datasets distribution pltfigurefigsize106 pltrcParamsupdatefontsize 22 plthistdfTXPOPESTIMATE2018bins100 plttitleTotal number of counties formatlendfTXPOP ESTIMATE2018 pltaxvlinenpmeandfTXPOP ESTIMATE2018crlinestyle pltxlabelPopulation pltylabelCount Performing proper sampling under different scenarios 89 The result is a highly skewed distribution with few very large population outliers The dashed line indicates the position of the mean Figure 42 Population histogram plotting of all 254 Texas counties Most counties have populations of below half a million and fewer than 5 counties have population in excess of 2 million The population mean is 112999 according to the following oneline code npmeandfTXPOPESTIMATE2018 Now lets use the randomsample function from the random module to select 25 nonrepetitive samples and plot the distribution To make the result reproducible I set the random seed to be 2020 Note on reproducibility To make your analysis such that it involves reproducible randomness set a random seed so that randomness becomes deterministic The following code snippet selects 25 counties data and calculates the mean population figure randomseed2020 pltfigurefigsize106 sample randomsampledfTXPOPESTIMATE2018tolist25 plthistsamplebins100 pltaxvlinenpmeansamplecr plttitleMean of sample population formatnp meansample 90 Sampling and Inferential Statistics pltxlabelPopulation pltylabelCount The result appears as follows Notice that the samples mean is about 50 smaller than the populations mean Figure 43 Simple random sample results We can do this several more times to check the results since the sampling will be different each time The following code snippet calculates the mean of the sample 100 times and visualizes the distribution of the sampled mean I initialize the random seed so that it becomes reproducible The following code snippet repeats the SRS process 100 times and calculates the mean for each repetition Then plot the histogram of the means I call the number of occasions 100 trials and the size of each sample 25 numSample numSample 25 trials 100 randomseed2020 sampleMeans for i in rangetrials sample randomsampledfTXPOPESTIMATE2018to listnumSample sampleMeansappendnpmeansample pltfigurefigsize108 plthistsampleMeansbins25 plttitleDistribution of the sample means for sample size Performing proper sampling under different scenarios 91 of formattrials numSample pltgcaxaxissettickparamsrotation45 pltxlabelSample Mean pltylabelCount The result looks somewhat like the original distribution of the population Figure 44 Distribution of the mean of 100 SRS processes However the distribution shape will change drastically if you modify the sample size or number of trials Let me first demonstrate the change in sample size In the following code snippet the number of samples takes values of 25 and 100 and the number of trials is 1000 Note that the distribution is normed so the scale becomes comparable numSamples 25100 colors rb trials 1000 randomseed2020 pltfigurefigsize108 sampleMeans for j in rangelennumSamples for i in rangetrials sample randomsampledfTXPOPESTIMATE2018to listnumSamplesj sampleMeansappendnpmeansample plthistsampleMeanscolorcolorsj alpha05bins25labelsample size 92 Sampling and Inferential Statistics formatnumSamplesjdensityTrue pltlegend plttitleDistribution density of means of 1000 SRS with respect to sample sizes pltxlabelSample Mean pltylabelDensity You can clearly see the influence of sample size on the result in the following graph Figure 45 Demonstration of the influence of sample size In short if you choose a bigger sample size it is more likely that you will obtain a larger estimation of the mean of the population data It is not counterintuitive because the mean is very susceptible to extreme values With a larger sample size the extreme values those 1 million are more likely to be selected and therefore increase the chance that the sample mean is large Performing proper sampling under different scenarios 93 I will leave it to you to examine the influence of the number of trials You can run the following code snippet to find out The number of trials should only influence accuracy numSample 100 colors rb trials 10005000 randomseed2020 pltfigurefigsize108 sampleMeans for j in rangelentrials for i in rangetrialsj sample randomsampledfTXPOPESTIMATE2018to listnumSample sampleMeansappendnpmeansample plthistsampleMeanscolorcolorsj alpha05bins25labeltrials formattrialsjdensityTrue pltlegend plttitleDistribution density of means of 1000 SRS and 5000 SRS pltxlabelSample Mean pltylabelDensity Most of the code is the same as the previous one except the number of trials now takes another value that is 5000 in line 3 Stratified random sampling Another common method of probability sampling is stratified random sampling Stratifying is a process of aligning or arranging something into categories or groups In stratified random sampling you should first classify or group the population into categories and then select elements from each group randomly The advantage of stratified random sampling is that every group is guaranteed to be represented in the final sample Sometimes this is important For example if you want to sample the income of American women without SRS it is likely that most samples will fall into highpopulation states such as California and Texas Information about small states will be completely lost Sometimes you want to sacrifice the absolute equal chance to include the representativeness 94 Sampling and Inferential Statistics For the Texas county population data we want to include all counties from different urbanization levels The following code snippet examines the urbanization code level distribution from collections import Counter CounterdfTXRuralurbanContinuum Code2013 The result shows some imbalance Counter70 39 60 65 50 6 20 25 30 22 10 35 80 20 90 29 40 13 If we want equal representativeness from each urbanization group such as two elements from each group stratified random sampling is likely the only way to do it In SRS the level 5 data will have a very low chance of being sampled Think about the choice between sampling equal numbers of entries in each levelstrata or a proportional number of entries as a choice between selecting senators and House representatives Note In the United States each state has two senators regardless of the population and state size The number of representatives in the House reflects how large the states population is The larger the population the more representatives the state has in the House The following code snippet samples four representatives from each urbanization level and prints out the mean Note that the code is not optimized for performance but for readability randomseed2020 samples for level in sortednpuniquedfTXRuralurbanContinuum Code2013 Performing proper sampling under different scenarios 95 samples randomsampledfTXdfTXRuralurbanContinuum Code2013levelPOPESTIMATE2018tolist4 printnpmeansamples The result is about 144010 so not bad Lets do this four more times and check the distribution of the sample mean The following code snippet performs stratified random sampling 1000 times and plots the distribution of means pltfigurefigsize108 plthistsampleMeansbins25 plttitleSample mean distribution with stratified random sampling pltgcaxaxissettickparamsrotation45 pltxlabelSample Mean pltylabelCount The following results convey some important information As you can tell the sampled means are pretty much centered around the true mean of the population Figure 46 Distribution of sample means from stratified random sampling 96 Sampling and Inferential Statistics To clarify the origin of this odd shape you need to check the mean of each group The following code snippet does the job pltfigurefigsize108 levels codeMeans for level in sortednpuniquedfTXRuralurbanContinuum Code2013 codeMean npmeandfTXdfTXRuralurbanContinuum Code2013levelPOPESTIMATE2018 levelsappendlevel codeMeansappendcodeMean pltplotlevelscodeMeansmarker10markersize20 plttitleUrbanization level code versus mean population pltxlabelUrbanization level code 2013 pltylabelPopulation mean The result looks like the following Figure 47 Urbanization level code versus the mean population Note that the larger the urbanization level code the smaller the mean population Stratified random sampling takes samples from each group so an improved performance is not surprising Performing proper sampling under different scenarios 97 Recall that the urbanization level is a categorical variable as we introduced it in Chapter 2 Essential Statistics for Data Assessment The previous graph is for visualization purposes only It doesnt tell us information such as that the urbanization difference between levels 3 and 2 is the same as the difference between levels 5 and 6 Also notice that it is important to choose a correct stratifying criterion For example classifying counties into different levels based on the first letter in a county name doesnt make sense here Systematic random sampling The last probability sampling method is likely the easiest one If the population has an order structure you can first select one at random and then select every nth member after it For example you can sample the students by ID on campus or select households by address number The following code snippet takes every tenth of the Texas dataset and calculates the mean randomseed2020 idx randomrandint010 populations dfTXPOPESTIMATE2018tolist samples samplesappendpopulationsidx while idx 10 lenpopulations idx 10 samplesappendpopulationsidx printnpmeansamples The result is 158799 so not bad Systematic random sampling is easy to implement and understand It naturally avoids potential clustering in the data However it assumes a natural randomness in the dataset such that manipulation of the data may cause false results Also you have to know the size of the population beforehand in order to determine a sampling interval We have covered three ways of probability sampling Combining previous nonprobability sampling techniques you have six methods at your disposal Each of the sampling techniques has its own pros and cons Choose wisely in different cases In the next section we will study some statistics associated with sampling techniques that will help us make such decisions 98 Sampling and Inferential Statistics Understanding statistics associated with sampling In the previous section you saw something like a histogram plot of the samples means We used the histogram to show the quality of the sampled mean If the distribution of the mean is centered around the true mean I claim it has a better quality In this section we will go deeper into it Instead of using Texas population data I will be using artificial uniform distributions as examples It should be easier for you to grasp the quantitative intuition if the distribution underlining the population is clear Sampling distribution of the sample mean You have seen the distribution of the sampled mean in the previous section There are some questions remaining For example what is the systematic relationship between the sample size and the sample mean What is the relationship between the number of times of sampling and the sample means distribution Assume we have a population that can only take values from integers 1 to 10 with equal probability The population is very large so we can sample as many as we want Let me perform an experiment by setting the sample size to 4 and then calculate the sample mean Lets do the sampling 100 times and check the distribution of the sample mean We did similar computational experiments for the Texas population data but here you can obtain the theoretical mean and standard deviation of the uniform distribution beforehand The theoretical mean and standard deviation of the distribution can be calculated in one line printnpmeani for i in range111 printnpsqrtnpmeani552 for i in range111 The mean is 55 and the standard deviation is about 287 The following code snippet performs the computational experiment and plots the distribution of the sample mean trials 100 sampleSize 4 randomseed2020 sampleMeans candidates i for i in range111 Understanding statistics associated with sampling 99 pltrcParamsupdatefontsize 18 for i in rangetrials sampleMean npmeanrandomchoicecandidates for in rangesampleSize sampleMeansappendsampleMean pltfigurefigsize106 plthistsampleMeans bins25 pltaxvline55cr linestyle plttitleSample mean distribution trial sample size formattrials sampleSize pltxlabelSample mean pltylabelCount I used the dashed vertical line to highlight the location of the true population mean The visualization can be seen here Figure 48 Sample mean distribution for 100 trials with a sample size of 4 Lets also take a note of the sample means standard deviation npmeansampleMeans The result is 59575 Now lets repeat the process by increasing the number of trials to 4 16 64 and 100 keeping sampleSize 4 unchanged I am going to use a subplot to do this 100 Sampling and Inferential Statistics The follow code snippet first declares a function that returns the sample means as a list def obtainSampleMeanstrials 100 sampleSize 4 sampleMeans candidates i for i in range111 for i in rangetrials sampleMean npmeanrandomchoicecandidates for in rangesampleSize sampleMeansappendsampleMean return sampleMeans The following code snippet makes use of the function we declared and plots the result of the experiments randomseed2020 figure axes pltsubplots41figsize816 figuretightlayout times 41664100 for i in rangelentimes sampleMeans obtainSampleMeans100timesi4 axesihistsampleMeansbins40density True axesiaxvline55cr axesisettitleSample mean distribution trial sample size format100timesi 4 printmean std formatnpmeansampleMeansnp stdsampleMeans Understanding statistics associated with sampling 101 You may observe an interesting trend where the distributions assume an increasingly smooth shape Note that the skipping is due to the fact that the possible sample values are all integers Figure 49 Sample mean distribution with 400 1600 6400 and 10000 trials 102 Sampling and Inferential Statistics There are two discoveries here As the number of trials increases the sample means distribution becomes smoother and bellshaped When the number of trials reaches a certain level the standard deviation doesnt seem to change To verify the second claim you need to compute the standard deviation of the sample means I will leave this to you The result is listed here As regards the four different numbers of trials the standard deviations are all around 144 Here is the output of the code snippet showing no significant decrease in standard deviation trials 400 mean 564 std 14078218992472025 trials 1600 mean 553390625 std 14563112832464553 trials 6400 mean 54877734375 std 14309896472527093 trials 10000 mean 551135 std 14457899838842432 Next lets study the influence of the number of trials The following code snippet does the trick In order to obtain more data points for future analysis I am going to skip some plotting results but you can always check the official notebook of the book for yourself I am going to stick to trials 6400 for this experiment randomseed2020 sizes 2k for k in range19 figure axes pltsubplots81figsize848 figuretightlayout for i in rangelensizes sampleMeans obtainSampleMeans6400sizesi axesihistsampleMeansbinsnplinspacenp minsampleMeansnpmaxsampleMeans40density True axesiaxvline55cr linestyle axesisettitleSample mean distribution trial sample size format6400 sizesi axesisetxlim010 printmean std formatnpmeansampleMeansnp stdsampleMeans Understanding statistics associated with sampling 103 Lets check the sampleSize 16 result Figure 410 Sample mean distribution sample size 16 If the sample size increases eightfold you obtain the following Figure 411 Sample mean distribution sample size 128 We do see that the standard error of the sample mean shrinks when the sample size increases The estimates of the population mean are more precise hence a tighter and tighter histogram We will study this topic more quantitatively in the next subsection Standard error of the sample mean The sample means standard error decreases when the sample size increases We will do some visualization to find out the exact relationship The standard error is the standard deviation of a statistic of a sampling distribution Here the statistic is the mean 104 Sampling and Inferential Statistics The thought experiment tip A useful technique for checking a monotonic relationship is to perform a thought experiment Imagine the sample size is 1 then when the number of trials increases to infinity basically we will be calculating the statistics of the population itself On the other hand if the sample size increases to a very large number then every sample mean will be very close to the true population mean which leads to small variance Here I plotted the relationship between the size of the sample and the standard deviation of the sample mean The following code snippet does the job randomseed2020 sizes 2k for k in range19 ses figure axes pltsubplots81figsize848 for i in rangelensizes sampleMeans obtainSampleMeans6400sizesi sesappendnpstdsampleMeans Due to space limitations here we only show two of the eight subfigures Figure 412 Sample mean distribution when the sample size is 2 Understanding statistics associated with sampling 105 With a larger sample size we have the following diagram Figure 413 Sample mean distribution when the sample size is 256 Then we plot the relationship in a simple line chart pltfigurefigsize86 pltplotsizesses plttitleStandard Error of Sample Mean Versus Sample Size pltxlabelSample Size pltylabelStandard Error of Sample Mean What you get is a following curve Figure 414 The sample mean standard error decreases with the sample size 106 Sampling and Inferential Statistics Now lets perform a transformation of the standard error so the relationship becomes clear pltfigurefigsize86 pltplotsizes1ele2 for ele in ses plttitleInverse of the Square of Standard Error versus Sample Size pltxlabelSample Size pltylabelTransformed Standard Error of Sample Mean The output becomes a straight linear line Figure 415 Standard error transformation There is a linear relationship between the sample size and the inverse of the square of the standard error Lets use n to denote the sample size σ𝑛𝑛 1 𝑛𝑛 Understanding statistics associated with sampling 107 Now recall that if the sample size is 1 we are basically calculating the population itself Therefore the relationship is exactly the following This equation is useful for estimating the true population standard deviation Note on replacement I used the randomchoice function for sampleSize times in this example This suggests that I am sampling from an infinitely large population or sampling with replacements However in the first section when sampling Texas population data I used the randomsample sampleSize function to sample a finite dataset without replacements The analysis of the sample mean will still apply but the standard errors coefficient will be different You will pick up a finite population correction factor that is related to the population size We wont go deeper into this topic due to content limitation The central limit theorem One last topic to discuss in this chapter is probably one of the most important theorems in statistics You may notice that the shape of the sample mean distribution tends to a bell shaped distribution indeed a normal distribution This is due to one of the most famous and important theorems in statistics the Central Limit Theorem CLT The CLT states that given a sufficiently large sample size the sampling distribution of the mean for a variable will approximate a normal distribution regardless of that variables distribution in the population Recall that the example distribution I used is the simplest discrete uniform distribution You can already see that the sample mean follows the bellshaped distribution which is equivalent to checking the sample sum The CLT is very strong Normal distribution is the most important distribution among many others as we will cover in the next chapter Mathematicians developed a lot of theories and tools relating to normal distribution The CLT enables us to apply those tools to other distributions as well Proving the CLT is beyond the scope of this introductory book However you are encouraged to perform the following thought experiment and do a computational experiment to verify it σ𝑛𝑛 σ 𝑛𝑛 108 Sampling and Inferential Statistics You toss an unfair coin that favors heads and record the heads as 1 and tails as 0 A set of tossings contains n tossings If n equals 1 you can do m sets of tossing and count the sum of each sets results What you can get is m binary numbers either 0 or 1 with 1 more likely than 0 If n 10 the sum of a set can now take values from 0 to 10 and will likely have an average greater than 5 because the coin is unfair However now you have a spread of the possible outcomes no longer binary As n increases the sum of the tossing set keeps increasing but it will likely increase at a more stable rate probably around 07 Casually speaking the sum hides the intrinsic structure of the original distribution To verify this amazing phenomenon perform some computational experiments The code snippets in this chapter provides useful skeletons Summary In this chapter you learned important but often undervalued concepts such as population samples and sampling methods You learned the right ways to perform sampling as well as the pitfalls of dangerous sampling methods We also made use of several important distributions in this chapter In the next chapter you are going to systematically learn some common important distributions With these background concepts solidified we can then move on to inferential statistics with confidence 5 Common Probability Distributions In the previous chapter we discussed the concepts of population and sampling In most cases it is not likely that you will find a dataset that perfectly obeys a welldefined distribution However common probability distributions are the backbone of data science and serve as the first approximation of realworld distributions The following topics will be covered in this chapter Understanding important concepts in probability Understanding common discrete probability distributions Understanding common continuous probability distributions Learning about joint and conditional distribution Understanding the power law and black swan Recall the famous saying there is nothing more practical than good theory The theory of probability is beyond greatness Lets get started 110 Common Probability Distributions Understanding important concepts in probability First of all we need to clarify some fundamental concepts in probability theory Events and sample space The easiest and most intuitive way to understand probability is probably through the idea of counting When tossing a fair coin the probability of getting a heads is one half You count two possible results and associate the probability of one half with each of them And the sum of all the associated nonoverlapping events not including having a coin standing on its edge must be unity Generally probability is associated with events within a sample space S In the coin tossing example tossing the coin is considered a random experiment it has two possible outcomes and the collection of all outcomes is the sample space The outcome of having a headstails is an event Note that an event is not necessarily singleoutcome for example tossing a dice and defining an event as having a result larger than 4 The event contains a subset of the sixoutcome sample space If the dice is fair it is intuitive to say that such an event having a result larger than 4 is associated with the probability PA 1 3 The probability of the whole sample space is 1 and any probability lies between 0 and 1 If an event A contains no outcomes the probability is 0 Such intuition doesnt only apply to discrete outcomes For continuous cases such as the arrival time of a bus between 8 AM and 9 AM you can define the sample space S as the wholetime interval from 8 AM to 9 AM An event A can be a bus arriving between 830 and 840 while another event B can be a bus arriving later than 850 A has a probability of 𝑃𝑃𝐴𝐴 1 6 and B has a probability of 𝑃𝑃𝐵𝐵 1 6 as well Lets use and to denote the union and intersection of two events The following three axioms for probability calculation will hold 𝑃𝑃𝐴𝐴 0 for any event 𝐴𝐴 𝑆𝑆 𝑃𝑃𝑆𝑆 1 If A B are mutually exclusive then 𝑃𝑃𝐴𝐴 𝐵𝐵 𝑃𝑃𝐴𝐴 𝑃𝑃𝐵𝐵 This leaves to you to verify that if A B are not mutually exclusive the following relationship holds 𝑃𝑃𝐴𝐴 𝐵𝐵 𝑃𝑃𝐴𝐴 𝑃𝑃𝐵𝐵 𝑃𝑃𝐴𝐴 𝐵𝐵 Understanding important concepts in probability 111 The probability mass function and the probability density function Both the Probability Mass Function PMF and the Probability Density Function PDF we are invented to describe the point density of a distribution PMF can be used to describe the probability of discrete events whereas PDF can be used to describe continuous cases Lets look at some examples to understand these functions better PMF PMF associates each single outcome of a discrete probability with a probability For example the following table represents a PMF for our cointossing experiment Figure 51 Probability of coin tossing outcomes If the coin is biased toward heads then the probability of having heads will be larger than 05 but the sum will remain as 1 Lets say you toss two fair dice What is the PMF for the sum of the outcomes We can achieve the result by counting The table cells contain the sum of the two outcomes Figure 52 Sum of two dicetossing outcomes 112 Common Probability Distributions We can then build a PMF table as shown in the following table As you can see the probability associated with each outcome is different Also note that the sample space changes its definition when we change the random experiment In these twodice cases the sample space S becomes all the outcomes of the possible sums Figure 53 Probability of the sum of dice tossing Lets denote the one dices outcome as A and another as B You can verify that PAB PAPB In this case the easiest example is the case of that sum being 2 and 12 Understanding important concepts in probability 113 The following code snippet can simulate the experiment as we know that each dice generates the possible outcome equally First generate all the possible outcomes import random import numpy as np dice 123456 probs 161616161616 sums sortednpuniquedicei dicej for i in range6 for j in range6 The following code then calculates all the associated probabilities I iterated every possible pair of outcomes and added the probability product to the corresponding result Here we make use of the third axiom declared earlier and the relationship we just claimed from collections import OrderedDict res OrderedDict for s in sums ress 0 for i in range6 for j in range6 if diceidicejs ress probsiprobsj Note on code performance The code is not optimized for performance but for readability OrderedDict creates a dictionary that maintains the order of the key as the order in which the keys are created Lets check the results and plot them with a bar plot Since the dictionary is ordered it is OK to plot keys and values directly as x and height as per the functions API pltfigurefigsize86 pltrcParamsupdatefontsize 22 pltbarreskeysresvalues plttitleProbabilities of Two Dice Sum pltxlabelSum Value pltylabelProbability 114 Common Probability Distributions Lets check out the beautiful symmetric result Figure 54 Probabilities of two dice sums You can check the sum of the values by using sumresvalues PDF PDF is the equivalent of PMF for continuous distribution For example a uniform distribution at interval 01 will have a PDF of 𝑓𝑓𝑋𝑋𝑥𝑥 𝑃𝑃𝑥𝑥 1 for any x in the range The requirements for a PDF to be valid are straightforward from the axioms for a valid probability A PDF must be nonnegative and integrates to 1 in the range where it takes a value Lets check a simple example Suppose the PDF of the bus arrival time looks as follows Figure 55 PDF of the bus arrival time Understanding important concepts in probability 115 You may check that the shaded region does have an area of 1 The bus has the highest probability of arriving at 830 and a lower probability of arriving too early or too late This is a terrible bus service anyway Unlike PMF a PDFs value can take an arbitrarily high number The highest probability for the twodice outcome is but the highest value on the PDF graph is 2 This is one crucial difference between PDF and PMF A single point on the PDF function doesnt hold the same meaning as a value in the PMF table Only the integrated Area Under the Curve AUC represents a meaningful probability For example in the previous PDF the probability that the bus will arrive between 824 AM 84 on the xaxis and 836 AM 86 on the xaxis is the area of the central lightly shaded part as shown in the following graph Figure 56 Probability that a bus will arrive between 824 AM and 836 AM Note on the difference between PMFPDF plots and histogram plots Dont confuse the PMF or PDF plots with the histogram plots you saw in previous chapters A histogram shows data distribution whereas the PMF and PDF are not backed by any data but theoretical claims It is not possible to schedule a bus that obeys the preceding weird PDF strictly but when we estimate the mean arrival time of the bus the PDF can be used as a simple approachable tool Integrating a PDF up to a certain value x gives you the Cumulative Distribution Function CDF which is shown as the following formula 𝐹𝐹𝑋𝑋𝑥𝑥 𝑓𝑓𝑋𝑋𝑥𝑥 𝑥𝑥 It takes values between 0 and 1 A CDF contains all the information a PDF contains Sometimes it is easier to use a CDF instead of a PDF in order to solve certain problems 116 Common Probability Distributions Subjective probability and empirical probability From another perspective you can classify probabilities into two types Subjective probability or theoretical probability Empirical probability or objective probability Lets look at each of these classifications in detail Subjective probability stems from a theoretical argument without any observation of the data You check the coin and think it is a fair one and then you come up with an equal probability of heads and tails You dont require any observations or random experiments All you have is a priori knowledge On the other hand empirical probability is deduced from observations or experiments For example you observed the performance of an NBA player and estimated his probability of 3point success for next season Your conclusion comes from a posteriori knowledge If theoretical probability exists by the law of large numbers given sufficient observations the observed frequency will approximate the theoretical probability infinitely closely For the content of this book we wont go into details of the proof but the intuition is clear In a realworld project you may wish to build a robust model to obtain a subjective probability Then during the process of observing random experiment results you adjust the probability to reflect the observed correction Understanding common discrete probability distributions In this section we will introduce you to some of the most important and common distributions I will first demonstrate some examples and the mechanism behind them that exhibits corresponding probability Then I will calculate the expectation and variance of the distribution show you samples that generated from the probability and plot its histogram plot and boxplot The expectation of X that follows a distribution is the mean value that X can take For example with PDF 𝑓𝑓𝑋𝑋𝑥𝑥 the mean is calculated as follows 𝐸𝐸𝑥𝑥 𝑓𝑓𝑋𝑋𝑥𝑥𝑥𝑥𝑥𝑥𝑥𝑥 Understanding common discrete probability distributions 117 The variance measures the spreading behavior of the distribution and is calculated as follows μ and σ2 are the common symbols for expectation and variance X is called a random variable Note that it is the outcome of a random experiment However not all random variables represent outcomes of events For example you can take Y expX and Y is also a random variable but not an outcome of a random experiment You can calculate the expectation and variance of any random variable We have three discrete distributions to cover They are as follows Bernoulli distribution Binomial distribution Poisson distribution Now lets look at these in detail one by one Bernoulli distribution Bernoulli distribution is the simplest discrete distribution that originates from a Bernoulli experiment A Bernoulli experiment resembles a general cointossing scenario The name comes from Jakob I Bernoulli a famous mathematician in the 1600s A Bernoulli experiment has two outcomes and the answer is usually binary that one outcome excludes another For example the following are all valid Bernoulli experiments Randomly ask a person whether they are married Buy a lottery ticket to win the lottery Whether a person will vote for Trump in 2020 If one event is denoted as a success with probability p then the opposite is denoted as a failure with probability 1 p Using X to denote the outcome and x to denote the outcomes realization where we set x 1 to be a success and x 0 to be a failure the PMF can be concisely written as follows 𝑉𝑉𝑉𝑉𝑉𝑉𝑥𝑥 𝑓𝑓𝑋𝑋𝑥𝑥𝑥𝑥 𝐸𝐸𝑥𝑥 2𝑑𝑑𝑥𝑥 𝑓𝑓𝑋𝑋𝑥𝑥 𝑝𝑝𝑥𝑥1 𝑝𝑝1𝑥𝑥 118 Common Probability Distributions Note on notations As you may have already noticed the uppercase letter is used to denote the outcome or event itself whereas the lowercase letter represents a specific realization of the outcome For example x can represent the realization that X takes the value of married in one experiment X denotes the event itself Given the definitions the mean is as follows The variance is as follows The following code performs a computational experiment with p 07 and sample size 1000 p 07 samples randomrandom 07 for in range1000 printnpmeansamplesnpvarsamples Since the result is straightforward we wont get into it You are welcome to examine it Binomial distribution Binomial distribution is built upon the Bernoulli distribution Its outcome is the sum of a collection of independent Bernoulli experiments Note on the concept of independency This is the first time that we have used the word independent explicitly In a twocoin experiment it is easy to imagine the fact that tossing one coin wont influence the result of another in any way It is enough to understand independent this way and we will go into its mathematical details later Lets say you do n Bernoulli experiments each with a probability of success p Then we say the outcome X follows a binomial distribution parametrized by n and p The outcome of the experiment can take any value k as long as k is smaller than or equal to n The PMF reads as follows 𝑝𝑝 1 1 𝑝𝑝 0 𝑝𝑝 0 𝑝𝑝21 𝑝𝑝 1 𝑝𝑝2𝑝𝑝 𝑝𝑝1 𝑝𝑝 𝑓𝑓𝑘𝑘 𝑛𝑛 𝑝𝑝 𝑃𝑃𝑋𝑋 𝑘𝑘 𝑛𝑛 𝑘𝑘 𝑛𝑛 𝑘𝑘 𝑝𝑝𝑘𝑘1 𝑝𝑝𝑛𝑛𝑘𝑘 Understanding common discrete probability distributions 119 The first term represents the combination of selecting k successful experiments out of n The second term is merely a product of independent Bernoulli distribution PMF The expectationmean of the binomial distribution is np and the variance is np1 p This fact follows the results of the sums of independent random variables The mean is the sum of means and the variance is the sum of the variance Lets do a simple computational example tossing a biased coin with PHead 08 100 times and plotting the distribution of the sum with 1000 trials The following code snippet first generates the theoretical data points X i for i in range1101 p 08 Fx npmathfactorialnnpmathfactorialnknpmath factorialkpk1pnk for k in X Then the following code snippet conducts the experiment and the plotting randomseed2020 n 100 K for trial in range1000 k npsumrandomrandom p for in rangen Kappendk pltfigurefigsize86 plthistKbins30densityTruelabelComputational Experiment PMF pltplotXFxcolorrlabelTheoretical PMFlinestyle pltlegend 120 Common Probability Distributions The result looks as follows Figure 57 Binomial distribution theoretical and simulation results The simulated values and the theoretical values agree pretty well Now well move on to another important distribution Poisson distribution The last discrete distribution we will cover is Poisson distribution It has a PMF as shown in the following equation where λ is a parameter k can take values of positive integers This distribution looks rather odd but it appears in nature everywhere Poisson distribution can describe the times a random event happens during a unit of time For example the number of people calling 911 in the United States every minute will obey a Poisson distribution The count of gene mutations per unit of time also follows a Poisson distribution Lets first examine the influence of the value λ The following code snippet plots the theoretical PMF for different values of λ lambdas 24616 K k for k in range30 pltfigurefigsize86 for il in enumeratelambdas pltplotKnpexpllknpmathfactorialk for k in K labelstrl 𝑃𝑃𝑋𝑋 𝑘𝑘 𝑒𝑒λλ𝑘𝑘 𝑘𝑘 Understanding the common continuous probability distribution 121 markeri2 pltlegend pltylabelProbability pltxlabelValues plttitleParameterized Poisson Distributions The result looks as follows Figure 58 Poisson distribution with various λ values The trend is that the larger λ is the larger the mean and variance will become This observation is true Indeed the mean and variance of Poisson distribution are both λ The numpyrandompoisson function can easily generate Poisson distribution samples The computational experiment is left to you You can try and conduct computational experiments on your own for further practice Understanding the common continuous probability distribution In this section you will see the three most important continuous distributions Uniform distribution Exponential distribution Gaussiannormal distribution Lets look at each of these in detail 122 Common Probability Distributions Uniform distribution Uniform distribution is an important uniform distribution It is useful computationally because many other distributions can be simulated with uniform distribution In earlier code examples I used randomrandom in the simulation of the Bernoulli distribution which itself generates a uniform random variable in the range 01 For a uniformly distributed random variable on 01 the mean is 05 and the variance is 1 12 This is a good number to remember for a data scientist role interview For a general uniform distribution If the range is ab the PDF reads as 𝑃𝑃𝑋𝑋 𝑥𝑥 1 𝑏𝑏 𝑎𝑎 if x is in the range ab The mean and variance become 2 and 2 12 respectively If you remember calculus check it yourself We will skip the computational experiments part for simplicity and move on to exponential distribution Exponential distribution The exponential distribution function is another important continuous distribution function In nature it mostly describes the time difference between independent random distribution For example the time between two episodes of lightning in a thunderstorm or the time between two 911 calls Recall that the number of 911 calls in a unit of time follows the Poisson distribution Exponential distribution and Poisson distribution do have similarities The PDF for exponential distribution is also parameterized by λ The value x can only take nonnegative values Its PDF observes the following form Because of the monotonicity of the PDF the maximal value always happens at x 0 where f0λ λ The following code snippet plots the PDF for different λ lambdas 02040810 K 05k for k in range15 pltfigurefigsize86 for il in enumeratelambdas pltplotKnpexplkl for k in K labelstrl markeri2 pltlegend pltylabelProbability 𝑓𝑓𝑥𝑥 λ λ𝑒𝑒λ𝑥𝑥 Understanding the common continuous probability distribution 123 pltxlabelValues plttitleParameterized Exponential Distributions The result looks as follows Figure 59 Exponential distribution with various λ The larger λ is the higher the peak at 0 is but the faster the distribution decays A smaller λ gives a lower peak but a fatter tail Integrating the product of x and PDF gives us the expectation and variance First the expectation reads as follows The variance reads as follows The result agrees with the graphical story The larger λ is the thinner the tail is and hence the smaller the expectation is Meanwhile the peakier shape brings down the variance Next we will investigate normal distribution 𝐸𝐸𝑥𝑥 𝑥𝑥λ𝑒𝑒λxdx 0 1 λ 𝑉𝑉𝑉𝑉𝑉𝑉𝑥𝑥 𝑥𝑥 𝐸𝐸𝑥𝑥 2λ𝑒𝑒λxdx 0 1 λ2 124 Common Probability Distributions Normal distribution We have used the term normal distribution quite often in previous chapters without defining it precisely A onedimensional normal distribution has a PDF as follows μ and σ2 are the parameters as expectation and standard deviation A standard normal function has an expectation of 0 and a variance of 1 Therefore its PDF reads as follows in a simpler form Qualitative argument of a normal distribution PDF The standard normal distribution PDF is an even function so it is symmetric with a symmetric axis x 0 Its PDF also monotonically decays from its peak at a faster rate than the exponential distribution PDF because it has a squared form Transforming the standard PDF to a general PDF the expectation μ on the exponent shifts the position of the symmetric axis and the variance σ2 determines how quick the decay is The universality of normal distribution is phenomenal For example the human populations height and weight roughly follow a normal distribution The number of leaves on trees in a forest will also roughly follow a normal distribution From the CLT we know that the sample sum from a population of any distribution will ultimately tend to follow a normal distribution Take the tree leaves example and imagine that the probability of growing a leaf follows a very sophisticated probability The total number of leaves on a tree will however cloud the details of the sophistication of the probability but gives you a normal distribution A lot of phenomena in nature follow a similar pattern what we observe is a collection or summation of lowerlevel mechanisms This is how the CLT makes normal distribution so universal and important For now lets focus on a onedimensional one 𝑓𝑓𝑥𝑥 1 𝜎𝜎2𝜋𝜋 𝑒𝑒𝑥𝑥𝜇𝜇2 2𝜎𝜎2 𝑓𝑓𝑥𝑥 1 2π 𝑒𝑒𝑥𝑥2 2 Understanding the common continuous probability distribution 125 Lets plot several normal distribution PDFs with different expectations 1 0 and 1 with the following code snippet mus 101 K 02k5 for k in range50 pltfigurefigsize86 for imu in enumeratemus pltplotK 1npsqrt2nppinpexpkmu22 for k in K labelstrmu markeri2 pltlegend pltylabelProbability pltxlabelValues plttitleParameterized Normal Distributions The result looks as follows Figure 510 Normal distribution with different expectations I will leave the exercise with a different variance to you Normal distribution has a deep connection with various statistical tests We will cover the details in Chapter 7 Statistical Hypothesis Testing 126 Common Probability Distributions Learning about joint and conditional distribution We have covered basic examples from discrete probability distributions and continuous probability distributions Note that all of them describe the distribution of a single experiment outcome How about the probability of the simultaneous occurrence of two eventsoutcomes The proper mathematical language is joint distribution Suppose random variables X and Y denote the height and weight of a person The following probability records the probability that X x and Y y simultaneously which is called a joint distribution A joint distribution is usually represented as shown in the following equation For a population we may have PX 170cm Y 75kg 025 You may ask the question What is the probability of a person being 170 cm while weighing 75 kg So you see that there is a condition that we already know this person weighs 75 kg The expression for a conditional distribution is a ratio as follows The notation PXY represents the conditional distribution of X given Y Conditional distributions are everywhere and people often misread them by ignoring the conditions For example does the following argument make sense Most people have accidents within 5 miles of their home therefore the further away you drive the safer you are Whats wrong with this claim It claims the probability Pdeath more than 5 miles away to be small as compared to Pdeath less than 5 miles away given the fact that Pdeath less than 5 miles away is larger than Pdeath more than 5 miles away It intentionally ignores the fact that the majority of commutes take place within short distances of the home in its phrasing of the claim The denominator is simply too small on the righthand side in the following equation I hope you are able to spot such tricks and understand the essence of these concepts 𝑃𝑃𝑋𝑋 𝑥𝑥 𝑌𝑌 𝑦𝑦 𝑃𝑃𝑋𝑋 170𝑐𝑐𝑐𝑐𝑌𝑌 75𝑘𝑘𝑘𝑘 𝑃𝑃𝑋𝑋 170𝑐𝑐𝑐𝑐 𝑌𝑌 75𝑘𝑘𝑘𝑘 𝑃𝑃𝑌𝑌 75𝑘𝑘𝑘𝑘 Pdeath more than 5 miles away 𝑃𝑃𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑ℎ 𝑚𝑚𝑚𝑚𝑚𝑚𝑑𝑑 𝑑𝑑ℎ𝑑𝑑𝑎𝑎 5 𝑚𝑚𝑚𝑚𝑚𝑚𝑑𝑑𝑚𝑚 𝑑𝑑𝑎𝑎𝑑𝑑𝑎𝑎 𝑃𝑃𝑚𝑚𝑚𝑚𝑚𝑚𝑑𝑑 𝑑𝑑ℎ𝑑𝑑𝑎𝑎 5 𝑚𝑚𝑚𝑚𝑚𝑚𝑑𝑑𝑚𝑚 𝑑𝑑𝑎𝑎𝑑𝑑𝑎𝑎 Understanding the power law and black swan 127 Independency and conditional distribution Now we can explore the true meaning of independency In short there are two equivalent ways to declare two random variables independent PX x Y y PX xPY y for any xy and PX xY y PX x for any xy You can check that they are indeed equivalent We see an independent relationship between X and Y implies that a conditional distribution of a random variable X over Y doesnt really depend on the random variable Y If you can decompose the PDFPMF of a joint probability into a product of two PDFs PMFs that one only contains one random variable and another only contains another random variable then the two random variables are independent Lets look at a quick example You toss a coin three times X denotes the event when you see two or more heads and Y denotes the event when the sum of heads is odd Are X and Y independent Lets do a quick calculation to obtain the probabilities 𝑃𝑃𝑋𝑋 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 1 2 𝑃𝑃𝑌𝑌 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 1 2 𝑃𝑃𝑋𝑋 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑌𝑌 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 1 4 You can verify the remaining three cases to check that X and Y are indeed independent The idea of conditional distribution is key to understanding many behaviors for example the survival bias For all the planes returning from the battlefield if the commander reinforces those parts where the plane got shot will the air force benefit from it The answer is probably no The commander is not looking at the whole probability but a conditional probability distribution based on the fact the plane did return To make the correct judgement the commander should reinforce those parts where the plane was not shot It is likely that a plane that got shot in those areas didnt make it back Conditional probability is also crucial to understanding the classical classification algorithm the Bayesbased classifier Adding an independence requirement to the features of the data we simplify the algorithm further to the naïve Bayes classifier We will cover these topics in Chapter 9 Working with Statistics for Classification Tasks Understanding the power law and black swan In this last section I want to give you a brief overview of the socalled power law and black swan events 128 Common Probability Distributions The ubiquitous power law What is the power law If you have two quantities such that one varies according to a power relationship of another and independent of the initial sizes then you have a power law relationship Many distributions have a power law shape rather than normal distributions 𝑃𝑃𝑋𝑋 𝑥𝑥 𝐶𝐶𝑥𝑥α The exponential distribution we saw previously is one such example For a realword example the frequency of words in most languages follows a power law The English letter frequencies also roughly follow a power law e appears the most often with a frequency of 11 The following graph taken from Wikipedia https enwikipediaorgwikiLetterfrequency shows a typical example of such a power law Figure 511 Frequency of English letters Understanding the power law and black swan 129 Whats amazing about a power law is not only its universality but also its lack of welldefined mean and variance in some cases For example when α is larger than 2 the expectation of such a power law distribution will explode to infinity while when α is larger than 3 the variance will also explode Be aware of the black swan Simply put a lack of welldefined variance implications is known as black swan behavior A black swan event is a rare event that is hard to predict or compute scientifically However black swan events have a huge impact on history science or finance and make people rationalize black swan events in hindsight Note Before people found black swans in Australia Europeans believed that all swans were white The term black swan was coined to represent ideas that were considered impossible Here are some of the typical examples of black swan events The 2020 coronavirus outbreak The 2008 financial crisis The assassination of Franz Ferdinand which sparked World War I People can justify those events afterward which make black swans totally inescapable but no one was prepared to prevent the occurrence of those black swans beforehand Beware of black swan events in the distribution you are working on Remember in the case of our Texas county population data in Chapter 4 Sampling and Inference Statistics that most of the counties have small populations but quite a few have populations that are 10 times above the average If you have sampled your data on a fair number of occasions and didnt see these outliers you may be inclined toward making incorrect estimations 130 Common Probability Distributions Summary In this chapter we covered the basics of common discrete and continuous probability distributions examined their statistics and also visualized the PDFs We also talked about joint distribution conditional distribution and independency We also briefly covered power law and black swan behavior Many distributions contain parameters that dictate the behavior of the probability distribution Suppose we know a sample comes from a population that follows a certain distribution how do you find the parameter of the distribution This will be the topic of our next chapter parametric estimation 6 Parametric Estimation One big challenge when working with probability distributions is identifying the parameters in the distributions For example the exponential distribution has a parameter λ and you can estimate it to get an idea of the mean and the variance of the distribution Parametric estimation is the process of estimating the underlying parameters that govern the distribution of a dataset Parameters are not limited to those that define the shape of the distribution but also the locations For example if you know that a dataset comes from a uniform distribution but you dont know the lower bound a and upper bound b of the distribution you can also estimate the values of a and b as they are also considered legitimate parameters Parametric estimation is important because it gives you a good idea of the dataset with a handful of parameters for example the distributions and associated descriptive statistics Although reallife examples wont exactly follow a distribution parameter estimation does serve as a benchmark for building more complicated models to model the underlying distribution 132 Parametric Estimation After finishing this chapter you will be able to complete the following tasks independently Understanding the concepts of parameter estimation and the features of estimators Using the method of moments to estimate parameters Applying the maximum likelihood approach to estimate parameters with Python Understanding the concepts of parameter estimation and the features of estimators A introduction to estimation theory requires a good mathematical understanding and careful derivation Here I am going to use laymans terms to give you a brief but adequate introduction so that we can move on to concrete examples quickly Estimation in statistics refers to the process of estimating unknown values from empirical data that involves random components Sometimes people confuse estimation with prediction Estimation usually deals with hidden parameters that are embodied in a known dataset things that already happened while prediction tries to predict values that are explicitly not in the dataset things that havent happened For example estimating the population of the world 1000 years ago is an estimation problem You can use various kinds of data that may contain information about the population The population is a number that will not change but is unknown On the other hand predicting the population in the year 2050 is a prediction problem because the number is essentially unknown and we have no data that contains it explicitly What parameters can we estimate In principle you can estimate any parameter as long as it is involved in the creation or generation of the random datasets Lets use the symbol θ as the set of all unknown parameters and x as the set of data For example θ1 and x1 will represent one of the parameters and a single data point respectively here indexed by 1 If pxθ depends on θ we can estimate that θ exists in the dataset A note on the exchangeable use of terms Estimation and prediction can sometimes be exchangeable Often estimation doesnt assert the values of new data while prediction does but ambiguity always exists For example if you want to know the trajectory of a missile given its current position the old positions are indeed unknown data but they can also be treated as hidden parameters that will determine the positions observed later We will not go down this rabbit hole here You are good to go if you understand whats going on Understanding the concepts of parameter estimation and the features of estimators 133 An estimator is required to obtain the estimated values of the parameters that were interested in An estimator is also static and the underlined parameter is called the estimand A particular value that an estimator takes is called an estimate a noun Too many concepts Lets look at a realworld case Lets take the 2016 US general election as an example The voting rate is an estimand because it is a parameter to model voting behavior A valid straightforward strategy is to take the average voting rates of a sample of counties in the US regardless of their population and other demographic characteristics This strategy can be treated as an estimator Lets say that a set of sampled counties gives a value of 034 which means 34 of the population vote Then the value 034 is an estimate as the realization of our naive estimator Take another sample of counties the same estimator may give the value of 04 as another estimate Note the estimator can be simple The estimator is not necessarily complicated or fancy It is just a way of determining unknown parameters You can claim the unknown parameter to be a constant regardless of whatever you observe A constant is a valid estimator but it is a horribly wrong one For the same estimand you can have as many estimators as you want To determine which one is better we require more quantitative analysis Without further specifications estimators in this chapter refer to the point estimator The point estimator offers the single best guess of the parameter while the socalled interval estimator gives an interval of the best guesses In the next section lets review the criteria for evaluating estimators Evaluation of estimators For the same estimand how do we evaluate the qualities of different estimators Think of the election example we want the estimation to be as accurate as possible as robust as possible and so on The properties of being accurate and robust have specific mathematical definitions and they are crucial for picking the right estimator for the right tasks The following is a list of the criteria Biasness Consistency Efficiency 134 Parametric Estimation Lets look at each of these criteria in detail The first criterion biasness Recall that an estimator is also a random variable that will take different values depending on the observed sample from the population Lets use θ to denote the estimator and θ to denote the true value of the parameter variable which our estimator tries to estimate The expected value of the difference between θ and θ is said to be the bias of the estimator which can be formulated as follows Note that the expectation is calculated over the distribution Pxθ as varied θ is supposed to change the sampled sets of x An estimator is said to be unbiased if its bias is 0 for all values of parameter θ Often we prefer an unbiased estimator over a biased estimator For example political analysts want an accurate voting rate and marketing analysts want a precise customer satisfaction rate for strategy development If the bias is a constant we can subtract that constant to obtain an unbiased estimator if the bias depends on θ it is not easy to fix in general For example the sample mean from a set using simple random sampling is an unbiased estimator This is rather intuitive because simple random sampling gives equal opportunity for every member of the set to be selected Next lets move on to the second criterion which is consistency or asymptotical consistency The second criterion consistency As the number of data points increases indefinitely the resulting sequence of the estimates converges to the true value θ in probability This is called consistency Lets say θn is the estimate given n data points then for any case of ϵ0 consistency gives the following formula For those of you who are not familiar with the language of calculus think about it this way no matter how small you choose the infinitesimal threshold ϵ you can also choose a number of data points n large enough such that the probability that θn is different from θ is going to be 0 On the other hand an inconsistent estimator will fail to estimate the parameter unbiasedly no matter how much data you use Bias θ Eθ θ lim 𝑛𝑛 𝑃𝑃 θ𝑛𝑛 θ ϵ 0 Understanding the concepts of parameter estimation and the features of estimators 135 Note on convergence There are two main types of convergence when we talk about probability and distribution One is called convergence in probability and another is called convergence in distribution There are differences if you plan to dig deeper into the mathematical definitions and implications of the two kinds of convergence All you need to know now is that the convergence in the context of consistency is convergence in probability The third criterion efficiency The last criterion I want to introduce is relative If two estimators are both unbiased which one should we choose The most commonly used quantity is called Mean Squared Error MSE MSE measures the expected value of the square of the difference between the estimator and the true value of the parameter The formal definition reads as follows Note that the MSE says nothing about the biasness of the estimator Lets say estimators A and B both have an MSE of 10 It is possible that estimator A is unbiased but the estimates are scattered around the true value of θ while estimator B is highly biased but concentrated around a point away from the bulls eye What we seek is an unbiased estimator with minimal MSE This is usually hard to achieve However take a step back among the unbiased estimators there often exists one estimator with the least MSE This estimator is called the Minimum Variance Unbiased Estimator MVUE Here we will touch on the concept of variancebias tradeoff The MSE contains two parts the bias part and the variance part If the bias part is 0 therefore unbiased then the MSE only contains the variance part So we call this estimator the MVUE We will cover the concepts of bias and variance again in the sense of machine learning for example in Chapter 8 Statistics for Regression and Chapter 9 Statistics for Classification If an estimator has a smaller MSE than another estimator we say the first estimator is more efficient Efficiency is a relative concept and it is defined in terms of a measuring metric Here we use MSE as the measuring metric the smaller the MSE is the more efficient the estimator will be 𝑀𝑀𝑀𝑀𝑀𝑀θ 𝑀𝑀 θ𝑋𝑋 θ 2 136 Parametric Estimation The concept of estimators beyond statistics If you extend the concept of estimators beyond the statistical concept to a real life methodology the accessibility of data is vitally important An estimator may have all the advantages but the difficulty of obtaining data becomes a concern For example to estimate the temperature of the suns surface it is definitively a great idea to send a proxy to do it but this is probably not a costefficient nor timeefficient way Scientists have used other measurements that can be done on Earth to do the estimation Some business scenarios share a similar characteristic where unbiasedness is not a big issue but data accessibility is In the next two sections I will introduce the two most important methods of parameter estimation Using the method of moments to estimate parameters The method of moments associates moments with the estimand What is a moment A moment is a special statistic of a distribution The most commonly used moment is the nth moment of a realvalued continuous function Lets use M to denote the moment and it is defined as follows where the order of the moment is reflected as the value of the exponent 𝑀𝑀𝑛𝑛 𝑥𝑥 𝑐𝑐𝑛𝑛 𝑓𝑓𝑥𝑥𝑑𝑑𝑥𝑥 This is said to be the moment about the value c Often we set c to be 0 𝑀𝑀𝑛𝑛 𝑥𝑥𝑛𝑛𝑓𝑓𝑥𝑥𝑑𝑑𝑥𝑥 Some results are immediately available for example because the integration of a valid Probability Density Function PDF always gives 1 Therefore we have M0 1 Also M1 is the expectation value therefore the mean A note on central moments For highorder moments where c is often set to be the mean these moments are called central moments In this setting the second moment M2 becomes the variance 138 Parametric Estimation Lets first generate a set of artificial data using the following code snippet The true parameter is 20 We will plot the histogram too nprandomseed2020 calls nprandompoissonlam20 size365 plthistcalls bins20 The result looks as follows Figure 61 Histogram plot of the artificial Poisson distribution data Now lets express the first moment with the unknown parameter In Chapter 5 Common Probability Distributions we saw that the expectation value is just λ itself Next lets express the first moment with the data which is just the mean of the data The npmeancalls function call gives the value 19989 This is our estimation and it is very close to the real parameter 20 In short the logic is the following The npmeancalls function call gives the sample mean and we use it to estimate the population mean The population mean is represented by moments Here it is just the first moment For a welldefined distribution the population mean is an expression of unknown parameters In this example the population mean happens to have a very simple expression the unknown parameter λ itself But make sure that you understand the whole chain of logic A more sophisticated example is given next Using the method of moments to estimate parameters 139 Example 2 the bounds of uniform distribution Lets see another example of continuous distribution We have a set of points that we assume comes from a uniform distribution However we dont know the lower bound α and upper bound β We would love to estimate them The assumed distribution has a uniform PDF on the legitimate domain Here is the complete form of the distribution 𝑃𝑃𝑋𝑋 𝑥𝑥 1 β α The following code snippet generates artificial data with a true parameter of 0 and 10 nprandomseed2020 data nprandomuniform0102000 Lets take a look at its distribution It clearly shows that the 2000 randomly generated data points are quite uniform Each bin contains roughly the same amount of data points Figure 62 Histogram plot of artificial uniform distribution data Next we perform the representation of moments with the unknown parameters 1 First lets express the first and second moments with the parameters The first moment is easy as it is the average of α and β 𝑀𝑀1 05α β 140 Parametric Estimation The second moment requires some calculation It is the integration of the product of x2 and the PDF according to the definition of moments 2 Then we calculate M1 and M2 from the data by using the following code snippet M1 npmeandata M2 npmeandata2 3 The next step is to express the parameters with the moments After solving the two equations that represent M1 and M2 we obtain the following α 𝑀𝑀1 3𝑀𝑀2 𝑀𝑀1 2 β 2𝑀𝑀1 α Substituting the values of the moments we obtain that α 0096 and β 100011 This is a pretty good estimation since the generation of the random variables has a lower bound of 0 and an upper bound of 10 Wait a second What will happen if we are unlucky and the data is highly skewed We may have an unreasonable estimation Here is an exercise You can try to substitute the generated dataset with 1999 values being 10 and only 1 value being 0 Now the data is unreasonably unlikely because it is supposed to contain 2000 data points randomly uniformly selected from the range 0 to 10 Do the analysis again and you will find that α is unrealistically wrong such that a uniform distribution starting from α cannot generate the dataset we coined itself How ridiculous However if you observe 1999 out of 2000 values aggregated at one single data point will you still assume the underlying distribution to be uniform Probably not This is a good example of why the naked eye should be the first safeguard of your statistical analysis You have estimated two sets of parameters in two different problems using the method of moments Next we will move on to the maximum likelihood approach 𝑀𝑀2 𝑥𝑥2 β α 𝑑𝑑𝑥𝑥 β α 1 3α2 αβ β2 Applying the maximum likelihood approach with Python 141 Applying the maximum likelihood approach with Python Maximum Likelihood Estimation MLE is the most widely used estimation method It estimates the probability parameters by maximizing a likelihood function The obtained extremum estimator is called the maximum likelihood estimator The MLE approach is both intuitive and flexible It has the following advantages MLE is consistent This is guaranteed In many practices a good MLE means the job that is left is simply to collect more data MLE is functionally invariant The likelihood function can take various transformations before maximizing the functional form We will see examples in the next section MLE is efficient Efficiency means when the sample size tends to infinity no other consistent estimator has a lower asymptotic MSE than MLE With that power in MLE I bet you just cant wait to try it Before maximizing the likelihood we need to define the likelihood function first Likelihood function A likelihood function is a conditional probability distribution function that conditions upon the hidden parameter As the name suggests it measures how likely it is that our observation comes from a distribution with the hidden parameters by assuming the hidden parameters are essentially true When you change the hidden parameters the likelihood function changes value In another words the likelihood function is a function of hidden parameters The difference between a conditional distribution function and a likelihood function is that we focus on different variables For a conditional distribution Pevent parameter we focus on the event and predict how likely it is that an event will happen So we are interested in the fevent Pevent parameter λ function where λ is known We treat the likelihood function as a function over the parameter domain where all the events are already observed fparameter Pevent E parameter where the collection of events E is known You can think of it as the opposite of the standard conditional distribution defined in the preceding paragraph Lets take coin flipping as an example Suppose we have a coin but we are not sure whether it is fair or not However what we know is that if it is unfair getting heads is more likely with a probability of Phead 06 142 Parametric Estimation Now you toss it 20 times and get 11 heads Is it more likely to be a fair coin or an unfair coin What we want to find is which of the following is more likely to be true P11 out of 20 is head fair or P11 out of 20 is head unfair Lets calculate the two possibilities The distribution we are interested in is a binomial distribution If the coin is fair then we have the following likelihood function value If the coin is biased toward heads then the likelihood function reads as follows It is more likely that the coin is fair I deliberately picked such a number so that the difference is subtle The essence of MLE is to maximize the likelihood function with respect to the unknown parameter A note on the fact that likelihood functions dont sum to 1 You may observe a fact that likelihood functions even enumerating all possible cases here two do not necessarily sum to unity This is due to the fact that likelihood functions are essentially not legitimate PDFs The likelihood function is a function of fairness where the probability of getting heads can take any value between 0 and 1 What gives a maximal value Lets do an analytical calculation and then plot it Lets use p to denote the probability of getting heads and Lp to denote the likelihood function You can do a thought experiment here The value of p changes from 0 to 1 but both p 0 and p 1 make the expression equal to 0 Somewhere in between there is a p value that maximizes the expression Lets find it Note that the value of p that gives the maximum of this function doesnt depend on the combinatorial factor We can therefore remove it to have the following expression 20 1191 2 20 01602 20 1193 5 1 1 2 5 9 01597 𝐿𝐿𝑝𝑝 20 11 9 𝑝𝑝111 𝑝𝑝9 𝐿𝐿𝑝𝑝 𝑝𝑝111 𝑝𝑝9 Applying the maximum likelihood approach with Python 143 You can further take the logarithm of the likelihood function to obtain the famous loglikelihood function 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑝𝑝 11𝑙𝑙𝑙𝑙𝑙𝑙𝑝𝑝 9𝑙𝑙𝑙𝑙𝑙𝑙1 𝑝𝑝 The format is much cleaner now In order to obtain this expression we used the formulas 𝑙𝑙𝑙𝑙𝑙𝑙𝑎𝑎𝑏𝑏 𝑏𝑏 𝑙𝑙𝑙𝑙𝑙𝑙𝑎𝑎 and 𝑙𝑙𝑜𝑜𝑔𝑔𝑎𝑎 𝑏𝑏 𝑙𝑙𝑜𝑜𝑔𝑔𝑎𝑎 𝑙𝑙𝑜𝑜𝑔𝑔𝑏𝑏 Transformation invariance The transformation suggests that the likelihood function is not fixed nor unique You can remove the global constant or transform it with a monotonic function such as logarithm to fit your needs The next step is to obtain the maximal of the loglikelihood function The derivative of logx is just 1 𝑥𝑥 so the result is simple to obtain by setting the derivative to 0 You can verify that the function only has one extremum by yourself P 055 is the right answer This agrees with our intuition since we observe 11 heads among 20 experiments The most likely guess is just 11 20 For completeness I will plot the original likelihood distribution before moving to more complex examples The following code snippet demonstrates this point pltfigurefigsize106 pltplotP factornppowerP11nppower1P9 linestyle linewidth4 labellikelihood function pltaxvline055 linestyle linewidth4 labelmost likely p colorr pltxlabelProbability of obtaining head pltylabelLikelihood function value pltlegendloc0006 𝑑𝑑𝑑𝑑𝑑𝑑𝑔𝑔𝐿𝐿𝑝𝑝 𝑑𝑑𝑝𝑝 11 𝑝𝑝 9 1 𝑝𝑝 0 144 Parametric Estimation The result is as follows The vertical dashed line indicates the maximum where p 055 Figure 63 The likelihood function of the cointossing experiment You can verify visually that the likelihood function only has one maximum MLE for uniform distribution boundaries Lets first revisit our previous example introduced in the Using the method of moments to estimate parameters section where the data is uniformly sampled from the range ab but both a and b are unknown We dont need to do any hardcore calculation or computational simulation to obtain the result with MLE Suppose the data is denoted as x1 x2 up to xn n is a large number as we saw in the previous section 2000 The likelihood function is therefore as follows The logarithm will bring down the exponent n which is a constant So what we want to maximize is 𝑙𝑙𝑙𝑙𝑙𝑙 1 𝑏𝑏 𝑎𝑎 𝑙𝑙𝑙𝑙𝑙𝑙𝑏𝑏 𝑎𝑎 which further means we wish to minimize logba 𝐿𝐿𝑎𝑎 𝑏𝑏 1 𝑏𝑏 𝑎𝑎 𝑛𝑛 Applying the maximum likelihood approach with Python 145 Transformation of logarithm Note that 𝑙𝑙𝑙𝑙𝑙𝑙 1 𝑥𝑥 is essentially 𝑙𝑙𝑙𝑙𝑙𝑙𝑥𝑥1 You can pull the 1 out of the logarithm Alright the result is simple enough such that we dont need derivatives When b becomes smaller logba is smaller When a is larger but must be smaller than b logba is smaller However b is the upper bound so it cant be smaller than the largest value of the dataset maxxi and by the same token cant be larger than the smallest value of the dataset minxi Therefore the result reads as follows A minxi b maxxi This agrees with our intuition and we have fully exploited the information we can get from such a dataset Also note that this is much more computationally cheaper than the method of moments approach MLE for modeling noise Lets check another example that is deeply connected with regression models which we are going to see in future chapters Here we will approach it from the perspective of estimators Regression and correlation A regression model detects relationships between variables usually the dependent variables outcome and the independent variables features The regression model studies the direction of the correlation and the strength of the correlation Lets say we anticipate that there is a correlation between random variable X and random variable Y For simplicity we anticipate the relationship between them is just proportional namely Y k X Here k is an unknown constant the coefficient of proportionality However there is always some noise in a realworld example The exact relationship between X and Y is therefore Y k X ε where ε stands for the noise random variable 146 Parametric Estimation Lets say we have collected the n independent data pairs xi and yi at our disposal The following code snippet creates a scatter plot for these data points For the demonstration I will choose n 100 pltfigurefigsize106 pltscatterXY pltxlabelX pltylabelY plttitleLinearly correlated variables with noise The result looks like the following Figure 64 Scatter plot of X and Y Instead of modeling the distribution of the data as in the previous example we are going to model the noise since the linear relationship between X and Y is actually known We will see two cases where the modeling choices change the estimation of the coefficient of proportionality k In the first case the noise follows a standard normal distribution and can be shown as N01 In the second case the noise follows a standard Laplace distribution Applying the maximum likelihood approach with Python 147 A note on the two candidate distributions of noise Recall that a standard normal distribution has a PDF that is shown as fx 1 2π ex2 2 Laplace distribution is very similar to the standard normal distribution It has a PDF that can be presented as f𝑥𝑥 1 2 𝑒𝑒𝑥𝑥 The big difference is that one decays faster while the other decays slower The positive half of the standard Laplace distribution is just half of the exponential distribution with λ 1 The following code snippet plots the two distributions in one graph xs nplinspace55100 normalvariables 1npsqrt2nppinpexp05xs2 laplacevariables 05npexpnpabsxs pltfigurefigsize108 pltplotxsnormalvariableslabelstandard normallinestylelinewidth4 pltplotxslaplacevariableslabelstandard Laplacelinestylelinewidth4 pltlegend The result looks as in the following figure The dashed line is the standard normal PDF and the dotted line is the standard Laplace PDF Figure 65 Standard normal and standard Laplace distribution around 0 148 Parametric Estimation I will model the noises according to the two distributions However the noise is indeed generated from a standard normal distribution Lets first examine case 1 Suppose the noise Є follows the standard normal distribution This means Єi yi kxi as you use other variables to represent Єi follows the given distribution Therefore we have fϵik 1 2π eyikxi2 2 for every such random noise data point Now think of k is the hidden parameter We just obtained our likelihood function for one data point As each data point is independent the likelihood function can therefore be aggregated to obtain the overall likelihood function as shown in the following formula What we want to find is k such that it maximizes the likelihood function Lets introduce the mathematical way to express this idea Lets take the logarithm of the likelihood function and make use of the rule that the logarithm of a product is the sum of each terms logarithm Note on simplifying the logarithm of a sum The rule of simplifying the likelihood functions expression is called the product rule of logarithm It is shown as 𝑙𝑙𝑙𝑙 𝑓𝑓𝑖𝑖 𝑙𝑙𝑙𝑙𝑓𝑓𝑖𝑖 Therefore using the product rule to decompose the loglikelihood function we have 𝑙𝑙𝑜𝑜𝑜𝑜𝑜𝑜𝑘𝑘 𝑙𝑙𝑜𝑜𝑜𝑜𝑓𝑓ϵ𝑖𝑖𝑘𝑘 𝑖𝑖 For each term we have the following Substitute the expression into the loglikelihood function Note that the optimal k wont depend on the first term which is a constant So we can drop it and focus on the second part L𝑘𝑘 fϵ0 ϵ𝑛𝑛𝑘𝑘 𝑓𝑓ϵ𝑖𝑖𝑘𝑘 𝑖𝑖 kMLE argmaxkLk log𝑓𝑓ϵ𝑖𝑖𝑘𝑘 05log2π 𝑦𝑦𝑖𝑖𝑘𝑘 𝑥𝑥𝑖𝑖2 2 logL𝑘𝑘 𝑛𝑛 2 log2π 1 2𝑦𝑦𝑖𝑖 𝑘𝑘 𝑥𝑥𝑖𝑖2 Applying the maximum likelihood approach with Python 149 Now lets calculate the derivative with respect to k At the maximum of the function the expression of the derivative will become 0 Equating this to 0 we find the optimal value of k On verifying maximality If you are familiar with calculus you may wonder why I didnt calculate the second derivative to verify that the value is indeed a maximum rather than a possible minimum You are right that in principle such a calculation is needed However in our examples in this chapter the functional forms are quite simple like a simple quadratic form with only one maximum The calculation is therefore omitted The following code snippet does the calculation Note that if a variable is a NumPy array you can perform elementwise calculation directly npsumXYnpsumXX The result is 040105608977245294 Lets visualize it to check how well this estimation does The following code snippet adds the estimated y values to the original scatter plot as a dashed bold line pltfigurefigsize106 pltscatterXYalpha 04 pltxlabelX pltylabelY pltplotXXk1linewidth4linestylecr labelfitted line pltlegend 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑘𝑘 𝑑𝑑𝑘𝑘 𝑦𝑦𝑖𝑖 𝑘𝑘 𝑥𝑥𝑖𝑖𝑥𝑥𝑖𝑖 𝑖𝑖 𝑘𝑘 𝑥𝑥𝑖𝑖𝑦𝑦𝑖𝑖 𝑥𝑥𝑖𝑖𝑥𝑥𝑖𝑖 150 Parametric Estimation The result looks as follows Figure 66 Estimated Y value according to normal noise assumption The result looks quite reasonable Now lets try the other case The logic of MLE remains the same until we take the exact form of the likelihood function fЄik The likelihood function for the second Laplacedistributed noise case is the following Taking the logarithm the loglikelihood has a form as follows I have removed the irrelevant constant 05 as I did in the cointossing example to simplify the calculation In order to maximize the loglikelihood function we need to minimize the summation yikxi This summation involves absolute values and the sign of each term depends on the value of k Put the book down and think for a while about the minimization Lk fϵ𝑖𝑖𝑘𝑘 1 2 𝑒𝑒𝑦𝑦𝑖𝑖𝑘𝑘𝑥𝑥𝑖𝑖 logLk yi 𝑘𝑘𝑥𝑥𝑖𝑖 Applying the maximum likelihood approach with Python 151 Lets define a function signx which gives us the sign of x If x is positive signx 1 if x is negative signx 1 otherwise it is 0 Then the derivative of the preceding summation with respect to k is essentially the following Because the signx function jumps it is still not easy to get a hint Lets do a computational experiment I will pick k between 0 and 1 then create a graph of the loglikelihood function values and the derivatives so that you can have a visual impression The following code snippet creates the data I needed Ks nplinspace0206100 def calloglikelihoodX Y k return npsumnpabsYkX def calderivativeXYk return npsumXnpsignYkX Likelihoods calloglikelihoodXYk for k in Ks Derivatives calderivativeXYk for k in Ks I picked the range 0206 and selected k values for a 004 increment Why did I pick this range This is just a first guess from the scatter plot The following code snippet plots the results pltfigurefigsize108 pltplotKsLikelihoodslabelLikelihood functionlinestylelinewidth4 pltplotKsDerivativeslabel Derivativelinestylelinewidth4 pltlegend pltxlabelK 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑘𝑘 𝑑𝑑𝑘𝑘 𝑥𝑥𝑖𝑖 sign𝑦𝑦𝑖𝑖𝑘𝑘𝑥𝑥𝑖𝑖 0 152 Parametric Estimation The dashed line is the likelihood function value and the dotted line is the derivative of the likelihood function Figure 67 Searching for the optimal k point through computational experimentation You may notice that there seems to be a plateau for the functions This is true Lets zoom in to the range where k takes the value 038042 The following code snippet does the job Ks nplinspace038042100 Likelihoods calloglikelihoodXYk for k in Ks Derivatives calderivativeXYk for k in Ks pltfigurefigsize108 pltplotKsLikelihoodslabelLikelihood functionlinestylelinewidth4 pltplotKsDerivativeslabel Derivativelinestylelinewidth4 pltlegend pltxlabelK Applying the maximum likelihood approach with Python 153 The result looks like the following Figure 68 The plateau of derivatives This strange behavior comes from the fact that taking the derivative of the absolute value function lost information about the value itself We only have information about the sign of the value left However we can still obtain an estimation of the optimal value by using the numpyargmax function This function returns the index of the maximal value in an array We can then use this index to index the array for our k values The following oneline code snippet does the job KsnpargmaxLikelihoods The result is about 0397 Then we know the real k values are in the range 0397 00004 0397 00004 which is smaller than our result from case 1 Why 00004 We divide a range of 004 into 100 parts equally so each grid is 00004 154 Parametric Estimation Lets plot both results together They are almost indistinguishable The following code snippet plots them together pltfigurefigsize108 pltscatterXYalpha 04 pltxlabelX pltylabelY pltplotXXk1linewidth4linestylecr labelestimation from normal noise assumption pltplotXXk2linewidth4linestylecg labelestimation from Laplace noise assumption pltlegend The result looks as follows Figure 69 Estimations of the coefficient of proportionality Alright we are almost done Lets find out how the data is generated The following code snippet tells you that it is indeed generated from the normal distribution randomseed2020 X nplinspace1010100 Y X 04 nparrayrandomnormalvariate01 for in range100 Applying the maximum likelihood approach with Python 155 In this example we modeled the random noise with two different distributions However rather than a simple unknown parameter our unknown parameter now carries correlation information between data points This is a common technique to use MLE to estimate unknown parameters in a model by assuming a distribution We used two different distributions to model the noise However the results are not very different This is not always the case Sometimes poorly modeled noise will lead to the wrong parameter estimation MLE and the Bayesian theorem Another important question remaining in a realworld case is how to build comprehensive modeling with simply the raw data The likelihood function is important but it may miss another important factor the prior distribution of the parameter itself which may be independent of the observed data To have a solid discussion on this extended topic let me first introduce the Bayesian theorem for general events A and B The Bayesian theorem builds the connection between PAB and PBA through the following rule This mathematical equation is derived from the definition of the conditional distribution 𝑃𝑃𝐴𝐴𝐵𝐵 𝑃𝑃𝐴𝐴 𝐵𝐵 𝑃𝑃𝐵𝐵 and 𝑃𝑃𝐵𝐵𝐴𝐴 𝑃𝑃𝐴𝐴 𝐵𝐵 𝑃𝑃𝐴𝐴 are both true statements therefore 𝑃𝑃𝐴𝐴𝐵𝐵𝑃𝑃𝐵𝐵 𝑃𝑃𝐵𝐵𝐴𝐴𝑃𝑃𝐴𝐴 which gives the Bayesians rule Why is the Bayesian rule important To answer this question lets replace A with observation O and B with the hidden unknown parameter λ Now lets assign reallife meaning to parts of the equation P0λ is the likelihood function we wish to maximize Pλ0 is a posterior probability of λ Pλ is a prior probability of λ PO is essentially 1 because it is observed therefore determined P𝐴𝐴𝐵𝐵 𝑃𝑃𝐵𝐵𝐴𝐴 𝑃𝑃𝐴𝐴 𝑃𝑃𝐵𝐵 P𝑂𝑂λ 𝑃𝑃λ𝑂𝑂 𝑃𝑃𝑂𝑂 𝑃𝑃λ 156 Parametric Estimation If you have forgotten the concepts of posterior probability and prior probability please refer to the Understanding the important concepts section in Chapter 5 Common Probability Distributions The Bayesian theorem basically says that the likelihood probability is just the ratio of posterior probability and prior probability Recall that the posterior distribution is a corrected or adjusted probability of the unknown parameter The stronger our observation O suggests a particular value of λ the bigger the ratio of 𝑃𝑃λ𝑂𝑂 𝑃𝑃λ is In both of our previous cases Pλ is invisible because we dont have any prior knowledge about λ Why is prior knowledge of the unknown parameter important Lets again use the coin tossing game not an experiment anymore as an example Suppose the game can be done by one of two persons Bob or Curry Bob is a serious guy who is not so playful Bob prefers using a fair coin over an unfair coin 80 of the time Bob uses the fair coin Curry is a naughty boy He randomly picks a fair coin or unfair coin for the game fifty fifty Will you take the fact of who you are playing the game with into consideration Of course you will If you play with Curry you will end up with the same analysis of solving MLE problems as coin tossing and noise modeling Earlier we didnt assume anything about the prior information of p However if you play with Bob you know that he is more serious and honest therefore he is unlikely to use an unfair coin You need to factor this into your decision One common mistake that data scientists make is the ignorance of prior distribution of the unknown parameter which leads to the wrong likelihood function The calculation of the modified coin tossing is left to you as an exercise The Bayesian theorem is often utilized together with the law of total probability It has an expression as follows We wont prove it mathematically here but you may think of it in the following way 1 For the first equals sign for the enumeration of all mutually exclusive events Bi each Bi overlaps with event A to some extent therefore the aggregation of the events Bi will complete the whole set of A 2 For the second equals sign we apply the definition of conditional probability to each term of the summation P𝐴𝐴 𝑃𝑃𝐴𝐴 𝐵𝐵𝑖𝑖 𝑖𝑖 𝑃𝑃𝐴𝐴𝐵𝐵𝑖𝑖𝑃𝑃𝐵𝐵𝑖𝑖 𝑖𝑖 Applying the maximum likelihood approach with Python 157 Example of the law of total probability Suppose you want to know the probability that a babys name is Adrian in the US Since Adrian is a genderneutral baby name you can calculate it using the law of total probability as PAdrian PAdrianboy Pboy PAdriangirl Pgirl The last example in this chapter is the famous Monty Hall question It will deepen your understanding of the Bayesian rule You are on a show to win a prize There is a huge prize behind one of three doors Now you pick door A the host opens door B and finds it empty You are offered a second chance to switch to door C or stick to door A what should you do The prior probability for each door A B or C is 𝑃𝑃𝐴𝐴 𝑃𝑃𝐵𝐵 𝑃𝑃𝐶𝐶 1 3 The host will always try to open an empty door after you select if you selected a door without the prize the host will open another empty door If you select the door with the prize the host will open one of the remaining doors randomly Lets use EB to denote the event that door B is opened by the host which is already observed Which pair of the following probabilities should you calculate and compare The likelihood of PEBA and PEBC The posterior probability PAEB and PCEB The answer is the posterior probability Without calculation you know PEBA is 1 2 Why Because if the prize is in fact behind door A door B is simply selected randomly by the host However this information alone will not give us guidance on the next action The posterior probability is what we want because it instructs us what to do after the observation Lets calculate PAEB first According to the Bayesian rule and the law of total probability we have the following equation 𝑃𝑃𝐴𝐴𝐸𝐸𝐵𝐵 𝑃𝑃𝐸𝐸𝐵𝐵𝐴𝐴𝑃𝑃𝐴𝐴 𝑃𝑃𝐸𝐸𝐵𝐵 𝑃𝑃𝐸𝐸𝐵𝐵𝐴𝐴𝑃𝑃𝐴𝐴 𝑃𝑃𝐸𝐸𝐵𝐵𝐴𝐴𝑃𝑃𝐴𝐴 𝑃𝑃𝐸𝐸𝐵𝐵𝐵𝐵𝑃𝑃𝐵𝐵 𝑃𝑃𝐸𝐸𝐵𝐵𝐶𝐶𝑃𝑃𝐶𝐶 158 Parametric Estimation Now what is PEBC It is 1 because the host is sure to pick the empty door to confuse you Lets go over the several conditional probabilities together PEBA The prize is behind door A so the host has an equal chance of randomly selecting from door B and door C This probability is 1 2 PEBB The prize is behind door B so the host will not open door B at all This probability is 0 PEBC The prize is behind door C so the host will definitely open door B Otherwise opening door C will reveal the prize This probability is essentially 1 By the same token 𝑃𝑃𝐶𝐶𝐸𝐸𝐵𝐵 2 3 The calculation is left to you as a small exercise This is counterintuitive You should switch rather than sticking to your original choice From the first impression since the door the host opens is empty there should be an equal chance that the prize will be in one of the two remaining two doors equally The devil is in the details The host has to open a door that is empty You will see from a computational sense about the breaking of symmetry in terms of choices First lets ask the question of whether we can do a computational experiment to verify the results The answer is yes but the setup is a little tricky The following code does the job and I will explain it in detail import random randomseed2020 doors ABC count stick switch 0 0 0 trials for i in range10000 prize randomchoicedoors pick randomchoicedoors reveal randomchoicedoors trial 1 while reveal prize or reveal pick P𝐴𝐴𝐸𝐸𝐵𝐵 1 2 1 3 1 2 1 3 0 1 3 1 1 3 1 3 Applying the maximum likelihood approach with Python 159 reveal randomchoicedoors trial1 trialsappendtrial if reveal pick and reveal prize count 1 if pick prize stick 1 else switch 1 printtotal experiment formatcount printtimes of switch formatswitch printtimes of stick formatstick Run it and you will see the following results total experiment 10000 times of switch 6597 times of stick 3403 Indeed you should switch The code follows the following logic 1 For 10000 experiments the prize is preselected randomly The users pick is also randomly selected The user may or may not pick the prize 2 Then the host reveals one of the doors However we know that the host will reveal one empty door from the remaining two doors for sure We use the trial variable to keep track of the times that we try to generate a random selection to meet this condition This variable is also appended to a list object whose name is trials 3 At last we decide whether to switch or not The symmetry is broken when the host tries to pick the empty door Lets use the following code to show the distribution of trials pltfigurefigsize106 plthisttrialsbins 40 160 Parametric Estimation The plot looks like the following Figure 610 Number of trials in the computer simulation In our plain simulation in order to meet the condition that the host wants to satisfy we must do random selection more than one time and sometimes even more than 10 times This is where the bizarreness hides Enough said on the Bayesian theorem you have grasped the foundation of MLE MLE is a simplified scenario of the Bayesian approach to estimation by assuming the prior distribution of the unknown parameter is uniform Summary In this chapter we covered two important methods of parameter estimation the method of moments and MLE You then learned the background of MLE the Bayesian way of modeling the likelihood function and so on However we dont know how well our estimators perform yet In general it requires a pipeline of hypothesis testing with a quantitative argument to verify a claim We will explore the rich world of hypothesis testing in the next chapter where we will put our hypothesesassumptions to the test 7 Statistical Hypothesis Testing In Chapter 6 Parametric Estimation you learned two important parameter estimation methods namely the method of moments and MLE The underlying assumption for parameter estimation is that we know that the data follows a specific distribution but we do not know the details of the parameters and so we estimate the parameters Parametric estimation offers an estimation but most of the time we also want a quantitative argument of confidence For example if the sample mean from one population is larger than the sample mean from another population is it enough to say the mean of the first population is larger than that of the second one To obtain an answer to this question you need statistical hypothesis testing which is another method of statistical inference of massive power In this chapter you are going to learn about the following topics An overview of hypothesis testing Making sense of confidence intervals and P values from visual examples Using the SciPy Python library to do common hypothesis testing 162 Statistical Hypothesis Testing Understanding the ANOVA model and corresponding testing Applying hypothesis testing to time series problems Appreciating AB testing with realworld examples Some concepts in this chapter are going to be subtle Buckle up and lets get started An overview of hypothesis testing To begin with the overview I would like to share an ongoing example from while I was writing this book As the coronavirus spread throughout the world pharmaceutical and biotechnology companies worked around the clock to develop drugs and vaccines Scientists estimated that it would take at least a year for a vaccine to be available To verify the effectiveness and safety of a vaccine or drug clinical trials needed to be done cautiously and thoroughly at different stages It is a wellknown fact that most drugs and vaccines wont reach the later trial stages and only a handful of them ultimately reach the market How do clinical trials work In short the process of screening medicines is a process of hypothesis testing A hypothesis is just a statement or claim about the statistics or parameters describing a studied population In clinical trials the hypothesis that a medicine is effective or safe is being tested The simplest scenario includes two groups of patients selected randomly You treat one group with the drug and another without the drug and control the rest of the factors Then the trial conductors will measure specific signals to compare the differences for example the concentration of the virus in the respiratory system or the number of days to full recovery The trial conductors then decide whether the differences or the statistics calculated are significant enough You can preselect a significance level α to check whether the trial results meet the expected level of significance Now imagine you observe the average math course scores for 9th grade students in a school You naturally assume that the distribution of each years math course scores follows a normal distribution However you find this years sample average score μ2020 is slightly below last years population average score 75 Do you think this finding just comes from the randomness of sampling or is the fundamental level of math skills of all students deteriorating You can use MLE or the methods of the moment to fit this years data to a normal distribution and compare the fitted mean However this still doesnt give you a quantitative argument of confidence In a scorebased a score out of 100 grading system lets say last years average score for all students is 75 and the average score for your 2020 class a 900student sample is 73 Is the decrease real How small or big is the twopoint difference An overview of hypothesis testing 163 To answer these questions we first need to clarify what constitutes a valid hypothesis Here are two conditions The statement must be expressed mathematically For example this years average score is the same as last years is a valid statement This statement can be expressed as 𝐻𝐻0 μ2020 75 with no ambiguity However this years average score is roughly the same as last years is not a valid statement because different people have different assessments of what is roughly the same The statement should be testable A valid statement is about the statistics of observed data If the statement requires data other than the observed data the statement is not testable For example you cant test the differences in students English scores if only math scores are given This famous saying by the statistician Fisher summarizes the requirements for a hypothesis well although he was talking about the null hypothesis specifically The null hypothesis must be exact that is free of vagueness and ambiguity because it must supply the basis of the problem of distribution of which the test of significance is the solution Now we are ready to proceed more mathematically Lets rephrase the math score problem as follows The average math score for the previous years 9th grade students is 75 This year you randomly sample 900 students and find the sample mean is 73 You want to know whether the average score for this years students is lower than last years To begin hypothesis testing the following three steps are required 1 Formulate a null hypothesis A null hypothesis basically says there is nothing special going on In our math score example it means there is no difference between this years score and last years score A null hypothesis is denoted by H0 On the other hand the corresponding alternative hypothesis states the opposite of the null hypothesis It is denoted by H1 or Hα In our example you can use 𝐻𝐻1 μ2020 75 Note that different choices of null hypothesis and alternative hypothesis will lead to different results in terms of accepting or rejecting the null hypothesis 164 Statistical Hypothesis Testing 2 Pick a test statistic that can be used to assess how well the null hypothesis holds and calculate it A test statistic is a random variable that you will calculate from the sampled data under the assumption that the null hypothesis is true Then you calculate the Pvalue according to the known distribution of this test statistic 3 Compute the Pvalue from the test statistic and compare it with an acceptable significance level α Gosh So many new terms Dont worry lets go back to our examples and you will see this unfold gradually After that you will be able to understand these concepts in a coherent way You will be able to follow these three steps to approach various kinds of hypothesis testing problems in a unified setting Understanding Pvalues test statistics and significance levels To explain the concepts with the math example lets first get a visual impression of the data I will be using two sample math scores for 900 students in 2019 and sample math scores for 900 students in 2020 Note on given facts The dataset for 2019 is not necessary for hypothesis testing because we are given the fact that the average score for 2019 is exactly 75 I will generate the datasets to provide you with a clear comparable visualization At this point I am not going to tell you how I generated the 2020 data otherwise the ground truth would be revealed to you beforehand I do assure you that the data for 2019 is generated from sampling a normal distribution with a mean of 75 and a variance of 25 The two datasets are called math2020 and math2019 Each of them contains 900 data points Let me plot them with histogram plots so you know roughly what they look like The following code snippet does the job pltfigurefigsize106 plthistmath2020binsnp linspace5010050alpha05labelmath2020 plthistmath2019binsnp linspace5010050alpha05labelmath2019 pltlegend An overview of hypothesis testing 165 Note that I explicitly set the bins to make sure the bin boundaries are fully determined The result looks as follows Figure 71 Histogram plot of math scores from 2020 and 2019 Note that the scores from 2020 do seem to have a smaller mean than the scores from 2019 which is supposed to be very close to 75 Instead of calling the corresponding numpy functions I will just use the following describe function from the scipy librarys stats module from scipy import stats statsdescribemath2020 The result is the following DescribeResultnobs900 minmax5361680120097629 9329408158813376 mean7289645796453996 variance2481446705891462 skewness0007960630504578523 kurtosis03444548003252992 Do the same thing for the year 2019 I find that the mean for the year 2019 is around 75 In Chapter 4 Sampling and Inferential Statistics we discussed the issue of sampling which itself involves randomness Is the difference in means an artifact of randomness or is it real To answer this question lets first embed some definitions into our example starting with the null hypothesis A null hypothesis basically says YES everything is due to randomness An alternative hypothesis says NO to randomness and claims that there are fundamental differences A hypothesis test is tested against the null hypothesis to see whether there is evidence to reject it or not 166 Statistical Hypothesis Testing Back to the math score problem We can pick the null hypothesis 𝐻𝐻0 μ2020 75 and set the alternative hypothesis 𝐻𝐻1 μ2020 75 Or we can choose 𝐻𝐻0 μ2020 75 and set 𝐻𝐻1 μ2020 75 Note that the combination of the null hypothesis and the alternative hypothesis must cover all possible cases The first case is called a twotailed hypothesis because either μ2020 75 or μ2020 75 will be a rebuttal of our null hypothesis The alternative is a onetailed hypothesis Since we only want to test whether our mean score is less than or equal to 75 we choose the onetailed hypothesis for our example Even in the null hypothesis the mean can be larger than 75 but we know this is going to have a negligible likelihood On the choice of onetailed or twotailed hypotheses Whether to use a onetailed hypothesis or a twotailed hypothesis depends on the task at hand One big difference is that choosing a twotailed alternative hypothesis requires an equal split of the significance level on both sides What we have now is a dataset a null hypothesis and an alternative hypothesis The next step is to find evidence to test the null hypothesis The null hypothesis reads 𝐻𝐻0 𝜇𝜇2020 75 whereas the alternative hypothesis reads 𝐻𝐻1 μ2020 75 After setting up the hypothesis the next step is to find a rule to measure the strength of the evidence or in other words to quantify the risk of making mistakes that reject a true null hypothesis The significance level α is essentially an index of the likelihood of making the mistake of rejecting a true null hypothesis For example if a medicine under trial is useless α 005 means that we set a threshold that at the chance of less than or equal to 5 we may incorrectly conclude that the medicine is useful If we set the bar way lower at α 0001 it means that we are very picky about the evidence In other words we want to minimize the cases where we are so unlucky that randomness leads us to the wrong conclusion As we continue to talk about significance all hypothesis testing to be discussed in this chapter will be done with statistical significance tests A statistical significance test assumes the null hypothesis is correct until evidence that contradicts the null hypothesis shows up Another perspective of hypothesis testing treats the null hypothesis and the alternative hypothesis equally and tests which one fits the statistical model better I only mention it here for completeness Making sense of confidence intervals and Pvalues from visual examples 167 To summarize if the evidence and test statistics show contradictions against the null hypothesis with high statistical significance smaller α values we reject the null hypothesis Otherwise we fail to reject the hypothesis Whether you reject the null hypothesis or not there is a chance that you will make mistakes Note Hypothesis testing includes the test of correlation or independence In this chapter we mainly focus on the test of differences However claiming two variables are correlated or independent is a legitimate statementhypothesis that can be tested Making sense of confidence intervals and Pvalues from visual examples Pvalues determine whether a research proposal will be funded whether a publication will be accepted or at least whether an experiment is interesting or not To start with let me give you some bullet points about Pvalues properties The Pvalue is a magical probability but it is not the probability that the null hypothesis will be accepted Statisticians tend to search for supportive evidence for the alternative hypothesis because the null hypothesis is boring Nobody wants to hear that there is nothing interesting going on The Pvalue is the probability of making mistakes if you reject the null hypothesis If the Pvalue is very small it means that you can safely reject the null hypothesis without worrying too much that you made mistakes because randomness tricked you If the Pvalue is 1 it means that you have absolutely no reason to reject the null hypothesis because what you get from the test statistic is the most typical results under the null hypothesis The Pvalue is defined in the way it is so that it can be comparable to the significance level α If we obtain a Pvalue smaller than the significance level we say the result is significant at significance level α The risk of making a mistake that rejects the null hypothesis wrongly is acceptable If the Pvalue is not smaller than α the result is not significant 168 Statistical Hypothesis Testing From first principles the Pvalue of an event is also the summed probability of observing the event and all events with equal or smaller probability This definition doesnt contradict the point about contradicting the null hypothesis Note that under the assumption of a true null hypothesis the cumulative probability of observing our test statistics and all other equal or rarer values of test statistics is the probability of mistakenly rejecting the null hypothesis The importance of firstprinciples thinking Firstprinciples thinking is very important in studying statistics and programming It is advised that you resist the temptation to use rules and procedures to get things done quickly but instead learn the definitions and concepts so you have a foundation in terms of first principles Please read the definition of the Pvalue carefully to make sure you fully understand it Before moving on to a concrete example of test statistics lets have a look at the Pvalue from two examples from first principles The importance of correctly understanding the Pvalue cannot be stressed enough Calculating the Pvalue from discrete events In our first example we will study the probability and the Pvalue of events in cointossing experiments Lets toss a fair coin 6 times and count the total number of heads There are 7 possibilities from 0 heads to 6 heads We can calculate the probability either theoretically or computationally I will just do a quick experiment with the following lines of code and compare the results with the theoretical values The following code snippet generates the experiment results for 1 million tosses and stores the results in the results variable randomseed2020 results for in range1000000 resultsappendsumrandomrandom 05 for i in range6 The following code snippet normalizes the results and lists them alongside the theoretical results from collections import Counter from math import factorial as factorial counter Counterresults Making sense of confidence intervals and Pvalues from visual examples 169 for head in sortedcounterkeys comput counterhead1000000 theory 056factorial6factorialheadfactorial6 head printheads Computational Theoretical formatheadcomput theory The results look as follows The computational results agree with the theoretical results pretty well heads 0 Computational 0015913 Theoretical 0015625 heads 1 Computational 0093367 Theoretical 009375 heads 2 Computational 0234098 Theoretical 0234375 heads 3 Computational 0312343 Theoretical 03125 heads 4 Computational 0234654 Theoretical 0234375 heads 5 Computational 0093995 Theoretical 009375 heads 6 Computational 001563 Theoretical 0015625 Lets answer the following questions to help us clarify the definition of the Pvalue The answers should be based on theoretical results 1 What is the probability of getting 5 heads what about Pvalue The probability is 009375 However the Pvalue is the sum of 009375 009375 0015625 0015625 021875 The Pvalue is the probability of you seeing such events with equal probability or rarer probability Getting 1 head is equally likely as getting 5 heads Getting 6 heads or 0 heads is more extreme With a Pvalue of roughly 0 we say that the event of observing 5 heads is quite typical The Pvalue for observing 6 heads is about 0031 The calculation is left to you as an exercise 2 What is the Pvalue of getting 3 heads The surprising answer here is 1 Among all 7 kinds of possibilities getting 3 heads is the most likely outcome therefore the rest of the outcomes are all rarer than getting 3 heads Another implication is that there are no other events that are more typical than observing 3 heads Now you should have a better understanding of the Pvalue by having treated it as a measurement of typicalness In the next section lets move on to a case involving the continuous Probability Density Function PDF case where we need some integration 170 Statistical Hypothesis Testing Calculating the Pvalue from the continuous PDF We just calculated Pvalues from discrete events Now lets examine a continuous distribution The distribution I am going to use is the Fdistribution The Fdistribution is the distribution we are going to use in the analysis of the variance test later so it is good to have a first impression here The analytical form of the Fdistribution is parameterized by two degrees of freedom d1 and d2 Fd1 d2 If x is greater than 0 the PDF is as follows f𝑥𝑥 𝑑𝑑1 𝑑𝑑2 1 𝐵𝐵 𝑑𝑑1 2 𝑑𝑑2 2 𝑑𝑑1 𝑑𝑑2 𝑑𝑑1 2 𝑥𝑥 𝑑𝑑1 2 1 1 𝑑𝑑1 𝑑𝑑2 𝑥𝑥 𝑑𝑑1𝑑𝑑2 2 The Bxy function is called the beta function and its a special kind of function If you are familiar with calculus it has the following definition as an integration B𝑥𝑥 𝑦𝑦 𝑡𝑡𝑥𝑥11 𝑡𝑡𝑦𝑦1dt 1 0 Fortunately we dont need to write our own function to generate these samples The scipy library provides another handy function f for us to use The following code snippet generates the PDFs with four pairs of parameters and plots them from scipystats import f pltfigurefigsize108 styles for i dfn dfd in enumerate2030206050305060 x nplinspacefppf0001 dfn dfd fppf0999 dfn dfd 100 pltplotx fpdfx dfn dfd linestyle stylesi lw4 alpha06 label formatdfndfd pltlegend Making sense of confidence intervals and Pvalues from visual examples 171 The plotted graph looks like this Figure 72 The Fdistribution PDF The probability distribution function of the Fdistribution is not symmetrical it is right skewed with a long tail Lets say you have a random variable x following the distribution F2060 If you observe x to be 15 what is the Pvalue for this observation The following code snippet highlights the region where the equal or rare events are highlighted in red I generated 100 linearly spaced data points and stored them in the x variable and selected those rarer observations on the right and those on the left pltfigurefigsize108 dfn dfd 2060 x nplinspacefppf0001 dfn dfd fppf0999 dfn dfd 100 pltplotx fpdfx dfn dfd linestyle lw4 alpha06 label formatdfndfd right xx15 left xfpdfx dfn dfd fpdfrightdfndfd008 pltfillbetweenrightf pdfrightdfndfdalpha04colorr pltfillbetweenleftfpdfleftdfndfdalpha04colorr pltlegend 172 Statistical Hypothesis Testing There is a little bit of hardcoding here where I manually selected the left part of the shaded area You are free to inspect the expression of the left variable The result looks as follows Figure 73 A rarer observation than 15 in F2060 The integration of the shaded area gives us the Pvalue for observing the value 15 The following code snippet uses the Cumulative Distribution Function CDF to calculate the value fcdfleft1dfndfd 1fcdfright0dfndfd The Pvalue is about 0138 so not very bad It is somewhat typical to observe a 15 from such an Fdistribution If your preselected significance level is α 005 then this observation is not significant enough By now you should understand the definition and implication of the Pvalue from first principles The remaining question is what exactly is the Pvalue in a hypothesis test The answer involves test statistics In the second step of hypothesis testing we calculate the best kind of statistic and check its Pvalue against a preselected significance level In the math score example we want to compare a sample mean against a constant this is a onesample onetailed test The statistic we want to use is the tstatistic The specific hypothesis test we want to apply is Students ttest Please bear with me on the new concepts The tstatistic is nothing special its just another random variable that follows a specific distribution which follows Students tdistribution We will cover both specific and Students tdistribution shortly with clear definitions and visualizations Making sense of confidence intervals and Pvalues from visual examples 173 Tests and test statistics Different problems require different test statistics If you want to test the differences in samples across several categories or groups you should use the Analysis of Variance ANOVA Ftest If you want to test the independence of two variables in a population you should use the Chisquare test which we will cover very soon Under the null hypothesis the tstatistic t is calculated as follows t μ2020 75 𝑠𝑠𝑛𝑛 n is the sample size and s is the sample standard deviation The random variable t follows Students tdistribution with a degree of freedom of n1 Students tdistribution Students tdistribution is a continuous probability distribution used when estimating the mean of a normally distributed distribution with an unknown population standard deviation and a small sample size It has a complicated PDF with a parameter called the Degree of Freedom DOF We wont go into the formula of the PDF as its convoluted but I will show you the relationship between the DOF and the shape of the tdistribution PDF The following code snippet plots the tdistributions with various DOFs alongside the standard normal distribution functions Here I use the scipystats module from scipystats import t norm pltfigurefigsize126 DOFs 248 linestyles for i df in enumerateDOFs x nplinspace4 4 100 rv tdf pltplotx rvpdfx k lw2 label DOF strdflinestylelinestylesi pltplotxnorm01pdfxk lw2 labelStandard Normal pltlegend 174 Statistical Hypothesis Testing The result looks like the following Pay attention to the line styles As you see when the DOF increases the tdistribution PDF tends to approach the standard normal distribution with larger and larger centrality Figure 74 The Students tdistribution PDF and a standard normal PDF Alright our statistic t μ2020 75 𝑠𝑠𝑛𝑛 follows the tdistribution with a DOF of 899 By substituting the numbers we can get the value of our tstatistic using the following code npmeanmath202075npstdmath202030 The result is about 126758 Replacing the tdistribution with a normal distribution With a large DOF 899 in our case the tdistribution will be completely indistinguishable from a normal distribution In practice you can use a normal distribution to do the test safely Significance levels in tdistribution Lets say we selected a significance level of α 001 Is our result significant enough We need to find out whether our result exceeds the threshold of the significance level 001 The tdistribution doesnt have an easytocalculate PDF so given a significance level of α 001 how do we easily find the thresholds Before the advent of easytouse libraries or programs people used to build tstatistics tables to solve this issue For a given significance level you can basically look up the table and find the corresponding tstatistic value Making sense of confidence intervals and Pvalues from visual examples 175 As the tdistribution is symmetric the importance of whether you are doing a onetail test or a twotailed test increases Lets first check a tdistribution table for a onetail test For example with a DOF of 5 to be significant at the 005 level the tstatistic needs to be 2015 Figure 75 The tdistribution table for onetailed significance levels For a more intuitive impression the following code snippet plots the different thresholds of the tdistribution PDF pltfigurefigsize106 df 5 x nplinspace8 8 200 rv tdf pltplotx rvpdfx k lw4linestyle alphas 0100500250010005000100005 threasholds 1476201525713365403258946869 for thre alpha in zipthreasholdsalphas pltplotthrethre0rvpdfthre label formatstralphalinewidth4 pltlegend The result looks as follows Figure 76 The significance levels for a tstatistics distribution with DOF5 176 Statistical Hypothesis Testing Adding the following two lines will zoom into the range that we are interested in pltxlim28 pltylim0015 The result looks as follows If you cant distinguish the colors just remember that the smaller the significance level is the further away the threshold is from the origin Figure 77 A zoomedin tdistribution showing different significance levels As the significance level decreases that is as α decreases we tend toward keeping the null hypothesis because it becomes increasingly harder to observe a sample with such low probability Next lets check the twotailed tdistribution table The twotailed case means that we must consider both ends of the symmetric distribution The summation of both gives us the significance level Figure 78 The tdistribution table for twotailed significance levels Making sense of confidence intervals and Pvalues from visual examples 177 Notice that for α 02 the tstatistic is the same as α 01 for a onetailed test The following code snippet illustrates the relationship between tstatistic and onetailed test using α 001 as an example I picked the most important region to show pltfigurefigsize106 df 5 x nplinspace8 8 200 rv tdf pltplotx rvpdfx k lw4linestyle alpha001 onetail 3365 twotail 4032 pltplotonetailonetail0rvpdfonetail label onetaillinewidth4linestyle pltplottwotailtwotail0rvpdftwotail label two taillinewidth4colorrlinestyle pltplottwotailtwotail0rvpdftwotail label two taillinewidth4colorrlinestyle pltfillbetweennplinspace8twotail200 rvpdfnplinspace8two tail200colorg pltfillbetweennplinspaceonetailtwotail200 rvpdfnplinspaceonetailtwo tail200colorg pltylim0002 pltlegend 178 Statistical Hypothesis Testing The result looks as follows Figure 79 A comparison of twotailed and onetailed results for the same significance level You need to trust me that the shaded parts have the same area The onetailed case only covers the region to the right of the vertical dashed line the left edge of the right shaded area but the twotailed case covers both sides symmetrically the outer portion of the two dotted vertical lines Since the significance levels are the same they should both cover the same area under the curve AUC which leads to the equal area of the two shaded regions For our onesided test our tstatistic is less than 10 It is equivalent to the threshold for the positive value because of the symmetry of the problem If you look up the tdistribution table with such a large DOF of 899 the difference between large DOFs is quite small For example the following two rows are found at the end of the tdistribution table Figure 710 A tdistribution table with very large DOFs In the math score example the absolute value 1267 for our tstatistic is far away from both 2358 and 2330 We have enough confidence to reject the null hypothesis which means that the alternative hypothesis is true indeed students math skills have declined Making sense of confidence intervals and Pvalues from visual examples 179 The following code snippet reveals how I generated the score data randomseed2020 math2019 randomnormalvariate755 for in range900 math2020 randomnormalvariate735 for in range900 Feel free to reproduce the random data generated to visualize it yourself Next lets examine another concept in hypothesis testing power The power of a hypothesis test I will briefly mention another concept you may see from other materials the power of a hypothesis test The power of hypothesis testing is the probability of making the correct decision if the alternative hypothesis is true It is easier to approach this concept from its complementary part The opposite side of this is the probability of failing to reject the null hypothesis H0 while the alternative hypothesis H1 is true This is called a type II error The smaller the type II error is the greater the power will be Intuitively speaking greater power means the test is more likely to detect something interesting going on On the other hand everything comes with a cost If the type II error is low then we will inevitably reject the null hypothesis based on some observations that indeed originate from pure randomness This is an error thats called a type I error The type I error is the mistake of rejecting the null hypothesis H0 while H0 is indeed true Does this definition ring a bell When you choose a significance level α you are choosing your highest acceptable type I error rate As you can imagine the type I error and type II error will compensate for each other in most cases We will come back to this topic again and again when we talk about machine learning Examples of type I and type II errors A classic example of type I and type II errors has to do with radar detection Say that a radar system is reporting no incoming enemy aircraft the null hypothesis is that there are no incoming enemy aircraft and the alternative hypothesis is that there actually are incoming enemy aircraft A type I error is reporting the enemy aircraft when there are no aircraft in the area A type II error would be when there are indeed incoming enemy aircraft but none were reported 180 Statistical Hypothesis Testing In the next section we are going to use the SciPy library to apply what we have learned so far to various kinds of hypothesis testing problems You will be amazed at how much you can do Using SciPy for common hypothesis testing The previous section went over a ttest and the basic concepts in general hypothesis testing In this section we are going to fully embrace the powerful idea of the paradigm of hypothesis testing and use the SciPy library to solve various hypothesis testing problems The paradigm The powerful idea behind the hypothesis testing paradigm is that if you know that your assumption when hypothesis testing is roughly satisfied you can just invoke a well written function and examine the Pvalue to interpret the results Tip I encourage you to understand why a test statistic is built in a specific way and why it follows a specific distribution For example for the tdistribution you should understand what the DOF is However this will require a deeper understanding of mathematical statistics If you just want to use hypothesis testing to gain insights knowing the paradigm is enough If you want to apply hypothesis testing to your dataset follow this paradigm 1 Identify the problems you are interested in exploring What are you going to test A difference correlation or independence 2 Find the correct hypothesis test and assumption Examine whether the assumption is satisfied carefully 3 Choose a significance level and perform the hypothesis test with a software package Recall that in the previous section we did this part manually but now it is all left to the software In this section I will follow the paradigm and do three different examples in SciPy Using SciPy for common hypothesis testing 181 Ttest First I will redo the ttest with SciPy The default API for a single sample ttest from scipystats only provides for twotailed tests We have already seen an example of interpreting and connecting twotailed and onetailed significance levels so this isnt an issue anymore The function we are going to use is called scipystatsttest1samp The following code snippet applies this function to our math score data from scipy import stats statsttest1sampmath2020750 The result reads as follows Ttest1sampResultstatistic12668347669098846 pvalue5842470780196407e34 The first value statistic is the tstatistic which agrees with our calculation The second term is the Pvalue it is so small that if the null hypothesis is true and you drew a 900student sample every second it would take longer than the amount of time the universe has existed for you to observe a sample as rare as we have here Lets do a twosample ttest The twosample ttest will test whether the means of two samples are the same For a twosample test the significance level is twotailed as our hypothesis is 𝐻𝐻0 μ1 μ2 There are two cases for the twosample ttest depending on the variance in each sample If the two variances are the same it is called a standard independent twosample ttest if the variances are unequal the test is called Welchs ttest Lets first examine the standard ttest The following code snippet generates and plots two normally distributed samples one at mean 2 and another at 21 with an equal population variance of 1 nprandomseed2020 sample1 nprandomnormal21400 sample2 nprandomnormal211400 pltfigurefigsize106 plthistsample1binsnplinspace1510alpha05labelsa mple1 182 Statistical Hypothesis Testing plthistsample2binsnplinspace1510alpha05labelsa mple2 pltlegend The result looks as follows Figure 711 Two samples with unequal means Lets call the ttestind ttest function directly with the following line of code statsttestindsample1sample2 The result looks as follows TtestindResultstatistic17765855804956159 pvalue007601736167057595 Our tstatistic is about 18 If our significance level is set to 005 we will fail to reject our null hypothesis How about increasing the number of samples Will it help Intuitively we know that more data contains more information about the population therefore it is expected that well see a smaller Pvalue The following code snippet does the job nprandomseed2020 sample1 nprandomnormal21900 sample2 nprandomnormal211900 statsttestindsample1sample2 Using SciPy for common hypothesis testing 183 The result shows a smaller Pvalue TtestindResultstatistic3211755683955914 pvalue00013425868478419776 Note that Pvalues can vary significantly from sample to sample In the following code snippet I sampled the two distributions and conducted the twosample ttest 100 times nprandomseed2020 pvalues for in range100 sample1 nprandomnormal21900 sample2 nprandomnormal211900 pvaluesappendstatsttestindsample1sample21 Lets see how the Pvalue itself distributes in a boxplot pltfigurefigsize106 pltboxplotpvalues The boxplot of the Pvalues will look as follows Figure 712 A boxplot of Pvalues for 100 standard twosample ttests The majority of the Pvalues do fall into the region of 0 02 but there are a handful of outliers as well 184 Statistical Hypothesis Testing Next if you dont know whether the two samples have the same variance you can use Welchs ttest First lets use the following code snippet to generate two samples from two different uniform distributions with different sample sizes as well Note Our null hypothesis remains unchanged which means that the twosample means are the same nprandomseed2020 sample1 nprandomuniform210400 sample2 nprandomuniform112900 pltfigurefigsize106 plthistsample1binsnp linspace01520alpha05labelsample1 plthistsample2binsnp linspace01520alpha05labelsample2 pltlegend The result looks as follows Figure 713 Two uniformly distributed samples with different means variances and sample sizes Lets call the same SciPy function but this time well tell it that the variances are not equal by setting the equalvar parameter to False statsttestindsample1sample2equalvarFalse Using SciPy for common hypothesis testing 185 The result shows quite a small Pvalue TtestindResultstatistic31364786834852163 pvalue00017579405400172416 With a significance level of 001 we will have enough confidence to reject the null hypothesis You dont have to know what Welchs tstatistic distributes this is the gift that the Python community gives to you The normality hypothesis test Our next test is a normality test In a normality test the null hypothesis H0 is that the sample comes from a normal distribution The alternative hypothesis H1 is that the sample doesnt come from a normal distribution There are several ways to do normality tests not a hypothesis test You can visually examine the datas histogram plot or check its boxplot or QQ plot However we will refer to the statisticians toolsets in this section A note on QQ plots QQ plots are not covered in this book They are used to compare two distributions You can plot data from a distribution against data from an ideal normal distribution and compare distributions There are several major tests for normality The most important ones are the Shapiro Wilk test and the AndersonDarling test Again we wont have time or space to go over the mathematical foundation of either test all we need to do is check their assumptions and call the right function in a given scenario How large should a random sample be to suit a normality test As you may have guessed if the size of the sample is small it really doesnt make much sense to say whether it comes from a normal distribution or not The extreme case is the sample size being 1 It is possible that it comes from any distribution There is no exact rule on how big is big enough The literature mentions that 50 is a good threshold beyond which the normality test is applicable I will first generate a set of data from Chisquared distributions with different parameters and use the two tests from SciPy to obtain the Pvalues 186 Statistical Hypothesis Testing A note on Chisquare distributions The Chisquared distribution or x2 distribution is a very important distribution in statistics The sum of the square of k independent standard normal random variables follows a Chisquared distribution with a DOF of k The following code snippet plots the real PDFs of Chisquared distributions so that you get an idea about the DOFs influence over the shape of the PDF from scipystats import chi2 pltfigurefigsize106 DOFs 481632 linestyles for i df in enumerateDOFs x nplinspacechi2ppf001 dfchi2ppf099 df 100 rv chi2df pltplotx rvpdfx k lw4 label DOF strdflinestylelinestylesi pltlegend The result looks as follows Figure 714 Chisquared distributions with different DOFs Next lets generate two sets of data of sample size 400 and plot them nprandomseed2020 sample1 nprandomchisquare8400 sample2 nprandomchisquare32400 pltfigurefigsize106 Using SciPy for common hypothesis testing 187 plthistsample1binsnp linspace06020alpha05labelsample1 plthistsample2binsnp linspace06020alpha05labelsample2 pltlegend The histogram plot looks as follows Sample one has a DOF of 8 while sample two has a DOF of 32 Figure 715 The Chisquared distributions with different DOFs Now lets call the shapiro and anderson test functions in SciPyStats to test the normality The following code snippet prints out the results The anderson function can be used to test fitness to other distributions but defaults to a normal distribution printResults for ShapiroWilk Test printSample 1 shapirosample1 printSample 2 shapirosample2 print printResults for AndersonDarling Test printSample 1 andersonsample1 printSample 2 andersonsample2 The results for the ShapiroWilk test read as follows Sample 1 09361660480499268 4538336286635802e12 Sample 2 09820653796195984 7246905443025753e05 188 Statistical Hypothesis Testing The results for the AndersonDarling Test read as follows Sample 1 AndersonResultstatistic6007815329566711 criticalvaluesarray057 065 0779 0909 1081 significancelevelarray15 10 5 25 1 Sample 2 AndersonResultstatistic18332323421475962 criticalvaluesarray057 065 0779 0909 1081 significancelevelarray15 10 5 25 1 The results for ShapiroWilk follow the format of statistic Pvalue so it is easy to see that for both sample one and sample two the null hypothesis should be rejected The results for the AndersonDarling test gives the statistic but you need to determine the corresponding critical value The significance level list is in percentages so 1 means 1 The corresponding critical value is 1081 For both cases the statistic is larger than the critical value which also leads to rejection of the null hypothesis Different test statistics cant be compared directly For the same hypothesis test if you choose a different test statistic you cannot compare the Pvalues from different methods directly As you can see from the preceding example the ShapiroWilk test has a much smaller Pvalue than the AndersonDarling test for both samples Before moving on to the next test lets generate a sample from a normal distribution and test the normality The following code snippet uses a sample from a standard normal distribution with a sample size of 400 sample3 nprandomnormal01400 printResults for ShapiroWilk Test printSample 3 shapirosample3 print printResults for AndersonDarling Test printSample 3 andersonsample3 The results read as follows Note that the function call may have a different output from the provided Jupyter notebook which is normal Results for ShapiroWilk Test Sample 3 0995371401309967 02820892035961151 Results for AndersonDarling Test Using SciPy for common hypothesis testing 189 Sample 3 AndersonResultstatistic046812258253402206 criticalvaluesarray057 065 0779 0909 1081 significancelevelarray15 10 5 25 1 It is true that we cant reject the null hypothesis even with a significance level as high as α 015 The goodnessoffit test In a normality test we tested whether a sample comes from a continuous normal distribution or not It is a fitness test which means we want to know how well our observation agrees with a preselected distribution Lets examine another goodnessoffit test for a discrete case the Chisquared goodnessoffit test Suppose you go to a casino and encounter a new game The new game involves drawing cards from a deck of cards three times the deck doesnt contain jokers You will win the game if two or three of the cards drawn belong to the suit of hearts otherwise you lose your bet You are a cautious gambler so you sit there watch and count After a whole day of walking around the casino and memorizing you observe the following results of card draws There are four cases in total Ive tabulated the outcomes here Figure 716 Counting the number of hearts Tip Note that in real life casinos will not allow you to count cards like this It is not in their interests and you will most likely be asked to leave if you are caught First lets calculategiven a 52card deck where there are 13 hearts what the expected observation look like For example picking 2 hearts would mean picking 2 hearts from the 13 hearts of the deck and pick 1 card from the remaining 39 cards which yields a total number of 13 11 2 39 381 So the total combination of choosing 3 hearts out of 52 cards is 52 3 39 Taking the ratio of those two instances we have the probability of obtaining 2 hearts being about 138 190 Statistical Hypothesis Testing The number of all observations is 1023 so in a fairgame scenario we should observe roughly 10231380 which gives 141 observations of 2 hearts cards being picked Based on this calculation you probably have enough evidence to question the casino owner The following code snippet calculates the fairgame probability and expected observations I used the comb function from the SciPy library from scipyspecial import comb P comb393icomb13icomb523 for i in range4 expected 1023p for p in P observed 46045110210 The index in the P and expected arrays means the number of observed hearts For example P0 represents the probability of observing 0 hearts Lets use a bar plot to see the differences between the expected values and the observed values The following code snippet plots the expected values and the observed values back to back x nparray0123 pltfigurefigsize106 pltbarx02expectedwidth04labelExpected pltbarx02observedwidth04 label Observed pltlegend pltxticksticks0123 The output of the code is as shown in the following graph Figure 717 The expected number of hearts and the observed number of hearts Using SciPy for common hypothesis testing 191 We do see that it is somewhat more probable to get fewer hearts than other cards Is this result significant Say we have a null hypothesis H0 that the game is fair How likely is it that our observation is consistent with the null hypothesis The Chisquare goodnessof fit test answers this question Tip Chisquare and Chisquared are often used interchangeably In this case the x2 statistic is calculated as 𝑂𝑂𝑖𝑖 𝐸𝐸𝑖𝑖2 𝐸𝐸𝑖𝑖 4 𝑖𝑖 where 0i is the number of observations for category i and Ei is the number of expected observations for category i We have four categories and the DOF for this x2 distribution is 4 1 3 Think about the expressive meaning of the summation If the deviation from the expectation and the observation is large the corresponding term will also be large If the expectation is small the ratio will become large which puts more weight on the smallexpectation terms Since the deviation for 2heart and 3heart cases is somewhat large we do expect that the statistic will be largely intuitive The following code snippet calls the chisquare function in SciPy to test the goodness of the fit from scipystats import chisquare chisquareobservedexpected The result reads as follows PowerdivergenceResultstatistic14777716323788255 pvalue0002016803916729754 With a significance level of 001 we reject the null hypothesis that the game cant be fair 192 Statistical Hypothesis Testing The following PDF of the x2 distribution with a DOF of 3 can give you a visual idea of how unfair the game is Figure 718 A Chisquared distribution with and DOF of 3 The code for generating the distribution is as follows pltfigurefigsize106 x nplinspacechi2ppf0001 3chi2ppf0999 3 100 rv chi23 pltplotx rvpdfx k lw4 label DOF str3linestyle Next lets move on to the next topic ANOVA A simple ANOVA model The ANOVA model is actually a collection of models We will cover the basics of ANOVA in this section ANOVA was invented by British statistician RA Fisher It is widely used to test the statistical significance of the difference between means of two or more samples In the previous ttest you saw the twosample ttest which is a generalized ANOVA test Using SciPy for common hypothesis testing 193 Before moving on let me clarify some terms In ANOVA you may often see the terms factor group and treatment Factor group and treatment basically mean the same thing For example if you want to study the average income of four cities then city is the factor or group It defines the criteria how you would love to classify your datasets You can also classify the data with the highest degree earned therefore you can get another factorgroup degree The term treatment originates from clinical trials which have a similar concept of factors and groups You may also hear the word level Level means the realizations that a factor can be For example San Francisco Los Angeles Boston and New York are four levels for the factorgroup city Some literature doesnt distinguish between groups or levels when it is clear that there is only one facet to a whole dataset When the total number of samples extends beyond two lets say g groups with group i having ni data points the null hypothesis can be formulated as follows 𝐻𝐻0 μ1 μ2 μ𝑔𝑔 In general you can do a sequence of ttests to test each pair of samples You will have gg 12 ttests to do For two different groups group i and group j you have the null hypothesis 𝐻𝐻0 μ1 μ2 μ𝑔𝑔 This approach has two problems You need to do more than one hypothesis test and the number of tests needed doesnt scale well The results require additional analysis Now lets examine the principles of ANOVA and see how it approaches these problems I will use the average income question as an example The sample data is as follows Figure 719 Income data samples from four cities Assumptions for ANOVA ANOVA has three assumptions The first assumption is that the data from each group must be distributed normally The second assumption is that samples from each group must have the same variance The third assumption is that each sample should be randomly selected In our example I assume that these three conditions are met However the imbalance of income does violate normality in real life Just to let you know 194 Statistical Hypothesis Testing The ANOVA test relies on the fact that the summation of variances is decomposable The total variance of all the data can be partitioned into variance between groups and variance within groups VAR𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 VAR𝑏𝑏𝑏𝑏𝑡𝑡𝑏𝑏𝑏𝑏𝑏𝑏 VAR𝑏𝑏𝑤𝑤𝑡𝑡ℎ𝑤𝑤𝑖𝑖 Heres that again but using notation common in the literature 𝑆𝑆𝑇𝑇 2 𝑆𝑆𝐵𝐵 2 𝑆𝑆𝑊𝑊 2 Now we define the three terms Let me use Xij to denote the j th data point from the j th group μ to denote the mean of the whole dataset in our example the mean of all the income data from all four groups and μi to denote the mean from group i For example μ1 is the average income for people living in San Francisco The total variance 𝑆𝑆𝑇𝑇 2 can be defined as follows 𝑆𝑆𝑇𝑇 2 𝑋𝑋𝑖𝑖𝑖𝑖 μ 2 𝑛𝑛𝑖𝑖 𝑖𝑖1 𝑔𝑔 𝑖𝑖1 The variance within groups 𝑆𝑆𝑊𝑊 2 is defined as follows 𝑆𝑆𝑊𝑊 2 𝑋𝑋𝑖𝑖𝑖𝑖 μi 2 𝑛𝑛𝑖𝑖 𝑖𝑖1 𝑔𝑔 𝑖𝑖1 The only difference is that the data point in each group now subtracts the group mean rather than the total mean The variance between groups is defined as follows 𝑆𝑆𝐴𝐴 2 𝑛𝑛𝑖𝑖μ𝑖𝑖 μ2 𝑔𝑔 𝑖𝑖1 The square of the difference between the group mean and the total mean is weighted by the number of group members The reason why this partition holds comes from the fact that a data point Xij can be decomposed as follows 𝑋𝑋𝑖𝑖𝑖𝑖 μ μ𝑖𝑖 μ 𝑋𝑋𝑖𝑖𝑖𝑖 μ𝑖𝑖 Using SciPy for common hypothesis testing 195 The first term on the righthand side of the equation is the total mean The second term is the difference of means across the group and the third term is the difference within the group I encourage you to substitute this expression into the formula of 𝑆𝑆𝑇𝑇 2 and collect terms to rediscover it as the sum of 𝑆𝑆𝐵𝐵 2 and 𝑆𝑆𝑊𝑊 2 It is a good algebraic exercise The following code snippet does the calculation to verify the equation First I create the following four numpy arrays SF nparray1200001103001278006890079040208000 15900089000 LA nparray6570088340240000190000450802590069000120300 BO nparray8799986340980001240001138009800010800078080 NY nparray30000062010450001300002380005600089000123000 Next the following code snippet calculates 𝑆𝑆𝑇𝑇 2 𝑆𝑆𝐵𝐵 2 and 𝑆𝑆𝑊𝑊 2 mu npmeannpconcatenateSFLABONY ST npsumnpconcatenateSFLABONY mu2 SW npsumSFnpmeanSF2 npsumLAnpmeanLA2 npsumBOnpmeanBO2 npsumNYnpmeanNY2 SB 8npmeanSFmu2 8npmeanLAmu2 8npmeanBOmu2 8npmeanNYmu2 Now lets verify that ST SW SB ST SWSB The answer is True So indeed we have this relationship How is this relationship useful Let me first denote the variance for each group with σ2 they are the same because this is one of the assumptions and then check each term carefully 196 Statistical Hypothesis Testing The question is what is the distribution of the statistic 𝑆𝑆𝑊𝑊 2 σ2 Recall that for group i the sum of the squared differences is just a Chisquare distribution with a DOF of ni 1 𝑋𝑋𝑖𝑖𝑖𝑖 μ𝑖𝑖 2 𝑛𝑛𝑖𝑖 𝑖𝑖 σ2χ𝑛𝑛𝑖𝑖1 2 When the null hypothesis holds namely μi μj for arbitrary i and j 𝑆𝑆𝑊𝑊 2 𝜎𝜎2 is just the summation of the statistic Because each group is independent we have 𝑆𝑆𝑊𝑊 2 σ2 χ𝑛𝑛𝑔𝑔 2 where n is the total number of the samples and g is the number of groups How about 𝑆𝑆𝐵𝐵 2 𝜎𝜎2 When the null hypothesis holds each observation no matter which group it comes from can be treated as a realization of 𝑁𝑁μ σ2 therefore 𝑆𝑆𝑇𝑇 2 will follow a Chisquare distribution with a DOF of n 1 However we have the equation 𝑆𝑆𝑇𝑇 2 𝑆𝑆𝐵𝐵 2 𝑆𝑆𝑊𝑊 2 so 𝑆𝑆𝐵𝐵 2 𝜎𝜎2 must follow an x2 distribution with a DOF of g 1 where g 1 n 1 n g The test statistic F is further defined as the ratio of the two equations where σ2 is canceled 𝐹𝐹 𝑆𝑆𝐵𝐵 2𝑔𝑔 1 𝑆𝑆𝑊𝑊 2 𝑛𝑛 𝑔𝑔 The statistic F follows an Fdistribution of Fg 1 n g If the null hypothesis doesnt hold the variance between groups 𝑆𝑆𝐵𝐵 2 will be large so F will be large If the null hypothesis is true 𝑆𝑆𝐵𝐵 2 will be small so F will also be small Lets manually calculate our test statistic for the income problem and compare it with the functionality provided by SciPy The following code snippet computes the F statistic F SB41SW484 F The result is about 0388 Before we do the Ftest lets also look at the PDF of the Fdistribution The following code snippet plots the Fdistribution with DOFs of 3 and 28 pltfigurefigsize106 x nplinspacefppf0001 3 28fppf0999 3 28 100 rv fdfn3 dfd28 pltplotx rvpdfx k lw4linestyle Using SciPy for common hypothesis testing 197 The plot looks as follows Figure 720 The Fdistribution with DOFs of 3 and 28 You can estimate that such a small statistic 0388 will probably have a very large Pvalue Now lets use the foneway function from sciPystats to do the Ftest The following code snippet gives us the statistic and the Pvalue from scipystats import foneway fonewayLANYSFBO Here is the result FonewayResultstatistic038810442907126874 pvalue07624301696455358 The statistic agrees with our own calculation and the Pvalue suggests that we cant reject the null hypothesis even at a very high significance value Beyond simple ANOVA After the Ftest when the means are different various versions of ANOVA can be used further to analyze the factors behind the difference and even their interactions But due to the limitations of space and time we wont cover this Stationarity tests for time series In this section we are going to discuss how to test the stationarity of an autoregression time series First lets understand what a time series is Using SciPy for common hypothesis testing 199 A note on the name of white noise The name of white noise actually comes from white light White light is a mixture of lights of all colors White noise is a mixture of sounds with different frequencies White noise is an important sound because the ideal white noise will have equal power or energy throughout the frequency spectrum Due to the limitation of space we wont go deep into this topic The following code snippet generates a white noise time series You can verify that there is no time dependence between Xt and Xtk The covariance is 0 for arbitrary k nprandomseed2020 pltfigurefigsize106 whitenoise nprandomnormal for in range100 pltxlabelTime step pltylabelValue pltplotwhitenoise The results look as follows You can see that it is stationary Figure 721 The white noise time series Another simple time series is random walk It is defined as the addition of the previous term in a sequence and a white noise term ϵ𝑡𝑡𝑁𝑁0 σ2 You can define X0 to be a constant or another white noise term 𝑋𝑋𝑡𝑡 𝑋𝑋𝑡𝑡1 ϵ𝑡𝑡 200 Statistical Hypothesis Testing Just as for white noise time series the random walk time series has a consistent expectation However the variance is different Because of the addition of independent normally distributed white noises the variance will keep increasing 𝑋𝑋𝑡𝑡 𝑋𝑋0 ϵ1 ϵ2 ϵ𝑡𝑡 Therefore you have the variance expressed as follows 𝑉𝑉𝑉𝑉𝑉𝑉𝑋𝑋𝑡𝑡 𝑉𝑉𝑉𝑉𝑉𝑉ϵ𝑖𝑖 𝑖𝑖 𝑡𝑡σ2 This is a little bit surprising because you might have expected the white noises to cancel each other out because they essentially symmetrical around 0 The white noises do cancel each other out in the mean sense but not in the variance sense The following code snippet uses the same set of random variables to show the differences between white noise time series and random walk time series pltfigurefigsize106 nprandomseed2020 whitenoise nprandomnormal for in range1000 randomwalk npcumsumwhitenoise pltplotwhitenoise label white noise pltplotrandomwalk label standard random walk pltlegend Here I used the cumsum function from numpy to calculate a cumulative sum of a numpy array or list The result looks as follows Figure 722 Stationary white noise and nonstationary random walk Using SciPy for common hypothesis testing 201 Say you took the difference to define a new time series δXt δ𝑋𝑋𝑡𝑡 𝑋𝑋𝑡𝑡𝑋𝑋𝑡𝑡1 ϵ𝑡𝑡 Then the new time series would become stationary In general a nonstationary time series can be reduced to a stationary time series by continuously taking differences Now lets talk about the concept of autoregression Autoregression describes the property of a model where future observations can be predicted or modeled with earlier observations plus some noise For example the random walk can be treated as a sequence of observations from a firstorder autoregressive process as shown in the following equation 𝑋𝑋𝑡𝑡 𝑋𝑋𝑡𝑡1 ϵ𝑡𝑡 The observation at timestamp t can be constructed from its onestepback value Xt1 Generally you can define an autoregressive time series with order n as follows where instances of Φi are real numbers 𝑋𝑋𝑡𝑡 ϕ1𝑋𝑋𝑡𝑡1 ϕ2𝑋𝑋𝑡𝑡2 ϕ𝑛𝑛𝑋𝑋𝑡𝑡𝑛𝑛 ϵ𝑡𝑡 Without formal mathematical proof I would like to show you the following results The autoregressive process given previously has a characteristic equation as follows 𝑓𝑓𝑠𝑠 1 ϕ1𝑠𝑠 ϕ2𝑠𝑠2 ϕ𝑛𝑛𝑠𝑠𝑛𝑛 0 In the domain of complex numbers this equation will surely have n roots Here is the theorem about these roots If all the roots have an absolute value larger than 1 then the time series is stationary Let me show that to you with two examples Our random walk model has the following characteristic function fs 1 s 0 which has a root equal to 0 so it is not stationary How about the following modified random walk Lets see 𝑋𝑋𝑡𝑡 08𝑋𝑋𝑡𝑡1 ϵ𝑡𝑡 It has a characteristic function of fs 1 08s 0 which has a root of 125 By our theorem this time series should be stationary 202 Statistical Hypothesis Testing The influence of Xt1 is reduced by a ratio of 08 and this effect will be compounding and fading away The following code snippet uses the exact same data we have for white noise and random walk to demonstrate this fading behavior I picked the first 500 data points so linestyles can be distinguishable for different lines pltfigurefigsize106 nprandomseed2020 whitenoise nprandomnormal for in range500 randomwalkmodified whitenoise0 for i in range1500 randomwalkmodifiedappendrandomwalkmodified108 whitenoisei randomwalk npcumsumwhitenoise pltplotwhitenoise label white noiselinestyle pltplotrandomwalk label standard random walk pltplotrandomwalkmodified label modified random walklinestyle pltlegend The graph looks as follows Figure 723 A comparison of a modified random walk and a standard random walk Lets try a more complicated example Is the time series obeying the following the autoregressive relationship 𝑋𝑋𝑡𝑡 06𝑋𝑋𝑡𝑡1 12𝑋𝑋𝑡𝑡2 ϵ𝑡𝑡 Using SciPy for common hypothesis testing 203 The characteristic equation reads 𝑓𝑓𝑠𝑠 1 06𝑠𝑠 12𝑠𝑠2 It has two roots Both roots are complex numbers with nonzero imaginary parts The roots absolute values are also smaller than 1 on the complex plane The following code snippet plots the two roots on the complex plane You can see that they are just inside the unit circle as shown in Figure 724 for root in nproots12061 pltpolar0npangleroot0absrootmarkero The graph looks as follows Figure 724 A polar plot of roots inside a unit circle You should expect the time series to be nonstationary because both roots have absolute values smaller than 1 Lets take a look at it with the following code snippet pltfigurefigsize106 nprandomseed2020 whitenoise nprandomnormal for in range200 series whitenoise0whitenoise1 for i in range2200 seriesappendseriesi106seriesi212 white noisei pltplotseries label oscillating pltxlabelTime step pltylabelValue pltlegend 204 Statistical Hypothesis Testing The result looks as follows Figure 725 A nonstationary secondorder autoregressive time series example Check the scales on the yaxis and you will be surprised The oscillation seems to come from nowhere The exercise of visualizing this time series in the log scale is left to you as an exercise In most cases given a time series the Augmented DickeyFuller ADF unit root test in the statsmodels library can be used to test whether a unit root is present or not The null hypothesis is that there exists a unit root which means the time series is not stationary The following code snippet applies the ADF unit root test on the white noise time series the random walk and the modified random walk You can leave the optional arguments of this function as their defaults from statsmodelstsastattools import adfuller adfullerwhitenoise The result is as follows 13456517599662801 35984059677945306e25 0 199 1 34636447617687436 5 28761761179270766 10 257457158581854 5161905447452475 Using SciPy for common hypothesis testing 205 You need to focus on the first two highlighted values terms in the result the statistic and the Pvalue The dictionary contains the significance levels In this case the Pvalue is very small and we can safely reject the null hypothesis There is no unit root so the time series is stationary For the random walk time series the result of adfuller randomwalkmodified is as follows 14609492394159564 05527332285592418 0 499 1 34435228622952065 5 2867349510566146 10 2569864247011056 13744481241324318 The Pvalue is very large therefore we cant reject the null hypothesis a unit root might exist The time series is not stationary The result for the modified random walk is shown in the following code block The Pvalue is also very small It is a stationary time series 7700113158114325 13463404483644221e11 0 499 1 34435228622952065 5 2867349510566146 10 2569864247011056 13756034107373926 Forcefully applying a test is dangerous How about our wildly jumping time series If you try to use the adfuller function on it you will find a wild statistic and a Pvalue of 0 The ADF test simply fails because the underlying assumptions are violated Because of the limitations of space and the complexity of it I omitted coverage of this You are encouraged to explore the roots of the cause and the mechanism of ADF tests from first principles by yourself 206 Statistical Hypothesis Testing We have covered enough hypothesis tests it is time to move on to AB testing where we will introduce cool concepts such as randomization and blocking Appreciating AB testing with a realworld example In the last section of this chapter lets talk about AB testing Unlike previous topics AB testing is a very general concept AB testing is something of a geeky engineers word for statistical hypothesis testing At the most basic level it simply means a way of finding out which setting or treatment performs better in a singlevariable experiment Most AB testing can be classified as a simple Randomized Controlled Trial RCT What randomized control means will be clear soon Lets take a realworld example a consulting company proposes a new workinghours schedule for a factory claiming that the new schedule will improve the workers efficiency as well as their satisfaction The cost of abruptly shifting the workinghours schedule may be big and the factory does not want the risk involved Therefore the consulting company proposes an AB test Consultants propose selecting two groups of workers group A and group B These groups have controlled variables such as workers wages occupations and so on such that the two groups are as similar as possible in terms of those variables The only difference is that one group follows the old workinghours schedule and the other group follows the new schedule After a certain amount of time the consultants measure the efficiency and level of satisfaction through counting outputs handing out quantitative questionnaires or taking other surveys If you are preparing for a data scientist interview the AB test you will likely encounter would be about the users of your website application or product For example landing page optimization is a typical AB test scenario What kind of front page will increase users click rates and conversion The content the UI the loading time and many other factors may influence users behavior Now that we have understood the importance of AB testing lets dive into the details of its steps Conducting an AB test To conduct an AB test you should know the variables that fall into the following three categories The metric This is a dependent variable that you want to measure In an experiment you can choose one or more such variables In the previous example workers efficiency is a metric Appreciating AB testing with a realworld example 207 The control variables These are variables that you can control A control variable is independent For example the font and color scheme of a landing page are controllable You want to find out how such variables influence your metric Other factors These are variables that may influence the metric but you have no direct control over them For example the wages of workers are not under your control The devices that users use to load your landing page are also not under your control However wages surely influence the level of satisfaction of workers and device screen sizes influence users clicking behavior Those factors must be identified and handled properly Lets look at an experiment on landing page optimization Lets say we want to find out about the color schemes influence on users clicking behavior We have two choices a warm color scheme and a cold color scheme We also want group A to have the same size as group B Here is a short list of variables that will influence the users clicking rate You are free to brainstorm more such variables The device the user is using for example mobile versus desktop The time at which the user opens the landing page The browser type and version for example Chrome versus Internet Explorer IE The battery level or WiFi signal level How do we deal with such variables To understand their influence on users click rates we need to first eliminate the influence of other factors so that if there is a difference in click rates we can confidently attribute the difference to the four variables we selected This is why we introduce randomization and blocking Randomization and blocking The most common way to eliminate or minimize the effect of unwanted variables is through blocking and randomization In a completely randomized experiment the individual test case will be assigned a treatmentcontrol variable value randomly In the landing page scenario this means that regardless of the device browser or the time a user opens the page a random choice of a warm color scheme or a cold color scheme is made for the user 208 Statistical Hypothesis Testing Imagine the scenario that the number of participants of the experiment is very large the effect of those unwanted variables would diminish as the sample size approaches infinity This is true because in a completely randomized experiment the larger the sample size is the smaller the effect that randomness has on our choices When the sample size is large enough we expect the number of IE users who see the warm color scheme to be close to the number of IE users who see the cold scheme The following computational experiment will give you a better idea of how randomization works I chose three variables in this computational experiment device browser and WiFi signal First lets assume that 60 of the users use mobile 90 of them use Chrome and 80 of them visit the website using a strong WiFi signal We also assume that there are no interactions among those variables for instance we do not assume that Chrome users have a strong preference to stick to a strong WiFi connection The following code snippet will assign a color scheme to a random combination of our three variables def buildsample device mobile if nprandomrandom 06 else desktop browser chrome if nprandomrandom 09 else IE wifi strong if nprandomrandom 08 else weak scheme warm if nprandomrandom 05 else cold return device browser wifi scheme Lets first generate 100 sample points and sort the results by the number of appearances from collections import Counter results buildsample for in range100 counter Counterresults for key in sortedcounter key lambda x counterx printkey counterkey The result looks as follows You can see that some combinations dont show up desktop IE strong warm 1 mobile IE weak cold 1 mobile IE strong cold 2 mobile chrome weak warm 3 mobile chrome weak cold 4 desktop chrome weak warm 4 Appreciating AB testing with a realworld example 209 desktop chrome weak cold 5 desktop IE strong cold 6 desktop chrome strong warm 10 desktop chrome strong cold 19 mobile chrome strong warm 20 mobile chrome strong cold 25 If you check each pair with the same setting for example users who use the mobile Chrome browser with strong WiFi signal have a roughly 5050 chance of getting the cold or warm color scheme landing page Lets try another 10000 samples The only change in the code snippet is changing 100 to 10000 The result looks like this desktop IE weak cold 41 desktop IE weak warm 45 mobile IE weak warm 55 mobile IE weak cold 66 desktop IE strong warm 152 desktop IE strong cold 189 mobile IE strong cold 200 mobile IE strong warm 228 desktop chrome weak cold 359 desktop chrome weak warm 370 mobile chrome weak cold 511 mobile chrome weak warm 578 desktop chrome strong warm 1442 desktop chrome strong cold 1489 mobile chrome strong warm 2115 mobile chrome strong cold 2160 Now you see even with the two most unlikely combinations we have about 30 to 40 data points There is although we tried to mitigate it an imbalance between the highlighted two combinations we have more cold scheme users than warm scheme users This is the benefit that randomization brings to us However this usually comes at a high cost It is not easy to obtain such large data samples in most cases There is also a risk that if the warm color scheme or cold color scheme is very bad for the users conversion rates such a largescale AB test will be regrettable 210 Statistical Hypothesis Testing With a small sample size issues of being unlucky can arise For example it is possible that IE desktop users with weak WiFi signals are all assigned the warm color scheme Given how AB testing is done there is no easy way to reverse such bad luck Blocking on the other hand arranges samples into blocks according to the unwanted variables first then randomly assigns block members to different control variable values Lets look at the landing page optimization example Instead of grouping users after providing them with random color schemes we group the users according to the device browser or WiFi signal before making the decision as to which color scheme to show to them Inside the block of desktop IE users we can intervene such that randomly half of them will see the warm color scheme and the other half will see the cold scheme Since all the unwanted variables are the same in each block the effect of the unwanted variables will be limited or homogeneous Further comparisons can also be done across blocks just like for complete randomization You may think of blocking as a kind of restricted randomization We want to utilize the benefit of randomization but we dont want to fall into a trap such as a specific group of candidates purely being associated with one control variable value Another example is that in a clinical trial you dont want complete randomization to lead to all aged people using a placebo which may happen You must force randomization somehow by grouping candidates first Common test statistics So an AB test has given you some data whats next You can do the following for a start Use visualization to demonstrate differences This is also called testing by visualization The deviation of the results can be obtained by running AB tests for several rounds and calculating the variance Apply a statistical hypothesis test Many of the statistical hypothesis tests we have covered can be used For example we have covered ttests a test for testing differences of means between two groups it is indeed one of the most important AB test statistics When the sizes of group A and group B are different or the variances are different we can use Welchs ttest which has the fewest assumptions involved Appreciating AB testing with a realworld example 211 For the clicking behavior of users Fishers exact test is good to use It is based on the binomial distribution I will provide you with an exercise on it in Chapter 13 Exercises and Projects For the work efficiency question we mentioned at the very beginning of this section ANOVA or a ttest can be used For a summary of when to use which hypothesis test here is a good resource httpswwwscribbrcomstatisticsstatisticaltests Tip Try to include both informative visualizations and statistical hypothesis tests in your reports This way you have visual elements to show your results intuitively as well as solid statistical analysis to justify your claims Make sure you blend them coherently to tell a complete story Common mistakes in AB tests In my opinion several common mistakes can lead to misleading AB test data collection or interpretations Firstly a careless AB test may miss important hidden variables For example say you want to randomly select users in the United States to do an AB test and you decide to do the randomization by partitioning the names For example people whose first name starts with the letters AF are grouped into a group those with GP go into another and so on What can go wrong Although this choice seems to be OK there are some pitfalls For example popular American names have changed significantly throughout the years The most popular female names in the 1960s and 1970s are Lisa Mary and Jennifer In the 2000s and 2010s the most popular female names become Emily Isabella Emma and Ava You may think that you are selecting random names but you are actually introducing biases to do with age Also different states have different popular names as well Another common mistake is making decisions too quickly Different from academic research where rigorousness is above all managers in the corporate world prefer to jump to conclusions and move on to the next sales goals If you only have half or even onethird of the tested data available you should hold on and wait until all the data is collected 212 Statistical Hypothesis Testing The last mistake is focusing on too many metrics or control variables at the same time It is true that several metrics can depend on common control variables and a metric can depend on several control variables Introducing too many metrics and control variables will include higherorder interactions and make the analysis less robust with low confidence If possible you should avoid tracking too many variables at the same time Higherorder interaction Higherorder interaction refers to the joint effect of three or more independent variables on the dependent variable For example obesity smoking and high blood pressure may contribute to heart issues much more severely if all three of them happen together When people refer to the main effect of something they often mean the effect of one independent variable and the interaction effect refers to the joint effect of two variables Lets summarize what we have learned in this chapter Summary This chapter was an intense one Congratulations on finishing it First we covered the concept of the hypothesis including the basic concepts of hypotheses such as the null hypothesis the alternative hypothesis and the Pvalue I spent quite a bit of time going over example content to ensure that you understood the concept of the Pvalue and significance levels correctly Next we looked at the paradigm of hypothesis testing and used corresponding library functions to do testing on various scenarios We also covered the ANOVA test and testing on time series Toward the end we briefly covered AB testing We demonstrated the idea with a classic click rate example and also pointed out some common mistakes One additional takeaway for this chapter is that in many cases new knowledge is needed to understand how a task is done in unfamiliar fields For example if you were not familiar with time series before reading this chapter now you should know how to use the unit root test to test whether an autoregressive time series is stationary or not Isnt this amazing In the next chapter we will begin our analysis of regression models Section 3 Statistics for Machine Learning Section 3 introduces two statistical learning categories regression and classification Concepts in machine learning are introduced Statistics with respect to learning models are developed and examined Methods such as boosting and bagging are explained This section consists of the following chapters Chapter 8 Statistics for Regression Chapter 9 Statistics for Classification Chapter 10 Statistics for TreeBased Methods Chapter 11 Statistics for Ensemble Methods 8 Statistics for Regression In this chapter we are going to cover one of the most important techniquesand likely the most frequently used technique in data science which is regression Regression in laymans terms is to build or find relationships between variables features or any other entities The word regression originates from the Latin regressus which means a return Usually in a regression problem you have two kinds of variables Independent variables also referred to as features or predictors Dependent variables also known as response variables or outcome variables Our goal is to try to find a relationship between dependent and independent variables Note It is quite helpful to understand word origins or how the scientific community chose a name for a concept It may not help you understand the concept directly but it will help you memorize the concepts more vividly 216 Statistics for Regression Regression can be used to explain phenomena or to predict unknown values In Chapter 7 Statistical Hypothesis Testing we saw examples in the Stationarity test for time series section of time series data to which regression models generally fit well If you are predicting the stock price of a company you can use various independent variables such as the fundamentals of the company and macroeconomic indexes to do a regression analysis against the stock price then use the regression model you obtained to predict the future stock price of that company if you assume the relationship you found will persist Of course such simple regression models were used decades ago and likely will not make you rich In this chapter you are still going to learn a lot from those classical models which are the baselines of more sophisticated models Understanding basic models will grant you the intuition to understand more complicated ones The following topics will be covered in this chapter Understanding a simple linear regression model and its rich content Connecting the relationship between regression and estimators Having handson experience with multivariate linear regression and collinearity analysis Learning regularization from logistic regression examples In this chapter we are going to use real financial data so prepare to get your hands dirty Understanding a simple linear regression model and its rich content Simple linear regression is the simplest regression model You only have two variables one dependent variable usually denoted by y and an independent variable usually denoted by x The relationship is linear so the model only contains two parameters The relationship can be formulated with the following formula k is the slope and b is the intercept Є is the noise term Understanding a simple linear regression model and its rich content 217 Note Proportionality is different from linearity Proportionality implies linearity and it is a stronger requirement that b must be 0 in the formula Linearity graphically means that the relationship between two variables can be represented as a straight but strict mathematical requirement of additivity and homogeneity If a relationship function f is linear then for any input x1 and x2 and scaler k we must have the following equations 𝑓𝑓𝑥𝑥1 𝑥𝑥2 𝑓𝑓𝑥𝑥1 𝑓𝑓𝑥𝑥2 and 𝑓𝑓𝑘𝑘𝑥𝑥1 𝑘𝑘𝑓𝑓𝑥𝑥1 Here is the code snippet that utilizes the yfinance library to obtain Netflixs stock price data between 2016 and 2018 You can use pip3 install yfinance to install the library If you are using Google Colab use pip3 install yfinance to run a shell command Pay attention to the symbol at the beginning The following code snippet imports the libraries import numpy as np import matplotlibpyplot as plt import random import yfinance as yf The following code snippet creates a Ticker instance and retrieves the daily stock price information The Ticker is a symbol for the stock Netflixs ticker is NFLX import yfinance as yf netflix yfTickerNFLX start 20160101 end 20180101 df netflixhistoryinterval1dstart startend end df 218 Statistics for Regression The result is a Pandas DataFrame as shown in the following figure Figure 81 Historical data for Netflix stock in 2016 and 2017 The next step for our analysis is to get an idea of what the data looks like The common visualization for the twovariable relationship is a scatter plot We are not particularly interested in picking the open price or the close price of the stock I will just pick the opening price as the price we are going to run regression against the Date column Date is not a normal column as other columns it is the index of the DataFrame You can use dfindex to access it When you convert a date to numerical values Matplotlib may throw a warning You can use the following instructions to suppress the warning The following code snippet suppresses the warning and plots the data from pandasplotting import registermatplotlibconverters registermatplotlibconverters pltfigurefigsize108 pltscatterdfindex dfOpen Understanding a simple linear regression model and its rich content 219 The result looks as shown in the following figure Figure 82 Scatter plot of Netflix stock price data Note that there are some jumps in the stock prices which may indicate stock price surges driven by good news Also note that the graph scales will significantly change how you perceive the data You are welcome to change the figure size to 103 and you may be less impressed by the performance of the stock price For this time period the stock price of Netflix seems to be linear with respect to time We shall investigate the relationship using our twoparameter simple linear regression model However before that we must do some transformation The first transformation is to convert a sequence of date objects the DataFrame index to a list of integers I redefined two variables x and y which represent the number of days since January 4 2016 and the opening stock price of that day The following code snippet creates two such variables I first created a timedelta object by subtracting the first element in the index January 4 2016 and then converted it to the number of days x dfindex dfindex0daystonumpy y dfOpentonumpy 220 Statistics for Regression Note If you checked the Netflix stock prices in the past 2 years you would surely agree with me that simple linear regression would be likely to fail We will try to use more sophisticated regression models in later chapters on such data Why dont we use standardization The reason is that in simple linear regression the slope k and intercept b when data is at its original scale have meanings For example k is the daily average stock price change Adding one more day to variable x the stock price will change accordingly Such meanings would be lost if we standardized the data Next lets take a look at how to use the SciPy library to perform the simplest linear regression based on least squared error minimization Least squared error linear regression and variance decomposition Lets first run the scipystatslinregress function to gain some intuition and I will then explain linear regression from the perspective of ANOVA specifically variance decomposition The following code snippet runs the regression from scipystats import linregress linregressxy The result looks as follows LinregressResultslope01621439447698934 intercept7483816138860539 rvalue09447803151619397 pvalue6807230675594974e245 stderr0002512657375708363 The result contains the slope and the intercept It also contains an Rvalue a Pvalue and a standard error Based on our knowledge from Chapter 7 Statistical Hypothesis Testing even without knowing the underlined hypothesis test such a small Pvalue tells you that you can reject whatever the null hypothesis is The R value is called the correlation coefficient whose squared value R2 is more wellknown the coefficient of determination Understanding a simple linear regression model and its rich content 221 There are two major things that the linregress function offers A correlation coefficient is calculated to quantitatively present the relationship between dependent and independent variables A hypothesis is conducted and a Pvalue is calculated In this section we focus on the calculation of the correlation coefficient and briefly talk about the hypothesis testing at the end Regression uses independent variables to explain dependent variables In the most boring case if the stock price of Netflix is a horizontal line no more explanation from the independent variable is needed The slope k can take value 0 and the intercept b can take the value of the motionless stock price If the relationship between the stock price and the date is perfectly linear then the independent variable fully explains the dependent variable in a linear sense What we want to explain quantitatively is the variance of the dependent variable npvaryleny calculates the sum of squares total SST of the stock prices The result is about 653922 The following code snippet adds the horizontal line that represents the mean of the stock prices and the differences between stock prices and their mean as vertical segments This is equivalent to estimating the stock prices using the mean stock price This is the best we can do with the dependent variable only pltfigurefigsize208 pltscatterx y ymean npmeany plthlinesymean npminx npmaxxcolorr sst 0 for x y in zipxy pltplotxxymeanycolorblacklinestyle sst y ymean2 printsst The total variance intuitively is the summed square of the differences between the stock price of the mean following the following formula You can verify that the SST variable is indeed about 653922 𝑦𝑦𝑖𝑖 𝑦𝑦2 𝑖𝑖 222 Statistics for Regression As you may expect the differences between the stock price and the mean are not symmetrically distributed along time due to the increase in Netflix stock The difference has a name residuals The result looks as shown in the following figure Figure 83 Visualization of SST and residuals If we have a known independent variable x in our case the number of days since the first data point we prefer a sloped line to estimate the stock price rather than the naïve horizontal line now Will the variance change Can we decrease the summed square of residuals Regardless of the nature of the additional independent variable we can first approach this case from a pure errorminimizing perspective I am going to rotate the line around the point npmeanxnpmeany Lets say now we have a slope of 010 The following code snippet recalculates the variance and replots the residuals Note that I used the variable sse sum of squared errors SSE to denote the total squared errors as shown in the following example pltfigurefigsize208 pltscatterx y ymean npmeany xmean npmeanx pltplotnpminxnpmaxx x2y01xmeanymeannpminxx2y01x meanymeannpmaxx colorr sse 0 for x y in zipxy yonline x2y01xmeanymeanx Understanding a simple linear regression model and its rich content 223 pltplotxxyon lineycolorblacklinestyle sse yonline y2 printsse The SSE is about 155964 much smaller than the SST Lets check the plot generated from the preceding code snippet Figure 84 Visualization of SSE and residuals It is visually clear that the differences for the data points shrink in general Is there a minimal value for SSE with respect to the slope k The following code snippet loops through the slope from 0 to 03 and plots it against SSE ymean npmeany xmean npmeanx slopes nplinspace00320 sses 0 for i in rangelenslopes for x y in zipxy for i in rangelensses yonline x2yslopesixmeanymeanx ssesi yonline y2 pltfigurefigsize208 pltrcxticklabelsize18 pltrcyticklabelsize18 pltplotslopessses 224 Statistics for Regression The result looks as shown in the following graph Figure 85 Slope versus the SSE This visualization demonstrates the exact idea of Least Square Error LSE When we change the slope the SSE changes and at some point it reaches its minimum In linear regression the sum of the squared error is parabolic which guarantees the existence of such a unique minimum Note The intercept is also an undetermined parameter However the intercept is usually of less interest because it is just a shift along the y axis which doesnt reflect how strongly the independent variable correlates with the dependent variable The following code snippet considers the influence of the intercept To find the minimum with respect to the two parameters we need a 3D plot You are free to skip this code snippet and it wont block you from learning further materials in this chapter Understanding a simple linear regression model and its rich content 225 The following code snippet prepares the data for the visualization def calsseslopeintercept x y sse 0 for x y in zipxy yonline x2yslope0interceptx sse yonline y2 return sse slopes nplinspace1120 intercepts nplinspace20040020 slopes intercepts npmeshgridslopesintercepts sses npzerosinterceptsshape for i in rangessesshape0 for j in rangessesshape1 ssesij calsseslopesijinterceptsijxy The following code snippet plots the 3D surface namely SSE versus slope and intercept from mpltoolkitsmplot3d import Axes3D from matplotlib import cm fig pltfigurefigsize1410 ax figgcaprojection3d axviewinit40 30 axsetxlabelslope axsetylabelintercept axsetzlabelsse pltrcxticklabelsize8 pltrcyticklabelsize8 surf axplotsurfaceslopes intercepts sses cmapcm coolwarm linewidth0 antialiasedTrue figcolorbarsurf shrink05 aspect5 pltshow 226 Statistics for Regression The result looks as shown in the following figure Figure 86 SSE as a function of slope and intercept The combination of the optimal values of slope and intercept gives us the minimal SSE It is a good time to answer a natural question how much of the variance in the independent variable can be attributed to the independent variable The answer is given by the R2 value It is defined as follows 𝑅𝑅2 𝑆𝑆𝑆𝑆𝑆𝑆 𝑆𝑆𝑆𝑆𝑆𝑆 𝑆𝑆𝑆𝑆𝑆𝑆 It can also be defined as shown in the following equation 𝑅𝑅2 𝑆𝑆𝑆𝑆𝑅𝑅 𝑆𝑆𝑆𝑆𝑆𝑆 Understanding a simple linear regression model and its rich content 227 where 𝑆𝑆𝑆𝑆𝑆𝑆 𝑆𝑆𝑆𝑆𝑆𝑆 𝑆𝑆𝑆𝑆𝑆𝑆 is called the sum of squared regression or regression sum of squares Do a thought experiment with me if R2 0 it means we have no error after regression All the changes of our dependent variable can be attributed to the change of the independent variable up to a proportionality coefficient k and a shift value b This is too good to be true in general and it is also the best that simple linear regression can do The slope and intercept given by the linregress function gives us an R2 value of 089094472 The verification of this value is left to you as an exercise In the next section we are going to learn about the limitations of R2 The coefficient of determination R2 is a key indicator of the quality of the regression model If SSR is large it means we captured enough information in the change of the dependent variable with the change of the independent variable If you have a very large R2 and your model is simple the story can end here However beyond simple linear regression sometimes R2 can be misleading Lets take multivariable polynomial regression as an example Lets say we have two independent variables x1 and x2 and we are free to use variables such as x1 2 as new predictors The expected expression of the dependent variable y will look like the following 𝑦𝑦 β α11𝑥𝑥1 α21𝑥𝑥2 α12𝑥𝑥1 2 α𝑘𝑘𝑘𝑘𝑥𝑥𝑘𝑘 𝑘𝑘 In the stock price example you can pick an additional independent variable such as the unemployment rate of the United States Although there is little meaning in taking the square of the number of days or the unemployment rate nothing stops you from doing it anyway Note In a simple linear model you often see r2 rather than R2 to indicate the coefficient of determination r2 is only used in the context of simple linear regression R2 will always increase when you add additional 𝛼𝛼𝑘𝑘𝑘𝑘𝑥𝑥𝑘𝑘 𝑘𝑘 terms Given a dataset R2 represents the power of explainability on this dataset You can even regress the stock price on your weight if you measure it daily during that time period and you are going to find a better R2 An increased R2 alone doesnt necessarily indicate a better model 228 Statistics for Regression A large R2 doesnt indicate any causeeffect relationship For example the change of time doesnt drive the stock price of Netflix high as it is not the cause of change of the dependent variable This is a common logic fault and a large R2 just magnifies it in many cases It is always risky to conclude causeeffect relationships without thorough experiments R2 is very sensitive to a single data point For example I created a set of data to demonstrate this point The following code snippet does the job nprandomseed2020 x nplinspace0220 y 3x nprandomnormalsizelenx xnew npappendxnparray0 ynew npappendynparray10 pltscatterxy pltscatter010 linregressxnewynew The plot looks as shown in the following figure Pay attention to the one outlier at the top left Figure 87 Linear data with an outlier Understanding a simple linear regression model and its rich content 229 The R2 value is less than 03 However removing the outlier updates the R2 value to around 088 A small R2 may indicate that you are using the wrong model in the first place Take the following as an example Simple linear regression is not suitable to fit the parabolic data nprandomseed2020 x nplinspace0220 y 4x28x nprandomnormalscale05sizelenx pltscatterxy linregressxy The R2 is less than 001 It is not correct to apply simple linear regression on such a dataset where nonlinearity is obvious You can see the failure of such a dataset where simple linear regression is applied in the following figure This is also why exploratory data analysis should be carried out before building models We discussed related techniques in Chapter 2 Essential Statistics for Data Assessment and Chapter 3 Visualization with Statistical Graphs Figure 88 A dataset where simple linear regression will fail Connecting the relationship between regression and estimators 231 We also obtain the following equation 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑘𝑘 𝑏𝑏 𝑑𝑑𝑏𝑏 𝑦𝑦𝑖𝑖 𝑘𝑘 𝑥𝑥𝑖𝑖 𝑖𝑖 Note The loglikelihood function only depends on each data point through 𝑦𝑦𝑖𝑖 𝑘𝑘𝑥𝑥𝑖𝑖 𝑏𝑏2 whose sum is exactly SSE Maximizing the loglikelihood is equivalent to minimizing the squared error Now we have two unknowns k and b and two equations We can solve it algebraically Indeed we have already solved the problem graphically through the 3D visualization but it is nice to have an algebraic solution We have the following formula for the algebraic solution 𝑘𝑘 𝑥𝑥𝑖𝑖 𝑥𝑥𝑦𝑦𝑖𝑖 𝑦𝑦 𝑖𝑖 𝑥𝑥𝑖𝑖 𝑥𝑥2 𝑖𝑖 and 𝑏𝑏 𝑦𝑦 𝑘𝑘𝑥𝑥 Now lets calculate the slope and intercept for the Netflix data with this formula and check them against the linregress results The following code does the job x dfindex dfindex0daystonumpy y dfOpentonumpy xmean npmeanx ymean npmeany k npsumxxmeanyymeannpsumxxmean2 b ymean k xmean printkb The results are about 016214 and 74838 which agree with the linregress results perfectly A computational approach is not illuminating in the sense of mathematical intuition Next lets try to understand simple linear regression from the estimation perspective Having handson experience with multivariate linear regression and collinearity analysis 233 You might have noticed that the expression does give you k In the linear model rxy connects the standard deviations of dependent and independent variables Due to an inequality restriction rxy can only take values between 1 and 1 and equality is reached when x and y are perfectly correlated negatively or positively The square of rxy gives us R2 rxy can be either positive or negative but R2 doesnt contain the directional information but the strength of explanation The standard error is associated with the estimated value of k Due to space limitations we cant go over the mathematics here However knowing that an estimator is also a random variable with variance is enough to continue this chapter A smaller variance of estimator means it is more robust and stable A corresponding concept is the efficiency of an estimator Two unbiased estimators may have different efficiencies A more efficient estimator has a smaller variance for all possible values of the estimated parameters In general the variance cant be infinitely small The socalled CramérRao lower bound restricts the minimal variance that could be achieved by an unbiased estimator Note I would like to suggest an interesting read on this crossvalidated question which you will find here httpsstatsstackexchangecom questions64195howdoicalculatethevariance oftheolsestimatorbeta0conditionalon Having handson experience with multivariate linear regression and collinearity analysis Simple linear regression is rarely useful because in reality many factors will contribute to certain outcomes We want to increase the complexity of our model to capture more sophisticated onetomany relationships In this section well study multivariate linear regression and collinearity analysis First we want to add more terms into the equation as follows 𝑦𝑦 𝑘𝑘1𝑥𝑥1 𝑘𝑘2𝑥𝑥2 𝑘𝑘𝑛𝑛𝑥𝑥𝑛𝑛 ϵ There is no nonlinear term and there are independent variables that contribute to the dependent variable collectively For example peoples wages can be a dependent variable and their age and number of employment years can be good explanatory independent variables 234 Statistics for Regression Note on multiple regression and multivariate regression You may see interchangeable usage of multiple linear regression and multivariate linear regression Strictly speaking they are different Multiple linear regression means that there are multiple independent variables while multivariate linear regression means the responsedependent variable is a vector multiple which means you must do regression on each element of it I will be using an exam dataset for demonstration purposes in this section The dataset is provided in the official GitHub repository of this book The following code snippet reads the data import pandas as pd exam pdreadcsvexamscsv exam Lets inspect the data Figure 89 Exam data for multivariate linear regression Having handson experience with multivariate linear regression and collinearity analysis 235 The dataset contains three insemester exams the independent variable and one final exam the dependent variable Lets first do some exploratory graphing Here I introduce a new kind of plot a violin plot It is like a boxplot but it gives a better idea of how the data is distributed inside the first and third quartiles Dont hesitate to try something new when you are learning with Python First we need to transform our exam DataFrame to the long format such that it only contains two columns the score column and the examname column We covered a similar example for the boxplot in Chapter 2 Essential Statistics for Data Assessment Feel free to review that part The following code snippet does the transformation examindex examindex examlong pdmeltexamidvarsindexvaluevars exam columns1variablevalue examlongcolumns examnamescore I will sample 10 rows with examlongsample10 from the new DataFrame for a peek Figure 810 The exam DataFrame in long format The following code snippet displays the violin plot You will see why it is called a violin plot import seaborn as sns snssetstylewhitegrid pltfigurefigsize86 snsviolinplotxexamname yscore dataexamlong 236 Statistics for Regression The result looks as shown in the following figure Figure 811 Violin plot of the exam scores We see that the score distributions are somewhat alike for the first three exams whereas the final exam has a longer tail Next lets do a set of scatter plots for pairs of exams with the following code snippet fig ax pltsubplots13figsize126 ax0scatterexamEXAM1examFINALcolorgreen ax1scatterexamEXAM2examFINALcolorred ax2scatterexamEXAM3examFINAL ax0setxlabelExam 1 score ax1setxlabelExam 2 score ax2setxlabelExam 3 score ax0setylabelFinal exam score Having handson experience with multivariate linear regression and collinearity analysis 237 The result looks as shown in the following figure Figure 812 Final exams versus the other three exams The linear model seems to be a great choice for our dataset because visually the numbered exam scores are strongly linearly correlated with the final exam score From simple linear regression to multivariate regression the idea is the same we would like to minimize the sum of squared errors Let me use the statsmodels library to run ordinary least square OLS regression The following code snippet does the job import statsmodelsapi as sm X examEXAM1EXAM2EXAM3tonumpy X smaddconstantX y examFINALtonumpy smOLSyXfitsummary The result looks as shown in the following figures Although there is a lot of information here we will be covering only the essential parts 238 Statistics for Regression First the regression result or summary is listed in the following figure Figure 813 OLS regression result Secondly the characteristics of the dataset and the model are also provided in the following figure Figure 814 OLS model characterization Lastly the coefficients and statistics for each predictor feature the numbered exam scores are provided in the following table Figure 815 Coefficients and statistics for each predictor feature First R2 is close to 1 which is a good sign that the regression on independent variables successfully captured almost all the variance of the dependent variable Then we have an adjusted R2 The adjusted R2 is defined as shown in the following equation 𝑅𝑅𝑎𝑎𝑎𝑎𝑎𝑎 2 1 1 𝑅𝑅2𝑛𝑛 1 𝑛𝑛 𝑑𝑑𝑑𝑑 1 Having handson experience with multivariate linear regression and collinearity analysis 239 df is the degree of freedom here it is 3 because we have 3 independent variables and n is the number of points in the data Note on the sign of adjusted R2 The adjusted R2 penalizes the performance when adding more independent variables The adjusted R2 can be negative if the original R2 is not large and you try to add meaningless independent variables Collinearity The linear model we just built seems to be good but the warning message says that there are multicollinearity issues From the scatter plot we see that the final exam score seems to be predictable from either of the exams Lets check the correlation coefficients between the exams examEXAM1EXAM2EXAM3corr The result looks as shown in the following figure Figure 816 Correlation between numbered exams EXAM 1 and EXAM 2 have a correlation coefficient of more than 090 and the smallest coefficient is around 085 Strong collinearity between independent variables becomes a problem because it tends to inflate the variance of the estimated regression coefficient This will become clearer if we just regress the final score on EXAM 1 linregressexamEXAM1 examFINAL The result is as follows LinregressResultslope18524548489068682 intercept15621968742401123 rvalue09460708318102032 pvalue9543660489160869e13 stderr013226692073027208 240 Statistics for Regression Note that with a slightly smaller R2 value the standard error of the slope is less than 10 of the estimated value However if you check the output for the threeindependent variables case the standard error for the coefficients of EXAM 1 and EXAM 2 is about onethird and onefifth of the values respectively How so An intuitive and vivid argument is that the model is confused about which independent variable it should pick to attribute the variance to The EXAM 1 score alone explains 94 of the total variance and the EXAM 2 score can explain almost 93 of the total variance too The model can either assign a more deterministic slope to either the EXAM 1 score or EXAM 2 but when they exist simultaneously the model is confused which numerically inflates the standard error of the regression coefficients In some numerical algorithms where randomness plays a role running the same program twice might give different sets of coefficients Sometimes the coefficient can even be negative when you already know it should be a positive value Are their quantitative ways to detect collinearity There are two common methods They are listed here The first one preexamines variables and the second one checks the Variance Inflation Factor VIF You can check the correlation coefficient between pairs of independent variables as we just did in the example A large absolute value for the correlation coefficient is usually a bad sign The second method of calculating the VIF is more systematic and unbiased in general To calculate the VIF of a coefficient we run a regression against its corresponding independent variable xi using the rest of the corresponding variables obtain the R2 and calculate VIFi using the following equation 𝑉𝑉𝑉𝑉𝐹𝐹𝑖𝑖 1 1 𝑅𝑅𝑖𝑖 2 Lets do an example I will use the EXAM 2 score and the EXAM 3 score as dependent variables and the EXAM 1 score as an independent variable X examEXAM2EXAM3tonumpy X smaddconstantX y examEXAM1tonumpy smOLSyXfitrsquared The result is around 0872 Therefore the VIF is about 78 This is already a big value A VIF greater than 10 suggests serious collinearity Learning regularization from logistic regression examples 241 Is collinearity an issue The answer is yes and no It depends on our goals If our goal is to predict the independent variable as accurately as possible then it is not an issue However in most cases we dont want to carry unnecessary complexity and redundancy in the model There are several ways to get rid of collinearity Some of those are as follows Select independent variables and drop the rest This may lose information Obtain more data More data brings diversity into the model and will reduce the variance Use Principle Component Analysis PCA to transform the independent variables into fewer new variables We will not cover it here because of space limitations The idea is to bundle the variance explainability of independent variables together in a new variable Use lasso regression Lasso regression is regression with regularization of L1 norm In the next section we will see how it is done and what exactly L1norm means Learning regularization from logistic regression examples L1 norm regularization which penalizes the complexity of a model is also called lasso regularization The basic idea of regularization in a linear model is that parameters in a model cant be too large such that too many factors contribute to the predicted outcomes However lasso does one more thing It not only penalizes the magnitude but also the parameters existence We will see how it works soon The name lasso comes from least absolute shrinkage and selection operator It will shrink the values of parameters in a model Because it uses the absolute value form it also helps with selecting explanatory variables We will see how it works soon Lasso regression is just like linear regression but instead of minimizing the sum of squared errors it minimizes the following function The index i loops over all data points where j loops over all coefficients Unlike standard OLS this function no longer has an intuitive graphic representation It is an objective function An objective function is a term from optimization We choose input values to maximize or minimize the value of an objective function y 𝑘𝑘𝑗𝑗𝑥𝑥1𝑖𝑖 𝑘𝑘2𝑥𝑥2𝑖𝑖 𝑘𝑘𝑣𝑣𝑥𝑥𝑣𝑣𝑖𝑖 β 2 i λ 𝑘𝑘𝑗𝑗 𝑗𝑗 242 Statistics for Regression The squared term on the left in the objective function is the OLS sum of squared error The term on the right is the regularization term λ a positive number is called the regularization coefficient It controls the strength of the penalization The regularization term is artificial For example the regression coefficientsslopes share the same coefficient but it is perfectly okay if you assign a different regularization coefficient to different regression coefficients When λ 0 we get back OLS As λ increases more and more the coefficient will shrink and eventually reach 0 If you change the regularization term from 𝜆𝜆 𝑘𝑘𝑗𝑗 𝑗𝑗 to λ kj 2 j just like the OLS term you will get ridge regression Ridge regression also helps control the complexity of a model but it doesnt help with selecting explanatory variables We will compare the effects with examples We will run the lasso regression ridge regression and normal linear regression again with modules from the sklearn library Note It is a good habit to check the same function offered from different libraries so you can compare them meaningfully For example in the sklearn library the objective function is defined such that the sum of squared error is reduced by 1 2𝑛𝑛 If you dont check the document and simply compare results from your own calculation you may end up with confusing conclusions about regularization coefficient choice This is also why in the code that follows the regularization coefficient for the ridge model multiplies 2n The APIs for two models are not consistent in the sklearn library The following code snippet prepares the data like earlier from sklearn import linearmodel X examEXAM1EXAM2EXAM3tonumpy y examFINALtonumpy In sklearn the regularization coefficient is defined as α so I am going to use α instead of λ First I choose α to be 01 alpha 01 linearregressor linearmodelLinearRegression linearregressorfitXy lassoregressor linearmodelLassoalphaalpha lassoregressorfitXy Learning regularization from logistic regression examples 243 ridgeregressor linearmodelRidgealphaalphaleny2 ridgeregressorfitXy printlinear model coefficient linearregressorcoef printlasso model coefficient lassoregressorcoef printridge model coefficient ridgeregressorcoef The result reads as follows linear model coefficient 035593822 054251876 116744422 lasso model coefficient 035537305 054236992 116735218 ridge model coefficient 03609811 054233219 116116573 Note that there isnt much difference in the values Our regularization term is still too small compared to the sum of squared error term Next I will generate a set of data varying α I will plot the scale of the three coefficients with respect to increasing α linearregressor linearmodelLinearRegression linearregressorfitXy linearcoefficient nparraylinearregressorcoef 20T lassocoefficient ridgecoefficient alphas nplinspace140020 for alpha in alphas lassoregressor linearmodelLassoalphaalpha lassoregressorfitXy ridgeregressor linearmodelRidgealphaalphaleny2 ridgeregressorfitXy lassocoefficientappendlassoregressorcoef ridgecoefficientappendridgeregressorcoef lassocoefficient nparraylassocoefficientT ridgecoefficient nparrayridgecoefficientT 244 Statistics for Regression Note that the T method is very handy it transposes a twodimensional NumPy array The following code snippet plots all the coefficients against the regularization coefficient Note how I use the loc parameter to position the legends pltfigurefigsize128 for i in range3 pltplotalphas linearcoefficienti label linear coefficient formati cr linestylelinewidth6 pltplotalphas lassocoefficienti label lasso coefficient formati c blinestylelinewidth6 pltplotalphas ridgecoefficienti label ridge coefficient formati cglinestylelinewidth6 pltlegendloc0705fontsize14 pltxlabelAlpha pltylabelCoefficient magnitude The result looks as shown in the following figure Note that different line styles indicate different regression models Note Different colors if you are reading a grayscale book check the Jupyter notebook indicate different coefficients Figure 817 Coefficient magnitudes versus the regularization coefficient Learning regularization from logistic regression examples 245 Note that the dotted line doesnt change with respect to the regularization coefficient because it is not regularized The lasso regression coefficients and the ridge regression coefficients start roughly at the same levels of their corresponding multiple linear counterparts The ridge regression coefficients decrease toward roughly the same scale and reach about 02 when α 400 The lasso regression coefficients on the other hand decrease to 0 one by one around α 250 When the coefficient is smaller than 1 the squared value is smaller than the absolute value This is true to the fact that lasso regression coefficients decreasing to 0 doesnt depend on this You can do an experiment by multiplying all independent variables by 01 to amplify the coefficients and you will find similar behavior This is left to you as an exercise So when α is large why does lasso regression tend to penalize the number of coefficients while ridge regression tends to drive coefficients at roughly the same magnitude Lets do one last thought experiment to end this chapter Consider the scenario that we have two positive coefficients k1 and k2 where k1 is larger than k2 Under the lasso penalization decreasing either coefficient by a small value δ will decrease the objective function by δ No secret there However in ridge regression decreasing the larger value k1 will always decrease the objective function more as shown in the following equation For k2 you can do the same calculation as the following Δ𝑘𝑘2 𝑘𝑘2 2 𝑘𝑘2 𝛿𝛿2 2𝑘𝑘2𝛿𝛿 𝛿𝛿2 Because k1 is greater than k2 decreasing the larger value benefits the minimization more The ridge regression discourages the elimination of smaller coefficients but prefers decreasing larger coefficients The lasso regression on the other hand is capable of generating a sparse model with fewer coefficients These regularizations especially the ridge regression are particularly useful to handle multicollinearity For readers interested in exploring this further I recommend you check out the corresponding chapter in the classical book Elements of Statistical Learning by Jerome H Friedman Robert Tibshirani and Trevor Hastie Δ𝑘𝑘1 𝑘𝑘1 2 𝑘𝑘1 δ2 2𝑘𝑘1δ δ2 246 Statistics for Regression Summary In this chapter we thoroughly went through basic simple linear regression demystified some core concepts in linear regression and inspected the linear regression model from several perspectives We also studied the problem of collinearity in multiple linear regression and proposed solutions At the end of the chapter we covered two more advanced and widely used regression models lasso regression and ridge regression The concepts introduced in this chapter will be helpful for our future endeavors In the next chapter we are going to study another important family of machine learning algorithms and the statistics behind it classification problems 9 Statistics for Classification In the previous chapter we covered regression problems where correlations in the form of a numerical relationship between independent variables and dependent variables are established Different from regression problems classification problems aim to predict the categorical dependent variable from independent variables For example with the same Netflix stock price data and other potential data we can build a model to use historical data that predicts whether the stock price will rise or fall after a fixed amount of time In this case the dependent variable is binary rise or fall lets ignore the possibility of having the same value for simplicity Therefore this is a typical binary classification problem We will look at similar problems in this chapter In this chapter we will cover the following topics Understanding how a logistic regression classifier works Learning how to evaluate the performance of a classifier Building a naïve Bayesian classification model from scratch Learning the mechanisms of a support vector classifier Applying crossvalidation to avoid classification model overfitting We have a lot of concepts and coding to cover So lets get started 248 Statistics for Classification Understanding how a logistic regression classifier works Although this section name sounds a bit unheard it is correct Logistic regression is indeed a regression model but it is mostly used for classification tasks A classifier is a model that contains sets of rules or formulas sometimes millions or more to perform the classification task In a simple logistic regression classifier we only need one rule built on a single feature to perform the classification Logistic regression is very popular in both traditional statistics as well as machine learning The name logistic originates from the name of the function used in logistic regression logistic function Logistic regression is the Generalized Linear Model GLM The GLM is not a single model but an extended group of models of Ordinary Least Squares OLS models Roughly speaking the linear part of the model in GLM is similar to OLS but various kinds of transformation and interpretations are introduced so GLM models can be applied to problems that simple OLS models cant be used for directly You will see what this means in logistic regression in the following section The logistic function and the logit function Its easier to look at the logit function first because it has a more intuitive physical meaning The logit function has another name the logodds function which makes much more sense The standard logit function takes the form log 𝑝𝑝 1 𝑝𝑝 where p is between 0 and 1 indicating the probability of one possibility happening in a binary outcome The logistic function is the inverse of the logit function A standard logistic function takes the form 1 1 𝑒𝑒𝑥𝑥 where x can take a value from ꝏ to ꝏ and the function takes a value between 0 and 1 The task for this section is to predict whether a stock index such as SPX will rise or fall from another index called the fear and greedy index The fear and greedy index is an artificial index that represents the sentiment of the stock market When most people are greedy the index is high and the overall stock index is likely to rise On the other hand when most people are fearful the index is low and the stock index is likely to fall There are various kinds of fear and greedy indexes The one composed by CNN Money is a 100point scale and contains influences from seven other economic and financial indicators 50 represents neutral whereas larger values display the greediness of the market Understanding how a logistic regression classifier works 249 We are not going to use real data though Instead I will use a set of artificial data as shown in the following code snippet As we did in Chapter 7 Statistical Hypothesis Testing I will hide the generation of the data from you until the end of the section The following code snippet creates the scatter plot of the fear and greedy index and the stock index change of the corresponding day pltfigurefigsize106 pltscatterfgindexstockindexchange 0 stockindexchangestockindexchange 0 s200 marker6 labelUp pltscatterfgindexstockindexchange 0 stockindexchangestockindexchange 0 s200 marker7 labelDown plthlines00100labelNeutral line pltxlabelFear Greedy Indexfontsize20 pltylabelStock Index Changefontsize20 pltlegendncol3 The graph looks as in the following figure Figure 91 Fear and greedy index versus stock index change 250 Statistics for Classification This might be suitable for simple linear regression but this time we are not interested in the exact values of stock index change but rather the direction of the market The horizontal neutral line bisects the data into two categories the stock index either goes up or goes down Our goal is to predict whether a stock index will rise or fall given a fear and greedy index value The formulation of a classification problem The goal of the classification task is to predict the binary outcome from a single numerical independent variable If we use the OLS model directly the outcome of a classification is binary but the normal OLS model offers continuous numerical outcomes This problem leads to the core of logistic regression Instead of predicting the probability we predict the odds Instead of predicting the hardcut binary outcome we can predict the probability of one of the outcomes We can predict the probability that the stock index will rise as p therefore 1 p becomes the probability that the stock price will fall no matter how small the scale is Notes on the term odds The odds of an event out of possible outcomes is the ratio of the events probability and the rest You might have heard the phrase against all odds which means doing something when the odds of success are slim to none Although probability is limited to 01 the odds can take an arbitrary value from 0 to infinity By applying a shift to the odds we get negative values to suit our needs By running a regression against the odds we have an intermediate dependent variable which is numerical and unbounded However there is one final question How do we choose the parameter of the regression equation We want to maximize our likelihood function Given a set of parameters and corresponding predicted probabilities we want the predictions to maximize our likelihood function In our stock price example it means the data points from the up group have probabilities of being up that are as large as possible and data points from the down group have probabilities of being down that are as large as possible too You can review Chapter 6 Parametric Estimation to refresh your memory of the maximal likelihood estimator Understanding how a logistic regression classifier works 251 Implementing logistic regression from scratch Make sure you understand the chain of logic before we start from a regression line to go over the process Then we talk about how to find the optimal values of this regression lines parameters Due to the limit of space I will omit some code and you can find them in this books official GitHub repository httpsgithub comPacktPublishingEssentialStatisticsforNonSTEMData Analysts The following is the stepbystep implementation and corresponding implementation of logistic regression We used our stock price prediction example We start by predicting the odds as a numerical outcome 1 First l will draw a sloped line to be our first guess of the regression against the odds of the stock index rising 2 Then I will project the corresponding data points on this regressed line 3 Lets look at the results The following graph has two y axes The left axis represents the odds value and the right axis represents the original stock index change I plotted one arrow to indicate how the projection is done The smaller markers as shown in the following figure are for the right axis and the large markers on the inclined line indicate the regressed odds Figure 92 Fear and greedy index versus odds The regression parameters I chose are very simpleno intercept and only a slope of 01 Notice that one of the up data points has smaller odds than one of the down data points 252 Statistics for Classification 4 Now we transform the odds into probability This is where the logistic function comes into play Probability 1 1 𝑒𝑒𝑘𝑘 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 Note that the k parameter can be absorbed into the slope and intercept so we still have two parameters However since the odds are always positive the probability will only be larger than 1 2 We need to apply a shift to the odds whose value can be observed into the parameter intercept I apply an intercept of 5 and we get the following Figure 93 Fear and greedy index versus shifted odds Note that the shifted values lose the meaning of the odds because negative odds dont make sense Next we need to use the logistic function to transform the regressed shifted odds into probability The following code snippet defines a handy logistic function You may also see the name sigmoid function used in other materials which is the same thing The word sigmoid means shaped like the character S The following code block defines the logistic function def logisticx return 1 1 npexpx The following two code snippets plot the shifted odds and transformed probability on the same graph I also defined a new function called calshiftedodds for clarity We plot the odds with the first code snippet def calshiftedoddsval slope intercept return valslope intercept Understanding how a logistic regression classifier works 253 slope intercept 01 5 fig ax1 pltsubplotsfigsize106 shiftedodds calshiftedoddsfgindexslopeintercept ax1scatterfgindexstockindexchange 0 shiftedoddsstockindexchange 0 s200 marker6 label Up ax1scatterfgindexstockindexchange 0 shiftedoddsstockindexchange 0 s200 marker7 label Down ax1plotfgindex shiftedodds linewidth2 cred The following code snippet continues to plot the probability ax2 ax1twinx ax2scatterfgindexstockindexchange 0 logisticshiftedoddsstockindexchange 0 s100 marker6 labelUp ax2scatterfgindexstockindexchange 0 logisticshiftedoddsstockindexchange 0 s100 marker7 labelDown 254 Statistics for Classification ax2plotfggrids logisticcalshiftedoddsfg gridsslopeintercept linewidth4 linestyle cgreen ax1setxlabelFear Greedy Indexfontsize20 ax1setylabelOdds 5fontsize20 ax2setylabelProbability of Going Upfontsize20 pltlegendfontsize20 The result is a nice graph that shows the shifted odds and the transformed probability side by side The dotted line corresponds to the right axis it has an S shape and data points projected onto it are assigned probabilities Figure 94 Transformed probability and shifted odds 5 Now you can pick a threshold of the probability to classify the data points For example a natural choice is 05 Check out the following graph where I use circles to mark out the up data points I am going to call those points positive The term comes from clinical testing where clinical experiments are done Understanding how a logistic regression classifier works 255 Figure 95 Threshold and positive data points As you can see there is a negative data point that was misclassified as a positive data point which means we misclassified a day that the stock index is going down as a day that the stock index goes up If you buy a stock index on that day you are going to lose money Positive and negative The term positive is relative as it depends on the problem In general when something interesting or significant happens we call it positive For example if radar detects an incoming airplane it is a positive event if you test positive for a virus it means you carry the virus Since our threshold is a linear line we cant reach perfect classification The following classifier with threshold 08 gets the misclassified negative data point right but wont fix the misclassified positive data point below it Figure 96 Threshold 08 256 Statistics for Classification Which one is better In this section we have converted a regression model into a classification model However we dont have the metrics to evaluate the performances of different choices of threshold In the next section lets examine the performance of the logistic regression classifier Evaluating the performance of the logistic regression classifier In this section we will approach the evaluation of our logistic regression classifiers in two ways The first way is to use the socalled confusion matrix and the second way is to use the F1 score To introduce the F1 score we also need to introduce several other metrics as well which will all be covered in this section Lets see what a confusion matrix looks like and define some terms For that lets take an example The following table is a 2by2 confusion matrix for the threshold 05 case The 2 in the topleft cell means that there are two positive cases that we successfully classify as positive Therefore it is called True Positive TP Correspondingly the 1 in the bottomleft cell means that one positive case was misclassified as False Negative FN We also have True Negative TN and False Positive FP by similar definition Note A perfect classifier will have false positive and false negative being 0 The false positive error is also called a type 1 error and the false negative error is also called a type 2 error As an example if a doctor is going to claim a man is pregnant it is a false positive error if the doctor says a laboring woman is not pregnant it is a false negative error In addition the recall or sensitivity or true positive rate is defined as follows 𝑇𝑇𝑇𝑇 𝑇𝑇𝑇𝑇 𝐹𝐹𝐹𝐹 Understanding how a logistic regression classifier works 257 This means the ability of the classifier to correctly identify the positive ones from all the ground truthpositive examples In our stock index example if we set the threshold to 0 we reach a sensitivity of 1 because indeed we pick out all the positive ones The allpositive classifier is n ot acceptable though On the other hand the precision or positive predictive value is defined as follows 𝑇𝑇𝑇𝑇 𝑇𝑇𝑇𝑇 𝐹𝐹𝑇𝑇 This means among all the claimed positive results how many of them are indeed positive In our stock index example setting the threshold to 1 will reach a precision of 1 because there wont be any false positive if there are no positive predictions at all A balance must be made The F1 score is the balance it is the harmonic mean of the precision and recall The harmonic mean of a and b is defined as 2𝑎𝑎𝑎𝑎 𝑎𝑎 𝑎𝑎 Notes on the harmonic mean If two values are different the harmonic mean is smaller than the geometric mean which is smaller than the most common arithmetic mean We can calculate the metrics for the preceding confusion matrix Recall and precision are both 2 3 2 2 1 The F1 score is therefore also 2 3 If we pick a threshold of 08 the confusion matrix will look as follows Then the recall will still be 2 3 but the precision will be 2 2 We reach a higher F1 score 08 We obtained a better result by simply changing the threshold However changing the threshold doesnt change the logistic function To evaluate the model itself we need to maximize the likelihood of our observation based on the regressed model Lets take a look at the regressed probabilities with code logisticcalshiftedoddsfgindexslopeintercept F1 2 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑝𝑝𝑝𝑝𝑝𝑝𝑟𝑟𝑟𝑟𝑟𝑟 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑝𝑝𝑝𝑝𝑝𝑝𝑟𝑟𝑟𝑟𝑟𝑟 258 Statistics for Classification The result looks as follows array000669285 004742587 026894142 073105858 095257413 099330715 Here I made the probabilities corresponding to positive data points bold In this case the likelihood function can be defined as follows Here index i loops through the positive indexes and index j loops through the negative indexes Since we regress against the probabilities that the stock index will go up we want the negative data points probabilities to be small which gives us the form 1 𝑃𝑃𝑗𝑗 In practice we often calculate the log likelihood Summation is easier to handle than multiplication numerically logL𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑖𝑖𝑖𝑖𝑖𝑖𝑠𝑠𝑖𝑖𝑖𝑖𝑠𝑠𝑠𝑠𝑖𝑖 𝑠𝑠𝑠𝑠𝑙𝑙𝑃𝑃𝑖𝑖 𝑖𝑖356 𝑠𝑠𝑠𝑠𝑙𝑙1 𝑃𝑃𝑗𝑗 𝑗𝑗124 Lets calculate our log likelihood function for slope 01 and intercept 5 with the following code snippet npprodprobsstockindexchange0npprod1probsstock indexchange0 The result is about 0065 Lets try another set of parameters with the following code snippet probs logisticcalshiftedoddsfgindex slope011intercept55 npprodprobsstockindexchange0npprod1probsstock indexchange0 The result is about 0058 Our original choice set of parameters is actually better Lslopeintercept Pi 1 Pj j124 i356 Building a naïve Bayes classifier from scratch 259 To find the parameters that maximize the likelihood function exactly lets use the sklearn library The following code snippet fits a regressor on our data points from sklearnlinearmodel import LogisticRegression regressor LogisticRegressionpenaltynone solvernewtoncgfitfgindex reshape11 stockindexchange0 printslope regressorcoef00 printintercept regressorintercept0 The best possible slope and intercept are about 006 and 304 respectively You can verify that this is true by plotting the likelihood function value against a grid of slopes and intercepts I will leave the calculation to you as an exercise Note on the LogisticRegression function Note that I explicitly set the penalty to none in the initialization of the LogisticRegression instance By default sklearn will set an L2 penalty term and use another solver a solver is a numerical algorithm to find the maximum that doesnt support the nopenalty setting I have to change these two arguments to make it match our approach in this section The newtoncg solver uses the Newton conjugate gradient algorithm If you are interested in finding out more about this you can refer to a numerical mathematics textbook The last thing I would like you to pay attention to is the reshaping of the input data to comply with the API Building a naïve Bayes classifier from scratch In this section we will study one of the most classic and important classification algorithms the naïve Bayes classification We covered Bayes theorem in previous chapters several times but now is a good time to revisit its form Suppose A and B are two random events the following relationship holds as long as PB 0 P𝐴𝐴𝐵𝐵 𝑃𝑃𝐵𝐵𝐴𝐴𝑃𝑃𝐴𝐴 𝑃𝑃𝐵𝐵 Some terminologies to review PAB is called the posterior probability as it is the probability of event A after knowing the outcome of event B PA on another hand is called the prior probability because it contains no information about event B 260 Statistics for Classification Simply put the idea of the Bayes classifier is to set the classification category variable as our A and the features there can be many of them as our B We predict the classification results as posterior probabilities Then why the naïve Bayes classifier The naïve Bayes classifier assumes that different features are mutually independent and Bayes theorem can be applied to them independently This is a very strong assumption and likely incorrect For example to predict whether someone has a risk of stroke or obesity problems or predicting their smoking habits and diet habits are all valid However they are not independent The naïve Bayes classifier assumes they are independent Surprisingly the simplest setting works well on many occasions such as detecting spam emails Note Features can be discrete or continuous We will only cover a discrete version example Continuous features can be naively assumed to have a Gaussian distribution I created a set of sample data as shown here which you can find in the books official GitHub repository Each row represents a set of information about a person The weight feature has three levels the highoildiet feature also has three and the smoking feature has two levels Our goal is to predict strokerisk which has three levels The following table shows the profile of 15 patients Figure 97 Stroke risk data Building a naïve Bayes classifier from scratch 261 Lets start with the first feature weight Lets calculate P𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑟𝑟𝑠𝑠𝑠𝑠𝑤𝑤𝑠𝑠𝑟𝑟𝑤𝑤ℎ𝑠𝑠 According to Bayes theorem we have the following P𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑒𝑒𝑟𝑟𝑖𝑖𝑠𝑠𝑠𝑠𝑤𝑤𝑒𝑒𝑖𝑖𝑤𝑤ℎ𝑠𝑠 𝑃𝑃𝑤𝑤𝑒𝑒𝑖𝑖𝑤𝑤ℎ𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑒𝑒𝑟𝑟𝑖𝑖𝑠𝑠𝑠𝑠𝑃𝑃𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑒𝑒𝑠𝑠𝑖𝑖𝑠𝑠𝑠𝑠 𝑃𝑃𝑤𝑤𝑒𝑒𝑖𝑖𝑤𝑤ℎ𝑠𝑠 Lets calculate the prior probabilities first since they will be used again and again P𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑟𝑟𝑠𝑠𝑠𝑠 𝑙𝑙𝑠𝑠𝑙𝑙 8 15 P𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑟𝑟𝑠𝑠𝑠𝑠 𝑚𝑚𝑟𝑟𝑚𝑚𝑚𝑚𝑚𝑚𝑠𝑠 3 15 P𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑟𝑟𝑠𝑠𝑠𝑠 ℎ𝑟𝑟𝑖𝑖ℎ 4 15 For the weight feature we have the following P𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤ℎ𝑡𝑡 𝑙𝑙𝑙𝑙𝑤𝑤 5 15 P𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤ℎ𝑡𝑡 𝑚𝑚𝑤𝑤𝑖𝑖𝑖𝑖𝑖𝑖𝑤𝑤 7 15 P𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤ℎ𝑡𝑡 ℎ𝑤𝑤𝑤𝑤ℎ 3 15 Now lets find the 3by3 matrix of the conditional probability of P𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤ℎ𝑡𝑡𝑠𝑠𝑡𝑡𝑠𝑠𝑠𝑠𝑠𝑠𝑤𝑤𝑟𝑟𝑤𝑤𝑠𝑠𝑠𝑠 The column index is for the stroke risk and the row index is for the weight The numbers in the cells are the conditional probabilities To understand the table with an example the first 0 in the last row means that P𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤ℎ𝑡𝑡 𝑙𝑙𝑙𝑙𝑤𝑤𝑠𝑠𝑡𝑡𝑠𝑠𝑙𝑙𝑠𝑠𝑤𝑤𝑟𝑟𝑤𝑤𝑠𝑠𝑠𝑠 ℎ𝑤𝑤𝑤𝑤ℎ 0 If you count the table you will find that among the four highrisk persons none of them have a low weight 262 Statistics for Classification Lets do the same thing for the other two features The last one is for smoking which is also binary OK too many numbers we will use Python to calculate them later in this chapter but for now lets look at an example What is the best stroke risk prediction if a person has middling weight a highoil diet but no smoking habit We need to determine which of the following values is the highest To simplify the expression I will use abbreviations to represent the quantities For example st stands for strokerisk and oil stands for a highoil diet 𝑃𝑃𝑠𝑠𝑠𝑠 ℎ𝑖𝑖𝑖𝑖ℎ𝑤𝑤 𝑚𝑚𝑖𝑖𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑜𝑜𝑖𝑖𝑚𝑚 𝑦𝑦𝑚𝑚𝑠𝑠 𝑠𝑠𝑚𝑚 𝑛𝑛𝑜𝑜 or 𝑃𝑃𝑠𝑠𝑠𝑠 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑤𝑤 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑜𝑜𝑚𝑚𝑚𝑚 𝑦𝑦𝑚𝑚𝑠𝑠 𝑠𝑠𝑚𝑚 𝑛𝑛𝑜𝑜 or 𝑃𝑃𝑠𝑠𝑠𝑠 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑙𝑙𝑚𝑚 𝑙𝑙𝑚𝑚𝑙𝑙 𝑦𝑦𝑚𝑚𝑠𝑠 𝑠𝑠𝑚𝑚 𝑛𝑛𝑙𝑙 In the following example I will use the high stroke risk case With Bayes theorem we have the following P𝑠𝑠𝑠𝑠 ℎ𝑖𝑖𝑖𝑖ℎ𝑤𝑤 𝑚𝑚𝑖𝑖𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑜𝑜𝑖𝑖𝑚𝑚 𝑦𝑦𝑚𝑚𝑠𝑠 𝑠𝑠𝑚𝑚 𝑛𝑛𝑜𝑜 Pst ℎ𝑖𝑖𝑖𝑖ℎ𝑃𝑃𝑤𝑤 𝑚𝑚𝑖𝑖𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑜𝑜𝑖𝑖𝑚𝑚 𝑦𝑦𝑚𝑚𝑠𝑠 𝑠𝑠𝑚𝑚 𝑛𝑛𝑜𝑜𝑠𝑠𝑠𝑠 ℎ𝑖𝑖𝑖𝑖ℎ 𝑃𝑃𝑤𝑤 𝑚𝑚𝑖𝑖𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑜𝑜𝑖𝑖𝑚𝑚 𝑦𝑦𝑚𝑚𝑠𝑠 𝑠𝑠𝑚𝑚 𝑛𝑛𝑜𝑜 Here is an interesting discovery To get comparable quantitative values we only care about the numerator because the denominator is the same for all classes The numerator is nothing but the joint probability of both the features and the category variable Building a naïve Bayes classifier from scratch 263 Next we use the assumption of independent features to decompose the numerator as follows P𝑠𝑠𝑠𝑠 ℎ𝑖𝑖𝑖𝑖ℎ𝑤𝑤 𝑚𝑚𝑖𝑖𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑜𝑜𝑖𝑖𝑚𝑚 𝑦𝑦𝑚𝑚𝑠𝑠 𝑠𝑠𝑚𝑚 𝑛𝑛𝑜𝑜 𝑃𝑃𝑠𝑠𝑠𝑠 ℎ𝑖𝑖𝑖𝑖ℎP𝑤𝑤 𝑚𝑚𝑖𝑖𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑜𝑜𝑖𝑖𝑚𝑚 𝑦𝑦𝑚𝑚𝑠𝑠 𝑠𝑠𝑚𝑚 𝑛𝑛𝑜𝑜𝑠𝑠𝑠𝑠 ℎ𝑖𝑖𝑖𝑖ℎ We can further reduce the expression as follows Thus the comparison of poster distributions boils down to the comparison of the preceding expression Note It is always necessary to check the rules with intuition to check whether they make sense The preceding expression says that we should consider the prior probability and the conditional probabilities of specific feature values Lets get some real numbers For the strokerisk high case the expression gives us the following The terms are in order You can check the preceding tables to verify them 4 15 1 4 2 4 0 4 0 The good habit of not smoking eliminates the possibility that this person has a high risk of getting a stroke How about strokerisk middle The expression is as follows 7 15 2 3 2 3 2 3 0138 Note that this value is only meaningful when comparing it with other options since we omitted the denominator in the posterior probabilitys expression earlier How about strokerisk low The expression is as follows 8 15 4 8 2 8 6 8 005 P𝑠𝑠𝑠𝑠 ℎ𝑖𝑖𝑖𝑖ℎP𝑤𝑤 𝑚𝑚𝑖𝑖𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑠𝑠𝑠𝑠 ℎ𝑖𝑖𝑖𝑖ℎP𝑜𝑜𝑖𝑖𝑚𝑚 𝑦𝑦𝑚𝑚𝑠𝑠𝑠𝑠𝑠𝑠 ℎ𝑖𝑖𝑖𝑖ℎP𝑠𝑠𝑚𝑚 𝑛𝑛𝑜𝑜𝑠𝑠𝑠𝑠 ℎ𝑖𝑖𝑖𝑖ℎ 264 Statistics for Classification The probabilities can therefore be normalized to a unit Therefore according to our Bayes classifier the person does not have a high risk of getting a stroke but has a middle or low stroke risk with a ratio of 3 to 1 after normalizing the probability Next lets write code to automate this The following code snippet builds the required prior probability for the category variable and the conditional probability for the features It takes a pandas DataFrame and corresponding column names as input def buildprobabilitiesdffeaturecolumnslist category variablestr priorprobability Counterdfcategoryvariable conditionalprobabilities for key in priorprobability conditionalprobabilitieskey for feature in featurecolumns featurekinds setnpuniquedffeature featuredict Counterdfdfcategory variablekeyfeature for possiblefeature in featurekinds if possiblefeature not in featuredict featuredictpossiblefeature 0 total sumfeaturedictvalues for featurelevel in featuredict featuredictfeaturelevel total conditionalprobabilitieskey feature feature dict return priorprobability conditionalprobabilities Building a naïve Bayes classifier from scratch 265 Lets see what we get by calling this function on our stroke risk dataset with the following code snippet priorprob conditionalprob buildprobabilitiesstrokerisk featurecolumnsweighthighoildietsmokingcategory variablestrokerisk I used the pprint module to print the conditional probabilities as shown from pprint import pprint pprintconditionalprob The result is as follows high highoildiet Counteryes 05 no 05 smoking Counteryes 10 no 00 weight Counterhigh 075 middle 025 low 00 low highoildiet Counterno 075 yes 025 smoking Counterno 075 yes 025 weight Counterlow 05 middle 05 high 00 middle highoildiet Counteryes 06666666666666666 no 03333333333333333 smoking Counterno 06666666666666666 yes 03333333333333333 weight Countermiddle 06666666666666666 low 03333333333333333 high 00 I highlighted a number the way to interpret 075 is by reading the dictionary keys as the event we are conditioned on and the event itself You can verify that this does agree with our previous table counting It corresponds to the following conditional probability expression 𝑃𝑃𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤ℎ𝑡𝑡 ℎ𝑤𝑤𝑤𝑤ℎ𝑠𝑠𝑡𝑡𝑠𝑠𝑠𝑠𝑠𝑠𝑤𝑤𝑠𝑠𝑤𝑤𝑠𝑠𝑠𝑠 ℎ𝑤𝑤𝑤𝑤ℎ 266 Statistics for Classification Next lets write another function to make the predictions displayed in the following code block def predictpriorprob conditionalprob featurevaluesdict probs total sumpriorprobvalues for key in priorprob probskey priorprobkeytotal for key in probs posteriordict conditionalprobkey for featurename featurelevel in featurevalues items probskey posteriordictfeaturenamefeature level total sumprobsvalues if total 0 printUndetermined else for key in probs probskey total return probs Note that it is totally possible that the probabilities are all 0 in the naïve Bayes classifier This is usually due to an illposed dataset or an insufficient dataset I will show you a couple of examples to demonstrate this The first example is as follows predictpriorprobconditional probweightmiddlehighoil dietnosmokingyes The result is shown next which indicates that the person is probably in the lowrisk group low 05094339622641509 high 033962264150943394 middle 015094339622641506 Underfitting overfitting and crossvalidation 267 The second example is as follows predictpriorprob conditional probweighthighhighoil dietnosmokingno The result is undetermined If you check the conditional probabilities you will find that the contradiction in the features and the insufficiency of the dataset lead to all zeros in the posterior probabilities This is left to you as an exercise In the next section lets look at another important concept in machine learning especially classification tasks crossvalidation Underfitting overfitting and crossvalidation What is crossvalidation and why is it needed To talk about crossvalidation we must formally introduce two other important concepts first underfitting and overfitting In order to obtain a good model for either a regression problem or a classification problem we must fit the model with the data The fitting process is usually referred to as training In the training process the model captures characteristics of the data establishes numerical rules and applies formulas or expressions Note The training process is used to establish a mapping between the data and the output classification regression we want For example when a baby learns how to distinguish an apple and a lemon they may learn how to associate the colors of those fruits with the taste Therefore they will make the right decision to grab a sweet red apple rather than a sour yellow lemon Everything we have discussed so far is about the training technique On the other hand putting a model into a real job is called testing Here is a little ambiguity that people often use carelessly In principle we should have no expectation of the models output on the testing dataset because that is the goal of the model we need the model to predict or generate results on the testing set However you may also hear the term testing set in a training process Here the word testing actually means an evaluation process of the trained model Strictly speaking a testing set is reserved for testing after the model is built In this case a model is trained on a training set then applied on a socalled testing set which we know the ground truth is to get a benchmark of the models performance So be aware of the two meanings of testing 268 Statistics for Classification In the following content I will refer to testing in the training process For example if the baby we mentioned previously learned that red means sweetness say one day the baby sees a red pepper for the first time and thinks it is sweet what will happen The babys colortosweetness model will likely fail the testing on the testing set a red pepper What the baby learned is an overfitted model An overfitted model learns too much about the characteristics of the training data for the baby it is the apple such that it cannot be generalized to unseen data easily How about an underfitted model An underfitted model can be constructed this way If the baby learns another feature that density is also a factor to indicate whether a fruit vegetable is sweet or not the baby may likely avoid the red pepper Compared to this model involving the fruits density the babys simple coloronly model is underfitting An underfitted model doesnt learn enough from the training data It can be improved to perform better on the training data without potentially damaging its generalization capacity As you may have guessed overfitting and underfitting are two cases that may not have clear boundaries Here is a vivid example Suppose we have the following data and we would like to have a polynomial regression model to fit it A polynomial regression model uses a polynomial rather than a linear line to fit the data The degree of the polynomial is a parameter we should choose for the model Lets see which one we should choose The following code snippet plots the artificial data pltfigurefigsize106 xcoor 1234567 ycoor 385710915 pltscatterxcoorycoor The result looks as follows Figure 98 Artificial data for polynomial fitting Underfitting overfitting and crossvalidation 269 Now let me use 1st 3rd and 5thorder polynomials to fit the data points The two functions I used are numpypolyfit and numpypolyval The following code snippet plots the graph styles pltfigurefigsize106 x nplinspace1720 for idx degree in enumeraterange162 coef nppolyfitxcoorycoordegree y nppolyvalcoefx pltplotxy linewidth4 linestylestylesidx labeldegree formatstrdegree pltscatterxcoorycoor s400 labelOriginal Data markero pltlegend The result looks as in the following figure Note that I made the original data points exceptionally large Figure 99 Polynomial fitting of artificial data points 270 Statistics for Classification Well it looks like the highdegree fitting almost overlaps with every data point and the linear regression line passes in between However with a degree of 5 the polynomial in principle can fit any five points in the plane and we merely have seven points This is clearly overfitting Lets enlarge our vision a little bit In the next figure I slightly modified the range of the x variable from 17 to 08 Lets see what happens The modification is easy so the code is omitted Figure 910 Fitting polynomials in an extended range Wow See the penalty we pay to fit our training data The higherorder polynomial just goes wild Imagine if we have a testing data point between 0 and 1 what a counterintuitive result we will get The question is how do we prevent overfitting We have already seen one tool regularization The other tool is called crossvalidation Crossvalidating requires another dataset called the validation set to validate the model before the model is applied to the testing set Crossvalidation can help to reduce the overfitting and reduce bias in the model learned in the training set early on For example the most common kfold crossvalidation splits the training set into k parts and leaves one part out of the training set to be the validation set After the training is done that validation set is used to evaluate the performance of the model The same procedure is iterated k times Bias can be detected early on if the model learned too much from the limited training set Note Crossvalidation can also be used to select parameters of the model Underfitting overfitting and crossvalidation 271 Some sklearn classifiers have crossvalidation built into the model classes Since we have reached the end of the chapter lets look at a logistic regression crossvalidation example in sklearn Here I am going to use the stroke risk data for logistic regression crossvalidation I am going to convert some categorical variables into numerical variables Recall that this is usually a bad practice as we discussed in Chapter 1 Fundamentals of Data Collection Cleaning and Preprocessing However it is doable here for a simple model such as logistic regression because there is an ordered structure in the categories For example I can map low weight to 1 middle weight to 2 and high weight to 3 The logistic regression classifier will automatically learn the parameters to distinguish them Another point is that the target stroke risk will now have three choices rather than two This multiclass classification is also achievable by training more than two logistic regression classifiers and using them together to partition the outcome space The code that does the categorytonumeric mapping is omitted due to space limitation you can find it in the Jupyter notebook The code that invokes logistic regression crossvalidation reads as follows from sklearnlinearmodel import LogisticRegressionCV X strokeriskweighthighoildietsmoking y strokeriskstrokerisk classifier LogisticRegressionCVcv3randomstate2020multi classautofitXy Note that the kfold cross validation has a k value of 3 We shouldnt choose a k value larger than the total number of records If we choose 15 which is exactly the number of records it is called leaveoneout crossvalidation You can obtain some parameters of the crossvalidated classifier by calling classifiergetparams The result reads as follows bound method BaseEstimatorgetparams of LogisticRegressionCVCs10 classweightNone cv3 dualFalse fitinterceptTrue interceptscaling10 l1ratiosNone maxiter100 multiclassauto n jobsNone penaltyl2 randomstate2020 refitTrue scoringNone solverlbfgs tol00001 verbose0 272 Statistics for Classification Note that a regularization term is automatically introduced because of the Cs parameter For more details you can refer to the API of the function Now lets call the predictprob function to predict the probabilities Lets say the person is slightly overweight so they have a weight value of 15 Recall that 1 means middle and 2 means high for weight This person also eats slightly more fatty foods but smokes a lot So they have 05 and 2 on another two features respectively The code reads as follows classifierpredictprobanparray15052 The results read as follows array020456731 015382072 064161197 So this person likely falls into the high stroke risk group Note that this model is very coarse due to the categorical variabletonumerical variable conversion but it gives you the capability to estimate on data which is beyond the previous observations Summary In this chapter we thoroughly studied the logistic regression classifier and corresponding classification task concepts Then we built a naïve Bayes classifier from scratch In the last part of this chapter we discussed the concepts of underfitting and overfitting and used sklearn to use crossvalidation functions In the next chapter we are going to study another big branch of machine learning models treebased models 10 Statistics for TreeBased Methods In the previous chapter we covered some important concepts in classification models We also built a naïve Bayes classifier from scratch which is very important because it requires you to understand every aspect of the details In this chapter we are going to dive into another family of statistical models that are also widely used in statistical analysis as well as machine learning treebased models Treebased models can be used for both classification tasks and regression tasks By the end of this chapter you will have achieved the following Gained an overview of treebased classification Understood the details of classification tree building Understood the mechanisms of regression trees Know how to use the scikitlearn library to build and regularize a treebased method Lets get started All the code snippets used in this chapter can be found in the official GitHub repository here httpsgithubcomPacktPublishingEssential StatisticsforNonSTEMDataAnalysts 274 Statistics for TreeBased Methods Overviewing treebased methods for classification tasks Treebased methods have two major varieties classification trees and regression trees A classification tree predicts categorical outcomes from a finite set of possibilities while a regression tree predicts numerical outcomes Lets first look at the classification tree especially the quality that makes it more popular and easy to use compared to other classification methods such as the simple logistic regression classifier and the naïve Bayes classifier A classification tree creates a set of rules and partitions the data into various subspaces in the feature space or feature domain in an optimal way First question what is a feature space Lets take our stroke risk data that we used in Chapter 9 Statistics for Classification as sample data Heres the dataset from the previous chapter for your reference Each row is a profile for a patient that records their weight diet habit smoking habit and corresponding stroke risk level Figure 101 Stroke risk data Overviewing treebased methods for classification tasks 275 We have three features for each record If we only look at the weight feature it can take three different levels low middle and high Imagine in a onedimensional line representing weight that there are only three discrete points a value can take namely the three levels This is a onedimensional feature space or feature domain On the other hand highoil diet and smoking habit are other twofeature dimensions with two possibilities Therefore a person can be on one of 12 322 combinations of all features in this threedimensional feature space A classification tree is built with rules to map these 12 points in the feature space to the outcome space which has three possible outcomes Each rule is a yesno question and the answer will be nonambiguous so each data record has a certain path to go down the tree The following is an example of such a classification tree Figure 102 An example of a classification tree for stroke risk data Lets look at one example to better understand the tree Suppose you are a guy who smokes but doesnt have a highoil diet Then starting at the top of the tree you will first go down to the left branch and then go right to the Middle stroke risk box The decision tree classifies you as a patient with middle stroke risk Now is a good time to introduce some terminology to mathematically describe a decision tree rather than using casual terms such as box A tree is usually drawn upside down but this is a good thing as you follow down a chain of decisions to reach the final status Here is some few important terminology that you need to be aware of Root node A root node is the only nodeblock that only has outgoing arrows In the tree shown in the previous figure it is the one at the top with the Smoking text The root node contains all the records and they havent been divided into subcategories which corresponds to partitions of feature space 276 Statistics for TreeBased Methods Decision node A decision node is one node with both incoming and outgoing arrows It splits the data feed into two groups For example the two nodes on the second level of the tree High oil diet and High weight are decision nodes The one on the left splits the smoking group further into the smoking and highoil diet group and the smoking and nonhighoil diet group The one on the right splits the nonsmoking group further into the nonsmoking and high weight and nonsmoking and nonhigh weight groups Leaf node A leaf node or a leaf is a node with only incoming arrows A leaf node represents the final terminal of a classification process where no further splitting is needed or allowed For example the node at the bottom left is a leaf that indicates that people who smoke and have a highoil diet are classified to have a high risk of stroke It is not necessary for a leaf to only contain pure results In this case it is alright to have only low stroke risk and high stroke risk people in the leaf What we optimized is the pureness of the classes in the node as the goal of classification is to reach unambiguous labeling The label for the records in a leaf node is the majority label If there is a tie a common solution is to pick a random label of the tied candidates to make it the majority label Parent node and children nodes The node at the start of an arrow is the parent node of the nodes at the end of the arrows which are called the child nodes A node can simultaneously be a parent node and a child node except the root node and the leaf The process of determining which feature or criteria to use to generate children nodes is called splitting It is common practice to do binary splitting which means a parent node will have two child nodes Depth and pruning The depth of a decision tree is defined as the length of the chain from the root node to the furthest leaf In the stroke risk case the depth is 2 It is not necessary for a decision tree to be balanced One branch of the tree can have more depth than another branch if accuracy requires The operation of removing children nodes including grandchild nodes and more is called pruning just like pruning a biological tree From now on we will use the rigorous terms we just learned to describe a decision tree Note One of the benefits of a decision tree is its universality The features dont necessarily take discrete values they can also take continuous numerical values For example if weight is replaced with continuous numerical values the splitting on high weight or not will be replaced by a node with criteria such as weight 200 pounds Overviewing treebased methods for classification tasks 277 Now lets go over the advantages of decision trees The biggest advantage of decision trees is that they are easy to understand For a person without any statistics or machine learning background decision trees are the easiest classification algorithms to understand The decision tree is not sensitive to data preprocessing and data incompletion For many machine learning algorithms data preprocessing is vital For example the units of a feature in grams or kilograms will influence the coefficient values of logistic regression However decision trees are not sensitive to data preprocessing The selection of the criteria will adjust automatically when the scale of the original data changes but the splitting results will remain unchanged If we apply logistic regression to the stroke risk data a missing value of a feature will break the algorithm However decision trees are more robust to achieve relatively stable results For example if a person who doesnt smoke misses the weight data they can be classified into the lowrisk or middlerisk groups randomly of course there are better ways to decide such as selecting the mode of records similar to it but they wont be classified into highrisk groups This result is sometimes good enough for practical use Explainability When a decision tree is trained you are not only getting a model but you also get a set of rules that you can explain to your boss or supervisor This is also why I love the decision tree the most The importance of features can also be extracted For example in general the closer the feature is to the root the more important the feature is in the model In the stroke risk example smoking is the root node that enjoys the highest feature importance We will talk about how the positions of the features are decided in the next chapter Now lets also talk about a few disadvantages of the decision tree It is easy to overfit Without control or penalization decision trees can be very complex How Imagine that unless there are two records with exactly the same features but different outcome variables the decision tree can actually build one leaf node for every record to reach 100 accuracy on the training set However the model will very likely not be generalized to another dataset Pruning is a common approach to remove overcomplex subbranches There are also constraints on the splitting step which we will discuss soon 278 Statistics for TreeBased Methods The greedy approach doesnt necessarily give the best model A single decision tree is built by greedily selecting the best splitting feature sequentially As a combination problem with an exponential number of possibilities the greedy approach doesnt necessarily give the best model In most cases this isnt a problem In some cases a small change in the training dataset might generate a completely different decision tree and give a different set of rules Make sure you doublecheck it before presenting it to your boss Note To understand why building a decision tree involves selecting rules from a combination of choices lets build a decision tree with a depth of 3 trained on a dataset of three continuous variable features We have three decision nodes including the root node to generate four leaves Each decision node can choose from three features for splitting therefore resulting in a total of 27 possibilities Yes one child node can choose the same feature as its parent Imagine we have four features then the total number of choices becomes 64 If the depth of the tree increases by 1 then we add four more decision nodes Therefore the total number of splitting feature choices is 16384 which is huge for such a fourfeature dataset Most trees will obviously be useless but the greedy approach doesnt guarantee the generation of the best decision tree We have covered the terminology advantages and disadvantages of decision trees In the next section we will dive deeper into decision trees specifically how branches of a tree are grown and pruned Growing and pruning a classification tree Lets start by examining the dataset one more time We will first simplify our problem to a binary case so that the demonstration of decision tree growing is simpler Lets examine Figure 101 again For the purpose of this demonstration I will just group the middlerisk and highrisk patients into the highrisk group This way the classification problem becomes a binary classification problem which is easier to explain After going through this section you can try the exercises on the original threecategory problem for practice The following code snippet generates the new dataset that groups middlerisk and highrisk patients together dfstrokerisk dfstrokeriskapplylambda x low if x low else high Growing and pruning a classification tree 279 The new dataset will then look as follows Figure 103 Binary stroke risk data Now lets think about the root node Which feature and what kind of criteria should we choose to generate two children nodes from the root node that contains all the records data points We will explore this topic in the next section Understanding how splitting works The principle of splitting is that splitting as a feature must get us closer to a completely correct classification We need a numerical metric to compare different choices of splitting features The goal for classification is to classify records into pure states such that each leaf will contain records that are as pure as possible Therefore pureness or impureness becomes a natural choice of metric The most common metric is called Gini impurity It measures how impure a set of data is For a binary class set of data with class labels A and B the definition of Gini impurity is the following 𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺 𝐺𝐺𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝐺𝐺𝑖𝑖𝑖𝑖 1 𝑃𝑃𝐴𝐴2 𝑃𝑃𝐵𝐵2 280 Statistics for TreeBased Methods If the set only contains A or B the Gini impurity is 0 The maximum impurity is 05 when half of the records are A and the other half are B For a threeclass dataset the minimum is still 0 but the maximum becomes 2 3 Note Gini impurity is named after the Italian demographer and statistician Corrado Gini Another wellknown index named after him is the Gini index which measures the inequality of wealth distribution in a society Lets see how this unfolds at the root node without any splitting The Gini impurity is calculated as 1 8 15 2 7 15 2 because we have eight lowrisk records and seven highrisk records The value is about 0498 close to the highest possible impurity After splitting by one criterion we have two children nodes The way to obtain the new lower impurity is to calculate the weighted Gini impurity of the two children nodes First lets take the highoil diet group as an example Lets examine the partition of the highoil diet group The following code snippet does the counting Counterdfdfhighoildietyesstrokerisk There is a total of six records with two lowrisk records and four highrisk records Therefore the impurity for the highoil diet group is 1 1 3 2 2 3 2 4 9 Meanwhile we can calculate the nonhighoil diet groups statistics Lets select and count them using the following code snippet Counterdfdfhighoildietnostrokerisk There is a total of nine records with six lowrisk records and three highrisk records Note that the proportionalities are the same for the highoil diet but with exchanging groups Therefore the Gini impurity is also 4 9 The weighted Gini impurity remains 4 9 because 4 9 6 15 4 9 9 15 4 9 It is about 0444 So what do we get from such a classification We have reduced the Gini impurity from 0498 to 0444 which is just a slight decrease but better than nothing Next lets examine the smoking behavior Growing and pruning a classification tree 281 By the same token lets first check the smoking groups statistics The following code snippet does the counting Counterdfdfsmokingyesstrokerisk There is a total of seven smoking cases Five of them are of high stroke risk and two of them are of low stroke risk The Gini impurity is therefore about 0408 Lets check the nonsmokers Counterdfdfsmokingnostrokerisk There are eight nonsmokers six of them are of low stroke risk and two of them are of high stroke risk Therefore the Gini impurity is 0375 Lets obtain the weighted impurity which is about 0408 7 15 0375 8 15 0390 This is a 0108 decrease from the original impurity without splitting and it is better than the splitting on the highoil diet group I will omit the calculation for the other feature weight but I will list the result for you in the following table Note that the weight feature has three levels so there can be multiple rules for splitting the feature Here I list all of them In the yes and no group statistics I list the number of highstroke risk records the number of lowrisk records and the Gini impurity value separated by commas Figure 104 The Gini impurity evaluation table for different splitting features at the root node 282 Statistics for TreeBased Methods Note that I highlighted the Gini impurity for the highweight group and the weighted Gini impurity for the last splitting choice All highweight patients have a high stroke risk and this drives the weighted impurity down to 0356 the lowest of all possible splitting rules Therefore we choose the last rule to build our decision tree After the first splitting the decision tree now looks like the following Figure 105 The decision tree after the first splitting Note that the left branch now contains a pure node which becomes a leaf Therefore our next stop only focuses on the right branch We naturally have an imbalanced tree now Now we have four choices for the splitting of 12 records First I will select these 12 records out with the following oneline code snippet dfright dfdfweighthigh The result looks as follows The Gini impurity for the right splitting node is 0444 as calculated previously This will become our new baseline Figure 106 The lowweight and middleweight group Growing and pruning a classification tree 283 As we did earlier lets build a table to compare different splitting choices for the splitting node on the right The ordering of the numbers is the same as in the previous table Figure 107 The Gini impurity evaluation table for different splitting features at the right splitting node We essentially only have three choices because the two splitting rules on the feature weight are mirrors of each other Now we have a tie We can randomly select one criterion for building the trees further This is one reason why decision trees dont theoretically generate the best results Note An intuitive way to solve this issue is to build both possibilities and even more possible trees which violates the greedy approach and let them vote on the prediction results This is a common method to build a more stable model or an ensemble of models We will cover related techniques in the next chapter Lets say I choose highoil diet as the criteria The tree now looks like the following Figure 108 The decision tree after the second splitting 284 Statistics for TreeBased Methods Now lets look at the two newly generated nodes The first one at a depth of 2 contains two high stroke risk records and two low stroke risk records They dont have a heavy weight but do have a highoil diet Lets check out their profile with this line of code dfrightdfrighthighoildietyes The result looks like the following Figure 109 Records classified into the first node at a depth of 2 Note that the lowweight category contains one low stroke risk record and a high stroke risk example The same situation happens with the middleweight category This makes the decision tree incapable of further splitting on any feature There wont be any Gini impurity decreasing for splitting Therefore we can stop here for this node Note Well what if we want to continue improving the classification results As you just discovered there is no way that the decision tree can classify these four records and no other machine learning method can do it either The problem is in the data not in the model There are two main approaches to solve this issue The first option is to try to obtain more data With more data we may find that low weight is positively correlated with low stroke risk and further splitting on the weight feature might benefit decreasing the Gini impurity Obtaining more data is always better because your training model gets to see more data which therefore reduces possible bias Another option is to introduce more features This essentially expands the feature space by more dimensions For example blood pressure might be another useful feature that might help us further increase the accuracy of the decision tree Growing and pruning a classification tree 285 Now lets look at the second node at depth 2 The records classified into this node are the following given by the dfrightdfrighthighoildietyes code Figure 1010 Records classified into the second node at a depth of 2 Note that only two high stroke risk records are in this node If we stop here the Gini impurity is 1 0252 0752 0375 which is quite a low value If we further split on the smoking feature note that out of all the nonsmokers four of them have a low stroke risk Half of the smokers have a high stroke risk and the other half have a low stroke risk This will give us a weighted Gini impurity of 025 if were splitting on smoking If we further split on the weight feature all the lowweight patients are at low stroke risk Two out of five middleweight records are at high risk This will give us a weighted Gini impurity of 5 8 1 2 5 2 3 5 2 03 which is also not bad 286 Statistics for TreeBased Methods For the two cases the final decision trees look as follows The following decision tree has smoking as the last splitting feature Figure 1011 Final decision tree version 1 Growing and pruning a classification tree 287 The other choice is splitting on weight again at the second node at depth 2 The following tree will be obtained Figure 1012 Final decision tree version 2 Now we need to make some hard choices to decide the final shape of our decision trees Evaluating decision tree performance In this section lets evaluate the performance of the decision tree classifiers If we stop at depth 2 we have the following confusion matrix Note that for the unclassifiable first node at depth 2 we can randomly assign it a label Here I assign it as high stroke risk The performance of our classifier can be summarized in the following table concisely Figure 1013 Confusion matrix of a decision tree of depth 2 288 Statistics for TreeBased Methods Generally we identify high risk as positive so the precision recall and F1 score are all 5 7 If you have forgotten these concepts you can review previous chapters If we dont stop at a depth of 2 the two finer decisions trees will have the following confusion matrices Again we assign the unclassifiable first node at depth 2 the label high stroke risk However the first node at depth 3 is also unclassifiable because it contains equal high stroke risk and low stroke risk records If they are classified as lowrisk ones then we essentially obtain the same result as the depthof2 one Therefore we assign the first leaf node at depth 3 a high stroke risk value The new confusion matrix will look as follows Figure 1014 Confusion matrix of a decision tree of depth 3 version 1 Note that we will have perfect recall but the precision will be just slightly better than a random guess 7 12 The F1 score is 14 19 Next lets check our final version If we split with weight the corresponding confusion matrix looks as shown in the following table Figure 1015 Confusion matrix of a decision tree of depth 3 version 2 The precision recall and F1 score will be identical to the depth 2 decision tree In real life we usually prefer the simplest model possible if it is as good or almost as good as the complicated ones Although the first depth 3 decision tree has a better F1 score it also introduces one more unclassifiable node and one more rule The second depth 3 decision tree does no better than the depth 2 one To constrain the complexity of the decision tree there are usually three methods Constrain the depth of the tree This is probably the most direct way of constraining the complexity of the decision tree Constrain the lower bound of the number of records classified into a node For example if after splitting one child node will only contain very few data points then it is likely not a good splitting Exploring regression tree 289 Constrain the lower bound of information gain In our case the information gain means lower Gini impurity For example if we set a criterion that each splitting must lower the information gain by 01 then the splitting will likely stop soon therefore confining the depth of the decision tree We will see algorithmic examples on a more complex dataset later in this chapter Note When the number of records in a splitting node is small the Gini impurity reduction is no longer as representative as before It is the same idea as in statistical significance The larger the sample size is the more confident we are about the derived statistics You may also hear the term size of the decision tree Usually the size is not the same as the depth The size refers to the total number of nodes in a decision tree For a symmetric decision tree the relationship is exponential Exploring regression tree The regression tree is very similar to a classification tree A regression tree takes numerical features as input and predicts another numerical variable Note It is perfectly fine to have mixtype features for example some of them are discrete and some of them are continuous We wont cover these examples due to space limitations but they are straightforward There are two very important visible differences The output is not discrete labels but rather numerical values The splitting rules are not similar to yesorno questions They are usually inequalities for values of certain features In this section we will just use a onefeature dataset to build a regression tree that the logistic regression classifier wont be able to classify I created an artificial dataset with the following code snippet def price2revenueprice if price 85 return 70 absprice 75 elif price 95 290 Statistics for TreeBased Methods return 10 80 else return 80 105 price prices nplinspace801008 revenue nparrayprice2revenueprice for price in prices pltrcParamsupdatefontsize 22 pltfigurefigsize108 pltscatterpricesrevenues300 pltxlabelprice pltylabeltotal revenue plttitlePrice versus Revenue Lets say we want to investigate the relationship between the price of an item and its total revenue a day If the price is set too low the revenue will be lower because the price is low If the price is too high the revenue will also be low due to fewer amounts of the item being sold The DataFrame looks as follows Figure 1016 Price and total revenue DataFrame Exploring regression tree 291 The following visualization makes this scenario clearer Figure 1017 The relationship between price and revenue The relationship between price and revenue is clearly nonlinear and logistic regression wont be able to classify it A linear regression will likely become a horizontal line There are clearly three regions where different relationships between revenue and price apply Now lets build a regression tree Like the deduction of Gini impurity in the classification tree we need a metric to measure the benefit of splitting A natural choice is still the sum of squared residuals Lets start from the root node We have eight data points so there are essentially seven intervals where we can put the first splitting criteria into For example we can split at price 85 Then we use the average revenue on both sides to be our prediction as follows The code snippet for the visualization reads as follows pltrcParamsupdatefontsize 22 pltfigurefigsize128 pltscatterpricesrevenues300 pltxlabelprice pltylabeltotal revenue plttitlePrice versus Revenue threshold 85 numleft sumprices threshold aveleft npmeanrevenueprices threshold 292 Statistics for TreeBased Methods numright sumprices threshold averight npmeanrevenueprices threshold pltaxvlinethresholdcolorredlinewidth6 pltplotpricesprices threshold aveleft for in rangenumleft linewidth6linestylecorange label average revenue on the left half pltplotpricesprices threshold averight for in rangenumright linewidth6linestylecgreen labelaverage revenue on the right half pltrcParamsupdatefontsize 16 pltlegendloc040 In the following figure the dotted line represents the average price for the scenario when the price is lower than 850 The dashed line represents the average price for the scenario when the price is higher than 850 Figure 1018 Splitting at price 850 Exploring regression tree 293 If we stop here the regression tree will have a depth of 1 and looks like the following Figure 1019 A regression tree of depth 1 However we havent tested the other six splitting choices Any splitting choice will have a corresponding sum of squared residuals and we would like to go over all the possibilities to determine the splitting that gives the minimal sum of squared residuals Note Unlike Gini impurity where we need to take a weighted average the total sum of squared residuals is a simple summation Gini impurity is not additive because it only takes a value between 0 and 1 Squared residuals are additive because each residual corresponds to one data point The following code snippet plots the sum of squared residuals against different choices of splitting For completion I plotted more than seven splitting values to visualize the stepped pattern def calssrarr if lenarr0 return 0 ave npmeanarr return npsumarrave2 splittingvalues nplinspace8010020 ssrvalues for splittingvalue in splittingvalues ssr calssrrevenueprices splittingvalue cal ssrrevenueprices splittingvalue ssrvaluesappendssr 294 Statistics for TreeBased Methods pltrcParamsupdatefontsize 22 pltfigurefigsize128 pltxlabelsplitting prices pltylabelsum of squared residuals plttitleSplitting Price versus Sum of Squared Residuals pltplotsplittingvaluesssrvalues The result looks as in the following figure Figure 1020 Splitting value for the root node versus the sum of squared residuals The visualization in the preceding figure indicates that 850 or any value between the second point and the third point is the best splitting value for the root node There are only two records in the first node with a depth of 1 so we focus on the second node and repeat the process explained here The code is omitted due to space limitations The visualization of the sum of squared residuals is the following Exploring regression tree 295 Figure 1021 Splitting choices at the second node at depth 1 Now in order to achieve the minimum sum of squared error we should put the last data point into one child node However you see that if we split at 98 the penalty we pay is not increasing much If we include another one such as splitting at 96 the penalty will soar It may be a good idea to split at 96 rather than 98 because a leaf node containing too few records is not representative in general and often indicates overfitting Here is the final look of our regression tree You can calculate the regressed average prices in each region easily The final regression tree looks as follows Figure 1022 Final regression tree 296 Statistics for TreeBased Methods The following figure shows a visualization for the partition of the regions Figure 1023 Regressed values and region partitioning In multifeature cases we will have more than one feature The scanning of the best splitting value should include all the features but the idea is the same Using tree models in scikitlearn Before ending this chapter lets try out the scikitlearn API You can verify that the results agree with our models built from scratch The following code snippet builds a regression tree with a maximum depth of 1 on the pricerevenue data from sklearntree import DecisionTreeRegressor from sklearn import tree prices revenue pricesreshape11 revenuereshape11 regressor DecisionTreeRegressorrandomstate0maxdepth1 regressorfitpricesrevenue Using tree models in scikitlearn 297 Now we can visualize the tree with the following code snippet pltfigurefigsize128 treeplottreeregressor The tree structure looks as follows Figure 1024 Regression tree visualization of depth 1 Next we limit the maximum depth to 2 and require the minimal number of records samples in a leaf node to be 2 The code only requires a small change in the following line regressor DecisionTreeRegressorrandomstate0max depth2minsamplesleaf2 After running the code we obtain the following tree structure Figure 1025 Regression tree visualization of depth 2 298 Statistics for TreeBased Methods As you can see this produces exactly the same results as the one we built from scratch Note The scikitlearn decision tree API cant explicitly handle categorical variables There are various options such as onehot encoding that you can use to bypass this limitation You are welcome to explore the solutions on your own Summary In this chapter we started with the fundamental concepts of decision trees and then built a simple classification tree and a regression tree from scratch We went over the details and checked the consistency with the scikitlearn library API You may notice that tree methods do tend to overfit and might fail to reach the optimal model In the next chapter we will explore the socalled ensemble learning They are metaalgorithms that can be used on top of many other machine learning algorithms as well 11 Statistics for Ensemble Methods In this chapter we are going to investigate the ensemble method in terms of statistics and machine learning The English word ensemble means a group of actors or musicians that work together as a whole The ensemble method or ensemble learning in machine learning is not a specific machine learning algorithm but a meta learning algorithm that builds on top of concrete machine learning algorithms to bundle them together to achieve better performance The ensemble method is not a single method but a collection of many In this chapter we will cover the most important and representative ones We are going to cover the following in this chapter Revisiting bias variance and memorization Understanding the bootstrapping and bagging techniques Understanding and using the boosting module Exploring random forests with scikitlearn Lets get started 300 Statistics for Ensemble Methods Revisiting bias variance and memorization Ensemble methods can improve the result of regression or classification tasks in that they can be applied to a group of classifiers or regressors to help build a final augmented model Since we are talking about performance we must have a metric for improving performance Ensemble methods are designed to either reduce the variance or the bias of the model Sometimes we want to reduce both to reach a balanced point somewhere on the biasvariance tradeoff curve We mentioned the concepts of bias and variance several times in earlier chapters To help you understand how the idea of ensemble learning originated I will revisit these concepts from the perspective of data memorization Lets say the following schematic visualization represents the relationship between the training dataset and the realworld total dataset The solid line shown in the following diagram separates the seen world and the unseen part Figure 111 A schematic representation of the observed data Suppose we want to build a classifier that distinguishes between the circles and the squares Unfortunately our observed data is only a poor subset of the original data In most cases we do not know the entire set of realworld data so we dont know how representative our accessible dataset is We want to train a model to classify the two classes that is square and circle However since our trained model will only be exposed to the limited observed data different choices regarding which model we choose as well as its complexity will give us different results Lets check out the following two decision boundaries Revisiting bias variance and memorization 301 First we can draw a decision boundary as a horizontal line as shown in the following diagram This way one square data point is misclassified as a round one Figure 112 A simple decision boundary Alternatively we can draw a decision boundary the other way as shown in the following diagram This zigzagging boundary will correctly classify both the square data points and the round data points Figure 113 A more complex decision boundary Can you tell which classification method is better With our hindsight of knowing what the entire dataset looks like we can tell that neither is great However the difference is how much we want our model to learn from the known data The structure of the training dataset will be memorized by the model The question is how much Note on memorization Data memorization means that when a model is being trained it is exposed to the training set so it remembers the characteristics or structure of the training data This is a good thing when the model has high bias because we want it to learn but it becomes notoriously bad when its memory gets stuck in the training data and fails to generalize Simply put when a model memorizes and learns too little of the training data it has high bias When it learns too much it has high variance 302 Statistics for Ensemble Methods Because of this we have the following famous curve of the relationship between model complexity and error This is probably the most important graph for any data scientist interview Figure 114 The relationship between model complexity and error When model complexity increases the error in terms of mean squared error or any other form will always decrease monotonically Recall that when we discussed R2 we said that adding any predictor feature will increase the R2 rate On the other hand the performance of the learned model will start to decrease on the test dataset or other unseen datasets This is because the model learns too many ungeneralizable characteristics such as the random noise of the training data In the preceding example the zigzagging boundary doesnt apply to the rest of the dataset To summarize underfitting means that the model is biased toward its original assumption which means theres information thats missing from the training set On the other hand overfitting means that too many training setspecific properties were learned so the models complexity is too high Underfitting and overfitting High bias corresponds to underfitting while high variance corresponds to overfitting Humans also fall into similar traps For example a CEO is very busy so heshe does not have a lot of free time to spend with hisher kids What is the most likely job of the kids mother Most people will likely say homemaker However I didnt specify the gender of the CEO Well the CEO is the mother Understanding the bootstrapping and bagging techniques 303 As the power of machine learning algorithms grows the necessity to curb overfitting and find a balance between bias and variance is prominent Next well learn about the bootstrapping and bagging techniques both of which can help us solve these issues Understanding the bootstrapping and bagging techniques Bootstrapping is a pictorial word It allows us to imagine someone pulling themselves up by their bootstraps In other words if no one is going to help us then we need to help ourselves In statistics however this is a sampling method If there is not enough data we help ourselves by creating more data Imagine that you have a small dataset and you want to build a classifierestimator with this limited amount of data In this case you can perform crossvalidation Crossvalidation techniques such as 10fold crossvalidation will decrease the number of records in each fold even further We can take all the data as the training data but you likely will end up with a model with very high variance What should we do then The bootstrapping method says that if the dataset being used is a sample of the unknown data in the dataset why not try resampling again The bootstrap method creates new training sets by uniformly sampling from the dataset and then replacing it This process can be repeated as many times as its necessary to create many new training datasets Each new training set can be used to train a classifierregressor Besides the magic of creating training datasets from thin air bootstrapping has two significant advantages Bootstrapping increases the randomness in the training set It is likely that such randomness will help us avoid capturing the intrinsic random noise in the original training set Bootstrapping can help build the confidence interval of calculated statistics plural form of statistic Suppose we run bootstrapping N times and obtain N new samples By doing this we can calculate the standard variation of a selected statistic which is not possible without bootstrapping Before we move on lets examine how bootstrapping works on a real dataset We are going to use the Boston Housing Dataset which you can find in its official GitHub repository httpsgithubcomPacktPublishingEssentialStatisticsfor NonSTEMDataAnalysts You can also find the meanings of each column in the respective Jupyter notebook It contains information regarding the per capita crime rate by town average number of rooms per dwelling and so on 304 Statistics for Ensemble Methods Later in this chapter we will use these features to predict the target feature that is the median value of owneroccupied homes medv Turning a regression problem into a classification problem I am going to build a classifier for demonstration purposes so I will transform the continuous variable medv into a binary variable that indicates whether a houses price is in the upper 50 or lower 50 of the market The first few lines of records in the original dataset look as follows Due to space limitations most of the code except for the crucial pieces will be omitted here Figure 115 Main section of the Boston Housing Dataset First lets plot the distribution of the index of accessibility to radial highways variable Here you can see that this distribution is quite messy no simple distribution can model it parametrically Lets say that this is our entire dataset for selecting the training dataset in this demo Figure 116 Index of accessibility to radial highways distribution the dataset contains 506 records Understanding the bootstrapping and bagging techniques 305 Now lets select 50 pieces of data to be our trainingobserved data This distribution looks as follows Notice that the yaxis has a different scale The functions well be using to perform sampling can be found in the scikitlearn library Please refer to the relevant Jupyter notebook for details Figure 117 Index of accessibility to radial highways distribution our original sample which contains 50 records Next lets run bootstrapping 1000 times For each round well sample 25 records Then we will plot these new samples on the same histogram plot Here you can see that the overlapping behavior drops some characteristics of the 50record sample such as the very high peak on the lefthand side Figure 118 Overlapping 1000 times with our bootstrapped sample containing 25 records 306 Statistics for Ensemble Methods Next we will learn how this will help decrease the variance of our classifiers by aggregating the classifiers that were trained on the bootstrapped sets The word bagging is essentially an abbreviation of bootstrap aggregation This is what we will study in the remainder of this section The premise of bagging is to train some weak classifiers or regressors on the newly bootstrapped training sets and then aggregating them together through a majority vote averaging mechanism to obtain the final prediction The following code performs preprocessing and dataset splitting I am going to use a decision tree as our default weak classifier so feature normalization wont be necessary from sklearnmodelselection import traintestsplit import numpy as np bostonbag bostoncopy bostonbagmedv bostonbagmedvapplylambda x intx npmedianbostonbagmedv bostonbagtrain bostonbagtest traintestsplitboston bagtrainsize07shuffleTruerandomstate1 bostonbagtrainX bostonbagtrainy bostonbagtrain dropmedvaxis1tonumpy bostonbagtrainmedv tonumpy bostonbagtestX bostonbagtesty bostonbagtest dropmedvaxis1tonumpy bostonbagtestmedv tonumpy Note that I explicitly made a copy of the boston DataFrame Now Im going to try and reproduce something to show the overfitting on the test set with a single decision tree I will control the maximal depth of the single decision tree in order to control the complexity of the model Then Ill plot the F1 score with respect to the tree depth both on the training set and the test set Without any constraints regarding model complexity lets take a look at how the classification tree performs on the training dataset The following code snippet plots the unconstrained classification tree from sklearntree import DecisionTreeClassifier from sklearnmetrics import f1score from sklearn import tree clf DecisionTreeClassifierrandomstate0 clffitbostonbagtrainX bostonbagtrainy Understanding the bootstrapping and bagging techniques 307 pltfigurefigsize128 treeplottreeclf By doing this we obtain a huge decision tree with a depth of 10 It is hard to see the details clearly but the visualization of this is shown in the following diagram Figure 119 Unconstrained single decision tree The F1 score can be calculated by running the following f1scorebostonbagtrainyclfpredictbostonbagtrainX This is exactly 1 This means weve obtained the best performance possible Next Ill build a sequence of classifiers and limit the maximal depth it can span Ill calculate the F1 score of the model on the train set and test set in the same way The code for this is as follows trainf1 testf1 depths range111 for depth in depths clf DecisionTreeClassifierrandomstate0max 308 Statistics for Ensemble Methods depthdepth clffitbostonbagtrainX bostonbagtrainy trainf1appendf1scorebostonbagtrainyclf predictbostonbagtrainX testf1appendf1scorebostonbagtestyclf predictbostonbagtestX The following code snippet plots the two curves pltfigurefigsize106 pltrcParamsupdatefontsize 22 pltplotdepthstrainf1labelTrain Set F1 Score pltplotdepthstestf1 labelTest Set F1 Score pltlegend pltxlabelModel Complexity pltylabelF1 Score plttitleF1 Score on Train and Test Set The following graph is what we get Note that the F1 score is an inverse indicatormetric of the error The higher the F1 score is the better the model is in general Figure 1110 F1 score versus model complexity Although the F1 score on the training set continues increasing to reach the maximum the F1 score on the test set stops increasing once the depth reaches 4 and gradually decreases beyond that It is clear that after a depth of 4 we are basically overfitting the decision tree Understanding the bootstrapping and bagging techniques 309 Next lets see how bagging would help We are going to utilize the BaggingClassifier API in scikitlearn First since we roughly know that the critical depth is 4 well build a base estimator of such depth before creating a bagging classifier marked out of 10 for it Each time well draw samples from the training dataset to build a base estimator The code for this reads as follows from sklearnensemble import BaggingClassifier baseestimator DecisionTreeClassifierrandomstate 0 max depth 4 baggingclf BaggingClassifierbaseestimatorbaseestimator nestimators10 njobs20 maxsamples07 randomstate0 baggingclffitbostonbagtrainX bostonbagtrainy Next lets plot the relationship between the F1 score and the number of base estimators trainf1 testf1 nestimators 2i1 for i in range18 for nestimator in nestimators baggingclf BaggingClassifierbaseestimatorbase estimator nestimatorsnestimator njobs20 randomstate0 baggingclffitbostonbagtrainX bostonbagtrainy trainf1appendf1scorebostonbagtrainy baggingclf predictbostonbagtrainX testf1appendf1scorebostonbagtesty baggingclf predictbostonbagtestX pltfigurefigsize106 pltplotnestimatorstrainf1labelTrain Set F1 Score pltplotnestimatorstestf1 labelTest Set F1 Score pltxscalelog pltlegend 310 Statistics for Ensemble Methods pltxlabelNumber of Estimators pltylabelF1 Score plttitleF1 Score on Train and Test Set The resulting of running the preceding code is displayed in the following graph Note that the xaxis is on the log scale Figure 1111 F1 score versus number of estimators As you can see even on the training set the F1 score stops increasing and begins to saturate and decline a little bit pay attention to the yaxis which indicates that there is still an intrinsic difference between the training set and the test set There may be several reasons for this performance and I will point out two here The first reason is that it is possible that we are intrinsically unlucky that is there was some significant difference between the training set and the test set in the splitting step The second reason is that our depth restriction doesnt successfully constrain the complexity of the decision tree What we can do here is try another random seed and impose more constraints on the splitting condition For example changing the following two lines will produce different results First well change how the dataset is split bostonbagtrain bostonbagtest traintest splitbostonbagtrainsize07shuffleTruerandom state1 Understanding and using the boosting module 311 Second well impose a new constraint on the base estimator so that a node must be large enough to be split baseestimator DecisionTreeClassifierrandomstate 0 maxdepth 4minsamplessplit30 Due to further imposed regularization it is somewhat clearer that the performance of both the training set and the test set is consistent Figure 1112 F1 score under more regularization Note that the scale on the yaxis is much smaller than the previous one Dealing with inconsistent results It is totally normal if the result is not very consistent when youre training evaluating machine learning models In such cases try different sets of data to eliminate the effect of randomness or run crossvalidation If your result is inconsistent with the expected behavior take steps to examine the whole pipeline Start with dataset completeness and tidiness then preprocessing and model assumptions It is also possible that the way the results are being visualized or presented is flawed Understanding and using the boosting module Unlike bagging which focuses on reducing variance the goal of boosting is to reduce bias without increasing variance 312 Statistics for Ensemble Methods Bagging creates a bunch of base estimators with equal importance or weights in terms of determining the final prediction The data thats fed into the base estimators is also uniformly resampled from the training set Determining the possibility of parallel processing From the description of bagging we provided you may imagine that it is relatively easy to run bagging algorithms Each process can independently perform sampling and model training Aggregation is only performed at the last step when all the base estimators have been trained In the preceding code snippet I set njobs 20 to build the bagging classifier When it is being trained 20 cores on the host machine will be used at most Boosting solves a different problem The primary goal is to create an estimator with low bias In the world of boosting both the samples and the weak classifiers are not equal During the training process some will be more important than others Here is how it works 1 First we assign a weight to each record in the training data Without special prior knowledge all the weights are equal 2 We then train our first baseweak estimator on the entire dataset After training we increase the weights of those records which are predicted with wrong labels 3 Once the weights have been updated we create the next base estimator The difference here is that those records that were misclassified by previous estimators or whose values seriously deviated from the true value provided by regression now receive higher attention Misclassifying them again will result in higher penalties which will in turn increase their weight 4 Finally we iterate step 3 until the preset iteration limit is reached or accuracy stops improving Note that boosting is also a meta algorithm This means that at different iterations the base estimator can be completely different from the previous one You can use logistic regression in the first iteration use a neural network at the second iteration and then use a decision tree at the final iteration There are two classical boosting methods we can use adaptive boosting AdaBoost and gradient descent boosting Understanding and using the boosting module 313 First well look at AdaBoost where there are two kinds of weights The weights of the training set records which change at every iteration The weights of the estimators which are determined inversely by their training errors The weight of a record indicates the probability that this record will be selected in the training set for the next base estimator For the estimator the lower its training error is the more voting power it has in the final weighted classifier The following is the pseudoalgorithm for AdaBoost 1 Initialize the equalweight training data with weight and a maximal iteration of k 2 At round k sample the data according to weight and build a weak classifier 3 Obtain the error of and calculate its corresponding weight that is 4 Update the weight of each record for round 1 1 α if classified right otherwise 1 α 5 Repeat steps 2 to 4 k times and obtain the final classifier which should be proportional to the form α Intuition behind the AdaBoost algorithm At step 3 if there are no errors will have an infinitely large weight This intuitively makes sense because if one weak classifier does the job why bother creating more At step 4 if a record is misclassified its weight will be increased exponentially otherwise its weight will be decreased exponentially Do a thought experiment here if a classifier correctly classifies most records but only a few then those records that have been misclassified will be 2α times more likely to be selected in the next round This also makes sense AdaBoost is only suitable for classification tasks For regression tasks we can use the Gradient Descent Boost GDB algorithm However please note that GDB can also be used to perform classification tasks or even ranking tasks α 1 2 1 314 Statistics for Ensemble Methods Lets take a look at the intuition behind GDB In regression tasks often we want to minimize the mean squared error which takes the form 1 2 where means the true value and is a regressed value The weak estimator at round k has a sum of residual Sequentially at iteration 1 we want to build another base estimator to remove such residuals If you know calculus and look closely youll see that the form of the residual is proportional to the derivative of the mean squared error also known as the loss function This is the key improving the base estimator is not achieved by weighing records like it is in AdaBoost but by deliberately constructing it to predict the residuals which happens to be the gradient of the MSE If a different loss function is used the residual argument wont be valid but we still want to predict the gradients The math here is beyond the scope of this book but you get the point Does GDB work with the logistic regression base learner The answer is a conditional no In general GDB doesnt work on simple logistic regression or other simple linear models The reason is that the addition of linear models is still a linear mode This is basically the essence of being linear If a linear model misclassifies some records another linear model will likely misclassify them If it doesnt the effect will be smeared at the last averaging step too This is probably the reason behind the illusion that ensemble algorithms are only applicable to treebased methods Most examples are only given with tree base estimatorslearners People just dont use them on linear models As an example lets see how GDB works on the regression task of predicting the median price of Boston housing The following code snippet builds the training dataset the test dataset and the regressor from sklearnensemble import GradientBoostingRegressor from sklearnmetrics import meansquarederror bostonboost bostoncopy bostonboosttrain bostonboosttest traintest splitbostonboost train size07 shuffleTrue Understanding and using the boosting module 315 random state1 bostonboosttrainX bostonboosttrainy bostonboost traindropmedvaxis1tonumpy bostonboost trainmedvtonumpy bostonboosttestX bostonboosttesty bostonboosttest dropmedvaxis1tonumpy bostonboosttestmedv tonumpy gdbreg GradientBoostingRegressorrandomstate0 regfitbostonboosttrainX bostonboosttrainy printmeansquarederrorregpredictbostonboosttrainX bostonboosttrainy Here the MSE on the training set is about 15 On the test set it is about 71 which is likely due to overfitting Now lets limit the number of iterations so that we can inspect the turning point The following code snippet will help us visualize this trainmse testmse nestimators range1030020 for nestimator in nestimators gdbreg GradientBoostingRegressorrandomstate0n estimatorsnestimator gdbregfitbostonboosttrainX bostonboosttrainy trainmseappendmeansquarederrorgdbregpredictboston boosttrainX bostonboosttrainy testmseappendmeansquarederrorgdbregpredictboston boosttestX bostonboosttesty pltfigurefigsize106 pltplotnestimatorstrainmselabelTrain Set MSE pltplotnestimatorstestmse labelTest Set MSE pltlegend pltxlabelNumber of Estimators pltylabelMSE plttitleMSE Score on Train and Test Set 316 Statistics for Ensemble Methods What weve obtained here is the classic behavior of biasvariance tradeoff as shown in the following graph Note that the error doesnt grow significantly after the turning point This is the idea behind decreasing bias without exploding the variance of boosting Figure 1113 Number of estimators in terms of boosting and MSE At this point you have seen how boosting works In the next section we will examine a model so that you fully understand the what bagging is Exploring random forests with scikitlearn Now that were near the end of this chapter I would like to briefly discuss random forests Random forests are not strictly ensemble algorithms because they are an extension of tree methods However unlike bagging decision trees they are different in an important way In Chapter 10 Statistical Techniques for TreeBased Methods we discussed how splitting the nodes in a decision tree is a greedy approach The greedy approach doesnt always yield the best possible tree and its easy to overfit without proper penalization The random forest algorithm does not only bootstrap the samples but also the features Lets take our stroke risk dataset as an example The heavy weight is the optimal feature to split on but this rules out 80 of all possible trees along with the other features of the root node The random forest algorithm at every splitting decision point samples a subset of the features and picks the best among them This way it is possible for the suboptimal features to be selected Exploring random forests with scikitlearn 317 Nongreedy algorithms The idea of not using a greedy algorithm to achieve the potential optimal is a key concept in AI For example in the game Go performing a shortterm optimal move may lead to a longterm strategic disadvantage A human Go master is capable of farseeing such consequences The most advanced AI can also make decisions regarding what a human does but at the cost of exponentially expensive computation power The balance between shortterm gain and longterm gain is also a key concept in reinforcement learning Lets take a look a code example of random forest regression to understand how the corresponding API in scikitlearn is called trainmse testmse nestimators range1030020 for nestimator in nestimators regr RandomForestRegressormaxdepth6 randomstate0 nestimatorsnestimator maxfeaturessqrt regrfitbostonboosttrainX bostonboosttrainy trainmseappendmeansquarederrorregrpredictboston boosttrainX bostonboosttrainy testmseappendmeansquarederrorregrpredictboston boosttestX bostonboosttesty pltfigurefigsize106 pltplotnestimatorstrainmselabelTrain Set MSE pltplotnestimatorstestmse labelTest Set MSE pltlegend pltxlabelNumber of Estimators pltylabelMSE plttitleMSE Score on Train and Test Set 318 Statistics for Ensemble Methods The visualization we receive is also a typical biasvariance tradeoff Note that the limitation of the max depth for each individual decision tree being set to 6 can significantly decrease the power of the model anyway Figure 1114 Estimators versus MSE in random forest regression One of the key features of random forests is their robustness against overfitting The relative flat curve of the test sets MSE is proof of this claim Summary In this chapter we discussed several important ensemble learning algorithms including bootstrapping for creating more training sets bagging for aggregating weak estimators boosting for improving accuracy without increasing variance too much and the random forest algorithm Ensemble algorithms are very powerful as they are models that build on top of basic models Understanding them will benefit you in the long run in terms of your data science career In the next chapter we will examine some common mistakes and go over some best practices in data science Section 4 Appendix Section 4 covers some realworld best practices that I have collected in my experience It also identifies common pitfalls that you should avoid Exercises projects and instructions for further learning are also provided This section consists of the following chapters Chapter 12 A Collection of Best Practices Chapter 13 Exercises and Projects 12 A Collection of Best Practices This chapter serves as a special chapter to investigate three important topics that are prevalent in data science nowadays data source quality data visualization quality and causality interpretation This has generally been a missing chapter in peer publications but I consider it essential to stress the following topics I want to affirm that you as a future data scientist will practice data science while following the best practice tips as introduced in this chapter After finishing this chapter you will be able to do the following Understand the importance of data quality Avoid using misleading data visualization Spot common errors in causality arguments First lets start with the beginning of any data science project the data itself 322 A Collection of Best Practices Understanding the importance of data quality Remember the old adage that says garbage in garbage out This is especially true in data science The quality of data will influence the entire downstream project It is difficult for people who work on the downstream tasks to identify the sources of possible issues In the following section I will present three examples in which poor data quality causes difficulties Understanding why data can be problematic The three examples fall into three different categories that represent three different problems Inherent bias in data Miscommunication in largescale projects Insufficient documentation and irreversible preprocessing Lets start with the first example which is quite a recent one and is pretty much a hot topicface generation Bias in data sources The first example we are going to look at is bias in data FaceDepixelizer is a tool that is capable of significantly increasing the resolution of a human face in an image You are recommended to give it a try on the Colab file the developers released It is impressive that Generative Adversarial Network GAN is able to create faces of human in images that are indistinguishable from real photos Generative adversarial learning Generative adversarial learning is a class of machine learning frameworks that enable the algorithms to compete with each other One algorithm creates new data such as sounds or images by imitating original data while another algorithm tries to distinguish the original data and the created data This adversarial process can result in powerful machine learning models that can create images sounds or even videos where humans are unable to tell whether they are real However people soon started encountering this issue within the model Among all of them I found the following example discovered by Twitter user Chicken3egg to be the most disturbing one The image on the left is the original picture with low resolution and the one on the right is the picture that FaceDepixelizer generated Understanding the importance of data quality 323 Figure 121 FaceDepixelizer example on the Obama picture If you are familiar with American politics you know the picture on the left is former President Barack Obama However the generated picture turns out to be a completely different guy For a discussion on this behavior please refer to the original tweet httpstwittercomChicken3ggstatus1274314622447820801 This issue is nothing new in todays machine learning research and has attracted the attention of the community A machine learning model is nothing but a digestor of data which outputs what it learns from the data Nowadays there is little diversity in human facial datasets especially in the case of people from minority backgrounds such as African Americans Not only will the characteristics of the human face be learned by the model but also its inherent bias This is a good example where flawed data may cause issues If such models are deployed in systems such as CCTV closedcircuit television the ethical consequences could be problematic To minimize such effects we need to scrutinize our data before feeding it into machine learning models The machine learning community has been working to address the data bias issue As the author of this book I fully support the ethical progress in data science and machine learning For example as of October 2020 the author of FaceDepixelizer has addressed the data bias issue You can find the latest updates in the official repository at https githubcomtgbomzeFaceDepixelizer Miscommunication in largescale projects When a project increases in size miscommunication can lead to inconsistent data and cause difficulties Here size may refer to code base size team size or the complexity of the organizational structure The most famous example is the loss of the 125 milliondollar Mars climate orbiter from NASA in September 1999 324 A Collection of Best Practices Back then NASAs Jet Propulsion Laboratory and Lockheed Martin collaborated and built a Mars orbiter However the engineering team at Lockheed Martin used English units of measurement and the team at Jet Propulsion Laboratory used the conventional metric system For readers who are not familiar with the English unit here is an example Miles are used to measure long distances where 1 mile is about 1600 meters which is 16 kilometers The orbiter took more than 280 days to reach Mars but failed to function The reason was later identified to be a mistake in unit usage Lorelle Young president of the US Metric Association commented that two measurement systems should not be used as the metric system is the standard measurement language of all sophisticated science Units in the United States Technically speaking the unit system in the United States is also different from the English unit system There are some subtle differences Some states choose the socalled United States customary while others have adopted metric units as official units This may not be a perfect example for our data science project since none of our projects will ever be as grand or important as NASAs Mars mission The point is that as projects become more complex and teams grow larger it is also easier to introduce data inconsistency into projects The weakest stage of a project A point when a team upgrades their systems dependencies or algorithms is the easiest stage to make mistakes and it is imperative that you pay the utmost attention at this stage Insufficient documentation and irreversible preprocessing The third most common reason for poor quality data is the absence of documentation and irreversible preprocessing We briefly talked about this in Chapter 1 Fundamentals of Data Collection Cleaning and Preprocessing Data documentation is sometimes referred to as metadata It is the data about the dataset itself for example information about how the data was obtained who is responsible for queries with respect to the data and the meanings of abbreviations in the dataset Understanding the importance of data quality 325 In data science teams especially for crossteam communication such information is often omitted based on my observations but they are actually very important You cannot assume that the data speaks for itself For example I have used the Texas county dataset throughout this book but the meaning of the rural code cant be revealed unless you read the specs carefully Even if the original dataset is accompanied by metadata irreversible preprocessing still has the potential to damage the datas quality One example I introduced earlier in Chapter 2 Essential Statistics for Data Assessment is the categorization of numerical values Such preprocessing results in the loss of information and there isnt a way for people who take the data from you to recover it Similar processing includes imputation and minmax scaling The key to solving such issues is to embrace the culture of documentation and reproducibility In a data science team it is not enough to share a result or a presentation it is important to share a reproducible result with wellwritten and easytounderstand documentation In such instances a Jupyter notebook is better than a script because you can put text nicely with the code together For the R ecosystem there is R Markdown which is similar to a Jupyter notebook You can demonstrate the pipeline of preprocessing and algorithm functioning in an interactive fashion The idea of painless reproducibility comes from different levels and not only in a data science project For a general Python project a requirementstxt file specifying versions of dependencies can ensure the consistency of a Python environment For our book in order to avoid possible hassles for readers who are not familiar with pip or virtualenv the Python package management and virtual environment management tools I have chosen Google Colab so that you can run the companying codes directly in the browser A general idea of reproducibility A common developer joke you might hear is This works on my machine Reproducibility has been a true headache at different levels For data science projects in order to make code and presentations reproducible Jupyter Notebooks R Markdown and many other tools were developed In terms of the consistency of libraries and packages we have package management tools such as pip for Python and npm for JavaScript In order to enable large systems to work across different hardware and environments Docker was created to isolate configurations of a running instance from its host machine All these technologies solve the same problem of painlessly reproducing a result or performance consistently This is a philosophical concept in engineering that you should keep in mind 326 A Collection of Best Practices In the next section well look at another aspect of common pitfalls in data sciencemisleading graphs Avoiding the use of misleading graphs Graphics convey much more information than words Not everyone understands Pvalues or statistical arguments but almost everyone can tell if one piece of a pie plot is larger than another piece of pie plot or if twoline plots share a similar trend However there are many ways in which graphs can also damage the quality of a visualization or mislead readers In this section we will examine two examples Lets start with the first example misleading graphs Example 1 COVID19 trend The following graph is a screenshot taken in early April 2020 A news channel showed this graph of new COVID19 cases per day in the United States Do you spot anything strange Figure 122 A screenshot of COVID19 coverage of a news channel Avoiding the use of misleading graphs 327 The issue is on the y axis If you look closely the y axis tickers are not separated equally but in a strange pattern For example the space between 30 and 60 is the same as the space between 240 and 250 The distances vary from 10 up to 50 Now I will regenerate this plot without mashing the y axis tickers with the following code snippet import pandas as pd import matplotlibpyplot as plt dates pddaterange20200318 periods15 freqD dailycases 33618611211612919217434430432724632 0339376 pltrcParamsupdatefontsize 22 pltfigurefigsize106 pltplot datesdailycases labelDaily Cases markero pltlegend pltgcatickparamsaxisx whichmajor labelsize14 pltxlabelDates pltylabelNew Daily Cases plttitleNew Daily Cases Versus Dates pltxticksrotation45 for xy in zipdates dailycases plttextx y stry 328 A Collection of Best Practices You will see the following visualization Figure 123 New daily cases without modifying the yaxis tickers Whats the difference between this one and the one that the news channel showed to its audience The jump from 174 to 344 is much more significant while the increase from 246 to 376 is also more dramatic The news channel manipulated the space used to represent 10 or 30 to represent 50 when the number grew large This way the visual impression is much weaker Next lets look at another example that has the ability to confuse readers Example 2 Bar plot cropping We are going to use the US county data for this example Here I am loading the data with the following code snippet we used in Chapter 4 Sampling and Inferential Statistics The difference is that this time I am looking at the influence code of all counties in the United States and not limited to just Texas The following code snippet performs the visualization from collections import Counter df pdreadexcelPopulationEstimatesxlsskiprows2 counter Counterdftail1UrbanInfluenceCode2003 codes key for key in counterkeys if strkey nan heights countercode for code in codes Avoiding the use of misleading graphs 329 pltfigurefigsize106 pltbarlistmaplambda x strxcodes heights pltxticksrotation45 plttitleUrban Influence Code for All Counties in the US pltxlabelUrban Influence Code pltylabelCount The result looks like the following Figure 124 Urban influence code counting for all counties in the US Note that I deliberately changed the urban influence code to a string and indicated that it is a categorical variable and not a numerical one According to the definition the urban influence code is a 12level classification of metropolitan counties developed by the United States Department of Agriculture 330 A Collection of Best Practices Now this is what happens if I add one more line to the previous code snippet pltgcasetylim80800 We then obtain the data as shown in the following diagram Figure 125 Urban influence code counting with limited y axis values The new graph uses the same data and the exact same kind of bar plot This visualization is not wrong but it is confusing as much as it is misleading There are more than 100 counties with an urban influence code of 30 the third bar from righthand side but the second visualization shows that there are probably no such counties The difference between being confusing and misleading is that misleading graphs are usually coined carefully and deliberately to convey the wrong message confusing graphs may not The visualizer might not realize the confusion that such data transformation or capping will cause There are other causes that may result in bad visualizations for example the improper use of fonts and color The more intense a color is the greater the importance we place on that element Opacity is an important factor too The perception of linear opacity doesnt always result in a linear impression of quantities It is not safe to purely rely on visual elements to make a quantitative judgement Avoiding the use of misleading graphs 331 In the next section we will talk about another good practice You should always question causality arguments Spot the common errors in this causality argument A popular conspiracy theory in early 2020 is that the coronavirus is caused by 5G towers being built around the world People who support such a theory have a powerful graph to support their argument I cant trace the origin of such a widespread theory but here are the two popular visualizations Figure 126 Map of the world showing a 5G conspiracy theory 332 A Collection of Best Practices The following map is similar but this time limited to the United States Figure 127 Map of the United States showing a 5G conspiracy theory The top portion shows the number of COVID19 cases in the United States while the lower portion shows the installation of 5G towers in the United States These two graphs are used to support the idea that 5G causes the spread of COVID19 Do you believe it Avoiding the use of misleading graphs 333 Lets study this problem step by step from the following perspectives Is the data behind the graphics accurate Do the graphics correctly represent the data Does the visualization support the argument that 5G is resulting in the spread of COVID19 For the first question following some research you will find that Korea China and the United States are indeed the leading players in 5G installation For example as of February 2020 Korea has 5G coverage in 85 cities while China has 5G coverage in 57 cities However Russias first 5G zone was only deployed in central Moscow in August 2019 For the first question the answer can roughly be true However the second question is definitively false All of Russia and Brazil are colored to indicate the seriousness of the spread of COVID19 and 5G rollout The visual elements do not represent the data proportionally Note that the maps first appeared online long before cases in the United States and Brazil exploded People cant tell the quantitative relationship between the rolloutcoverage of 5G and the number of COVID19 cases The graphic is both misleading and confusing Onto the last question The answer to this is also definitively false However the misrepresentation issue got confused with the interpretation of the data for the world map so lets focus on the map of the United States There are many ways to generate maps such as the COVID19 cases map or the 5G tower installation map for example by means of population or urbanization maps Since COVID19 is transmitted mainly in the air and by touching droplets containing the virus population density and lifestyles play a key role in its diffusion This explains why areas with a high population are also areas where there are more confirmed cases From a business standpoint ATT and Verizon will focus heavily on offering 5G services to high population density regions such as New York This explains the density map concerning 5G tower installation Factors such as population density are called confounding factors A confounding factor is a factor that is a true cause behind another two or more factors The causal relationships are between the confounding factor and the other factors not between the nonconfounding factors It is a common trick to use visualization to suggest or represent a causal relationship between two variables without stating the underlying reasons That said how do we detect and rebut the entire causal argument Lets understand this in the next section 334 A Collection of Best Practices Fighting against false arguments To refute false arguments you need to do the following Maintain your curiosity to dig deeper into domain knowledge Gather different perspectives from credible experts To refute false arguments domain knowledge is the key because domain knowledge can reveal the details that loose causal arguments often hide from their audience Take the case of COVID19 for example To ascertain the possible confounding factor of population density you need to know how the virus spreads which falls into the domain of epidemiology The first question you need to ask is whether there is a more science based explanation During the process of finding such an explanation you need to learn domain knowledge and can often easily poke a hole in the false arguments Proving causal relations scientifically Proving a causal relation between variables is hard but to fake one is quite easy In a rigorous academic research environment to prove a cause and effect relationship you need to control variables that leave the target variable as being the only explanation of the observed behavior Often such experiments must have the ability to be reproduced in other labs with different groups of researchers However this is sometimes hard or even impossible to reproduce in the case of social science research A second great way of refuting such false arguments is to gather opinions from credible experts A credible expert is someone who has verifiable knowledge and experience in specific domains that is trustworthy As the saying goes given enough eyes on the codes there wont be any bugs Seeking opinions from true experts will often easily reveal the fallacy in false causal arguments In your data science team pair coding and coding reviews will help you to detect errors including but not limited to causal relation arguments An even better way to do this is to show your work to the world put your code on GitHub or build a website to show it to anyone on the internet This is how academic publishing works and includes two important elementspeer review and open access to other researchers Summary 335 Summary In this chapter we discussed three best practices in data science They are also three warnings I give to you always be cautious about data quality always be vigilant about visualization and pay more attention to detect and thereby help avoid false cause and effect relationship claims In the next and final chapter of this book we will use what you have learned so far to solve the exercises and problems 13 Exercises and Projects This chapter is dedicated to exercises and projects that will enhance your understanding of statistics as well as your practical programming skills This chapter contains three sections The first section contains exercises that are direct derivatives of the code examples you saw throughout this book The second section contains some projects I would recommend you try out some of these will be partially independent of what we covered concretely in previous chapters The last section is for those of you who want to dive more into the theory and math aspects of this book Each section is organized according to the contents of the corresponding chapter Once youve finished this final chapter you will be able to do the following Reinforce your basic knowledge about the concepts and knowledge points that were covered in this book Gain working experience of a projects life cycle Understand the math and theoretical foundations at a higher level Lets get started 338 Exercises and Projects Note on the usage of this chapter You are not expected to read or use this chapter sequentially You may use this chapter to help you review the topics that were covered in a certain chapter once youve finished it You can also use it as reference for your data scientist interview Exercises Most of the exercises in each chapter dont depend on each other However if the exercises do depend on each other this relationship will be stated clearly Chapter 1 Fundamentals of Data Collection Cleaning and Preprocessing Exercises related to Chapter 1 Fundamentals of Data Collection Cleaning and Preprocessing are listed in this section Exercise 1 Loading data Load the autompg data as a pandas DataFrame by using the pandasreadcsv function This data can be found at httpsarchiveicsucieduml machinelearningdatabasesautompg Hint You may find that the default argument fails to load the data properly Search the document of the readcsv function identify the problem and find the solution Exercise 2 Preprocessing Once youve loaded the autompg data preprocess the data like so 1 Identify the type of each columnfeature as numerical or categorical 2 Perform minmax scaling for the numerical variables 3 Impute the data with the median value Exercises 339 Hint This dataset which I chose on purpose can be ambiguous in terms of determining the variable type Think about different choices and their possible consequences for downstream tasks Exercise 3 Pandas and API calling Sign up for a free account at httpsopenweathermaporgapi obtain an API key and read the API documentation carefully Make some API calls to obtain the hourly temperature for the city you live in for the next 24 hours Build a pandas DataFrame and plot a time versus temperature graph Hint You may need Pythons datetime module to convert a string into a valid datetime object Chapter 2 Essential Statistics for Data Assessment Exercises related to Chapter 2 Essential Statistics for Data Assessment are listed in this section Exercise 1 Other definitions of skewness There are several different definitions of skewness Use the data that we used to examine skewness in Chapter 2 Essential Statistics for Data Assessment to calculate the following versions of skewness according to Wikipedia httpsenwikipediaorgwiki Skewness Pearsons second skewness coefficient Quantilebased measures Groeneveld and Meedens coefficient Exercise 2 Bivariate statistics Load the autompg dataset that we introduced in Exercise 1 Loading data for Chapter 1 Fundamentals of Data Collection Cleaning and Preprocessing Identify the bivariate correlation with all the numerical variables Which variables are positively correlated and which ones are negatively correlated Do these correlations make reallife sense to you 340 Exercises and Projects Exercise 3 The crosstab function Can you implement your own crosstab function It can take two lists of equal length as input and generate a DataFrame as output You can also set an optional parameter such as the name of the input lists Hint Use the zip function for a onepass loop Chapter 3 Visualization with Statistical Graphs Exercises related to Chapter 3 Visualization with Statistical Graphs are listed in this section Exercise 1 Identifying the components This is an open question Try to identify the three components shown on Seaborns gallery page httpsseabornpydataorgexamplesindexhtml The three components of any statistical graph are data geometry and aesthetics You may encounter new kinds of graph types here which makes this a great opportunity to learn Exercise 2 Queryoriented transformation Use the autompg data we introduced in Exercise 1 Loading data for Chapter 1 Fundamentals of Data Collection Cleaning and Preprocessing transform it into the long format and generate a boxplot called the Seaborn tips boxplot Namely you want the xaxis to be the cylinders variable and the yaxis to be the mpg data Exercise 3 Overlaying two graphs For the birth rate and death rate of the Anderson county overlay the line plot and the bar plot in the same graph Some possible enhancements you can make here are as follows Choose a proper font and indicate the value of the data with numbers Choose different symbols for the death rate and the birth rate Exercises 341 Exercise 4 Layout Create a 2x2 layout and create a scatter plot of the birth rate and death rate of your four favorite states in the US Properly set the opacity and marker size so that it is visually appealing Exercise 5 The pairplot function The pairplot function of the seaborn library is a very powerful function Use it to visualize all the numerical variables of the autompg dataset we introduced in Exercise 1 Loading data of Chapter 1 Fundamentals of Data Collection Cleaning and Preprocessing Study its documentation so that you know how to add a regression line to the offdiagonal plots Does the regression line make sense to you Compare the result with the correlation result you obtained in Exercise 2 Bivariate statistics of Chapter 2 Essential Statistics for Data Assessment Hint Read the documentation regarding pairplot for more information httpsseabornpydataorggeneratedseaborn pairplothtml Chapter 4 Sampling and Inferential Statistics Exercises related to Chapter 4 Sampling and Inferential Statistics are listed in this section Exercise 1 Simple sampling with replacement Create a set of data that follows a standard normal distribution of size 1000 Run simple random sampling with and without replacement Increase the sampling size What happens if the sampling size of your replacement sampling exceeds 1000 Exercise 2 Stratified sampling Run stratified sampling on the population of county stratified with respect to states rather than ruralurban Continuum Code2013 Sample two data points from each state and compare the results by running the sampling multiple times 342 Exercises and Projects Exercise 3 Central limit theorem Verify the central limit theorem by summing nonnormal random variables by following the distributions listed here just pick the easiest form for each If you are not familiar with these distributions please refer to Chapter 5 Common Probability Distributions Binomial distribution Uniform distribution Poisson distribution Exponential distribution Can you visually examine the number of random variables that need to be summed together to approach a normal distribution for each of the aforementioned distributions What is the intuition behind your observation Chapter 5 Common Probability Distributions Exercises related to Chapter 5 Common Probability Distributions are listed in this section Exercise 1 Identify the sample space and the event corresponding to the probability being asked in the following statements By tossing four fair coins find the probability of at least getting two heads The probability that a bus arrives between 800 AM and 830 AM A battleship fires three missiles The battleships target will be destroyed if at least two missiles hit its target Find the probability that the battleships target will be destroyed Assume that the likelihood of a woman giving birth to a boy or a girl are the same If we know a family that has three kids has a girl in the family find the probability that the family has at least one girl How about the probability of having at least a boy Exercise 2 Proof of probability equations Prove the following equation 𝑃𝑃𝐴𝐴 𝐵𝐵 𝑃𝑃𝐴𝐴 𝑃𝑃𝐵𝐵 𝑃𝑃𝐴𝐴 𝐵𝐵 Exercises 343 Hint Enumerate all possibilities of both the lefthand side and the righthand side of the equation Prove that if an event is in the lefthand side that is the union of A and B then it is in the expression of the righthand side In the same regard if it is in the righthand side then prove it is also in the lefthand side Exercise 3 Proof of De Morgans law De Morgans law states two important transformation rules as stated here https enwikipediaorgwikiDeMorgan27slaws Use the same technique you used in the previous exercise to prove them Exercise 4 Three dice sum Write a program that calculates the distribution in a sum of three fair dice Write a program that will simulate the case where the dice are not fair so that for each dice it is two times more likely to have an even outcome than an odd outcome Hint Write some general simulation code so that you can specify the probability associated with each face of the dice and the total number of dice Then you are free to recover the central limit theorem easily Exercise 5 Approximate the binomial distribution with the Poisson distribution The binomial distribution with a large n can sometimes be very difficult to calculate because of the involvement of factorials However you can approximate this with Poisson distribution The condition for this is as follows n p 0 np λ where λ is a constant If all three conditions are met then the binomial distribution can be approximated by the Poisson distribution with the corresponding parameter λ Prove this visually Most of the time n must be larger than 100 and p must be smaller than 001 344 Exercises and Projects Exercise 6 Conditional distribution for a discrete case In the following table the first column indicates the grades of all the students in an English class while the first row indicates the math grades for the same class The values shown are the count of students and the corresponding students Figure 131 Scores Answer the following questions Some of these question descriptions are quite complicated so read them carefully What is the probability of a randomly selected student having a B in math What is the probability of a randomly selected student having a math grade thats no worse than a C and an English grade thats no worse than a B If we know that a randomly selected student has a D in math whats the probability that this student has no worse than a B in English Whats the minimal math grade a randomly selected student should have where you have the confidence to say that this student has at least a 70 chance of having no worse than a B in English Chapter 6 Parameter Estimation Exercises related to Chapter 6 Parameter Estimation are listed in this section Exercise 1 Distinguishing between estimation and prediction Which of the following scenarios belong to the estimation process and which belong to the prediction process Find out the weather next week Find out the battery capacity of your phone based on your usage profile Find out the arrival time of the next train based on the previous trains arrival time Exercises 345 Exercise 2 Properties of estimators If you were to use the method of moments to estimate a uniform distributions boundaries is the estimator thats obtained unbiased Is the estimator thats obtained consistent Exercise 3 Method of moments Randomly generate two variables between 0 and 1 without checking their values Set them as the μ and σ of a Gaussian random distribution Generate 1000 samples and use the method of moments to estimate these two variables Do they agree Exercise 4 Maximum likelihood I We toss a coin 100 times and for 60 cases we get heads Its possible for the coin to be either fair or biased to getting heads with a probability of 70 Which is more likely Exercise 5 Maximum likelihood II In this chapter we discussed an example in which we used normal distribution or Laplace distribution to model the noise in a dataset What if the noise follows a uniform distribution between 1 and 1 What result will be yielded Does it make sense Exercise 6 Law of total probability Lets say the weather tomorrow has a 40 chance of being windy and a 60 chance of being sunny On a windy day you have a 50 chance of going hiking while on a sunny day the probability goes up to 80 Whats the probability of you going hiking tomorrow without knowing tomorrows weather What about after knowing that its 90 likely to be sunny tomorrow Exercise 7 Monty Hall question calculation Calculate the quantity 𝑃𝑃𝐶𝐶𝐸𝐸𝐵𝐵 that we left in the Monty Hall question Check its value with the provided answer 346 Exercises and Projects Chapter 7 Statistical Hypothesis Testing Exercises related to Chapter 7 Statistical Hypothesis Testing are listed in this section Exercise 1 Onetailed and twotailed tests Is the following statement correct The significant level for a onetailed test will be twice as large as it is for the twotailed test Exercise 2 Pvalue concept I Is the following statement correct For a discrete random variable where every outcome shares an equal probability any outcome has a Pvalue of 1 Exercise 3 Pvalue concept II Are the following statements correct The value of the Pvalue is obtained by assuming the null hypothesis is correct The Pvalue by definition has a falsepositive ratio Hint Falsepositive naively means something isnt positive but you misclassify or mistreat it as being positive Exercise 4 Calculating the Pvalue Calculate the Pvalue of observing five heads when tossing an independent fair coin six times Exercise 5 Table looking Find the corresponding value in a onetailed tdistribution table where the degree of freedom is 5 and the significance level is 0002 Exercise 6 Decomposition of variance Prove the following formula mathematically 𝑆𝑆𝑇𝑇 2 𝑆𝑆𝐵𝐵 2 𝑆𝑆𝑊𝑊 2 Exercises 347 Exercise 7 Fishers exact test For the linkclicking example we discussed in this chapter read through the Fishers exact test page httpsenwikipediaorgwikiFisher27sexacttest Run a Fishers exact test on the device and browser variables Hint Build a crosstab first It may be easier to refer to the accompanying notebook for reusable code Exercise 8 Normality test with central limit theorem In Exercise 3 Central limit theorem of Chapter 4 Sampling and Inferential Statistics we tested the central limit theorem visually In this exercise use the normality test provided in Chapter 7 Statistical Hypothesis Testing to test the normality of the random variable that was generated from summing nonnormal random variables Exercise 9 Goodness of fit In Chapter 7 Statistical Hypothesis Testing we ran a goodness of fit test on the casino card game data Now lets assume the number of hearts no longer follows a binomial distribution but a Poisson distribution Now run a goodness of fit test by doing the following 1 Find the most likely parameter for the Poisson distribution by maximizing the likelihood function 2 Run the goodness of fit test Suggestion This question is a little tricky You may need to review the procedure of building and maximizing a likelihood function to complete this exercise Exercise 10 Stationary test Find the data for the total number of COVID19 deaths in the United States from the date when the first death happened to July 29th where the number of patients that died had reached 150000 Run a stationary test on the data and the first difference of the data 348 Exercises and Projects Chapter 8 Statistics for Regression Exercises related to Chapter 8 Statistics for Regression are listed in this section Exercise 1 Rsquared Are the following statements correct A bigger R2 is always a good indicator of good fit for a singlevariable linear model For a multivariable linear model an adjusted R2 is a better choice when evaluating the quality of the model Exercise 2 Polynomial regression Is the following statement correct To run a regression variable y over single variable x an 8th order polynomial can fit an arbitrary dataset of size 8 If yes why Exercise 3 Doubled R2 Is the following statement correct If a regression model has an R2 of 09 suppose we obtained another set of data that happens to match the original dataset exactly similar to a duplicate What will happen to the regression coefficients What will happen to the R2 Exercise 4 Linear regression on the autompg dataset Run simple linear regression on the autompg dataset Obtain the coefficients between the make year and mpg variables Try to do this by using different methods as we did in the example provided in this chapter Exercise 5 Collinearity Run multivariable linear regression to fit the mpg in the autompg dataset to the rest of the numerical variables Are the other variables highly collinear Calculate VIF to eliminate two variables and run the model again Is the adjusted R2 decreasing or increasing Exercises 349 Exercise 6 Lasso regression and ridge regression Repeat Exercise 5 Collinearity but this time with lasso regularization or ridge regularization Change the regularization coefficient to control the strength of the regularization and plot a set of line plots regarding the regularization parameter versus the coefficient magnitude Chapter 9 Statistics for Classification Exercises related to Chapter 9 Statistics for Classification are listed in this section Exercise 1 Odds and log of odds Determine the correctness of the following statements The odds of an event can take any value between 0 and infinity The log of odds has the meaning as a probability Hint Plot the relationship between the probability and the log of odds Exercise 2 Confusion matrix Determine the proper quadrant for the following scenarios in the coefficient matrix Diagnose a man as pregnant If there are incoming planes and the detector failed to find them A COVID19 patient was correctly diagnosed as positive Exercise 3 F1 score Calculate the F1 score for the following confusion matrix Figure 132 Confusion matrix 350 Exercises and Projects Exercise 4 Grid search for optimal logistic regression coefficients When we maximized the loglikelihood function of the stock prediction example I suggested that you use grid search to find the optimal set of slopes and intercepts Write a function that will find the optimal values and compare them with the ones I provided there Exercise 5 The linregress function Use the linregress function to run linear regression on the Netflix stock data Then verify the R2 values agreement with our manually calculated values Exercise 6 Insufficient data issue with the Bayes classifier For a naïve Bayes classifier if the data is categorical and theres not enough of it we may encounter an issue where the prediction encounters a tie between two or even more possibilities For the stroke risk prediction example verify that the following data gives us an undetermined prediction weighthighhighoildietnosmokingno Exercise 7 Laplace smoothing One solution for solving the insufficient data problem is to use Laplace smoothing also known as additive smoothing Please read the wiki page at httpsenwikipedia orgwikiAdditivesmoothing and the lecture note from Toronto university at httpwwwcstorontoedubonnercourses2007scsc411 lectures03bayeszemelpdf before resolving the issue that was raised in Exercise 4 Grid search for optimal logistic regression coefficients Exercise 8 Crossvalidation For the autompg data we introduced in Exercise 1 Loading data of Chapter 1 Fundamentals of Data Collection Cleaning and Preprocessing use 5fold crossvalidation to train multivariable linear regression models and evaluate their performance using the mean squared error metric Exercises 351 Exercise 9 ROC curve One important concept I skipped due to space limitations is the ROC curve However it is easy to replicate For the stock prediction logistic regression model pick a series of equally spaced thresholds between 0 and 1 and then create a scatter plot of the true positive rates and the true positive rates Examine the result What you will have obtained is an ROC curve You can find more information about the ROC curve at https enwikipediaorgwikiReceiveroperatingcharacteristic Exercise 10 Predicting cylinders Use the mpg horsepower and displacement variables of the autompg dataset to classify the cylinder variable using a Gaussian Naïve classifier Check out the documentation at httpsscikitlearnorgstablemodulesnaive bayeshtmlgaussiannaivebayes to find out more Hint Gaussian Naïve Bayes is the continuous version of the categorical Naïve Bayes method However it is not the only option Feel free to explore other Naïve Bayes classifiers and compare their performance You are also permitted to remove oddnumber cylinder samples since they are very rare and not informative in general Chapter 10 Statistics for TreeBased Methods Exercises related to Chapter 10 Statistics for TreeBased Methods are listed in this section Exercise 1 Tree concepts Determine the correctness of the following statements A tree can only have a single root node A decision node can only have at most two child nodes A node can only have one parent node 352 Exercises and Projects Exercise 2 Gini impurity visualized For a threecategory dataset with categories A B and C produce a 2D visualization of the Gini impurity as a function of 𝑃𝑃𝐴𝐴 and 𝑃𝑃𝐵𝐵 where 0 𝑃𝑃𝐴𝐴 𝑃𝑃𝐵𝐵 1 Hint Although we have three categories the requirement of summing to one leaves us with two degrees of freedom Exercise 3 Comparing Gini impurity with entropy Another criterion for tree node splitting is known as entropy Read the Wikipedia page at httpsenwikipediaorgwikiEntropyinformationtheory and write functions that will help you redo Exercise 2 Gini impurity revisited but with entropy being used instead What do you find In terms of splitting nodes which method is more aggressiveconservative What if you increase the number of possible categories Can you perform a Monte Carlo simulation Exercise 4 Entropybased tree building Use entropy instead of Gini impurity to rebuild the stroke risk decision tree Exercise 5 Nonbinary tree Without grouping lowrisk and middlerisk groups build a threecategory decision tree from scratch for the stroke risk dataset You can use either Gini impurity or entropy for this Exercise 6 Regression tree concepts Determine the correctness of the following statements The number of possible outputs a regression tree has is the same as the number of partitions it has for the feature space Using absolute error rather than MSE will yield the same partition result To split over a continuous variable the algorithm has to try all possible values so for each splitting the time complexity of the naïve approach will be ONM where N is the number of continuous features and M is the number of samples in the node Exercises 353 Exercise 7 Efficiency of regression tree building Write some pseudocode to demonstrate how the tree partition process can be paralleled If you cant please explain which step or steps prohibit this Exercise 8 sklearn example Use sklearn to build a regression tree that predicts the value of mpg in the autompg dataset with the rest of the features Then build a classification tree that predicts the number of cylinders alongside the rest of the features Exercise 9 Overfitting and pruning This is a hard question that may involve higher coding requirements The sklearn API provides a parameter that helps us control the depth of the tree which prevents overfitting Another way we can do this is to build the tree so that its as deep as needed first then prune the tree backward from the leaves Can you implement a helperutility function to achieve this You may need to dive into the details of the DecisionTreeClassifier class to directly manipulate the tree object Chapter 11 Statistics for Ensemble Methods Exercises related to Chapter 11 Statistics for Ensemble Methods are listed in this section Exercise 1 Biasvariance tradeoff concepts Determine the correctness of the following statements Train set and test set splitting will prevent both underfitting and overfitting Nonrandomized sampling will likely cause overfitting Variance in terms of the concept of biasvariance tradeoff means the variance of the model not the variance of the data Exercise 2 Bootstrapping concept Determine the correctness of the following statements If there is any ambiguity please illustrate your answers with examples The bigger the variance of the sample data the more performant bootstrapping will be Bootstrapping doesnt generate new information from the original dataset 354 Exercises and Projects Exercise 3 Bagging concept Determine the correctness of the following statements Each aggregated weak learner is the same weight BaggingClassifier can be set so that its trained in parallel The differences between weak learners are caused by the differences in the sampled data that they are exposed to Exercise 4 From using a bagging tree classifier to random forests You may have heard about random forests before They are treebased machine learning models that are known for their robustness to overfitting The key difference between bagging decision trees and random forests is that a random forest not only samples the records but also samples the features For example lets say were performing a regression task where were using the autompg dataset Here the cylinder feature may not be available during one iteration of the node splitting process Implement your own simple random forest class Compare its performance with the sklearn bagging classifier Exercise 5 Boosting concepts Determine the correctness of the following statements In principle boosting is not trainable in parallel Boosting in principle can decrease the bias of the training set indefinitely Boosting linear weak learners is not efficient because the linear combinations of a linear model is also a linear model Exercise 6 AdaBoost Use AdaBoost to predict the number of cylinders in the autompg dataset against the rest of the variables Exercise 7 Gradient boosting tree visualization Using the tree visualization code that was introduced in Chapter 12 Statistics for Tree Based Methods visualize the decision rules of the weak learnerstrees for a gradient descent model Select trees from the first 10 iterations first 40 iterations and then every 100 iterations Do you notice any patterns Project suggestions 355 Everything up to this point has all been exercises Ive prepared for you The next section is dedicated to the projects you can carry out Each project is associated with a chapter and a topic but its recommended that you integrate these projects to build a comprehensive project that you can show to future employers Project suggestions These projects will be classified into three different categories as listed here Elementary projects Elementary projects are ones where you only need knowledge from one or two chapters and are easy to complete Elementary projects only require that you have basic Python programming skills Comprehensive projects Comprehensive projects are ones that require you to review knowledge from several chapters Having a thorough understanding of the example code provided in this book is required to complete a comprehensive project Capstone projects Capstone projects are projects that involve almost all the contents of this book In addition to the examples provided in this book you are expected to learn a significant amount of new knowledge and programming skills to complete the task at hand Lets get started Nontabular data This is an elementary project The knowledge points in this project can be found in Chapter 1 Fundamentals of Data Collection Cleaning and Preprocessing and Chapter 2 Essential Statistics for Data Assessment The university dataset in the UCI machine learning repository is stored in a nontabular format httpsarchiveicsuciedumldatasetsUniversity Please examine its format and perform the following tasks 1 Examine the data format visually and then write down some patterns to see whether such patterns can be used to extract the data at specific lines 2 Write a function that will systematically read the data file and store the data contained within in a pandas DataFrame 3 The data description mentioned the existence of both missing data and duplicate data Identify the missing data and deduplicate the duplicated data 356 Exercises and Projects 4 Classify the features into numerical features and categorical features 5 Apply minmax normalization to all the numerical variables 6 Apply median imputation to the missing data Nontabular format Legacy data may be stored in nontabular format for historical reasons The format the university data is stored in is a LISPreadable format which is a powerful old programming language that was invented more than 60 years ago Realtime weather data This is a comprehensive project The knowledge points are mainly from Chapter 1 Fundamentals of Data Collection Cleaning and Preprocessing and Chapter 3 Visualization with Statistical Graphs The free weather API provides current weather data for more than 200000 cities in the world You can apply for a free trial here httpsopenweathermaporgapi In this example you will build a visualization of the temperature for major US cities Refer to the following instructions 1 Read the API documentation for the current endpoint https openweathermaporgcurrent Write some short scripts to test the validity of your API key by querying the current weather in New York If you dont have one apply for a free trial 2 Write another function that will parse the returned data into tabular format 3 Query the current weather in Los Angeles Chicago Miami and Denver as well You may want to store their zip codes in a dictionary for reference Properly set the fonts legends and size of markers The color of the line will be determined by the main field of the returned JSON object You are encouraged to choose a color map that associates warmer colors with higher temperatures for example 4 Write another function that requeries the weather information in each city and replots the visualization every 20 minutes Realtime weather data 357 Colormap In data sciencerelated visualization a colormap is a mapping from ordered quantities or indices to an array of colors Different colormaps significantly change the feeling of viewers For more information please refer to httpsmatplotliborg311tutorialscolors colormapshtml I also pointed out the main field in the returned JSON for you Figure 133 The main field of the returned json object We will use the functions and code you developed in this project for a capstone project later in this chapter The rest of this project is quite flexible and up to you 358 Exercises and Projects Goodness of fit for discrete distributions This is an elementary project The topics involved are from Chapter 3 Visualization with Statistical Graphs to Chapter 7 Statistical Hypothesis Testing The description is fancy but you should be able to divide it into small parts This is also a project where many details are ambiguous and you should define your own questions specifically In this project you will write a bot that can guess the parameters of discrete distributions You also need to visualize this process Suppose there is a program that randomly selects integer λ from the list 51020 first then generates Poissondistributed samples based on the pregenerated λ The program will also recreate λ after generating n samples where n is another random integer variable thats uniformly distributed between 500 and 600 By doing this you will have obtained 100000 data points generated by this program Write a function that will calculate the possibilities of λ behind every data point Then visualize this You can approach this problem by using the following instructions 1 First you need to clarify the definition Lets say you can calculate the goodness of fit tests Pvalue and then use this Pvalue to indicate how likely a specific λ is for the parameter of a distribution You can also calculate the likelihoods and compare them The point is that you need to define quantities that can describe the ambiguous term possibilities in the question This is up to you but I used the goodness of fit test as an example 2 Then you should write a program that will generate 100000 pieces of data 3 Next define a window size The window will be moved to capture the data and run the goodness of fit test on the windowed data Choose a window size and justify your choice Whats the problem with a window size being too small and whats the problem with a window size being too large 4 Calculate the Pvalues for the goodness of fit results Plot them along with the original data Choose your aesthetics wisely The idea behind this exercise There are many observations that can be modeled as a process thats controlled by a hidden parameter Its key to model such a process and determine the parameters In this project you already know that the mechanism behind the random variable generation is a Poisson process but in most cases you wont know this This project is a simplified scenario Realtime weather data 359 Building a weather prediction web app This is a capstone project To complete this project you should get your hands on the Dash framework https dashgalleryplotlyhostPortal Dash is a framework that you can use to quickly turn a Python script into a deployable data science application You should at least go over the first four sessions of the tutorial and focus on the Dash map demo to learn how to render maps httpsdashgalleryplotlyhostdashmapd demo To finish this project follow these steps 1 Finish the prerequisite mentioned in the project description 2 Write a function that will map a list of city zip codes to markers on a US map and render them Learn how to change the contents on the map when hovering over it with your mouse 3 Use the weather API to obtain a weather report for the last 7 days for the major cities we listed in earlier projects Build a linear regressor that will predict the temperature for the following data Write a function that will automate this step so that every time you invoke the function new data will be queried 4 Design a layout where the page is split into two parts 5 The left panel should be a map where each city is highlighted with todays temperature value 6 The right panel should be a line chart where regression is performed so that tomorrows temperature is predicted and distinguished from known previous temperature 7 This step is optional You are encouraged to use other machine learning algorithms rather than simple linear regression to allow the users to switch between different algorithms For example you can use both simple linear regression and regression trees 8 To train a regression tree you might need more historical data Explore the API options and give it a try 9 You need a UI component to allow the users to change between algorithms The toggle button may be what you need httpsdashplotlycomdash daqtoggleswitch 360 Exercises and Projects This project is particularly useful because as a data scientist an interactive app is probably the best way to demonstrate your skills in terms of both statistics and programming The last project is similar in this regard Building a typing suggestion app This is a capstone project In this project you are going to build a web app that predicts what the users typing by training them on large word corpus There are three components in this app but they are not strongly correlated You can start from any of them Processing word data The typing suggestion is based on Bayes theorem where the most likely or topk predictions are made by examining a large set of sentences You can obtain an English text corpus from the English Corpora website https wwwenglishcorporaorghistoryasp You can start from a small corpus such as the manually annotated subcorpus project httpwwwanc orgdatamascdownloadsdatadownload You need to tokenize the data by sequencing words You are encouraged to start with tutorials from SpaCy or NLTK You should find material online such as httpsrealpythoncom naturallanguageprocessingspacypython to help with this Building the model You need to build a twogram model which means counting the number of appearances of neighboring word pairs For example I eat is likely to be more frequent than I fly so the word eat will be more likely to show up than fly You need to create a model or module that can perform such a task quickly Also you may want to save your data in a local disk persistently so that you dont need to run the model building process every time your app starts Building the app The last step is to create a UI The documentation for the input component can be found here httpsdashplotlycomdashcore componentsinput You need to decide on how you want your user to see your suggestions You can achieve this by creating a new UI component One additional feature you may wish to add to your app is a spam filter By doing this you can inform your user of how likely the input text looks like a spam message in real time Further reading With that you have reached the last part of this book In this section I am going to recommend some of the best books on data science statistics and machine learning Ive found all of which can act as companions to this book I have grouped them into categories and shared my personal thoughts on them Further reading 361 Textbooks Books that fall into this category are read like textbooks and are often used as textbooks or at least reference books in universities Their quality has been proven and their value is timeless The first one is Statistical Inference by George Casella 2nd Edition which book covers the first several chapters of this book It contains a multitude of useful exercises and practices all of which are explained in detail It is hard to get lost when reading this book The second book is The Elements of Statistical Learning by Trevor Hastie Robert Tibshirani and Jerome Friedman 2nd Edition This book is the bible of traditional statistical learning Its not easy for beginners who are not comfortable with the conciseness of math proof There is another book An Introduction to Statistical Learning With Application in R by Gareth James and Daniela Witten that is simpler and easier to digest Both books cover all the topics in this book starting from Chapter 6 Parametric Estimation Visualization The first book I recommend about visualization is The Visual Display of Quantitative Information by Edward R Tufte This book will not teach you how to code or plot a real visualization instead teaching you the philosophy surrounding visualization The second book I recommend is also by Edward R Tufte It is called Visual and Statistical Thinking Displays of Evidence for Making Decisions It contains many examples where visualizations are done correctly and incorrectly It is also very entertaining to read I wont recommend any specific books that dedicate full content to coding examples for visualization here The easiest way to learn about visualization is by referring to this books GitHub repository and replicating the examples provided Of course it would be great if you were to get a hard copy so that you can look up information quickly and review it frequently Exercising your mind This category contains books that dont read like textbooks but also require significant effort to read think about and digest The first book is Common Errors in Statistics and How to Avoid Them by Phillip I Good and James W Hardin This book contains concepts surrounding the usage of visualizations that are widely misunderstood 362 Exercises and Projects The second book is Stat Labs Mathematical Statistics Through Applications Springer Texts in Statistics by Deborah Nolan and Terry P Speed This book is unique because it starts every topic with realworld data and meaningful questions that should be asked You will find that this books reading difficulty increases quickly I highly recommend this book You may need to use a pen and paper to perform any calculations and tabulation thats required of you Summary Congratulations on reaching the end of this book In this chapter we introduced exercises and projects of varying difficulty You were also provided with a list of additional books that will help you as you progress through the exercises and projects mentioned in this chapter I hope these additional materials will boost your statistical learning progress and make you an even better data scientist If you have any questions about the content of this book please feel free to light up any issues on the official GitHub repository for this book httpsgithub comPacktPublishingEssentialStatisticsforNonSTEMData Analysts We are always happy to answer your questions there Other Books You May Enjoy If you enjoyed this book you may be interested in these other books by Packt Practical Data Analysis Using Jupyter Notebook Marc Wintjen ISBN 9781838826031 Understand the importance of data literacy and how to communicate effectively using data Find out how to use Python packages such as NumPy pandas Matplotlib and the Natural Language Toolkit NLTK for data analysis Wrangle data and create DataFrames using pandas Produce charts and data visualizations using timeseries datasets Discover relationships and how to join data together using SQL Use NLP techniques to work with unstructured data to create sentiment analysis models 364 Other Books You May Enjoy HandsOn Mathematics for Deep Learning Jay Dawani ISBN 9781838647292 Understand the key mathematical concepts for building neural network models Discover core multivariable calculus concepts Improve the performance of deep learning models using optimization techniques Cover optimization algorithms from basic stochastic gradient descent SGD to the advanced Adam optimizer Understand computational graphs and their importance in DL Explore the backpropagation algorithm to reduce output error Cover DL algorithms such as convolutional neural networks CNNs sequence models and generative adversarial networks GANs Leave a review let other readers know what you think 365 Leave a review let other readers know what you think Please share your thoughts on this book with others by leaving a review on the site that you bought it from If you purchased the book from Amazon please leave us an honest review on this books Amazon page This is vital so that other potential readers can see and use your unbiased opinion to make purchasing decisions we can understand what our customers think about our products and our authors can see your feedback on the title that they have worked with Packt to create It will only take a few minutes of your time but is valuable to other potential customers our authors and Packt Thank you Index A AB test conducting 206 207 mistakes 211 AB testing with realworld example 206 adaptive boosting AdaBoost 313 advanced visualization customization about 65 aesthetics customizing 70 geometry customizing 65 aesthetics customizing 70 aesthetics example markers 7072 alternative hypothesis 163 analysis combining with plain plotting 76 77 Analysis of Variance ANOVA 173 AndersonDarling test 185 ANOVA model 192 assumptions 193 Application Programming Interface API 6 Area Under the Curve AUC 115178 asymptotical consistency 134 attribute 26 Augmented DickeyFuller ADF 204 autoregression 201 axissharing 65 66 B bagging 306311 bar plot 6265 Bayesian theorem 155160 Bernoulli distribution 117 118 beta function 170 bias 300303 biasness 134 bimodal 32 bimodal distribution 32 Binomial distribution 118120 bivariate descriptive statistics using 47 black swan about 129 examples 129 blocking 207 210 boosting methods about 312 adaptive boosting AdaBoost 312 368 Index Gradient Descent Boost GDB 312 boosting module 311 bootstrapping about 303305 advantages 303 boxplot about 37 59 reference link 73 C capstone projects 355 categorical variables classifying 26 handling 43 versus numerical variables 2629 causeeffect relationship 228 central limit theorem CLT 107 108 central moments 136 children nodes 276 Chisquared distributions 186 Chisquared goodnessoffit test 189 classification tasks used for overviewing tree based methods 274278 classification tree about 274 growing 278 279 pruning 278 279 coefficient of determination 221 227229 collinearity analysis about 239 240 handson experience 233 235239 colormaps reference link 357 comprehensive projects 355 conditional distribution 126 127 confounding factors 333 consistency 134 continuous PDF Pvalue calculating from 170174 continuous probability distribution about 121 exponential distribution 122 123 normal distribution 124 125 types 121 uniform distribution 122 continuous variable transforming to categorical variable 46 correlation 48 correlation coefficient 220 correlation matrix 48 49 covariance 48 crosstabulation 50 crossvalidation 267 269272 Cumulative Distribution Function CDF 115 172 D Dash framework reference link 359 Dash map reference link 359 data collecting from various data sources 5 obtaining from API 69 obtaining from scratch 9 10 preparing to fit plotting function API 7376 reading directly from files 5 data imputation about 11 dataset preparing for 1115 with mean or median values 16 17 with modemost frequent value 18 19 Index 369 data memorization 300 data quality 322 data quality problems about 322 bias in data sources 322 323 insufficient documentation 324 325 irreversible preprocessing 324 325 miscommunication in large scale projects 323 data standardization about 21 22 performing 21 decision node 276 decision tree advantages 277 methods 288 performance evaluating 287289 decision tree terminologies children node 276 decision node 276 depth 276 explainability 277 leaf node 276 overfit 277 parent node 276 pruning 276 root node 275 Degree of Freedom DOF 173 depth 276 discrete distributions 358 discrete events Pvalue calculating from 168 169 discrete probability distributions about 116 117 Bernoulli distribution 117 118 Binomial distribution 118120 Poisson distribution 120 121 types 117 E efficiency 135 efficiency of an estimator 233 elementary projects 355 empirical probability 116 estimand 133 estimate 133 estimation 132 estimator about 133 evaluating 133 features 132 133 relationship connecting between simple linear regression model 230 231 estimator criteria list biasness 134 consistency 134 efficiency 135 explainability 277 exponential distribution 122 123 extreme outliers 42 F False Negative FN 256 False Positive FP 256 Fdistribution 170 fear and greedy index 248 Federal Reserve URL 5 firstprinciples thinking 168 Fishers exact test 211 frequencies 43 370 Index G Generalized Linear Model GLM 248 Generative Adversarial Learning GAN 322 geometry customizing 65 geometry example axissharing 65 scale change 68 69 subplots 65 ggplot2 components aesthetics 54 data 54 geometries 54 Gini impurity 279 goodnessoffit test 189 Google Map API URL 5 Google Map Place API URL 7 Gradient Descent Boost GDB about 313 working 314316 H harmonic mean 257 higherorder interactions 212 histogram plot 58 hypothesis 162 hypothesis test 179 hypothesis testing 162 230 I independency 127 Interquartile Range IQR 38 59 J joint distribution 126 K kfold crossvalidation 270 L Laplace distribution 146 lasso regression 241 lasso regularization 241 law of total probability 156 leaf node 276 least squared error linear regression 220227 Least Square Error LSE 224 leaveoneout crossvalidation 271 likelihood function 141143 250 logistic regression used for learning regularization 241245 logistic regression classifier classification problem 250 implementing from scratch 251256 performance evaluating 256259 working 248250 loss function 314 M markers 7072 math example concepts 164 165 Maximum Likelihood Estimation MLE 141 230 Maximum Likelihood Estimation MLE approach Index 371 about 141 155160 advantages 141 applying with Python 141 for modeling noise 145155 for uniform distribution boundaries 144 likelihood function 141143 mean 30 31 Mean Squared Error MSE 135 median 31 59 meta algorithm 312 Minimum Variance Unbiased Estimator MVUE 135 misleading graphs false arguments 334 usage avoiding 326 misleading graphs example Bar plot cropping 328333 COVID19 trend 326328 mixed data types handling 43 mode 32 33 modeling noise using in maximum likelihood estimator approach 145155 multivariate descriptive statistics using 47 multivariate linear regression handson experience 233239 N naïve Bayes classifier building from scratch 259267 nonprobability sampling about 86 risks 86 87 nonprobability sampling methods purposive sampling 87 volunteer sampling 87 nonstationarity time series example 201205 nontabular data 355 356 normal distribution 124 125 null hypothesis 163 numerical variables classifying 26 versus categorical variables 2629 O objective function 241 onesample test 172 onetailed hypothesis 166 onetailed test 172 Ordinary Least Squares OLS 232 248 outlier about 20 removing 20 21 outlier detection 4042 59 outliers 60 overfit 277 overfitted model 268 overfitting 267 269272 302 P pandas melt function reference link 74 parameter estimation concepts 132 133 methods using 136 137 parameter estimation examples 911 phone calls 137 138 uniform distribution 139 140 parent node 276 Poisson distribution 120 121 372 Index positive predictive value 257 posterior probability 259 precision 257 prediction 132 presentationready plotting font 80 styling 78 tips 78 Principle Component Analysis PCA 241 prior probability 259 probability types 116 Probability Density Function PDF 114 115 136 probability fundamental concepts events 110 sample space 110 Probability Mass Function PMF 111114 137 probability sampling about 86 safer approach 88 simple random sampling SRS 8893 stratified random sampling 9397 systematic random sampling 97 probability types empirical probability 116 subjective probability 116 proportions 44 45 pruning 276 pure node 282 purposive sampling 87 Pvalue calculating from continuous PDF 170174 calculating from discrete events 168 169 Pvalues properties 167 Python Matplotlib package examples 54 plotting types exploring 56 statistical graph elements 5456 Python Matplotlib package plotting types about 56 bar plot 6265 boxplot 58 59 histogram plot 58 outlier detection 58 59 scatter plot 60 simple line plot 56 57 Q QQ plots 185 quantiles 37 quartiles 37 38 queryoriented statistical plotting about 72 analysis combining with plain plotting 76 77 data preparing to fit plotting function API 7376 R random forests exploring with scikitlearn 316318 randomization 207210 Randomized Controlled Trial RCT 206 random walk time series 199 realtime weather data about 356 357 discrete distributions 358 typing suggestion app building 360 weather prediction web app Index 373 building 359 360 regression tree 274 exploring 289296 regularization learning from logistic regression 241245 regularization coefficient 242 regularization term 242 ridge regression 242 root node 275 280 S sample correlation coefficient 232 sample mean sampling distribution 98103 standard error 103 105107 sampling approach nonprobability sampling 86 performing in different scenarios 86 probability sampling 86 sampling distribution of sample mean 98103 sampling techniques fundamental concepts 84 85 used for learning statistics 98 scale change 68 69 scatter plot 60 scikitlearn random forests exploring with 316318 scikitlearn API tree models using 296 297 scikitlearn preprocessing module examples 23 imputation 23 standardization implementing 23 24 SciPy using for hypothesis testing about 180 ANOVA model 192197 goodnessoffit test 189192 normality hypothesis test 185189 paradigm 180 stationarity tests for time series 197 198 ttest 181185 ShapiroWilk test 185 simple linear regression model about 216 218220 as estimator 232 233 coefficient of determination 227229 content 216 218220 hypothesis testing 230 least squared error linear regression 220227 relationship connecting between estimators 230 231 variance decomposition 220227 simple line plot 56 57 simple random sampling SRS 8893 skewness 39 40 splitting features working 279285 287 standard deviation 36 37 standard error of sample mean 103 105107 standard independent twosample ttest 181 standard score 40 stationarity time series 198 example 198200 statistical graph elements 5456 statistical significance tests 166 statistics learning with sampling techniques 98 stratified random sampling 9396 stratifying 93 Students tdistribution 173 subjective probability 116 subplots 66 68 sum of squared errors SSE 222 sum of squares total SST 221 systematic random sampling 97 T tdistribution significance levels 174178 test statistic 164 test statistics 210 time series 198 treebased methods classification tree 274 overviewing for classification tasks 274278 regression tree 274 tree models using in scikitlearn API 296 298 True Negative TN 256 True Positive TP 256 twotailed hypothesis 166 type 1 error 256 type 2 error 256 type I error 179 type II error 179 typing suggestion app building 360 model building 360 word data processing 360 U ubiquitous power law 128 129 uncorrelated sample standard deviation 232 underfitting 267 269272 302 uniform distribution 122 uniform distribution boundaries using in maximum likelihood estimator approach 144 university dataset reference link 355 V variables 26 variance 3335 300303 variance decomposition 220227 Variance Inflation Factor VIF 240 volunteer sampling 87 W weather API reference link 356 weather prediction web app building 359 360 Welchs ttest 181 white noise time series 198 Z zscore 40