您的位置:首页 > 编程语言 > Python开发

class Statistical Thinking in Python (Part 1)

2017-09-23 20:37 561 查看
一些EDA数据的基本技术

Import 
matplotlib.pyplot
 and 
seaborn
 as
their usual aliases (
plt
 and 
sns
).
Use 
seaborn
 to set the plotting defaults.
Plot a histogram of the Iris versicolor petal lengths using 
plt.hist()
 and the
provided NumPy array 
versicolor_petal_length
.
Show the histogram using 
plt.show()
.

# Import plotting modules

import matplotlib.pyplot as plt

import seaborn as sns

# Set default Seaborn style

sns.set()

# Plot histogram of versicolor petal lengths

plt.hist(versicolor_petal_length)

# Show histogram

plt.show()

Label the axes. Don't forget that you should always include units in your axis labels. Your yy-axis
label is just 
'count'
. Your xx-axis
label is 
'petal length (cm)'
. The units are essential!
Display the plot constructed in the above steps using 
plt.show()
.

    # Plot histogram of versicolor petal lengths

    _ = plt.hist(versicolor_petal_length)

    

    # Label axes

    plt.xlabel("petal length (cm)")

    plt.ylabel("count")

    

    # Show histogram

    plt.show()

hist直方图的bins数目一般是数据数量的开根号

Import 
numpy
 as 
np
.
This gives access to the square root function, 
np.sqrt()
.
Determine how many data points you have using 
len()
.
Compute the number of bins using the square root rule.
Convert the number of bins to an integer using the built in 
int()
 function.
Generate the histogram and make sure to use the 
bins
keyword argument.
Hit 'Submit Answer' to plot the figure and see the fruit of your labors!

# Import numpy

import numpy as np

# Compute number of data points: n_data

n_data = len(versicolor_petal_length)

# Number of bins is the square root of number of data points: n_bins

n_bins = np.sqrt(n_data)

# Convert number of bins to integer: n_bins

n_bins = int(n_bins)

# Plot the histogram

plt.hist(versicolor_petal_length, bins=n_bins)

# Label axes

_ = plt.xlabel('petal length (cm)')

_ = plt.ylabel('count')

# Show histogram

plt.show()

_ = sns.swarmplot(x='state', y='dem_share', data=df_swing)
_ = plt.xlabel('state')
_ = plt.ylabel('percent of vote for Obama')
plt.show()


In the IPython Shell, inspect the DataFrame 
df
 using 
df.head()
.
This will let you identify which column names you need to pass as the 
x
 and 
y
 keyword
arguments in your call to 
sns.swarmplot()
.
Use 
sns.swarmplot()
 to make a bee swarm plot from the DataFrame containing the Fisher iris data set, 
df
.
The x-axis should contain each of the three species, and the y-axis should contain the petal lengths.
Label the axes.
Show your plot.

# Create bee swarm plot with Seaborn's default settings

sns.swarmplot(x='species', y='petal length (cm)', data=df)

# Label the axes

plt.xlabel('species')

plt.ylabel('petal length (cm)')

# Show the plot

plt.show()

Define a function with the signature 
ecdf(data)
.
Within the function definition,
Compute the number of data points, 
n
, using the 
len()
 function.
The xx-values
are the sorted data. Use the 
np.sort()
 function to perform the sorting.
The yy data
of the ECDF go from 
1/n
 to 
1
 in
equally spaced increments. You can construct this using 
np.arange()
. Remember, however, that the end value in 
np.arange()
 is
not inclusive. Therefore, 
np.arange()
 will need to go from 
1
to 
n+1
.
Be sure to divide this by 
n
.
The function returns the values 
x
 and 
y
.

def ecdf(data):

    """Compute ECDF for a one-dimensional array of measurements."""

    # Number of data points: n

    n = len(data)

    # x-data for the ECDF: x

    x = np.sort(data)

    # y-data for the ECDF: y

    y = np.arange(1, n+1) / n

    return x, y

Use 
ecdf()
 to compute the ECDF of 
versicolor_petal_length
.
Unpack the output into
x_vers
 and 
y_vers
.
Plot the ECDF as dots. Remember to include 
marker = '.'
 and 
linestyle
= 'none'
 in addition to 
x_vers
 and 
y_vers
 as
arguments inside 
plt.plot()
.
Set the margins of the plot with 
plt.margins()
 so that no data points are cut off. Use a 2% margin.
Label the axes. You can label the y-axis 
'ECDF'
.
Show your plot.
Use 
ecdf()
 to compute the ECDF of 
versicolor_petal_length
.
Unpack the output into
x_vers
 and 
y_vers
.
Plot the ECDF as dots. Remember to include 
marker = '.'
 and 
linestyle
= 'none'
 in addition to 
x_vers
 and 
y_vers
 as
arguments inside 
plt.plot()
.
Set the margins of the plot with 
plt.margins()
 so that no data points are cut off. Use a 2% margin.
Label the axes. You can label the y-axis 
'ECDF'
.
Show your plot.

# Compute ECDF for versicolor data: x_vers, y_vers

x_vers, y_vers = ecdf(versicolor_petal_length)

# Generate plot

plt.plot(x_vers, y_vers, marker='.', linestyle='none')

# Make the margins nice

plt.margins(0.02)

# Label the axes

plt.ylabel('ECDF')

plt.xlabel('versicolor_petal_length')

# Display the plot

plt.show()

Compute ECDFs for each of the three species using your 
ecdf()
 function. The variables 
setosa_petal_length
versicolor_petal_length
,
and 
virginica_petal_length
 are all in your namespace. Unpack the ECDFs into 
x_set,
y_set
x_vers, y_vers
 and 
x_virg,
y_virg
, respectively.
Plot all three ECDFs on the same plot as dots. To do this, you will need three 
plt.plot()
 commands.
Assign the result of each to 
_
.
Specify 2% margins.
A legend and axis labels have been added for you, so hit 'Submit Answer' to see all the ECDFs!

# Compute ECDFs

x_set, y_set = ecdf(setosa_petal_length)

x_vers, y_vers = ecdf(versicolor_petal_length)

x_virg, y_virg = ecdf(virginica_petal_length)

# Plot all ECDFs on the same plot

_ = plt.plot(x_set, y_set, marker='.', linestyle='none', )

_ = plt.plot(x_vers, y_vers, marker='.', linestyle='none', )

_ = plt.plot(x_virg, y_virg, marker='.', linestyle='none', )

# Make nice margins

plt.margins(0.02)

# Annotate the plot

plt.legend(('setosa', 'versicolor', 'virginica'), loc='lower right')

_ = plt.xlabel('petal length (cm)')

_ = plt.ylabel('ECDF')

# Display the plot

plt.show()

dataframe计算分位数

Create 
percentiles
, a NumPy array of percentiles you want to compute. These are the 2.5th, 25th, 50th,
75th, and 97.5th. You can do so by creating a list containing these ints/floats and convert the list to a NumPy array using 
np.array()
.
For example, 
np.array([30, 50])
would create an array consisting of the 30th and 50th percentiles.
Use 
np.percentile()
 to compute the percentiles of the petal lengths from the Iris versicolor samples.
The variable 
versicolor_petal_length
 is in your namespace.
Print the percentiles.

# Specify array of percentiles: percentiles

percentiles = np.array([2.5, 25, 50, 75, 97.5])

# Compute percentiles: ptiles_vers

ptiles_vers = np.percentile(versicolor_petal_length, percentiles)

# Print the result

print(ptiles_vers)

画图并且标记处分位数点:

Plot the percentiles as red diamonds on the ECDF. Pass the x and y co-ordinates - 
ptiles_vers
 and 
percentiles/100
 -
as positional arguments and specify the 
marker='D'
color='red'
 and 
linestyle='none'
 keyword
arguments. The argument for the y-axis - 
percentiles/100
 has been specified for you.
Display the plot.

# Plot the ECDF

_ = plt.plot(x_vers, y_vers, '.')

plt.margins(0.02)

_ = plt.xlabel('petal length (cm)')

_ = plt.ylabel('ECDF')

# Overlay percentiles as red diamonds.

_ = plt.plot(ptiles_vers, percentiles/100, marker='D', color='red',

         linestyle='none')

# Show the plot

plt.show()

计算方差

Create an array called 
differences
 that is the difference between the petal lengths (
versicolor_petal_length
)
and the mean petal length. The variable 
versicolor_petal_length
 is already in your namespace as a NumPy array so
you can take advantage of NumPy's vectorized operations.
Square each element in this array. For example, 
x**2
squares each element in the array 
x
.
Store the result as 
diff_sq
.
Compute the mean of the elements in 
diff_sq
 using 
np.mean()
.
Store the result as 
variance_explicit
.
Compute the variance of 
versicolor_petal_length
using 
np.var()
.
Store the result as 
variance_np
.
Print both 
variance_explicit
 and 
variance_np
in
one 
print
 call to make sure they are consistent.

# Array of differences to mean: differences

differences = versicolor_petal_length - np.mean(versicolor_petal_length)

# Square the differences: diff_sq

diff_sq = differences ** 2

# Compute the mean square difference: variance_explicit

variance_explicit = np.mean(diff_sq)

# Compute the variance using NumPy: variance_np

variance_np = np.var(versicolor_petal_length)

# Print the results

print(variance_explicit,variance_np)

计算皮尔逊相关系数

Define a function with signature 
pearson_r(x, y)
.
Use 
np.corrcoef()
 to compute the correlation matrix of 
x
 and 
y
 (pass
them to 
np.corrcoef()
 in that order).
The function returns entry 
[0,1]
 of the correlation matrix.

Compute the Pearson correlation between the data in the arrays 
versicolor_petal_length
 and 
versicolor_petal_width
.
Assign the result to 
r
.
Print the result.

def pearson_r(x, y):

    """Compute Pearson correlation coefficient between two arrays."""

    # Compute correlation matrix: corr_mat

    corr_mat = np.corrcoef(x, y)

    # Return entry [0,1]

    return corr_mat[0,1]

# Compute Pearson correlation coefficient for I. versicolor: r

r = pearson_r(versicolor_petal_length,versicolor_petal_width)

# Print the result

print(r)

Draw samples out of the Binomial distribution using 
np.random.binomial()
. You should use parameters 
n
= 100
 and 
p = 0.05
, and set the 
size
 keyword
argument to 
10000
.
Compute the CDF using your previously-written 
ecdf()
function.
Plot the CDF with axis labels. The x-axis here is the number of defaults out of 100 loans, while the y-axis is the CDF.
Show the plot.

# Take 10,000 samples out of the binomial distribution: n_defaults

n_defaults = np.random.binomial(100, 0.05, size = 10000)

# Compute CDF: x, y

x, y = ecdf(n_defaults)

# Plot the CDF with axis labels

plt.plot(x, y, marker = '.', linestyle = 'none' )

plt.xlabel('the number of defaults out of 100 loans')

plt.ylabel('CDF')

# Show the plot

plt.show()

正态分布的构建和绘制

Draw 100,000 samples from a Normal distribution that has a mean of 
20
 and a standard
deviation of 
1
. Do the same for Normal distributions with standard deviations of 
3
 and 
10
,
each still with a mean of 
20
. Assign the results to 
samples_std1
samples_std3
 and 
samples_std10
,
respectively.
Plot a histograms of each of the samples; for each, use 100 bins, also using the keyword arguments 
normed=True
and 
histtype='step'
.
The latter keyword argument makes the plot look much like the smooth theoretical PDF. You will need to make 3 
plt.hist()
 calls.
Hit 'Submit Answer' to make a legend, showing which standard deviations you used, and show your plot! There is no need to label the axes because we have not defined what is being described by the Normal distribution; we are just looking at shapes
of PDFs.

# Draw 100000 samples from Normal distribution with stds of interest: samples_std1, samples_std3, samples_std10

samples_std1 = np.random.normal(20,1,100000)

samples_std3 = np.random.normal(20,3,100000)

samples_std10 = np.random.normal(20,10,100000)

# Make histograms

plt.hist(samples_std1,normed=True,histtype='step',bins=100)

plt.hist(samples_std3,normed=True,histtype='step',bins=100)

plt.hist(samples_std10,normed=True,histtype='step',bins=100)

# Make a legend, set limits and show plot

_ = plt.legend(('std = 1', 'std = 3', 'std = 10'))

plt.ylim(-0.01, 0.42)

plt.show()

  
          数据的标准正态分布拟合和原数据的拟合

Compute mean and standard deviation of Belmont winners' times with the two outliers removed. The NumPy array 
belmont_no_outliers
 has
these data.
Take 10,000 samples out of a normal distribution with this mean and standard deviation using 
np.random.normal()
.
Compute the CDF of the theoretical samples and the ECDF of the Belmont winners' data, assigning the results to 
x_theor,
y_theor
 and 
x, y
, respectively.
Hit submit to plot the CDF of your samples with the ECDF, label your axes and show the plot.

# Compute mean and standard deviation: mu, sigma

mu = np.mean(belmont_no_outliers)

sigma = np.std(belmont_no_outliers)

# Sample out of a normal distribution with this mu and sigma: samples

samples = np.random.normal(mu,sigma,10000)

# Get the CDF of the samples and of the data

x_theor,y_theor = ecdf(samples)

x, y = ecdf(belmont_no_outliers)

# Plot the CDFs and show the plot

_ = plt.plot(x_theor, y_theor)

_ = plt.plot(x, y, marker='.', linestyle='none')

plt.margins(0.02)

_ = plt.xlabel('Belmont winning time (sec.)')

_ = plt.ylabel('CDF')

plt.show()

指数分布

Define a function with call signature 
successive_poisson(tau1,
tau2, size=1)
 that samples the waiting time for a no-hitter and a hit of the cycle.

Draw waiting times 
tau1
 (
size
 number
of samples) for the no-hitter out of an exponential distribution and assign to 
t1
.
Draw waiting times 
tau2
 (
size
 number
of samples) for hitting the cycle out of an exponential distribution and assign to 
t2
.
The function returns the sum of the waiting times for the two events.

def successive_poisson(tau1, tau2, size=1):

    # Draw samples out of first exponential distribution: t1

    t1 = np.random.exponential(tau1, size)

    # Draw samples out of second exponential distribution: t2

    t2 = np.random.exponential(tau2, size)

    return t1 + t2

Use your 
successive_poisson()
 function to draw 100,000 out of the distribution
of waiting times for observing a no-hitter and a hitting of the cycle.
Plot the PDF of the waiting times using the step histogram technique of a previous exercise. Don't forget the necessary keyword arguments. You should use 
bins=100
normed=True
,
and 
histtype='step'
.
Label the axes.
Show your plot.

# Draw samples of waiting times: waiting_times

waiting_times = successive_poisson(764, 715, size=100000)

# Make the histogram

plt.hist(waiting_times, bins=100, normed=True, histtype='step')

# Label axes

plt.xlabel('time')

plt.ylabel('waiting_times')

# Show the plot

plt.show()

多项式拟合

Compute the slope and intercept of the regression line using 
np.polyfit()
. Remember, 
fertility
 is
on the y-axis and 
illiteracy
 on the x-axis.
Print out the slope and intercept from the linear regression.
To plot the best fit line, create an array 
x
 that consists of 0 and 100 using 
np.array()
.
Then, compute the theoretical values of 
y
 based on your regression parameters. I.e., 
y
= a * x + b
.
Plot the data and the regression line on the same plot. Be sure to label your axes.
Hit 'Submit Answer' to display your plot.

# Plot the illiteracy rate versus fertility

_ = plt.plot(illiteracy, fertility, marker='.', linestyle='none')

plt.margins(0.02)

_ = plt.xlabel('percent illiterate')

_ = plt.ylabel('fertility')

# Perform a linear regression using np.polyfit(): a, b

a, b = np.polyfit(illiteracy, fertility,1)

# Print the results to the screen

print('slope =', a, 'children per woman / percent illiterate')

print('intercept =', b, 'children per woman')

# Make theoretical line to plot

x = np.array([0,100])

y = a * x + b

# Add regression line to your plot

_ = plt.plot(x, y)

# Draw the plot

plt.show()

Specify the values of the slope for which to compute the RSS. Use 
np.linspace()
 to get 
200
 points
in the range between 
0
 and 
0.1
.
For example, to get 
100
 points in the range between 
0
 and 
0.5
,
you could use 
np.linspace()
 like so: 
np.linspace(0,
0.5, 100)
.
Initialize an array, 
rss
, to contain the RSS using 
np.empty_like()
 and
the array you created above. The 
empty_like()
 function returns a new array with the same shape and type as a given
array (in this case, 
a_vals
).
Write a 
for
 loop to compute the sum of RSS of the slope. Hint: the RSS is given
by 
np.sum((y_data - a * x_data - b)**2)
. The variable 
b
 you
computed in the last exercise is already in your namespace. Here, 
fertility
 is the 
y_data
 and 
illiteracy
 the 
x_data
.
Plot the RSS (
rss
) versus slope (
a_vals
).
Hit 'Submit Answer' to see the plot!

# Specify slopes to consider: a_vals

a_vals = np.linspace(0, 0.1, 200)

# Initialize sum of square of residuals: rss

rss = np.empty_like(a_vals)

# Compute sum of square of residuals for each value of a_vals

for i, a in enumerate(a_vals):

    rss[i] = np.sum((fertility - a*illiteracy - b)**2)

# Plot the RSS

plt.plot(a_vals, rss, '-')

plt.xlabel('slope (children per woman / percent illiterate)')

plt.ylabel('sum of square of residuals')

plt.show()

Compute the parameters for the slope and intercept using 
np.polyfit()
. The Anscombe
data are stored in the arrays 
x
 and 
y
.
Print the slope 
a
 and intercept 
b
.
Generate theoretical xx and yy data
from the linear regression. Your xx array,
which you can create with 
np.array()
, should consist of 
3
 and 
15
.
To generate the yy data,
multiply the slope by 
x_theor
 and add the intercept.
Plot the Anscombe data as a scatter plot and then plot the theoretical line. Remember to include the 
marker='.'
and 
linestyle='none'
 keyword
arguments in addition to 
x
 and 
y
 when
to plot the Anscombe data as a scatter plot. You do not need these arguments when plotting the theoretical line.
Hit 'Submit Answer' to see the plot!



# Perform linear regression: a, b

a, b = np.polyfit(x,y,1)

# Print the slope and intercept

print(a, b)

# Generate theoretical x and y data: x_theor, y_theor

x_theor = np.array([3, 15])

y_theor = x_theor * a + b

# Plot the Anscombe data and theoretical line

_ = plt.plot(x_theor, y_theor)

_ = plt.plot(x,y,marker='.',linestyle='none')

# Label the axes

plt.xlabel('x')

plt.ylabel('y')

# Show the plot

plt.show()

样本小的时候进行重复采样的方法

Write a 
for
 loop to acquire 
50
 bootstrap
samples of the rainfall data and plot their ECDF.
Use 
np.random.choice()
 to generate a bootstrap sample from the NumPy array 
rainfall
.
Be sure that the 
size
 of the resampled array is 
len(rainfall)
.
Use the function 
ecdf()
 that you wrote in the prequel to this course to generate the 
x
 and 
y
values
for the ECDF of the bootstrap sample 
bs_sample
.
Plot the ECDF values. Specify 
color='gray'
 (to make gray dots) and 
alpha=0.1
 (to
make them semi-transparent, since we are overlaying so many) in addition to the 
marker='.'
 and 
linestyle='none'
 keyword
arguments.

Use 
ecdf()
 to generate 
x
 and 
y
 values
for the ECDF of the original rainfall data available in the array 
rainfall
.
Plot the ECDF values of the original data.
Hit 'Submit Answer' to visualize the samples!

for _ in range(50):

    # Generate bootstrap sample: bs_sample

    bs_sample = np.random.choice(rainfall, size=len(rainfall))

    # Compute and plot ECDF from bootstrap sample

    x, y = ecdf(bs_sample)

    _ = plt.plot(x, y, marker='.', linestyle='none',

                 color='gray', alpha=0.1)

# Compute and plot ECDF from original data

x, y = ecdf(rainfall)

_ = plt.plot(x, y, marker='.')

# Make margins and label axes

plt.margins(0.02)

_ = plt.xlabel('yearly rainfall (mm)')

_ = plt.ylabel('ECDF')

# Show the plot

plt.show()

Draw 
10000
 bootstrap replicates of the mean annual rainfall using
your 
draw_bs_reps()
 function and the 
rainfall
 array. Hint:
Pass in 
np.mean
 for 
func
 to
compute the mean.
As a reminder, 
draw_bs_reps()
 accepts 3 arguments: 
data
func
,
and 
size
.

Compute and print the standard error of the mean of 
rainfall
.
The formula to compute this is 
np.std(data) / np.sqrt(len(data))
.

Compute and print the standard deviation of your bootstrap replicates 
bs_replicates
.
Make a histogram of the replicates using the 
normed=True
 keyword argument and 
50
 bins.
Hit 'Submit Answer' to see the plot!

# Take 10,000 bootstrap replicates of the mean: bs_replicates

bs_replicates = draw_bs_reps(rainfall, np.mean, 10000)

# Compute and print SEM

sem = np.std(rainfall) / np.sqrt(len(rainfall))

print(sem)

# Compute and print standard deviation of bootstrap replicates

bs_std = np.std(bs_replicates)

print(bs_std)

# Make a histogram of the results

_ = plt.hist(bs_replicates, bins=50, normed=True)

_ = plt.xlabel('mean annual rainfall (mm)')

_ = plt.ylabel('PDF')

# Show the plot

plt.show()

计算置信区间

Generate 
10000
 bootstrap replicates of ττ from
the 
nohitter_times
 data using your 
draw_bs_reps()
function.
Recall that the the optimal ττ is
calculated as the mean of the data.
Compute the 95% confidence interval using 
np.percentile()
 and passing in two arguments: The array 
bs_replicates
,
and the list of percentiles - in this case 
2.5
 and 
97.5
.
Print the confidence interval.
Plot a histogram of your bootstrap replicates. This has been done for you, so hit 'Submit Answer' to see the plot!

# Draw bootstrap replicates of the mean no-hitter time (equal to tau): bs_replicates

bs_replicates = draw_bs_reps(nohitter_times,np.mean,10000)

# Compute the 95% confidence interval: conf_int

conf_int = np.percentile(bs_replicates,[2.5,97.5])

# Print the confidence interval

print('95% confidence interval =', conf_int, 'games')

# Plot the histogram of the replicates

_ = plt.hist(bs_replicates, bins=50, normed=True)

_ = plt.xlabel(r'$\tau$ (games)')

_ = plt.ylabel('PDF')

# Show the plot

plt.show()

重复抽取数据和进行线性拟合

Define a function with call signature 
draw_bs_pairs_linreg(x,
y, size=1)
 to perform pairs bootstrap estimates on linear regression parameters.

Use 
np.arange()
 to set up an array of indices going from 
0
 to 
len(x)
.
These are what you will resample and use them to pick values out of the 
x
and 
y
 arrays.
Use 
np.empty()
 to initialize the slope and intercept replicate arrays to be of size 
size
.
Write a 
for
 loop to:
Resample the indices 
inds
. Use 
np.random.choice()
 to
do this.
Make new xx and yy arrays 
bs_x
 and 
bs_y
using
the the resampled indices 
bs_inds
. To do this, slice 
x
 and 
y
 with 
bs_inds
.
Use 
np.polyfit()
 on the new xx and yyarrays
and store the computed slope and intercept.

Return the pair bootstrap replicates of the slope and intercept.

def draw_bs_pairs_linreg(x, y, size=1):

    """Perform pairs bootstrap for linear regression."""

    # Set up array of indices to sample from: inds

    inds = np.arange(len(x))

    # Initialize replicates: bs_slope_reps, bs_intercept_reps

    bs_slope_reps = np.empty(size)

    bs_intercept_reps = np.empty(size)

    # Generate replicates

    for i in range(size):

        bs_inds = np.random.choice(inds, size=len(inds))

        bs_x, bs_y = x[bs_inds], y[bs_inds]

        bs_slope_reps[i], bs_intercept_reps[i] = np.polyfit(bs_x, bs_y, 1)

    return bs_slope_reps, bs_intercept_reps

Generate an array of xx-values
consisting of 
0
 and 
100
for
the plot of the regression lines. Use the 
np.array()
function for this.
Write a 
for
 loop in which you plot a regression line with a slope and intercept given by the pairs
bootstrap replicates. Do this for 
100
 lines.
When plotting the regression lines in each iteration of the 
for
 loop, recall the regression equation 
y
= a*x + b
. Here, 
a
 is 
bs_slope_reps[i]
 and 
b
 is 
bs_intercept_reps[i]
.
Specify the keyword arguments 
linewidth=0.5
alpha=0.2
,
and 
color='red'
 in your call to 
plt.plot()
.

Make a scatter plot with 
illiteracy
 on the x-axis and 
fertility
 on
the y-axis. Remember to specify the 
marker='.'
 and 
linestyle='none'
 keyword
arguments.
Label the axes, set a 2% margin, and show the plot. This has been done for you, so hit 'Submit Answer' to visualize the bootstrap regressions!

# Generate array of x-values for bootstrap lines: x

x = np.array([0,100])

# Plot the bootstrap lines

for i in range(100):

    _ = plt.plot(x, bs_slope_reps[i]*x + bs_intercept_reps[i],

                 linewidth=0.5, alpha=0.2, color='red')

# Plot the data

_ = plt.plot()

# Label axes, set the margins, and show the plot

_ = plt.xlabel('illiteracy')

_ = plt.ylabel('fertility')

plt.margins(0.02)

plt.show()


Visualizing permutation sampling

Write a 
for
 loop to 50 generate permutation samples, compute their ECDFs, and plot them.
Generate a permutation sample pair from 
rain_july
 and 
rain_november
 using
your 
permutation_sample()
 function.
Generate the 
x
 and 
y
 values
for an ECDF for each of the two permutation samples for the ECDF using your 
ecdf()
 function.
Plot the ECDF of the first permutation sample (
x_1
and 
y_1
)
as dots. Do the same for the second permutation sample (
x_2
 and 
y_2
).

Generate 
x
 and 
y
 values
for ECDFs for the 
rain_july
 and 
rain_november
 data
and plot the ECDFs using respectively the keyword arguments 
color='red'
 and 
color='blue'
.
Label your axes, set a 2% margin, and show your plot. This has been done for you, so just hit 'Submit Answer' to view the plot!

for _ in range(50):

    # Generate permutation samples

    perm_sample_1, perm_sample_2 = permutation_sample(rain_july, rain_november)

    # Compute ECDFs

    x_1, y_1 = ecdf(perm_sample_1)

    x_2, y_2 = ecdf(perm_sample_2)

    # Plot ECDFs of permutation sample

    _ = plt.plot(x_1, y_1, marker='.', linestyle='none',

                 color='red', alpha=0.02)

    _ = plt.plot(x_2, y_2, marker='.', linestyle='none',

                 color='blue', alpha=0.02)

# Create and plot ECDFs from original data

x_1, y_1 = ecdf(rain_july)

x_2, y_2 = ecdf(rain_november)

_ = plt.plot(x_1, y_1, marker='.', linestyle='none', color='red')

_ = plt.plot(x_2, y_2, marker='.', linestyle='none', color='blue')

# Label axes, set margin, and show plot

plt.margins(0.02)

_ = plt.xlabel('monthly rainfall (mm)')

_ = plt.ylabel('ECDF')

plt.show()

P值的计算:

Define a function with call signature 
diff_of_means(data_1, data_2)
 that returns the differences in
means between two data sets, mean of 
data_1
 minus mean of 
data_2
.
Use this function to compute the empirical difference of means that was observed in the frogs.
Draw 10,000 permutation replicates of the difference of means.
Compute the p-value.
Print the p-value.

def diff_of_means(data_1, data_2):

    """Difference in means of two arrays."""

    # The difference of means of data_1, data_2: diff

    diff = np.mean(data_1) - np.mean(data_2)

    return diff

# Compute difference of mean impact force from experiment: empirical_diff_means

empirical_diff_means = diff_of_means(force_a, force_b)

# Draw 10,000 permutation replicates: perm_replicates

perm_replicates = draw_perm_reps(force_a, force_b,

                                 diff_of_means, size=10000)

# Compute p-value: p

p = np.sum( perm_replicates >= empirical_diff_means) / len(perm_replicates)

# Print the result

print('p-value =', p)

Translate the impact forces of Frog B such that its mean is 0.55 N.
Use your 
draw_bs_reps()
 function to take 10,000 bootstrap replicates of the mean of your translated
forces.
Compute the p-value by finding the fraction of your bootstrap replicates that are less than the observed mean impact force of Frog B. Note that the variable of interest here is 
force_b
.
Print your p-value.

# Make an array of translated impact forces: translated_force_b

translated_force_b = force_b - np.mean(force_b) + 0.55

# Take bootstrap replicates of Frog B's translated impact forces: bs_replicates

bs_replicates = draw_bs_reps(translated_force_b, np.mean, 10000)

# Compute fraction of replicates that are less than the observed Frog B force: p

p = np.sum(bs_replicates <= np.mean(force_b)) / 10000

# Print the p-value

print('p = ', p)

Construct Boolean arrays, 
dems
 and 
reps
 that
contain the votes of the respective parties; e.g., 
dems
 has 153 
True
 entries
and 91 
False
 entries.
Write a function, 
frac_yay_dems(dems, reps)
 that returns the fraction of Democrats that voted yay.
The first input is an array of Booleans, Two inputs are required to use your 
draw_perm_reps()
 function, but the
second is not used.
Use your 
draw_perm_reps()
 function to draw 10,000 permutation replicates of the fraction of Democrat
yay votes.
Compute and print the p-value.

# Construct arrays of data: dems, reps

dems = np.array([True] * 153 + [False] * 91)

reps = np.array([True] * 136 + [False] * 35)

def frac_yay_dems(dems, reps):

    """Compute fraction of Democrat yay votes."""

    frac = np.sum(dems) / len(dems)

    return frac

# Acquire permutation samples: perm_replicates

perm_replicates = draw_perm_reps(dems, reps, frac_yay_dems, 10000)

# Compute and print p-value: p

p = np.sum(perm_replicates <= 153/244) / len(perm_replicates)

print('p-value =', p)

Compute the observed difference in mean inter-nohitter time using 
diff_of_means()
.
Generate 10,000 permutation replicates of the difference of means using 
draw_perm_reps()
.
Compute and print the p-value.

# Compute the observed difference in mean inter-no-hitter times: nht_diff_obs

nht_diff_obs = diff_of_means(nht_dead, nht_live)

# Acquire 10,000 permutation replicates of difference in mean no-hitter time: perm_replicates

perm_replicates = draw_perm_reps(nht_dead, nht_live,

                                 diff_of_means, size=10000)

# Compute and print the p-value: p

p = np.sum(perm_replicates <= nht_diff_obs) / len(perm_replicates)

print('p-val =',p)

Compute the observed Pearson correlation between 
illiteracy
 and 
fertility
.
Initialize an array to store your permutation replicates.
Write a 
for
 loop to draw 10,000 replicates:
Permute the 
illiteracy
 measurements using 
np.random.permutation()
.
Compute the Pearson correlation between the permuted illiteracy array, 
illiteracy_permuted
, and 
fertility
.

Compute and print the p-value from the replicates.

# Compute observed correlation: r_obs

r_obs = pearson_r(illiteracy, fertility)

# Initialize permutation replicates: perm_replicates

perm_replicates = np.empty(10000)

# Draw replicates

for i in range(10000):

    # Permute illiteracy measurments: illiteracy_permuted

    illiteracy_permuted = np.random.permutation(illiteracy)

    # Compute Pearson correlation

    perm_replicates[i] = pearson_r(illiteracy_permuted, fertility)

# Compute p-value: p

p = np.sum(perm_replicates >= r_obs) / len(perm_replicates)

print('p-val =', p)

Use your 
ecdf()
 function to generate 
x,y
 values
from the 
control
 and 
treated
 arrays
for plotting the ECDFs.
Plot the ECDFs on the same plot.
The margins have been set for you, along with the legend and axis labels. Hit 'Submit Answer' to see the result!

# Compute x,y values for ECDFs

x_control, y_control = ecdf(control)

x_treated, y_treated = ecdf(treated)

# Plot the ECDFs

plt.plot(x_control, y_control, marker='.', linestyle='none')

plt.plot(x_treated, y_treated, marker='.', linestyle='none')

# Set the margins

plt.margins(0.02)

# Add a legend

plt.legend(('control', 'treated'), loc='lower right')

# Label axes and show plot

plt.xlabel('millions of alive sperm per mL')

plt.ylabel('ECDF')

plt.show()

Compute the mean alive sperm count of 
control
 minus that of 
treated
.
Compute the mean of all alive sperm counts. To do this, first concatenate 
control
 and 
treated
 and
take the mean of the concatenated array.
Generate shifted data sets for both 
control
 and 
treated
 such
that the shifted data sets have the same mean. This has already been done for you.
Generate 10,000 bootstrap replicates of the mean each for the two shifted arrays. Use your 
draw_bs_reps()
function.
Compute the bootstrap replicates of the difference of means.
The code to compute and print the p-value has been written for you. Hit 'Submit Answer' to see the result!

# Compute the difference in mean sperm count: diff_means

diff_means = np.mean(control) - np.mean(treated)

# Compute mean of pooled data: mean_count

mean_count = np.mean(np.concatenate((control, treated)))

# Generate shifted data sets

control_shifted = control - np.mean(control) + mean_count

treated_shifted = treated - np.mean(treated) + mean_count

# Generate bootstrap replicates

bs_reps_control = draw_bs_reps(control_shifted,

                               np.mean, size=10000)

bs_reps_treated = draw_bs_reps(treated_shifted,

                               np.mean, size=10000)

# Get replicates of difference of means: bs_replicates

bs_replicates = bs_reps_control - bs_reps_treated

# Compute and print p-value: p

p = np.sum(bs_replicates >= np.mean(control) - np.mean(treated)) \

            / len(bs_replicates)

print('p-value =', p)
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: