您的位置：首页 > 编程语言 > Python开发

class Statistical Thinking in Python (Part 1)

2017-09-23 20:37 561 查看

一些EDA数据的基本技术

Import

matplotlib.pyplot

and

seaborn

as
their usual aliases (

plt

and

sns

).
Use

seaborn

to set the plotting defaults.
Plot a histogram of the Iris versicolor petal lengths using

plt.hist()

and the
provided NumPy array

versicolor_petal_length

.
Show the histogram using

plt.show()

.

# Import plotting modules

import matplotlib.pyplot as plt

import seaborn as sns

# Set default Seaborn style

sns.set()

# Plot histogram of versicolor petal lengths

plt.hist(versicolor_petal_length)

# Show histogram

plt.show()

Label the axes. Don't forget that you should always include units in your axis labels. Your yy-axis
label is just

'count'

. Your xx-axis
label is

'petal length (cm)'

. The units are essential!
Display the plot constructed in the above steps using

plt.show()

.

# Plot histogram of versicolor petal lengths

_ = plt.hist(versicolor_petal_length)

# Label axes

plt.xlabel("petal length (cm)")

plt.ylabel("count")

# Show histogram

plt.show()

hist直方图的bins数目一般是数据数量的开根号

Import

numpy

np

.
This gives access to the square root function,

np.sqrt()

.
Determine how many data points you have using

len()

.
Compute the number of bins using the square root rule.
Convert the number of bins to an integer using the built in

int()

function.
Generate the histogram and make sure to use the

bins

keyword argument.
Hit 'Submit Answer' to plot the figure and see the fruit of your labors!

# Import numpy

import numpy as np

# Compute number of data points: n_data

n_data = len(versicolor_petal_length)

# Number of bins is the square root of number of data points: n_bins

n_bins = np.sqrt(n_data)

# Convert number of bins to integer: n_bins

n_bins = int(n_bins)

# Plot the histogram

plt.hist(versicolor_petal_length, bins=n_bins)

# Label axes

_ = plt.xlabel('petal length (cm)')

_ = plt.ylabel('count')

# Show histogram

plt.show()

_ = sns.swarmplot(x='state', y='dem_share', data=df_swing)
_ = plt.xlabel('state')
_ = plt.ylabel('percent of vote for Obama')
plt.show()

In the IPython Shell, inspect the DataFrame

df

using

df.head()

.
This will let you identify which column names you need to pass as the

and

keyword
arguments in your call to

sns.swarmplot()

.
Use

sns.swarmplot()

to make a bee swarm plot from the DataFrame containing the Fisher iris data set,

df

.
The x-axis should contain each of the three species, and the y-axis should contain the petal lengths.
Label the axes.
Show your plot.

# Create bee swarm plot with Seaborn's default settings

sns.swarmplot(x='species', y='petal length (cm)', data=df)

# Label the axes

plt.xlabel('species')

plt.ylabel('petal length (cm)')

# Show the plot

plt.show()

Define a function with the signature

ecdf(data)

.
Within the function definition,
Compute the number of data points,

, using the

len()

function.
The xx-values
are the sorted data. Use the

np.sort()

function to perform the sorting.
The yy data
of the ECDF go from

1/n

in
equally spaced increments. You can construct this using

np.arange()

. Remember, however, that the end value in

np.arange()

is
not inclusive. Therefore,

np.arange()

will need to go from

n+1

.
Be sure to divide this by

.
The function returns the values

and

.

def ecdf(data):

"""Compute ECDF for a one-dimensional array of measurements."""

# Number of data points: n

n = len(data)

# x-data for the ECDF: x

x = np.sort(data)

# y-data for the ECDF: y

y = np.arange(1, n+1) / n

return x, y

Use

ecdf()

to compute the ECDF of

versicolor_petal_length

.
Unpack the output into

x_vers

and

y_vers

.
Plot the ECDF as dots. Remember to include

marker = '.'

and

linestyle
= 'none'

in addition to

x_vers

and

y_vers

as
arguments inside

plt.plot()

.
Set the margins of the plot with

plt.margins()

so that no data points are cut off. Use a 2% margin.
Label the axes. You can label the y-axis

'ECDF'

.
Show your plot.
Use

ecdf()

to compute the ECDF of

versicolor_petal_length

.
Unpack the output into

x_vers

and

y_vers

.
Plot the ECDF as dots. Remember to include

marker = '.'

and

linestyle
= 'none'

in addition to

x_vers

and

y_vers

as
arguments inside

plt.plot()

.
Set the margins of the plot with

plt.margins()

so that no data points are cut off. Use a 2% margin.
Label the axes. You can label the y-axis

'ECDF'

.
Show your plot.

# Compute ECDF for versicolor data: x_vers, y_vers

x_vers, y_vers = ecdf(versicolor_petal_length)

# Generate plot

plt.plot(x_vers, y_vers, marker='.', linestyle='none')

# Make the margins nice

plt.margins(0.02)

# Label the axes

plt.ylabel('ECDF')

plt.xlabel('versicolor_petal_length')

# Display the plot

plt.show()

Compute ECDFs for each of the three species using your

ecdf()

function. The variables

setosa_petal_length

versicolor_petal_length

,
and

virginica_petal_length

are all in your namespace. Unpack the ECDFs into

x_set,
y_set

x_vers, y_vers

and

x_virg,
y_virg

, respectively.
Plot all three ECDFs on the same plot as dots. To do this, you will need three

plt.plot()

commands.
Assign the result of each to

.
Specify 2% margins.
A legend and axis labels have been added for you, so hit 'Submit Answer' to see all the ECDFs!

# Compute ECDFs

x_set, y_set = ecdf(setosa_petal_length)

x_vers, y_vers = ecdf(versicolor_petal_length)

x_virg, y_virg = ecdf(virginica_petal_length)

# Plot all ECDFs on the same plot

_ = plt.plot(x_set, y_set, marker='.', linestyle='none', )

_ = plt.plot(x_vers, y_vers, marker='.', linestyle='none', )

_ = plt.plot(x_virg, y_virg, marker='.', linestyle='none', )

# Make nice margins

plt.margins(0.02)

# Annotate the plot

plt.legend(('setosa', 'versicolor', 'virginica'), loc='lower right')

_ = plt.xlabel('petal length (cm)')

_ = plt.ylabel('ECDF')

# Display the plot

plt.show()

dataframe计算分位数

Create

percentiles

, a NumPy array of percentiles you want to compute. These are the 2.5th, 25th, 50th,
75th, and 97.5th. You can do so by creating a list containing these ints/floats and convert the list to a NumPy array using

np.array()

.
For example,

np.array([30, 50])

would create an array consisting of the 30th and 50th percentiles.
Use

np.percentile()

to compute the percentiles of the petal lengths from the Iris versicolor samples.
The variable

versicolor_petal_length

is in your namespace.
Print the percentiles.

# Specify array of percentiles: percentiles

percentiles = np.array([2.5, 25, 50, 75, 97.5])

# Compute percentiles: ptiles_vers

ptiles_vers = np.percentile(versicolor_petal_length, percentiles)

# Print the result

print(ptiles_vers)

画图并且标记处分位数点：

Plot the percentiles as red diamonds on the ECDF. Pass the x and y co-ordinates -

ptiles_vers

and

percentiles/100

-
as positional arguments and specify the

marker='D'

color='red'

and

linestyle='none'

keyword
arguments. The argument for the y-axis -

percentiles/100

has been specified for you.
Display the plot.

# Plot the ECDF

_ = plt.plot(x_vers, y_vers, '.')

plt.margins(0.02)

_ = plt.xlabel('petal length (cm)')

_ = plt.ylabel('ECDF')

# Overlay percentiles as red diamonds.

_ = plt.plot(ptiles_vers, percentiles/100, marker='D', color='red',

linestyle='none')

# Show the plot

plt.show()

计算方差

Create an array called

differences

that is the difference between the petal lengths (

versicolor_petal_length

)
and the mean petal length. The variable

versicolor_petal_length

is already in your namespace as a NumPy array so
you can take advantage of NumPy's vectorized operations.
Square each element in this array. For example,

x**2

squares each element in the array

.
Store the result as

diff_sq

.
Compute the mean of the elements in

diff_sq

using

np.mean()

.
Store the result as

variance_explicit

.
Compute the variance of

versicolor_petal_length

using

np.var()

.
Store the result as

variance_np

.
Print both

variance_explicit

and

variance_np

in
one

print

call to make sure they are consistent.

# Array of differences to mean: differences

differences = versicolor_petal_length - np.mean(versicolor_petal_length)

# Square the differences: diff_sq

diff_sq = differences ** 2

# Compute the mean square difference: variance_explicit

variance_explicit = np.mean(diff_sq)

# Compute the variance using NumPy: variance_np

variance_np = np.var(versicolor_petal_length)

# Print the results

print(variance_explicit,variance_np)

计算皮尔逊相关系数

Define a function with signature

pearson_r(x, y)

.
Use

np.corrcoef()

to compute the correlation matrix of

and

(pass
them to

np.corrcoef()

in that order).
The function returns entry

[0,1]

of the correlation matrix.

Compute the Pearson correlation between the data in the arrays

versicolor_petal_length

and

versicolor_petal_width

.
Assign the result to

.
Print the result.

def pearson_r(x, y):

"""Compute Pearson correlation coefficient between two arrays."""

# Compute correlation matrix: corr_mat

corr_mat = np.corrcoef(x, y)

# Return entry [0,1]

return corr_mat[0,1]

# Compute Pearson correlation coefficient for I. versicolor: r

r = pearson_r(versicolor_petal_length,versicolor_petal_width)

# Print the result

print(r)

Draw samples out of the Binomial distribution using

np.random.binomial()

. You should use parameters

n
= 100

and

p = 0.05

, and set the

size

keyword
argument to

.
Compute the CDF using your previously-written

ecdf()

function.
Plot the CDF with axis labels. The x-axis here is the number of defaults out of 100 loans, while the y-axis is the CDF.
Show the plot.

# Take 10,000 samples out of the binomial distribution: n_defaults

n_defaults = np.random.binomial(100, 0.05, size = 10000)

# Compute CDF: x, y

x, y = ecdf(n_defaults)

# Plot the CDF with axis labels

plt.plot(x, y, marker = '.', linestyle = 'none' )

plt.xlabel('the number of defaults out of 100 loans')

plt.ylabel('CDF')

# Show the plot

plt.show()

正态分布的构建和绘制

Draw 100,000 samples from a Normal distribution that has a mean of

and a standard
deviation of

. Do the same for Normal distributions with standard deviations of

and

,
each still with a mean of

. Assign the results to

samples_std1

samples_std3

and

samples_std10

,
respectively.
Plot a histograms of each of the samples; for each, use 100 bins, also using the keyword arguments

normed=True

and

histtype='step'

.
The latter keyword argument makes the plot look much like the smooth theoretical PDF. You will need to make 3

plt.hist()

calls.
Hit 'Submit Answer' to make a legend, showing which standard deviations you used, and show your plot! There is no need to label the axes because we have not defined what is being described by the Normal distribution; we are just looking at shapes
of PDFs.

# Draw 100000 samples from Normal distribution with stds of interest: samples_std1, samples_std3, samples_std10

samples_std1 = np.random.normal(20,1,100000)

samples_std3 = np.random.normal(20,3,100000)

samples_std10 = np.random.normal(20,10,100000)

# Make histograms

plt.hist(samples_std1,normed=True,histtype='step',bins=100)

plt.hist(samples_std3,normed=True,histtype='step',bins=100)

plt.hist(samples_std10,normed=True,histtype='step',bins=100)

# Make a legend, set limits and show plot

_ = plt.legend(('std = 1', 'std = 3', 'std = 10'))

plt.ylim(-0.01, 0.42)

plt.show()

数据的标准正态分布拟合和原数据的拟合

Compute mean and standard deviation of Belmont winners' times with the two outliers removed. The NumPy array

belmont_no_outliers

has
these data.
Take 10,000 samples out of a normal distribution with this mean and standard deviation using

np.random.normal()

.
Compute the CDF of the theoretical samples and the ECDF of the Belmont winners' data, assigning the results to

x_theor,
y_theor

and

x, y

, respectively.
Hit submit to plot the CDF of your samples with the ECDF, label your axes and show the plot.

# Compute mean and standard deviation: mu, sigma

mu = np.mean(belmont_no_outliers)

sigma = np.std(belmont_no_outliers)

# Sample out of a normal distribution with this mu and sigma: samples

samples = np.random.normal(mu,sigma,10000)

# Get the CDF of the samples and of the data

x_theor,y_theor = ecdf(samples)

x, y = ecdf(belmont_no_outliers)

# Plot the CDFs and show the plot

_ = plt.plot(x_theor, y_theor)

_ = plt.plot(x, y, marker='.', linestyle='none')

plt.margins(0.02)

_ = plt.xlabel('Belmont winning time (sec.)')

_ = plt.ylabel('CDF')

plt.show()

指数分布

Define a function with call signature

successive_poisson(tau1,
tau2, size=1)

that samples the waiting time for a no-hitter and a hit of the cycle.

Draw waiting times

tau1

(

size

number
of samples) for the no-hitter out of an exponential distribution and assign to

t1

.
Draw waiting times

tau2

(

size

number
of samples) for hitting the cycle out of an exponential distribution and assign to

t2

.
The function returns the sum of the waiting times for the two events.

def successive_poisson(tau1, tau2, size=1):

# Draw samples out of first exponential distribution: t1

t1 = np.random.exponential(tau1, size)

# Draw samples out of second exponential distribution: t2

t2 = np.random.exponential(tau2, size)

return t1 + t2

Use your

successive_poisson()

function to draw 100,000 out of the distribution
of waiting times for observing a no-hitter and a hitting of the cycle.
Plot the PDF of the waiting times using the step histogram technique of a previous exercise. Don't forget the necessary keyword arguments. You should use

bins=100

normed=True

,
and

histtype='step'

.
Label the axes.
Show your plot.

# Draw samples of waiting times: waiting_times

waiting_times = successive_poisson(764, 715, size=100000)

# Make the histogram

plt.hist(waiting_times, bins=100, normed=True, histtype='step')

# Label axes

plt.xlabel('time')

plt.ylabel('waiting_times')

# Show the plot

plt.show()

多项式拟合

Compute the slope and intercept of the regression line using

np.polyfit()

. Remember,

fertility

is
on the y-axis and

illiteracy

on the x-axis.
Print out the slope and intercept from the linear regression.
To plot the best fit line, create an array

that consists of 0 and 100 using

np.array()

.
Then, compute the theoretical values of

based on your regression parameters. I.e.,

y
= a * x + b

.
Plot the data and the regression line on the same plot. Be sure to label your axes.
Hit 'Submit Answer' to display your plot.

# Plot the illiteracy rate versus fertility

_ = plt.plot(illiteracy, fertility, marker='.', linestyle='none')

plt.margins(0.02)

_ = plt.xlabel('percent illiterate')

_ = plt.ylabel('fertility')

# Perform a linear regression using np.polyfit(): a, b

a, b = np.polyfit(illiteracy, fertility,1)

# Print the results to the screen

print('slope =', a, 'children per woman / percent illiterate')

print('intercept =', b, 'children per woman')

# Make theoretical line to plot

x = np.array([0,100])

y = a * x + b

# Add regression line to your plot

_ = plt.plot(x, y)

# Draw the plot

plt.show()

Specify the values of the slope for which to compute the RSS. Use

np.linspace()

to get

points
in the range between

and

0.1

.
For example, to get

points in the range between

and

0.5

,
you could use

np.linspace()

like so:

np.linspace(0,
0.5, 100)

.
Initialize an array,

rss

, to contain the RSS using

np.empty_like()

and
the array you created above. The

empty_like()

function returns a new array with the same shape and type as a given
array (in this case,

a_vals

).
Write a

for

loop to compute the sum of RSS of the slope. Hint: the RSS is given
by

np.sum((y_data - a * x_data - b)**2)

. The variable

you
computed in the last exercise is already in your namespace. Here,

fertility

is the

y_data

and

illiteracy

the

x_data

.
Plot the RSS (

rss

) versus slope (

a_vals

).
Hit 'Submit Answer' to see the plot!

# Specify slopes to consider: a_vals

a_vals = np.linspace(0, 0.1, 200)

# Initialize sum of square of residuals: rss

rss = np.empty_like(a_vals)

# Compute sum of square of residuals for each value of a_vals

for i, a in enumerate(a_vals):

rss[i] = np.sum((fertility - a*illiteracy - b)**2)

# Plot the RSS

plt.plot(a_vals, rss, '-')

plt.xlabel('slope (children per woman / percent illiterate)')

plt.ylabel('sum of square of residuals')

plt.show()

Compute the parameters for the slope and intercept using

np.polyfit()

. The Anscombe
data are stored in the arrays

and

.
Print the slope

and intercept

.
Generate theoretical xx and yy data
from the linear regression. Your xx array,
which you can create with

np.array()

, should consist of

and

.
To generate the yy data,
multiply the slope by

x_theor

and add the intercept.
Plot the Anscombe data as a scatter plot and then plot the theoretical line. Remember to include the

marker='.'

and

linestyle='none'

keyword
arguments in addition to

and

when
to plot the Anscombe data as a scatter plot. You do not need these arguments when plotting the theoretical line.
Hit 'Submit Answer' to see the plot!

# Perform linear regression: a, b

a, b = np.polyfit(x,y,1)

# Print the slope and intercept

print(a, b)

# Generate theoretical x and y data: x_theor, y_theor

x_theor = np.array([3, 15])

y_theor = x_theor * a + b

# Plot the Anscombe data and theoretical line

_ = plt.plot(x_theor, y_theor)

_ = plt.plot(x,y,marker='.',linestyle='none')

# Label the axes

plt.xlabel('x')

plt.ylabel('y')

# Show the plot

plt.show()

样本小的时候进行重复采样的方法

Write a

for

loop to acquire

bootstrap
samples of the rainfall data and plot their ECDF.
Use

np.random.choice()

to generate a bootstrap sample from the NumPy array

rainfall

.
Be sure that the

size

of the resampled array is

len(rainfall)

.
Use the function

ecdf()

that you wrote in the prequel to this course to generate the

and

values
for the ECDF of the bootstrap sample

bs_sample

.
Plot the ECDF values. Specify

color='gray'

(to make gray dots) and

alpha=0.1

(to
make them semi-transparent, since we are overlaying so many) in addition to the

marker='.'

and

linestyle='none'

keyword
arguments.

Use

ecdf()

to generate

and

values
for the ECDF of the original rainfall data available in the array

rainfall

.
Plot the ECDF values of the original data.
Hit 'Submit Answer' to visualize the samples!

for _ in range(50):

# Generate bootstrap sample: bs_sample

bs_sample = np.random.choice(rainfall, size=len(rainfall))

# Compute and plot ECDF from bootstrap sample

x, y = ecdf(bs_sample)

_ = plt.plot(x, y, marker='.', linestyle='none',

color='gray', alpha=0.1)

# Compute and plot ECDF from original data

x, y = ecdf(rainfall)

_ = plt.plot(x, y, marker='.')

# Make margins and label axes

plt.margins(0.02)

_ = plt.xlabel('yearly rainfall (mm)')

_ = plt.ylabel('ECDF')

# Show the plot

plt.show()

Draw

bootstrap replicates of the mean annual rainfall using
your

draw_bs_reps()

function and the

rainfall

array. Hint:
Pass in

np.mean

for

func

to
compute the mean.
As a reminder,

draw_bs_reps()

accepts 3 arguments:

data

func

,
and

size

.

Compute and print the standard error of the mean of

rainfall

.
The formula to compute this is

np.std(data) / np.sqrt(len(data))

.

Compute and print the standard deviation of your bootstrap replicates

bs_replicates

.
Make a histogram of the replicates using the

normed=True

keyword argument and

bins.
Hit 'Submit Answer' to see the plot!

# Take 10,000 bootstrap replicates of the mean: bs_replicates

bs_replicates = draw_bs_reps(rainfall, np.mean, 10000)

# Compute and print SEM

sem = np.std(rainfall) / np.sqrt(len(rainfall))

print(sem)

# Compute and print standard deviation of bootstrap replicates

bs_std = np.std(bs_replicates)

print(bs_std)

# Make a histogram of the results

_ = plt.hist(bs_replicates, bins=50, normed=True)

_ = plt.xlabel('mean annual rainfall (mm)')

_ = plt.ylabel('PDF')

# Show the plot

plt.show()

计算置信区间

Generate

bootstrap replicates of ττ from
the

nohitter_times

data using your

draw_bs_reps()

function.
Recall that the the optimal ττ is
calculated as the mean of the data.
Compute the 95% confidence interval using

np.percentile()

and passing in two arguments: The array

bs_replicates

,
and the list of percentiles - in this case

2.5

and

97.5

.
Print the confidence interval.
Plot a histogram of your bootstrap replicates. This has been done for you, so hit 'Submit Answer' to see the plot!

# Draw bootstrap replicates of the mean no-hitter time (equal to tau): bs_replicates

bs_replicates = draw_bs_reps(nohitter_times,np.mean,10000)

# Compute the 95% confidence interval: conf_int

conf_int = np.percentile(bs_replicates,[2.5,97.5])

# Print the confidence interval

print('95% confidence interval =', conf_int, 'games')

# Plot the histogram of the replicates

_ = plt.hist(bs_replicates, bins=50, normed=True)

_ = plt.xlabel(r'$\tau$ (games)')

_ = plt.ylabel('PDF')

# Show the plot

plt.show()

重复抽取数据和进行线性拟合

Define a function with call signature

draw_bs_pairs_linreg(x,
y, size=1)

to perform pairs bootstrap estimates on linear regression parameters.

Use

np.arange()

to set up an array of indices going from

len(x)

.
These are what you will resample and use them to pick values out of the

and

arrays.
Use

np.empty()

to initialize the slope and intercept replicate arrays to be of size

size

.
Write a

for

loop to:
Resample the indices

inds

. Use

np.random.choice()

to
do this.
Make new xx and yy arrays

bs_x

and

bs_y

using
the the resampled indices

bs_inds

. To do this, slice

and

with

bs_inds

.
Use

np.polyfit()

on the new xx and yyarrays
and store the computed slope and intercept.

Return the pair bootstrap replicates of the slope and intercept.

def draw_bs_pairs_linreg(x, y, size=1):

"""Perform pairs bootstrap for linear regression."""

# Set up array of indices to sample from: inds

inds = np.arange(len(x))

# Initialize replicates: bs_slope_reps, bs_intercept_reps

bs_slope_reps = np.empty(size)

bs_intercept_reps = np.empty(size)

# Generate replicates

for i in range(size):

bs_inds = np.random.choice(inds, size=len(inds))

bs_x, bs_y = x[bs_inds], y[bs_inds]

bs_slope_reps[i], bs_intercept_reps[i] = np.polyfit(bs_x, bs_y, 1)

return bs_slope_reps, bs_intercept_reps

Generate an array of xx-values
consisting of

and

for
the plot of the regression lines. Use the

np.array()

function for this.
Write a

for

loop in which you plot a regression line with a slope and intercept given by the pairs
bootstrap replicates. Do this for

lines.
When plotting the regression lines in each iteration of the

for

loop, recall the regression equation

y
= a*x + b

. Here,

bs_slope_reps[i]

and

bs_intercept_reps[i]

.
Specify the keyword arguments

linewidth=0.5

alpha=0.2

,
and

color='red'

in your call to

plt.plot()

.

Make a scatter plot with

illiteracy

on the x-axis and

fertility

on
the y-axis. Remember to specify the

marker='.'

and

linestyle='none'

keyword
arguments.
Label the axes, set a 2% margin, and show the plot. This has been done for you, so hit 'Submit Answer' to visualize the bootstrap regressions!

# Generate array of x-values for bootstrap lines: x

x = np.array([0,100])

# Plot the bootstrap lines

for i in range(100):

_ = plt.plot(x, bs_slope_reps[i]*x + bs_intercept_reps[i],

linewidth=0.5, alpha=0.2, color='red')

# Plot the data

_ = plt.plot()

# Label axes, set the margins, and show the plot

_ = plt.xlabel('illiteracy')

_ = plt.ylabel('fertility')

plt.margins(0.02)

plt.show()

Visualizing permutation sampling

Write a

for

loop to 50 generate permutation samples, compute their ECDFs, and plot them.
Generate a permutation sample pair from

rain_july

and

rain_november

using
your

permutation_sample()

function.
Generate the

and

values
for an ECDF for each of the two permutation samples for the ECDF using your

ecdf()

function.
Plot the ECDF of the first permutation sample (

x_1

and

y_1

)
as dots. Do the same for the second permutation sample (

x_2

and

y_2

).

Generate

and

values
for ECDFs for the

rain_july

and

rain_november

data
and plot the ECDFs using respectively the keyword arguments

color='red'

and

color='blue'

.
Label your axes, set a 2% margin, and show your plot. This has been done for you, so just hit 'Submit Answer' to view the plot!

for _ in range(50):

# Generate permutation samples

perm_sample_1, perm_sample_2 = permutation_sample(rain_july, rain_november)

# Compute ECDFs

x_1, y_1 = ecdf(perm_sample_1)

x_2, y_2 = ecdf(perm_sample_2)

# Plot ECDFs of permutation sample

_ = plt.plot(x_1, y_1, marker='.', linestyle='none',

color='red', alpha=0.02)

_ = plt.plot(x_2, y_2, marker='.', linestyle='none',

color='blue', alpha=0.02)

# Create and plot ECDFs from original data

x_1, y_1 = ecdf(rain_july)

x_2, y_2 = ecdf(rain_november)

_ = plt.plot(x_1, y_1, marker='.', linestyle='none', color='red')

_ = plt.plot(x_2, y_2, marker='.', linestyle='none', color='blue')

# Label axes, set margin, and show plot

plt.margins(0.02)

_ = plt.xlabel('monthly rainfall (mm)')

_ = plt.ylabel('ECDF')

plt.show()

P值的计算：

Define a function with call signature

diff_of_means(data_1, data_2)

that returns the differences in
means between two data sets, mean of

data_1

minus mean of

data_2

.
Use this function to compute the empirical difference of means that was observed in the frogs.
Draw 10,000 permutation replicates of the difference of means.
Compute the p-value.
Print the p-value.

def diff_of_means(data_1, data_2):

"""Difference in means of two arrays."""

# The difference of means of data_1, data_2: diff

diff = np.mean(data_1) - np.mean(data_2)

return diff

# Compute difference of mean impact force from experiment: empirical_diff_means

empirical_diff_means = diff_of_means(force_a, force_b)

# Draw 10,000 permutation replicates: perm_replicates

perm_replicates = draw_perm_reps(force_a, force_b,

diff_of_means, size=10000)

# Compute p-value: p

p = np.sum( perm_replicates >= empirical_diff_means) / len(perm_replicates)

# Print the result

print('p-value =', p)

Translate the impact forces of Frog B such that its mean is 0.55 N.
Use your

draw_bs_reps()

function to take 10,000 bootstrap replicates of the mean of your translated
forces.
Compute the p-value by finding the fraction of your bootstrap replicates that are less than the observed mean impact force of Frog B. Note that the variable of interest here is

force_b

.
Print your p-value.

# Make an array of translated impact forces: translated_force_b

translated_force_b = force_b - np.mean(force_b) + 0.55

# Take bootstrap replicates of Frog B's translated impact forces: bs_replicates

bs_replicates = draw_bs_reps(translated_force_b, np.mean, 10000)

# Compute fraction of replicates that are less than the observed Frog B force: p

p = np.sum(bs_replicates <= np.mean(force_b)) / 10000

# Print the p-value

print('p = ', p)

Construct Boolean arrays,

dems

and

reps

that
contain the votes of the respective parties; e.g.,

dems

has 153

True

entries
and 91

False

entries.
Write a function,

frac_yay_dems(dems, reps)

that returns the fraction of Democrats that voted yay.
The first input is an array of Booleans, Two inputs are required to use your

draw_perm_reps()

function, but the
second is not used.
Use your

draw_perm_reps()

function to draw 10,000 permutation replicates of the fraction of Democrat
yay votes.
Compute and print the p-value.

# Construct arrays of data: dems, reps

dems = np.array([True] * 153 + [False] * 91)

reps = np.array([True] * 136 + [False] * 35)

def frac_yay_dems(dems, reps):

"""Compute fraction of Democrat yay votes."""

frac = np.sum(dems) / len(dems)

return frac

# Acquire permutation samples: perm_replicates

perm_replicates = draw_perm_reps(dems, reps, frac_yay_dems, 10000)

# Compute and print p-value: p

p = np.sum(perm_replicates <= 153/244) / len(perm_replicates)

print('p-value =', p)

Compute the observed difference in mean inter-nohitter time using

diff_of_means()

.
Generate 10,000 permutation replicates of the difference of means using

draw_perm_reps()

.
Compute and print the p-value.

# Compute the observed difference in mean inter-no-hitter times: nht_diff_obs

nht_diff_obs = diff_of_means(nht_dead, nht_live)

# Acquire 10,000 permutation replicates of difference in mean no-hitter time: perm_replicates

perm_replicates = draw_perm_reps(nht_dead, nht_live,

diff_of_means, size=10000)

# Compute and print the p-value: p

p = np.sum(perm_replicates <= nht_diff_obs) / len(perm_replicates)

print('p-val =',p)

Compute the observed Pearson correlation between

illiteracy

and

fertility

.
Initialize an array to store your permutation replicates.
Write a

for

loop to draw 10,000 replicates:
Permute the

illiteracy

measurements using

np.random.permutation()

.
Compute the Pearson correlation between the permuted illiteracy array,

illiteracy_permuted

, and

fertility

.

Compute and print the p-value from the replicates.

# Compute observed correlation: r_obs

r_obs = pearson_r(illiteracy, fertility)

# Initialize permutation replicates: perm_replicates

perm_replicates = np.empty(10000)

# Draw replicates

for i in range(10000):

# Permute illiteracy measurments: illiteracy_permuted

illiteracy_permuted = np.random.permutation(illiteracy)

# Compute Pearson correlation

perm_replicates[i] = pearson_r(illiteracy_permuted, fertility)

# Compute p-value: p

p = np.sum(perm_replicates >= r_obs) / len(perm_replicates)

print('p-val =', p)

Use your

ecdf()

function to generate

x,y

values
from the

control

and

treated

arrays
for plotting the ECDFs.
Plot the ECDFs on the same plot.
The margins have been set for you, along with the legend and axis labels. Hit 'Submit Answer' to see the result!

# Compute x,y values for ECDFs

x_control, y_control = ecdf(control)

x_treated, y_treated = ecdf(treated)

# Plot the ECDFs

plt.plot(x_control, y_control, marker='.', linestyle='none')

plt.plot(x_treated, y_treated, marker='.', linestyle='none')

# Set the margins

plt.margins(0.02)

# Add a legend

plt.legend(('control', 'treated'), loc='lower right')

# Label axes and show plot

plt.xlabel('millions of alive sperm per mL')

plt.ylabel('ECDF')

plt.show()

Compute the mean alive sperm count of

control

minus that of

treated

.
Compute the mean of all alive sperm counts. To do this, first concatenate

control

and

treated

and
take the mean of the concatenated array.
Generate shifted data sets for both

control

and

treated

such
that the shifted data sets have the same mean. This has already been done for you.
Generate 10,000 bootstrap replicates of the mean each for the two shifted arrays. Use your

draw_bs_reps()

function.
Compute the bootstrap replicates of the difference of means.
The code to compute and print the p-value has been written for you. Hit 'Submit Answer' to see the result!

# Compute the difference in mean sperm count: diff_means

diff_means = np.mean(control) - np.mean(treated)

# Compute mean of pooled data: mean_count

mean_count = np.mean(np.concatenate((control, treated)))

# Generate shifted data sets

control_shifted = control - np.mean(control) + mean_count

treated_shifted = treated - np.mean(treated) + mean_count

# Generate bootstrap replicates

bs_reps_control = draw_bs_reps(control_shifted,

np.mean, size=10000)

bs_reps_treated = draw_bs_reps(treated_shifted,

np.mean, size=10000)

# Get replicates of difference of means: bs_replicates

bs_replicates = bs_reps_control - bs_reps_treated

# Compute and print p-value: p

p = np.sum(bs_replicates >= np.mean(control) - np.mean(treated)) \

/ len(bs_replicates)

print('p-value =', p)

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航