class Statistical Thinking in Python (Part 1)
2017-09-23 20:37
561 查看
一些EDA数据的基本技术
Import
their usual aliases (
Use
Plot a histogram of the Iris versicolor petal lengths using
provided NumPy array
Show the histogram using
# Import plotting modules
import matplotlib.pyplot as plt
import seaborn as sns
# Set default Seaborn style
sns.set()
# Plot histogram of versicolor petal lengths
plt.hist(versicolor_petal_length)
# Show histogram
plt.show()
Label the axes. Don't forget that you should always include units in your axis labels. Your yy-axis
label is just
label is
Display the plot constructed in the above steps using
# Plot histogram of versicolor petal lengths
_ = plt.hist(versicolor_petal_length)
# Label axes
plt.xlabel("petal length (cm)")
plt.ylabel("count")
# Show histogram
plt.show()
hist直方图的bins数目一般是数据数量的开根号
Import
This gives access to the square root function,
Determine how many data points you have using
Compute the number of bins using the square root rule.
Convert the number of bins to an integer using the built in
Generate the histogram and make sure to use the
Hit 'Submit Answer' to plot the figure and see the fruit of your labors!
# Import numpy
import numpy as np
# Compute number of data points: n_data
n_data = len(versicolor_petal_length)
# Number of bins is the square root of number of data points: n_bins
n_bins = np.sqrt(n_data)
# Convert number of bins to integer: n_bins
n_bins = int(n_bins)
# Plot the histogram
plt.hist(versicolor_petal_length, bins=n_bins)
# Label axes
_ = plt.xlabel('petal length (cm)')
_ = plt.ylabel('count')
# Show histogram
plt.show()
In the IPython Shell, inspect the DataFrame
This will let you identify which column names you need to pass as the
arguments in your call to
Use
The x-axis should contain each of the three species, and the y-axis should contain the petal lengths.
Label the axes.
Show your plot.
# Create bee swarm plot with Seaborn's default settings
sns.swarmplot(x='species', y='petal length (cm)', data=df)
# Label the axes
plt.xlabel('species')
plt.ylabel('petal length (cm)')
# Show the plot
plt.show()
Define a function with the signature
Within the function definition,
Compute the number of data points,
The xx-values
are the sorted data. Use the
The yy data
of the ECDF go from
equally spaced increments. You can construct this using
not inclusive. Therefore,
Be sure to divide this by
The function returns the values
def ecdf(data):
"""Compute ECDF for a one-dimensional array of measurements."""
# Number of data points: n
n = len(data)
# x-data for the ECDF: x
x = np.sort(data)
# y-data for the ECDF: y
y = np.arange(1, n+1) / n
return x, y
Use
Unpack the output into
Plot the ECDF as dots. Remember to include
arguments inside
Set the margins of the plot with
Label the axes. You can label the y-axis
Show your plot.
Use
Unpack the output into
Plot the ECDF as dots. Remember to include
arguments inside
Set the margins of the plot with
Label the axes. You can label the y-axis
Show your plot.
# Compute ECDF for versicolor data: x_vers, y_vers
x_vers, y_vers = ecdf(versicolor_petal_length)
# Generate plot
plt.plot(x_vers, y_vers, marker='.', linestyle='none')
# Make the margins nice
plt.margins(0.02)
# Label the axes
plt.ylabel('ECDF')
plt.xlabel('versicolor_petal_length')
# Display the plot
plt.show()
Compute ECDFs for each of the three species using your
and
Plot all three ECDFs on the same plot as dots. To do this, you will need three
Assign the result of each to
Specify 2% margins.
A legend and axis labels have been added for you, so hit 'Submit Answer' to see all the ECDFs!
# Compute ECDFs
x_set, y_set = ecdf(setosa_petal_length)
x_vers, y_vers = ecdf(versicolor_petal_length)
x_virg, y_virg = ecdf(virginica_petal_length)
# Plot all ECDFs on the same plot
_ = plt.plot(x_set, y_set, marker='.', linestyle='none', )
_ = plt.plot(x_vers, y_vers, marker='.', linestyle='none', )
_ = plt.plot(x_virg, y_virg, marker='.', linestyle='none', )
# Make nice margins
plt.margins(0.02)
# Annotate the plot
plt.legend(('setosa', 'versicolor', 'virginica'), loc='lower right')
_ = plt.xlabel('petal length (cm)')
_ = plt.ylabel('ECDF')
# Display the plot
plt.show()
dataframe计算分位数
Create
75th, and 97.5th. You can do so by creating a list containing these ints/floats and convert the list to a NumPy array using
For example,
Use
The variable
Print the percentiles.
# Specify array of percentiles: percentiles
percentiles = np.array([2.5, 25, 50, 75, 97.5])
# Compute percentiles: ptiles_vers
ptiles_vers = np.percentile(versicolor_petal_length, percentiles)
# Print the result
print(ptiles_vers)
画图并且标记处分位数点:
Plot the percentiles as red diamonds on the ECDF. Pass the x and y co-ordinates -
as positional arguments and specify the
arguments. The argument for the y-axis -
Display the plot.
# Plot the ECDF
_ = plt.plot(x_vers, y_vers, '.')
plt.margins(0.02)
_ = plt.xlabel('petal length (cm)')
_ = plt.ylabel('ECDF')
# Overlay percentiles as red diamonds.
_ = plt.plot(ptiles_vers, percentiles/100, marker='D', color='red',
linestyle='none')
# Show the plot
plt.show()
计算方差
Create an array called
and the mean petal length. The variable
you can take advantage of NumPy's vectorized operations.
Square each element in this array. For example,
Store the result as
Compute the mean of the elements in
Store the result as
Compute the variance of
Store the result as
Print both
one
# Array of differences to mean: differences
differences = versicolor_petal_length - np.mean(versicolor_petal_length)
# Square the differences: diff_sq
diff_sq = differences ** 2
# Compute the mean square difference: variance_explicit
variance_explicit = np.mean(diff_sq)
# Compute the variance using NumPy: variance_np
variance_np = np.var(versicolor_petal_length)
# Print the results
print(variance_explicit,variance_np)
计算皮尔逊相关系数
Define a function with signature
Use
them to
The function returns entry
Compute the Pearson correlation between the data in the arrays
Assign the result to
Print the result.
def pearson_r(x, y):
"""Compute Pearson correlation coefficient between two arrays."""
# Compute correlation matrix: corr_mat
corr_mat = np.corrcoef(x, y)
# Return entry [0,1]
return corr_mat[0,1]
# Compute Pearson correlation coefficient for I. versicolor: r
r = pearson_r(versicolor_petal_length,versicolor_petal_width)
# Print the result
print(r)
Draw samples out of the Binomial distribution using
argument to
Compute the CDF using your previously-written
Plot the CDF with axis labels. The x-axis here is the number of defaults out of 100 loans, while the y-axis is the CDF.
Show the plot.
# Take 10,000 samples out of the binomial distribution: n_defaults
n_defaults = np.random.binomial(100, 0.05, size = 10000)
# Compute CDF: x, y
x, y = ecdf(n_defaults)
# Plot the CDF with axis labels
plt.plot(x, y, marker = '.', linestyle = 'none' )
plt.xlabel('the number of defaults out of 100 loans')
plt.ylabel('CDF')
# Show the plot
plt.show()
正态分布的构建和绘制
Draw 100,000 samples from a Normal distribution that has a mean of
deviation of
each still with a mean of
respectively.
Plot a histograms of each of the samples; for each, use 100 bins, also using the keyword arguments
The latter keyword argument makes the plot look much like the smooth theoretical PDF. You will need to make 3
Hit 'Submit Answer' to make a legend, showing which standard deviations you used, and show your plot! There is no need to label the axes because we have not defined what is being described by the Normal distribution; we are just looking at shapes
of PDFs.
# Draw 100000 samples from Normal distribution with stds of interest: samples_std1, samples_std3, samples_std10
samples_std1 = np.random.normal(20,1,100000)
samples_std3 = np.random.normal(20,3,100000)
samples_std10 = np.random.normal(20,10,100000)
# Make histograms
plt.hist(samples_std1,normed=True,histtype='step',bins=100)
plt.hist(samples_std3,normed=True,histtype='step',bins=100)
plt.hist(samples_std10,normed=True,histtype='step',bins=100)
# Make a legend, set limits and show plot
_ = plt.legend(('std = 1', 'std = 3', 'std = 10'))
plt.ylim(-0.01, 0.42)
plt.show()
数据的标准正态分布拟合和原数据的拟合
Compute mean and standard deviation of Belmont winners' times with the two outliers removed. The NumPy array
these data.
Take 10,000 samples out of a normal distribution with this mean and standard deviation using
Compute the CDF of the theoretical samples and the ECDF of the Belmont winners' data, assigning the results to
Hit submit to plot the CDF of your samples with the ECDF, label your axes and show the plot.
# Compute mean and standard deviation: mu, sigma
mu = np.mean(belmont_no_outliers)
sigma = np.std(belmont_no_outliers)
# Sample out of a normal distribution with this mu and sigma: samples
samples = np.random.normal(mu,sigma,10000)
# Get the CDF of the samples and of the data
x_theor,y_theor = ecdf(samples)
x, y = ecdf(belmont_no_outliers)
# Plot the CDFs and show the plot
_ = plt.plot(x_theor, y_theor)
_ = plt.plot(x, y, marker='.', linestyle='none')
plt.margins(0.02)
_ = plt.xlabel('Belmont winning time (sec.)')
_ = plt.ylabel('CDF')
plt.show()
指数分布
Define a function with call signature
Draw waiting times
of samples) for the no-hitter out of an exponential distribution and assign to
Draw waiting times
of samples) for hitting the cycle out of an exponential distribution and assign to
The function returns the sum of the waiting times for the two events.
def successive_poisson(tau1, tau2, size=1):
# Draw samples out of first exponential distribution: t1
t1 = np.random.exponential(tau1, size)
# Draw samples out of second exponential distribution: t2
t2 = np.random.exponential(tau2, size)
return t1 + t2
Use your
of waiting times for observing a no-hitter and a hitting of the cycle.
Plot the PDF of the waiting times using the step histogram technique of a previous exercise. Don't forget the necessary keyword arguments. You should use
and
Label the axes.
Show your plot.
# Draw samples of waiting times: waiting_times
waiting_times = successive_poisson(764, 715, size=100000)
# Make the histogram
plt.hist(waiting_times, bins=100, normed=True, histtype='step')
# Label axes
plt.xlabel('time')
plt.ylabel('waiting_times')
# Show the plot
plt.show()
多项式拟合
Compute the slope and intercept of the regression line using
on the y-axis and
Print out the slope and intercept from the linear regression.
To plot the best fit line, create an array
Then, compute the theoretical values of
Plot the data and the regression line on the same plot. Be sure to label your axes.
Hit 'Submit Answer' to display your plot.
# Plot the illiteracy rate versus fertility
_ = plt.plot(illiteracy, fertility, marker='.', linestyle='none')
plt.margins(0.02)
_ = plt.xlabel('percent illiterate')
_ = plt.ylabel('fertility')
# Perform a linear regression using np.polyfit(): a, b
a, b = np.polyfit(illiteracy, fertility,1)
# Print the results to the screen
print('slope =', a, 'children per woman / percent illiterate')
print('intercept =', b, 'children per woman')
# Make theoretical line to plot
x = np.array([0,100])
y = a * x + b
# Add regression line to your plot
_ = plt.plot(x, y)
# Draw the plot
plt.show()
Specify the values of the slope for which to compute the RSS. Use
in the range between
For example, to get
you could use
Initialize an array,
the array you created above. The
array (in this case,
Write a
by
computed in the last exercise is already in your namespace. Here,
Plot the RSS (
Hit 'Submit Answer' to see the plot!
# Specify slopes to consider: a_vals
a_vals = np.linspace(0, 0.1, 200)
# Initialize sum of square of residuals: rss
rss = np.empty_like(a_vals)
# Compute sum of square of residuals for each value of a_vals
for i, a in enumerate(a_vals):
rss[i] = np.sum((fertility - a*illiteracy - b)**2)
# Plot the RSS
plt.plot(a_vals, rss, '-')
plt.xlabel('slope (children per woman / percent illiterate)')
plt.ylabel('sum of square of residuals')
plt.show()
Compute the parameters for the slope and intercept using
data are stored in the arrays
Print the slope
Generate theoretical xx and yy data
from the linear regression. Your xx array,
which you can create with
To generate the yy data,
multiply the slope by
Plot the Anscombe data as a scatter plot and then plot the theoretical line. Remember to include the
arguments in addition to
to plot the Anscombe data as a scatter plot. You do not need these arguments when plotting the theoretical line.
Hit 'Submit Answer' to see the plot!
# Perform linear regression: a, b
a, b = np.polyfit(x,y,1)
# Print the slope and intercept
print(a, b)
# Generate theoretical x and y data: x_theor, y_theor
x_theor = np.array([3, 15])
y_theor = x_theor * a + b
# Plot the Anscombe data and theoretical line
_ = plt.plot(x_theor, y_theor)
_ = plt.plot(x,y,marker='.',linestyle='none')
# Label the axes
plt.xlabel('x')
plt.ylabel('y')
# Show the plot
plt.show()
样本小的时候进行重复采样的方法
Write a
samples of the rainfall data and plot their ECDF.
Use
Be sure that the
Use the function
for the ECDF of the bootstrap sample
Plot the ECDF values. Specify
make them semi-transparent, since we are overlaying so many) in addition to the
arguments.
Use
for the ECDF of the original rainfall data available in the array
Plot the ECDF values of the original data.
Hit 'Submit Answer' to visualize the samples!
for _ in range(50):
# Generate bootstrap sample: bs_sample
bs_sample = np.random.choice(rainfall, size=len(rainfall))
# Compute and plot ECDF from bootstrap sample
x, y = ecdf(bs_sample)
_ = plt.plot(x, y, marker='.', linestyle='none',
color='gray', alpha=0.1)
# Compute and plot ECDF from original data
x, y = ecdf(rainfall)
_ = plt.plot(x, y, marker='.')
# Make margins and label axes
plt.margins(0.02)
_ = plt.xlabel('yearly rainfall (mm)')
_ = plt.ylabel('ECDF')
# Show the plot
plt.show()
Draw
your
Pass in
compute the mean.
As a reminder,
and
Compute and print the standard error of the mean of
The formula to compute this is
Compute and print the standard deviation of your bootstrap replicates
Make a histogram of the replicates using the
Hit 'Submit Answer' to see the plot!
# Take 10,000 bootstrap replicates of the mean: bs_replicates
bs_replicates = draw_bs_reps(rainfall, np.mean, 10000)
# Compute and print SEM
sem = np.std(rainfall) / np.sqrt(len(rainfall))
print(sem)
# Compute and print standard deviation of bootstrap replicates
bs_std = np.std(bs_replicates)
print(bs_std)
# Make a histogram of the results
_ = plt.hist(bs_replicates, bins=50, normed=True)
_ = plt.xlabel('mean annual rainfall (mm)')
_ = plt.ylabel('PDF')
# Show the plot
plt.show()
计算置信区间
Generate
the
Recall that the the optimal ττ is
calculated as the mean of the data.
Compute the 95% confidence interval using
and the list of percentiles - in this case
Print the confidence interval.
Plot a histogram of your bootstrap replicates. This has been done for you, so hit 'Submit Answer' to see the plot!
# Draw bootstrap replicates of the mean no-hitter time (equal to tau): bs_replicates
bs_replicates = draw_bs_reps(nohitter_times,np.mean,10000)
# Compute the 95% confidence interval: conf_int
conf_int = np.percentile(bs_replicates,[2.5,97.5])
# Print the confidence interval
print('95% confidence interval =', conf_int, 'games')
# Plot the histogram of the replicates
_ = plt.hist(bs_replicates, bins=50, normed=True)
_ = plt.xlabel(r'$\tau$ (games)')
_ = plt.ylabel('PDF')
# Show the plot
plt.show()
重复抽取数据和进行线性拟合
Define a function with call signature
Use
These are what you will resample and use them to pick values out of the
Use
Write a
Resample the indices
do this.
Make new xx and yy arrays
the the resampled indices
Use
and store the computed slope and intercept.
Return the pair bootstrap replicates of the slope and intercept.
def draw_bs_pairs_linreg(x, y, size=1):
"""Perform pairs bootstrap for linear regression."""
# Set up array of indices to sample from: inds
inds = np.arange(len(x))
# Initialize replicates: bs_slope_reps, bs_intercept_reps
bs_slope_reps = np.empty(size)
bs_intercept_reps = np.empty(size)
# Generate replicates
for i in range(size):
bs_inds = np.random.choice(inds, size=len(inds))
bs_x, bs_y = x[bs_inds], y[bs_inds]
bs_slope_reps[i], bs_intercept_reps[i] = np.polyfit(bs_x, bs_y, 1)
return bs_slope_reps, bs_intercept_reps
Generate an array of xx-values
consisting of
the plot of the regression lines. Use the
Write a
bootstrap replicates. Do this for
When plotting the regression lines in each iteration of the
Specify the keyword arguments
and
Make a scatter plot with
the y-axis. Remember to specify the
arguments.
Label the axes, set a 2% margin, and show the plot. This has been done for you, so hit 'Submit Answer' to visualize the bootstrap regressions!
# Generate array of x-values for bootstrap lines: x
x = np.array([0,100])
# Plot the bootstrap lines
for i in range(100):
_ = plt.plot(x, bs_slope_reps[i]*x + bs_intercept_reps[i],
linewidth=0.5, alpha=0.2, color='red')
# Plot the data
_ = plt.plot()
# Label axes, set the margins, and show the plot
_ = plt.xlabel('illiteracy')
_ = plt.ylabel('fertility')
plt.margins(0.02)
plt.show()
Write a
Generate a permutation sample pair from
your
Generate the
for an ECDF for each of the two permutation samples for the ECDF using your
Plot the ECDF of the first permutation sample (
as dots. Do the same for the second permutation sample (
Generate
for ECDFs for the
and plot the ECDFs using respectively the keyword arguments
Label your axes, set a 2% margin, and show your plot. This has been done for you, so just hit 'Submit Answer' to view the plot!
for _ in range(50):
# Generate permutation samples
perm_sample_1, perm_sample_2 = permutation_sample(rain_july, rain_november)
# Compute ECDFs
x_1, y_1 = ecdf(perm_sample_1)
x_2, y_2 = ecdf(perm_sample_2)
# Plot ECDFs of permutation sample
_ = plt.plot(x_1, y_1, marker='.', linestyle='none',
color='red', alpha=0.02)
_ = plt.plot(x_2, y_2, marker='.', linestyle='none',
color='blue', alpha=0.02)
# Create and plot ECDFs from original data
x_1, y_1 = ecdf(rain_july)
x_2, y_2 = ecdf(rain_november)
_ = plt.plot(x_1, y_1, marker='.', linestyle='none', color='red')
_ = plt.plot(x_2, y_2, marker='.', linestyle='none', color='blue')
# Label axes, set margin, and show plot
plt.margins(0.02)
_ = plt.xlabel('monthly rainfall (mm)')
_ = plt.ylabel('ECDF')
plt.show()
P值的计算:
Define a function with call signature
means between two data sets, mean of
Use this function to compute the empirical difference of means that was observed in the frogs.
Draw 10,000 permutation replicates of the difference of means.
Compute the p-value.
Print the p-value.
def diff_of_means(data_1, data_2):
"""Difference in means of two arrays."""
# The difference of means of data_1, data_2: diff
diff = np.mean(data_1) - np.mean(data_2)
return diff
# Compute difference of mean impact force from experiment: empirical_diff_means
empirical_diff_means = diff_of_means(force_a, force_b)
# Draw 10,000 permutation replicates: perm_replicates
perm_replicates = draw_perm_reps(force_a, force_b,
diff_of_means, size=10000)
# Compute p-value: p
p = np.sum( perm_replicates >= empirical_diff_means) / len(perm_replicates)
# Print the result
print('p-value =', p)
Translate the impact forces of Frog B such that its mean is 0.55 N.
Use your
forces.
Compute the p-value by finding the fraction of your bootstrap replicates that are less than the observed mean impact force of Frog B. Note that the variable of interest here is
Print your p-value.
# Make an array of translated impact forces: translated_force_b
translated_force_b = force_b - np.mean(force_b) + 0.55
# Take bootstrap replicates of Frog B's translated impact forces: bs_replicates
bs_replicates = draw_bs_reps(translated_force_b, np.mean, 10000)
# Compute fraction of replicates that are less than the observed Frog B force: p
p = np.sum(bs_replicates <= np.mean(force_b)) / 10000
# Print the p-value
print('p = ', p)
Construct Boolean arrays,
contain the votes of the respective parties; e.g.,
and 91
Write a function,
The first input is an array of Booleans, Two inputs are required to use your
second is not used.
Use your
yay votes.
Compute and print the p-value.
# Construct arrays of data: dems, reps
dems = np.array([True] * 153 + [False] * 91)
reps = np.array([True] * 136 + [False] * 35)
def frac_yay_dems(dems, reps):
"""Compute fraction of Democrat yay votes."""
frac = np.sum(dems) / len(dems)
return frac
# Acquire permutation samples: perm_replicates
perm_replicates = draw_perm_reps(dems, reps, frac_yay_dems, 10000)
# Compute and print p-value: p
p = np.sum(perm_replicates <= 153/244) / len(perm_replicates)
print('p-value =', p)
Compute the observed difference in mean inter-nohitter time using
Generate 10,000 permutation replicates of the difference of means using
Compute and print the p-value.
# Compute the observed difference in mean inter-no-hitter times: nht_diff_obs
nht_diff_obs = diff_of_means(nht_dead, nht_live)
# Acquire 10,000 permutation replicates of difference in mean no-hitter time: perm_replicates
perm_replicates = draw_perm_reps(nht_dead, nht_live,
diff_of_means, size=10000)
# Compute and print the p-value: p
p = np.sum(perm_replicates <= nht_diff_obs) / len(perm_replicates)
print('p-val =',p)
Compute the observed Pearson correlation between
Initialize an array to store your permutation replicates.
Write a
Permute the
Compute the Pearson correlation between the permuted illiteracy array,
Compute and print the p-value from the replicates.
# Compute observed correlation: r_obs
r_obs = pearson_r(illiteracy, fertility)
# Initialize permutation replicates: perm_replicates
perm_replicates = np.empty(10000)
# Draw replicates
for i in range(10000):
# Permute illiteracy measurments: illiteracy_permuted
illiteracy_permuted = np.random.permutation(illiteracy)
# Compute Pearson correlation
perm_replicates[i] = pearson_r(illiteracy_permuted, fertility)
# Compute p-value: p
p = np.sum(perm_replicates >= r_obs) / len(perm_replicates)
print('p-val =', p)
Use your
from the
for plotting the ECDFs.
Plot the ECDFs on the same plot.
The margins have been set for you, along with the legend and axis labels. Hit 'Submit Answer' to see the result!
# Compute x,y values for ECDFs
x_control, y_control = ecdf(control)
x_treated, y_treated = ecdf(treated)
# Plot the ECDFs
plt.plot(x_control, y_control, marker='.', linestyle='none')
plt.plot(x_treated, y_treated, marker='.', linestyle='none')
# Set the margins
plt.margins(0.02)
# Add a legend
plt.legend(('control', 'treated'), loc='lower right')
# Label axes and show plot
plt.xlabel('millions of alive sperm per mL')
plt.ylabel('ECDF')
plt.show()
Compute the mean alive sperm count of
Compute the mean of all alive sperm counts. To do this, first concatenate
take the mean of the concatenated array.
Generate shifted data sets for both
that the shifted data sets have the same mean. This has already been done for you.
Generate 10,000 bootstrap replicates of the mean each for the two shifted arrays. Use your
Compute the bootstrap replicates of the difference of means.
The code to compute and print the p-value has been written for you. Hit 'Submit Answer' to see the result!
# Compute the difference in mean sperm count: diff_means
diff_means = np.mean(control) - np.mean(treated)
# Compute mean of pooled data: mean_count
mean_count = np.mean(np.concatenate((control, treated)))
# Generate shifted data sets
control_shifted = control - np.mean(control) + mean_count
treated_shifted = treated - np.mean(treated) + mean_count
# Generate bootstrap replicates
bs_reps_control = draw_bs_reps(control_shifted,
np.mean, size=10000)
bs_reps_treated = draw_bs_reps(treated_shifted,
np.mean, size=10000)
# Get replicates of difference of means: bs_replicates
bs_replicates = bs_reps_control - bs_reps_treated
# Compute and print p-value: p
p = np.sum(bs_replicates >= np.mean(control) - np.mean(treated)) \
/ len(bs_replicates)
print('p-value =', p)
Import
matplotlib.pyplotand
seabornas
their usual aliases (
pltand
sns).
Use
seabornto set the plotting defaults.
Plot a histogram of the Iris versicolor petal lengths using
plt.hist()and the
provided NumPy array
versicolor_petal_length.
Show the histogram using
plt.show().
# Import plotting modules
import matplotlib.pyplot as plt
import seaborn as sns
# Set default Seaborn style
sns.set()
# Plot histogram of versicolor petal lengths
plt.hist(versicolor_petal_length)
# Show histogram
plt.show()
Label the axes. Don't forget that you should always include units in your axis labels. Your yy-axis
label is just
'count'. Your xx-axis
label is
'petal length (cm)'. The units are essential!
Display the plot constructed in the above steps using
plt.show().
# Plot histogram of versicolor petal lengths
_ = plt.hist(versicolor_petal_length)
# Label axes
plt.xlabel("petal length (cm)")
plt.ylabel("count")
# Show histogram
plt.show()
hist直方图的bins数目一般是数据数量的开根号
Import
numpyas
np.
This gives access to the square root function,
np.sqrt().
Determine how many data points you have using
len().
Compute the number of bins using the square root rule.
Convert the number of bins to an integer using the built in
int()function.
Generate the histogram and make sure to use the
binskeyword argument.
Hit 'Submit Answer' to plot the figure and see the fruit of your labors!
# Import numpy
import numpy as np
# Compute number of data points: n_data
n_data = len(versicolor_petal_length)
# Number of bins is the square root of number of data points: n_bins
n_bins = np.sqrt(n_data)
# Convert number of bins to integer: n_bins
n_bins = int(n_bins)
# Plot the histogram
plt.hist(versicolor_petal_length, bins=n_bins)
# Label axes
_ = plt.xlabel('petal length (cm)')
_ = plt.ylabel('count')
# Show histogram
plt.show()
_ = sns.swarmplot(x='state', y='dem_share', data=df_swing) _ = plt.xlabel('state') _ = plt.ylabel('percent of vote for Obama') plt.show()
In the IPython Shell, inspect the DataFrame
dfusing
df.head().
This will let you identify which column names you need to pass as the
xand
ykeyword
arguments in your call to
sns.swarmplot().
Use
sns.swarmplot()to make a bee swarm plot from the DataFrame containing the Fisher iris data set,
df.
The x-axis should contain each of the three species, and the y-axis should contain the petal lengths.
Label the axes.
Show your plot.
# Create bee swarm plot with Seaborn's default settings
sns.swarmplot(x='species', y='petal length (cm)', data=df)
# Label the axes
plt.xlabel('species')
plt.ylabel('petal length (cm)')
# Show the plot
plt.show()
Define a function with the signature
ecdf(data).
Within the function definition,
Compute the number of data points,
n, using the
len()function.
The xx-values
are the sorted data. Use the
np.sort()function to perform the sorting.
The yy data
of the ECDF go from
1/nto
1in
equally spaced increments. You can construct this using
np.arange(). Remember, however, that the end value in
np.arange()is
not inclusive. Therefore,
np.arange()will need to go from
1to
n+1.
Be sure to divide this by
n.
The function returns the values
xand
y.
def ecdf(data):
"""Compute ECDF for a one-dimensional array of measurements."""
# Number of data points: n
n = len(data)
# x-data for the ECDF: x
x = np.sort(data)
# y-data for the ECDF: y
y = np.arange(1, n+1) / n
return x, y
Use
ecdf()to compute the ECDF of
versicolor_petal_length.
Unpack the output into
x_versand
y_vers.
Plot the ECDF as dots. Remember to include
marker = '.'and
linestyle = 'none'in addition to
x_versand
y_versas
arguments inside
plt.plot().
Set the margins of the plot with
plt.margins()so that no data points are cut off. Use a 2% margin.
Label the axes. You can label the y-axis
'ECDF'.
Show your plot.
Use
ecdf()to compute the ECDF of
versicolor_petal_length.
Unpack the output into
x_versand
y_vers.
Plot the ECDF as dots. Remember to include
marker = '.'and
linestyle = 'none'in addition to
x_versand
y_versas
arguments inside
plt.plot().
Set the margins of the plot with
plt.margins()so that no data points are cut off. Use a 2% margin.
Label the axes. You can label the y-axis
'ECDF'.
Show your plot.
# Compute ECDF for versicolor data: x_vers, y_vers
x_vers, y_vers = ecdf(versicolor_petal_length)
# Generate plot
plt.plot(x_vers, y_vers, marker='.', linestyle='none')
# Make the margins nice
plt.margins(0.02)
# Label the axes
plt.ylabel('ECDF')
plt.xlabel('versicolor_petal_length')
# Display the plot
plt.show()
Compute ECDFs for each of the three species using your
ecdf()function. The variables
setosa_petal_length,
versicolor_petal_length,
and
virginica_petal_lengthare all in your namespace. Unpack the ECDFs into
x_set, y_set,
x_vers, y_versand
x_virg, y_virg, respectively.
Plot all three ECDFs on the same plot as dots. To do this, you will need three
plt.plot()commands.
Assign the result of each to
_.
Specify 2% margins.
A legend and axis labels have been added for you, so hit 'Submit Answer' to see all the ECDFs!
# Compute ECDFs
x_set, y_set = ecdf(setosa_petal_length)
x_vers, y_vers = ecdf(versicolor_petal_length)
x_virg, y_virg = ecdf(virginica_petal_length)
# Plot all ECDFs on the same plot
_ = plt.plot(x_set, y_set, marker='.', linestyle='none', )
_ = plt.plot(x_vers, y_vers, marker='.', linestyle='none', )
_ = plt.plot(x_virg, y_virg, marker='.', linestyle='none', )
# Make nice margins
plt.margins(0.02)
# Annotate the plot
plt.legend(('setosa', 'versicolor', 'virginica'), loc='lower right')
_ = plt.xlabel('petal length (cm)')
_ = plt.ylabel('ECDF')
# Display the plot
plt.show()
dataframe计算分位数
Create
percentiles, a NumPy array of percentiles you want to compute. These are the 2.5th, 25th, 50th,
75th, and 97.5th. You can do so by creating a list containing these ints/floats and convert the list to a NumPy array using
np.array().
For example,
np.array([30, 50])would create an array consisting of the 30th and 50th percentiles.
Use
np.percentile()to compute the percentiles of the petal lengths from the Iris versicolor samples.
The variable
versicolor_petal_lengthis in your namespace.
Print the percentiles.
# Specify array of percentiles: percentiles
percentiles = np.array([2.5, 25, 50, 75, 97.5])
# Compute percentiles: ptiles_vers
ptiles_vers = np.percentile(versicolor_petal_length, percentiles)
# Print the result
print(ptiles_vers)
画图并且标记处分位数点:
Plot the percentiles as red diamonds on the ECDF. Pass the x and y co-ordinates -
ptiles_versand
percentiles/100-
as positional arguments and specify the
marker='D',
color='red'and
linestyle='none'keyword
arguments. The argument for the y-axis -
percentiles/100has been specified for you.
Display the plot.
# Plot the ECDF
_ = plt.plot(x_vers, y_vers, '.')
plt.margins(0.02)
_ = plt.xlabel('petal length (cm)')
_ = plt.ylabel('ECDF')
# Overlay percentiles as red diamonds.
_ = plt.plot(ptiles_vers, percentiles/100, marker='D', color='red',
linestyle='none')
# Show the plot
plt.show()
计算方差
Create an array called
differencesthat is the difference between the petal lengths (
versicolor_petal_length)
and the mean petal length. The variable
versicolor_petal_lengthis already in your namespace as a NumPy array so
you can take advantage of NumPy's vectorized operations.
Square each element in this array. For example,
x**2squares each element in the array
x.
Store the result as
diff_sq.
Compute the mean of the elements in
diff_squsing
np.mean().
Store the result as
variance_explicit.
Compute the variance of
versicolor_petal_lengthusing
np.var().
Store the result as
variance_np.
Print both
variance_explicitand
variance_npin
one
# Array of differences to mean: differences
differences = versicolor_petal_length - np.mean(versicolor_petal_length)
# Square the differences: diff_sq
diff_sq = differences ** 2
# Compute the mean square difference: variance_explicit
variance_explicit = np.mean(diff_sq)
# Compute the variance using NumPy: variance_np
variance_np = np.var(versicolor_petal_length)
# Print the results
print(variance_explicit,variance_np)
计算皮尔逊相关系数
Define a function with signature
pearson_r(x, y).
Use
np.corrcoef()to compute the correlation matrix of
xand
y(pass
them to
np.corrcoef()in that order).
The function returns entry
[0,1]of the correlation matrix.
Compute the Pearson correlation between the data in the arrays
versicolor_petal_lengthand
versicolor_petal_width.
Assign the result to
r.
Print the result.
def pearson_r(x, y):
"""Compute Pearson correlation coefficient between two arrays."""
# Compute correlation matrix: corr_mat
corr_mat = np.corrcoef(x, y)
# Return entry [0,1]
return corr_mat[0,1]
# Compute Pearson correlation coefficient for I. versicolor: r
r = pearson_r(versicolor_petal_length,versicolor_petal_width)
# Print the result
print(r)
Draw samples out of the Binomial distribution using
np.random.binomial(). You should use parameters
n = 100and
p = 0.05, and set the
sizekeyword
argument to
10000.
Compute the CDF using your previously-written
ecdf()function.
Plot the CDF with axis labels. The x-axis here is the number of defaults out of 100 loans, while the y-axis is the CDF.
Show the plot.
# Take 10,000 samples out of the binomial distribution: n_defaults
n_defaults = np.random.binomial(100, 0.05, size = 10000)
# Compute CDF: x, y
x, y = ecdf(n_defaults)
# Plot the CDF with axis labels
plt.plot(x, y, marker = '.', linestyle = 'none' )
plt.xlabel('the number of defaults out of 100 loans')
plt.ylabel('CDF')
# Show the plot
plt.show()
正态分布的构建和绘制
Draw 100,000 samples from a Normal distribution that has a mean of
20and a standard
deviation of
1. Do the same for Normal distributions with standard deviations of
3and
10,
each still with a mean of
20. Assign the results to
samples_std1,
samples_std3and
samples_std10,
respectively.
Plot a histograms of each of the samples; for each, use 100 bins, also using the keyword arguments
normed=Trueand
histtype='step'.
The latter keyword argument makes the plot look much like the smooth theoretical PDF. You will need to make 3
plt.hist()calls.
Hit 'Submit Answer' to make a legend, showing which standard deviations you used, and show your plot! There is no need to label the axes because we have not defined what is being described by the Normal distribution; we are just looking at shapes
of PDFs.
# Draw 100000 samples from Normal distribution with stds of interest: samples_std1, samples_std3, samples_std10
samples_std1 = np.random.normal(20,1,100000)
samples_std3 = np.random.normal(20,3,100000)
samples_std10 = np.random.normal(20,10,100000)
# Make histograms
plt.hist(samples_std1,normed=True,histtype='step',bins=100)
plt.hist(samples_std3,normed=True,histtype='step',bins=100)
plt.hist(samples_std10,normed=True,histtype='step',bins=100)
# Make a legend, set limits and show plot
_ = plt.legend(('std = 1', 'std = 3', 'std = 10'))
plt.ylim(-0.01, 0.42)
plt.show()
数据的标准正态分布拟合和原数据的拟合
Compute mean and standard deviation of Belmont winners' times with the two outliers removed. The NumPy array
belmont_no_outliershas
these data.
Take 10,000 samples out of a normal distribution with this mean and standard deviation using
np.random.normal().
Compute the CDF of the theoretical samples and the ECDF of the Belmont winners' data, assigning the results to
x_theor, y_theorand
x, y, respectively.
Hit submit to plot the CDF of your samples with the ECDF, label your axes and show the plot.
# Compute mean and standard deviation: mu, sigma
mu = np.mean(belmont_no_outliers)
sigma = np.std(belmont_no_outliers)
# Sample out of a normal distribution with this mu and sigma: samples
samples = np.random.normal(mu,sigma,10000)
# Get the CDF of the samples and of the data
x_theor,y_theor = ecdf(samples)
x, y = ecdf(belmont_no_outliers)
# Plot the CDFs and show the plot
_ = plt.plot(x_theor, y_theor)
_ = plt.plot(x, y, marker='.', linestyle='none')
plt.margins(0.02)
_ = plt.xlabel('Belmont winning time (sec.)')
_ = plt.ylabel('CDF')
plt.show()
指数分布
Define a function with call signature
successive_poisson(tau1, tau2, size=1)that samples the waiting time for a no-hitter and a hit of the cycle.
Draw waiting times
tau1(
sizenumber
of samples) for the no-hitter out of an exponential distribution and assign to
t1.
Draw waiting times
tau2(
sizenumber
of samples) for hitting the cycle out of an exponential distribution and assign to
t2.
The function returns the sum of the waiting times for the two events.
def successive_poisson(tau1, tau2, size=1):
# Draw samples out of first exponential distribution: t1
t1 = np.random.exponential(tau1, size)
# Draw samples out of second exponential distribution: t2
t2 = np.random.exponential(tau2, size)
return t1 + t2
Use your
successive_poisson()function to draw 100,000 out of the distribution
of waiting times for observing a no-hitter and a hitting of the cycle.
Plot the PDF of the waiting times using the step histogram technique of a previous exercise. Don't forget the necessary keyword arguments. You should use
bins=100,
normed=True,
and
histtype='step'.
Label the axes.
Show your plot.
# Draw samples of waiting times: waiting_times
waiting_times = successive_poisson(764, 715, size=100000)
# Make the histogram
plt.hist(waiting_times, bins=100, normed=True, histtype='step')
# Label axes
plt.xlabel('time')
plt.ylabel('waiting_times')
# Show the plot
plt.show()
多项式拟合
Compute the slope and intercept of the regression line using
np.polyfit(). Remember,
fertilityis
on the y-axis and
illiteracyon the x-axis.
Print out the slope and intercept from the linear regression.
To plot the best fit line, create an array
xthat consists of 0 and 100 using
np.array().
Then, compute the theoretical values of
ybased on your regression parameters. I.e.,
y = a * x + b.
Plot the data and the regression line on the same plot. Be sure to label your axes.
Hit 'Submit Answer' to display your plot.
# Plot the illiteracy rate versus fertility
_ = plt.plot(illiteracy, fertility, marker='.', linestyle='none')
plt.margins(0.02)
_ = plt.xlabel('percent illiterate')
_ = plt.ylabel('fertility')
# Perform a linear regression using np.polyfit(): a, b
a, b = np.polyfit(illiteracy, fertility,1)
# Print the results to the screen
print('slope =', a, 'children per woman / percent illiterate')
print('intercept =', b, 'children per woman')
# Make theoretical line to plot
x = np.array([0,100])
y = a * x + b
# Add regression line to your plot
_ = plt.plot(x, y)
# Draw the plot
plt.show()
Specify the values of the slope for which to compute the RSS. Use
np.linspace()to get
200points
in the range between
0and
0.1.
For example, to get
100points in the range between
0and
0.5,
you could use
np.linspace()like so:
np.linspace(0, 0.5, 100).
Initialize an array,
rss, to contain the RSS using
np.empty_like()and
the array you created above. The
empty_like()function returns a new array with the same shape and type as a given
array (in this case,
a_vals).
Write a
forloop to compute the sum of RSS of the slope. Hint: the RSS is given
by
np.sum((y_data - a * x_data - b)**2). The variable
byou
computed in the last exercise is already in your namespace. Here,
fertilityis the
y_dataand
illiteracythe
x_data.
Plot the RSS (
rss) versus slope (
a_vals).
Hit 'Submit Answer' to see the plot!
# Specify slopes to consider: a_vals
a_vals = np.linspace(0, 0.1, 200)
# Initialize sum of square of residuals: rss
rss = np.empty_like(a_vals)
# Compute sum of square of residuals for each value of a_vals
for i, a in enumerate(a_vals):
rss[i] = np.sum((fertility - a*illiteracy - b)**2)
# Plot the RSS
plt.plot(a_vals, rss, '-')
plt.xlabel('slope (children per woman / percent illiterate)')
plt.ylabel('sum of square of residuals')
plt.show()
Compute the parameters for the slope and intercept using
np.polyfit(). The Anscombe
data are stored in the arrays
xand
y.
Print the slope
aand intercept
b.
Generate theoretical xx and yy data
from the linear regression. Your xx array,
which you can create with
np.array(), should consist of
3and
15.
To generate the yy data,
multiply the slope by
x_theorand add the intercept.
Plot the Anscombe data as a scatter plot and then plot the theoretical line. Remember to include the
marker='.'and
linestyle='none'keyword
arguments in addition to
xand
ywhen
to plot the Anscombe data as a scatter plot. You do not need these arguments when plotting the theoretical line.
Hit 'Submit Answer' to see the plot!
# Perform linear regression: a, b
a, b = np.polyfit(x,y,1)
# Print the slope and intercept
print(a, b)
# Generate theoretical x and y data: x_theor, y_theor
x_theor = np.array([3, 15])
y_theor = x_theor * a + b
# Plot the Anscombe data and theoretical line
_ = plt.plot(x_theor, y_theor)
_ = plt.plot(x,y,marker='.',linestyle='none')
# Label the axes
plt.xlabel('x')
plt.ylabel('y')
# Show the plot
plt.show()
样本小的时候进行重复采样的方法
Write a
forloop to acquire
50bootstrap
samples of the rainfall data and plot their ECDF.
Use
np.random.choice()to generate a bootstrap sample from the NumPy array
rainfall.
Be sure that the
sizeof the resampled array is
len(rainfall).
Use the function
ecdf()that you wrote in the prequel to this course to generate the
xand
yvalues
for the ECDF of the bootstrap sample
bs_sample.
Plot the ECDF values. Specify
color='gray'(to make gray dots) and
alpha=0.1(to
make them semi-transparent, since we are overlaying so many) in addition to the
marker='.'and
linestyle='none'keyword
arguments.
Use
ecdf()to generate
xand
yvalues
for the ECDF of the original rainfall data available in the array
rainfall.
Plot the ECDF values of the original data.
Hit 'Submit Answer' to visualize the samples!
for _ in range(50):
# Generate bootstrap sample: bs_sample
bs_sample = np.random.choice(rainfall, size=len(rainfall))
# Compute and plot ECDF from bootstrap sample
x, y = ecdf(bs_sample)
_ = plt.plot(x, y, marker='.', linestyle='none',
color='gray', alpha=0.1)
# Compute and plot ECDF from original data
x, y = ecdf(rainfall)
_ = plt.plot(x, y, marker='.')
# Make margins and label axes
plt.margins(0.02)
_ = plt.xlabel('yearly rainfall (mm)')
_ = plt.ylabel('ECDF')
# Show the plot
plt.show()
Draw
10000bootstrap replicates of the mean annual rainfall using
your
draw_bs_reps()function and the
rainfallarray. Hint:
Pass in
np.meanfor
functo
compute the mean.
As a reminder,
draw_bs_reps()accepts 3 arguments:
data,
func,
and
size.
Compute and print the standard error of the mean of
rainfall.
The formula to compute this is
np.std(data) / np.sqrt(len(data)).
Compute and print the standard deviation of your bootstrap replicates
bs_replicates.
Make a histogram of the replicates using the
normed=Truekeyword argument and
50bins.
Hit 'Submit Answer' to see the plot!
# Take 10,000 bootstrap replicates of the mean: bs_replicates
bs_replicates = draw_bs_reps(rainfall, np.mean, 10000)
# Compute and print SEM
sem = np.std(rainfall) / np.sqrt(len(rainfall))
print(sem)
# Compute and print standard deviation of bootstrap replicates
bs_std = np.std(bs_replicates)
print(bs_std)
# Make a histogram of the results
_ = plt.hist(bs_replicates, bins=50, normed=True)
_ = plt.xlabel('mean annual rainfall (mm)')
_ = plt.ylabel('PDF')
# Show the plot
plt.show()
计算置信区间
Generate
10000bootstrap replicates of ττ from
the
nohitter_timesdata using your
draw_bs_reps()function.
Recall that the the optimal ττ is
calculated as the mean of the data.
Compute the 95% confidence interval using
np.percentile()and passing in two arguments: The array
bs_replicates,
and the list of percentiles - in this case
2.5and
97.5.
Print the confidence interval.
Plot a histogram of your bootstrap replicates. This has been done for you, so hit 'Submit Answer' to see the plot!
# Draw bootstrap replicates of the mean no-hitter time (equal to tau): bs_replicates
bs_replicates = draw_bs_reps(nohitter_times,np.mean,10000)
# Compute the 95% confidence interval: conf_int
conf_int = np.percentile(bs_replicates,[2.5,97.5])
# Print the confidence interval
print('95% confidence interval =', conf_int, 'games')
# Plot the histogram of the replicates
_ = plt.hist(bs_replicates, bins=50, normed=True)
_ = plt.xlabel(r'$\tau$ (games)')
_ = plt.ylabel('PDF')
# Show the plot
plt.show()
重复抽取数据和进行线性拟合
Define a function with call signature
draw_bs_pairs_linreg(x, y, size=1)to perform pairs bootstrap estimates on linear regression parameters.
Use
np.arange()to set up an array of indices going from
0to
len(x).
These are what you will resample and use them to pick values out of the
xand
yarrays.
Use
np.empty()to initialize the slope and intercept replicate arrays to be of size
size.
Write a
forloop to:
Resample the indices
inds. Use
np.random.choice()to
do this.
Make new xx and yy arrays
bs_xand
bs_yusing
the the resampled indices
bs_inds. To do this, slice
xand
ywith
bs_inds.
Use
np.polyfit()on the new xx and yyarrays
and store the computed slope and intercept.
Return the pair bootstrap replicates of the slope and intercept.
def draw_bs_pairs_linreg(x, y, size=1):
"""Perform pairs bootstrap for linear regression."""
# Set up array of indices to sample from: inds
inds = np.arange(len(x))
# Initialize replicates: bs_slope_reps, bs_intercept_reps
bs_slope_reps = np.empty(size)
bs_intercept_reps = np.empty(size)
# Generate replicates
for i in range(size):
bs_inds = np.random.choice(inds, size=len(inds))
bs_x, bs_y = x[bs_inds], y[bs_inds]
bs_slope_reps[i], bs_intercept_reps[i] = np.polyfit(bs_x, bs_y, 1)
return bs_slope_reps, bs_intercept_reps
Generate an array of xx-values
consisting of
0and
100for
the plot of the regression lines. Use the
np.array()function for this.
Write a
forloop in which you plot a regression line with a slope and intercept given by the pairs
bootstrap replicates. Do this for
100lines.
When plotting the regression lines in each iteration of the
forloop, recall the regression equation
y = a*x + b. Here,
ais
bs_slope_reps[i]and
bis
bs_intercept_reps[i].
Specify the keyword arguments
linewidth=0.5,
alpha=0.2,
and
color='red'in your call to
plt.plot().
Make a scatter plot with
illiteracyon the x-axis and
fertilityon
the y-axis. Remember to specify the
marker='.'and
linestyle='none'keyword
arguments.
Label the axes, set a 2% margin, and show the plot. This has been done for you, so hit 'Submit Answer' to visualize the bootstrap regressions!
# Generate array of x-values for bootstrap lines: x
x = np.array([0,100])
# Plot the bootstrap lines
for i in range(100):
_ = plt.plot(x, bs_slope_reps[i]*x + bs_intercept_reps[i],
linewidth=0.5, alpha=0.2, color='red')
# Plot the data
_ = plt.plot()
# Label axes, set the margins, and show the plot
_ = plt.xlabel('illiteracy')
_ = plt.ylabel('fertility')
plt.margins(0.02)
plt.show()
Visualizing permutation sampling
Write a forloop to 50 generate permutation samples, compute their ECDFs, and plot them.
Generate a permutation sample pair from
rain_julyand
rain_novemberusing
your
permutation_sample()function.
Generate the
xand
yvalues
for an ECDF for each of the two permutation samples for the ECDF using your
ecdf()function.
Plot the ECDF of the first permutation sample (
x_1and
y_1)
as dots. Do the same for the second permutation sample (
x_2and
y_2).
Generate
xand
yvalues
for ECDFs for the
rain_julyand
rain_novemberdata
and plot the ECDFs using respectively the keyword arguments
color='red'and
color='blue'.
Label your axes, set a 2% margin, and show your plot. This has been done for you, so just hit 'Submit Answer' to view the plot!
for _ in range(50):
# Generate permutation samples
perm_sample_1, perm_sample_2 = permutation_sample(rain_july, rain_november)
# Compute ECDFs
x_1, y_1 = ecdf(perm_sample_1)
x_2, y_2 = ecdf(perm_sample_2)
# Plot ECDFs of permutation sample
_ = plt.plot(x_1, y_1, marker='.', linestyle='none',
color='red', alpha=0.02)
_ = plt.plot(x_2, y_2, marker='.', linestyle='none',
color='blue', alpha=0.02)
# Create and plot ECDFs from original data
x_1, y_1 = ecdf(rain_july)
x_2, y_2 = ecdf(rain_november)
_ = plt.plot(x_1, y_1, marker='.', linestyle='none', color='red')
_ = plt.plot(x_2, y_2, marker='.', linestyle='none', color='blue')
# Label axes, set margin, and show plot
plt.margins(0.02)
_ = plt.xlabel('monthly rainfall (mm)')
_ = plt.ylabel('ECDF')
plt.show()
P值的计算:
Define a function with call signature
diff_of_means(data_1, data_2)that returns the differences in
means between two data sets, mean of
data_1minus mean of
data_2.
Use this function to compute the empirical difference of means that was observed in the frogs.
Draw 10,000 permutation replicates of the difference of means.
Compute the p-value.
Print the p-value.
def diff_of_means(data_1, data_2):
"""Difference in means of two arrays."""
# The difference of means of data_1, data_2: diff
diff = np.mean(data_1) - np.mean(data_2)
return diff
# Compute difference of mean impact force from experiment: empirical_diff_means
empirical_diff_means = diff_of_means(force_a, force_b)
# Draw 10,000 permutation replicates: perm_replicates
perm_replicates = draw_perm_reps(force_a, force_b,
diff_of_means, size=10000)
# Compute p-value: p
p = np.sum( perm_replicates >= empirical_diff_means) / len(perm_replicates)
# Print the result
print('p-value =', p)
Translate the impact forces of Frog B such that its mean is 0.55 N.
Use your
draw_bs_reps()function to take 10,000 bootstrap replicates of the mean of your translated
forces.
Compute the p-value by finding the fraction of your bootstrap replicates that are less than the observed mean impact force of Frog B. Note that the variable of interest here is
force_b.
Print your p-value.
# Make an array of translated impact forces: translated_force_b
translated_force_b = force_b - np.mean(force_b) + 0.55
# Take bootstrap replicates of Frog B's translated impact forces: bs_replicates
bs_replicates = draw_bs_reps(translated_force_b, np.mean, 10000)
# Compute fraction of replicates that are less than the observed Frog B force: p
p = np.sum(bs_replicates <= np.mean(force_b)) / 10000
# Print the p-value
print('p = ', p)
Construct Boolean arrays,
demsand
repsthat
contain the votes of the respective parties; e.g.,
demshas 153
Trueentries
and 91
Falseentries.
Write a function,
frac_yay_dems(dems, reps)that returns the fraction of Democrats that voted yay.
The first input is an array of Booleans, Two inputs are required to use your
draw_perm_reps()function, but the
second is not used.
Use your
draw_perm_reps()function to draw 10,000 permutation replicates of the fraction of Democrat
yay votes.
Compute and print the p-value.
# Construct arrays of data: dems, reps
dems = np.array([True] * 153 + [False] * 91)
reps = np.array([True] * 136 + [False] * 35)
def frac_yay_dems(dems, reps):
"""Compute fraction of Democrat yay votes."""
frac = np.sum(dems) / len(dems)
return frac
# Acquire permutation samples: perm_replicates
perm_replicates = draw_perm_reps(dems, reps, frac_yay_dems, 10000)
# Compute and print p-value: p
p = np.sum(perm_replicates <= 153/244) / len(perm_replicates)
print('p-value =', p)
Compute the observed difference in mean inter-nohitter time using
diff_of_means().
Generate 10,000 permutation replicates of the difference of means using
draw_perm_reps().
Compute and print the p-value.
# Compute the observed difference in mean inter-no-hitter times: nht_diff_obs
nht_diff_obs = diff_of_means(nht_dead, nht_live)
# Acquire 10,000 permutation replicates of difference in mean no-hitter time: perm_replicates
perm_replicates = draw_perm_reps(nht_dead, nht_live,
diff_of_means, size=10000)
# Compute and print the p-value: p
p = np.sum(perm_replicates <= nht_diff_obs) / len(perm_replicates)
print('p-val =',p)
Compute the observed Pearson correlation between
illiteracyand
fertility.
Initialize an array to store your permutation replicates.
Write a
forloop to draw 10,000 replicates:
Permute the
illiteracymeasurements using
np.random.permutation().
Compute the Pearson correlation between the permuted illiteracy array,
illiteracy_permuted, and
fertility.
Compute and print the p-value from the replicates.
# Compute observed correlation: r_obs
r_obs = pearson_r(illiteracy, fertility)
# Initialize permutation replicates: perm_replicates
perm_replicates = np.empty(10000)
# Draw replicates
for i in range(10000):
# Permute illiteracy measurments: illiteracy_permuted
illiteracy_permuted = np.random.permutation(illiteracy)
# Compute Pearson correlation
perm_replicates[i] = pearson_r(illiteracy_permuted, fertility)
# Compute p-value: p
p = np.sum(perm_replicates >= r_obs) / len(perm_replicates)
print('p-val =', p)
Use your
ecdf()function to generate
x,yvalues
from the
controland
treatedarrays
for plotting the ECDFs.
Plot the ECDFs on the same plot.
The margins have been set for you, along with the legend and axis labels. Hit 'Submit Answer' to see the result!
# Compute x,y values for ECDFs
x_control, y_control = ecdf(control)
x_treated, y_treated = ecdf(treated)
# Plot the ECDFs
plt.plot(x_control, y_control, marker='.', linestyle='none')
plt.plot(x_treated, y_treated, marker='.', linestyle='none')
# Set the margins
plt.margins(0.02)
# Add a legend
plt.legend(('control', 'treated'), loc='lower right')
# Label axes and show plot
plt.xlabel('millions of alive sperm per mL')
plt.ylabel('ECDF')
plt.show()
Compute the mean alive sperm count of
controlminus that of
treated.
Compute the mean of all alive sperm counts. To do this, first concatenate
controland
treatedand
take the mean of the concatenated array.
Generate shifted data sets for both
controland
treatedsuch
that the shifted data sets have the same mean. This has already been done for you.
Generate 10,000 bootstrap replicates of the mean each for the two shifted arrays. Use your
draw_bs_reps()function.
Compute the bootstrap replicates of the difference of means.
The code to compute and print the p-value has been written for you. Hit 'Submit Answer' to see the result!
# Compute the difference in mean sperm count: diff_means
diff_means = np.mean(control) - np.mean(treated)
# Compute mean of pooled data: mean_count
mean_count = np.mean(np.concatenate((control, treated)))
# Generate shifted data sets
control_shifted = control - np.mean(control) + mean_count
treated_shifted = treated - np.mean(treated) + mean_count
# Generate bootstrap replicates
bs_reps_control = draw_bs_reps(control_shifted,
np.mean, size=10000)
bs_reps_treated = draw_bs_reps(treated_shifted,
np.mean, size=10000)
# Get replicates of difference of means: bs_replicates
bs_replicates = bs_reps_control - bs_reps_treated
# Compute and print p-value: p
p = np.sum(bs_replicates >= np.mean(control) - np.mean(treated)) \
/ len(bs_replicates)
print('p-value =', p)
相关文章推荐
- class Importing Data in Python (Part 1)
- Eclipse:Could not create the view: Plug-in org.eclipse.jdt.ui was unable to load class org.eclipse.jdt.internal.ui.packageview.PackageExplorerPart. 解决方法
- OOP Concepts in Python 2.x - Part 1
- Part 34 to 35 Talking about multiple class inheritance in C#
- Thinking in one of my experience——My second lecture in Miss Harrell’s class
- Use Module and Function instead of Class in Python
- Operator Overloading part 3(Chapter 12 of Thinking in C++)
- What is a metaclass in Python?
- Thinking In Python(Bruce Eckel) 程序员的Python快速入门
- An Introduction to Interactive Programming in Python (Part 1) - Week 1
- Thinking in Python(前言 + Python语言概览)
- class & inheritance in python demo
- python -- why defined &#39;__new__&#39; and &#39;__init__&#39; all in a class
- Base class hijacks an interface(Thinking in java(英文版)700)
- WebSphere Class Loaders and Shared Library, Part 3 (Class loader in WebSphre Portal + Shared Library )
- python调用py方法,报错【ValueError: no such test method in <class 'mytestcase.MyTestCase'>: runTest】
- Thinking in Hibernate Configuration Class
- Some tips about Class in Python
- [C++/Python] 如何在C++中使用一个Python类? (Use Python-defined class in C++)
- metaclass in python