您的位置:首页 > 其它

Distance in Statistics

2017-03-28 18:02 141 查看

Minkowski distance

The Minkowski distance is a metric in a normed vector space which can be considered as a generalization of both the Euclidean distance and the Manhanttan distance.

Definition

The Minkowski distance of order p between two points

X=(x1,x2,...,xn)andY=(y1,y2,...,y3)∈Rn

is defined as

(∑i=1n|xi−yi|p)1/p

For p⩾1, the Minkowski distance is metric as a result of the Minkowski inequality. When p<1, the distance between (0,0) and (1,1) is 21/p>2, but the point (0,1) is at a distance 1 from both of these points. Since this violates the triangle inequality, for p<1 it is not a metri.

Minkowski distance is typically used with p being 1 or 2. the latter is the Euclidean distance, while the former is somethimes known as the Manhattan distance. In the limmitting case of p reaching infinity, we obtain the Chebyshev distance:

limp→∞(∑i=1n|xi−yi|p)1/p=maxi=1n|xi−yi|

Similarly, for p reaching negative infinity, we have:

limp→−∞(∑i=1n|xi−yi|p)1/p=mini=1n|xi−yi|

The Minkowski distance can also be viewd as multiple of the power mean of the componet-wise differences between P and Q.

Mahalanobis distance

Unlike Euclidean distance, it takes into account the relations between each two dimensions.

Definition

For random vector X∈Rm and Y∈Rn, the m×n cross covariance matrix is equal to

Cov(X,Y)==E[(X−E[X])(Y−E[Y])T]E[XYT]−E[X]E[Y]T

Similarly, the covariance matrix Σ of a distribution {X∈Rn} is

Σ(X)=Cov(X,X)

The Mahalanobis distance between two points X and Y is defined as

DM(X,Y)=(X−Y)TΣ−1(X−Y)−−−−−−−−−−−−−−−−−√

If the covariance matrix is the identity matrix, the Mahalanobis distance reduces to the Euclidean distance. If the covariance matrix is diagonal, then the resulting distance is called a normalized Euclidean distance:

d(X,Y)=∑i=1n(xi−yi)2σ2i−−−−−−−−−−−⎷

Hamming distance

In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of substitutions required to change one string into the other, or the minimum number of errors the could have transformed one string into the other.

Examples

The Hamming distance between:

“karolin” and “kathrin” is 3.

“karolin” and “kerstin” is 3.

1011101 and 1001001 is 2.

2173896 and 2233796 is 3.

Jaccard similarity coefficient

The Jaccard similarity coefficient measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets:

J(A,B)=|A∩B||A∪B|

So, the Jaccard distance is defined as

Jδ(A,B)=1−J(A,B)

Pearson Correlation coefficient

Pearson Correlation coefficient is the covariance of two variables divided by the product of their standard deviations:

ρX,Y=Cov(X,Y)σXσY

Then, the Correlation distance is defined as DXY=1−ρXY

Information entropy

* Information entropy* is a measure of dispersion of a distribution

Entropy(X)=∑i=1n−pilog2pi

where,

n: the number of classification types for a sample sets X

pi: the probability of the i-th type a sample belongs to
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  Distance Statistic