Calculate ECDF in Python

Published Feb 26, 2018Last updated May 22, 2018

ECDF: Emperical Cumulative Distribution Function:
An empirical distribution function is the function associated with the empirical measure of a sample. This cumulative distribution function is a step function that jumps up by 1/n at each of the n data points. Its value at any specified value of the measured variable is the fraction of observations of the measured variable that are less than or equal to the specified value.
Let (x1, …, xn) be independent, identically distributed real random variables with the common cumulative distribution function F(t). Then the empirical distribution function is defined as:Source

Coming to my point, it is really hard to find an alternative for ecdf() function of R in Python. There are few online codes available, but this is verified as the best possible match to the R's ecdf() function. It follows the algorithm behind calculating the ECDF of a given data.

If you want to calculate it on your own, then it is not a big function (in terms of LOC).

Let's say we are given a 1-D array that we name as data.

  data = [101, 118, 121, 103, 142, 111, 119, 122, 128, 112, 117,157]

We would first convert it into a numpy array.

  raw_data = np.array(data)

Now we would find the respective x and y values that represent the actual cdf of the data.

  # create a sorted series of unique data
    cdfx = np.sort(data.unique())
  # x-data for the ECDF: evenly spaced sequence of the uniques
  		x_values = np.linspace(start=min(cdfx),
  			stop=max(cdfx),num=len(cdfx))
    
    # size of the x_values
    	size_data = raw_data.size
    # y-data for the ECDF:
        y_values = []
      	for i in x_value:
        # all the values in raw data less than the ith value in x_values
            temp = raw_data[raw_data <= i]
        # fraction of that value with respect to the size of the x_values
            value = temp.size / size_data
        # pushing the value in the y_values
            y_values.append(value)
    # return both x and y values    
      	return x_values,y_values

The possible answer would be two arrays representing the x and y values of ECDF.

"cdf_x": [101, 106.090909090909, 111.181818181818, 116.272727272727, 121.363636363636, 126.454545454545, 131.545454545455, 136.636363636364, 141.727272727273, 146.818181818182, 151.909090909091, 157]

"cdf_y": [0.0833333333333333, 0.166666666666667, 0.25, 0.333333333333333, 0.666666666666667, 0.75, 0.833333333333333, 0.833333333333333, 0.833333333333333, 0.916666666666667, 0.916666666666667, 1]

These results are verified with the following function in R:

x <- c(101, 118, 121, 103, 142, 111, 119, 122, 128, 112, 117,157)
ecdf(x)

Using the output you get in the above R code, you can find the respective x and y values.

Python Statistics Numpy Statistics in python

Report

Enjoy this post? Give Kripanshu Bhargava a like if it's helpful.

Kripanshu Bhargava

Software Developer intern at Harvard | Graduate Student at UT Dallas

Discover and read more posts from Kripanshu Bhargava

get started