Calculate ECDF in Python

Published Feb 26, 2018
Calculate ECDF in Python

ECDF: Emperical Cumulative Distribution Function:
an empirical distribution function is the distribution function associated with the empirical measure of a sample. This cumulative distribution function is a step function that jumps up by 1/n at each of the n data points. Its value at any specified value of the measured variable is the fraction of observations of the measured variable that are less than or equal to the specified value.

It is really hard to find an alternative for ecdf() function of R in Python. There are few online codes available, but I verified this as the most accurate match to the R's ecdf() function. It follows the algorithm behind calculating the ECDF of a given data.

If you want to calculate it on your own, then it is not a big function (in terms of LOC).

Let's say we are given a 1-D array that we name as data.

  data = [101, 118, 121, 103, 142, 111, 119, 122, 128, 112, 117,157]

We would first convert it into a numpy array.

  raw_data = np.array(data)

Now we would find the respective x and y values that represent the actual cdf of the data.

  # create a sorted series of unique data
    cdfx = np.sort(data.unique())
  # x-data for the ECDF: evenly spaced sequence of the uniques
  		x_values = np.linspace(start=min(cdfx),
    # size of the x_values
    	size_data = raw_data.size
    # y-data for the ECDF:
        y_values = []
      	for i in x_value:
        # all the values in raw data less than the ith value in x_values
            temp = raw_data[raw_data <= i]
        # fraction of that value with respect to the size of the x_values
            value = temp.size / size_data
        # pushing the value in the y_values
    # return both x and y values    
      	return x_values,y_values

The possible answer would be two arrays representing the x and y values of ECDF.

"cdfplotx": [101, 106.090909090909, 111.181818181818, 116.272727272727, 121.363636363636, 126.454545454545, 131.545454545455, 136.636363636364, 141.727272727273, 146.818181818182, 151.909090909091, 157]

"cdfploty": [0.0833333333333333, 0.166666666666667, 0.25, 0.333333333333333, 0.666666666666667, 0.75, 0.833333333333333, 0.833333333333333, 0.833333333333333, 0.916666666666667, 0.916666666666667, 1]

These results are verified with the following function in R:

x <- c(101, 118, 121, 103, 142, 111, 119, 122, 128, 112, 117,157)

Using the output you get in the above R code, you can find the respective x and y values.

Discover and read more posts from Kripanshu Bhargava
get started