Calculate ECDF in Python
ECDF: Emperical Cumulative Distribution Function:
An empirical distribution function is the function associated with the empirical measure of a sample. This cumulative distribution function is a step function that jumps up by 1/n at each of the n data points. Its value at any specified value of the measured variable is the fraction of observations of the measured variable that are less than or equal to the specified value.
Let (x1, …, xn) be independent, identically distributed real random variables with the common cumulative distribution function F(t). Then the empirical distribution function is defined as:Source
Coming to my point, it is really hard to find an alternative for ecdf() function of R in Python. There are few online codes available, but this is verified as the best possible match to the R's ecdf() function. It follows the algorithm behind calculating the ECDF of a given data.
If you want to calculate it on your own, then it is not a big function (in terms of LOC).
Let's say we are given a 1-D array that we name as data.
data = [101, 118, 121, 103, 142, 111, 119, 122, 128, 112, 117,157]
We would first convert it into a numpy array.
raw_data = np.array(data)
Now we would find the respective x and y values that represent the actual cdf of the data.
# create a sorted series of unique data
cdfx = np.sort(data.unique())
# x-data for the ECDF: evenly spaced sequence of the uniques
x_values = np.linspace(start=min(cdfx),
stop=max(cdfx),num=len(cdfx))
# size of the x_values
size_data = raw_data.size
# y-data for the ECDF:
y_values = []
for i in x_value:
# all the values in raw data less than the ith value in x_values
temp = raw_data[raw_data <= i]
# fraction of that value with respect to the size of the x_values
value = temp.size / size_data
# pushing the value in the y_values
y_values.append(value)
# return both x and y values
return x_values,y_values
The possible answer would be two arrays representing the x and y values of ECDF.
"cdf_x": [101, 106.090909090909, 111.181818181818, 116.272727272727, 121.363636363636, 126.454545454545, 131.545454545455, 136.636363636364, 141.727272727273, 146.818181818182, 151.909090909091, 157]
"cdf_y": [0.0833333333333333, 0.166666666666667, 0.25, 0.333333333333333, 0.666666666666667, 0.75, 0.833333333333333, 0.833333333333333, 0.833333333333333, 0.916666666666667, 0.916666666666667, 1]
These results are verified with the following function in R:
x <- c(101, 118, 121, 103, 142, 111, 119, 122, 128, 112, 117,157)
ecdf(x)
Using the output you get in the above R code, you can find the respective x and y values.