Codementor Events

Data Science and Python

Published Mar 29, 2018Last updated Sep 24, 2018
Data Science and Python

What is this post about ?

When I google "Data Science", 8/10 results are of Data Science Courses. When I open these courses, I see some common sentences such as "learn Python for Data Science", "basics of Data Analysis in Python" etc. I believe you must have realized why I wrote python in bold. Everyone in the Tech Industry accepts that Python is the most important language for Data Science at present. I want to dig in this fact and find out why it holds true. Moreover, How python became the most important language for Data Science and what bond it holds with Data Science. What developments took place in the last decade that made python climb the Data Science ladder and achieve top spot leaving R and SAS behind directly or indirectly.

What this post does not cover ?

This writing is not a tutorial of any type. It will not cover Data Science Techniques or python programming. It will highlight the relation between python language and Data Science. It covers how they work together. If you want to learn Python or Data Science techniques, this is not for you.

What is "Data Scientist" and How it evolved ?

The title Data Scientist became popular in late 90s. I loved the definition given at alexa.com - "Data Scientist is the adult version of a kid who can't stop asking - Why?". Data Science developments were so quick that Data Scientist is the sexiest job title at present.

There are roughly 4.4 trillion GBs of digital data and it is expected to tenfold in the next decade.Companies are more interested in making sense from the available data to gain a competitive edge. Now question arises, How do they hire ?

In 2006
Designation- Data Analyst or Business Analyst
Tools- SAS(most popular), R, WEKA, Statistica
Background- Statistics, Mathematics and Economics

In 2011
Companies realized the importance of domain knowledge. Analysts were working for the industry they have never been, on the product they never used and for the customers they could never relate to. At this point of time, Business understanding became a key point. Analytics teams were having MBA's, Engineers and even psycholgists graduates.
R emerged as a strong competitor for SAS as it was open source. Hadoop emerged, Data sets were now called Big Data.

At Present
Data Scientists have separated them from Data Analysts. Data Scientists have an option for specialization in Data Science, Machine learning, Big Data and Data Visualization.
All of these specializations lead a practitioner to a specialized role according to their expertise.

Any student who want to learn Data Science must spend some time analysing his toolkit.

Python's offering

Python provides all the necessary functionality required by data scientists and integrates well with tools such as Hadoop and Spark. let us see how,

 Python Packages

Data Scientist's Question Python's Answer
Q1. How to do numerical analysis easily? NumPy - supports large, N-dimensional arrays and powerful mathematical functions.
Q2. How to manipulate Data? Pandas - supports data structures and operations on tables called DataFrames.
Q3. How should I visualize Data? Matplotlib - built over NumPy and Pandas to support data visualization.
Q4. How to do Scientific analysis and computing? SciPy - supports scientific computing and technical computing.
Q5. How to do statistical analysis? StatsModel - supports statistical analysis.
Q6. How to implement Machine learning? Scikit-learn - supports machine learning and predictive modelling. It is build on NumPy, Pandas and Matplotlib.
Q7. How to implement neural networks? TensorFlow - supports creation of deep learning models directly or by using wrapper libraries.
Q8. How to connect MySQL Database? PyMySQL - supports easy connectivity to MySQL database, execute queries and extract data.
Q9. How to read XML, HTML Data? BeautifulSoup - supports easy read in XML and HTML type data.
Q10. I want an interactive programming notebook like R, what can I do? Jupyter Notebook - supports interactive programming along with visualizations.

Conclusion

This post does not suggest you to practice Data Science with python, it just highlights the richness of libraries that python can offer you. It tells that developers are working on python packages all the time to provide more functionality which has made python a good choice.

Post Notes
I tried to cover some of the most important requirements in Data Science and what are Python solutions to them. Please suggest, if I missed anything. šŸ˜ƒ šŸ˜ƒ

Discover and read more posts from Kunal Dhawan
get started
post commentsBe the first to share your opinion
bootcamprag
6 years ago

Thanks for the informationā€¦ I am loving the Python langauage And hope to do good thingsā€¦ I am looking for a mentor and direction

Kunal Dhawan
6 years ago

feel free to ask for help from me

M. Badawy
6 years ago

For DS poc projects, and specifically for statistical learning algorithms, Python is a distant second to R imho. I use both, and to me libraries like bnlearn, Shiny Dashboard, even igraph offer more capabilities that are hard to come by in the Python world. And SAS, really?

Now, u wanna work on a large project that is production ready, python is prob only option there.

TF, Spark, H2o, etc. donā€™t add anything to the discussion as you can access them equally as easily from R as well.

Gael Lickindorf
6 years ago

Really, it took me 1s to find python-igraph

M. Badawy
6 years ago

Lol, did you try it? Itā€™s like 2-3 versions behind, not same level of support for some weird reason. Best graph library (performance-wise) I could find for Python was graph-tool, but itā€™s still behind Rā€™s igraph implementation. Wake me up when you find something as awesome as ā€œbnlearnā€ though, well, maybe Uberā€™s Pyro, but Iā€™m not sure they are directly comparable.

My comment was not meant to start a pissing contest, I use both R & Python for totally different use-cases. I only meant to stress that R can be far superior, specially for cutting edge statistical-based ML algos. Avoiding R because it is ā€œsingle-threadedā€, or ā€œnot very Pythonicā€ is not wise imho.

SAS on the other hand, oh well ā€¦;)

KWGD1980
6 years ago

Hit the nail on the head. Python is not the only language one should learn for the data science fieldā€¦

In the end, you pick the best tool for the job, and in some cases itā€™s Python, and some cases itā€™s not.

Karandeep Singh
6 years ago

Informative and very nicely written.

Show more replies