Data Science and Python
What is this post about ?
When I google "Data Science", 8/10 results are of Data Science Courses. When I open these courses, I see some common sentences such as "learn Python for Data Science", "basics of Data Analysis in Python" etc. I believe you must have realized why I wrote python in bold. Everyone in the Tech Industry accepts that Python is the most important language for Data Science at present. I want to dig in this fact and find out why it holds true. Moreover, How python became the most important language for Data Science and what bond it holds with Data Science. What developments took place in the last decade that made python climb the Data Science ladder and achieve top spot leaving R and SAS behind directly or indirectly.
What this post does not cover ?
This writing is not a tutorial of any type. It will not cover Data Science Techniques or python programming. It will highlight the relation between python language and Data Science. It covers how they work together. If you want to learn Python or Data Science techniques, this is not for you.
What is "Data Scientist" and How it evolved ?
The title Data Scientist became popular in late 90s. I loved the definition given at alexa.com - "Data Scientist is the adult version of a kid who can't stop asking - Why?". Data Science developments were so quick that Data Scientist is the sexiest job title at present.
There are roughly 4.4 trillion GBs of digital data and it is expected to tenfold in the next decade.Companies are more interested in making sense from the available data to gain a competitive edge. Now question arises, How do they hire ?
Designation- Data Analyst or Business Analyst
Tools- SAS(most popular), R, WEKA, Statistica
Background- Statistics, Mathematics and Economics
Companies realized the importance of domain knowledge. Analysts were working for the industry they have never been, on the product they never used and for the customers they could never relate to. At this point of time, Business understanding became a key point. Analytics teams were having MBA's, Engineers and even psycholgists graduates.
R emerged as a strong competitor for SAS as it was open source. Hadoop emerged, Data sets were now called Big Data.
Data Scientists have separated them from Data Analysts. Data Scientists have an option for specialization in Data Science, Machine learning, Big Data and Data Visualization.
All of these specializations lead a practitioner to a specialized role according to their expertise.
Any student who want to learn Data Science must spend some time analysing his toolkit.
Python provides all the necessary functionality required by data scientists and integrates well with tools such as Hadoop and Spark. let us see how,
|Data Scientist's Question||Python's Answer|
|Q1. How to do numerical analysis easily?||NumPy - supports large, N-dimensional arrays and powerful mathematical functions.|
|Q2. How to manipulate Data?||Pandas - supports data structures and operations on tables called DataFrames.|
|Q3. How should I visualize Data?||Matplotlib - built over NumPy and Pandas to support data visualization.|
|Q4. How to do Scientific analysis and computing?||SciPy - supports scientific computing and technical computing.|
|Q5. How to do statistical analysis?||StatsModel - supports statistical analysis.|
|Q6. How to implement Machine learning?||Scikit-learn - supports machine learning and predictive modelling. It is build on NumPy, Pandas and Matplotlib.|
|Q7. How to implement neural networks?||TensorFlow - supports creation of deep learning models directly or by using wrapper libraries.|
|Q8. How to connect MySQL Database?||PyMySQL - supports easy connectivity to MySQL database, execute queries and extract data.|
|Q9. How to read XML, HTML Data?||BeautifulSoup - supports easy read in XML and HTML type data.|
|Q10. I want an interactive programming notebook like R, what can I do?||Jupyter Notebook - supports interactive programming along with visualizations.|
This post does not suggest you to practice Data Science with python, it just highlights the richness of libraries that python can offer you. It tells that developers are working on python packages all the time to provide more functionality which has made python a good choice.
I tried to cover some of the most important requirements in Data Science and what are Python solutions to them. Please suggest, if I missed anything.