Cheat Sheet: Python For Data Science
Starting to learn a new programming language is never easy. And for aspiring data scientists, this can even be more so — most of the times, they come from different kinds of fields of study, or they have already several years of experience in an industry that is very different from the data science industry.
Luckily, there are several resources that you can fall back on, online as well as in real-life. But, specifically for data science, you’ll find the amount of materials that is available can be lacking sometimes: you have general Python cheat sheets that will inform you about the most important things that you need to know to program with Python, but this does not specifically target the data science industry.
To help students that are taking their free Python for Data Science course, DataCamp started a series of cheat sheets that targets those who are just starting with data science and that can use some more material to support their learning.
Table of Contents
- Variables and data types
- Installing Python
- NumPy Arrays
- Some Useful Basic Statistical Functions
- Python For Data Science Cheat Sheet
Python For Data Science Cheat Sheet
The cheat sheet is a handy addition to your learning, as it covers the basics, brought together in seven topics, that any beginner needs to know to get started doing data science with Python.
Variables and data types
To start with Python, you first need to know about variables and data types. That should not come as a surprise, as they are the basics of every programming language.
Variables are used to name and store a value for later use, such as reference or manipulation, by the computer program. To store a value, you assign it to a variable. This is called variable assignment: you set or reset the value that is stored in one or more locations denoted by a variable name.
When you have assigned a value to a variable, your variable gains or changes its data type. The data type specifies which type of value a variable holds and what type of operations can be applied to it. In Python, you can easily assign values to variables like this:
x=5. When you then print out or refer to
x, you'll get back the value 5. Naturally, the data type of x will be an integer.
These are just the bare Python basics. The next step is then to do calculations with variables. The ones that the cheat sheet mentions are sum, subtraction, multiplication, exponentiation, remainder and division. You will already know how these operations work and what effect they can have on values, but the cheat sheet also shows you how to perform these operations in Python
When you're just starting out with Python, you might also find it useful to get to know more about certain functions, for example. Luckily, others before you have also had this need, so there is a way to ask for more information: just use the
help() method. Don't forget to pass the element about which you want to know more. In other words, you need to put the element, which in this case is
str, in between the parentheses to get back the information you need.
Next, you see that there are some of the most popular built-in data structures listed: strings and lists.
Strings are one of the basic elements of programming languages in general and this is not much different for Python. Things that you should master when it comes to working with strings are some string operations and string methods.
There are basically four string operations that you need to know to get started on working with strings:
- If you multiply your string, you see that the string has become significantly larger, as it concatenates the same string x amount of times to the original one.
- If you add a string to your original string, you’ll get back a concatenation of your string and the new string that you have added to it.
- You should be able to check whether a certain element is present in your string.
- You should know that you can also select elements from strings in Python. Don’t forget here that the index starts at 0; you’ll see this coming back later when you’re working with lists and numpy arrays, but also in other programming languages.
When it comes to string methods, it’s definitely handy to know that you should use the
lower() methods to put your string in uppercase or lowercase, respectively. Also, knowing how to count string elements or how to replace them is no frivolous luxury. And, especially when you’re parsing text, you’ll find a method to strip the whitespace from the ends enormously handy.
You might feel that these strings aren’t immediately what you’ll be using when you start doing data science and this will be mostly true; Text Mining and Natural Language Processing (NLP) are already advanced, but this is no excuse to neglect this data structure!
Lists, on the other hand, will seem more useful from the start. Lists are used to store an ordered collection of items, which might be of different types but usually they aren’t. The elements that are contained within a list are separated by commas and enclosed in square brackets. In this case, the
my_list variable is made up of strings: you have “my” and “list”, but also the variables that are also strings. You see that we have put a reference to NumPy arrays in this section. That’s mainly because there have been some discussions about whether to use lists or arrays in some cases.
The four reasons that most Pythonistas mention to better use NumPy arrays over lists are:
- NumPy arrays are more compact than lists,
- Access in reading and writing items is faster with NumPy,
- NumPy can be more convenient to work with, thanks to the fact that you get a lot of vector and matrix operations for free,
- NumPy can be more efficient to work with because they are implemented more efficiently.
For data science and the amount of data that you’ll be working with in real-life situations, it’s also useful for you to know your way around with NumPy arrays.
Lists are easily initialized with the help of square brackets (
). Note also that you can make lists of lists, as in the variable
my_list2! This is especially tricky when you’re first starting out. Next, like with the strings, you also need to know how to select list elements. Make sure that you don’t forget that also here, the index starts at 0.
When you have covered some of the absolute basics of Python, it's time to get started with Python's data science libraries. The popular ones that you should check out are pandas, NumPy, scikit-learn and matplotlib. But why are these libraries so important for data science?
- Pandas is used for data manipulation with Python (read more about Pandas). The handy data structures that pandas offers, such as the Series and DataFrame, are indispensable to do data analysis.
- NumPy, a package that offers the NumPy array as a more efficient alternative data structure to lists, will come in handy when you get your hands dirty with data science.
- Scikit-learn on the other hand, is the ideal tool if you want to get started on machine learning and data mining.
- Lastly, matplotlib is one of the basic Python libraries that you need to master to start making impressive data visualizations of your data and analyses.
You immediately see that these four libraries will offer you everything that you need to get started with doing data science.
There will be times when you want to import these libraries entirely to elaborate your analyses, but at other times, you want to perform only a selective import, where you only import a certain number of modules or methods of a library.
Also, there are certain conventions that you need to follow when you import the libraries that have been mentioned above: as such, pandas is imported as .pd, NumPy is imported as .np, scikit-learn is actually .sklearn when you want to import modules, and you import matplotlib.pyplot as .plt.
Right now, these conventions might strike you as odd or totally unnecessary, but you’ll quickly see that it becomes easier as you start to work intensively with them.
Now that you have covered some of the basics, you might want to install Python if you haven't already. Consider getting one of the Python distributions, such as Anaconda. It's the leading open data science platform, powered by Python. The absolute advantage of installing Anaconda is that you'll easily get access to over 720 packages that you can install with conda. But you also have a dependency and environment manager and the Spyder Integrated Development Environment (IDE). And as if these tools weren’t enough, you also get the Jupyter Notebook, an interactive data science environment that allows you to use your favorite data science tools and share your code and analyses with great ease.
In short, all the tools that you need to get started on doing data science with Python!
When you have imported the libraries that you need to do data science, you will probably need to import the most important data structure for scientific computing in Python: the NumPy array.
You’ll see that these arrays look a lot like lists and that, maybe to some surprise, you can convert your lists to NumPy arrays. Remember the performance question of above? This is the solution!
Subsetting and slicing NumPy arrays work very much like with lists. Don’t forget that the index starts at 0 When you look at the operations that you can perform on NumPy arrays, you’ll see that these allow you to allow to subset when you use the
> operators. You can also multiply and add NumPy arrays. These will, of course, change the values that your array holds.
Some Useful Basic Statistical Functions
Lastly, there are some functions that will surely come in handy when you’re starting out with Python: there are some of the basic statistical measures such as the mean, median, correlation coefficient, and the standard deviation that you can retrieve with the help of the
std(), respectively. You can also insert, delete, and append items to your arrays. Also, make sure not to miss the shape function to get the dimensions of the array. If your array has n rows and m columns, you’ll get back the tuple (m,n); Very handy if you want to inspect your data.
Python For Data Science Cheat Sheet
The first one that was published, was the Python for Data Science cheat sheet. You can click the image below to access the full cheat sheet.
Martijn, is the co-founder of DataCamp, an online interactive education platform for data science that combines fun video instructions with in-browser coding challenges. In his spare time, he keeps himself busy with collecting superhero T-shirts.