Codementor Events

Working with a real-world Problem in Data Science

Published May 16, 2019

Working with real-world dataset is not as easy as we see it while learning. Working with Kaggle data, Zindi data is very easy compared to getting data yourself.

When you are working with a real-world problem you don’t always have the dataset ready. The first step here is mining your data. Data comes in different formats, so we have several data-mining techniques.

Data Collection

Data collection is the most important part of data science, data collection plays a great role in determining how well the analysis of data goes. Data comes in different format like csv, tsv, xlsx, html and so on.

Data Collection Techniques

  • Interviews
  • Questionnaires and Surveys
  • Observations
  • Focus Groups
  • Ethnographies, Oral History, and Case Studies
  • Documents and Records
  • Web Scraping

Here is a link to where you can read more on several data collection techniques https://cyfar.org/data-collection-techniques

Data Cleaning

Once you have your data ready the next thing you have to do is to clean your data. Data Cleaning is the process of identifying and removing unwanted observations from the data. Data cleaning process could be the removal of unwanted observations, removal of outliers, filling of missing rows, creation of calculated column, symbols.

Define your question

In data analysis, questions should be measurable, clear and concise. Questions should be designed to qualify or disqualify a potential solution to a problem. In the advertising industry questions like ‘Does age affect rate at which people subscribe to this service’, ‘How does gender affect the type advert would like to see ?’. This is done so as to understand the solution we are working on better. This can help to target people that are likely to use a particular product, people that are likely to subscribe to a particular channel.

Set clear measurement Priority

This can come in two different ways:

  • Decide what to measure
  • Decide how to measure.

One of the key challenges with performance management is selecting what to measure. The priority here is to focus on quantifiable factors that are clearly linked to the drivers of success in business

Analyze your data

Data could be manipulated in a number of ways, such as plotting it out, creating pivot tables, group by a particular category. Tools like pandas, excel, tableau, power bi are very useful in data analysis.

Interpret Result

After analyzing data the next step is to interpret the analysis, this step is where conclusions are made whether a hypothesis fails or is accepted.

The Conclusion

As you can see, data is not always available. You have to be careful with privacy and licenses. Encrypt all personal data before sending out to the public, Read robot.txt of websites before scraping, remove all access token or keys before sharing your code, data with the public.

Thanks for Reading.

Cheers!

Discover and read more posts from Hammed Busirah
get started
post commentsBe the first to share your opinion
Show more replies