Data Science and Data engineering are some of the hottest topics on the market right now, with a huge array of applications. However, as the need for these types of professionals grows, so do its requirements. Many professionals enter the area purely by studying computer science-related concepts, such as programming languages like Python or machine learning libraries like TensorFlow or Scikit Learn, which is not bad in itself. But if you want to go deeper in your studies, it’s also really nice to have a strong statistical foundation, which is where a lot of these concepts’ earlier iterations came from. More specifically, statistical inference. This article has the objective to shed a little light on this topic, and where you can begin to study it.
In more formal terms, Statistical inference is when we make use of statistical concepts to make affirmations about the characteristics of a population using information given by a sample. In our case, population refers to the set of all elements or results in a given situation while the sample refers to a subset of this entire population. Without us realizing it, statistical inference is pretty much used in a lot of our day-to-day situations. For example, let’s say we just cooked some food, and we want to measure if we added enough salt to it. We don’t need to eat the entire thing to reach that conclusion, but rather we just need to take a bite. In fact, with this one taste test, we can determine not only if we have enough salt but also if the food has a nice temperature, if it’s seasoned well, or even if it’s spicy.
There are 2 basic problems that we can solve using statistical inference: Hypothesis Test and estimation. Basically, using these 2 concepts we can make generalizations about our population using solely our samples! While the objective of this article is not to go deep into these concepts, I will explain a little bit about estimation, which can be used as a basis for the hypothesis test (that is a more complex concept) later on.
Say we want to discover the mean of the ages in a High School. However, we don’t have access to the ages of every student in the school, only a few classes. In this case, the school is our population while the available classes are our sample. The natural use of the estimator is that the mean of the sample can represent the mean of the entire population. Therefore, this equation can be written as:
Where “p” is the population parameter, n is the number of students and X is the age of each student in the sample. This is a very classic use of estimation and is a great way to start the discussion. It’s important to note, however, that estimators can ́t be biased! If we pick classes with lower age students we will obviously have a lower mean. So it’s very important to guarantee that we pick a sample that can represent the entire population and not just a part of it. This is usually achieved by picking the sample elements at random, although there are other ways for that too.
Another very famous use of estimation is logistic regression and linear regression, super important concepts for entering the world of data science. With regression, we are pretty much trying to create a function out of our data points. Once we have this function we can try estimating the elements of our population. This is what makes statistical inference so important to Data Science as a whole. It gives you all the tools you need to learn the math behind our most used algorithms. There is a huge chance you were already using these concepts and didn’t even know about it.
There are many interesting sources to start studying statistical inference, from books to actual online courses. For books, I recommend the classic “Statistical inference” by Berger Casella. You can also use “Estatística básica” by Wilton Bussab and Pedro Morettin if you want a reference in Portuguese. This is the book I used for my graduation, and I could not recommend it more. I personally prefer reading rather than watching online courses, but if you are interested in something more visual The Statistical Inference program by John Hopkins University on Coursera might be great for you. It is very complete and extensive, so if you really wanna get serious about it, it’s gonna be worth your precious time.
Lastly, you can also practice the concepts you just learned using some programming languages and frameworks. For example, you can start using R, a programming language that is tailor-made for statistics and inference. While you won’t build any complex application on it, it is great for practicing the fundamentals, containing built-in functions for just about everything you are looking for. Its IDE, R studio, also contains a lot of resources for learning!
But if you don’t really like using R that’s completely fine because you can also use some specific Python libraries that have the same functions as R. Aside from our usual libraries such as NumPy and Pandas, which comes in handy in these scenarios, there is also Scipy.Stats, which is a very in-depth library with a ton of different statistical functions, and Pingouin which is a personal favorite of mine. It is very simple to use and contains pretty much what you need for statistical inference. Hopefully, with these tips, you can either start or get even better in the world of Data Science.
Enjoy your studies!