August 31, 2022

*Written by *Ajay Gandecha

In last week’s post, I talked about how this academic year will likely be amongst the most exciting and consequential years for data science at UNC. I also talked about the new Data Science minor and ways to explore and get involved in data science on campus. However, many of you might be at the start of your college career here at UNC! You may not have taken any related courses yet, or might not even know anything about data science. Well, one of the best ways to explore an interest and learn more about a field at UNC is to take an introductory class!

The UNC Statistics department offers two great introductory courses that serve as the foundational sequence for data science – STOR 120 and STOR 320. Last year, I had the chance to take both courses in this introductory sequence. Here, I talk about both of these courses in detail and why you should take them if you have the opportunity.

The first course of the introductory sequence is STOR 120: Foundations of Statistics and Data Science. I took the class in the Fall semester of 2021 with Dr. Jeff McLean. Overall, this class was terrific. The professor was excellent, and the material was interesting and involved a good balance of basic programming and statistics concepts without being too overwhelming or incorporating any complex math and statistics.

The first third of the course specifically focused on learning the Python programming language and applying Python to data science. At first, we talked about the basics of programming in Python, from variables and data types to if-statements and loops. After that, we began to incorporate some statistical functions into our code, building up to working with complex tables and creating entire visualizations from scratch.

I think that spending this much time teaching Python is one of the biggest strengths of STOR 120 and * even leads me to believe that every statistics major should take STOR 120 as their introductory statistics course*. Python is amongst the most powerful programming languages for data scientists. Python has a vast library of data science-related and machine learning packages that are well-maintained by thousands of passionate developers. Python has a simple syntax, too, and is highly approachable as an introductory programming language, even for those with no computer science background. Despite its simplicity, Python allows data scientists to access the full power of a multipurpose programming language. This strength of Python is extremely good for data scientists who might want to integrate more complicated logic in their programs or those who want to access the thousands of APIs and additional functionality Python supports far from just statistics and visualizations.

For those interested in knowing more about the Python data science packages we used in class, STOR 120 primarily uses the `datascience`

package. The `datascience`

package is excellent, especially for beginners working with datasets for the first time. I wish we got to use the ever-popular Pandas package more often in class since it is the standard data analysis package for Python and significantly faster at running. However, the operations we covered with the `datascience`

package are generally synonymous with its Pandas counterpart. Despite this, we got to work with Numpy and Matplotlib, which are vital to know for most data science work in Python.

Despite being hard to tell at first, STOR 120 is a foundation of statistics class, not purely a programming class. But, especially in the last two-thirds of the course, STOR 120 covers essential statistics and data science concepts. The course does talk about the absolute basics – mean, median, min, and max – before moving on to more intermediate concepts such as standard deviations, p-values, sampling, and hypothesis testing.

The course also spends a reasonable amount of time on some interesting statistics concepts that are usually not covered in a first statistics class. For example, one of my favorite things we learned about was bootstrap sampling. Professor McLean also gave great lectures that allowed us to follow along and implement these statistical concepts on real-world datasets in our code. Near the end of the class, we talked about developing and analyzing simulations. Creating simulations of different events was a lot more intuitive than the typical hypothesis test formulas given in an AP Statistics or intro-level math-based statistics class.

STOR 120 had a lab component, making it unique compared to other introductory math and programming classes. However, in this lab, we all got the chance to play around with the concepts we learned, talk to others, and ask questions if necessary. We also had homework every week, two midterms, and a final. The second midterm is notoriously rough, with over half of the class failing the in-person part of the exam. Luckily, though, the class did exceptionally well on the first midterm and the final to still do well in the course.

After STOR 120, I was really excited to try and take another data science course. After waiting on the waitlist for some time, I successfully enrolled in the second course in the sequence, STOR 320: Introduction to Data Science. My professor, Dr. Mario Giacomazzo, was also extremely excellent and pushed everyone in the class to succeed and use good data science practices throughout the course.

The programming language we used was the first significant change from STOR 120 to STOR 320. STOR 320’s curriculum entirely focuses on R rather than Python. Compared to STOR 120, I had no prior knowledge of the programming language going into the class. Despite this, the professor walked us through how to code in R with many examples to get us proficient, or at least literate, quite quickly.

Despite the potential annoyances of switching languages at first, STOR 320 switching to R is also very good for its students. R accompanies Python as one of the most-used languages for data science. R also has great data visualization libraries, and we got to make visually appealing graphics out of the gate. Also, many upper-division STOR courses at UNC use R, so STOR 320 is an excellent foundation for those classes as well.

There are no exams or a final in this course. The entire STOR 320 course’s curriculum surrounds a semester-long final project making up a pretty hefty 30% of the final grade. The project’s premise is deceivingly simple: find a dataset, develop a few interesting questions, and create a final report to present your findings. The project itself, however, is much more involved. We utilized the more advanced modeling methods that we learned throughout the course. The project was a group project and was a good chance to meet other statistics majors and people interested in data science.

STOR 120 and 320 were amongst my favorite classes in my first year of college, persuading me to get more involved with data science on campus and potentially look to study it in some capacity. For statistics majors, data science minors, or anyone wishing to dabble in data science, STOR 120 and STOR 320 are terrific classes to take at UNC! If you have taken any of these courses, let me know what you think below.