Data Science for High Schoolers

A k-Nearest-Neighbor algorithm that I developed as my first data science project

If you’re a high school student, chances are you’re considering what to pursue for your future career. You might have interests in STEM subjects like math or biology, but you might not be sure which field to pursue. I was in the same position as many of you. I knew I wanted to get into the tech sector, but I had other interests in the physical sciences such as chemistry and physics. Then, I learned about data science and its interdisciplinary connection with several other STEM fields. In this article, I’ll introduce data science to students still unsure about which STEM-field and how to get started learning about data science.

Data science revolves around almost every STEM field. From all of the physical sciences to mathematics to computer science, data science can be applied in all of these fields. For example, when you conduct a complex lab experiment, you will most likely work with multivariable data. Modeling this data to make accurate predictions of future experiments allows investigators to learn more about the experiment being conducted. Additionally, you can visualize this data to determine correlations between specific variables. By determining specific correlations, you can modify your experiment to optimize future results.

Not only is data science used in the lab, but many companies and organizations use data science for a variety of tasks. Some companies utilize data science concepts to analyze their product’s growth in the market and what they can do to boost revenue. Other organizations such as the CDC are currently using advanced data science to project insightful models regarding the spread of COVID-19 and its potential effects.

Finally, data science is easy to learn. You don’t need a heavy math background to understand the fundamentals of data analytics. Almost all high schoolers have the mathematical skills to understand data science concepts such as regression analysis. Furthermore, data science requires minimal coding experience due to the simplicity of the Python language and the modules it provides.

The vast applications of data science have enabled it to become one of the most popular fields in STEM, making it the perfect choice for high schoolers to invest their time in.

For starters, students need to understand that data science requires a basic understanding of a coding language, specifically R or Python. For high schoolers, I would recommend that they use Python since it is generally easier to learn and more readable for new programmers. Some of the concepts that students should be familiar with before beginning their journey in data science are the following:

  • Variables
  • Data Types (Specifically Strings)
  • Control Flow
  • Basic Object-Oriented Programming
  • Functions

These are just a few concepts that students can learn beforehand to become further acquainted with Python. A complete knowledge of all of the functionalities that Python provides is by no means necessary. Most programmers and data scientists have to look up documentations of specific programming modules because it is impossible to memorize all the methods and functions provided.

Moving on, there are many Python libraries available for data analytics. However, high school students do not need to use all of these libraries in order to develop complex projects. The core libraries that all data scientists must have experience are the following:

  • NumPy
  • Pandas
  • Matplotlib
  • Scikit-Learn

I have listed these Python libraries in the order of which a high school student should learn them. These are the essential libraries that are a must for most data science projects. They allow the user to analyze massive datasets and to develop models and visualizations of the dataset to gain new insights.

To actually get started with data science, you will need a coding environment. For data science projects, I prefer using PyCharm by JetBrains as it provides a clean interface with many useful features for the programmer. Additionally, to optimize your programming experience, set up an Anaconda environment. This virtual environment will allow you to easily download and access Python libraries into your Python interpreter and will save you lots of time for future projects.

Now, I’ll discuss what each of the recommended Python libraries above are used for and how they help with data science projects.

Python Libraries

NumPy is one of the most essential Python library for any data science project. As you can see in the name, NumPy is meant for numerical calculations and analysis. It provides the user with a myriad of functionalities that allow for advanced mathematics and it empowers the user to find correlations in massive datasets. Through NumPy, the user can leverage array objects to perform complicated mathematical operations like Fourier transformations and linear algebra. It allows for reshaping large datasets and processing visual data. It also acts as the foundation for other machine learning frameworks such as TensorFlow, and it provides a mathematical backend to these libraries. If you want to venture into data science, you need a solid understanding of NumPy and its available functionalities.

NumPy is a very simple package to use. However, there are a countless number of functions available for the user, so learning some of the main functions is the most efficient way to learn how to use NumPy. There are many tutorials available on Youtube and other platforms such as Coursera and edX (I’ll link some tutorials at the end of the article).

Pandas is a Python library solely made for data analysis and to provide the user with helpful data structures. The two data structures Pandas provides the user are Series and Dataframes.

Data manipulation and preparation is arguably the hardest process in any data science project. Almost all datasets in the real world are nowhere near perfect, and we have to manipulate them as best as possible in order for our machine learning models to produce accurate results. Some datasets may include empty or null values or some may have values of infinity that the machine learning model won’t understand. Pandas allows us to eliminate these issues in the dataset and to feed our machine learning model with data it can understand. Additionally, we may not want to use all of the data in the dataset. Most datasets have multivariable data with thousands of data points. We might only want to analyze a few variables in the dataset, so Pandas allows us to “drop” certain variables and keep exactly what we want to analyze.

Learning Pandas isn’t very hard once you have an understanding of NumPy since Pandas is built upon NumPy. Corey Schafer has a great tutorial series on Pandas which is an awesome introduction on the implementation of the module.

After learning about how to analyze and prepare our data, we may also want to visualize our data. Visualizations of data come in many forms: scatter plots, line graphs, heat maps etc. Matplotlib allows us to plot data on graphs and to develop insightful images that allow us to truly visualize correlations across several variables in our datasets.

Matplotlib allows researchers to present their findings in an efficient manner. Visualizations and graphs provided by Matplotlib are very easy to understand and are perfect for research projects. Additionally, by using Matplotlib, data scientists have the ability to make any changes as they learn more about their dataset.

Matplotlib provides several features that can allow the user to develop beautiful graphs and data visualizations. If you want to step it up a notch, I recommend also using Seaborn along with Matplotlib. Seaborn is a derivative module of Matplotlib and it provides the user with more functions regarding the appearance of the graph and the different types of graphs available. Again, I recommend watching Corey Schafer’s course on Matplotlib to get started with data visualizations.

Now, lets move into what most of you were probably waiting for: machine learning.

Machine learning is essentially the study of algorithms that learn specific correlations from datasets. Machine learning consists of three types of learning: supervised, unsupervised, and reinforcement. Scikit-Learn mainly provides us with tools used for supervised and unsupervised learning, so I won’t touch on reinforcement learning in this article.

Scikit-Learn is a framework used for predictive analysis of data, and most data scientists use Scikit-Learn for the following:

  • Classification
  • Regression
  • Clustering
  • Preprocessing

Scikit-Learn provides many machine learning models for classification, regression, and clustering. Some models include linear regression, logistic regression, k-Nearest Neighbor, k-Means, Support Vector Machine (SVM) etc. These models are heavily utilized by data scientists in order to develop accurate models for predictive data analysis. Additionally, Scikit-Learn condenses the entire machine learning process so much that you could develop a model for a complex, multivariate dataset in potentially less than 100 lines of code. This makes it perfect for beginners who want to begin developing data science projects and to gain more experience in the field. Additionally, Scikit-Learn offers a package named preprocessing. This package allows the user to convert string data into numerical data, a key feature that data scientists need since machine learning models only understand numerical inputs. Scikit-Learn completely condenses the entire data cleansing process and allows the user to work on the more important parts of the project.

For Scikit-Learn, I recommend watching Tech with Tim’s tutorial series on basic machine learning models in which he uses Scikit-Learn for regression, classification, and clustering analysis.

This guide outlined the fundamentals of data science and how high school students can get involved. It can be a lot to take in at once, but when you put in enough dedication and effort to learn how to use these programming frameworks, you can truly create awesome projects that can have a major impact. Furthermore, obtaining these data analytics skills is essential when pursuing almost all STEM careers. If you have any questions or would like to learn more about how to get involved in data science, feel free to contact me at zeeshanp@berkeley.edu.

I’m a freshman studying CS and Statistics at UC Berkeley. Feel free to contact me at zeeshanp@berkeley.edu.