Stock Market Prediction Using Machine Learning

Originally published at Analyst Admin.

In this post, I will be transforming raw Stock Market trading data into refined machine-learning-ready data that can be used to predict the price change for the next trading day. For example, will AAPL go up 1% or down 3% tomorrow?

Using data science principles, we will engineer technical indicators that will be used as features to train a machine learning model. For example, open-close percent change and simple moving averages. Additionally, we will transform the dataset from single-point observations into a time series. We will wrap it up by evaluating our model so that it can provide a benchmark for future revisions and enhancements.

As you may have guessed already, I will be using Scikit-Learn, Python, Pandas, and Numpy. If you don’t already have them then I suggest downloading the full package from Anaconda which conveniently installs everything in one go. Set up instructions can be found here — Data Science Like a Pro: Anaconda and Jupyter Notebook on Visual Studio Code.

See my Github repository for the full project including data and Jupyter Notebook.

The Data

I pulled the daily data from the Yahoo API (before it was taken down) and combined it with Nasdaq data to bring in features like Market Cap, IPO Year, etc. One of my future posts will cover where and how to pull your own stock data including how to store it locally in a database. For now, my Github contains sample data.

Some questions about the data:

  1. Which column is the target vector? We will need to create it!
  2. How will features in the sample help predict the target vector? We will need to chain together several days of trading into one observation so that the model can learn based on patterns.

Clean Up

For the sake of simplicity, I will assume this will only be done with one company and only keep features that change day to day such as: date, open, high, low, close, volume, and adjclose.

Technical Indicators

I’ll start by adding the first technical indicator and target vector: open-close percent change. To create this series we can use Pandas vector operations such as:

Once we have our target vector we can start to engineer new features that help us model the relationship between features and the target vector. For example, for observation #1–2016–11–11, how does open, high, low, close, volume, adjclose correlate with the percent change? It’s a direct function of the features. In the real world, if we want to predict the next day we will not have it’s opening and closing price!

Let’s continue by creating a new feature based on the previous trading day’s percent change.

Now let’s try that again to get the percent change from two days ago.

Notice how the data is evolving from what happened during the isolated trading day to what happened in the days leading to that day. We can repeat this process to bring in ’n’ number of previous percent changes so that the machine learning algorithm can learn how trends correlate with the target vector.

Before calculating the simple moving average, we first need to sort the data with the oldest date on top. Try skipping this step before calculating SMA and you will notice that there is no way to change forward-looking vs backward-looking SMA calculations.

Now we calculate SMA using the rolling and mean operations of a Pandas series. The “5” below is for the number of days. If we want to create a Long SMA we can use something like “100” days instead of “5”. However, I will stick with a small number so that the effect can be viewed in a small window of data.

Notice the first 4 days (index 0–3) do not have a “shortSma” that’s because each of those days did not have enough days going back to calculate the rolling(5).

Now that we have SMA we can shift it to create two new features.

That’s it for new features. Now we can cleanup the dataframe by removing all the NaNs.

The function above will remove all observations where at least one feature contains a NaN, so be careful. You can use the “subset” argument to only drop observations where only specific columns contain NaN if you are not sure about your data.

Now let’s separate the data into an X and y (features & target vector).

We will then scale the X so that when we train our linear regression model we can analyze the coefficients in the same scale. I will use a scikit-learn function called “scale” within the preprocessing library.

Next, I will use the scikit-learn “train_test_split” function to divide my data into training and testing sets randomly. The testing size is set to “0.2” or 20% which means I will be training the model on 80% of the dataset which is a safe split for linear regression given the dataset.

Special note, this will generate two new dataframes and two new series, X_train, X_test, y_train, y_test, respectively.

Now the data is 100% ready for a machine learning algorithm. From this point on we jump into a linear regression algorithm to train and test the model. Special note, in the real world you will probably not be using a linear regression algorithm to predict the stock market. I’m using linear regression here for the sake of simplicity.

To fit the model to the training data we will use the “fit” function and pass is our training data.

Now that the model is fit, we can see the coefficients of each feature. In general, the coefficient with the highest absolute value is the most correlated with the target vector. Also, positive coefficients are positively correlated while negative coefficients are negatively correlated with the target vector.

To see which feature has the most correlation we can create a new dataframe with coefficients and column names. I’ve also gone ahead and taken the absolute value and sorted the data.

Notice how shortSma-1 has the highest absolute value! That means that shortSma-1 has the highest correlation with the percent change (target vector). This type of analysis can be used to create more relevant features and to understand what value each feature is bringing to the model.

Now let’s use the trained model to predict on the testing set.

Now that we have the test predictions we can calculate the mean squared error (MSE). Followed by a square root of the MSE to arrive at RMSE.

Now, as you can see our RMSE is 0.04 which means our predictions could be off by 4%. That’s not very good for trading but it can provide a baseline to compare against. To improve the results, we can try adding more features and observations, trying different machine learning algorithms, and optimizing the parameters of those algorithms. I hope this gives you an idea of how we go from bare stock trading data to refined machine learning data!

Thank you for reading!

Originally published at on October 18, 2020.

Hi! I’m an analyst and administrator who enjoys helping the analytics community.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store