PSEi Stock Prediction: A Data Science Project
Hey guys! Ever wondered if you could predict the Philippine Stock Exchange index (PSEi) using data science? Well, you're in the right place! This article dives into building a data science project for PSEi stock market prediction. We'll cover everything from gathering data to building and evaluating models. Get ready to unleash your inner data scientist!
Why Predict the PSEi?
Predicting the stock market, especially the PSEi, is a fascinating and challenging task. It's not just about making money (though that's a definite perk!). Understanding the forces that drive the PSEi can give you valuable insights into the Philippine economy. Here’s why this project is super cool:
- Economic Indicator: The PSEi reflects the overall health of the Philippine economy. A rising PSEi often indicates economic growth, while a falling PSEi can signal potential downturns. By analyzing historical data and identifying patterns, we can gain a better understanding of these economic trends and potentially forecast future movements.
 - Investment Opportunities: Accurate predictions, even if they are not perfect, can help investors make informed decisions. Identifying potential upward trends allows for strategic investments, while recognizing potential declines can help mitigate risk. This empowers investors to navigate the market with greater confidence and potentially maximize their returns.
 - Risk Management: Understanding the volatility and potential risks associated with the PSEi is crucial for effective risk management. Predictive models can help investors assess the potential impact of various market scenarios and develop strategies to protect their investments from significant losses. By quantifying risk, investors can make more informed decisions about asset allocation and portfolio diversification.
 - Learning Experience: Building a PSEi prediction model is an excellent way to enhance your data science skills. You'll learn about data collection, cleaning, feature engineering, model selection, and evaluation. It's a real-world project that combines theory with practical application, solidifying your understanding of key data science concepts. This hands-on experience will make you a more competent and valuable data scientist.
 - Personal Satisfaction: Successfully building a predictive model, even a basic one, can be incredibly rewarding. Seeing your code come to life and generate meaningful insights is a fantastic feeling. It's a testament to your skills and dedication, and it can inspire you to tackle even more complex data science challenges. The sense of accomplishment you'll gain from this project is well worth the effort.
 
So, are you excited? Let's dive into the project!
Step 1: Data Acquisition
The first step in any data science project is gathering the data. For PSEi prediction, you'll need historical stock data. Here's where you can find it:
- Philippine Stock Exchange (PSE) Website: The official PSE website (https://www.pse.com.ph/) is a primary source. They often have historical data available, though you might need to do some digging or contact them directly. Look for sections related to market statistics, historical data, or data downloads.
 - Financial APIs: APIs like Yahoo Finance, Alpha Vantage, and Tiingo provide historical stock data for various markets, including the PSEi. These APIs are usually the easiest way to automate data collection. You'll need to sign up for an account (some offer free tiers) and use their API to fetch the data. Python libraries like 
yfinanceandalpha_vantagemake this process super simple. - Third-Party Data Providers: Several companies specialize in providing financial data. These providers often offer comprehensive datasets that include not just historical prices but also other relevant information like news sentiment, economic indicators, and company financials. However, these services typically come with a subscription fee.
 - Web Scraping: As a last resort, you can try web scraping data from websites that display historical PSEi information. However, be aware that web scraping can be brittle (websites change their structure) and may violate the website's terms of service. Use this method with caution and respect the website's robots.txt file.
 
Data Points to Collect:
- Date: The date of the observation.
 - Open: The opening price of the PSEi on that day.
 - High: The highest price of the PSEi during that day.
 - Low: The lowest price of the PSEi during that day.
 - Close: The closing price of the PSEi on that day. This is often your target variable.
 - Volume: The volume of shares traded on that day.
 - Adjusted Close: The closing price adjusted for dividends and stock splits.
 
Coding Example (using yfinance in Python):
import yfinance as yf
# Get PSEi data (ticker symbol: PSEI.PS)
psei = yf.Ticker("PSEI.PS")
# Get historical data
data = psei.history(period="5y") # Get data for the last 5 years
print(data.head())
This code snippet will fetch the last 5 years of PSEi data and print the first few rows. You can adjust the period parameter to get data for different timeframes. Store this data in a Pandas DataFrame for further analysis.
Step 2: Data Cleaning and Preprocessing
Okay, you've got your data! Now comes the less glamorous but absolutely crucial step: cleaning and preprocessing. Real-world data is messy. It often contains missing values, outliers, and inconsistencies. Cleaning it ensures your model learns from accurate information.
- Handling Missing Values:
- Identify Missing Values: Use 
data.isnull().sum()in Pandas to find columns with missing values. - Impute or Remove:
- Imputation: Fill missing values with a reasonable estimate. Common methods include using the mean, median, or a constant value. For time series data, you might use forward fill (carry the last known value forward) or backward fill. Pandas provides 
data.fillna()for this purpose. - Removal: If a small percentage of rows have missing values, you can simply remove them using 
data.dropna(). However, be cautious about removing too much data. 
 - Imputation: Fill missing values with a reasonable estimate. Common methods include using the mean, median, or a constant value. For time series data, you might use forward fill (carry the last known value forward) or backward fill. Pandas provides 
 
 - Identify Missing Values: Use 
 - Outlier Detection and Treatment:
- Visualize Data: Use box plots or scatter plots to identify potential outliers.
 - Z-Score or IQR: Calculate the Z-score or Interquartile Range (IQR) to identify data points that fall outside a predefined range.
 - Winsorizing or Truncating: Replace extreme values with a less extreme value (Winsorizing) or remove them entirely (truncating).
 
 - Data Transformation:
- Scaling: Scale numerical features to a similar range. Common methods include MinMaxScaler (scales to [0, 1]) and StandardScaler (standardizes to have mean 0 and standard deviation 1). Use 
sklearn.preprocessingfor these transformations. - Normalization: Normalize data to have a unit norm. This is useful when the magnitude of the features is not important.
 - Log Transformation: Apply a logarithmic transformation to reduce skewness in the data. This can be helpful for features with a long tail distribution.
 
 - Scaling: Scale numerical features to a similar range. Common methods include MinMaxScaler (scales to [0, 1]) and StandardScaler (standardizes to have mean 0 and standard deviation 1). Use 
 
Feature Engineering:
This is where you get creative! Feature engineering involves creating new features from existing ones that might be more informative for your model.
- Technical Indicators: Calculate common technical indicators used in stock trading:
- Moving Averages (SMA, EMA): Calculate simple and exponential moving averages over different time periods (e.g., 5-day, 20-day, 50-day). These smooth out price fluctuations and highlight trends.
 - Relative Strength Index (RSI): Measures the magnitude of recent price changes to evaluate overbought or oversold conditions in the price of a stock or other asset.
 - Moving Average Convergence Divergence (MACD): A trend-following momentum indicator that shows the relationship between two moving averages of a security’s price.
 - Bollinger Bands: Bands plotted at standard deviation levels above and below a moving average. These bands can indicate volatility and potential price breakouts.
 
 - Lagged Variables: Include past values of the PSEi (e.g., the closing price from the previous day, the day before that, etc.). These lagged variables can capture the autocorrelation in the time series data.
 - Volatility: Calculate the rolling standard deviation of the PSEi price over a certain period to measure volatility.
 - Date-Related Features: Extract features from the date, such as the day of the week, month, or quarter. These features can capture seasonal patterns in the stock market.
 
Coding Example (Feature Engineering with Pandas):
import pandas as pd
# Assuming you have a DataFrame called 'data'
# Calculate 5-day moving average
data['SMA_5'] = data['Close'].rolling(window=5).mean()
# Calculate RSI (Relative Strength Index)
def calculate_rsi(data, window=14):
    delta = data['Close'].diff()
    up, down = delta.copy(), delta.copy()
    up[up < 0] = 0
    down[down > 0] = 0
    roll_up1 = up.rolling(window=window).mean()
    roll_down1 = down.abs().rolling(window=window).mean()
    RS = roll_up1 / roll_down1
    RSI = 100.0 - (100.0 / (1.0 + RS))
    return RSI
data['RSI'] = calculate_rsi(data)
print(data.head())
This code snippet demonstrates how to calculate a 5-day moving average and the RSI. You can adapt these examples to create other technical indicators.
Step 3: Model Selection and Training
Alright, data's clean and features are engineered! Time for the fun part: building your prediction model. Here are a few popular choices:
- Linear Regression: A simple and interpretable model. It assumes a linear relationship between the features and the target variable. It's a good starting point but might not capture the complexities of the stock market.
 - Support Vector Regression (SVR): A powerful model that can capture non-linear relationships. It's more complex than linear regression but can often provide better results.
 - Random Forest: An ensemble method that combines multiple decision trees. It's robust to outliers and can handle non-linear relationships well. Random Forests are known for their high accuracy and ability to generalize to unseen data.
 - Long Short-Term Memory (LSTM) Networks: A type of recurrent neural network (RNN) specifically designed for time series data. LSTMs can capture long-term dependencies in the data, making them well-suited for stock market prediction. These are more complex to implement but can yield excellent results.
 
Splitting the Data:
Before training, split your data into training and testing sets. The training set is used to train the model, while the testing set is used to evaluate its performance on unseen data. A common split is 80% for training and 20% for testing. Make sure to use a time-based split to preserve the temporal order of the data.
from sklearn.model_selection import train_test_split
# Assuming 'data' is your DataFrame
X = data.drop('Close', axis=1) # Features
y = data['Close'] # Target variable
# Time-based split
train_size = int(len(X) * 0.8)
X_train, X_test = X[:train_size], X[train_size:]
y_train, y_test = y[:train_size], y[train_size:]
print(f"Training set size: {len(X_train)}")
print(f"Testing set size: {len(X_test)}")
Training the Model:
Now, train your chosen model using the training data. This involves feeding the training data to the model and allowing it to learn the relationships between the features and the target variable.
from sklearn.linear_model import LinearRegression
# Create a linear regression model
model = LinearRegression()
# Train the model
model.fit(X_train, y_train)
print("Model trained!")
For more complex models like SVR, Random Forest, or LSTM, you'll need to adjust the code accordingly. Scikit-learn provides classes for SVR and Random Forest, while TensorFlow or PyTorch are commonly used for LSTMs. Be sure to optimize hyperparameters using techniques like grid search or random search to improve model performance.
Step 4: Model Evaluation
Time to see how well your model performs! Use the testing data to evaluate its accuracy. Common metrics include:
- Mean Absolute Error (MAE): The average absolute difference between the predicted and actual values. Lower MAE indicates better performance.
 - Mean Squared Error (MSE): The average squared difference between the predicted and actual values. MSE penalizes larger errors more heavily than MAE.
 - Root Mean Squared Error (RMSE): The square root of the MSE. RMSE is easier to interpret than MSE because it's in the same units as the target variable.
 - R-squared (Coefficient of Determination): Measures the proportion of variance in the target variable that is explained by the model. R-squared ranges from 0 to 1, with higher values indicating better performance.
 
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
# Make predictions on the test set
y_pred = model.predict(X_test)
# Calculate evaluation metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"R-squared (R2): {r2:.2f}")
Interpreting the Results:
- MAE, MSE, and RMSE: These metrics provide a measure of the average prediction error. Smaller values indicate better accuracy.
 - R-squared: This metric indicates the proportion of variance in the target variable that is explained by the model. An R-squared of 0.8, for example, means that the model explains 80% of the variance in the PSEi closing price.
 
Visualizing Predictions:
Plot the predicted values against the actual values to get a visual sense of the model's performance. This can help you identify areas where the model is performing well and areas where it is struggling.
import matplotlib.pyplot as plt
# Plot predictions vs. actual values
plt.figure(figsize=(12, 6))
plt.plot(y_test.index, y_test, label='Actual', color='blue')
plt.plot(y_test.index, y_pred, label='Predicted', color='red')
plt.xlabel('Date')
plt.ylabel('PSEi Closing Price')
plt.title('PSEi Prediction: Actual vs. Predicted')
plt.legend()
plt.show()
Step 5: Iteration and Improvement
Don't be discouraged if your initial model isn't perfect! Data science is an iterative process. Here's how to improve your model:
- Feature Engineering: Experiment with different features. Try adding more technical indicators, lagged variables, or external data sources (e.g., economic indicators, news sentiment).
 - Model Selection: Try different models. If linear regression isn't working well, try SVR, Random Forest, or LSTM.
 - Hyperparameter Tuning: Optimize the hyperparameters of your model using techniques like grid search or random search.
 - More Data: Collect more data. The more data you have, the better your model will likely perform.
 - Ensemble Methods: Combine multiple models to create an ensemble. This can often improve performance by averaging out the errors of individual models.
 
Conclusion
Building a PSEi stock market prediction model is a challenging but rewarding data science project. It requires a combination of data acquisition, cleaning, feature engineering, model selection, and evaluation skills. By following the steps outlined in this article, you can build your own PSEi prediction model and gain valuable insights into the Philippine stock market. Remember to experiment, iterate, and have fun! Good luck, data scientists!