Training ML Models with Apache Spark on VPS

By Anurag Singh

Updated on Oct 17, 2024

Training ML Models with Apache Spark on VPS

In this tutorial, we're training ML models with Apache Spark on VPS Ubuntu 24.04.

We will cover how to install and configure Apache Spark on a VPS, and then use it to train machine learning models. Apache Spark is a powerful, open-source distributed computing system designed for big data processing. It provides an easy-to-use API for machine learning tasks and can process large-scale data sets efficiently by distributing workloads across multiple nodes.

Prerequisites

  • A KVM VPS or dedicated server with Ubuntu 24.04.
  • A root user access or normal user with sudo rights
  • A basic understanding of Linux command line and machine learning concepts.
  • Java Development Kit (JDK) version 8 or later.
  • Python 3.x installed (for using PySpark, Apache Spark's Python API).

Training ML Models with Apache Spark on VPS

Step 1: Install Java (JDK)

Apache Spark requires Java to run, so we need to install the Java Development Kit (JDK).

Update the package list:

sudo apt update

Install OpenJDK 17:

sudo apt install openjdk-17-jdk -y

Verify the installation by checking the version:

java -version

You should see output similar to:

openjdk version "17.0.12" 2024-07-16
OpenJDK Runtime Environment (build 17.0.12+7-Ubuntu-1ubuntu224.04)
OpenJDK 64-Bit Server VM (build 17.0.12+7-Ubuntu-1ubuntu224.04, mixed mode, sharing)

Step 2: Install Apache Spark

Download Spark:

Navigate to the Apache Spark Downloads page and copy the link to the latest stable release. Alternatively, you can use the following command to download Apache Spark directly:

wget https://dlcdn.apache.org/spark/spark-3.5.3/spark-3.5.3-bin-hadoop3.tgz

Extract the downloaded archive:

tar -xvzf spark-3.5.3-bin-hadoop3.tgz

Move Spark to /opt for easier access:

sudo mv spark-3.5.3-bin-hadoop3 /opt/spark

Set environment variables for Spark:

Open your .bashrc file:

nano ~/.bashrc

Add the following lines at the bottom of the file:

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Save and exit the file (CTRL+O, then CTRL+X).

Apply the changes:

source ~/.bashrc

Exit the current shell session and relogin into your server via SSH

Verify the Spark installation: Run the Spark shell to ensure Spark is installed correctly:

spark-shell

If everything is set up correctly, you should see the Spark interactive shell prompt, which looks like:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.5.3
      /_/
         
Using Scala version 2.12.18 (OpenJDK 64-Bit Server VM, Java 17.0.12)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

Exit the shell: Type :quit to exit the Spark shell.

Step 3: Install Python and PySpark

PySpark is the Python API for Spark, and it's the easiest way to work with Spark if you're familiar with Python.

Install Python 3 and pip:

sudo apt install python3 python3-pip python3-venv -y

Set Up a Virtual Environment

It's recommended to create a Python virtual environment for the Flask application to avoid conflicts between packages.

mkdir ~/spark_ml_app
cd ~/spark_ml_app
python3 -m venv venv
source venv/bin/activate

Install PySpark: You can install PySpark using pip:

pip install pyspark

Verify PySpark installation: Start the PySpark shell to verify that PySpark is installed correctly:

pyspark

If PySpark is correctly installed, you should see the PySpark shell prompt, which looks like:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.5.3
      /_/

Using Python version 3.12.3 (main, Sep 11 2024 14:17:37)
Spark context Web UI available at http://65.20.66.35:4040
Spark context available as 'sc' (master = local[*], app id = local-1729130920340).
SparkSession available as 'spark'.
>>> 

Exit the session by exit().

Step 4: Install Additional Python Libraries

To train machine learning models, we will need additional libraries like pandas and numpy. Install them using pip:

pip install pandas numpy

You might also need scikit-learn for preprocessing and model evaluation:

pip install scikit-learn

Install one required package:

pip install setuptools

Step 5: Prepare a Dataset for Training

We will use a sample dataset to train a machine learning model. For demonstration purposes, we’ll use the popular Iris dataset.

Download the dataset:

wget https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data -O iris.csv

Upload the dataset to your server: If you're using your own dataset, you can upload it to your server using tools like scp or rsync.

Step 6: Train a Machine Learning Model Using PySpark

Now, let’s create a simple machine learning model to classify the Iris dataset.

Create a Python file: Create a Python script on your server. Let's call it train_model.py.

nano train_model.py

Copy and paste following content:

# train_model.py

from pyspark.sql import SparkSession
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Step 1: Initialize Spark session
spark = SparkSession.builder \
    .appName('IrisClassification') \
    .getOrCreate()

# Step 2: Load the dataset
data = spark.read.csv('iris.csv', inferSchema=True, header=False)
data = data.toDF('sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'label')

# Step 3: Prepare features and labels
assembler = VectorAssembler(
    inputCols=['sepal_length', 'sepal_width', 'petal_length', 'petal_width'],
    outputCol='features'
)
output = assembler.transform(data)
final_data = output.select('features', 'label')

# Step 4: Split the dataset into training and testing sets
train_data, test_data = final_data.randomSplit([0.8, 0.2])

# Step 5: Train a Logistic Regression model
lr = LogisticRegression(featuresCol='features', labelCol='label')
model = lr.fit(train_data)

# Step 6: Make predictions on the test data
predictions = model.transform(test_data)
predictions.select('features', 'label', 'prediction').show()

# Step 7: Evaluate the model's accuracy
evaluator = MulticlassClassificationEvaluator(
    labelCol='label', predictionCol='prediction', metricName='accuracy'
)
accuracy = evaluator.evaluate(predictions)
print(f"Test Accuracy: {accuracy}")

# Step 8: Save the trained model
model.save("logistic_regression_model")

# Stop the Spark session
spark.stop()

Save the train_model.py file and make it executable:

chmod +x train_model.py

Run the script using spark-submit, which is used to run PySpark applications in a distributed manner:

spark-submit train_model.py

Step 7: Running Spark on Multiple Cores or a Cluster

If you want to scale up and take advantage of distributed processing, you can configure Spark to run on multiple cores or across a cluster.

Update Spark configuration: Open the Spark configuration file (/opt/spark/conf/spark-defaults.conf) and modify it to define the number of cores and memory allocation. Example:

spark.executor.cores 4
spark.executor.memory 8g

Start Spark on multiple nodes (optional): If you're working with a Spark cluster, start the master node:

./sbin/start-master.sh

Then start worker nodes by pointing them to the master:

./sbin/start-slave.sh spark://<master-ip>:7077

Conclusion

In this tutorial, we covered the complete setup of Apache Spark on a VPS and demonstrated how to use PySpark to train a machine learning model. With this foundation, you can now explore more advanced machine learning techniques, handle large datasets, and take advantage of Spark's distributed computing capabilities to process data at scale.