In this tutorial, we're training ML models with Apache Spark on VPS Ubuntu 24.04.
We will cover how to install and configure Apache Spark on a VPS, and then use it to train machine learning models. Apache Spark is a powerful, open-source distributed computing system designed for big data processing. It provides an easy-to-use API for machine learning tasks and can process large-scale data sets efficiently by distributing workloads across multiple nodes.
Prerequisites
- A KVM VPS or dedicated server with Ubuntu 24.04.
- A root user access or normal user with sudo rights
- A basic understanding of Linux command line and machine learning concepts.
- Java Development Kit (JDK) version 8 or later.
- Python 3.x installed (for using PySpark, Apache Spark's Python API).
Training ML Models with Apache Spark on VPS
Step 1: Install Java (JDK)
Apache Spark requires Java to run, so we need to install the Java Development Kit (JDK).
Update the package list:
sudo apt update
Install OpenJDK 17:
sudo apt install openjdk-17-jdk -y
Verify the installation by checking the version:
java -version
You should see output similar to:
openjdk version "17.0.12" 2024-07-16
OpenJDK Runtime Environment (build 17.0.12+7-Ubuntu-1ubuntu224.04)
OpenJDK 64-Bit Server VM (build 17.0.12+7-Ubuntu-1ubuntu224.04, mixed mode, sharing)
Step 2: Install Apache Spark
Download Spark:
Navigate to the Apache Spark Downloads page and copy the link to the latest stable release. Alternatively, you can use the following command to download Apache Spark directly:
wget https://dlcdn.apache.org/spark/spark-3.5.3/spark-3.5.3-bin-hadoop3.tgz
Extract the downloaded archive:
tar -xvzf spark-3.5.3-bin-hadoop3.tgz
Move Spark to /opt for easier access:
sudo mv spark-3.5.3-bin-hadoop3 /opt/spark
Set environment variables for Spark:
Open your .bashrc
file:
nano ~/.bashrc
Add the following lines at the bottom of the file:
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
Save and exit the file (CTRL+O, then CTRL+X).
Apply the changes:
source ~/.bashrc
Exit the current shell session and relogin into your server via SSH
Verify the Spark installation: Run the Spark shell to ensure Spark is installed correctly:
spark-shell
If everything is set up correctly, you should see the Spark interactive shell prompt, which looks like:
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.5.3
/_/
Using Scala version 2.12.18 (OpenJDK 64-Bit Server VM, Java 17.0.12)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
Exit the shell: Type :quit to exit the Spark shell.
Step 3: Install Python and PySpark
PySpark is the Python API for Spark, and it's the easiest way to work with Spark if you're familiar with Python.
Install Python 3 and pip:
sudo apt install python3 python3-pip python3-venv -y
Set Up a Virtual Environment
It's recommended to create a Python virtual environment for the Flask application to avoid conflicts between packages.
mkdir ~/spark_ml_app
cd ~/spark_ml_app
python3 -m venv venv
source venv/bin/activate
Install PySpark: You can install PySpark using pip:
pip install pyspark
Verify PySpark installation: Start the PySpark shell to verify that PySpark is installed correctly:
pyspark
If PySpark is correctly installed, you should see the PySpark shell prompt, which looks like:
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.5.3
/_/
Using Python version 3.12.3 (main, Sep 11 2024 14:17:37)
Spark context Web UI available at http://65.20.66.35:4040
Spark context available as 'sc' (master = local[*], app id = local-1729130920340).
SparkSession available as 'spark'.
>>>
Exit the session by exit()
.
Step 4: Install Additional Python Libraries
To train machine learning models, we will need additional libraries like pandas
and numpy
. Install them using pip:
pip install pandas numpy
You might also need scikit-learn
for preprocessing and model evaluation:
pip install scikit-learn
Install one required package:
pip install setuptools
Step 5: Prepare a Dataset for Training
We will use a sample dataset to train a machine learning model. For demonstration purposes, we’ll use the popular Iris dataset.
Download the dataset:
wget https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data -O iris.csv
Upload the dataset to your server: If you're using your own dataset, you can upload it to your server using tools like scp or rsync.
Step 6: Train a Machine Learning Model Using PySpark
Now, let’s create a simple machine learning model to classify the Iris dataset.
Create a Python file: Create a Python script on your server. Let's call it train_model.py.
nano train_model.py
Copy and paste following content:
# train_model.py
from pyspark.sql import SparkSession
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
# Step 1: Initialize Spark session
spark = SparkSession.builder \
.appName('IrisClassification') \
.getOrCreate()
# Step 2: Load the dataset
data = spark.read.csv('iris.csv', inferSchema=True, header=False)
data = data.toDF('sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'label')
# Step 3: Prepare features and labels
assembler = VectorAssembler(
inputCols=['sepal_length', 'sepal_width', 'petal_length', 'petal_width'],
outputCol='features'
)
output = assembler.transform(data)
final_data = output.select('features', 'label')
# Step 4: Split the dataset into training and testing sets
train_data, test_data = final_data.randomSplit([0.8, 0.2])
# Step 5: Train a Logistic Regression model
lr = LogisticRegression(featuresCol='features', labelCol='label')
model = lr.fit(train_data)
# Step 6: Make predictions on the test data
predictions = model.transform(test_data)
predictions.select('features', 'label', 'prediction').show()
# Step 7: Evaluate the model's accuracy
evaluator = MulticlassClassificationEvaluator(
labelCol='label', predictionCol='prediction', metricName='accuracy'
)
accuracy = evaluator.evaluate(predictions)
print(f"Test Accuracy: {accuracy}")
# Step 8: Save the trained model
model.save("logistic_regression_model")
# Stop the Spark session
spark.stop()
Save the train_model.py
file and make it executable:
chmod +x train_model.py
Run the script using spark-submit, which is used to run PySpark applications in a distributed manner:
spark-submit train_model.py
Step 7: Running Spark on Multiple Cores or a Cluster
If you want to scale up and take advantage of distributed processing, you can configure Spark to run on multiple cores or across a cluster.
Update Spark configuration: Open the Spark configuration file (/opt/spark/conf/spark-defaults.conf
) and modify it to define the number of cores and memory allocation. Example:
spark.executor.cores 4
spark.executor.memory 8g
Start Spark on multiple nodes (optional): If you're working with a Spark cluster, start the master node:
./sbin/start-master.sh
Then start worker nodes by pointing them to the master:
./sbin/start-slave.sh spark://<master-ip>:7077
Conclusion
In this tutorial, we covered the complete setup of Apache Spark on a VPS and demonstrated how to use PySpark to train a machine learning model. With this foundation, you can now explore more advanced machine learning techniques, handle large datasets, and take advantage of Spark's distributed computing capabilities to process data at scale.