Newsletters




You Are Not a Machine, So Learn Machine Learning

<< back Page 3 of 3

We will use Java and Apache Spark ML v2.1.0. As you know, Java is still used at the core of the T-800 programming and it was one of the most widely used development languages in enterprise applications back in 2017. Spark is a wonderful analytics package. ML is the Machine Learning library—yes, they lacked inspiration the day they had to find a name. 

There are two ways to get the code: get it on GitHub or just enjoy typing a few lines of code, like in the good ol’ days of thousands of DATA lines from Compute! magazine—thinking of it, that’s even older than the 21st century.

If you picked the lazy and fast way, simply go to GitHub at http://bit.ly/SparkJavaCookbookCode  (and really, while you are it, don’t be shy, make it a favorite or fork the project). There are a couple of dependencies, nothing to worry about.

There are quite a few imports, even for a small example. I left them here as I do not want you to be confused by similar names in different packages (Vector is one for example). A few packages differ from the previous major version, Spark v1.6.2. You will see all the dependencies in Maven’s pom.xml.

package net.jgp.labs.spark;

import static org.apache.spark.sql.functions.callUDF;

import org.apache.spark.ml.linalg.Vector;

import org.apache.spark.ml.linalg.VectorUDT;

import org.apache.spark.ml.linalg.Vectors;

import org.apache.spark.ml.regression.LinearRegression;

import org.apache.spark.ml.regression.LinearRegressionModel;

import org.apache.spark.ml.regression.LinearRegressionTrainingSummary;

import org.apache.spark.sql.Dataset;

import org.apache.spark.sql.Row;

import org.apache.spark.sql.SparkSession;

import org.apache.spark.sql.types.DataTypes;

import org.apache.spark.sql.types.Metadata;

import org.apache.spark.sql.types.StructField;

import org.apache.spark.sql.types.StructType;

 

import net.jgp.labs.spark.udf.VectorBuilder;

Then we have a basic main() that will instantiate the class and start() it.

public class SimplePredictionFromTextFile  {

 

    public static void main(String[] args) {

        System.out.println("Working directory = " + System.getProperty("user.dir"));

        SimplePredictionFromTextFile app = new SimplePredictionFromTextFile();

        app.start();

    }

Since v2.0.0, Spark enforces the use of a Spark session. In prior versions, it was a little confusing because you might have needed several session and configuration objects.

    private void start() {

        SparkSession spark = SparkSession.builder()

                .appName("Simple prediction from Text File")

                .master("local").getOrCreate();

We need a UDF (User Defined Function) that transforms our input into a format that can be used by Spark.

        spark.udf().register("vectorBuilder", new VectorBuilder(), new VectorUDT());

Our data is in tuple-data-file.csv. Our first set is in tuple-data-file-set1.csv and the second is in guess-what-file.csv.

No, it is in tuple-data-file-set2.csv, but I wanted to check if you were following.

        String filename = "data/tuple-data-file.csv";

In this situation, we need to force the structure of our dataset as Spark needs some guidance on the metadata of our data.

                new StructField[] {

                        new StructField("_c0", DataTypes.DoubleType,

                                false, Metadata.empty()),

                        new StructField("_c1", DataTypes.DoubleType,

                                false, Metadata.empty()),

                        new StructField("features", new VectorUDT(),

                                true, Metadata.empty()), });

In a Java context, one of the major changes of Spark v2.0.0 is that our beloved dataframe is now implemented as a Dataset<Row>. I named the dataframe variable df, not for nostalgic reasons, but to align with Spark standards: the preferred data container is a dataframe.

        Dataset<Row> df = spark.read().format("csv")

                .schema(schema).option("header", "false")

                .load(filename);

        df = df.withColumn("valuefeatures", df.col("_c0")).drop("_c0");

        df = df.withColumn("label", df.col("_c1")).drop("_c1");

        df = df.withColumn("features",

                callUDF("vectorBuilder", df.col("valuefeatures")));

As you can see, we transformed our dataframe to create a label and features. More precisely, each label as a vector of features. Spark ML requires those columns to be names label and features.

        df.printSchema();

        df.show();

We are now ready to build our linear regression. We will limit to 20 iterations, this criterion depends on the complexity of the data—yeah, our dataset is ridiculously small. I encourage you to play with the number of iterations to see the behavior.

        LinearRegression lr = new LinearRegression().setMaxIter(20);

We assign our dataframe to our linear regression. This way, we get the model used for our linear regression.

        LinearRegressionModel model = lr.fit(df);

 

        // Given a dataset, predict each point's label, and show the results.

        model.transform(df).show();

And now, we can throw in the 7th element, which feature is 7. We can create a vector and predict from it:

        Double feature = 7.0;

        Vector features = Vectors.dense(feature);

        double prediction = model.predict(features);

       

        System.out.println("Prediction for feature " + feature + " is " + prediction);

    }

}

Before we run our app, let’s quickly look at our UDF.

package net.jgp.labs.spark.udf;

 

import org.apache.spark.ml.linalg.Vector;

import org.apache.spark.ml.linalg.Vectors;

import org.apache.spark.sql.api.java.UDF1;

 

public class VectorBuilder implements UDF1<Double, Vector> {

    private static final long serialVersionUID = -2991355883253063841L;

 

    @Override

    public Vector call(Double t1) throws Exception {

        return Vectors.dense(t1);

    }

 

}

Without going into much detail, you can see that we are building a Vector from a Double using the Vectors.dense() method. Typical plumbing code.

In the code on GitHub, you’ll have a lot more statistical information displayed.

After executing, we should get the following information on the first dataset:

And the following information on the second dataset:

As you see, it matches our previous values.

This is a very basic example with a couple of very limited datasets. Now that you have worked through it, you can proudly check the box for acquiring “knowledge of machine learning.”

The Future is Now

I encourage you to read and discover more about this fascinating world. One good resource is Fundamentals of Deep Learning by Nikhil Buduma. (A preview of the first three chapters is available here.)

In conclusion, even if we are not in the game space of Dungeons and Dragons, or the futuristic environment of Star Trek, and even less in the science fiction universe of Terminator, the world of data and its automated treatment (informatics, in short) is living a revolution from within. Moore’s law took care of us until now but has reached its limit with the explosion of data we are living with.

The entire profession is living a revolution: a few years ago, we had DBAs (database administrators), and now we have data stewards, data analysts, data engineers, and data sientists. But you, my friend who patiently read this article, must understand that you need to acquire these new skills.

The availability of open source tools and new collaborative models for software engineering—which make me strangely think of the utopist world of Star Trek—allow us to access new technologies such as Apache Spark.

<< back Page 3 of 3

Sponsors