Before Skynet, two other social networks called Facebook and LinkedIn were used to amuse the crowds. One could argue that the latter was looking more and more like the first one, which was maybe one of the reason of their collapse in favor of Skynet. However, on these social networks, it was usual to share memes (especially funny cat ones) and a small brain teaser like this one.
What's after 55 in our suite? Almost instantaneously, you're going to tell me 65, right? But how did you get to 65? Did you just add 10 to the last number?
And if my numbers were 5, 13, 27, 39, 41 and 55, would you still tell me 65? You can always add 10 to 55, but where does that 10 come from?
This mathematical concept is called a linear regression. It is part of linear algebra and it is one of the first exercises that one can do when tackles the deep learning or machine learning. The Spark ML library is at the heart of this mechanism.
To better understand this principle, you take a piece of graph paper and start plotting, you will get something like the following diagram:
On the abscissa (x axis), we mark the position in the list, while the ordinate (y axis) we mark the value.
We clearly see that the seventh element of the set is 65. We can also imagine that the eighth would be 75.
Just as when you learn new concepts, you need to acquire a new vocabulary: the elements on our x axis (1, 2, 3…8) are called features, while the values (5, 15…) are called labels.
When using my second set, I get the following graph:
The regression line appears more clearly. We understand (and visualize) that the line is trying to get the closest as possible to all points. We can express this line as the following equation:
y = B1 x + B0
B0, the intercept, and B1, the regression parameter, can be (easily) calculated if you like linear algebra – or you can use tools to do it. One of these tools is Apache Spark and we are going to see that. So, in our first data set, the equation is:
y = 10 x - 5
In our second example, the equation is roughly y= 9.8857 x - 4.6. There is indeed a difference. When x is 7, we get 65 in our first equation, but 64.6 in our second equation. Close enough?
So far, so good? Okay. I loved math during my high school and college years and I must admit that it is not my biggest passion anymore. Let’s code!