接下來,我們使用測試 DataFrame (即原始 DataFrame 被隨機切分的 20%) 來測量模型的準確性,且不用於訓練。
在以下程式碼中,我們呼叫管線模型上的轉換,測試 DataFrame 便會根據管線步驟被傳遞通過特徵擷取階段,用模型微調選擇的隨機森林模型進行預估,然後在新的 DataFrame 欄中回傳預測。
val predictions =pipelineModel.transform(testData)
predictions.select("prediction", "medhvalue").show(5)
result:
+------------------+---------+
| prediction|medhvalue|
+------------------+---------+
|104349.59677450571| 94600.0|
| 77530.43231856065| 85800.0|
|111369.71756877871| 90100.0|
| 97351.87386020401| 82800.0|
+------------------+---------+
With the predictions and labels from the test data, we can now evaluate the model. To evaluate the linear regression model, you measure how close the predictions values are to the label values. The error in a prediction, shown by the green lines below, is the difference between the prediction (the regression line Y value) and the actual Y value, or label. (Error =prediction-label).
平均絕對誤差 (MAE) 是標籤和模型預測之間絕對差異的平均值。絕對值會將任何負號移除。
MAE =sum(absolute(prediction-label)) / number of observations).
The Mean Square Error (MSE) is the sum of the squared errors divided by the number of observations. The squaring removes any negative signs and also gives more weight to larger differences. (MSE =sum(squared(prediction-label)) / number of observations).
均方差 (RMSE) 是 MSE 的平方根。RMSE 是預測誤差的標準差。誤差是在衡量迴歸線和標籤資料點之間的距離,RMSE 則是在衡量這些誤差的分佈情況。
The following code example uses the DataFrame withColumn transformation, to add a column for the error in prediction: error=prediction-medhvalue。接著,我們顯示預測的摘要統計、房價中位數和錯誤 (以千美元計)。
predictions =predictions.withColumn("error",
col("prediction")-col("medhvalue"))
predictions.select("prediction", "medhvalue", "error").show
result:
+------------------+---------+-------------------+
| prediction|medhvalue| error|
+------------------+---------+-------------------+
| 104349.5967745057| 94600.0| 9749.596774505713|
| 77530.4323185606| 85800.0| -8269.567681439352|
| 101253.3225967887| 103600.0| -2346.677403211302|
+------------------+---------+-------------------+
predictions.describe("prediction", "medhvalue", "error").show
result:
+-------+-----------------+------------------+------------------+
|summary| prediction| medhvalue| error|
+-------+-----------------+------------------+------------------+
| count| 4161| 4161| 4161|
| mean|206307.4865123929|205547.72650805095| 759.7600043416329|
| stddev|97133.45817381598|114708.03790345002| 52725.56329678355|
| min|56471.09903814694| 26900.0|-339450.5381565819|
| max|499238.1371374392| 500001.0|293793.71945819416|
+-------+-----------------+------------------+------------------+
以下程式碼範例使用 Spark 迴歸評估器計算預測 DataFrame 上的 MAE,結果回傳為 36636.35 (以千美元計)。
val maevaluator =new RegressionEvaluator()
.setLabelCol("medhvalue")
.setMetricName("mae")
val mae =maevaluator.evaluate(predictions)
result:
mae: Double =36636.35
以下程式碼範例使用 Spark 迴歸評估器計算預測 DataFrame 上的 RMSE,結果回傳為 52724.70。
val evaluator =new RegressionEvaluator()
.setLabelCol("medhvalue")
.setMetricName("rmse")
val rmse =evaluator.evaluate(predictions)
result:
rmse: Double =52724.70