Solar Radiation
Introduction
Being able to reasonably predict solar radiation data is vital for the success of any solar
project. Photovoltaic (PV) and concentrating solar power (CSP) thermal systems may have slightly
differing requirements, but they need accurate solar radiation information for the same reasons.
In an attempt to resolve this challenge, we applied SpeedWise® Machine Learning (a commercial
AutoML solution) to generate a predictive model related to radiation prediction. Our goal was to
quickly propose a model that predicts solar radiation based on a set of important parameters
such as time, temperature, pressure, wind, etc. Models like this would aid in understanding the
energy potential, energy consumption based on expected energy output from solar technologies,
and selecting the right equipment for a given project.
Problem Description, Solution and Machine Learning Results
Problem Description
The dataset contains meteorological data from the HI-SEAS Habitat in Hawaii. Here we want to
accurately model solar irradiance from other meteorological parameters contained within the
dataset using SpeedWise Machine Learning (SML). SML is a powerful AutoML solution that can
automatically deal with the idiosyncrasy of real data (the one considered here is a large,
incomplete and noisy dataset) while being able to automatically extract patterns, trends and
correlations from this data.
What was I looking to accomplish with machine learning?
For this study we will be solving a regression problem using machine learning, since the
target is a real continuous variable. Many different models are available in SML for such a
problem. The idea is to build a model that can automatically conduct a continuous variable
prediction (solar radiation) for a given set of inputs (time, temperature, wind, etc.) based
on available data. Therefore, we will select the radiation as the target in our training
data. Once an optimum machine learning model is identified, we will look into some quick
evaluation metrics related to the regression problems in machine learning (so that we can
determine the level of confidence in our predictions).
Unwrapping the Data
The dataset for this study includes a total of 32,686 entries and 11 features or attributes
associated with each entry. This means the dataset is a table with more than 359,000 cells.
Data preprocessing is an essential step in building a robust and meaningful machine learning
model. SML allows users to cleanse their data efficiently, systematically and quickly. The
“Autopilot Mode” feature in SML cleanses the data automatically in one click. Behind the
scenes, SML was able to recognize which data cleansing steps were needed to train a robust
machine learning model. For the more data-savvy user, there are over 18 data cleansing steps
you could manually apply, and recommendations are available if needed. For this dataset, our
first step is to convert the time and date parameters into a more useful format. To get a
better understanding of the data, several variables were visualized using the Visualization
function in SML.
Figure 1: Visualization of the data in the processing step.
The above plots show that temperature has strong correlation with solar radiation. The
relationship between pressure and solar radiation is less clear. As expected, solar radiation
peaks at approximately 12:00. Additionally, monthly means of both solar radiation and
temperature appear to decrease as winter approaches. To further help visualize any relationships
between the variables, a Pearson correlation heatmap was plotted.
Figure 2: Pearson correlation of the data.
Results
SpeedWise Machine Learning, our AutoML solution of choice, greatly facilitates identifying a
series of good machine learning models for the problem at hand. In this case, the technology
found (with little effort) a model (XGBoost) that explains 94% of the variation in solar
radiation Some key metrics from the actual model performance are shown in Figure 3. In addition,
a plot of the Ground Truth vs. Prediction is shown to evaluate the accuracy of the model. It is
observed that most of the points are along the 45° line implying that the model was able to
accurately predict the solar radiation. Finally, a feature contribution plot was generated to
identify the heavy hitters for this model. The variables most relevant to the prediction were
found to be time of day and temperature.
Figure 3: Results of the machine learning model.
To further explore the relationship between the model features and the solar radiation, SML
generates partial dependency plots. A partial dependence plot can show whether the relationship
between the target and a feature is linear, monotonic or more complex. The model shows a
positive correlation between radiation and temperature. On the other hand, a negative
correlation is observed between humidity and radiation. As we have seen in the feature
engineering analysis, radiation peak is in the middle of the day. This analysis provides insight
of the effect that individual features have on the predicted outcome of a machine learning model
and the potential insight to guide economic decisions. For example, it is clear from this
analysis that the radiation is a function of the day length. Based on the plot, it seems that
the radiation peak’s plateau is between 10:00 am to 14:30 pm. The peak plateau can be optimized
by installing a solar tracker, that orients a solar panel toward the sun to maximize the amount
of energy produced from a fixed amount of installed power generating capacity.
Figure 4: Partial dependency plots of the model’s features.
When generating a forecast, uncertainty quantification is useful for understanding the risk
associated with the predictions. Since our data is subject to bias, which carried into our
trained model, our goal is to provide a distribution describing the sensitivity of that
prediction to sampling bias in the training data . In Figure 5, the predictions are sorted from
low to high (blue line). A grey band is used to describe uncertainty and has an 80% chance of
containing the true value (green stars).
Figure 5: Uncertainty quantification plot.
Key Insights
• We built an ML model to predict the solar radiation that explains 94% of the variation in
solar radiation using a dataset that includes observations such as: temperature, pressure,
humidity, time, etc.
• These results show that accurate predictive models can be very easily obtained, even for
relatively large datasets, by applying smart built-in automated capabilities of SML.
• The variables most relevant to the prediction of solar radiation were found to be temperature
and time of day.
• An uncertainty quantification analysis was executed to understand the risk associated with the
predictions.
How AutoML Workflow Solved this Problem
For this study we went through five basic steps in SML, which are very standardized in this
technology:
1. Upload Data: This is achieved with a simple browser-based “drag and drop” exercise of the
original dataset file (a .csv file in our case). A necessary step from the user was to define
the output variable that we wanted to predict, and to indicate whether this was a classification
or regression problem (in my case it was a regression with two classes).
2. Clean and Visualize Data: The original data required attention before the data could be
effectively used for building a machine learning model. For instance, we had to change the date
format so it can be used in a regression model. We solved these problems using the Datetime
Parsing feature that extracts information from a date time string feature. Conversely, data
visualization helped evaluate the nature of the data being used in our problem, and some
specialized actions could certainly be taken based on that visualization exercise.
Nevertheless,
SML offers an autopilot option to automatically deal with most of these data processing issues
and to generate a well-conditioned dataset, which is very handy for those people that lack a
data science background (for instance). This process also includes appropriately splitting the
data into training, validation and test sets, which SML facilitates in a smart way.
3. Machine Learning Model Building/Optimization: SML allows the user to choose from a variety of
machine learning models, or they can try all of them if desired. For each model, a
hyperparameter optimization process is also necessary to identify the best possible machine
learning configuration. This technology leverages cloud computing to carry out this model
building and optimization process in a very efficient manner.
4. Machine Learning Model Evaluation and Uncertainty Quantification: Once the best possible
model is identified, a series of quantitative metrics and plots are used to properly evaluate
the model (we showed some of those above).
5. Machine Learning Model Deployment: While deployment was not the main objective of this study,
it is also possible within SML to generate an API (in Python, MATLAB and/or JavaScript). This
would certainly facilitate deployment of the model and automatically run predictions for new
data as it becomes available.