Solar Radiation


Being able to reasonably predict solar radiation data is vital for the success of any solar project. Photovoltaic (PV) and concentrating solar power (CSP) thermal systems may have slightly differing requirements, but they need accurate solar radiation information for the same reasons. In an attempt to resolve this challenge, we applied SpeedWise® Machine Learning (a commercial AutoML solution) to generate a predictive model related to radiation prediction. Our goal was to quickly propose a model that predicts solar radiation based on a set of important parameters such as time, temperature, pressure, wind, etc. Models like this would aid in understanding the energy potential, energy consumption based on expected energy output from solar technologies, and selecting the right equipment for a given project.

Problem Description, Solution and Machine Learning Results

Problem Description

The dataset contains meteorological data from the HI-SEAS Habitat in Hawaii. Here we want to accurately model solar irradiance from other meteorological parameters contained within the dataset using SpeedWise Machine Learning (SML). SML is a powerful AutoML solution that can automatically deal with the idiosyncrasy of real data (the one considered here is a large, incomplete and noisy dataset) while being able to automatically extract patterns, trends and correlations from this data.

What was I looking to accomplish with machine learning?

For this study we will be solving a regression problem using machine learning, since the target is a real continuous variable. Many different models are available in SML for such a problem. The idea is to build a model that can automatically conduct a continuous variable prediction (solar radiation) for a given set of inputs (time, temperature, wind, etc.) based on available data. Therefore, we will select the radiation as the target in our training data. Once an optimum machine learning model is identified, we will look into some quick evaluation metrics related to the regression problems in machine learning (so that we can determine the level of confidence in our predictions).

Unwrapping the Data

The dataset for this study includes a total of 32,686 entries and 11 features or attributes associated with each entry. This means the dataset is a table with more than 359,000 cells. Data preprocessing is an essential step in building a robust and meaningful machine learning model. SML allows users to cleanse their data efficiently, systematically and quickly. The “Autopilot Mode” feature in SML cleanses the data automatically in one click. Behind the scenes, SML was able to recognize which data cleansing steps were needed to train a robust machine learning model. For the more data-savvy user, there are over 18 data cleansing steps you could manually apply, and recommendations are available if needed. For this dataset, our first step is to convert the time and date parameters into a more useful format. To get a better understanding of the data, several variables were visualized using the Visualization function in SML.

Figure 1: Visualization of the data in the processing step.

The above plots show that temperature has strong correlation with solar radiation. The relationship between pressure and solar radiation is less clear. As expected, solar radiation peaks at approximately 12:00. Additionally, monthly means of both solar radiation and temperature appear to decrease as winter approaches. To further help visualize any relationships between the variables, a Pearson correlation heatmap was plotted.

Figure 2: Pearson correlation of the data.


SpeedWise Machine Learning, our AutoML solution of choice, greatly facilitates identifying a series of good machine learning models for the problem at hand. In this case, the technology found (with little effort) a model (XGBoost) that explains 94% of the variation in solar radiation Some key metrics from the actual model performance are shown in Figure 3. In addition, a plot of the Ground Truth vs. Prediction is shown to evaluate the accuracy of the model. It is observed that most of the points are along the 45° line implying that the model was able to accurately predict the solar radiation. Finally, a feature contribution plot was generated to identify the heavy hitters for this model. The variables most relevant to the prediction were found to be time of day and temperature.

Figure 3: Results of the machine learning model.

To further explore the relationship between the model features and the solar radiation, SML generates partial dependency plots. A partial dependence plot can show whether the relationship between the target and a feature is linear, monotonic or more complex. The model shows a positive correlation between radiation and temperature. On the other hand, a negative correlation is observed between humidity and radiation. As we have seen in the feature engineering analysis, radiation peak is in the middle of the day. This analysis provides insight of the effect that individual features have on the predicted outcome of a machine learning model and the potential insight to guide economic decisions. For example, it is clear from this analysis that the radiation is a function of the day length. Based on the plot, it seems that the radiation peak’s plateau is between 10:00 am to 14:30 pm. The peak plateau can be optimized by installing a solar tracker, that orients a solar panel toward the sun to maximize the amount of energy produced from a fixed amount of installed power generating capacity.

Figure 4: Partial dependency plots of the model’s features.

When generating a forecast, uncertainty quantification is useful for understanding the risk associated with the predictions. Since our data is subject to bias, which carried into our trained model, our goal is to provide a distribution describing the sensitivity of that prediction to sampling bias in the training data . In Figure 5, the predictions are sorted from low to high (blue line). A grey band is used to describe uncertainty and has an 80% chance of containing the true value (green stars).

Figure 5: Uncertainty quantification plot.

Key Insights

• We built an ML model to predict the solar radiation that explains 94% of the variation in solar radiation using a dataset that includes observations such as: temperature, pressure, humidity, time, etc.
• These results show that accurate predictive models can be very easily obtained, even for relatively large datasets, by applying smart built-in automated capabilities of SML.
• The variables most relevant to the prediction of solar radiation were found to be temperature and time of day.
• An uncertainty quantification analysis was executed to understand the risk associated with the predictions.

How AutoML Workflow Solved this Problem

For this study we went through five basic steps in SML, which are very standardized in this technology:

1. Upload Data: This is achieved with a simple browser-based “drag and drop” exercise of the original dataset file (a .csv file in our case). A necessary step from the user was to define the output variable that we wanted to predict, and to indicate whether this was a classification or regression problem (in my case it was a regression with two classes).

2. Clean and Visualize Data: The original data required attention before the data could be effectively used for building a machine learning model. For instance, we had to change the date format so it can be used in a regression model. We solved these problems using the Datetime Parsing feature that extracts information from a date time string feature. Conversely, data visualization helped evaluate the nature of the data being used in our problem, and some specialized actions could certainly be taken based on that visualization exercise.

Nevertheless, SML offers an autopilot option to automatically deal with most of these data processing issues and to generate a well-conditioned dataset, which is very handy for those people that lack a data science background (for instance). This process also includes appropriately splitting the data into training, validation and test sets, which SML facilitates in a smart way.

3. Machine Learning Model Building/Optimization: SML allows the user to choose from a variety of machine learning models, or they can try all of them if desired. For each model, a hyperparameter optimization process is also necessary to identify the best possible machine learning configuration. This technology leverages cloud computing to carry out this model building and optimization process in a very efficient manner.

4. Machine Learning Model Evaluation and Uncertainty Quantification: Once the best possible model is identified, a series of quantitative metrics and plots are used to properly evaluate the model (we showed some of those above).

5. Machine Learning Model Deployment: While deployment was not the main objective of this study, it is also possible within SML to generate an API (in Python, MATLAB and/or JavaScript). This would certainly facilitate deployment of the model and automatically run predictions for new data as it becomes available.