Solar Power Generation
Introduction
The use of solar PV panels as a renewable energy source is gaining traction in the recent
years and many plants have been built worldwide. In this process, PV cells absorb the sun’s
energy and convert it to direct current (DC) electricity, then solar inverter converts DC
electricity from your solar modules to alternating current (AC) electricity, which is used
by most home appliances. Solar power sources are irregular in nature because the output
power of PV systems is weather-dependent and uncontrollable. Therefore, the ability to
predict solar PV energy output is essential for supply and demand planning in an electric
grid, and to determine the operation of the system. Machine learning (ML) algorithms have
shown great potential in generating such a forecast. Here we used SpeedWise® Machine
Learning (a commercial AutoML solution) to generate a predictive model related to daily
power generation.
Problem Description, Solution and Machine Learning Results
Problem Description
This dataset has been gathered from a solar power plant in India over a 34-day period.
Two datasets were collected: power generation data and sensor data. The power generation
datasets are gathered at the inverter level—each inverter has multiple lines of solar
panels attached to it. The sensor data is gathered at a plant level—single array of
sensors optimally placed at the plant. Our goal here is to accurately model the power
generation of the plant based off the measured parameters observed from the sensors. For
that we used SpeedWise Machine Learning (SML), which is a powerful AutoML solution that
can automatically deal with the idiosyncrasy of real data (the one considered here is a
large, incomplete and noisy dataset) while being able to automatically extract patterns,
trends and correlations from this data.
What was I looking to accomplish with machine learning?
The power generation from the plant is measured over time, therefore, solving a
regression problem seems to be a good fit for our model. Many different models are
available in SML for such a problem. The idea is to build a model that can automatically
identify the relationship between our target (energy generation) and a set of inputs
(time, temperature, etc.) based on available data. Once an optimum machine learning
model is identified, we will review and evaluate the metrices related to our regression
problem to understand the ability of our model to predict power generation.
Unwrapping the Data
The dataset for this study includes a total of 67,669 entries and 11 features or
attributes associated with each entry. This means the dataset is a table with more than
744,678 cells. Figure 1 shows the different measurement points from our plant. The
meteorological measurements consist of date and time record for each observation, the
ambient temperature at the plant, a module (solar panel) attached to the sensor panel,
and the amount of irradiation. The generation data consists of a record of the inverter
id (22 units), the amount of DC power generated by the inverter (units – kW), the amount
of AC power generated by the inverter (units – kW), daily yield, which is a cumulative
sum of power generated on that day until a specific point, and total yield for the
inverter until that point in time. Observations are recorded at 15-minute intervals. A
critical process before building the model is to perform an Exploratory Data Analysis
(EDA) to investigate suboptimal performance of the equipment.
Figure 1: Solar plant diagram.
Data preprocessing is an essential step in building a robust and meaningful machine learning
model. SML allows users to cleanse their data efficiently, systematically and quickly. For
this dataset, our first step is to convert the time and date parameters into a more useful
format. Next, we take care of the missing data using SML’s automatic imputation step.
Alternatively, users can remove data by column or by row, which are also operations
available in SML. To get a better understanding of the data, several variables were
visualized using the Visualization function in SML.
Figure 2: Visualization of the data in the processing step.
The plots in Figure 2 show that temperature and irradiation have strong correlation with
energy generation. The relationship between DC power and AC power has around a unit slope,
which suggests that the inverter’s performances are close to optimal. As expected, DC power
generation increases after the morning sunrise and peaks at approximately 10:00 am. Then,
there’s a plateau of 6 to 8 hours before it begins to decrease as the day progresses. To
further help visualize any relationships between the variables, a Pearson correlation
heatmap was plotted.
Figure 3: Pearson correlation of the data.
Results
Two machine learning models were generated to forecast the energy generation of the solar
plant. SML found two models (Radom Forest+XGBoost) that explains 90% of the variation in the
DC power. Some key metrics from the actual model performance are shown in Figure 4. In
addition, a plot of the Ground Truth vs. Prediction is shown to evaluate the accuracy of the
model. The gray points are the training model results and the green points are the
validation and testing. It is observed that the points are along the unit line implying that
the model was able to accurately predict the DC power. Finally, a feature contribution plot
was generated to identify the heavy hitters for this model. The variables most relevant to
the prediction were found to be the model temperature and the radiation.
Figure 4: Results of the machine learning model.
To further explore the relationship between the model features and the energy generation,
SML generates partial dependency plots. A partial dependence plot can show whether the
relationship between the target and a feature is linear, monotonic or more complex. The
model shows a positive correlation between DC power and temperature. As we have seen in the
feature engineering analysis, DC power plateaus between 10:00 to 17:00. This analysis
provides insight of the effect that individual features have on the predicted outcome of a
machine learning model.
Figure 5: Partial dependency plots of the model’s features.
When generating a forecast, uncertainty quantification is useful for understanding the risk
associated with the predictions. Since our data is subject to bias, which carried into our
trained model, our goal is to provide a distribution describing the sensitivity of that
prediction to sampling bias in the training data. In Figure 6, the predictions are sorted
from low to high (blue line). A grey band is used to describe uncertainty and has an 80%
chance of containing the true value (green stars).
Figure 6: Uncertainty quantification plot .
Key Insights
• We built an ML model to predict the solar radiation in 94% of the variation in energy
generation using a dataset that includes observations such as: temperature, irradiation,
time, etc.
• These results show that accurate predictive models can be very easily obtained, even for
relatively large datasets, by applying smart built-in automated capabilities of SML.
• The variables most relevant to the prediction of solar radiation were found to be
temperature, radiation, and time of day.
• An uncertainty quantification analysis was executed to understand the risk associated with
the predictions.
How AutoML Workflow Solved this Problem
For this study we went through five basic steps in SML, which are very standardized in this
technology:
1. Upload Data: This is achieved with a simple browser-based “drag and drop” exercise of the
original dataset file (a .csv file in our case). A necessary step from the user was to
define the output variable that we wanted to predict, and to indicate whether this was a
classification or regression problem (in this case it was a regression with two classes).
2. Clean and Visualize Data: The original data required attention before the data could be
effectively used for building a machine learning model. For instance, we had to change the
date format so it can be used in a regression model. We solved these problems using the
Datetime Parsing feature that extracts information from a date time string. Conversely, data
visualization helped evaluate the nature of the data being used in our problem, and some
specialized actions could certainly be taken based on that visualization exercise.
Nevertheless, SML offers an autopilot option to automatically deal with most of these data
processing issues and to generate a well-conditioned dataset, which is very handy for those
people that lack a data science background (for instance). This process also includes
appropriately splitting the data into training, validation and test sets, which SML
facilitates in a smart way.
3. Machine Learning Model Building/Optimization: SML allows the user to choose from a
variety of machine learning models, or they can try all of them if desired. For each model,
a hyperparameter optimization process is also necessary to identify the best possible
machine learning configuration. This technology leverages cloud computing to carry out this
model building and optimization process in a very efficient manner.
4. Machine Learning Model Evaluation and Uncertainty Quantification: Once the best
possible model is identified, a series of quantitative metrics and plots are used to
properly evaluate the model (we showed some of those above).
5. Machine Learning Model Deployment: While deployment was not the main objective of this
study, it is also possible within SML to generate an API (in Python, MATLAB and/or
JavaScript). This would certainly facilitate deployment of the model and automatically run
predictions for new data as it becomes available.