Solar Power Generation


The use of solar PV panels as a renewable energy source is gaining traction in the recent years and many plants have been built worldwide. In this process, PV cells absorb the sun’s energy and convert it to direct current (DC) electricity, then solar inverter converts DC electricity from your solar modules to alternating current (AC) electricity, which is used by most home appliances. Solar power sources are irregular in nature because the output power of PV systems is weather-dependent and uncontrollable. Therefore, the ability to predict solar PV energy output is essential for supply and demand planning in an electric grid, and to determine the operation of the system. Machine learning (ML) algorithms have shown great potential in generating such a forecast. Here we used SpeedWise® Machine Learning (a commercial AutoML solution) to generate a predictive model related to daily power generation.

Problem Description, Solution and Machine Learning Results

Problem Description

This dataset has been gathered from a solar power plant in India over a 34-day period. Two datasets were collected: power generation data and sensor data. The power generation datasets are gathered at the inverter level—each inverter has multiple lines of solar panels attached to it. The sensor data is gathered at a plant level—single array of sensors optimally placed at the plant. Our goal here is to accurately model the power generation of the plant based off the measured parameters observed from the sensors. For that we used SpeedWise Machine Learning (SML), which is a powerful AutoML solution that can automatically deal with the idiosyncrasy of real data (the one considered here is a large, incomplete and noisy dataset) while being able to automatically extract patterns, trends and correlations from this data.

What was I looking to accomplish with machine learning?

The power generation from the plant is measured over time, therefore, solving a regression problem seems to be a good fit for our model. Many different models are available in SML for such a problem. The idea is to build a model that can automatically identify the relationship between our target (energy generation) and a set of inputs (time, temperature, etc.) based on available data. Once an optimum machine learning model is identified, we will review and evaluate the metrices related to our regression problem to understand the ability of our model to predict power generation.

Unwrapping the Data

The dataset for this study includes a total of 67,669 entries and 11 features or attributes associated with each entry. This means the dataset is a table with more than 744,678 cells. Figure 1 shows the different measurement points from our plant. The meteorological measurements consist of date and time record for each observation, the ambient temperature at the plant, a module (solar panel) attached to the sensor panel, and the amount of irradiation. The generation data consists of a record of the inverter id (22 units), the amount of DC power generated by the inverter (units – kW), the amount of AC power generated by the inverter (units – kW), daily yield, which is a cumulative sum of power generated on that day until a specific point, and total yield for the inverter until that point in time. Observations are recorded at 15-minute intervals. A critical process before building the model is to perform an Exploratory Data Analysis (EDA) to investigate suboptimal performance of the equipment.

Figure 1: Solar plant diagram.

Data preprocessing is an essential step in building a robust and meaningful machine learning model. SML allows users to cleanse their data efficiently, systematically and quickly. For this dataset, our first step is to convert the time and date parameters into a more useful format. Next, we take care of the missing data using SML’s automatic imputation step. Alternatively, users can remove data by column or by row, which are also operations available in SML. To get a better understanding of the data, several variables were visualized using the Visualization function in SML.

Figure 2: Visualization of the data in the processing step.

The plots in Figure 2 show that temperature and irradiation have strong correlation with energy generation. The relationship between DC power and AC power has around a unit slope, which suggests that the inverter’s performances are close to optimal. As expected, DC power generation increases after the morning sunrise and peaks at approximately 10:00 am. Then, there’s a plateau of 6 to 8 hours before it begins to decrease as the day progresses. To further help visualize any relationships between the variables, a Pearson correlation heatmap was plotted.

Figure 3: Pearson correlation of the data.


Two machine learning models were generated to forecast the energy generation of the solar plant. SML found two models (Radom Forest+XGBoost) that explains 90% of the variation in the DC power. Some key metrics from the actual model performance are shown in Figure 4. In addition, a plot of the Ground Truth vs. Prediction is shown to evaluate the accuracy of the model. The gray points are the training model results and the green points are the validation and testing. It is observed that the points are along the unit line implying that the model was able to accurately predict the DC power. Finally, a feature contribution plot was generated to identify the heavy hitters for this model. The variables most relevant to the prediction were found to be the model temperature and the radiation.

Figure 4: Results of the machine learning model.

To further explore the relationship between the model features and the energy generation, SML generates partial dependency plots. A partial dependence plot can show whether the relationship between the target and a feature is linear, monotonic or more complex. The model shows a positive correlation between DC power and temperature. As we have seen in the feature engineering analysis, DC power plateaus between 10:00 to 17:00. This analysis provides insight of the effect that individual features have on the predicted outcome of a machine learning model.

Figure 5: Partial dependency plots of the model’s features.

When generating a forecast, uncertainty quantification is useful for understanding the risk associated with the predictions. Since our data is subject to bias, which carried into our trained model, our goal is to provide a distribution describing the sensitivity of that prediction to sampling bias in the training data. In Figure 6, the predictions are sorted from low to high (blue line). A grey band is used to describe uncertainty and has an 80% chance of containing the true value (green stars).

Figure 6: Uncertainty quantification plot .

Key Insights

• We built an ML model to predict the solar radiation in 94% of the variation in energy generation using a dataset that includes observations such as: temperature, irradiation, time, etc.

• These results show that accurate predictive models can be very easily obtained, even for relatively large datasets, by applying smart built-in automated capabilities of SML.

• The variables most relevant to the prediction of solar radiation were found to be temperature, radiation, and time of day.

• An uncertainty quantification analysis was executed to understand the risk associated with the predictions.

How AutoML Workflow Solved this Problem

For this study we went through five basic steps in SML, which are very standardized in this technology:

1. Upload Data: This is achieved with a simple browser-based “drag and drop” exercise of the original dataset file (a .csv file in our case). A necessary step from the user was to define the output variable that we wanted to predict, and to indicate whether this was a classification or regression problem (in this case it was a regression with two classes).

2. Clean and Visualize Data: The original data required attention before the data could be effectively used for building a machine learning model. For instance, we had to change the date format so it can be used in a regression model. We solved these problems using the Datetime Parsing feature that extracts information from a date time string. Conversely, data visualization helped evaluate the nature of the data being used in our problem, and some specialized actions could certainly be taken based on that visualization exercise. Nevertheless, SML offers an autopilot option to automatically deal with most of these data processing issues and to generate a well-conditioned dataset, which is very handy for those people that lack a data science background (for instance). This process also includes appropriately splitting the data into training, validation and test sets, which SML facilitates in a smart way.

3. Machine Learning Model Building/Optimization: SML allows the user to choose from a variety of machine learning models, or they can try all of them if desired. For each model, a hyperparameter optimization process is also necessary to identify the best possible machine learning configuration. This technology leverages cloud computing to carry out this model building and optimization process in a very efficient manner.

4. Machine Learning Model Evaluation and Uncertainty Quantification: Once the best possible model is identified, a series of quantitative metrics and plots are used to properly evaluate the model (we showed some of those above).

5. Machine Learning Model Deployment: While deployment was not the main objective of this study, it is also possible within SML to generate an API (in Python, MATLAB and/or JavaScript). This would certainly facilitate deployment of the model and automatically run predictions for new data as it becomes available.