Where to find data for AI validation
The paper reported in reference chapter [1], provides a summary of data sources that can be used for AI validation. The data spans from turbofan engines, to vibrations and acoustic emissions, to current in batteries.

| 1 | Saxena, A.; Goebel, K. Turbofan Engine Degradation Simulation Data Set. NASA Ames Progn. Data Repos. 2008. Available online: https://data.nasa.gov/Aerospace/CMAPSS-Jet-Engine-Simulated-Data/ff5v-kuh6/ (accessed on 19 January 2024) | NASA Turbofan Dataset-CMAPSSD and | The turbofan engine degradation simulation dataset, generated with the Commercial Modular Aero-Propulsion System Simulation (C-MAPSS) dynamical model |
| 2 | Saxena, A.; Goebel, K. Phm08 challenge data set. In NASA Ames Prognostics Data Repository; NASA Ames Research Center: Moffett Field, CA, USA, (consulted 2014-02-15); 2008. Available online: https://www.nasa.gov/intelligent-systems-division/discovery-and-systems-health/pcoe/pcoe-data-set-repository/ (accessed on 19 January 2024). | PHM 2008 Dataset | The degradation collected from aircraft engines derived from CMAPSSD |
| 3 | Agogino, A.; Goebel, K. Milling data set. In NASA Ames Prognostics Data Repository; NASA Ames Research Center: Moffett Field, CA, USA, 2007. Available online: https://www.nasa.gov/intelligent-systems-division/discovery-and-systems-health/pcoe/pcoe-data-set-repository/ (accessed on 19 January 2024) | NASA Ames Milling Dataset | Acoustic emission, vibration, and motor current data collected under different experimental conditions for predicting the milling tool wear |
| 4 | Lee, J.; Qiu, H.; Yu, G.; Lin, J. Bearing Data Set. IMS, University of Cincinnati, NASA Ames Prognostics Data Repository, Rexnord Technical Services. 2007. Available online: https://www.nasa.gov/intelligent-systems-division/discovery-and-systems-health/pcoe/pcoe-data-set-repository/ (accessed on 19 January 2024) | NASA Bearing Dataset | Run-to-failure vibration data from 4 accelerometers in a shaft |
| 5 | FEMTO Bearing Data Set; FEMTO-ST Institute: Besançon, France, 2012; IEEE PHM 2012 Data Challenge. Available online: https://www.femto-st.fr/en/Research-departments/AS2M/Research-groups/DATA-PHM (accessed on 19 January 2024) | FEMTO Ball Bearing Dataset from IEEE PHM Challenge | Run-to-failure temperature and vibration data from engine thermocouple and accelerometer sensors |
| 6 | Backblaze. Hard Drive Data and Stats 2019. Available online: https://www.backblaze.com/b2/hard-drive-test-data.html (accessed on 30 December 2023) | Backblaze Hard Disk Drive Dataset | The daily status of hard disk drives (HDDs), consisting of 433 failed drives and 22,962 good drives |
| 7 | PAKDD2020 Alibaba AI OPS Competition. 2020. Available online: https://tianchi.aliyun.com/competition/entrance/231775/introduction (accessed on 30 December 2023) | PAKDD2020 Alibaba AI OPS Competition Dataset | HDD daily health status data including both a raw and a normalized value as well as a label and the time of failure |
| 8 | Saha, B.; Goebel, K. Battery data set. In NASA Ames Prognostics Data Repository; NASA Ames Research Center: Moffett Field, CA, USA, 2007. Available online: https://www.nasa.gov/intelligent-systems-division/discovery-and-systems-health/pcoe/pcoe-data-set-repository/ (accessed on 19 January 2024) | NASA Ames Prognostics Dataset | Li-ion battery degradation data during repeated charge and discharge cycles |
| 9 | Hasib, S.A.; Islam, S.; Chakrabortty, R.K.; Ryan, M.J.; Saha, D.K.; Ahamed, M.H.; Moyeen, S.I.; Das, S.K.; Ali, M.F.; Islam, M.R. A comprehensive review of available battery datasets, RUL prediction approaches, and advanced battery management. IEEE Access 2021, 9, 86166–86193. [ https://ieeexplore.ieee.org/document/9454160/] | Lithium-ion Battery Dataset of the University of Maryland | The current and voltage data on different EV drive cycles at varying ambient temperatures (including 0◦C, 25◦C, and 45◦C) |
| 10 | Celaya, J.; Saxena, A.; Saha, S.; Goebel, K. MOSFET thermal overstress aging data set. In NASA Ames Prognostics Data Repository; NASA Ames Research Center: Moffett Field, CA, USA, 2011. Available online: https://www.nasa.gov/intelligent-systems-division/discovery-and-systems-health/pcoe/pcoe-data-set-repository/ (accessed on 19 January 2024) | MOSFET Thermal Overstress Aging Dataset | Run-to-failure experiments on power MOSFETs under thermal overstress |
| 11 | Ribeiro, F. MaFaulDa-Machinery Fault Database; Signals, Multimedia, and Telecommunications Laboratory: Rio de Janeiro, Brazil, 2016 https://www02.smt.ufrj.br/~offshore/mfs/page_01.html | MAFAULDA | Fault measurements from machinery simulators run under different load conditions |
| 12 | Microsoft Azure. Azure ai Guide for Predictive Maintenance Solutions. 2020. Available online: https://docs.microsoft.com/pt-br/azure/machine-learning/team-data-science-process/predictive-maintenance-playbook#solution-templates-for-predictive-maintenance (accessed on 30 December 2023) | Microsoft Azure PdM Dataset | Data modules of machines, telemetry, errors, maintenance, and failures collected by a Microsoft employee for PdM modeling collection |
| 13 | Hong, T.; Pinson, P.; Fan, S.; Zareipour, H.; Troccoli, A.; Hyndman, R.J. Probabilistic energy forecasting: Global energy forecasting competition 2014 and beyond. Int. J. Forecast. 2016, 32, 896–913. [ https://www.sciencedirect.com/science/article/abs/pii/S0169207016000133?via%3Dihub] | Global Energy Forecasting Competition (GEFCOM) Dataset | Hourly solar power generation data and assigning numerical weather forecasts from 1 April 2012 to 1 July 2014 |
| 14 | McCann, M.; Johnston, A. SECOM Dataset UCI Machine Learning Repository. 2008. Available online: https://archive.ics.uci.edu/ml/datasets/secom (accessed on 19 January 2024) | The UCI SECOM Dataset | Measurements of features of semiconductor production within a semiconductor manufacturing process |
Why are training databases important
Testing an AI methodology is crucial for its success for at least two reasons:
- The artificial neural network (ANN), that is the foundational piece of many AI methodologies, is only good with interpolations. For instance, the ANN will predict very well if the failure mode was part of the training data (interpolation), but will make very bad predictions, if the results are outside the range of training (extrapolation).
- An AI methodology needs to provide the right feedback. If we consider for example an AI applied to car engine failure, we want to have alarms only when it’s time for actions. We don’t want the AI providing too many feedback, filling up the entire dashboard, and we also don’t want the AI to be silent when we are close to an epic failure of the engine.
Both reasons justify the fact that we need a lot of data to train the AI, and we need also a lot of data to test the AI against. The fact that many data are freely available, helps unveiling the AI potential and improve its efficacy and reliability
How are data generated
Depending on the subject, the data can have a different source. NASA, that is providing a rich source of data, has a mixture of experimental data from experimental machines, real measurement data and simulations. Why so many data? As mentioned in the previous chapter, ANN is not very good with the extrapolation, therefore it is important to simulate all the possible conditions the AI may experience. A non-simulated condition in fact, can lead to false alarms, or no alarms.
In some cases, the simulated data can be better that the real data. If we go back to the example of the car engine failing, these machines proven to be very reliable, therefore there are not many data available. In this case, some real data simulating small failure or a regular to severe degradation could be used and some modelled points could simulate the most sever failures. In both cases, as mentioned already in the previous article https://aisciencetalk.blog/2024/06/10/the-importance-of-data-quality-while-using-ai/, it is fundamental to rely on good quality data and to make sure the modelled data are also realistic. To this point, the data shared by NASA https://data.nasa.gov/Aerospace/CMAPSS-Jet-Engine-Simulated-Data/ff5v-kuh6/ include simulated data with some noise on top of it. The noise has the exact purpose to make the data realistic. CMAPS, is one of the well-known repositories in the field of gas turbine deterioration analysis, and there are several publications that show the potential of AI for the detection of these failures.
References
- Ucar, A.; Karakose, M.; Kırımça, N. Artificial Intelligence for Predictive Maintenance Applications: Key Components, Trustworthiness, and Future Trends. Appl. Sci. 2024, 14, 898. https://doi.org/10.3390/app14020898
Copyright
Author: Simone Togni
Platform: https://aisciencetalk.blog/