Top Data Sources for AI Validation

Listen to this article

Where to find data for AI validation

The paper reported in reference chapter [1], provides a summary of data sources that can be used for AI validation. The data spans from turbofan engines, to vibrations and acoustic emissions, to current in batteries.

1Saxena, A.; Goebel, K. Turbofan Engine Degradation Simulation Data Set. NASA Ames Progn. Data Repos. 2008. Available online: https://data.nasa.gov/Aerospace/CMAPSS-Jet-Engine-Simulated-Data/ff5v-kuh6/ (accessed on 19 January 2024)NASA Turbofan Dataset-CMAPSSD andThe turbofan engine degradation simulation dataset, generated with the Commercial Modular Aero-Propulsion System Simulation (C-MAPSS) dynamical model
2Saxena, A.; Goebel, K. Phm08 challenge data set. In NASA Ames Prognostics Data Repository; NASA Ames Research Center: Moffett Field, CA, USA, (consulted 2014-02-15); 2008. Available online: https://www.nasa.gov/intelligent-systems-division/discovery-and-systems-health/pcoe/pcoe-data-set-repository/ (accessed on 19 January 2024).PHM 2008 DatasetThe degradation collected from aircraft engines derived from CMAPSSD
3Agogino, A.; Goebel, K. Milling data set. In NASA Ames Prognostics Data Repository; NASA Ames Research Center: Moffett Field, CA, USA, 2007. Available online: https://www.nasa.gov/intelligent-systems-division/discovery-and-systems-health/pcoe/pcoe-data-set-repository/ (accessed on 19 January 2024)NASA Ames Milling DatasetAcoustic emission, vibration, and motor current data collected under different experimental conditions for predicting the milling tool wear
4Lee, J.; Qiu, H.; Yu, G.; Lin, J. Bearing Data Set. IMS, University of Cincinnati, NASA Ames Prognostics Data Repository, Rexnord Technical Services. 2007. Available online: https://www.nasa.gov/intelligent-systems-division/discovery-and-systems-health/pcoe/pcoe-data-set-repository/ (accessed on 19 January 2024)NASA Bearing DatasetRun-to-failure vibration data from 4 accelerometers in a shaft
5FEMTO Bearing Data Set; FEMTO-ST Institute: Besançon, France, 2012; IEEE PHM 2012 Data Challenge. Available online: https://www.femto-st.fr/en/Research-departments/AS2M/Research-groups/DATA-PHM (accessed on 19 January 2024)FEMTO Ball Bearing Dataset from IEEE PHM ChallengeRun-to-failure temperature and vibration data from engine thermocouple and accelerometer sensors
6Backblaze. Hard Drive Data and Stats 2019. Available online: https://www.backblaze.com/b2/hard-drive-test-data.html (accessed on 30 December 2023)Backblaze Hard Disk Drive DatasetThe daily status of hard disk drives (HDDs), consisting of 433 failed drives and 22,962 good drives
7PAKDD2020 Alibaba AI OPS Competition. 2020. Available online: https://tianchi.aliyun.com/competition/entrance/231775/introduction (accessed on 30 December 2023)PAKDD2020 Alibaba AI OPS Competition DatasetHDD daily health status data including both a raw and a normalized value as well as a label and the time of failure
8Saha, B.; Goebel, K. Battery data set. In NASA Ames Prognostics Data Repository; NASA Ames Research Center: Moffett Field, CA, USA, 2007. Available online: https://www.nasa.gov/intelligent-systems-division/discovery-and-systems-health/pcoe/pcoe-data-set-repository/ (accessed on 19 January 2024)NASA Ames Prognostics DatasetLi-ion battery degradation data during repeated charge and discharge cycles
9Hasib, S.A.; Islam, S.; Chakrabortty, R.K.; Ryan, M.J.; Saha, D.K.; Ahamed, M.H.; Moyeen, S.I.; Das, S.K.; Ali, M.F.; Islam, M.R. A comprehensive review of available battery datasets, RUL prediction approaches, and advanced battery management. IEEE Access 2021, 9, 86166–86193. [ https://ieeexplore.ieee.org/document/9454160/]Lithium-ion Battery Dataset of the University of MarylandThe current and voltage data on different EV drive cycles at varying ambient temperatures (including 0◦C, 25◦C, and 45◦C)
10Celaya, J.; Saxena, A.; Saha, S.; Goebel, K. MOSFET thermal overstress aging data set. In NASA Ames Prognostics Data Repository; NASA Ames Research Center: Moffett Field, CA, USA, 2011. Available online: https://www.nasa.gov/intelligent-systems-division/discovery-and-systems-health/pcoe/pcoe-data-set-repository/ (accessed on 19 January 2024)MOSFET Thermal Overstress Aging DatasetRun-to-failure experiments on power MOSFETs under thermal overstress
11Ribeiro, F. MaFaulDa-Machinery Fault Database; Signals, Multimedia, and Telecommunications Laboratory: Rio de Janeiro, Brazil, 2016 https://www02.smt.ufrj.br/~offshore/mfs/page_01.htmlMAFAULDAFault measurements from machinery simulators run under different load conditions
12Microsoft Azure. Azure ai Guide for Predictive Maintenance Solutions. 2020. Available online: https://docs.microsoft.com/pt-br/azure/machine-learning/team-data-science-process/predictive-maintenance-playbook#solution-templates-for-predictive-maintenance (accessed on 30 December 2023)Microsoft Azure PdM DatasetData modules of machines, telemetry, errors, maintenance, and failures collected by a Microsoft employee for PdM modeling collection
13Hong, T.; Pinson, P.; Fan, S.; Zareipour, H.; Troccoli, A.; Hyndman, R.J. Probabilistic energy forecasting: Global energy forecasting competition 2014 and beyond. Int. J. Forecast. 2016, 32, 896–913. [ https://www.sciencedirect.com/science/article/abs/pii/S0169207016000133?via%3Dihub]Global Energy Forecasting Competition (GEFCOM) DatasetHourly solar power generation data and assigning numerical weather forecasts from 1 April 2012 to 1 July 2014
14McCann, M.; Johnston, A. SECOM Dataset UCI Machine Learning Repository. 2008. Available online: https://archive.ics.uci.edu/ml/datasets/secom (accessed on 19 January 2024)The UCI SECOM DatasetMeasurements of features of semiconductor production within a semiconductor manufacturing process

Why are training databases important

Testing an AI methodology is crucial for its success for at least two reasons:

  1. The artificial neural network (ANN), that is the foundational piece of many AI methodologies, is only good with interpolations. For instance, the ANN will predict very well if the failure mode was part of the training data (interpolation), but will make very bad predictions, if the results are outside the range of training (extrapolation).
  2. An AI methodology needs to provide the right feedback. If we consider for example an AI applied to car engine failure, we want to have alarms only when it’s time for actions. We don’t want the AI providing too many feedback, filling up the entire dashboard, and we also don’t want the AI to be silent when we are close to an epic failure of the engine.

Both reasons justify the fact that we need a lot of data to train the AI, and we need also a lot of data to test the AI against. The fact that many data are freely available, helps unveiling the AI potential and improve its efficacy and reliability

How are data generated

Depending on the subject, the data can have a different source. NASA, that is providing a rich source of data, has a mixture of experimental data from experimental machines, real measurement data and simulations. Why so many data? As mentioned in the previous chapter, ANN is not very good with the extrapolation, therefore it is important to simulate all the possible conditions the AI may experience. A non-simulated condition in fact, can lead to false alarms, or no alarms.

In some cases, the simulated data can be better that the real data. If we go back to the example of the car engine failing, these machines proven to be very reliable, therefore there are not many data available. In this case, some real data simulating small failure or a regular to severe degradation could be used and some modelled points could simulate the most sever failures. In both cases, as mentioned already in the previous article https://aisciencetalk.blog/2024/06/10/the-importance-of-data-quality-while-using-ai/, it is fundamental to rely on good quality data and to make sure the modelled data are also realistic. To this point, the data shared by NASA https://data.nasa.gov/Aerospace/CMAPSS-Jet-Engine-Simulated-Data/ff5v-kuh6/ include simulated data with some noise on top of it. The noise has the exact purpose to make the data realistic. CMAPS, is one of the well-known repositories in the field of gas turbine deterioration analysis, and there are several publications that show the potential of AI for the detection of these failures.

References

Copyright

Author: Simone Togni

Platform: https://aisciencetalk.blog/

Leave a Reply

Scroll to Top

Discover more from AI Science Talk Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading