By Ken Kleinberg, Practice Lead, Innovative Technologies
Synthetic Data – Although it's widely known and discussed that AI requires large amounts of data to train -- and that the amount of data being created these days is overwhelming and growing -- it’s also the case that lots of data is incomplete and dirty, particularly within healthcare. Synthetic data is data created by a system that has similar characteristics to real data and can be used to ensure a large training set with the requisite number of distributed examples to properly train an application without bias.
Not dissimilar to systems that can generate fake faces or celebrities, synthetic data generation systems can also reduce data acquisition costs and data privacy issues (the data does not represent any real-world person). Such approaches go beyond simpler random data generation approaches (e.g., no point in creating data for distinguishing cats and dogs by creating images that are an unrealistic blend of the characteristics of the two species, although degree of class separation should be controllable). Synthetic data can be used in support of supervised and unsupervised learning. It’s already being explored in healthcare.
One challenge in synthetic data technology is that it could be used to create “deep fakes,” such as in fraudulent medical claims, or making up patient encounters. Fortunately, the same technology can also help in the detection of these fakes - for example noting that certain combinations of field values for claims data don’t make sense, or that patient notes were manufactured and didn’t match the actual patient.
Data Labeling – The labeling or scoring of cases as having a certain condition or not in the creation of training and validation sets can be highly time-intensive and require expensive experts. That became abundantly apparent from some large AI initiatives such as the work that Memorial Sloan Kettering has done with IBM Watson for cancer treatment (for a detailed analysis of how complex this issue is see this article. ) In fact, a whole submarket of AI consisting of human/manual data labeling services (including crowd-sourced approaches) has emerged, sometime referred to as “AI’s hidden workforce.” These can reduce the cost and timeframe for large projects. For example, there is an especially large need for labeling and annotating data to determine what kinds of images show a tumor.
As a variation/expansion of the synthetic data approach mentioned above, such labeling systems can be directed/programmed/trained to create labeled examples of different types. For example, a semi-supervised learning approach which uses labeled and unlabeled data can be used to generate a set of patient skin images with and without tumors.
Automated Data Analysis – Data sets with lots of variables or where there is minimal domain expertise as to what data will be important and in what combinations and dependencies can benefit from automated data analysis techniques. For example, the system could look for patterns such as field values with the same number of characters as representing something similar, or rows of data where two or more fields tend to match to each other and can be used to discern what could be valuable patterns.
At the AIWorld conference in Boston a few months ago, one vendor’s product demonstration used the example of the passenger list of the Titanic to determine who was more likely to survive. It turned out that certain types of room numbers tended to be deeper in the ship where passengers were less lightly to escape. Or anyone with the designation of master vs mister had an increased chance (boys were often called master back then).
While automated data analysis may provide multiple unusable, naïve, or non-common sense “insights” (e.g., color of clothing determines likelihood of cancer), they can assist experts in where to explore further.
Looking Forward – Regarding the debate as to whether AI is being hyped too much, or that the world is yet to appreciate how profoundly AI will change our lives, its eventual fate and value will depend on how well systems can be trained and applied to real-world problems. That vendors are starting to focus more on the “how” than the “what” and the “why” is encouraging as is the real-world use of AI to improve medication management as our Practice Lead, Michael Solomon, conveyed in his 2-part blog series. AI overall is on POCP’s collective radar. In a future blog, we’ll take a closer look at Robotic Process Automation (RPA). Contact me at ken.kleinberg@pocp.com if you have a particular area of interest.