Preparing the ground for mining

The term materials informatics – referring to applying data mining techniques to materials design – has been around 7 years now. Yet, while the materials informatics approach slowly nucleates in a few institutions, a true revolution in the way most materials are designed has not arrived.

Why aren’t more people data mining materials? Some point to a lack of unified materials data sets to analyze. That situation is changing: as you read this, supercomputers around the country are systematically computing materials properties and the results are being posted openly to the web. Many are optimistic that these new databases will soon kickstart the materials informatics revolution.

However, even as the availability of materials data sets grows, an underestimated obstacle remains to materials informatics: we’ve yet to come up with a good mathematical description of materials that can be leveraged by the data mining community. Even as materials scientists expose acres of materials data for mining, it is as if these data resources are encased in hard rock that mathematicians and computer scientists can’t immediately dig into.

Encoding materials

Because data mining operates on numbers rather than on atoms or electrons, materials scientists must first encode materials in a numerical format for the machine learning community. Encoding materials is nothing new; researchers have always needed to describe materials in numbers (e.g., in research articles and to computational chemistry software). However, the details of the encoding are crucial to the success of the data mining effort. In particular, such encodings should be universalreversible, and meaningful.

digital material
Encoding materials for the data mining community is an important and perhaps underestimated challenge.


A good encoding is applicable to the entire space of the objects it hopes to describe. For materials, this problem was solved long ago. Just as an audio recording can digitally represent any piece of music using 0s and 1s, materials scientists can describe any material using three vectors to describe the repeating unit cell plus additional vectors for the positions and elemental identity of each atom in that cell.


A machine learning technique might be able to describe a potential new material with ideal properties in terms of numbers, but those numbers must have a way to transform back into a complete material. We will refer to this characteristic – that an encoded prediction can be decoded back to an object – as reversibility.

A reversible materials encoding is not difficult to achieve on its own or in conjunction with universality (the vector strategy described previously works just fine). However, both universality and reversibility become problematic when we attempt to define meaningful representations of materials.


While universality and reversibility are important for representing data, the crucial ingredient for extracting patterns and making predictions is meaningfulness of the encoding. For example, audio encodings such as MP3 are universal and reversible, but not particularly meaningful. How would an algorithm take a raw string of 0s and 1s and attempt to guess its composer or mood? It is not as if a certain binary pattern indicates Beethoven and another indicates Debussy. A machine learning expert given data encoded using formats meant for recording and playback would be hopelessly lost if he or she attempted to directly apply those encodings to machine learning.

universal/reversible vs meaningfulness in music
Unfortunately, as encoding strategies increase in meaning, they tend to become less universal and reversible.

In music, we can convey more meaning if we change our encoding strategy to be a musical score composed of note pitches, note lengths, and dynamic markings. This format lends itself to mathematically detecting note patterns that predict something meaningful such as musical genre or mood,[1] which will in turn vastly improve the success of data mining. However, a traditional musical score is not universal across all music; for example, it struggles to describe microtonal music and electronic music. It is also not fully reversible; as any fan of classical music knows, the score for Beethoven’s Ninth Symphony has been reversed into countless valid musical interpretations by different orchestras.[2] Thus, our quest for meaning has limited us in universality and reversibility.

You might instead encode music using words; a classical music piece might be described as “orchestral” or “small ensemble”, “cheerful” or “somber”, “melodic” or “twelve-tone”. These descriptors are in fact quite meaningful: we would expect that other music with the same or similar descriptors would sound similar, and such descriptors should also correlate with other properties of the music such as the country or time period it was composed in. However, despite being high in meaning, these textual descriptors are not reversible: reading a description of a piece of music does not allow you to faithfully recreate the sound in your head. In addition, given a limited musical vocabulary, text descriptors are also not universal[3] (a vocabulary of classical terms such as “twelve-tone” won’t describe rap, although it might be a more interesting universe to live in if it did).

Thus, the road to meaning often leads us away from universality and reversibility.

A “killer” encoding?

The typical description of a material as a set of vectors satisfy universality and reversibility, but are quite barren in meaning. The more meaningful materials encodings that have been devised, such as graph-based representations of materials or vectors of “chemical descriptors”, tend to be like the musical score and the verbal descriptions of music: they don’t apply to all materials and don’t capture enough detail to allow a machine learning solution to be reversed back into a material. Generating these alternate encodings also requires complex custom analysis code that can assess a material’s topology and determine its features. Therefore, each new materials informatics application requires the development of a new, meaningful data representation and lots of initial programming work before any real data mining can begin.

As materials data continues to pour in, the materials encoding problem will become more visible and might even become the primary limit to the growth of materials informatics. An opportunity exists for the ambitious materials hacker to develop a “killer” encoding: a universal, reversible representation that can serve as a meaningful “first pass” for almost any materials mining problem.

With some work, such an encoding should be possible. For example, the word2vec algorithm is an encoding for data mining that turn words into numbers.[4] The encoding is pretty universal (most words can be turned into numbers) and reversible (numbers can be turned back to words). And, it is meaningful; for example, you can use the encoding to measure meaningful distances between words (the word “France” would be closer in distance to the word “Spain” than “keyboard”). You can even take meaningful sums (sometimes) – for example, allows you to enter “Justin Bieber – man + woman” and get back an encoding that reverses to “Taylor Swift”.[5] Thus, it is certainly possible to design encodings that score well in universality, reversibility, and meaningfulness.

The word2vec algorithm is not perfect, and neither would be a similar encoding for materials. However, it would enable researchers to get shovels in the ground on a new problem quickly. The machine learning community has spent decades developing powerful and sophisticated digital drills for extracting patterns from data; the materials scientists now need to prepare the soil.

[1] For example, people have described sequences of notes using Markov chains, and then used that information to generate music of a certain style.
[2] There is a nice iOS app from Touch Press dedicated to exploring the different versions of Beethoven’s Ninth.
[3] The Music Genome Project hired musicians to describe much of the world’s music catalog using a vocabulary of about 400 words. Despite being able to power the Pandora music service with this encoding, it was not universal (Latin and classical music were two examples of holdouts).
[4] More about word2vec here.
[5] Try your luck at word additions at

Add to the discussion!

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s