Important note: if you are just joining us, you probably want to go back and start from the first page of the Phase Diagram comic!
Page 2 of the phase diagram comic doesn’t accomplish much on its own, but sets the stage for the upcoming pages. Those of you familiar with computational phase diagrams might already see where this analogy is headed, although there’s still a lot of ground to cover and maybe even a few surprises (unless you are really far ahead), so stay tuned.
And here’s page 2!
To be continued next time, when our protagonist wanders into an experimental synthesis lab…
One of the more powerful tools in materials screening is the computational phase stability diagram. Unfortunately, it is only utilized by a few research groups, and I thought that a comic book about them might improve the situation.
The project is taking a bit longer than expected…so I am releasing the comic page-by-page, with the first page below. This page is just an introduction; there will be about 5 more pages that fill in more details, with one post per page. The final post will provide the full comic as a PDF and will include an article with more details.
Developing parallel computer algorithms is becoming more important as CPU architectures jettison clock speed in favor of multicore designs. Indeed, many scientific applications are moving beyond CPUs to graphical processing units (GPUs), which are composed of thousands of individual cores that can complete certain tasks orders of magnitude faster than CPUs. The concept of GPU computing was visually demonstrated by the Mythbusters, who unveiled a massively parallel paintball bazooka (with 1100 individual barrels standing in for a GPU’s many computing cores) that physically produced a painting of the Mona Lisa in 80 milliseconds.
What about human parallelization? In software development, it’s certainly possible to achieve spectacular gains in productivity by progressing from a single coffee-addled, “overclocked” hacker to a large and distributed team. For example, few would have predicted that a complex, tightly integrated, and extremely stable operating system could be produced by a team of volunteers working in their spare time. Yet, Linux is just that and somehow manages to be competitive with billion dollar efforts in the commercial sector!
How can open source and parallel software development work so well? Karl Fogel offers one clue in his book, Producing Open Source Software: namely, that debugging works remarkably well in parallel. Not only do more contributors add more “eyes” on the code, but large and diverse teams are more likely to contain someone that has the precise background needed to identify a subtle bug. Such an effect can be seen outside of the software world as well; for example, the internet community recently decoded a cryptic, decades-old letter from a mother to her grown children in very little time. The key to success? Some of the amateur sleuths were intimately familiar with common Christian prayers that were the focus of her letter. I’ve had similar luck when posting a tricky probability question to the Math Exchange web site; the problem was non-trivial to two esteemed mathematician colleagues but was expertly answered several times over (including formulas!) in under an hour by internet citizens more familiar with that type of problem. The first correct response arrived within 15 minutes.
Unfortunately, examples of such community-driven development are rare in the field of materials science (especially in the United States). For example, none of the current electronic structure databases fit the bill. Perhaps this is because such efforts are still young and building momentum. However, it is also possible that scientists underestimate the difficulty of building software with a healthy community of developers. Adding to the difficulty are challenges particular to the scientific realm.
The following are some early experiences from building software for the Materials Project.
Work with computer scientists (but don’t expect them to solve all your computer science problems)
Software development is a fast-moving field, and computer scientists can provide crucial guidance on modern software technologies and development practices. Yet, at the same time, the bulk of actual programming tasks will most likely involve gluing together software design principles with materials science applications. The most effective “bonders” are those who hybridize themselves between materials science and computer science. Trying to divide programming work into materials science problems and computer science problems leads to weak bonds that are more susceptible to communication overhead and more likely to impede progress. Therefore, the Materials Project generally hires research postdocs that are also competent programmers. One strategy that is surprisingly effective in this effort is to employ a basic programming assignment to assess core competency and motivation before the phone interview stage.
Structure code for compartmentalized development (then work a lot to help people anyway)
Getting the community to adopt your project is quite difficult (my own software project, FireWorks, has certainly not gotten there yet!). One thing that is for certain is that functional, useful code combined with an open license doesn’t equal a community driven project. Contrary to potential fears of code theft or receiving harsh criticism upon going open-source (or dreams of immediate fame and thank-you letters), the most likely scenario of sharing code is that the world will not give much notice.
The usual tips involve writing code that at least has the potential for distributed development, e.g., by writing modular code and spending time on documentation. But following the advice is more difficult when working with scientific collaborators. For example, modular software that computer scientists can read easily is often impenetrable to materials scientists, simply because the latter are often not familiar with programming abstractions such as object-oriented programming that are meant to facilitate scalability and productivity. And because a codebase itself is generally not seen as the final output (scientists are hired and promoted based on scientific results and journal articles, not Github contributions), scientists can be poorly motivated to work on documentation, code cleanup, or unit tests that could serve as a force multiplier for collaboration.
Unfortunately, perhaps the only way to write good code is to first write lots of bad code. This can lead to tension because senior developers can act like already-industrialized nations that expect newcomers to never “pollute” the codebase whilst themselves being guilty of producing poor code during their development.
One strategy to address mixed levels of programming proficiency is to entrust the more technical programmers with the overall code design and core library elements and to train newcomers by having them implement specific and limited components. Surprisingly, the situation can end up not too different from that described in Wikipedia’s article on early Netherlandish painting:
“…the master was responsible for the overall design of the painting, and typically painted the focal portions, such as the faces, hands and the embroidered parts of the figure’s clothing. The more prosaic elements would be left to assistants; in many works it is possible to discern abrupt shifts in style, with the relatively weak Deesis passage in van Eyck’s Crucifixion and Last Judgement diptych being a better-known example.”
If an in-house software collaboration is like a small artist studio, a public software project might be more like a large, ever-expanding community mural that must recruit and retain random volunteers of various skill levels. Somehow, the project must reach a point where the mural can largely extend itself in unexpected and powerful ways while still maintaining a consistent and uniform artistic vision. In particular, the project is truly healthy only when (a) it can quickly integrate new contributions and fixes to older sections and (b) one is confident that the painting would go on even if the lead artist departs. These added considerations involve many human factors and can often be more difficult to achieve than producing good code.
We are often our own worst enemy
As scientists, we are often our own worst enemy in scaling software projects. Whereas computer scientists generally take pride in their “laziness” and happily reuse or adopt existing code (probably learning something in the process), regular scientists are by nature xenophobic and desire to write their own version of codes from scratch using programming techniques they are comfortable with. This often leads to myopic software and stagnation in programming paradigms. In particular, the programming model of “single-use script employs custom parser to read haphazard input file to produce custom output file” is severely outdated but extremely common in scientific codes.
Scientists can also be very protective of code, too heavily weighting the potential negative aspects of being open (e.g., unhelpful criticism, non-attribution) and inadequately weighting its benefits (bug fixes, enhancements, impact). In particular, even some supposedly “open” scientific software require explicitly requesting the code by personal email or come saddled with non-compete agreements. It is unclear what fear motivates the authors to put up barricades to programs that they ostensibly want to share. But it is doubtful that Linux – despite its brilliant kernel – would have ever seen such success if all users and collaborators were required to first write Linus Torvalds an email and agree to never work on a competing operating system.
As the years pass, what might distinguish one electronic structure database effort from another might not be the number of compounds or its initial software infrastructure but rather how successfully it leverages the community to scale beyond itself. It will most likely be a difficult exercise in human parallelization, but it can’t be more complicated than writing an operating system with a bunch of strangers – right?
 Some of the issues in parallel programming are summarized here.
 Here’s that Mythbusters video.
 The Cathedral and the Bazaar by Eric S. Raymond
 One (crude) estimate puts the cost of Windows Vista development at 10 billion dollars. The budget for Windows 8 advertising alone is estimated to be over a billion dollars.
 Producing Open Source Software by Karl Fogel.
 The internet quickly decodes a decades-long family mystery.
 The usefulness of the programming challenge and other tips for hiring programmers are explained by Coding Horror.
 Another strategy is to employ multiple codebases that are cleanly separated in functionality but integrate and stack in a modular way. For example, the Materials Project completely separates the development of its workflow code from its materials science code. Such “separation of powers” can also accommodate different personalities by giving different members full ownership of one code, affording them the authority to resolve small and counterproductive arguments quickly (à la the benevolent dictator model of software management).
 Wikipedia article on Early Netherlandish painting. Incidentally, while Wikipedia is often criticized for being unreliable due to its crowdsourced nature, a Nature study found that the online material of Britannica was itself guilty of 2.92 mistakes per science article; Wikipedia was not so much worse with 3.86 mistakes per article. Another interpretation is that both these numbers are way too large!
As a species, we are particularly proud of our capacity for creative thought. Our ability to invent tools and imagine abstract concepts distinguishes us from other animals and from modern day computers (for example, the Turing Test is essentially a clever form of a creativity test).
However, in many cases, humans can now program computers to solve problems better than they can. In 1987, chess grandmaster Garry Kasparov’s proclaimed that “No computer can ever beat me”. The statement was perfectly reasonable at the time, but it would be proven wrong within a decade. Similar turning points in history won’t be limited to games. For a 2006 NASA mission that required an unconventional antenna design, the best-performing, somewhat alien-looking solution was not imagined by a human antenna design expert (of which NASA surely had plenty) but was instead evolved within a computer program. When mission parameters changed, a new design was re-evolved for the new specifications much more rapidly than could be achieved by NASA staff.
It is difficult to say whether current achievements by computers truly constitute “creativity” or “imagination”. However, the philosophical ambiguity has not stopped materials researchers from joining in on the action. As one example, Professor Artem Oganov’s lab at Stony Brook University has used computer programs to evolve new materials that have been observed at high pressures and might play a role in planet formation. These materials can be quite unexpected, such as a new high-pressure form of sodium that is transparent and insulating rather than silver and metallic. Thus, while we may not know whether to label the computer’s process as a “creative” one, the end result can certainly possess the hallmarks of purposeful design.
Indeed, if there is any doubt that computer algorithms are capable of producing creative solutions, one needs only to visit the Boxcar2D site. This website uses evolutionary algorithms to design a two-dimensional car from scratch; the process unfolds before your eyes, in real time.
It is instructive to observe the Boxcar2D algorithm “thinking”. Towards the beginning of the simulation, most designs are underwhelming, but a few work better than you’d intuit. For example, my simulation included a one-wheeler that employed an awkward chassis protrusion as a brake that modulated speed during downhill sections. It was a subtle strategy that outperformed a more classic motorbike design that wiped out on a mild slope.
Eventually, the one-wheelers would prove too cautious, and the algorithm began designing faster two-wheelers that better matched the size between the wheels and the proportions of the frame to prevent flipping. Finally, the algorithm designed a car that was symmetric to being turned upside down, eliminating the problem altogether. All this happened in the course of minutes in a somewhat hypnotic visual progression from random to ordered.
However, despite the successes in computational optimization, several obstacles still exist that prevent future materials from being designed by computers. The biggest problem is that computers cannot yet predict all the important properties of a material accurately or automatically. In many cases, computational theory can only predict a few heuristic indicators of less than half the important technological properties of a material. Thus, materials that look promising to the computer are incomplete models that require further evaluation and perhaps re-ranking by human experts. Indeed, the most successful automatic algorithms have been those used for crystal structure prediction (such as the research of Professor Oganov), for which simple computer calculations are very good at ranking different compounds without adult supervision.
There are other problems; for example, genetic algorithms typically require many more calculations to find a good solution compared with human-generated guesses. However, this “problem” may also be an advantage. The computer’s willingness to produce several rounds of very poor-performing, uninhibited designs frees it from the bias and conservatism that can be displayed by human researchers, thereby revealing better solutions in the long run. Still, helping the computer become a smarter and more informed guesser would certainly improve the prospects for designing materials in computers.
Today, almost all materials are still devised by humans within a feedback loop of hypothesis and experiment. The next step might be to mix human and machine – that is, to use human intuition to suggest compounds that are further refined by a computer. Yet, perhaps one day, materials design may become more like chess or antenna design. Like parents of a gifted child, it might become more logical for materials scientists to train their computers to be more imaginative than them.
 A nice illustrated history of artificial intelligence in games was presented by XKCD.
 More about NASA’s evolved antenna design here.  Attempt to evolve cars from randomness at the Boxcar2D site.
 For example, my colleagues (with minor help from myself) devised data-driven algorithms for materials prediction that more closely mimic a first step undertaken by many researchers (links here and here). These algorithms are more efficient at finding new materials but are much less “creative” than the evolutionary algorithms employed by Professor Oganov and others.
The Top500 list ranks the most powerful computers in the world. As of this writing, the top spot belongs to Tianhe-2, a Chinese computer capable of 33.86 petaflops (quadrillion computations per second). That’s perhaps a million times the performance of the computer you are viewing this on now. Many of the remaining top 10 computers are located at national laboratories around the United States, including Titan at Oak Ridge National Laboratory and Sequoia at Lawrence Livermore National Laboratory (both reaching 17 petaflops).
The supercomputer rankings are based on the LINPACK benchmark, which is essentially a measure of a computer’s “horsepower”. While there are endless discussions about employing better benchmarks that give a more practical “top speed”, perhaps a more important problem is that many researchers needing large amounts of computing don’t care much about the top speed of their computer. Rather than supercharged dragsters that set land speed records, these scientists require a fleet of small and reliable vans that deliver lots of total mileage over the course of a year.
HPC and HTC
The fundamental difference between the “drag racers” and the “delivery van operators” is whether the simulations coordinate all the computer processors towards a common task. In combustion studies and climate simulations, scientists typically run a single large computing job that coordinates results between tens or even hundreds of thousands of processors. These projects employ “computer jocks” that are the equivalent of a NASCAR pit crew; they tweak the supercomputers and simulation code to run at peak performance. Meanwhile, many applications in biology and materials science require many small independent simulations – perhaps one calculation per gene sequence or one per material. Each calculation consumes only a modest amount of processing power; however, the total computational time needed for such applications can now rival the bigger simulations because millions of such tiny simulations might be needed. Like running a delivery service, the computer science problems tend to be more about tracking and managing all these computing jobs rather than breaking speed records. And instead of a highly parallel machine tuned for performance, many inexpensive computers would do just fine.
These two modes of computing are sometimes called “high-performance computing” (or HPC) for the large coordinated simulations and “high-throughput computing” (or HTC) for myriad small simulations. They are not always at odds. Just as consumer cars benefit from improvements on the racetrack, today’s high-throughput computing has benefited from designing yesterday’s high-performance systems. And the machines built for high-performance computing can sometimes still be used for high-throughput computing while maintaining low cost per CPU cycle and high energy efficiency. But, fundamental differences prevent the relationship from being completely symbiotic.
Making deliveries in racecars
One issue with large supercomputers concerns drivability – how easy is it to use and program the (often complex) computing architecture? Unfortunately, tweaking the computers to go faster often means making them less flexible. Sometimes, enhancing performance for large simulations prevents small jobs from running at all. The Mira Blue Gene cluster at Argonne Lab (currently the #5 system in the Top500 list) has a minimum partition of over 8000 processors and a low ratio of I/O nodes to compute nodes, meaning the machine is fundamentally built for large computations and doesn’t support thousands of small ones writing files concurrently. These top supercomputing machines are built like Death Stars: they are designed for massive coordinated power towards a single target but can potentially crumble under a load of many small jobs.
Other times, there are no major technical barriers to running small jobs on large machines. However, “run limits” imposed by the computing centers effectively enforce that only large computing tasks achieve large total computer time. One can bypass this problem by writing code that allows a group of small jobs to masquerade as a big job. It doesn’t always work, and is a crude hack even when it does.
If supercomputers are naturally suited to support large jobs, why not take small high-throughput jobs elsewhere? A major reason is cost. The Materials Project currently consumes approximately 20 million CPU-hours and 50 terabytes of storage a year. This level of computing might cost two million dollars a year on a cloud computing service such as Amazon EC2 but can be gotten essentially for “free” by writing a competitive science grant to use the existing government supercomputers (for example, two of these centers – Argonne and Oak Ridge – have a combined computing budget of 8 billion CPU-hours*).
Still, many high-throughput computational scientists have chosen not to employ supercomputers. The Harvard Clean Energy Project, an effort to discover new organic solar cells through millions of small simulations, leverages the personal computers of thousands of citizen volunteers via the IBM World Community Grid. Cycle Computing is a business that maps scientific computing problems to cloud computing services. The National Science Foundation is funding computing centers such as XSEDE that are more accommodating to massive amounts of small jobs. Adopting politically-tinged slogans like “Supercomputing/HPC for the 99 percent” in their press releases, these efforts are actively rebelling against the majority of computer time going to the small minority of computer jocks that can operate the supercomputing behemoths.
Can’t we all just get along?
The HPC/HTC divide is discouraging, especially because large and small jobs are not completely immiscible. The National Energy Research Scientific Computing Center (NERSC) has added a high-throughput queue to its large production cluster, Hopper (a 1 petaflop machine that was at one time the #8 computer on the Top500 list). In theory, this could be of mutual benefit. Supercomputing centers are partly evaluated by their utilization, the average percent of computing cycles that are used for scientific work (versus waiting idly for an appropriate-size task). A small job can fill a gap in computing like a well-shaped Tetris piece, thereby improving utilization. The two communities can certainly cooperate more. HPC and HTC certainly belong to different computing families, but a Montague-Capulet style war may only hurt both sides.
 Many people believe that supercomputers owned by government agencies or commercial entities such as Google would win if they decided to enter into the competition.
 I’m assuming ~30 gigaflops LINPACK performance for a desktop based on this article in The Register.
 The FireWorks software I develop is one workflow code that has this feature built-in.
 (9/18/2014) I’ve heard an estimate that computer time at DOE leadership computing facilities costs about 2.5 cents per core-hour, such that 20 million CPU-hours would be equivalent to about $500,000.
 For example, see these articles released by San Diego Computing Center and in Ars Technica.
 Note: I am part of a queueing subcommittee at NERSC that hopes to start investigating this soon.
* (2/7/2014) The original version of this article stated that the DOE had a total 2013 computing budget of 2.4 billion CPU-hours, which is too low. The updated figure of 8 billion for just two top centers is based on this presentation.
Ansel Adams is famous for capturing iconic black and white photographs of Yosemite Park and the American West during the early and mid 1900s. In those early days of photography, landscape photographers needed to be hardy. To capture his seminal Monolith image, Ansel climbed 2500 feet of elevation in the snow carrying a large format camera and wooden tripod. In addition to such physical stamina, photographers needed to possess a deep and intuitive mental understanding of how their camera inputs – shutter speed, focal length, film type, and lens filters – achieved the final output – exposure, sharpness, depth of field, and contrast. Ansel had only two of his large glass plate negatives remaining when he reached the spot to photograph the Monolith. Without the luxury of infinite shots or digital previews, he had to previsualize all the complex interactions between the outside light and his camera equipment before opening the shutter.
Today, most photographs at Yosemite are likely taken by cell phones carried in a pocket along well-marked trails. All the decisions of the photographic process – shutter speed, white balance, ISO speed – are made by the camera. After the capture, one can select amongst dozens of predesigned post-processing treatments to get the right “look” – perhaps a high-contrast black and white reminiscent of Adams. Today, one can be a photographer whilst knowing almost nothing about the photographic process. Even today’s professionals offload many of the details of determining exposure and white balance to their cameras, taking advantage of the convenience and opportunities that digital photography offers.
Computational materials science is now following a similar trajectory to that of photography. In the not too distant past, one needed to know both DFT theory and the details of its implementation to have any hope of computing an accurate result. Computing time was limited, so each calculation setting was thought through and adjusted manually before “opening the shutter” and performing the calculation.
These days, DFT software packages offer useful default settings that work over a wide range of materials. In addition, software “add-ons” are starting to promise essentially “point-and-compute” technology – that is, give the software a material and a desired property and it will figure out all the settings and execute the calculation workflow. Performing a DFT computation is still not as easy as snapping an iPhone picture, but it is becoming more and more accessible to experimental groups and those sitting outside theory department walls.
But, along with these new steps forward in automation, might we be losing something as well?
Often one of the first victims of increased convenience is quality of the final product. The large format negative used by Ansel Adams to capture “The Monolith” was likely capable of recording over 500 megapixels of detail, much more than today’s highest end digital camera (and a far cry from Instagram snapshots).
In the case of DFT, both manual and semi-automatic computations employ the same backend software packages, so it is not as if the automatic “cameras” are of poorer quality. Careful manual control of the calculation settings still yields the most precise results. However, algorithms for fully automatic calculation parameters are growing in sophistication and may someday be embraced almost universally, perhaps in the same manner as the automatic white balance feature in cameras.
More controversial, perhaps, is the rise of automated post-processing routines to “correct” the raw computed data in areas of known DFT inaccuracy. Such techniques are how cell phone cameras provide good images despite having poor sensors and lenses: post-processing algorithms reduce noise and boost sharpness and color to make the final image look better than the raw data. The danger is potentially overcorrecting and making things worse. Fundamentally-minded researchers (like fans of high quality lenses) would insist that quality originate in the raw data itself. The problem is that employing a quality “computational lens” requires much more computational time and expense, and designing better “lenses” that produce better raw data is a very slow research process. It appears that the use of post-processing to correct for DFT’s shortcomings will only grow while researchers wait for more fundamental improvements to accuracy.
If you point your camera up at the night sky, open up the aperture, and take a very long exposure (or stack multiple shots) you can reveal the circular trails left by stars through the night sky as the earth rotates. These images are not accurate depictions of the sky at any one moment, but instead expose a different truth of how it rotates with time. To get these images, one must think of the camera as not just a point-and-click device but rather as a multi-purpose tool.
Creative controls also exist for the quantum calculation camera; one can artificially stretch the spacing between atoms past their normal configuration or calculate the properties of materials with magnetic configurations unknown in reality. These calculations are opportunities to predict not only what is, but what might be and see things in a different way. For example, could a battery material be charged faster if we increased the spacing between atoms? Sadly, the “point-and-compute” method does not encourage this type of creative approach; those unfamiliar with the manual controls may think of DFT in a reduced vocabulary.
Organization, sharing, and the democratization of DFT
Perhaps the most unambiguous improvement of the new generation of DFT calculation will be how data is organized and shared. In the past, the results of a DFT study were buried within a maze of custom file folder hierarchies stored on a research group computing cluster that could be accessed by only a dozen people and usually navigated by only the author of the study.
We are starting to see a shift. Researchers are developing software to manage and share calculations, and large global data repositories share tens of thousands of DFT results with thousands of users via the internet (something unimaginable to many in the field only a decade ago). The audience for a once purely theoretical method is greatly expanding.
The changes to DFT calculations are not occurring as quickly or drastically as they did for photography, but they are certainly happening. Like it or not, today’s computational materials scientists will soon be sharing the field with many “point-and-compute” enthusiasts. The old guard, then, must learn to maintain the strengths of the old ways while still taking advantage of all the new opportunities.
The term materials informatics – referring to applying data mining techniques to materials design – has been around 7 years now. Yet, while the materials informatics approach slowly nucleates in a few institutions, a true revolution in the way most materials are designed has not arrived.
Why aren’t more people data mining materials? Some point to a lack of unified materials data sets to analyze. That situation is changing: as you read this, supercomputers around the country are systematically computing materials properties and the results are being posted openly to the web. Many are optimistic that these new databases will soon kickstart the materials informatics revolution.
However, even as the availability of materials data sets grows, an underestimated obstacle remains to materials informatics: we’ve yet to come up with a good mathematical description of materials that can be leveraged by the data mining community. Even as materials scientists expose acres of materials data for mining, it is as if these data resources are encased in hard rock that mathematicians and computer scientists can’t immediately dig into.
Because data mining operates on numbers rather than on atoms or electrons, materials scientists must first encode materials in a numerical format for the machine learning community. Encoding materials is nothing new; researchers have always needed to describe materials in numbers (e.g., in research articles and to computational chemistry software). However, the details of the encoding are crucial to the success of the data mining effort. In particular, such encodings should be universal, reversible, and meaningful.
A good encoding is applicable to the entire space of the objects it hopes to describe. For materials, this problem was solved long ago. Just as an audio recording can digitally represent any piece of music using 0s and 1s, materials scientists can describe any material using three vectors to describe the repeating unit cell plus additional vectors for the positions and elemental identity of each atom in that cell.
A machine learning technique might be able to describe a potential new material with ideal properties in terms of numbers, but those numbers must have a way to transform back into a complete material. We will refer to this characteristic – that an encoded prediction can be decoded back to an object – as reversibility.
A reversible materials encoding is not difficult to achieve on its own or in conjunction with universality (the vector strategy described previously works just fine). However, both universality and reversibility become problematic when we attempt to define meaningful representations of materials.
While universality and reversibility are important for representing data, the crucial ingredient for extracting patterns and making predictions is meaningfulness of the encoding. For example, audio encodings such as MP3 are universal and reversible, but not particularly meaningful. How would an algorithm take a raw string of 0s and 1s and attempt to guess its composer or mood? It is not as if a certain binary pattern indicates Beethoven and another indicates Debussy. A machine learning expert given data encoded using formats meant for recording and playback would be hopelessly lost if he or she attempted to directly apply those encodings to machine learning.
In music, we can convey more meaning if we change our encoding strategy to be a musical score composed of note pitches, note lengths, and dynamic markings. This format lends itself to mathematically detecting note patterns that predict something meaningful such as musical genre or mood, which will in turn vastly improve the success of data mining. However, a traditional musical score is not universal across all music; for example, it struggles to describe microtonal music and electronic music. It is also not fully reversible; as any fan of classical music knows, the score for Beethoven’s Ninth Symphony has been reversed into countless valid musical interpretations by different orchestras. Thus, our quest for meaning has limited us in universality and reversibility.
You might instead encode music using words; a classical music piece might be described as “orchestral” or “small ensemble”, “cheerful” or “somber”, “melodic” or “twelve-tone”. These descriptors are in fact quite meaningful: we would expect that other music with the same or similar descriptors would sound similar, and such descriptors should also correlate with other properties of the music such as the country or time period it was composed in. However, despite being high in meaning, these textual descriptors are not reversible: reading a description of a piece of music does not allow you to faithfully recreate the sound in your head. In addition, given a limited musical vocabulary, text descriptors are also not universal (a vocabulary of classical terms such as “twelve-tone” won’t describe rap, although it might be a more interesting universe to live in if it did).
Thus, the road to meaning often leads us away from universality and reversibility.
A “killer” encoding?
The typical description of a material as a set of vectors satisfy universality and reversibility, but are quite barren in meaning. The more meaningful materials encodings that have been devised, such as graph-based representations of materials or vectors of “chemical descriptors”, tend to be like the musical score and the verbal descriptions of music: they don’t apply to all materials and don’t capture enough detail to allow a machine learning solution to be reversed back into a material. Generating these alternate encodings also requires complex custom analysis code that can assess a material’s topology and determine its features. Therefore, each new materials informatics application requires the development of a new, meaningful data representation and lots of initial programming work before any real data mining can begin.
As materials data continues to pour in, the materials encoding problem will become more visible and might even become the primary limit to the growth of materials informatics. An opportunity exists for the ambitious materials hacker to develop a “killer” encoding: a universal, reversible representation that can serve as a meaningful “first pass” for almost any materials mining problem.
With some work, such an encoding should be possible. For example, the word2vec algorithm is an encoding for data mining that turn words into numbers. The encoding is pretty universal (most words can be turned into numbers) and reversible (numbers can be turned back to words). And, it is meaningful; for example, you can use the encoding to measure meaningful distances between words (the word “France” would be closer in distance to the word “Spain” than “keyboard”). You can even take meaningful sums (sometimes) – for example, ThisPlusThat.me allows you to enter “Justin Bieber – man + woman” and get back an encoding that reverses to “Taylor Swift”. Thus, it is certainly possible to design encodings that score well in universality, reversibility, and meaningfulness.
The word2vec algorithm is not perfect, and neither would be a similar encoding for materials. However, it would enable researchers to get shovels in the ground on a new problem quickly. The machine learning community has spent decades developing powerful and sophisticated digital drills for extracting patterns from data; the materials scientists now need to prepare the soil.
 For example, people have described sequences of notes using Markov chains, and then used that information to generate music of a certain style.
 There is a nice iOS app from Touch Press dedicated to exploring the different versions of Beethoven’s Ninth.
 The Music Genome Project hired musicians to describe much of the world’s music catalog using a vocabulary of about 400 words. Despite being able to power the Pandora music service with this encoding, it was not universal (Latin and classical music were two examples of holdouts).
 More about word2vec here.
 Try your luck at word additions at ThisPlusThat.me.